public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* resuming swsusp twice
@ 2005-07-13 18:59 Andy Isaacson
  2005-07-14 14:58 ` Stefan Seyfried
  2005-07-15  8:33 ` Pavel Machek
  0 siblings, 2 replies; 8+ messages in thread
From: Andy Isaacson @ 2005-07-13 18:59 UTC (permalink / raw)
  To: linux-kernel

Yesterday I booted my laptop to 2.6.13-rc2-mm1, suspended to swsusp, and
then resumed.  It ran fine overnight, including a fair amount of IO
(running firefox, rsyncing ~/Mail/archive from my mail server, hg pull,
etc).  This morning I did a swsusp:

	echo shutdown > /sys/power/disk
	echo disk > /sys/power/state

and got a panic along the lines of "Unable to find swap space, try
swapon -a".  Unfortunately I was in a hurry and didn't record the error
messages.  I powered off, then a few minutes later powered on again.

At this point, it resumed *to the swsusp state from yesterday*!
As soon as I realized what had happened, I powered off (not
shutdown) and rebooted.

On the next boot it did not find a swsusp signature and booted normally;
ext3 did a normal recovery and seemed OK, but I was suspicious and did a
fsck -f, which revealed a lot of damage; most of the damage seemed to be
in the hg repo which had been pulled from www.kernel.org/hg/.

It's extremely unfortunate that there is *any* failure mode in swsusp
that can result in this behavior.

I will try to reproduce, but I'm curious if anyone else has seen this.

-andy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: resuming swsusp twice
  2005-07-13 18:59 resuming swsusp twice Andy Isaacson
@ 2005-07-14 14:58 ` Stefan Seyfried
  2005-07-14 17:54   ` Andy Isaacson
  2005-07-15  8:33 ` Pavel Machek
  1 sibling, 1 reply; 8+ messages in thread
From: Stefan Seyfried @ 2005-07-14 14:58 UTC (permalink / raw)
  To: Andy Isaacson; +Cc: LKML

Andy Isaacson wrote:
> Yesterday I booted my laptop to 2.6.13-rc2-mm1, suspended to swsusp, and
> then resumed.  It ran fine overnight, including a fair amount of IO
> (running firefox, rsyncing ~/Mail/archive from my mail server, hg pull,
> etc).  This morning I did a swsusp:
> 
> 	echo shutdown > /sys/power/disk
> 	echo disk > /sys/power/state
> 
> and got a panic along the lines of "Unable to find swap space, try

a panic? it should only be an error message, but the machine should
still be alive.

> swapon -a".  Unfortunately I was in a hurry and didn't record the error
> messages.  I powered off, then a few minutes later powered on again.

Powered off hard or "shutdown -h now"?

> At this point, it resumed *to the swsusp state from yesterday*!
> As soon as I realized what had happened, I powered off (not
> shutdown) and rebooted.

Good.

> On the next boot it did not find a swsusp signature and booted normally;
> ext3 did a normal recovery and seemed OK, but I was suspicious and did a
> fsck -f, which revealed a lot of damage; most of the damage seemed to be

this is expected in this case, unfortunately.

> in the hg repo which had been pulled from www.kernel.org/hg/.
> 
> It's extremely unfortunate that there is *any* failure mode in swsusp
> that can result in this behavior.

I of course won't say that this cannot happen, but by design, the swsusp
signature is invalidated even before reading the image, so theoretically
it should not happen.

> I will try to reproduce, but I'm curious if anyone else has seen this.

i have not seen anything like that, but i am not always running the
latest & greatest kernel.
-- 
Stefan Seyfried                  \ "I didn't want to write for pay. I
QA / R&D Team Mobile Devices      \ wanted to be paid for what I write."
SUSE LINUX Products GmbH, Nürnberg \                    -- Leonard Cohen

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: resuming swsusp twice
  2005-07-14 14:58 ` Stefan Seyfried
@ 2005-07-14 17:54   ` Andy Isaacson
  2005-07-14 18:36     ` Stefan Seyfried
                       ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Andy Isaacson @ 2005-07-14 17:54 UTC (permalink / raw)
  To: Stefan Seyfried; +Cc: LKML

On Thu, Jul 14, 2005 at 04:58:12PM +0200, Stefan Seyfried wrote:
> Andy Isaacson wrote:
> > Yesterday I booted my laptop to 2.6.13-rc2-mm1, suspended to swsusp,
> > and
[snip]
> > and got a panic along the lines of "Unable to find swap space, try
>
> a panic? it should only be an error message, but the machine should
> still be alive.

Well, the console was left on the swsusp VT (guess that's not suprising)
and I was hurrying to catch the train, so I didn't investigate, I just
held down the power button for 5 seconds.

> > swapon -a".  Unfortunately I was in a hurry and didn't record the
> > error
> > messages.  I powered off, then a few minutes later powered on again.
>
> Powered off hard or "shutdown -h now"?

Hard.  It's a Thinkpad X40 with ACPI, so I hold down the power button
for a few seconds to power off.

> > At this point, it resumed *to the swsusp state from yesterday*!
[snip severe ext3 damage]
> > It's extremely unfortunate that there is *any* failure mode in
> > swsusp
> > that can result in this behavior.
>
> I of course won't say that this cannot happen, but by design, the
> swsusp
> signature is invalidated even before reading the image, so
> theoretically
> it should not happen.

Yes, I'd seen that happen on earlier swsusps, so I was quite suprised
when it blew up like this.

Perhaps the image should be more rigorously checked?  I'm wishing that
it would verify that the header and the image matched, after it finishes
reading the image.  For example, computing the hash

MD5(header || image)     (|| denotes "concatenate" in crypto pseudocode.)

and storing that hash in a final trailing block.  Additionally, of
course, as soon as the resume has read the image it should overwrite the
header; and the header should include jiffies or something along those
lines to ensure that it won't accidentally have the same contents as the
previous image's header.

The hash doesn't have to be MD5; even a CRC should suffice I think...

-andy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: resuming swsusp twice
  2005-07-14 17:54   ` Andy Isaacson
@ 2005-07-14 18:36     ` Stefan Seyfried
  2005-07-14 21:45       ` Andy Isaacson
  2005-07-15  8:35     ` Pavel Machek
  2005-07-15  8:38     ` Pavel Machek
  2 siblings, 1 reply; 8+ messages in thread
From: Stefan Seyfried @ 2005-07-14 18:36 UTC (permalink / raw)
  To: Andy Isaacson; +Cc: LKML

Andy Isaacson wrote:

> Perhaps the image should be more rigorously checked?  I'm wishing that
> it would verify that the header and the image matched, after it finishes

in your case, the header and the image matched. There was no new image
on disk. And no new header.

> reading the image.  For example, computing the hash
> 
> MD5(header || image)     (|| denotes "concatenate" in crypto pseudocode.)
> 
> and storing that hash in a final trailing block.  Additionally, of
> course, as soon as the resume has read the image it should overwrite the
> header; and the header should include jiffies or something along those

the header is actually overwritten _prior_ to reading the image back. Or
it should be, obviously it was not in your casee.

> lines to ensure that it won't accidentally have the same contents as the
> previous image's header.
> 
> The hash doesn't have to be MD5; even a CRC should suffice I think...

But the failure you have seen now - failure to invalidate the resume
header - could also happen as long as we do not fix the reason for your
failure. If we fix it, we don't need additional security nets ;-)

But i have no idea what went wrong for you, i'll have a look at the code
but i doubt that i'll find much of interest.

One thing which would be interesting:
You don't eventually have multiple swap partitions?
-- 
Stefan Seyfried                  \ "I didn't want to write for pay. I
QA / R&D Team Mobile Devices      \ wanted to be paid for what I write."
SUSE LINUX Products GmbH, Nürnberg \                    -- Leonard Cohen

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: resuming swsusp twice
  2005-07-14 18:36     ` Stefan Seyfried
@ 2005-07-14 21:45       ` Andy Isaacson
  0 siblings, 0 replies; 8+ messages in thread
From: Andy Isaacson @ 2005-07-14 21:45 UTC (permalink / raw)
  To: Stefan Seyfried; +Cc: LKML

On Thu, Jul 14, 2005 at 08:36:15PM +0200, Stefan Seyfried wrote:
> But the failure you have seen now - failure to invalidate the resume
> header - could also happen as long as we do not fix the reason for your
> failure. If we fix it, we don't need additional security nets ;-)

So if the header is overwritten before the pages are read back in, that
implies that the overwriting IO did not get to disk in my failing case.
Since pleny of other IO did end up on disk (scribbling on my ext3 in the
process), I wonder what could be different there...

> But i have no idea what went wrong for you, i'll have a look at the code
> but i doubt that i'll find much of interest.
> 
> One thing which would be interesting:
> You don't eventually have multiple swap partitions?

One root partition, one swap partition, no swap files or anything.

The only interesting thing I can think of is that my swap partition is
only 512MB while the machine has 1.25GB RAM.  (Installed Ubuntu and took
the defaults before installing the SODIMM.)

FWIW, I have suspended and resumed a few times since the failure and
haven't seen a repeat of the problem.  I am seeing some other problems
with 2.6.13-rc2-mm1 that I didn't see before - DRM/i830 lockups after
swsusp - that might be masking the problem, but I have done the
boot-swsusp-resume-swsusp-resume successfully.

I'm at a loss as to what I might have done to trigger the problem.

-andy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: resuming swsusp twice
  2005-07-13 18:59 resuming swsusp twice Andy Isaacson
  2005-07-14 14:58 ` Stefan Seyfried
@ 2005-07-15  8:33 ` Pavel Machek
  1 sibling, 0 replies; 8+ messages in thread
From: Pavel Machek @ 2005-07-15  8:33 UTC (permalink / raw)
  To: Andy Isaacson; +Cc: linux-kernel

Hi!

> Yesterday I booted my laptop to 2.6.13-rc2-mm1, suspended to swsusp, and
> then resumed.  It ran fine overnight, including a fair amount of IO
> (running firefox, rsyncing ~/Mail/archive from my mail server, hg pull,
> etc).  This morning I did a swsusp:
> 
> 	echo shutdown > /sys/power/disk
> 	echo disk > /sys/power/state
> 
> and got a panic along the lines of "Unable to find swap space, try
> swapon -a".  Unfortunately I was in a hurry and didn't record the error
> messages.  I powered off, then a few minutes later powered on again.
> 
> At this point, it resumed *to the swsusp state from yesterday*!
> As soon as I realized what had happened, I powered off (not
> shutdown) and rebooted.

Bad, very bad.

> On the next boot it did not find a swsusp signature and booted normally;
> ext3 did a normal recovery and seemed OK, but I was suspicious and did a
> fsck -f, which revealed a lot of damage; most of the damage seemed to be
> in the hg repo which had been pulled from www.kernel.org/hg/.

You should not let ext3 do journal replay. At that point, hopefully
damage will be slightly better. 

> It's extremely unfortunate that there is *any* failure mode in swsusp
> that can result in this behavior.

Well, I've never seen that one before...
								Pavel
-- 
teflon -- maybe it is a trademark, but it should not be.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: resuming swsusp twice
  2005-07-14 17:54   ` Andy Isaacson
  2005-07-14 18:36     ` Stefan Seyfried
@ 2005-07-15  8:35     ` Pavel Machek
  2005-07-15  8:38     ` Pavel Machek
  2 siblings, 0 replies; 8+ messages in thread
From: Pavel Machek @ 2005-07-15  8:35 UTC (permalink / raw)
  To: Andy Isaacson; +Cc: Stefan Seyfried, LKML

Hi!

> > I of course won't say that this cannot happen, but by design, the
> > swsusp
> > signature is invalidated even before reading the image, so
> > theoretically
> > it should not happen.
> 
> Yes, I'd seen that happen on earlier swsusps, so I was quite suprised
> when it blew up like this.
> 
> Perhaps the image should be more rigorously checked?  I'm wishing that
> it would verify that the header and the image matched, after it finishes
> reading the image.  For example, computing the hash
> 
> MD5(header || image)     (|| denotes "concatenate" in crypto pseudocode.)
> 
> and storing that hash in a final trailing block.  Additionally, of
> course, as soon as the resume has read the image it should overwrite the
> header; and the header should include jiffies or something along those
> lines to ensure that it won't accidentally have the same contents as the
> previous image's header.
> 
> The hash doesn't have to be MD5; even a CRC should suffice I think...

That's quite a lot of complexity... just fix the bug.

								Pavel
-- 
teflon -- maybe it is a trademark, but it should not be.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: resuming swsusp twice
  2005-07-14 17:54   ` Andy Isaacson
  2005-07-14 18:36     ` Stefan Seyfried
  2005-07-15  8:35     ` Pavel Machek
@ 2005-07-15  8:38     ` Pavel Machek
  2 siblings, 0 replies; 8+ messages in thread
From: Pavel Machek @ 2005-07-15  8:38 UTC (permalink / raw)
  To: Andy Isaacson; +Cc: Stefan Seyfried, LKML

Hi!

> > I of course won't say that this cannot happen, but by design, the
> > swsusp
> > signature is invalidated even before reading the image, so
> > theoretically
> > it should not happen.
> 
> Yes, I'd seen that happen on earlier swsusps, so I was quite suprised
> when it blew up like this.
> 
> Perhaps the image should be more rigorously checked?  I'm wishing that
> it would verify that the header and the image matched, after it finishes
> reading the image.  For example, computing the hash
> 
> MD5(header || image)     (|| denotes "concatenate" in crypto pseudocode.)
> 
> and storing that hash in a final trailing block.  Additionally, of
> course, as soon as the resume has read the image it should overwrite the
> header; and the header should include jiffies or something along those
> lines to ensure that it won't accidentally have the same contents as the
> previous image's header.
> 
> The hash doesn't have to be MD5; even a CRC should suffice I think...

Actually, what you want is "if filesystems are newer than suspend
image, panic" test. There is more than one way how that can happen.

Are you sure you did not do

suspend kernel 1
boot kernel 2
attempt to suspend kernel 2 but fail ("not enough swap space")
boot kernel 1 ("and successfully resume, corrupting data")

?
								Pavel
-- 
teflon -- maybe it is a trademark, but it should not be.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2005-07-16 13:54 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-07-13 18:59 resuming swsusp twice Andy Isaacson
2005-07-14 14:58 ` Stefan Seyfried
2005-07-14 17:54   ` Andy Isaacson
2005-07-14 18:36     ` Stefan Seyfried
2005-07-14 21:45       ` Andy Isaacson
2005-07-15  8:35     ` Pavel Machek
2005-07-15  8:38     ` Pavel Machek
2005-07-15  8:33 ` Pavel Machek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox