ext4: 3.17? problems

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* ext4: 3.17? problems
@ 2014-09-28 10:44 Pavel Machek
  2014-09-28 12:46 ` Theodore Ts'o
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Pavel Machek @ 2014-09-28 10:44 UTC (permalink / raw)
  To: kernel list; +Cc: jack, linux-ext4, tytso, kernel list, adilger.kernel

Hi!

After update to debian testing, my machine sometimes fails to
reboot. (aptitude upgrade seems to be the trigger).

So I had to hard power-down the machine. That should be perfectly
safe, as ext4 has a journal, and this is plain SATA disk, right?

On next boot to Debian stable, I got stacktrace, and messages about
ext4 corruption. Back to Debian testing. systemd ran fsck, determined
it can't fix it, dropped me into emergency shell, _but mounted the
filesstem, anyway_. Oops.

Now I'm getting

fsck 1.42.12
...
Inodes that were part of a corrupted orphan linked list found <y>
Deleted inode has zero dtime <y>
(6 inodes) was part of the orphaned inode list. FIXED.
Block bitmap differences.
Free inode counts wrong.

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ext4: 3.17? problems
  2014-09-28 10:44 ext4: 3.17? problems Pavel Machek
@ 2014-09-28 12:46 ` Theodore Ts'o
  2014-09-30 21:01   ` Pavel Machek
  2014-09-29  9:36 ` Dmitry Monakhov
  2014-09-29 11:44 ` Jan Kara
  2 siblings, 1 reply; 8+ messages in thread
From: Theodore Ts'o @ 2014-09-28 12:46 UTC (permalink / raw)
  To: Pavel Machek; +Cc: kernel list, jack, linux-ext4, adilger.kernel

On Sun, Sep 28, 2014 at 12:44:56PM +0200, Pavel Machek wrote:
> 
> After update to debian testing, my machine sometimes fails to
> reboot. (aptitude upgrade seems to be the trigger).
> 
> So I had to hard power-down the machine. That should be perfectly
> safe, as ext4 has a journal, and this is plain SATA disk, right?
> 
> On next boot to Debian stable, I got stacktrace, and messages about
> ext4 corruption. Back to Debian testing. systemd ran fsck, determined
> it can't fix it, dropped me into emergency shell, _but mounted the
> filesstem, anyway_. Oops.

I've been running 3.17-rc4 plus the ext4 dev patches and due to either
regressions in i915 or the X server (not sure which) over the last
couple of weeks, I've had to power-down my system a number of times
after the system has hung when either shutting down the X server or
when trying to add or remove an external display.  So I've had to
unfortunately do a fair number of hard-power-offs on my T540p, and
I've not noticed any like what you've described.

Can you give any more details?  Are you using LVM or dm-crypt?  Is
this repeatable?

Cheers,

							- Ted

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ext4: 3.17? problems
  2014-09-28 10:44 ext4: 3.17? problems Pavel Machek
  2014-09-28 12:46 ` Theodore Ts'o
@ 2014-09-29  9:36 ` Dmitry Monakhov
  2014-09-29 11:44 ` Jan Kara
  2 siblings, 0 replies; 8+ messages in thread
From: Dmitry Monakhov @ 2014-09-29  9:36 UTC (permalink / raw)
  To: Pavel Machek, kernel list
  Cc: jack, linux-ext4, tytso, kernel list, adilger.kernel

On Sun, 28 Sep 2014 12:44:56 +0200, Pavel Machek <pavel@ucw.cz> wrote:
> Hi!
> 
> After update to debian testing, my machine sometimes fails to
> reboot. (aptitude upgrade seems to be the trigger).
> 
> So I had to hard power-down the machine. That should be perfectly
> safe, as ext4 has a journal, and this is plain SATA disk, right?
> 
AFAIU you have some corruption on your fs (the root of cause is unknown
at this moment)
So you have following stages:
1) fs corruption
2) boot-> mount attempt
3) fsck
During (1) Once ext4 driver found this error it will call ext4_error
which will tag sb with FS_ERROR flag.
During (2) it will found that tag and clear s_orphan which result
in complain you have seen  during(3)

My idea is that (2) and (3) is consequences of (1).
Please provide more details(dmsg) about initial error.
> On next boot to Debian stable, I got stacktrace, and messages about
> ext4 corruption. Back to Debian testing. systemd ran fsck, determined
> it can't fix it, dropped me into emergency shell, _but mounted the
> filesstem, anyway_. Oops.
> 
> Now I'm getting
> 
> fsck 1.42.12
> ...
> Inodes that were part of a corrupted orphan linked list found <y>
> Deleted inode has zero dtime <y>
> (6 inodes) was part of the orphaned inode list. FIXED.
> Block bitmap differences.
> Free inode counts wrong.
> 
> 
> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ext4: 3.17? problems
  2014-09-28 10:44 ext4: 3.17? problems Pavel Machek
  2014-09-28 12:46 ` Theodore Ts'o
  2014-09-29  9:36 ` Dmitry Monakhov
@ 2014-09-29 11:44 ` Jan Kara
  2 siblings, 0 replies; 8+ messages in thread
From: Jan Kara @ 2014-09-29 11:44 UTC (permalink / raw)
  To: Pavel Machek; +Cc: kernel list, jack, linux-ext4, tytso, adilger.kernel

  Hello,

On Sun 28-09-14 12:44:56, Pavel Machek wrote:
> After update to debian testing, my machine sometimes fails to
> reboot. (aptitude upgrade seems to be the trigger).
> 
> So I had to hard power-down the machine. That should be perfectly
> safe, as ext4 has a journal, and this is plain SATA disk, right?
  Yes, it should be safe.

> On next boot to Debian stable, I got stacktrace, and messages about
> ext4 corruption. Back to Debian testing. systemd ran fsck, determined
  It would be really good to get those messages... Ideally you could also
use
  e2image -r <partition> | bzip2 -c
to store fs metadata before doing anything else with the fs to a usb stick.
That is invaluable for future analysis.

> it can't fix it, dropped me into emergency shell, _but mounted the
> filesstem, anyway_. Oops.
  What kernel versions are you running in Debian testing and stable?
My guess would be that kernel had problems only during orphan inode
recovery (i.e. when deleting already deleted files) and we let the mount
proceed if this fails because it's a relatively harmless problem.

> Now I'm getting
> 
> fsck 1.42.12
> ...
> Inodes that were part of a corrupted orphan linked list found <y>
> Deleted inode has zero dtime <y>
> (6 inodes) was part of the orphaned inode list. FIXED.
> Block bitmap differences.
> Free inode counts wrong.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ext4: 3.17? problems
  2014-09-28 12:46 ` Theodore Ts'o
@ 2014-09-30 21:01   ` Pavel Machek
  2014-09-30 23:18     ` Henrique de Moraes Holschuh
  2014-10-01  8:48     ` Jan Kara
  0 siblings, 2 replies; 8+ messages in thread
From: Pavel Machek @ 2014-09-30 21:01 UTC (permalink / raw)
  To: Theodore Ts'o, kernel list, jack, linux-ext4, adilger.kernel,
	Dmitry Monakhov

Hi!

On Sun 2014-09-28 08:46:58, Theodore Ts'o wrote:
> On Sun, Sep 28, 2014 at 12:44:56PM +0200, Pavel Machek wrote:
> > 
> > After update to debian testing, my machine sometimes fails to
> > reboot. (aptitude upgrade seems to be the trigger).
> > 
> > So I had to hard power-down the machine. That should be perfectly
> > safe, as ext4 has a journal, and this is plain SATA disk, right?
> > 
> > On next boot to Debian stable, I got stacktrace, and messages about
> > ext4 corruption. Back to Debian testing. systemd ran fsck, determined
> > it can't fix it, dropped me into emergency shell, _but mounted the
> > filesstem, anyway_. Oops.
> 
> I've been running 3.17-rc4 plus the ext4 dev patches and due to either
> regressions in i915 or the X server (not sure which) over the last
> couple of weeks, I've had to power-down my system a number of times
> after the system has hung when either shutting down the X server or
> when trying to add or remove an external display.  So I've had to
> unfortunately do a fair number of hard-power-offs on my T540p, and
> I've not noticed any like what you've described.

Ok, I'm not 100% sure it was 3.17-rcX... but according to logs, it
is. 3.17-rc4

> Can you give any more details?  Are you using LVM or dm-crypt?  Is
> this repeatable?

No, I don't think it is repeatable in useful way for debugging, but it
is not first time it happened here. No LVM or dm-crypt in use.

> > So I had to hard power-down the machine. That should be perfectly
> > safe, as ext4 has a journal, and this is plain SATA disk, right?
> > 
> AFAIU you have some corruption on your fs (the root of cause is unknown
> at this moment)
> So you have following stages:
> 1) fs corruption
> 2) boot-> mount attempt
> 3) fsck
> During (1) Once ext4 driver found this error it will call ext4_error
> which will tag sb with FS_ERROR flag.
> During (2) it will found that tag and clear s_orphan which result
> in complain you have seen  during(3)

I tried to search syslog, but could not find original messages. It
happened during shutdown. I guess syslog was already stopped at that
point..>?

Logs say:

Sep 28 11:45:38 amd NetworkManager[3422]: <info> Activation (tun0)
successful, device activated.
Sep 28 11:45:38 amd nm-dispatcher: Dispatching action 'up' for tun0
Sep 28 11:45:39 amd systemd[1]: Stopping OpenBSD Secure Shell
server...
Sep 28 11:45:39 amd systemd[1]: Starting OpenBSD Secure Shell
server...
Sep 28 11:45:39 amd systemd[1]: Started OpenBSD Secure Shell server.
Sep 28 11:45:41 amd NetworkManager[3422]: <warn> Could not send ARP
for local address 10.10.0.14: Failed to execute child process
"/sbin/arping" (No such file or directory)
Sep 28 11:45:49 amd ntpdate[1413]: adjust time server 193.85.174.5
offset 0.002797 sec
Sep 28 12:17:01 amd /USR/SBIN/CRON[3612]: (root) CMD (   cd / &&
run-parts --report /etc/cron.hourly)
Sep 28 12:58:12 amd rsyslogd: [origin software="rsyslogd"
swVersion="8.4.0" x-pid="3380" x-info="http://www.rsyslog.com"] start
Sep 28 12:58:12 amd systemd[1]: Starting Load Kernel Modules...
Sep 28 12:58:12 amd systemd[1]: Mounted POSIX Message Queue File
System.
Sep 28 12:58:12 amd systemd[1]: Starting udev Kernel Socket.
Sep 28 12:58:12 amd systemd[1]: Listening on udev Kernel Socket.
Sep 28 12:58:12 amd systemd[1]: Starting udev Control Socket.
Sep 28 12:58:12 amd systemd[1]: Listening on udev Control Socket.
Sep 28 12:58:12 amd systemd[1]: Starting udev Coldplug all Devices...
Sep 28 12:58:12 amd systemd[1]: Started Set Up Additional Binary
Formats.
Sep 28 12:58:12 amd systemd[1]: Starting Dispatch Password Requests to
Console Directory Watch.
Sep 28 12:58:12 amd systemd[1]: Started Dispatch Password Requests to
Console Directory Watch.
Sep 28 12:58:12 amd systemd[1]: Mounting Debug File System...
Sep 28 12:58:12 amd kernel: Initializing cgroup subsys cpu
Sep 28 12:58:12 amd kernel: Linux version 3.17.0-rc4 (pavel@amd) (gcc
version 4.9.1 (Debian 4.9.1-12) ) #1 SMP Sun Sep 14 21:24:53 CEST 2014

> > After update to debian testing, my machine sometimes fails to
> > reboot. (aptitude upgrade seems to be the trigger).
> > 
> > So I had to hard power-down the machine. That should be perfectly
> > safe, as ext4 has a journal, and this is plain SATA disk, right?
>   Yes, it should be safe.

Good.

> > On next boot to Debian stable, I got stacktrace, and messages about
> > ext4 corruption. Back to Debian testing. systemd ran fsck, determined
>   It would be really good to get those messages... Ideally you could also
> use
>   e2image -r <partition> | bzip2 -c
> to store fs metadata before doing anything else with the fs to a usb stick.
> That is invaluable for future analysis.

Too late for that :-(.

> > it can't fix it, dropped me into emergency shell, _but mounted the
> > filesstem, anyway_. Oops.
>   What kernel versions are you running in Debian testing and stable?

Debian testing was 3.17-rc4, AFAICT. For debian stable -- not sure.

> My guess would be that kernel had problems only during orphan inode
> recovery (i.e. when deleting already deleted files) and we let the mount
> proceed if this fails because it's a relatively harmless problem.

Is there some phase during shutdown where journalling no longer
protects fs integrity?

Thanks,
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ext4: 3.17? problems
  2014-09-30 21:01   ` Pavel Machek
@ 2014-09-30 23:18     ` Henrique de Moraes Holschuh
  2014-10-01  8:50       ` Jan Kara
  2014-10-01  8:48     ` Jan Kara
  1 sibling, 1 reply; 8+ messages in thread
From: Henrique de Moraes Holschuh @ 2014-09-30 23:18 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Theodore Ts'o, kernel list, jack, linux-ext4, adilger.kernel,
	Dmitry Monakhov

On Tue, 30 Sep 2014, Pavel Machek wrote:
> > > So I had to hard power-down the machine. That should be perfectly
> > > safe, as ext4 has a journal, and this is plain SATA disk, right?
> >   Yes, it should be safe.
> 
> Good.

...

> Is there some phase during shutdown where journalling no longer
> protects fs integrity?

Hmm... what kind of backing device?  Because I have Crucial/Micron M500 SSDs
here that _always_ complain (in a SMART counter/attribute) that they have
been subject to a sudden poweroff *when subject to a normal system
shutdown*.

This is scaring me a great deal.  Are we doing something different for SSDs
in the scsi-sd or libata shutdown paths?

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ext4: 3.17? problems
  2014-09-30 21:01   ` Pavel Machek
  2014-09-30 23:18     ` Henrique de Moraes Holschuh
@ 2014-10-01  8:48     ` Jan Kara
  1 sibling, 0 replies; 8+ messages in thread
From: Jan Kara @ 2014-10-01  8:48 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Theodore Ts'o, kernel list, jack, linux-ext4, adilger.kernel,
	Dmitry Monakhov

On Tue 30-09-14 23:01:08, Pavel Machek wrote:
> > > On next boot to Debian stable, I got stacktrace, and messages about
> > > ext4 corruption. Back to Debian testing. systemd ran fsck, determined
> >   It would be really good to get those messages... Ideally you could also
> > use
> >   e2image -r <partition> | bzip2 -c
> > to store fs metadata before doing anything else with the fs to a usb stick.
> > That is invaluable for future analysis.
> 
> Too late for that :-(.
  OK, you can take a note for next time ;)

> > > it can't fix it, dropped me into emergency shell, _but mounted the
> > > filesstem, anyway_. Oops.
> >   What kernel versions are you running in Debian testing and stable?
> 
> Debian testing was 3.17-rc4, AFAICT. For debian stable -- not sure.
  OK, there were some changes to orphan list locking in 3.17-rc1. If I
screwed up it could cause orphan list corruption. But for now I don't think
that's the issue.

> > My guess would be that kernel had problems only during orphan inode
> > recovery (i.e. when deleting already deleted files) and we let the mount
> > proceed if this fails because it's a relatively harmless problem.
> 
> Is there some phase during shutdown where journalling no longer
> protects fs integrity?
  No. We first finish all modifications to the fs and only after that clean
up the journal. So that makes all changes to the fs protected.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ext4: 3.17? problems
  2014-09-30 23:18     ` Henrique de Moraes Holschuh
@ 2014-10-01  8:50       ` Jan Kara
  0 siblings, 0 replies; 8+ messages in thread
From: Jan Kara @ 2014-10-01  8:50 UTC (permalink / raw)
  To: Henrique de Moraes Holschuh
  Cc: Pavel Machek, Theodore Ts'o, kernel list, jack, linux-ext4,
	adilger.kernel, Dmitry Monakhov, linux-scsi

On Tue 30-09-14 20:18:01, Henrique de Moraes Holschuh wrote:
> On Tue, 30 Sep 2014, Pavel Machek wrote:
> > > > So I had to hard power-down the machine. That should be perfectly
> > > > safe, as ext4 has a journal, and this is plain SATA disk, right?
> > >   Yes, it should be safe.
> > 
> > Good.
> 
> ...
> 
> > Is there some phase during shutdown where journalling no longer
> > protects fs integrity?
> 
> Hmm... what kind of backing device?  Because I have Crucial/Micron M500 SSDs
> here that _always_ complain (in a SMART counter/attribute) that they have
> been subject to a sudden poweroff *when subject to a normal system
> shutdown*.
> 
> This is scaring me a great deal.  Are we doing something different for SSDs
> in the scsi-sd or libata shutdown paths?
  Nothing I'm aware of but this is more a question for SCSI guys (added to
CC).

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-10-01  8:50 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-28 10:44 ext4: 3.17? problems Pavel Machek
2014-09-28 12:46 ` Theodore Ts'o
2014-09-30 21:01   ` Pavel Machek
2014-09-30 23:18     ` Henrique de Moraes Holschuh
2014-10-01  8:50       ` Jan Kara
2014-10-01  8:48     ` Jan Kara
2014-09-29  9:36 ` Dmitry Monakhov
2014-09-29 11:44 ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).