* ext4: 3.17? problems @ 2014-09-28 10:44 Pavel Machek 2014-09-28 12:46 ` Theodore Ts'o ` (2 more replies) 0 siblings, 3 replies; 8+ messages in thread From: Pavel Machek @ 2014-09-28 10:44 UTC (permalink / raw) To: kernel list; +Cc: jack, linux-ext4, tytso, kernel list, adilger.kernel Hi! After update to debian testing, my machine sometimes fails to reboot. (aptitude upgrade seems to be the trigger). So I had to hard power-down the machine. That should be perfectly safe, as ext4 has a journal, and this is plain SATA disk, right? On next boot to Debian stable, I got stacktrace, and messages about ext4 corruption. Back to Debian testing. systemd ran fsck, determined it can't fix it, dropped me into emergency shell, _but mounted the filesstem, anyway_. Oops. Now I'm getting fsck 1.42.12 ... Inodes that were part of a corrupted orphan linked list found <y> Deleted inode has zero dtime <y> (6 inodes) was part of the orphaned inode list. FIXED. Block bitmap differences. Free inode counts wrong. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: ext4: 3.17? problems 2014-09-28 10:44 ext4: 3.17? problems Pavel Machek @ 2014-09-28 12:46 ` Theodore Ts'o 2014-09-30 21:01 ` Pavel Machek 2014-09-29 9:36 ` Dmitry Monakhov 2014-09-29 11:44 ` Jan Kara 2 siblings, 1 reply; 8+ messages in thread From: Theodore Ts'o @ 2014-09-28 12:46 UTC (permalink / raw) To: Pavel Machek; +Cc: kernel list, jack, linux-ext4, adilger.kernel On Sun, Sep 28, 2014 at 12:44:56PM +0200, Pavel Machek wrote: > > After update to debian testing, my machine sometimes fails to > reboot. (aptitude upgrade seems to be the trigger). > > So I had to hard power-down the machine. That should be perfectly > safe, as ext4 has a journal, and this is plain SATA disk, right? > > On next boot to Debian stable, I got stacktrace, and messages about > ext4 corruption. Back to Debian testing. systemd ran fsck, determined > it can't fix it, dropped me into emergency shell, _but mounted the > filesstem, anyway_. Oops. I've been running 3.17-rc4 plus the ext4 dev patches and due to either regressions in i915 or the X server (not sure which) over the last couple of weeks, I've had to power-down my system a number of times after the system has hung when either shutting down the X server or when trying to add or remove an external display. So I've had to unfortunately do a fair number of hard-power-offs on my T540p, and I've not noticed any like what you've described. Can you give any more details? Are you using LVM or dm-crypt? Is this repeatable? Cheers, - Ted ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: ext4: 3.17? problems 2014-09-28 12:46 ` Theodore Ts'o @ 2014-09-30 21:01 ` Pavel Machek 2014-09-30 23:18 ` Henrique de Moraes Holschuh 2014-10-01 8:48 ` Jan Kara 0 siblings, 2 replies; 8+ messages in thread From: Pavel Machek @ 2014-09-30 21:01 UTC (permalink / raw) To: Theodore Ts'o, kernel list, jack, linux-ext4, adilger.kernel, Dmitry Monakhov Hi! On Sun 2014-09-28 08:46:58, Theodore Ts'o wrote: > On Sun, Sep 28, 2014 at 12:44:56PM +0200, Pavel Machek wrote: > > > > After update to debian testing, my machine sometimes fails to > > reboot. (aptitude upgrade seems to be the trigger). > > > > So I had to hard power-down the machine. That should be perfectly > > safe, as ext4 has a journal, and this is plain SATA disk, right? > > > > On next boot to Debian stable, I got stacktrace, and messages about > > ext4 corruption. Back to Debian testing. systemd ran fsck, determined > > it can't fix it, dropped me into emergency shell, _but mounted the > > filesstem, anyway_. Oops. > > I've been running 3.17-rc4 plus the ext4 dev patches and due to either > regressions in i915 or the X server (not sure which) over the last > couple of weeks, I've had to power-down my system a number of times > after the system has hung when either shutting down the X server or > when trying to add or remove an external display. So I've had to > unfortunately do a fair number of hard-power-offs on my T540p, and > I've not noticed any like what you've described. Ok, I'm not 100% sure it was 3.17-rcX... but according to logs, it is. 3.17-rc4 > Can you give any more details? Are you using LVM or dm-crypt? Is > this repeatable? No, I don't think it is repeatable in useful way for debugging, but it is not first time it happened here. No LVM or dm-crypt in use. > > So I had to hard power-down the machine. That should be perfectly > > safe, as ext4 has a journal, and this is plain SATA disk, right? > > > AFAIU you have some corruption on your fs (the root of cause is unknown > at this moment) > So you have following stages: > 1) fs corruption > 2) boot-> mount attempt > 3) fsck > During (1) Once ext4 driver found this error it will call ext4_error > which will tag sb with FS_ERROR flag. > During (2) it will found that tag and clear s_orphan which result > in complain you have seen during(3) I tried to search syslog, but could not find original messages. It happened during shutdown. I guess syslog was already stopped at that point..>? Logs say: Sep 28 11:45:38 amd NetworkManager[3422]: <info> Activation (tun0) successful, device activated. Sep 28 11:45:38 amd nm-dispatcher: Dispatching action 'up' for tun0 Sep 28 11:45:39 amd systemd[1]: Stopping OpenBSD Secure Shell server... Sep 28 11:45:39 amd systemd[1]: Starting OpenBSD Secure Shell server... Sep 28 11:45:39 amd systemd[1]: Started OpenBSD Secure Shell server. Sep 28 11:45:41 amd NetworkManager[3422]: <warn> Could not send ARP for local address 10.10.0.14: Failed to execute child process "/sbin/arping" (No such file or directory) Sep 28 11:45:49 amd ntpdate[1413]: adjust time server 193.85.174.5 offset 0.002797 sec Sep 28 12:17:01 amd /USR/SBIN/CRON[3612]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) Sep 28 12:58:12 amd rsyslogd: [origin software="rsyslogd" swVersion="8.4.0" x-pid="3380" x-info="http://www.rsyslog.com"] start Sep 28 12:58:12 amd systemd[1]: Starting Load Kernel Modules... Sep 28 12:58:12 amd systemd[1]: Mounted POSIX Message Queue File System. Sep 28 12:58:12 amd systemd[1]: Starting udev Kernel Socket. Sep 28 12:58:12 amd systemd[1]: Listening on udev Kernel Socket. Sep 28 12:58:12 amd systemd[1]: Starting udev Control Socket. Sep 28 12:58:12 amd systemd[1]: Listening on udev Control Socket. Sep 28 12:58:12 amd systemd[1]: Starting udev Coldplug all Devices... Sep 28 12:58:12 amd systemd[1]: Started Set Up Additional Binary Formats. Sep 28 12:58:12 amd systemd[1]: Starting Dispatch Password Requests to Console Directory Watch. Sep 28 12:58:12 amd systemd[1]: Started Dispatch Password Requests to Console Directory Watch. Sep 28 12:58:12 amd systemd[1]: Mounting Debug File System... Sep 28 12:58:12 amd kernel: Initializing cgroup subsys cpu Sep 28 12:58:12 amd kernel: Linux version 3.17.0-rc4 (pavel@amd) (gcc version 4.9.1 (Debian 4.9.1-12) ) #1 SMP Sun Sep 14 21:24:53 CEST 2014 > > After update to debian testing, my machine sometimes fails to > > reboot. (aptitude upgrade seems to be the trigger). > > > > So I had to hard power-down the machine. That should be perfectly > > safe, as ext4 has a journal, and this is plain SATA disk, right? > Yes, it should be safe. Good. > > On next boot to Debian stable, I got stacktrace, and messages about > > ext4 corruption. Back to Debian testing. systemd ran fsck, determined > It would be really good to get those messages... Ideally you could also > use > e2image -r <partition> | bzip2 -c > to store fs metadata before doing anything else with the fs to a usb stick. > That is invaluable for future analysis. Too late for that :-(. > > it can't fix it, dropped me into emergency shell, _but mounted the > > filesstem, anyway_. Oops. > What kernel versions are you running in Debian testing and stable? Debian testing was 3.17-rc4, AFAICT. For debian stable -- not sure. > My guess would be that kernel had problems only during orphan inode > recovery (i.e. when deleting already deleted files) and we let the mount > proceed if this fails because it's a relatively harmless problem. Is there some phase during shutdown where journalling no longer protects fs integrity? Thanks, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: ext4: 3.17? problems 2014-09-30 21:01 ` Pavel Machek @ 2014-09-30 23:18 ` Henrique de Moraes Holschuh 2014-10-01 8:50 ` Jan Kara 2014-10-01 8:48 ` Jan Kara 1 sibling, 1 reply; 8+ messages in thread From: Henrique de Moraes Holschuh @ 2014-09-30 23:18 UTC (permalink / raw) To: Pavel Machek Cc: Theodore Ts'o, kernel list, jack, linux-ext4, adilger.kernel, Dmitry Monakhov On Tue, 30 Sep 2014, Pavel Machek wrote: > > > So I had to hard power-down the machine. That should be perfectly > > > safe, as ext4 has a journal, and this is plain SATA disk, right? > > Yes, it should be safe. > > Good. ... > Is there some phase during shutdown where journalling no longer > protects fs integrity? Hmm... what kind of backing device? Because I have Crucial/Micron M500 SSDs here that _always_ complain (in a SMART counter/attribute) that they have been subject to a sudden poweroff *when subject to a normal system shutdown*. This is scaring me a great deal. Are we doing something different for SSDs in the scsi-sd or libata shutdown paths? -- "One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie." -- The Silicon Valley Tarot Henrique Holschuh ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: ext4: 3.17? problems 2014-09-30 23:18 ` Henrique de Moraes Holschuh @ 2014-10-01 8:50 ` Jan Kara 0 siblings, 0 replies; 8+ messages in thread From: Jan Kara @ 2014-10-01 8:50 UTC (permalink / raw) To: Henrique de Moraes Holschuh Cc: Pavel Machek, Theodore Ts'o, kernel list, jack, linux-ext4, adilger.kernel, Dmitry Monakhov, linux-scsi On Tue 30-09-14 20:18:01, Henrique de Moraes Holschuh wrote: > On Tue, 30 Sep 2014, Pavel Machek wrote: > > > > So I had to hard power-down the machine. That should be perfectly > > > > safe, as ext4 has a journal, and this is plain SATA disk, right? > > > Yes, it should be safe. > > > > Good. > > ... > > > Is there some phase during shutdown where journalling no longer > > protects fs integrity? > > Hmm... what kind of backing device? Because I have Crucial/Micron M500 SSDs > here that _always_ complain (in a SMART counter/attribute) that they have > been subject to a sudden poweroff *when subject to a normal system > shutdown*. > > This is scaring me a great deal. Are we doing something different for SSDs > in the scsi-sd or libata shutdown paths? Nothing I'm aware of but this is more a question for SCSI guys (added to CC). Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: ext4: 3.17? problems 2014-09-30 21:01 ` Pavel Machek 2014-09-30 23:18 ` Henrique de Moraes Holschuh @ 2014-10-01 8:48 ` Jan Kara 1 sibling, 0 replies; 8+ messages in thread From: Jan Kara @ 2014-10-01 8:48 UTC (permalink / raw) To: Pavel Machek Cc: Theodore Ts'o, kernel list, jack, linux-ext4, adilger.kernel, Dmitry Monakhov On Tue 30-09-14 23:01:08, Pavel Machek wrote: > > > On next boot to Debian stable, I got stacktrace, and messages about > > > ext4 corruption. Back to Debian testing. systemd ran fsck, determined > > It would be really good to get those messages... Ideally you could also > > use > > e2image -r <partition> | bzip2 -c > > to store fs metadata before doing anything else with the fs to a usb stick. > > That is invaluable for future analysis. > > Too late for that :-(. OK, you can take a note for next time ;) > > > it can't fix it, dropped me into emergency shell, _but mounted the > > > filesstem, anyway_. Oops. > > What kernel versions are you running in Debian testing and stable? > > Debian testing was 3.17-rc4, AFAICT. For debian stable -- not sure. OK, there were some changes to orphan list locking in 3.17-rc1. If I screwed up it could cause orphan list corruption. But for now I don't think that's the issue. > > My guess would be that kernel had problems only during orphan inode > > recovery (i.e. when deleting already deleted files) and we let the mount > > proceed if this fails because it's a relatively harmless problem. > > Is there some phase during shutdown where journalling no longer > protects fs integrity? No. We first finish all modifications to the fs and only after that clean up the journal. So that makes all changes to the fs protected. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: ext4: 3.17? problems 2014-09-28 10:44 ext4: 3.17? problems Pavel Machek 2014-09-28 12:46 ` Theodore Ts'o @ 2014-09-29 9:36 ` Dmitry Monakhov 2014-09-29 11:44 ` Jan Kara 2 siblings, 0 replies; 8+ messages in thread From: Dmitry Monakhov @ 2014-09-29 9:36 UTC (permalink / raw) To: Pavel Machek, kernel list Cc: jack, linux-ext4, tytso, kernel list, adilger.kernel On Sun, 28 Sep 2014 12:44:56 +0200, Pavel Machek <pavel@ucw.cz> wrote: > Hi! > > After update to debian testing, my machine sometimes fails to > reboot. (aptitude upgrade seems to be the trigger). > > So I had to hard power-down the machine. That should be perfectly > safe, as ext4 has a journal, and this is plain SATA disk, right? > AFAIU you have some corruption on your fs (the root of cause is unknown at this moment) So you have following stages: 1) fs corruption 2) boot-> mount attempt 3) fsck During (1) Once ext4 driver found this error it will call ext4_error which will tag sb with FS_ERROR flag. During (2) it will found that tag and clear s_orphan which result in complain you have seen during(3) My idea is that (2) and (3) is consequences of (1). Please provide more details(dmsg) about initial error. > On next boot to Debian stable, I got stacktrace, and messages about > ext4 corruption. Back to Debian testing. systemd ran fsck, determined > it can't fix it, dropped me into emergency shell, _but mounted the > filesstem, anyway_. Oops. > > Now I'm getting > > fsck 1.42.12 > ... > Inodes that were part of a corrupted orphan linked list found <y> > Deleted inode has zero dtime <y> > (6 inodes) was part of the orphaned inode list. FIXED. > Block bitmap differences. > Free inode counts wrong. > > > -- > (english) http://www.livejournal.com/~pavelmachek > (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: ext4: 3.17? problems 2014-09-28 10:44 ext4: 3.17? problems Pavel Machek 2014-09-28 12:46 ` Theodore Ts'o 2014-09-29 9:36 ` Dmitry Monakhov @ 2014-09-29 11:44 ` Jan Kara 2 siblings, 0 replies; 8+ messages in thread From: Jan Kara @ 2014-09-29 11:44 UTC (permalink / raw) To: Pavel Machek; +Cc: kernel list, jack, linux-ext4, tytso, adilger.kernel Hello, On Sun 28-09-14 12:44:56, Pavel Machek wrote: > After update to debian testing, my machine sometimes fails to > reboot. (aptitude upgrade seems to be the trigger). > > So I had to hard power-down the machine. That should be perfectly > safe, as ext4 has a journal, and this is plain SATA disk, right? Yes, it should be safe. > On next boot to Debian stable, I got stacktrace, and messages about > ext4 corruption. Back to Debian testing. systemd ran fsck, determined It would be really good to get those messages... Ideally you could also use e2image -r <partition> | bzip2 -c to store fs metadata before doing anything else with the fs to a usb stick. That is invaluable for future analysis. > it can't fix it, dropped me into emergency shell, _but mounted the > filesstem, anyway_. Oops. What kernel versions are you running in Debian testing and stable? My guess would be that kernel had problems only during orphan inode recovery (i.e. when deleting already deleted files) and we let the mount proceed if this fails because it's a relatively harmless problem. > Now I'm getting > > fsck 1.42.12 > ... > Inodes that were part of a corrupted orphan linked list found <y> > Deleted inode has zero dtime <y> > (6 inodes) was part of the orphaned inode list. FIXED. > Block bitmap differences. > Free inode counts wrong. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2014-10-01 8:50 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-09-28 10:44 ext4: 3.17? problems Pavel Machek 2014-09-28 12:46 ` Theodore Ts'o 2014-09-30 21:01 ` Pavel Machek 2014-09-30 23:18 ` Henrique de Moraes Holschuh 2014-10-01 8:50 ` Jan Kara 2014-10-01 8:48 ` Jan Kara 2014-09-29 9:36 ` Dmitry Monakhov 2014-09-29 11:44 ` Jan Kara
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).