linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 3.11.5 kernel infinite loop
@ 2013-11-06  1:52 Russell Coker
  2013-11-06 12:37 ` Duncan
  0 siblings, 1 reply; 3+ messages in thread
From: Russell Coker @ 2013-11-06  1:52 UTC (permalink / raw)
  To: linux-btrfs

I have a system running the Debian package of 3.11.5 with an Amd Opteron 1212 
processor (2*64bit cores), 8G of RAM, and an Intel 120G SSD for the root and 
home subvols.  It has a RAID-1 array of 2*3TB disks for bulk storage (movies 
etc) but that probably isn't relevant to this problem.

On the root filesystem I have cron jobs making daily snapshots of / and /home 
and additional snapshots of /home every 15 minutes.  At midnight a cron job 
removes older snapshots.  For the last 8 days the system has been reliably 
hanging at about 5 minutes after midnight and the subvol removal cron job is 
the only thing that has happened then.

So it seems clear to me that on my system 3.11.5 has a crash a few minutes 
after removing ~98 subvols at the same time.

Last night I watched it happen and deleted a few dozen extra subvols to test 
whether it would repeat.  That wasn't such a good idea and I rebooted the 
system many times before giving up and booting 3.10.11 which is now working 
correctly.

When running 3.11.5 I was seeing kernel log messages such as the following 
shortly after boot.  Then after that it got into a state where a ssh session 
didn't work and the X login prompt didn't even flash it's cursor.  In that 
state it could still forward packets (the system in question is an ethernet 
bridge which I use to connect my workstation to the Internet) but couldn't do 
much else.  The NFS server processes locked and sshd wouldn't complete the 
login process for new connection attempts.

[   68.056003] BUG: soft lockup - CPU#0 stuck for 22s! [btrfs-cleaner:270]     
[   68.144004] BUG: soft lockup - CPU#1 stuck for 22s! [btrfs-transacti:271]

Prior to the lockup those two kernel processes had used most CPU time.  I'm 
not sure whether prior to the lockup they were in some sort of CPU loop or 
whether they were just reading a lot of data from a fast SSD and acting 
correctly.

As an aside I ordered a replacement server last week when I wasn't sure if 
this was a hardware or a software problem.  This will allow me to test some 
things in more detail on the old server after the new one is running, however 
I don't own a spare SSD so if it's a SSD specific issue then I have limited 
ability to test.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: 3.11.5 kernel infinite loop
  2013-11-06  1:52 3.11.5 kernel infinite loop Russell Coker
@ 2013-11-06 12:37 ` Duncan
  2013-11-09  1:22   ` Chris Samuel
  0 siblings, 1 reply; 3+ messages in thread
From: Duncan @ 2013-11-06 12:37 UTC (permalink / raw)
  To: linux-btrfs

Russell Coker posted on Wed, 06 Nov 2013 12:52:38 +1100 as excerpted:

> I have a system running the Debian package of 3.11.5 with an Amd Opteron
> 1212 processor (2*64bit cores), 8G of RAM, and an Intel 120G SSD for the
> root and home subvols.  It has a RAID-1 array of 2*3TB disks for bulk
> storage (movies etc) but that probably isn't relevant to this problem.
> 
> On the root filesystem I have cron jobs making daily snapshots of / and
> /home and additional snapshots of /home every 15 minutes.  At midnight a
> cron job removes older snapshots.  For the last 8 days the system has
> been reliably hanging at about 5 minutes after midnight and the subvol
> removal cron job is the only thing that has happened then.

I believe there's a btrfs-critical stable-series patch in 3.11.6, that 
you're probably missing with 3.11.5.  (There were unfortunately some 
crossed signals and the patch was skipped for a couple weeks after it 
should have gone in, but it's in now.)

Yes... Just checked the 3.11.6 changelog:

Josef Bacik (1):
      Btrfs: use right root when checking for hash collision


Note that there's another critical patch in-flight, patching a bug 
triggered by btrfs balance on filesystems with pre-allocated files (like 
systemd does with its journal and various torrent clients do with their 
downloads).  But this one is currently being held up because stable rules 
require it to be in current mainline first, and 3.12 is out, but the two-
week 3.13 commit window that would normally be open now is suspended for 
a week, as Linux is traveling without a reliable net connection.  So the 
patch can't hit mainline, and thus won't hit stable unless an exception 
is made, until after Linus' vacation, when the commit window opens and 
the patch is accepted.

See previous discussion here on this list for it, or simply don't do any 
balances if you're running systemd or with any other pre-allocated-file 
apps such as torrent clients running, until after you get that patch.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: 3.11.5 kernel infinite loop
  2013-11-06 12:37 ` Duncan
@ 2013-11-09  1:22   ` Chris Samuel
  0 siblings, 0 replies; 3+ messages in thread
From: Chris Samuel @ 2013-11-09  1:22 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1047 bytes --]

On Wed, 6 Nov 2013 12:37:32 PM Duncan wrote:

> Note that there's another critical patch in-flight, patching a bug 
> triggered by btrfs balance on filesystems with pre-allocated files (like 
> systemd does with its journal and various torrent clients do with their 
> downloads).  But this one is currently being held up because stable rules 
> require it to be in current mainline first, and 3.12 is out, but the two-
> week 3.13 commit window that would normally be open now is suspended for 
> a week, as Linux is traveling without a reliable net connection.  So the 
> patch can't hit mainline, and thus won't hit stable unless an exception 
> is made, until after Linus' vacation, when the commit window opens and 
> the patch is accepted.

Greg K-H has said he'll accept stable patches that haven't hit the mainline 
during this period.

cheers!
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

This email may come with a PGP signature as a file. Do not panic.
For more info see: http://en.wikipedia.org/wiki/OpenPGP

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 482 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2013-11-09  1:22 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-06  1:52 3.11.5 kernel infinite loop Russell Coker
2013-11-06 12:37 ` Duncan
2013-11-09  1:22   ` Chris Samuel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).