linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Filesystem hang on kernel 4.2.0 with copy reflink
@ 2016-01-04  8:35 Mark Zealey
  2016-01-04 10:41 ` Jack Wang
  0 siblings, 1 reply; 4+ messages in thread
From: Mark Zealey @ 2016-01-04  8:35 UTC (permalink / raw)
  To: linux-btrfs

Hi there, I've run into a very strange hang with btrfs. I was trying to 
restore a directory (postgres database) from a readonly snapshot. To do 
this i used the command `cp -ar --reflink=always`. This worked fine for 
100s of files, however when it got to a particular file 16 kworker 
threads (I have 8 processors in this system) got marked as being in D 
state (with 0 cpu usage or disk usage) and I could not access the btrfs 
file system any more. I can't see any kernel message or OOPS. Can you 
please let me know what additional debug information I can provide to 
help track this issue down in the kernel?

System is latest ubuntu 14.04 LTS with a backported wily kernel (package 
linux-image-4.2.0-22-generic):

4.2.0-22-generic #27~14.04.1-Ubuntu SMP Fri Dec 18 10:57:53 UTC 2015 
x86_64 x86_64 x86_64 GNU/Linux

Thanks

Mark

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Filesystem hang on kernel 4.2.0 with copy reflink
  2016-01-04  8:35 Filesystem hang on kernel 4.2.0 with copy reflink Mark Zealey
@ 2016-01-04 10:41 ` Jack Wang
  2016-01-04 12:11   ` Mark Zealey
  0 siblings, 1 reply; 4+ messages in thread
From: Jack Wang @ 2016-01-04 10:41 UTC (permalink / raw)
  To: Mark Zealey; +Cc: linux-btrfs

Hi Mark,

Could you do below when the hang happens, and post the dmesg.

echo w > /proc/sysrq-trigger

2016-01-04 9:35 GMT+01:00 Mark Zealey <mark@markandruth.co.uk>:
> Hi there, I've run into a very strange hang with btrfs. I was trying to
> restore a directory (postgres database) from a readonly snapshot. To do this
> i used the command `cp -ar --reflink=always`. This worked fine for 100s of
> files, however when it got to a particular file 16 kworker threads (I have 8
> processors in this system) got marked as being in D state (with 0 cpu usage
> or disk usage) and I could not access the btrfs file system any more. I
> can't see any kernel message or OOPS. Can you please let me know what
> additional debug information I can provide to help track this issue down in
> the kernel?
>
> System is latest ubuntu 14.04 LTS with a backported wily kernel (package
> linux-image-4.2.0-22-generic):
>
> 4.2.0-22-generic #27~14.04.1-Ubuntu SMP Fri Dec 18 10:57:53 UTC 2015 x86_64
> x86_64 x86_64 GNU/Linux
>
> Thanks
>
> Mark
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Filesystem hang on kernel 4.2.0 with copy reflink
  2016-01-04 10:41 ` Jack Wang
@ 2016-01-04 12:11   ` Mark Zealey
  2016-01-09  9:28     ` Duncan
  0 siblings, 1 reply; 4+ messages in thread
From: Mark Zealey @ 2016-01-04 12:11 UTC (permalink / raw)
  To: Jack Wang; +Cc: linux-btrfs

It overflowed the dmesg buffer but hopefully contains enough cores - 
https://mark.zealey.org/download/btrfs_crash.txt

Some other output:

# mount
/dev/sdb1 on / type btrfs (rw,noatime,skip_balance,subvol=@)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
none on /sys/fs/cgroup type tmpfs (rw)
none on /sys/fs/fuse/connections type fusectl (rw)
none on /sys/kernel/debug type debugfs (rw)
none on /sys/kernel/security type securityfs (rw)
none on /sys/firmware/efi/efivars type efivarfs (rw)
udev on /dev type devtmpfs (rw,mode=0755)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)
none on /run/lock type tmpfs (rw,noexec,nosuid,nodev,size=5242880)
none on /run/shm type tmpfs (rw,nosuid,nodev)
none on /run/user type tmpfs 
(rw,noexec,nosuid,nodev,size=104857600,mode=0755)
none on /sys/fs/pstore type pstore (rw)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu type cgroup (rw,relatime,cpu)
cgroup on /sys/fs/cgroup/cpuacct type cgroup (rw,relatime,cpuacct)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,relatime,blkio)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,relatime,memory)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup 
(rw,relatime,freezer,release_agent=/run/cgmanager/agents/cgm-release-agent.freezer)
cgroup on /sys/fs/cgroup/net_cls type cgroup 
(rw,relatime,net_cls,release_agent=/run/cgmanager/agents/cgm-release-agent.net_cls)
/dev/sdb1 on /home type btrfs (rw,noatime,skip_balance,subvol=@home)
/dev/sdb3 on /boot/efi type vfat (rw)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc 
(rw,noexec,nosuid,nodev)
rpc_pipefs on /run/rpc_pipefs type rpc_pipefs (rw)
systemd on /sys/fs/cgroup/systemd type cgroup 
(rw,noexec,nosuid,nodev,none,name=systemd)

  ps auwx|grep ' D'
root       275  0.0  0.0      0     0 ?        D    Jan02   2:29 
[btrfs-transacti]
root       361  0.0  0.0      0     0 ?        D    13:30   0:00 
[kworker/u16:5]
root       404  0.0  0.0      0     0 ?        D    13:31   0:00 
[kworker/u16:7]
root      1127  0.0  0.0      0     0 ?        D    13:54   0:00 
[kworker/u16:0]
root      1137  0.0  0.0      0     0 ?        D    13:54   0:00 
[kworker/u16:2]
root      1189  2.3  0.0  25932  2216 pts/7    D+   13:55   0:02 cp -vax 
--reflink=always /.snapshots/psql/var/lib/postgresql/ .
root      1191  0.0  0.0      0     0 ?        D    13:55   0:00 
[kworker/u16:3]
root      1197  0.0  0.0      0     0 ?        D    13:55   0:00 
[kworker/u16:4]
root      1200  0.0  0.0      0     0 ?        D    13:55   0:00 
[kworker/u16:8]
root      1201  0.0  0.0      0     0 ?        D    13:55   0:00 
[kworker/u16:10]
root      1230  0.0  0.0      0     0 ?        D    13:55   0:00 
[kworker/u16:15]
root      1231  0.0  0.0      0     0 ?        D    13:55   0:00 
[kworker/u16:16]
root     14569  0.0  0.0      0     0 ?        D    12:18   0:00 
[kworker/u16:9]
root     14572  0.0  0.0      0     0 ?        D    12:19   0:00 
[kworker/u16:11]
root     14573  0.0  0.0      0     0 ?        D    12:19   0:00 
[kworker/u16:12]
root     14582  0.0  0.0      0     0 ?        D    12:19   0:00 
[kworker/u16:13]
root     32228  0.0  0.0      0     0 ?        D    13:17   0:00 
[kworker/u16:1]

The last output of the cp:

‘/.snapshots/psql/var/lib/postgresql/9.5/main/base/16385/25009’ -> 
‘./postgresql/9.5/main/base/16385/25009’
‘/.snapshots/psql/var/lib/postgresql/9.5/main/base/16385/25011’ -> 
‘./postgresql/9.5/main/base/16385/25011’
‘/.snapshots/psql/var/lib/postgresql/9.5/main/base/16385/25012’ -> 
‘./postgresql/9.5/main/base/16385/25012’
‘/.snapshots/psql/var/lib/postgresql/9.5/main/base/16385/25243’ -> 
‘./postgresql/9.5/main/base/16385/25243’
‘/.snapshots/psql/var/lib/postgresql/9.5/main/base/16385/25246’ -> 
‘./postgresql/9.5/main/base/16385/25246’
‘/.snapshots/psql/var/lib/postgresql/9.5/main/base/16385/25248’ -> 
‘./postgresql/9.5/main/base/16385/25248’
‘/.snapshots/psql/var/lib/postgresql/9.5/main/base/16385/25249’ -> 
‘./postgresql/9.5/main/base/16385/25249’
‘/.snapshots/psql/var/lib/postgresql/9.5/main/base/16385/25251’ -> 
‘./postgresql/9.5/main/base/16385/25251’
‘/.snapshots/psql/var/lib/postgresql/9.5/main/base/16385/25254’ -> 
‘./postgresql/9.5/main/base/16385/25254’
‘/.snapshots/psql/var/lib/postgresql/9.5/main/base/16385/25256’ -> 
‘./postgresql/9.5/main/base/16385/25256’
‘/.snapshots/psql/var/lib/postgresql/9.5/main/base/16385/25257’ -> 
‘./postgresql/9.5/main/base/16385/25257’
‘/.snapshots/psql/var/lib/postgresql/9.5/main/base/16385/25283’ -> 
‘./postgresql/9.5/main/base/16385/25283’

And those (and other files) that it would have copied:

-rw------- 1 postgres postgres         0 Dec 30 18:11 
/.snapshots/psql/var/lib/postgresql/9.5/main/base/16385/25243
-rw------- 1 postgres postgres         0 Dec 30 18:11 
/.snapshots/psql/var/lib/postgresql/9.5/main/base/16385/25246
-rw------- 1 postgres postgres      8192 Dec 30 18:11 
/.snapshots/psql/var/lib/postgresql/9.5/main/base/16385/25248
-rw------- 1 postgres postgres      8192 Dec 30 18:11 
/.snapshots/psql/var/lib/postgresql/9.5/main/base/16385/25249
-rw------- 1 postgres postgres         0 Dec 30 18:11 
/.snapshots/psql/var/lib/postgresql/9.5/main/base/16385/25251
-rw------- 1 postgres postgres         0 Dec 30 18:11 
/.snapshots/psql/var/lib/postgresql/9.5/main/base/16385/25254
-rw------- 1 postgres postgres      8192 Dec 30 18:11 
/.snapshots/psql/var/lib/postgresql/9.5/main/base/16385/25256
-rw------- 1 postgres postgres      8192 Dec 30 18:11 
/.snapshots/psql/var/lib/postgresql/9.5/main/base/16385/25257
-rw------- 1 postgres postgres 409624576 Dec 30 19:10 
/.snapshots/psql/var/lib/postgresql/9.5/main/base/16385/25283
-rw------- 1 postgres postgres    122880 Dec 30 18:29 
/.snapshots/psql/var/lib/postgresql/9.5/main/base/16385/25283_fsm
-rw------- 1 postgres postgres         0 Dec 30 18:22 
/.snapshots/psql/var/lib/postgresql/9.5/main/base/16385/25284
-rw------- 1 postgres postgres      8192 Dec 30 18:22 
/.snapshots/psql/var/lib/postgresql/9.5/main/base/16385/25285

Also I have quota tracking enabled on the btrfs volume if that makes any 
difference.

Mark

On 04/01/16 12:41, Jack Wang wrote:
> Hi Mark,
>
> Could you do below when the hang happens, and post the dmesg.
>
> echo w > /proc/sysrq-trigger
>
> 2016-01-04 9:35 GMT+01:00 Mark Zealey <mark@markandruth.co.uk>:
>> Hi there, I've run into a very strange hang with btrfs. I was trying to
>> restore a directory (postgres database) from a readonly snapshot. To do this
>> i used the command `cp -ar --reflink=always`. This worked fine for 100s of
>> files, however when it got to a particular file 16 kworker threads (I have 8
>> processors in this system) got marked as being in D state (with 0 cpu usage
>> or disk usage) and I could not access the btrfs file system any more. I
>> can't see any kernel message or OOPS. Can you please let me know what
>> additional debug information I can provide to help track this issue down in
>> the kernel?
>>
>> System is latest ubuntu 14.04 LTS with a backported wily kernel (package
>> linux-image-4.2.0-22-generic):
>>
>> 4.2.0-22-generic #27~14.04.1-Ubuntu SMP Fri Dec 18 10:57:53 UTC 2015 x86_64
>> x86_64 x86_64 GNU/Linux
>>
>> Thanks
>>
>> Mark
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Filesystem hang on kernel 4.2.0 with copy reflink
  2016-01-04 12:11   ` Mark Zealey
@ 2016-01-09  9:28     ` Duncan
  0 siblings, 0 replies; 4+ messages in thread
From: Duncan @ 2016-01-09  9:28 UTC (permalink / raw)
  To: linux-btrfs

Mark Zealey posted on Mon, 04 Jan 2016 14:11:00 +0200 as excerpted:

> Also I have quota tracking enabled on the btrfs volume if that makes any
> difference.

I'm not sure whether it makes a difference for this particular hang, but 
I do know that btrfs quotas are simply not stable thru at least 4.3.  I'm 
not sure what 4.4 status is, but my general recommendation regarding 
btrfs quotas is...

If you need quotas, use a more mature filesystem where they work 
reliably, if you don't, then turn them off for now, and don't turn them 
on again until at least two complete kernel cycles have passed without a 
known quota bug or instability, which, if 4.4 is indeed known-bug-free 
and it and 4.5 remain so thru 4.6, means that would be the earliest I 
could recommend turning it on, and that's only if there's no known quota 
bugs in 4.4 or 4.5 or 4.6 by the time of 4.6 release.

Unless of course you're deliberately and specifically testing btrfs 
quotas and working with the devs on fixing quota specific issues, in 
which case, thank you. =:^)

Meanwhile, the rather long history of problems with quotas on btrfs is 
both the reason I'm suggesting two complete cycles without quota issues 
before enabling them, and a good reason to be skeptical as to current 
releases' quota code or the likelihood of that two-releases-quota-bug-
free happening any time soon.  There's definitely a lot of work going 
into the feature and there's gotta be a point where it actually works, 
but they've rewritten the code three times and are still dealing with 
bugs, and it has been in a "try back in a couple kernel cycles" state for 
years, now, so... who knows?  Kinda reminds me of how long the raid56 
code took, only quota was being worked on before that, and is still being 
worked on, so...

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-01-09  9:28 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-04  8:35 Filesystem hang on kernel 4.2.0 with copy reflink Mark Zealey
2016-01-04 10:41 ` Jack Wang
2016-01-04 12:11   ` Mark Zealey
2016-01-09  9:28     ` Duncan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).