Oops while rebalancing, now unmountable.

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Oops while rebalancing, now unmountable.
@ 2010-11-08 17:10 Shane Shrybman
  2010-11-08 17:55 ` Chris Mason
  2010-11-09 13:42 ` Chris Mason
  0 siblings, 2 replies; 22+ messages in thread
From: Shane Shrybman @ 2010-11-08 17:10 UTC (permalink / raw)
  To: linux-btrfs

Hi,

Got an oops last week while rebalancing that seems to have left me with
a corrupted btrfs. Kernel was ~2.6.36 + Transparent hugetlb patchset +
small misc. patchs.

=EF=BB=BFI tried my last reliable 2.6.35 based kernel without the trans=
parent
hugetlb patchset and the btrfs was not mountable there either.
=EF=BB=BF
There has been a few unclean shutdowns lately due to power issues. But
things seemed fine after several days of use.

Label: none  uuid: b8a1b5a4-dce4-405f-89f4-903c13e92174
        Total devices 1 FS bytes used 1.25TB
        devid    1 size 1.36TB used 1.30TB path /dev/sdc1

Btrfs v0.19

Here's the oops I dug out of the log.

btrfs: found 97 extents
btrfs: sdc1 checksum verify failed on 624993189888 wanted 57EC7EA0 foun=
d 872AA969 level 0
message repeated 2 times
BUG: unable to handle kernel NULL pointer dereference at 0000001c
IP: [<791338e4>] btrfs_print_leaf+0x14/0xb00
*pde =3D 00000000=20
Oops: 0000 [#1] SMP=20
last sysfs file: /sys/block/sdd/removable
Modules linked in: loop nvidia(P) agpgart acpi_cpufreq mperf nfsd expor=
tfs nfs lockd fscache auth_rpcgss sunrpc coretemp netconsole usbhid tda=
7432 tuner tea5767 tda8290 tda18271 tuner_xc2028 xc5000 snd_cmipci game=
port snd_opl3_lib snd_hwdep snd_mpu401_uart tda9887 tuner_simple tuner_=
types mt20xx tea5761 msp3400 snd_hda_codec_realtek snd_hda_intel snd_hd=
a_codec snd_pcm_oss snd_mixer_oss snd_pcm saa7115 snd_seq_dummy snd_seq=
_oss snd_seq_midi snd_seq_midi_event snd_seq r8169 bttv uhci_hcd ivtv v=
ideobuf_dma_sg videobuf_core btcx_risc snd_rawmidi cx2341x i2c_algo_bit=
 snd_timer tveeprom mii rtc_cmos rtc_core rtc_lib snd_seq_device ehci_h=
cd snd usbcore soundcore snd_page_alloc [last unloaded: tvaudio]

Pid: 26877, comm: btrfs Tainted: P            2.6.36-aa1 #5 P35-S3G/P35=
-S3G
EIP: 0060:[<791338e4>] EFLAGS: 00010296 CPU: 1
EIP is at btrfs_print_leaf+0x14/0xb00
EAX: f2374000 EBX: 00000000 ECX: 091847f5 EDX: 00000000
ESI: f2374000 EDI: 00000065 EBP: 82d73abc ESP: 82d73988
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process btrfs (pid: 26877, ti=3D82d72000 task=3D79527120 task.ti=3D82d7=
2000)
Stack:
 f2374000 847f4000 00000091 00001000 000001af 00000000 00000021 0000100=
0
<0> 0000002b 00000000 00000000 847f4000 00000091 8ac1a000 00000091 0000=
0037
<0> e329489c 82d739ec 7912145f 00000002 00000001 e329489c 82d73a9f 82d7=
3a9f
Call Trace:
 [<7912145f>] ? unlock_up+0xcf/0xe0
 [<79126767>] ? btrfs_search_slot+0x537/0x630
 [<791316f0>] ? lookup_inline_extent_backref+0xd0/0x4f0
 [<79129fc8>] ? update_block_group+0xe8/0x2c0
 [<7913263c>] ? __btrfs_free_extent+0x60c/0x7d0
 [<791a8e18>] ? rb_erase+0x208/0x280
 [<79132d74>] ? run_clustered_refs+0x224/0x9d0
 [<79159ba9>] ? map_extent_buffer+0xb9/0xc0
 [<791335bd>] ? btrfs_run_delayed_refs+0x9d/0x170
 [<7913d590>] ? btrfs_commit_transaction+0x80/0x640
 [<791756bd>] ? tree_insert+0x6d/0x80
 [<79176f28>] ? btrfs_update_reloc_root+0x108/0x180
 [<79045840>] ? autoremove_wake_function+0x0/0x50
 [<79178315>] ? prepare_to_merge+0x215/0x230
 [<7917bfdb>] ? relocate_block_group+0x4db/0x5a0
 [<7917c181>] ? btrfs_relocate_block_group+0xe1/0x270
 [<791649c6>] ? btrfs_relocate_chunk+0x56/0x520
 [<791597cc>] ? unmap_extent_buffer+0xc/0x10
 [<7914f96d>] ? btrfs_item_offset+0xbd/0xc0
 [<791665ce>] ? btrfs_balance+0x21e/0x2b0
 [<7916aa32>] ? btrfs_ioctl+0x3f2/0x8f0
 [<79076d26>] ? lru_cache_add_lru+0x26/0x40
 [<7908232f>] ? handle_pte_fault+0x39f/0x600
 [<79083438>] ? __pte_alloc+0x78/0xd0
 [<7916a640>] ? btrfs_ioctl+0x0/0x8f0
 [<790aa833>] ? vfs_ioctl+0x33/0x50
 [<790aa9c5>] ? do_vfs_ioctl+0x85/0x530
 [<7901efec>] ? do_page_fault+0x14c/0x3e0
 [<790945fd>] ? follow_hugetlb_page+0x8d/0x260
 [<790aaed0>] ? sys_ioctl+0x60/0x70
 [<79002b50>] ? sysenter_do_call+0x12/0x26
Code: 75 f8 8b 7d fc 89 ec 5d c3 8d b4 26 00 00 00 00 8d bc 27 00 00 00=
 00 55 89 e5 57 56 89 c6 53 81 ec 28 01 00 00 89 95 f0 fe ff ff <8b> 42=
 1c ba 03 00 00 00 e8 3f 03 ef ff 8b 50 60 89 95 3c ff ff=20
EIP: [<791338e4>] btrfs_print_leaf+0x14/0xb00 SS:ESP 0068:82d73988
CR2: 000000000000001c
---[ end trace 11a6d62f12dc6118 ]---


# btrfsck  /dev/sdc1
checksum verify failed on 625055924224 wanted C3DFFE41 found 9F99998
checksum verify failed on 625055924224 wanted C3DFFE41 found 9F99998
checksum verify failed on 625055924224 wanted C3DFFE41 found 9F99998
Segmentation fault

Is it a total write off or can a repair be made? I don't care about the
data,  as long as I can mount it and delete the corrupted files.

Regards,

Shane


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Oops while rebalancing, now unmountable.
  2010-11-08 17:10 Oops while rebalancing, now unmountable Shane Shrybman
@ 2010-11-08 17:55 ` Chris Mason
  2010-11-08 20:39   ` Shane Shrybman
  2010-11-09 13:42 ` Chris Mason
  1 sibling, 1 reply; 22+ messages in thread
From: Chris Mason @ 2010-11-08 17:55 UTC (permalink / raw)
  To: Shane Shrybman; +Cc: linux-btrfs

Excerpts from Shane Shrybman's message of 2010-11-08 12:10:57 -0500:
> Hi,
> 
> Got an oops last week while rebalancing that seems to have left me with
> a corrupted btrfs. Kernel was ~2.6.36 + Transparent hugetlb patchset +
> small misc. patchs.

Have you tried the 2.6.36 + the btrfs unstable git tree?  It may be able
to help here.

If that doesn't do it we'll have to try old copies of the super block.
I doubt balancing caused this corruption, it probably dug up an old
corruption because it reads every single metadata block on the FS.

-chris

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Oops while rebalancing, now unmountable.
  2010-11-08 17:55 ` Chris Mason
@ 2010-11-08 20:39   ` Shane Shrybman
  2010-11-08 21:04     ` Chris Mason
  0 siblings, 1 reply; 22+ messages in thread
From: Shane Shrybman @ 2010-11-08 20:39 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

On Mon, 2010-11-08 at 12:55 -0500, Chris Mason wrote:
> Excerpts from Shane Shrybman's message of 2010-11-08 12:10:57 -0500:
> > Hi,
> > 
> > Got an oops last week while rebalancing that seems to have left me with
> > a corrupted btrfs. Kernel was ~2.6.36 + Transparent hugetlb patchset +
> > small misc. patchs.
> 
> Have you tried the 2.6.36 + the btrfs unstable git tree?  It may be able
> to help here.

Ok, I have built and booted the 2.6.36 + the btrfs unstable git tree
kernel and ran btrfsck with the same results.

# btrfsck /dev/sdc1
checksum verify failed on 625055924224 wanted C3DFFE41 found 9F99998
checksum verify failed on 625055924224 wanted C3DFFE41 found 9F99998
checksum verify failed on 625055924224 wanted C3DFFE41 found 9F99998
Segmentation fault

Thanks,

Shane


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Oops while rebalancing, now unmountable.
  2010-11-08 20:39   ` Shane Shrybman
@ 2010-11-08 21:04     ` Chris Mason
  2010-11-08 21:25       ` Shane Shrybman
  0 siblings, 1 reply; 22+ messages in thread
From: Chris Mason @ 2010-11-08 21:04 UTC (permalink / raw)
  To: Shane Shrybman; +Cc: linux-btrfs

Excerpts from Shane Shrybman's message of 2010-11-08 15:39:25 -0500:
> On Mon, 2010-11-08 at 12:55 -0500, Chris Mason wrote:
> > Excerpts from Shane Shrybman's message of 2010-11-08 12:10:57 -0500:
> > > Hi,
> > > 
> > > Got an oops last week while rebalancing that seems to have left me with
> > > a corrupted btrfs. Kernel was ~2.6.36 + Transparent hugetlb patchset +
> > > small misc. patchs.
> > 
> > Have you tried the 2.6.36 + the btrfs unstable git tree?  It may be able
> > to help here.
> 
> Ok, I have built and booted the 2.6.36 + the btrfs unstable git tree
> kernel and ran btrfsck with the same results.
> 
> # btrfsck /dev/sdc1
> checksum verify failed on 625055924224 wanted C3DFFE41 found 9F99998
> checksum verify failed on 625055924224 wanted C3DFFE41 found 9F99998
> checksum verify failed on 625055924224 wanted C3DFFE41 found 9F99998
> Segmentation fault

Ok, if you pull down the latest btrfs-progs from git, you'll find a new
option to btrfsck, which tries alternate copies of the super block.

btrfsck -s 1 /dev/sdc1
btrfsck -s 2 /dev/sdc1
btrfsck -s 3 /dev/sdc1

If we're really lucky, one of these will work.

-chris

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Oops while rebalancing, now unmountable.
  2010-11-08 21:04     ` Chris Mason
@ 2010-11-08 21:25       ` Shane Shrybman
  0 siblings, 0 replies; 22+ messages in thread
From: Shane Shrybman @ 2010-11-08 21:25 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

On Mon, 2010-11-08 at 16:04 -0500, Chris Mason wrote:
> Excerpts from Shane Shrybman's message of 2010-11-08 15:39:25 -0500:
> > On Mon, 2010-11-08 at 12:55 -0500, Chris Mason wrote:
> > > Excerpts from Shane Shrybman's message of 2010-11-08 12:10:57 -0500:
> > > > Hi,
> > > > 
> > > > Got an oops last week while rebalancing that seems to have left me with
> > > > a corrupted btrfs. Kernel was ~2.6.36 + Transparent hugetlb patchset +
> > > > small misc. patchs.
> > > 
> > > Have you tried the 2.6.36 + the btrfs unstable git tree?  It may be able
> > > to help here.
> > 
> > Ok, I have built and booted the 2.6.36 + the btrfs unstable git tree
> > kernel and ran btrfsck with the same results.
> > 
> > # btrfsck /dev/sdc1
> > checksum verify failed on 625055924224 wanted C3DFFE41 found 9F99998
> > checksum verify failed on 625055924224 wanted C3DFFE41 found 9F99998
> > checksum verify failed on 625055924224 wanted C3DFFE41 found 9F99998
> > Segmentation fault
> 
> Ok, if you pull down the latest btrfs-progs from git, you'll find a new
> option to btrfsck, which tries alternate copies of the super block.
> 
> btrfsck -s 1 /dev/sdc1
> btrfsck -s 2 /dev/sdc1
> btrfsck -s 3 /dev/sdc1
> 
> If we're really lucky, one of these will work.

No luck yet :)

# ./btrfsck  -s 1 /dev/sdc1
using SB copy 1, bytenr 67108864
checksum verify failed on 625055924224 wanted C3DFFE41 found FFFFFF88
checksum verify failed on 625055924224 wanted C3DFFE41 found FFFFFF88
checksum verify failed on 625055924224 wanted C3DFFE41 found FFFFFF88
Segmentation fault
# ./btrfsck  -s 0 /dev/sdc1
using SB copy 0, bytenr 65536
checksum verify failed on 625055924224 wanted C3DFFE41 found FFFFFF88
checksum verify failed on 625055924224 wanted C3DFFE41 found FFFFFF88
checksum verify failed on 625055924224 wanted C3DFFE41 found FFFFFF88
Segmentation fault
# ./btrfsck  -s 2 /dev/sdc1
using SB copy 2, bytenr 274877906944
checksum verify failed on 625055924224 wanted C3DFFE41 found FFFFFF88
checksum verify failed on 625055924224 wanted C3DFFE41 found FFFFFF88
checksum verify failed on 625055924224 wanted C3DFFE41 found FFFFFF88
Segmentation fault
# ./btrfsck  -s 3 /dev/sdc1
using SB copy 3, bytenr 1125899906842624
No valid Btrfs found on /dev/sdc1






^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Oops while rebalancing, now unmountable.
  2010-11-08 17:10 Oops while rebalancing, now unmountable Shane Shrybman
  2010-11-08 17:55 ` Chris Mason
@ 2010-11-09 13:42 ` Chris Mason
  2010-11-09 18:21   ` Shane Shrybman
  1 sibling, 1 reply; 22+ messages in thread
From: Chris Mason @ 2010-11-09 13:42 UTC (permalink / raw)
  To: Shane Shrybman; +Cc: linux-btrfs

Excerpts from Shane Shrybman's message of 2010-11-08 12:10:57 -0500:
> Hi,
> 
> Got an oops last week while rebalancing that seems to have left me with
> a corrupted btrfs. Kernel was ~2.6.36 + Transparent hugetlb patchset +
> small misc. patchs.

We have a confirmed and reproducible case where the transparent
hugepages are corrupting btrfs (and only btrfs).  I'll work with Andrea
on figuring out the cause.

So, the first step to trying to fix it is to grab the latest btrfsck and
see if some old copies of the super are working:

btrfsck -s 1 /dev/xxx
btrfsck -s 2 /dev/xxx

-chris

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Oops while rebalancing, now unmountable.
  2010-11-09 13:42 ` Chris Mason
@ 2010-11-09 18:21   ` Shane Shrybman
  2010-11-14 19:55     ` Shane Shrybman
  0 siblings, 1 reply; 22+ messages in thread
From: Shane Shrybman @ 2010-11-09 18:21 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

On Tue, 2010-11-09 at 08:42 -0500, Chris Mason wrote:
> Excerpts from Shane Shrybman's message of 2010-11-08 12:10:57 -0500:
> > Hi,
> > 
> > Got an oops last week while rebalancing that seems to have left me with
> > a corrupted btrfs. Kernel was ~2.6.36 + Transparent hugetlb patchset +
> > small misc. patchs.
> 
> We have a confirmed and reproducible case where the transparent
> hugepages are corrupting btrfs (and only btrfs).  I'll work with Andrea
> on figuring out the cause.
> 
> So, the first step to trying to fix it is to grab the latest btrfsck and
> see if some old copies of the super are working:
> 
> btrfsck -s 1 /dev/xxx
> btrfsck -s 2 /dev/xxx
> 

Yeah, I tried that with the latest btrfsck (last commit was:
btrfs-debug-tree: add -d option ...)

# ./btrfsck  -s 1 /dev/sdc1
using SB copy 1, bytenr 67108864
checksum verify failed on 625055924224 wanted C3DFFE41 found FFFFFF88
checksum verify failed on 625055924224 wanted C3DFFE41 found FFFFFF88
checksum verify failed on 625055924224 wanted C3DFFE41 found FFFFFF88
Segmentation fault
# ./btrfsck  -s 0 /dev/sdc1
using SB copy 0, bytenr 65536
checksum verify failed on 625055924224 wanted C3DFFE41 found FFFFFF88
checksum verify failed on 625055924224 wanted C3DFFE41 found FFFFFF88
checksum verify failed on 625055924224 wanted C3DFFE41 found FFFFFF88
Segmentation fault
# ./btrfsck  -s 2 /dev/sdc1
using SB copy 2, bytenr 274877906944
checksum verify failed on 625055924224 wanted C3DFFE41 found FFFFFF88
checksum verify failed on 625055924224 wanted C3DFFE41 found FFFFFF88
checksum verify failed on 625055924224 wanted C3DFFE41 found FFFFFF88
Segmentation fault
# ./btrfsck  -s 3 /dev/sdc1
using SB copy 3, bytenr 1125899906842624
No valid Btrfs found on /dev/sdc1

Hmm, odd that btrfsck -s 0 /dev/sdc1 finds a different checksum than
before.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Oops while rebalancing, now unmountable.
  2010-11-09 18:21   ` Shane Shrybman
@ 2010-11-14 19:55     ` Shane Shrybman
  2010-11-14 20:42       ` Andrea Arcangeli
  0 siblings, 1 reply; 22+ messages in thread
From: Shane Shrybman @ 2010-11-14 19:55 UTC (permalink / raw)
  To: aarcange; +Cc: linux-btrfs, Chris Mason

On Tue, 2010-11-09 at 13:21 -0500, Shane Shrybman wrote:
> On Tue, 2010-11-09 at 08:42 -0500, Chris Mason wrote:
> > Excerpts from Shane Shrybman's message of 2010-11-08 12:10:57 -0500:
> > > Hi,
> > > 
> > > Got an oops last week while rebalancing that seems to have left me with
> > > a corrupted btrfs. Kernel was ~2.6.36 + Transparent hugetlb patchset +
> > > small misc. patchs.
> > 
> > We have a confirmed and reproducible case where the transparent
> > hugepages are corrupting btrfs (and only btrfs).  I'll work with Andrea
> > on figuring out the cause.
> > 
> > So, the first step to trying to fix it is to grab the latest btrfsck and
> > see if some old copies of the super are working:
> > 
> > btrfsck -s 1 /dev/xxx
> > btrfsck -s 2 /dev/xxx
> > 
> 
> Yeah, I tried that with the latest btrfsck (last commit was:
> btrfs-debug-tree: add -d option ...)
> 
> # ./btrfsck  -s 1 /dev/sdc1
> using SB copy 1, bytenr 67108864
> checksum verify failed on 625055924224 wanted C3DFFE41 found FFFFFF88
> checksum verify failed on 625055924224 wanted C3DFFE41 found FFFFFF88
> checksum verify failed on 625055924224 wanted C3DFFE41 found FFFFFF88
> Segmentation fault
> # ./btrfsck  -s 0 /dev/sdc1
> using SB copy 0, bytenr 65536
> checksum verify failed on 625055924224 wanted C3DFFE41 found FFFFFF88
> checksum verify failed on 625055924224 wanted C3DFFE41 found FFFFFF88
> checksum verify failed on 625055924224 wanted C3DFFE41 found FFFFFF88
> Segmentation fault
> # ./btrfsck  -s 2 /dev/sdc1
> using SB copy 2, bytenr 274877906944
> checksum verify failed on 625055924224 wanted C3DFFE41 found FFFFFF88
> checksum verify failed on 625055924224 wanted C3DFFE41 found FFFFFF88
> checksum verify failed on 625055924224 wanted C3DFFE41 found FFFFFF88
> Segmentation fault
> # ./btrfsck  -s 3 /dev/sdc1
> using SB copy 3, bytenr 1125899906842624
> No valid Btrfs found on /dev/sdc1
> 
> Hmm, odd that btrfsck -s 0 /dev/sdc1 finds a different checksum than
> before.
> 

Hi Andrea!

Long time since our last bug fix :) I still have fond memories of
2.4.23-aa kernels, best of all time!

I couldn't find any other mention of this corruption issue with THP and
btrfs, so I was just checking to see if there has been any progress?

Regards,

Shane



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Oops while rebalancing, now unmountable.
  2010-11-14 19:55     ` Shane Shrybman
@ 2010-11-14 20:42       ` Andrea Arcangeli
  2010-11-14 22:00         ` Christoph Hellwig
  0 siblings, 1 reply; 22+ messages in thread
From: Andrea Arcangeli @ 2010-11-14 20:42 UTC (permalink / raw)
  To: Shane Shrybman; +Cc: linux-btrfs, Chris Mason

Hi Shane,

On Sun, Nov 14, 2010 at 02:55:07PM -0500, Shane Shrybman wrote:
> Hi Andrea!
> 
> Long time since our last bug fix :) I still have fond memories of
> 2.4.23-aa kernels, best of all time!

Nice memories of good times :)

> I couldn't find any other mention of this corruption issue with THP and
> btrfs, so I was just checking to see if there has been any progress?

btrfs misses this:

+       .migratepage    = btree_migratepage,

It's a bug that can trigger upstream too (not only with THP) if there
are hugepage allocations (like while incrasing nr_hugepages). Chris
already fixed it with an experimental patch.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Oops while rebalancing, now unmountable.
  2010-11-14 20:42       ` Andrea Arcangeli
@ 2010-11-14 22:00         ` Christoph Hellwig
  2010-11-14 22:12           ` Andrea Arcangeli
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2010-11-14 22:00 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Shane Shrybman, linux-btrfs, Chris Mason, linux-mm, linux-fsdevel

On Sun, Nov 14, 2010 at 09:42:06PM +0100, Andrea Arcangeli wrote:
> btrfs misses this:
> 
> +       .migratepage    = btree_migratepage,
> 
> It's a bug that can trigger upstream too (not only with THP) if there
> are hugepage allocations (like while incrasing nr_hugepages). Chris
> already fixed it with an experimental patch.

If the lack of an obscure method causes data corruption something
is seriously wrong with THP.  At least from the 10.000 foot view
I can't quite figure what the exact issue is, though.
fallback_migrate_page seems to do the right thing to me for that
case.

Btw, there's also another issue with the page migration code when used
for filesystem pages.  If directly calls into ->writepage instead
of using the flusher threads.  On most filesystems this will
"only" cause nasty I/O patterns, but on ext4 for example it will
be more nasty as ext3 doesn't do conversions from delayed allocations to
real ones.  So unless you're doing a lot of overwrites it will be
hard to make any progress in writeout().

Btw, what codepath does THP call migrate_pages from?  If you don't
use an explicit thread writeout will be a no-op on btrfs and XFS, too.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Oops while rebalancing, now unmountable.
  2010-11-14 22:00         ` Christoph Hellwig
@ 2010-11-14 22:12           ` Andrea Arcangeli
  2010-11-15 18:23             ` Christoph Hellwig
  0 siblings, 1 reply; 22+ messages in thread
From: Andrea Arcangeli @ 2010-11-14 22:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Shane Shrybman, linux-btrfs, Chris Mason, linux-mm, linux-fsdevel

On Sun, Nov 14, 2010 at 05:00:18PM -0500, Christoph Hellwig wrote:
> On Sun, Nov 14, 2010 at 09:42:06PM +0100, Andrea Arcangeli wrote:
> > btrfs misses this:
> > 
> > +       .migratepage    = btree_migratepage,
> > 
> > It's a bug that can trigger upstream too (not only with THP) if there
> > are hugepage allocations (like while incrasing nr_hugepages). Chris
> > already fixed it with an experimental patch.
> 
> If the lack of an obscure method causes data corruption something
> is seriously wrong with THP.  At least from the 10.000 foot view

I just wrote above that it can happen upstream without THP. It's not
THP related at all. THP is the consumer, this is a problem in migrate
that will trigger as well with migrate_pages or all other possible
migration APIs.

If more people would be using hugetlbfs they would have noticed
without THP.

> I can't quite figure what the exact issue is, though.
> fallback_migrate_page seems to do the right thing to me for that
> case.
> 
> Btw, there's also another issue with the page migration code when used
> for filesystem pages.  If directly calls into ->writepage instead
> of using the flusher threads.  On most filesystems this will
> "only" cause nasty I/O patterns, but on ext4 for example it will
> be more nasty as ext3 doesn't do conversions from delayed allocations to
> real ones.  So unless you're doing a lot of overwrites it will be
> hard to make any progress in writeout().

+static int btree_migratepage(struct address_space *mapping,
+                       struct page *newpage, struct page *page)
+{
+       /*
+        * we can't safely write a btree page from here,
+        * we haven't done the locking hook
+        */
+       if (PageDirty(page))
+               return -EAGAIN;

fallback_migrate_page would call writeout() which is apparently not
ok in btrfs for locking issues leading to corruption.

> Btw, what codepath does THP call migrate_pages from?  If you don't
> use an explicit thread writeout will be a no-op on btrfs and XFS, too.

THP never calls migrate_pages, it's memory compaction that calls it
from inside alloc_pages(order=9). It got noticed only with THP because
it makes more frequent hugepage allocations than nr_hugepages in
hugetlbfs (and maybe there are more THP users already).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Oops while rebalancing, now unmountable.
  2010-11-14 22:12           ` Andrea Arcangeli
@ 2010-11-15 18:23             ` Christoph Hellwig
  2010-11-15 18:46               ` Chris Mason
  2010-11-15 18:46               ` Andrea Arcangeli
  0 siblings, 2 replies; 22+ messages in thread
From: Christoph Hellwig @ 2010-11-15 18:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Hellwig, Shane Shrybman, linux-btrfs, Chris Mason,
	linux-mm, linux-fsdevel

On Sun, Nov 14, 2010 at 11:12:22PM +0100, Andrea Arcangeli wrote:
> I just wrote above that it can happen upstream without THP. It's not
> THP related at all. THP is the consumer, this is a problem in migrate
> that will trigger as well with migrate_pages or all other possible
> migration APIs.
> 
> If more people would be using hugetlbfs they would have noticed
> without THP.

Okay, it seems THP is really just the messenger for bad VM practices
here.

> +static int btree_migratepage(struct address_space *mapping,
> +                       struct page *newpage, struct page *page)
> +{
> +       /*
> +        * we can't safely write a btree page from here,
> +        * we haven't done the locking hook
> +        */
> +       if (PageDirty(page))
> +               return -EAGAIN;
> 
> fallback_migrate_page would call writeout() which is apparently not
> ok in btrfs for locking issues leading to corruption.

Hmm, it seems the issue for that particular problem is indeedin btrfs.
If it needs external locking for writing out data it should not
implement ->writepage to start with.  Chris, can you explain what's
going on with the btree code? It's pretty funny both in the
btree_writepage which goes directly into extent_write_full_page
if PF_MEMALLOC is not set, but otherwise does much more complicated
work, and also in btree_writepages which skips various WB_SYNC_NONE,
including the very weird check for for_kupdate.

What's the story behing all this and the corruption that Andrea found?

> > Btw, what codepath does THP call migrate_pages from?  If you don't
> > use an explicit thread writeout will be a no-op on btrfs and XFS, too.
> 
> THP never calls migrate_pages, it's memory compaction that calls it
> from inside alloc_pages(order=9). It got noticed only with THP because
> it makes more frequent hugepage allocations than nr_hugepages in
> hugetlbfs (and maybe there are more THP users already).

Well, s/THP/compaction/ and the same problem applies.  The modern
filesystem all have stopped from writeback happening either at all
or at least for the delalloc case from direct reclaim.  Calling
into this code from alloc_pages for filesystem backed pages is thus
rather pointless.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Oops while rebalancing, now unmountable.
  2010-11-15 18:23             ` Christoph Hellwig
@ 2010-11-15 18:46               ` Chris Mason
  2010-11-15 19:03                 ` Christoph Hellwig
  2010-11-16 21:48                 ` Shane Shrybman
  2010-11-15 18:46               ` Andrea Arcangeli
  1 sibling, 2 replies; 22+ messages in thread
From: Chris Mason @ 2010-11-15 18:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrea Arcangeli, Shane Shrybman, linux-btrfs, linux-mm,
	linux-fsdevel

Excerpts from Christoph Hellwig's message of 2010-11-15 13:23:14 -0500:
> On Sun, Nov 14, 2010 at 11:12:22PM +0100, Andrea Arcangeli wrote:
> > I just wrote above that it can happen upstream without THP. It's not
> > THP related at all. THP is the consumer, this is a problem in migrate
> > that will trigger as well with migrate_pages or all other possible
> > migration APIs.
> >=20
> > If more people would be using hugetlbfs they would have noticed
> > without THP.
>=20
> Okay, it seems THP is really just the messenger for bad VM practices
> here.
>=20
> > +static int btree_migratepage(struct address_space *mapping,
> > +                       struct page *newpage, struct page *page)
> > +{
> > +       /*
> > +        * we can't safely write a btree page from here,
> > +        * we haven't done the locking hook
> > +        */
> > +       if (PageDirty(page))
> > +               return -EAGAIN;
> >=20
> > fallback_migrate_page would call writeout() which is apparently not
> > ok in btrfs for locking issues leading to corruption.
>=20
> Hmm, it seems the issue for that particular problem is indeedin btrfs.
> If it needs external locking for writing out data it should not
> implement ->writepage to start with.  Chris, can you explain what's
> going on with the btree code? It's pretty funny both in the
> btree_writepage which goes directly into extent_write_full_page
> if PF_MEMALLOC is not set, but otherwise does much more complicated
> work, and also in btree_writepages which skips various WB_SYNC_NONE,
> including the very weird check for for_kupdate.

So, I had THP + a patched btrfs running all weekend and I can safely say
I've fixed this one now.=20

>=20
> What's the story behing all this and the corruption that Andrea found?

For the metadata blocks, btrfs gets into a problematic lock inversion
where it needs to record that a block has been written so that it will
be properly recowed when someone tries to change it again.

Basically the rule for btree_writepage:

1) lock the extent buffer (different from the page)
2) mark the metadata block as written
3) lock the page
4) call writepage

Btrfs does this correctly everywhere it uses writepage, and everyone
else either uses writepages or is PF_MEMALLOC, except for the page
migration code, which just jumps to step 4.

So, my current fix adds a migrate page hook and adds a warning into the
code to make sure we protest loudly when the block isn't marked as
written.  Since this shakedown worked well, I'm changing the warning to
a BUG().

The check for kupdate in btree_writepages is different.  Once we write
something, we have to do a good amount of work in order to modify it
again.  The btrfs log commits make sure that we write metadata from time
to time, so we don't really need help from the flusher threads unless.

We also don't want to waste time writing metadata from
balance_dirty_pages.  It'll just make more allocations later as we
wander around and recow things, and it is much more likely to be seeky
than the file IO.  So we setup a threshold where we don't bother doing
metadata IO unless there is a good amount pending.

I'm fine with removing the metadata writepage entirely, it didn't use to
have this many rules and it seems like a better idea to have it not
there at all.

-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=3Dmailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Oops while rebalancing, now unmountable.
  2010-11-15 18:23             ` Christoph Hellwig
  2010-11-15 18:46               ` Chris Mason
@ 2010-11-15 18:46               ` Andrea Arcangeli
  2010-11-15 19:03                 ` Chris Mason
  2010-11-15 19:12                 ` Christoph Hellwig
  1 sibling, 2 replies; 22+ messages in thread
From: Andrea Arcangeli @ 2010-11-15 18:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Shane Shrybman, linux-btrfs, Chris Mason, linux-mm, linux-fsdevel

On Mon, Nov 15, 2010 at 01:23:14PM -0500, Christoph Hellwig wrote:
> On Sun, Nov 14, 2010 at 11:12:22PM +0100, Andrea Arcangeli wrote:
> > +static int btree_migratepage(struct address_space *mapping,
> > +                       struct page *newpage, struct page *page)
> > +{
> > +       /*
> > +        * we can't safely write a btree page from here,
> > +        * we haven't done the locking hook
> > +        */
> > +       if (PageDirty(page))
> > +               return -EAGAIN;
> > 
> > fallback_migrate_page would call writeout() which is apparently not
> > ok in btrfs for locking issues leading to corruption.
> 
> Hmm, it seems the issue for that particular problem is indeedin btrfs.

I've been reading the writeout() in mm/migrate.c and I wonder if maybe
that should have been WB_SYNC_ALL or if we miss a
wait_on_page_writeback in after ->writepage() returns? Can you have a
look there? We check the PG_writeback bit when the page is not dirty
(well before fallback_migrate_page is called), but after calling
writeout() we don't return to wait on PG_writeback. We make sure to
hold the page lock after ->writepage returns but that doesn't mean
PG_writeback isn't still set.

> If it needs external locking for writing out data it should not
> implement ->writepage to start with.  Chris, can you explain what's
> going on with the btree code? It's pretty funny both in the
> btree_writepage which goes directly into extent_write_full_page
> if PF_MEMALLOC is not set, but otherwise does much more complicated
> work, and also in btree_writepages which skips various WB_SYNC_NONE,
> including the very weird check for for_kupdate.
> 
> What's the story behing all this and the corruption that Andrea found?
> 
> > > Btw, what codepath does THP call migrate_pages from?  If you don't
> > > use an explicit thread writeout will be a no-op on btrfs and XFS, too.
> > 
> > THP never calls migrate_pages, it's memory compaction that calls it
> > from inside alloc_pages(order=9). It got noticed only with THP because
> > it makes more frequent hugepage allocations than nr_hugepages in
> > hugetlbfs (and maybe there are more THP users already).
> 
> Well, s/THP/compaction/ and the same problem applies.  The modern
> filesystem all have stopped from writeback happening either at all
> or at least for the delalloc case from direct reclaim.  Calling
> into this code from alloc_pages for filesystem backed pages is thus
> rather pointless.

Compaction practically only happens in the context of the task
allocating memory (in my tree it is also used by kswapd). Not
immediate to ask a separate daemon to invoke it. Not sure why this
should screw delalloc. Compaction isn't freeing any memory at all,
it's not reclaim. It just defragments and moves stuff around and it
may have to write dirty pages to do so.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Oops while rebalancing, now unmountable.
  2010-11-15 18:46               ` Chris Mason
@ 2010-11-15 19:03                 ` Christoph Hellwig
  2010-11-16 21:48                 ` Shane Shrybman
  1 sibling, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2010-11-15 19:03 UTC (permalink / raw)
  To: Chris Mason
  Cc: Christoph Hellwig, Andrea Arcangeli, Shane Shrybman, linux-btrfs,
	linux-mm, linux-fsdevel

On Mon, Nov 15, 2010 at 01:46:02PM -0500, Chris Mason wrote:
> For the metadata blocks, btrfs gets into a problematic lock inversion
> where it needs to record that a block has been written so that it will
> be properly recowed when someone tries to change it again.
> 
> Basically the rule for btree_writepage:
> 
> 1) lock the extent buffer (different from the page)
> 2) mark the metadata block as written
> 3) lock the page
> 4) call writepage
> 
> Btrfs does this correctly everywhere it uses writepage, and everyone
> else either uses writepages or is PF_MEMALLOC, except for the page
> migration code, which just jumps to step 4.
>
> So, my current fix adds a migrate page hook and adds a warning into the
> code to make sure we protest loudly when the block isn't marked as
> written.  Since this shakedown worked well, I'm changing the warning to
> a BUG().
> 

This sounds to me like you shouldn't bother to use ->writepage
for the case that adheres to your locking protocol, but just call into
extent_write_full_page directly.  ->writepage is supposed to directly
callable from the VM, and not require filesystems specific calling
conventions.  Just calling extent_write_full_page directly and
making btree_writepage do the PF_MEMALLOC unconditionally should
also fix the page migration corruption.  And at the same time
making btree_writepage future proof.

Btw, magic like the one there currently does need at least a long
describing comment.

> The check for kupdate in btree_writepages is different.  Once we write
> something, we have to do a good amount of work in order to modify it
> again.  The btrfs log commits make sure that we write metadata from time
> to time, so we don't really need help from the flusher threads unless.
>
> We also don't want to waste time writing metadata from
> balance_dirty_pages.  It'll just make more allocations later as we
> wander around and recow things, and it is much more likely to be seeky
> than the file IO.  So we setup a threshold where we don't bother doing
> metadata IO unless there is a good amount pending.
> 
> I'm fine with removing the metadata writepage entirely, it didn't use to
> have this many rules and it seems like a better idea to have it not
> there at all.

for_kupdate only covers a tiny subset of the flusher threads, as it's
only set for the older_than_this still writeback.  It doesn't cover
regular percentage background reclaim not other asynchronous activity
from the flusher threads, like wakeup_flusher_threads or the laptop-mode
I/O completion.

At the very least it should check for_kupdate || for_background to cover
all background writeback, which is what the few other uses of
for_kupdate already do, but I suspect you simply want to not mark
the btree inode as hashed in the inode hash and skip background
writeback completely.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Oops while rebalancing, now unmountable.
  2010-11-15 18:46               ` Andrea Arcangeli
@ 2010-11-15 19:03                 ` Chris Mason
  2010-11-15 19:16                   ` Andrea Arcangeli
  2010-11-15 19:12                 ` Christoph Hellwig
  1 sibling, 1 reply; 22+ messages in thread
From: Chris Mason @ 2010-11-15 19:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Hellwig, Shane Shrybman, linux-btrfs, linux-mm,
	linux-fsdevel

Excerpts from Andrea Arcangeli's message of 2010-11-15 13:46:57 -0500:
> On Mon, Nov 15, 2010 at 01:23:14PM -0500, Christoph Hellwig wrote:
> > On Sun, Nov 14, 2010 at 11:12:22PM +0100, Andrea Arcangeli wrote:
> > > +static int btree_migratepage(struct address_space *mapping,
> > > +                       struct page *newpage, struct page *page)
> > > +{
> > > +       /*
> > > +        * we can't safely write a btree page from here,
> > > +        * we haven't done the locking hook
> > > +        */
> > > +       if (PageDirty(page))
> > > +               return -EAGAIN;
> > >=20
> > > fallback_migrate_page would call writeout() which is apparently not
> > > ok in btrfs for locking issues leading to corruption.
> >=20
> > Hmm, it seems the issue for that particular problem is indeedin btrfs=
.
>=20
> I've been reading the writeout() in mm/migrate.c and I wonder if maybe
> that should have been WB_SYNC_ALL or if we miss a
> wait_on_page_writeback in after ->writepage() returns? Can you have a
> look there? We check the PG_writeback bit when the page is not dirty
> (well before fallback_migrate_page is called), but after calling
> writeout() we don't return to wait on PG_writeback. We make sure to
> hold the page lock after ->writepage returns but that doesn't mean
> PG_writeback isn't still set.

It always returns either -EIO or -EAGAIN, so the caller will try again
and then end up waiting on PageWriteback?

>=20
> > If it needs external locking for writing out data it should not
> > implement ->writepage to start with.  Chris, can you explain what's
> > going on with the btree code? It's pretty funny both in the
> > btree_writepage which goes directly into extent_write_full_page
> > if PF_MEMALLOC is not set, but otherwise does much more complicated
> > work, and also in btree_writepages which skips various WB_SYNC_NONE,
> > including the very weird check for for_kupdate.
> >=20
> > What's the story behing all this and the corruption that Andrea found=
?
> >=20
> > > > Btw, what codepath does THP call migrate_pages from?  If you don'=
t
> > > > use an explicit thread writeout will be a no-op on btrfs and XFS,=
 too.
> > >=20
> > > THP never calls migrate_pages, it's memory compaction that calls it
> > > from inside alloc_pages(order=3D9). It got noticed only with THP be=
cause
> > > it makes more frequent hugepage allocations than nr_hugepages in
> > > hugetlbfs (and maybe there are more THP users already).
> >=20
> > Well, s/THP/compaction/ and the same problem applies.  The modern
> > filesystem all have stopped from writeback happening either at all
> > or at least for the delalloc case from direct reclaim.  Calling
> > into this code from alloc_pages for filesystem backed pages is thus
> > rather pointless.
>=20
> Compaction practically only happens in the context of the task
> allocating memory (in my tree it is also used by kswapd). Not
> immediate to ask a separate daemon to invoke it. Not sure why this
> should screw delalloc. Compaction isn't freeing any memory at all,
> it's not reclaim. It just defragments and moves stuff around and it
> may have to write dirty pages to do so.

The short version is that when the VM jumps in and starts doing single
page IO, we badly fragment files.  The FS wants writepages for
everything, so that we can do smart delayed allocation and so that we
can write out things in units bigger than 4KB.

We most recently hashed this out in threads with Mel Gorman around
balance_dirty_pages, but it has a very big impact on performance.

-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=3Dmailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Oops while rebalancing, now unmountable.
  2010-11-15 18:46               ` Andrea Arcangeli
  2010-11-15 19:03                 ` Chris Mason
@ 2010-11-15 19:12                 ` Christoph Hellwig
  2010-11-15 19:18                   ` Chris Mason
  2010-11-15 19:29                   ` Andrea Arcangeli
  1 sibling, 2 replies; 22+ messages in thread
From: Christoph Hellwig @ 2010-11-15 19:12 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Hellwig, Shane Shrybman, linux-btrfs, Chris Mason,
	linux-mm, linux-fsdevel

On Mon, Nov 15, 2010 at 07:46:57PM +0100, Andrea Arcangeli wrote:
> I've been reading the writeout() in mm/migrate.c and I wonder if maybe
> that should have been WB_SYNC_ALL or if we miss a
> wait_on_page_writeback in after ->writepage() returns? Can you have a
> look there? We check the PG_writeback bit when the page is not dirty
> (well before fallback_migrate_page is called), but after calling
> writeout() we don't return to wait on PG_writeback. We make sure to
> hold the page lock after ->writepage returns but that doesn't mean
> PG_writeback isn't still set.

I didn't even notice that, but the WB_SYNC_NONE does indeed seem
buggy to me.  If we set the sync_mode to WB_SYNC_NONE filesystem
can and frequently do trylock operations and might just skip to
write it out completely.

So we defintively do need to change writeout to do a WB_SYNC_ALL
writeback.  In addition to that we'll also need the
wait_on_page_writeback call to make sure we actually wait for I/O
to finish.

Also what protects us from updating the page while we write it out?
PG_writeback on many filesystems doesn't protect writes from modifying
the in-flight buffer, and just locking the page after ->writepage
is racy without a check that nothing changed.

> Compaction practically only happens in the context of the task
> allocating memory (in my tree it is also used by kswapd). Not
> immediate to ask a separate daemon to invoke it. Not sure why this
> should screw delalloc. Compaction isn't freeing any memory at all,
> it's not reclaim. It just defragments and moves stuff around and it
> may have to write dirty pages to do so.

kswapd is fine.  Other task allocation memory are direct reclaimers.
Direct reclaim through the filesystem delalloc conversion and the I/O
stack guarantees you stack overflows, that's why filesystems refuse
to do anything in ->writepage for this case.  btrfs and XFS have
explicit checks for PF_MEMALLOC (with a carve out for kswapd in XFS),
and ext4 only writes already allocated blocks in ->writepage but never
does delalloc conversions.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Oops while rebalancing, now unmountable.
  2010-11-15 19:03                 ` Chris Mason
@ 2010-11-15 19:16                   ` Andrea Arcangeli
  0 siblings, 0 replies; 22+ messages in thread
From: Andrea Arcangeli @ 2010-11-15 19:16 UTC (permalink / raw)
  To: Chris Mason
  Cc: Christoph Hellwig, Shane Shrybman, linux-btrfs, linux-mm,
	linux-fsdevel

On Mon, Nov 15, 2010 at 02:03:55PM -0500, Chris Mason wrote:
> It always returns either -EIO or -EAGAIN, so the caller will try again
> and then end up waiting on PageWriteback?

Returning any error from ->writepage will make writeout return -EIO so
aborting the migration for that page. If no error is returned from
->writepage, writeout will return -EAGAIN the caller will try again
after wait_on_page_writeback. I think I misread the code when in prev
mail I worried about not waiting on PG_writeback after writeout()... :)

So the ideal would be not to return errors when ->writepage submitted
the writeback I/O successfully but if it returns -EIO/-EAGAIN there's
no risk whatsoever (except compaction will be less effective).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Oops while rebalancing, now unmountable.
  2010-11-15 19:12                 ` Christoph Hellwig
@ 2010-11-15 19:18                   ` Chris Mason
  2010-11-15 19:29                   ` Andrea Arcangeli
  1 sibling, 0 replies; 22+ messages in thread
From: Chris Mason @ 2010-11-15 19:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrea Arcangeli, Shane Shrybman, linux-btrfs, linux-mm,
	linux-fsdevel

Excerpts from Christoph Hellwig's message of 2010-11-15 14:12:04 -0500:
> On Mon, Nov 15, 2010 at 07:46:57PM +0100, Andrea Arcangeli wrote:
> > I've been reading the writeout() in mm/migrate.c and I wonder if mayb=
e
> > that should have been WB_SYNC_ALL or if we miss a
> > wait_on_page_writeback in after ->writepage() returns? Can you have a
> > look there? We check the PG_writeback bit when the page is not dirty
> > (well before fallback_migrate_page is called), but after calling
> > writeout() we don't return to wait on PG_writeback. We make sure to
> > hold the page lock after ->writepage returns but that doesn't mean
> > PG_writeback isn't still set.
>=20
> I didn't even notice that, but the WB_SYNC_NONE does indeed seem
> buggy to me.  If we set the sync_mode to WB_SYNC_NONE filesystem
> can and frequently do trylock operations and might just skip to
> write it out completely.
>=20
> So we defintively do need to change writeout to do a WB_SYNC_ALL
> writeback.  In addition to that we'll also need the
> wait_on_page_writeback call to make sure we actually wait for I/O
> to finish.
>=20
> Also what protects us from updating the page while we write it out?
> PG_writeback on many filesystems doesn't protect writes from modifying
> the in-flight buffer, and just locking the page after ->writepage
> is racy without a check that nothing changed.
>=20

Oh, I should have thought of that.  Btrfs (and most of the time xfs?)
will wait on PageWriteback internally, but for the ext crowd we're in
trouble.

-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=3Dmailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Oops while rebalancing, now unmountable.
  2010-11-15 19:12                 ` Christoph Hellwig
  2010-11-15 19:18                   ` Chris Mason
@ 2010-11-15 19:29                   ` Andrea Arcangeli
  2010-11-15 20:54                     ` Christoph Hellwig
  1 sibling, 1 reply; 22+ messages in thread
From: Andrea Arcangeli @ 2010-11-15 19:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Shane Shrybman, linux-btrfs, Chris Mason, linux-mm, linux-fsdevel

On Mon, Nov 15, 2010 at 02:12:04PM -0500, Christoph Hellwig wrote:
> I didn't even notice that, but the WB_SYNC_NONE does indeed seem
> buggy to me.  If we set the sync_mode to WB_SYNC_NONE filesystem
> can and frequently do trylock operations and might just skip to
> write it out completely.

Scary stuff, so WB_SYNC_NONE wouldn't submit the dirty part of the
page down for I/O, so that it's all clean after wait_on_page_writeback
returns? (well of course unless the dirty bit was set again)

> So we defintively do need to change writeout to do a WB_SYNC_ALL
> writeback.  In addition to that we'll also need the
> wait_on_page_writeback call to make sure we actually wait for I/O
> to finish.

Ok that is ok... I misread it sorry. But the writeback must be started
by WB_SYNC_NONE (or _ALL) for wait_on_page_writeback to be effective.

migration will abort if ->writepage returns error, that's safe
though. It will retry calling on wait_on_page_writeback only if
->writepage returns 0.

> Also what protects us from updating the page while we write it out?
> PG_writeback on many filesystems doesn't protect writes from modifying
> the in-flight buffer, and just locking the page after ->writepage
> is racy without a check that nothing changed.

migrate established migration ptes already so nobody can write to the
page through pagetables. The only thing left is O_DIRECT which is also
taken care by the page count check in migrate_page_move_mapping,
before migrate_page called by fallback_migrate_page can succeed. So
nothing can be modifying the page if we go ahead with migrate_page
(and no pte dirty bit can happen either). The page is also locked down
for the whole migration so all writes syscalls should be stopped.

> kswapd is fine.  Other task allocation memory are direct reclaimers.
> Direct reclaim through the filesystem delalloc conversion and the I/O
> stack guarantees you stack overflows, that's why filesystems refuse
> to do anything in ->writepage for this case.  btrfs and XFS have
> explicit checks for PF_MEMALLOC (with a carve out for kswapd in XFS),
> and ext4 only writes already allocated blocks in ->writepage but never
> does delalloc conversions.

I didn't realize the stack overflow issue was specific to delalloc. I
think it's ok here to skip ->writepage for delalloc, it's not
mandatory, memory compaction isn't supposed to do much I/O anyway,
it's supposed to copy ram instead. Sure it'd be more reliable to
submit I/O but it's going to work pretty well, plus compaction will be
retried again later by khugepaged once every 10 sec. kswapd actually
with THP will not do anything because THP allocations are run with
__GFP_NO_KSWAPD to avoid kswapd to waste cpu by trying in the
background hard to create hugepages if 90% of ram goes in anonymous
memory (and there are background anon allocations that would wakeup
kswapd) but only 80% can be allocated as 2M contiguous beacuse 20% was
at some point allocated in slab caches.

In short with THP it's khugepaged that is supposed to run the
->writepage in migrate.c and it will run it once every 10 sec even
when it fails (and not in a 100% cpu wasting loop like kswapd), so if
you did something magic for kswapd in XFS you should do for khugepaged
too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Oops while rebalancing, now unmountable.
  2010-11-15 19:29                   ` Andrea Arcangeli
@ 2010-11-15 20:54                     ` Christoph Hellwig
  0 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2010-11-15 20:54 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Hellwig, Shane Shrybman, linux-btrfs, Chris Mason,
	linux-mm, linux-fsdevel

On Mon, Nov 15, 2010 at 08:29:14PM +0100, Andrea Arcangeli wrote:
> Scary stuff, so WB_SYNC_NONE wouldn't submit the dirty part of the
> page down for I/O, so that it's all clean after wait_on_page_writeback
> returns? (well of course unless the dirty bit was set again)

It might not if we have lock contention or other resource starvation.
That's the reason why WB_SYNC_NONE was added - to not block the flusher
threads.

> I didn't realize the stack overflow issue was specific to delalloc.

It's not.  It's specific to direct reclaim.  Only ext4 special cases
delalloc, but I'm not sure if that's intentional or just an accidental
side effect of the mess that the ext4 writeback code is.

> In short with THP it's khugepaged that is supposed to run the
> ->writepage in migrate.c and it will run it once every 10 sec even
> when it fails (and not in a 100% cpu wasting loop like kswapd), so if
> you did something magic for kswapd in XFS you should do for khugepaged
> too.

If you have a PF_ flag for it that's easy to add once it goes into
mainline.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Oops while rebalancing, now unmountable.
  2010-11-15 18:46               ` Chris Mason
  2010-11-15 19:03                 ` Christoph Hellwig
@ 2010-11-16 21:48                 ` Shane Shrybman
  1 sibling, 0 replies; 22+ messages in thread
From: Shane Shrybman @ 2010-11-16 21:48 UTC (permalink / raw)
  To: Chris Mason
  Cc: Christoph Hellwig, Andrea Arcangeli, linux-btrfs, linux-mm,
	linux-fsdevel

On Mon, 2010-11-15 at 13:46 -0500, Chris Mason wrote:
> Excerpts from Christoph Hellwig's message of 2010-11-15 13:23:14 -0500:
> > On Sun, Nov 14, 2010 at 11:12:22PM +0100, Andrea Arcangeli wrote:
> > > I just wrote above that it can happen upstream without THP. It's not
> > > THP related at all. THP is the consumer, this is a problem in migrate
> > > that will trigger as well with migrate_pages or all other possible
> > > migration APIs.
> > > 
> > > If more people would be using hugetlbfs they would have noticed
> > > without THP.
> > 
> > Okay, it seems THP is really just the messenger for bad VM practices
> > here.
> > 
> > > +static int btree_migratepage(struct address_space *mapping,
> > > +                       struct page *newpage, struct page *page)
> > > +{
> > > +       /*
> > > +        * we can't safely write a btree page from here,
> > > +        * we haven't done the locking hook
> > > +        */
> > > +       if (PageDirty(page))
> > > +               return -EAGAIN;
> > > 
> > > fallback_migrate_page would call writeout() which is apparently not
> > > ok in btrfs for locking issues leading to corruption.
> > 
> > Hmm, it seems the issue for that particular problem is indeedin btrfs.
> > If it needs external locking for writing out data it should not
> > implement ->writepage to start with.  Chris, can you explain what's
> > going on with the btree code? It's pretty funny both in the
> > btree_writepage which goes directly into extent_write_full_page
> > if PF_MEMALLOC is not set, but otherwise does much more complicated
> > work, and also in btree_writepages which skips various WB_SYNC_NONE,
> > including the very weird check for for_kupdate.
> 
> So, I had THP + a patched btrfs running all weekend and I can safely say
> I've fixed this one now. 
> 

That seems like good news!

Is that btrfs patch available somewhere?

Where does this leave the existing corrupted btrfs'?

Thanks guys,

Shane


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2010-11-16 21:48 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-08 17:10 Oops while rebalancing, now unmountable Shane Shrybman
2010-11-08 17:55 ` Chris Mason
2010-11-08 20:39   ` Shane Shrybman
2010-11-08 21:04     ` Chris Mason
2010-11-08 21:25       ` Shane Shrybman
2010-11-09 13:42 ` Chris Mason
2010-11-09 18:21   ` Shane Shrybman
2010-11-14 19:55     ` Shane Shrybman
2010-11-14 20:42       ` Andrea Arcangeli
2010-11-14 22:00         ` Christoph Hellwig
2010-11-14 22:12           ` Andrea Arcangeli
2010-11-15 18:23             ` Christoph Hellwig
2010-11-15 18:46               ` Chris Mason
2010-11-15 19:03                 ` Christoph Hellwig
2010-11-16 21:48                 ` Shane Shrybman
2010-11-15 18:46               ` Andrea Arcangeli
2010-11-15 19:03                 ` Chris Mason
2010-11-15 19:16                   ` Andrea Arcangeli
2010-11-15 19:12                 ` Christoph Hellwig
2010-11-15 19:18                   ` Chris Mason
2010-11-15 19:29                   ` Andrea Arcangeli
2010-11-15 20:54                     ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).