* Spurious "ghost" "parent transid verify failed" messages on 4.14.x, 4.20.14 @ 2019-03-12 4:00 Zygo Blaxell 2019-03-12 17:34 ` Jorge Bastos 2019-04-03 14:47 ` Spurious "ghost" "parent transid verify failed" messages on 5.0.5 + lockdep warning Zygo Blaxell 0 siblings, 2 replies; 9+ messages in thread From: Zygo Blaxell @ 2019-03-12 4:00 UTC (permalink / raw) To: linux-btrfs On 4.14.x and 4.20.14 kernels (probably all the ones in between too, but I haven't tested those), I get what I call "ghost parent transid verify failed" errors. Here's an unedited recent example from dmesg: [16180.649285] BTRFS error (device dm-3): parent transid verify failed on 1218181971968 wanted 9698 found 9744 [16864.457119] BTRFS error (device dm-3): parent transid verify failed on 981488238592 wanted 9758 found 9763 [16864.459153] BTRFS error (device dm-3): parent transid verify failed on 981488238592 wanted 9758 found 9763 [16865.409955] BTRFS error (device dm-3): parent transid verify failed on 981488238592 wanted 9758 found 9763 [16865.413091] BTRFS error (device dm-3): parent transid verify failed on 981488238592 wanted 9758 found 9763 [16865.583441] BTRFS error (device dm-3): parent transid verify failed on 981488238592 wanted 9758 found 9763 [16865.586602] BTRFS error (device dm-3): parent transid verify failed on 981488238592 wanted 9758 found 9763 [17364.870504] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 [17364.879696] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 [17364.901713] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 [17376.127722] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 [17376.129965] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 [17376.132919] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 [17376.250605] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 [17376.253633] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 [17376.577192] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 [17376.585174] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 [17380.720874] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 [17380.735643] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 [17383.282250] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 [17383.286598] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 [17433.417529] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 [17433.420685] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 [17881.816489] BTRFS error (device dm-3): parent transid verify failed on 785410277376 wanted 9786 found 9794 [17881.820072] BTRFS error (device dm-3): parent transid verify failed on 785410277376 wanted 9786 found 9794 [17881.824307] BTRFS error (device dm-3): parent transid verify failed on 785410277376 wanted 9786 found 9794 [19525.602623] BTRFS error (device dm-3): parent transid verify failed on 984570740736 wanted 9841 found 9846 [19525.605395] BTRFS error (device dm-3): parent transid verify failed on 984570740736 wanted 9841 found 9846 [19525.829326] BTRFS error (device dm-3): parent transid verify failed on 984570740736 wanted 9841 found 9846 [19525.834123] BTRFS error (device dm-3): parent transid verify failed on 984570740736 wanted 9841 found 9846 [19525.905553] BTRFS error (device dm-3): parent transid verify failed on 984570740736 wanted 9841 found 9846 [19525.909302] BTRFS error (device dm-3): parent transid verify failed on 984570740736 wanted 9841 found 9846 [19525.914374] BTRFS error (device dm-3): parent transid verify failed on 984570740736 wanted 9841 found 9846 [19525.917829] BTRFS error (device dm-3): parent transid verify failed on 984570740736 wanted 9841 found 9846 [19525.980067] BTRFS error (device dm-3): parent transid verify failed on 984570740736 wanted 9841 found 9846 [19525.983479] BTRFS error (device dm-3): parent transid verify failed on 984570740736 wanted 9841 found 9846 They seem to occur while running these simultaneously: * rsync receiving a lot of data * snapshot delete to make space for more data * bees deduping said data--possibly on the snapshot being deleted. All three seem to be needed--these messages do not appear if one of the above is not running at the time. I can get the messages on a test VM that is running nothing but the above three things on the test filesystem. Unlike the usual disaster cases when "parent transid verify failed" messages appear, these incidents seem to be entirely harmless. They've been appearing for years (since at least the first v4.14 kernels) and...nothing else happens. The filesystem doesn't go read-only. It doesn't become irreparably damaged. Nothing locks up or crashes (at the same time). The system just keeps running normally, often for months, emitting a few of these messages during longer snapshot deletes. This happens on several different host systems. There is no sign of hardware failure. It looks like a real bug, but I don't have a simpler reproducer, and sometimes weeks or months pass between incidents on a given machine. Scrubs are always clean. There are never errors recorded in btrfs dev stat: # btrfs dev stat /testfs/ [/dev/mapper/vgtest-tvdb].write_io_errs 0 [/dev/mapper/vgtest-tvdb].read_io_errs 0 [/dev/mapper/vgtest-tvdb].flush_io_errs 0 [/dev/mapper/vgtest-tvdb].corruption_errs 0 [/dev/mapper/vgtest-tvdb].generation_errs 0 [/dev/mapper/vgtest-tvdc].write_io_errs 0 [/dev/mapper/vgtest-tvdc].read_io_errs 0 [/dev/mapper/vgtest-tvdc].flush_io_errs 0 [/dev/mapper/vgtest-tvdc].corruption_errs 0 [/dev/mapper/vgtest-tvdc].generation_errs 0 dump-tree on the offending block doesn't find a metadata page in the block right after the error occurs. Sometimes it finds a newer block (higher transid) some hours later: # btrfs ins dump-tree -b 984570740736 /dev/mapper/vgtest-tvdc btrfs-progs v4.20.1 leaf 984570740736 items 151 free space 7525 generation 9882 owner EXTENT_TREE leaf 984570740736 flags 0x1(WRITTEN) backref revision 1 [...etc] Mount options common to all systems exhibiting this behavior: rw,noatime,compress,space_cache=v2 both ssd and nossd any of: compress, compress-force, compress-force=zstd, compress=zstd My guess/theory: someone is maybe holding a reference to metadata they don't hold a lock for, and the page is being updated or replaced underneath due to the snapshot delete, triggering the verification failure. The error doesn't get persisted to disk because by the time the error is discovered, the page was already deleted by the snapshot delete. No errors are getting persisted to disk, so it's probably something that happens in a read operation rather than a write operation. Every application reads and writes data, so if it were in plain read or write code someone else would have noticed by now. it's probably not a common code path. bees seems to be required to trigger the bug (or triggers it much more often) so it may be a problem somewhere in one of the ioctls bees uses. bees does a lot of FILE_EXTENT_SAME, TREE_SEARCH_V2, and LOGICAL_INO ioctls. It could be any of the three, but FILE_EXTENT_SAME does writes, and TREE_SEARCH_V2 isn't used by bees to look at items in EXTENT_TREE, which leaves LOGICAL_INO which starts from EXTENT_TREE and works outwards to subvol trees. So my guess is it's a bug in LOGICAL_INO. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Spurious "ghost" "parent transid verify failed" messages on 4.14.x, 4.20.14 2019-03-12 4:00 Spurious "ghost" "parent transid verify failed" messages on 4.14.x, 4.20.14 Zygo Blaxell @ 2019-03-12 17:34 ` Jorge Bastos 2019-04-03 16:07 ` Zygo Blaxell 2019-04-03 14:47 ` Spurious "ghost" "parent transid verify failed" messages on 5.0.5 + lockdep warning Zygo Blaxell 1 sibling, 1 reply; 9+ messages in thread From: Jorge Bastos @ 2019-03-12 17:34 UTC (permalink / raw) To: Zygo Blaxell; +Cc: Btrfs BTRFS I've got similar errors when doing a balance and send/receive at the same time, in these cases send/receive would abort because of the errors, but it all worked normally after waiting for the balance to finish, no more errors, though now I try to remember to disable any scheduled send/receive jobs before a balance. Jorge On Tue, Mar 12, 2019 at 4:03 AM Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote: > > On 4.14.x and 4.20.14 kernels (probably all the ones in between too, > but I haven't tested those), I get what I call "ghost parent transid > verify failed" errors. Here's an unedited recent example from dmesg: > > [16180.649285] BTRFS error (device dm-3): parent transid verify failed on 1218181971968 wanted 9698 found 9744 > [16864.457119] BTRFS error (device dm-3): parent transid verify failed on 981488238592 wanted 9758 found 9763 > [16864.459153] BTRFS error (device dm-3): parent transid verify failed on 981488238592 wanted 9758 found 9763 > [16865.409955] BTRFS error (device dm-3): parent transid verify failed on 981488238592 wanted 9758 found 9763 > [16865.413091] BTRFS error (device dm-3): parent transid verify failed on 981488238592 wanted 9758 found 9763 > [16865.583441] BTRFS error (device dm-3): parent transid verify failed on 981488238592 wanted 9758 found 9763 > [16865.586602] BTRFS error (device dm-3): parent transid verify failed on 981488238592 wanted 9758 found 9763 > [17364.870504] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 > [17364.879696] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 > [17364.901713] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 > [17376.127722] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 > [17376.129965] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 > [17376.132919] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 > [17376.250605] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 > [17376.253633] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 > [17376.577192] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 > [17376.585174] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 > [17380.720874] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 > [17380.735643] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 > [17383.282250] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 > [17383.286598] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 > [17433.417529] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 > [17433.420685] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 > [17881.816489] BTRFS error (device dm-3): parent transid verify failed on 785410277376 wanted 9786 found 9794 > [17881.820072] BTRFS error (device dm-3): parent transid verify failed on 785410277376 wanted 9786 found 9794 > [17881.824307] BTRFS error (device dm-3): parent transid verify failed on 785410277376 wanted 9786 found 9794 > [19525.602623] BTRFS error (device dm-3): parent transid verify failed on 984570740736 wanted 9841 found 9846 > [19525.605395] BTRFS error (device dm-3): parent transid verify failed on 984570740736 wanted 9841 found 9846 > [19525.829326] BTRFS error (device dm-3): parent transid verify failed on 984570740736 wanted 9841 found 9846 > [19525.834123] BTRFS error (device dm-3): parent transid verify failed on 984570740736 wanted 9841 found 9846 > [19525.905553] BTRFS error (device dm-3): parent transid verify failed on 984570740736 wanted 9841 found 9846 > [19525.909302] BTRFS error (device dm-3): parent transid verify failed on 984570740736 wanted 9841 found 9846 > [19525.914374] BTRFS error (device dm-3): parent transid verify failed on 984570740736 wanted 9841 found 9846 > [19525.917829] BTRFS error (device dm-3): parent transid verify failed on 984570740736 wanted 9841 found 9846 > [19525.980067] BTRFS error (device dm-3): parent transid verify failed on 984570740736 wanted 9841 found 9846 > [19525.983479] BTRFS error (device dm-3): parent transid verify failed on 984570740736 wanted 9841 found 9846 > > They seem to occur while running these simultaneously: > > * rsync receiving a lot of data > > * snapshot delete to make space for more data > > * bees deduping said data--possibly on the snapshot being > deleted. > > All three seem to be needed--these messages do not appear if one of the > above is not running at the time. I can get the messages on a test VM > that is running nothing but the above three things on the test filesystem. > > Unlike the usual disaster cases when "parent transid verify failed" > messages appear, these incidents seem to be entirely harmless. They've > been appearing for years (since at least the first v4.14 kernels) > and...nothing else happens. The filesystem doesn't go read-only. > It doesn't become irreparably damaged. Nothing locks up or crashes > (at the same time). The system just keeps running normally, often for > months, emitting a few of these messages during longer snapshot deletes. > > This happens on several different host systems. There is no sign of > hardware failure. It looks like a real bug, but I don't have a simpler > reproducer, and sometimes weeks or months pass between incidents on a > given machine. > > Scrubs are always clean. There are never errors recorded in btrfs > dev stat: > > # btrfs dev stat /testfs/ > [/dev/mapper/vgtest-tvdb].write_io_errs 0 > [/dev/mapper/vgtest-tvdb].read_io_errs 0 > [/dev/mapper/vgtest-tvdb].flush_io_errs 0 > [/dev/mapper/vgtest-tvdb].corruption_errs 0 > [/dev/mapper/vgtest-tvdb].generation_errs 0 > [/dev/mapper/vgtest-tvdc].write_io_errs 0 > [/dev/mapper/vgtest-tvdc].read_io_errs 0 > [/dev/mapper/vgtest-tvdc].flush_io_errs 0 > [/dev/mapper/vgtest-tvdc].corruption_errs 0 > [/dev/mapper/vgtest-tvdc].generation_errs 0 > > dump-tree on the offending block doesn't find a metadata page in the > block right after the error occurs. Sometimes it finds a newer block > (higher transid) some hours later: > > # btrfs ins dump-tree -b 984570740736 /dev/mapper/vgtest-tvdc > btrfs-progs v4.20.1 > leaf 984570740736 items 151 free space 7525 generation 9882 owner EXTENT_TREE > leaf 984570740736 flags 0x1(WRITTEN) backref revision 1 > [...etc] > > Mount options common to all systems exhibiting this behavior: > > rw,noatime,compress,space_cache=v2 > > both ssd and nossd > > any of: compress, compress-force, compress-force=zstd, compress=zstd > > My guess/theory: someone is maybe holding a reference to metadata > they don't hold a lock for, and the page is being updated or replaced > underneath due to the snapshot delete, triggering the verification > failure. The error doesn't get persisted to disk because by the time the > error is discovered, the page was already deleted by the snapshot delete. > No errors are getting persisted to disk, so it's probably something > that happens in a read operation rather than a write operation. Every > application reads and writes data, so if it were in plain read or write > code someone else would have noticed by now. it's probably not a common > code path. bees seems to be required to trigger the bug (or triggers it > much more often) so it may be a problem somewhere in one of the ioctls > bees uses. bees does a lot of FILE_EXTENT_SAME, TREE_SEARCH_V2, and > LOGICAL_INO ioctls. It could be any of the three, but FILE_EXTENT_SAME > does writes, and TREE_SEARCH_V2 isn't used by bees to look at items in > EXTENT_TREE, which leaves LOGICAL_INO which starts from EXTENT_TREE and > works outwards to subvol trees. So my guess is it's a bug in LOGICAL_INO. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Spurious "ghost" "parent transid verify failed" messages on 4.14.x, 4.20.14 2019-03-12 17:34 ` Jorge Bastos @ 2019-04-03 16:07 ` Zygo Blaxell 0 siblings, 0 replies; 9+ messages in thread From: Zygo Blaxell @ 2019-04-03 16:07 UTC (permalink / raw) To: Jorge Bastos; +Cc: Btrfs BTRFS Hmmm btrfs send is _also_ prone to use-after-free issues when running with balance (or anything that modifies readonly snapshots). That and the lockdep warning on 5.0.x points to...well, maybe not the same bug as the snapshot delete case, but either a single bug in common code, or two similar bugs in two different places. On Tue, Mar 12, 2019 at 05:34:36PM +0000, Jorge Bastos wrote: > I've got similar errors when doing a balance and send/receive at the > same time, in these cases send/receive would abort because of the > errors, but it all worked normally after waiting for the balance to > finish, no more errors, though now I try to remember to disable any > scheduled send/receive jobs before a balance. > > Jorge > > On Tue, Mar 12, 2019 at 4:03 AM Zygo Blaxell > <ce3g8jdj@umail.furryterror.org> wrote: > > > > On 4.14.x and 4.20.14 kernels (probably all the ones in between too, > > but I haven't tested those), I get what I call "ghost parent transid > > verify failed" errors. Here's an unedited recent example from dmesg: > > > > [16180.649285] BTRFS error (device dm-3): parent transid verify failed on 1218181971968 wanted 9698 found 9744 > > [16864.457119] BTRFS error (device dm-3): parent transid verify failed on 981488238592 wanted 9758 found 9763 > > [16864.459153] BTRFS error (device dm-3): parent transid verify failed on 981488238592 wanted 9758 found 9763 > > [16865.409955] BTRFS error (device dm-3): parent transid verify failed on 981488238592 wanted 9758 found 9763 > > [16865.413091] BTRFS error (device dm-3): parent transid verify failed on 981488238592 wanted 9758 found 9763 > > [16865.583441] BTRFS error (device dm-3): parent transid verify failed on 981488238592 wanted 9758 found 9763 > > [16865.586602] BTRFS error (device dm-3): parent transid verify failed on 981488238592 wanted 9758 found 9763 > > [17364.870504] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 > > [17364.879696] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 > > [17364.901713] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 > > [17376.127722] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 > > [17376.129965] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 > > [17376.132919] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 > > [17376.250605] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 > > [17376.253633] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 > > [17376.577192] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 > > [17376.585174] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 > > [17380.720874] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 > > [17380.735643] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 > > [17383.282250] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 > > [17383.286598] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 > > [17433.417529] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 > > [17433.420685] BTRFS error (device dm-3): parent transid verify failed on 564765081600 wanted 9774 found 9779 > > [17881.816489] BTRFS error (device dm-3): parent transid verify failed on 785410277376 wanted 9786 found 9794 > > [17881.820072] BTRFS error (device dm-3): parent transid verify failed on 785410277376 wanted 9786 found 9794 > > [17881.824307] BTRFS error (device dm-3): parent transid verify failed on 785410277376 wanted 9786 found 9794 > > [19525.602623] BTRFS error (device dm-3): parent transid verify failed on 984570740736 wanted 9841 found 9846 > > [19525.605395] BTRFS error (device dm-3): parent transid verify failed on 984570740736 wanted 9841 found 9846 > > [19525.829326] BTRFS error (device dm-3): parent transid verify failed on 984570740736 wanted 9841 found 9846 > > [19525.834123] BTRFS error (device dm-3): parent transid verify failed on 984570740736 wanted 9841 found 9846 > > [19525.905553] BTRFS error (device dm-3): parent transid verify failed on 984570740736 wanted 9841 found 9846 > > [19525.909302] BTRFS error (device dm-3): parent transid verify failed on 984570740736 wanted 9841 found 9846 > > [19525.914374] BTRFS error (device dm-3): parent transid verify failed on 984570740736 wanted 9841 found 9846 > > [19525.917829] BTRFS error (device dm-3): parent transid verify failed on 984570740736 wanted 9841 found 9846 > > [19525.980067] BTRFS error (device dm-3): parent transid verify failed on 984570740736 wanted 9841 found 9846 > > [19525.983479] BTRFS error (device dm-3): parent transid verify failed on 984570740736 wanted 9841 found 9846 > > > > They seem to occur while running these simultaneously: > > > > * rsync receiving a lot of data > > > > * snapshot delete to make space for more data > > > > * bees deduping said data--possibly on the snapshot being > > deleted. > > > > All three seem to be needed--these messages do not appear if one of the > > above is not running at the time. I can get the messages on a test VM > > that is running nothing but the above three things on the test filesystem. > > > > Unlike the usual disaster cases when "parent transid verify failed" > > messages appear, these incidents seem to be entirely harmless. They've > > been appearing for years (since at least the first v4.14 kernels) > > and...nothing else happens. The filesystem doesn't go read-only. > > It doesn't become irreparably damaged. Nothing locks up or crashes > > (at the same time). The system just keeps running normally, often for > > months, emitting a few of these messages during longer snapshot deletes. > > > > This happens on several different host systems. There is no sign of > > hardware failure. It looks like a real bug, but I don't have a simpler > > reproducer, and sometimes weeks or months pass between incidents on a > > given machine. > > > > Scrubs are always clean. There are never errors recorded in btrfs > > dev stat: > > > > # btrfs dev stat /testfs/ > > [/dev/mapper/vgtest-tvdb].write_io_errs 0 > > [/dev/mapper/vgtest-tvdb].read_io_errs 0 > > [/dev/mapper/vgtest-tvdb].flush_io_errs 0 > > [/dev/mapper/vgtest-tvdb].corruption_errs 0 > > [/dev/mapper/vgtest-tvdb].generation_errs 0 > > [/dev/mapper/vgtest-tvdc].write_io_errs 0 > > [/dev/mapper/vgtest-tvdc].read_io_errs 0 > > [/dev/mapper/vgtest-tvdc].flush_io_errs 0 > > [/dev/mapper/vgtest-tvdc].corruption_errs 0 > > [/dev/mapper/vgtest-tvdc].generation_errs 0 > > > > dump-tree on the offending block doesn't find a metadata page in the > > block right after the error occurs. Sometimes it finds a newer block > > (higher transid) some hours later: > > > > # btrfs ins dump-tree -b 984570740736 /dev/mapper/vgtest-tvdc > > btrfs-progs v4.20.1 > > leaf 984570740736 items 151 free space 7525 generation 9882 owner EXTENT_TREE > > leaf 984570740736 flags 0x1(WRITTEN) backref revision 1 > > [...etc] > > > > Mount options common to all systems exhibiting this behavior: > > > > rw,noatime,compress,space_cache=v2 > > > > both ssd and nossd > > > > any of: compress, compress-force, compress-force=zstd, compress=zstd > > > > My guess/theory: someone is maybe holding a reference to metadata > > they don't hold a lock for, and the page is being updated or replaced > > underneath due to the snapshot delete, triggering the verification > > failure. The error doesn't get persisted to disk because by the time the > > error is discovered, the page was already deleted by the snapshot delete. > > No errors are getting persisted to disk, so it's probably something > > that happens in a read operation rather than a write operation. Every > > application reads and writes data, so if it were in plain read or write > > code someone else would have noticed by now. it's probably not a common > > code path. bees seems to be required to trigger the bug (or triggers it > > much more often) so it may be a problem somewhere in one of the ioctls > > bees uses. bees does a lot of FILE_EXTENT_SAME, TREE_SEARCH_V2, and > > LOGICAL_INO ioctls. It could be any of the three, but FILE_EXTENT_SAME > > does writes, and TREE_SEARCH_V2 isn't used by bees to look at items in > > EXTENT_TREE, which leaves LOGICAL_INO which starts from EXTENT_TREE and > > works outwards to subvol trees. So my guess is it's a bug in LOGICAL_INO. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Spurious "ghost" "parent transid verify failed" messages on 5.0.5 + lockdep warning 2019-03-12 4:00 Spurious "ghost" "parent transid verify failed" messages on 4.14.x, 4.20.14 Zygo Blaxell 2019-03-12 17:34 ` Jorge Bastos @ 2019-04-03 14:47 ` Zygo Blaxell 2019-07-01 3:39 ` Spurious "ghost" "parent transid verify failed" messages on 5.0.21 - with call traces Zygo Blaxell 1 sibling, 1 reply; 9+ messages in thread From: Zygo Blaxell @ 2019-04-03 14:47 UTC (permalink / raw) To: linux-btrfs On Tue, Mar 12, 2019 at 12:00:25AM -0400, Zygo Blaxell wrote: > On 4.14.x and 4.20.14 kernels (probably all the ones in between too, > but I haven't tested those), I get what I call "ghost parent transid > verify failed" errors. Here's an unedited recent example from dmesg: > > [16180.649285] BTRFS error (device dm-3): parent transid verify failed on 1218181971968 wanted 9698 found 9744 These happen much less often on 5.0.x, but they still happen from time to time. 5.0.5 also produced a lockdep warning last night, the first such warning I have seen with this call trace: [Tue Apr 2 23:00:28 2019] WARNING: CPU: 1 PID: 4213 at kernel/locking/lockdep.c:702 register_lock_class+0x2f2/0x590 [Tue Apr 2 23:00:28 2019] Modules linked in: crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ppdev joydev dm_cache_smq dm_cache dm_persistent_data dm_bio_prison dm_bufio dm_mod aesni_intel aes_x86_64 snd_pcm crypto_simd cryptd glue_helper snd_timer sr_mod cdrom snd soundcore sg parport_pc input_leds psmouse rtc_cmos ide_pci_generic pcspkr serio_raw floppy parport piix bochs_drm ide_core i2c_piix4 evbug evdev qemu_fw_cfg ip_tables x_tables ipv6 crc_ccitt autofs4 [Tue Apr 2 23:00:28 2019] CPU: 1 PID: 4213 Comm: btrfs-cleaner Not tainted 5.0.5-zb64-4eaee293f832+ #1 [Tue Apr 2 23:00:28 2019] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 [Tue Apr 2 23:00:28 2019] RIP: 0010:register_lock_class+0x2f2/0x590 [Tue Apr 2 23:00:28 2019] Code: 40 b2 f8 8d 89 c2 e8 9d 94 00 00 66 90 85 d2 74 11 48 c7 c7 c8 45 5e 8c e8 eb fb ff ff e8 fb b2 c2 00 45 31 ff e9 f6 fd ff ff <0f> 0b 8b 05 ee 41 e8 02 e9 c0 fd ff ff 65 48 8b 04 25 00 6e 01 00 [Tue Apr 2 23:00:28 2019] RSP: 0018:ffffb15f40893ab0 EFLAGS: 00010087 [Tue Apr 2 23:00:28 2019] RAX: ffffffff8c906af0 RBX: 0000000000000282 RCX: 0000000000000000 [Tue Apr 2 23:00:28 2019] RDX: ffffffff8e085a18 RSI: 000000000000076e RDI: 0000000000000000 [Tue Apr 2 23:00:28 2019] RBP: 0000000000000000 R08: ffffffff8b4dd33d R09: 0000000000000000 [Tue Apr 2 23:00:28 2019] R10: 0000000000000000 R11: ffff908c2a09a720 R12: 0000000000000000 [Tue Apr 2 23:00:28 2019] R13: 0000000000000001 R14: ffff908c2a09a720 R15: ffffffff8d623b60 [Tue Apr 2 23:00:28 2019] FS: 0000000000000000(0000) GS:ffff908d3a000000(0000) knlGS:0000000000000000 [Tue Apr 2 23:00:28 2019] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Tue Apr 2 23:00:28 2019] CR2: 00007f65819ea1f0 CR3: 00000001328d2006 CR4: 00000000001606e0 [Tue Apr 2 23:00:28 2019] Call Trace: [Tue Apr 2 23:00:28 2019] ? __lock_acquire+0x477/0x1b30 [Tue Apr 2 23:00:28 2019] __lock_acquire+0xb1/0x1b30 [Tue Apr 2 23:00:28 2019] ? btrfs_tree_lock+0x9d/0x250 [Tue Apr 2 23:00:28 2019] ? __lock_acquire+0x477/0x1b30 [Tue Apr 2 23:00:28 2019] ? lock_acquire+0xbc/0x1c0 [Tue Apr 2 23:00:28 2019] lock_acquire+0xbc/0x1c0 [Tue Apr 2 23:00:28 2019] ? btrfs_tree_lock+0x9d/0x250 [Tue Apr 2 23:00:28 2019] _raw_write_lock+0x30/0x70 [Tue Apr 2 23:00:28 2019] ? btrfs_tree_lock+0x9d/0x250 [Tue Apr 2 23:00:28 2019] btrfs_tree_lock+0x9d/0x250 [Tue Apr 2 23:00:28 2019] ? mark_page_accessed+0x13d/0x160 [Tue Apr 2 23:00:28 2019] ? mark_extent_buffer_accessed+0x67/0x80 [Tue Apr 2 23:00:28 2019] do_walk_down+0x126/0x610 [Tue Apr 2 23:00:28 2019] ? do_raw_spin_unlock+0x49/0xc0 [Tue Apr 2 23:00:28 2019] walk_down_tree+0xb8/0x100 [Tue Apr 2 23:00:28 2019] btrfs_drop_snapshot+0x3a4/0x8d0 [Tue Apr 2 23:00:28 2019] btrfs_clean_one_deleted_snapshot+0xb2/0x110 [Tue Apr 2 23:00:28 2019] cleaner_kthread+0x11f/0x180 [Tue Apr 2 23:00:28 2019] kthread+0x106/0x140 [Tue Apr 2 23:00:28 2019] ? btrfs_destroy_pinned_extent+0x1b0/0x1b0 [Tue Apr 2 23:00:28 2019] ? kthread_park+0x90/0x90 [Tue Apr 2 23:00:28 2019] ret_from_fork+0x3a/0x50 [Tue Apr 2 23:00:28 2019] irq event stamp: 131299397 [Tue Apr 2 23:00:28 2019] hardirqs last enabled at (131299397): [<ffffffff8b2bd4fe>] kmem_cache_alloc+0x1fe/0x310 [Tue Apr 2 23:00:28 2019] hardirqs last disabled at (131299396): [<ffffffff8b2bd3b5>] kmem_cache_alloc+0xb5/0x310 [Tue Apr 2 23:00:28 2019] softirqs last enabled at (131299388): [<ffffffff8c0003a4>] __do_softirq+0x3a4/0x45f [Tue Apr 2 23:00:28 2019] softirqs last disabled at (131299339): [<ffffffff8b0a3d54>] irq_exit+0xe4/0xf0 [Tue Apr 2 23:00:28 2019] ---[ end trace 3fa5835f7315f7f2 ]--- The comments around lockdep.c:702 talk about trampling on memory, which could be a use-after-free bug. The spurious parent transid verify failures also behave like a use-after-free bug. Both events have occurred only while snapshots are being deleted. This smells a lot like a use-after-free bug in the btrfs snapshot delete code. Some time later: [Wed Apr 3 09:05:20 2019] BTRFS error (device dm-3): parent transid verify failed on 2071098556416 wanted 13332 found 13308 [Wed Apr 3 09:05:21 2019] BTRFS error (device dm-3): parent transid verify failed on 2071098556416 wanted 13332 found 13308 In this case the test machine was running a workload that includes continuous snapshot deletes--snapshots are created in parallel by a process that tries to keep 20-100 snapshots being deleted at all times (in addition to the usual 10 rsyncs + bees dedupe). The log messages here are 10 hours apart, and the first appeared 3 hours after boot. Normally, there will be a few "parent transid verify failed" messages per 24 hours of snapshot delete time. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Spurious "ghost" "parent transid verify failed" messages on 5.0.21 - with call traces 2019-04-03 14:47 ` Spurious "ghost" "parent transid verify failed" messages on 5.0.5 + lockdep warning Zygo Blaxell @ 2019-07-01 3:39 ` Zygo Blaxell 2019-07-01 5:56 ` Qu Wenruo 2020-02-04 4:27 ` Spurious "ghost" "parent transid verify failed" messages on 5.0.21 - with call traces - Fixed in 47c8495d358b (5.4.18/5.5.2?) Zygo Blaxell 0 siblings, 2 replies; 9+ messages in thread From: Zygo Blaxell @ 2019-07-01 3:39 UTC (permalink / raw) To: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 14272 bytes --] On Wed, Apr 03, 2019 at 10:47:16AM -0400, Zygo Blaxell wrote: > On Tue, Mar 12, 2019 at 12:00:25AM -0400, Zygo Blaxell wrote: > > On 4.14.x and 4.20.14 kernels (probably all the ones in between too, > > but I haven't tested those), I get what I call "ghost parent transid > > verify failed" errors. Here's an unedited recent example from dmesg: > > > > [16180.649285] BTRFS error (device dm-3): parent transid verify failed on 1218181971968 wanted 9698 found 9744 > > These happen much less often on 5.0.x, but they still happen from time > to time. I put this patch in 5.0.21: commit 5abbed1af5570f1317f31736e3862e8b7df1ca8b Author: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Date: Sat May 18 17:48:59 2019 -0400 btrfs: get a call trace when we hit ghost parent transid verify failures diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 6fe9197f6ee4..ed961d2915a1 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -356,6 +356,7 @@ static int verify_parent_transid(struct extent_io_tree *io_tree, "parent transid verify failed on %llu wanted %llu found %llu", eb->start, parent_transid, btrfs_header_generation(eb)); + WARN_ON(1); ret = 1; /* and eventually (six weeks later!) got another reproduction of this bug on 5.0.21: [Sat Jun 29 16:03:51 2019] BTRFS error (device dm-24): parent transid verify failed on 192792461312 wanted 425102 found 425088 [Sat Jun 29 16:03:51 2019] WARNING: CPU: 1 PID: 308 at fs/btrfs/disk-io.c:359 verify_parent_transid+0xf7/0x1e0 [Sat Jun 29 16:03:51 2019] Modules linked in: isofs cpuid nfsv3 nfs fscache algif_skcipher af_alg mq_deadline softdog nfsd auth_rpcgss nfs_acl lockd grace sunrpc bnep cpufreq_userspace cpufreq_powersave cpufreq_conservative nfnetlink_queue nfnetlink_log nfnetlink bluetooth ecdh_generic rfkill snd_seq_dummy snd_hrtimer snd_seq_midi snd_seq_oss snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device binfmt_misc fuse nbd xt_REDIRECT ipt_REJECT nf_reject_ipv4 xt_nat xt_conntrack xt_tcpudp nf_log_ipv4 nf_log_common xt_LOG ip6table_nat nf_nat_ipv6 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_mangle iptable_mangle ip6table_filter ip6_tables iptable_filter ip_tables x_tables bpfilter tcp_cubic dummy ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi lp dm_crypt ppdev edac_mce_amd kvm_amd ccp kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_via snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi aesni_intel [Sat Jun 29 16:03:51 2019] aes_x86_64 crypto_simd cryptd glue_helper wmi_bmof serio_raw pcspkr k10temp fam15h_power usbnet input_leds mii joydev snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm_oss radeon parport_pc snd_mixer_oss asus_atk0110 parport rtc_cmos snd_pcm snd_timer pcc_cpufreq snd soundcore evdev acpi_cpufreq sg sp5100_tco af_packet ipv6 crc_ccitt raid10 raid0 multipath linear hid_generic dm_raid raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx dm_mod crc32c_intel ohci_pci raid1 psmouse sr_mod md_mod cdrom pdc202xx_new ide_pci_generic atiixp ide_core i2c_piix4 ohci_hcd ehci_pci ehci_hcd r8169 xhci_pci realtek xhci_hcd wmi [last unloaded: ax88179_178a] [Sat Jun 29 16:03:51 2019] CPU: 1 PID: 308 Comm: crawl_257 Tainted: G W 5.0.21-zb64-45132582ab2c+ #1 [Sat Jun 29 16:03:51 2019] Hardware name: System manufacturer System Product Name/M5A78L-M/USB3, BIOS 2101 12/02/2014 [Sat Jun 29 16:03:51 2019] RIP: 0010:verify_parent_transid+0xf7/0x1e0 [Sat Jun 29 16:03:51 2019] Code: 48 c1 f8 06 48 c1 e0 0c 4c 3b 74 10 50 74 2f 48 c7 c6 b0 5c 25 89 48 c7 c7 60 5d 90 89 e8 d1 5a 8f 00 85 c0 0f 85 a8 00 00 00 <0f> 0b 48 89 df 41 bc 01 00 00 00 e8 69 3f 03 00 85 c0 74 59 48 8b [Sat Jun 29 16:03:51 2019] RSP: 0018:ffffaf3c8d047730 EFLAGS: 00010296 [Sat Jun 29 16:03:51 2019] RAX: 0000000000000000 RBX: ffff9c7ab0654ae0 RCX: 0000000000000000 [Sat Jun 29 16:03:51 2019] RDX: 0000000000000000 RSI: ffff9c7fbc5e7cb8 RDI: ffff9c7fbc5e7cb8 [Sat Jun 29 16:03:51 2019] RBP: ffffaf3c8d047770 R08: ffffffff8812a25f R09: 0000000000000001 [Sat Jun 29 16:03:51 2019] R10: ffffaf3c8d047670 R11: 0000000000000000 R12: 0000000000000000 [Sat Jun 29 16:03:51 2019] R13: ffff9c7f5b4ac450 R14: 0000000000067c8e R15: ffffaf3c8d047738 [Sat Jun 29 16:03:51 2019] FS: 00007f53e7b74700(0000) GS:ffff9c7fbc400000(0000) knlGS:0000000000000000 [Sat Jun 29 16:03:51 2019] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Sat Jun 29 16:03:51 2019] CR2: 000001df74da3000 CR3: 000000021bf52000 CR4: 00000000000406e0 [Sat Jun 29 16:03:51 2019] Call Trace: [Sat Jun 29 16:03:51 2019] btree_read_extent_buffer_pages+0xbe/0x120 [Sat Jun 29 16:03:51 2019] read_tree_block+0x42/0x70 [Sat Jun 29 16:03:51 2019] read_block_for_search.isra.11+0x1ae/0x370 [Sat Jun 29 16:03:51 2019] btrfs_next_old_leaf+0x236/0x420 [Sat Jun 29 16:03:51 2019] resolve_indirect_refs+0x486/0x910 [Sat Jun 29 16:03:51 2019] find_parent_nodes+0x443/0x1340 [Sat Jun 29 16:03:51 2019] ? btrfs_inode_flags_to_fsflags+0x90/0x90 [Sat Jun 29 16:03:51 2019] iterate_extent_inodes+0xfb/0x3e0 [Sat Jun 29 16:03:51 2019] ? _raw_spin_unlock+0x27/0x40 [Sat Jun 29 16:03:51 2019] iterate_inodes_from_logical+0xa1/0xd0 [Sat Jun 29 16:03:51 2019] ? iterate_inodes_from_logical+0xa1/0xd0 [Sat Jun 29 16:03:51 2019] ? btrfs_inode_flags_to_fsflags+0x90/0x90 [Sat Jun 29 16:03:51 2019] btrfs_ioctl_logical_to_ino+0xf3/0x1a0 [Sat Jun 29 16:03:51 2019] btrfs_ioctl+0xcf7/0x2cb0 [Sat Jun 29 16:03:51 2019] ? lock_acquire+0xbd/0x1c0 [Sat Jun 29 16:03:51 2019] ? lock_acquire+0xbd/0x1c0 [Sat Jun 29 16:03:51 2019] do_vfs_ioctl+0xa6/0x6e0 [Sat Jun 29 16:03:51 2019] ? do_vfs_ioctl+0xa6/0x6e0 [Sat Jun 29 16:03:51 2019] ? __fget+0x119/0x200 [Sat Jun 29 16:03:51 2019] ksys_ioctl+0x75/0x80 [Sat Jun 29 16:03:51 2019] __x64_sys_ioctl+0x1a/0x20 [Sat Jun 29 16:03:51 2019] do_syscall_64+0x65/0x1a0 [Sat Jun 29 16:03:51 2019] entry_SYSCALL_64_after_hwframe+0x49/0xbe [Sat Jun 29 16:03:51 2019] RIP: 0033:0x7f53e9468427 [Sat Jun 29 16:03:51 2019] Code: 00 00 90 48 8b 05 69 aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 aa 0c 00 f7 d8 64 89 01 48 [Sat Jun 29 16:03:51 2019] RSP: 002b:00007f53e7b707c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [Sat Jun 29 16:03:51 2019] RAX: ffffffffffffffda RBX: 00007f53e7b70af0 RCX: 00007f53e9468427 [Sat Jun 29 16:03:51 2019] RDX: 00007f53e7b70af8 RSI: 00000000c038943b RDI: 0000000000000004 [Sat Jun 29 16:03:51 2019] RBP: 0000000001000000 R08: 0000000000000000 R09: 00007f53e7b70cd0 [Sat Jun 29 16:03:51 2019] R10: 0000559f3bf9cc40 R11: 0000000000000246 R12: 0000000000000004 [Sat Jun 29 16:03:51 2019] R13: 00007f53e7b70af8 R14: 00007f53e7b709d8 R15: 00007f53e7b70c00 [Sat Jun 29 16:03:51 2019] irq event stamp: 1551156398 [Sat Jun 29 16:03:51 2019] hardirqs last enabled at (1551156397): [<ffffffff880037da>] trace_hardirqs_on_thunk+0x1a/0x1c [Sat Jun 29 16:03:51 2019] hardirqs last disabled at (1551156398): [<ffffffff88dc51f9>] __schedule+0xd9/0xb70 [Sat Jun 29 16:03:51 2019] softirqs last enabled at (1551156396): [<ffffffff890003a0>] __do_softirq+0x3a0/0x45a [Sat Jun 29 16:03:51 2019] softirqs last disabled at (1551156389): [<ffffffff880a99a9>] irq_exit+0xe9/0xf0 [Sat Jun 29 16:03:51 2019] ---[ end trace d600eda2e1469b36 ]--- [Sat Jun 29 16:03:51 2019] BTRFS error (device dm-24): parent transid verify failed on 192792461312 wanted 425102 found 425088 [Sat Jun 29 16:03:51 2019] WARNING: CPU: 2 PID: 308 at fs/btrfs/disk-io.c:359 verify_parent_transid+0xf7/0x1e0 [Sat Jun 29 16:03:51 2019] Modules linked in: isofs cpuid nfsv3 nfs fscache algif_skcipher af_alg mq_deadline softdog nfsd auth_rpcgss nfs_acl lockd grace sunrpc bnep cpufreq_userspace cpufreq_powersave cpufreq_conservative nfnetlink_queue nfnetlink_log nfnetlink bluetooth ecdh_generic rfkill snd_seq_dummy snd_hrtimer snd_seq_midi snd_seq_oss snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device binfmt_misc fuse nbd xt_REDIRECT ipt_REJECT nf_reject_ipv4 xt_nat xt_conntrack xt_tcpudp nf_log_ipv4 nf_log_common xt_LOG ip6table_nat nf_nat_ipv6 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_mangle iptable_mangle ip6table_filter ip6_tables iptable_filter ip_tables x_tables bpfilter tcp_cubic dummy ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi lp dm_crypt ppdev edac_mce_amd kvm_amd ccp kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_via snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi aesni_intel [Sat Jun 29 16:03:51 2019] aes_x86_64 crypto_simd cryptd glue_helper wmi_bmof serio_raw pcspkr k10temp fam15h_power usbnet input_leds mii joydev snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm_oss radeon parport_pc snd_mixer_oss asus_atk0110 parport rtc_cmos snd_pcm snd_timer pcc_cpufreq snd soundcore evdev acpi_cpufreq sg sp5100_tco af_packet ipv6 crc_ccitt raid10 raid0 multipath linear hid_generic dm_raid raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx dm_mod crc32c_intel ohci_pci raid1 psmouse sr_mod md_mod cdrom pdc202xx_new ide_pci_generic atiixp ide_core i2c_piix4 ohci_hcd ehci_pci ehci_hcd r8169 xhci_pci realtek xhci_hcd wmi [last unloaded: ax88179_178a] [Sat Jun 29 16:03:51 2019] CPU: 2 PID: 308 Comm: crawl_257 Tainted: G W 5.0.21-zb64-45132582ab2c+ #1 [Sat Jun 29 16:03:52 2019] Hardware name: System manufacturer System Product Name/M5A78L-M/USB3, BIOS 2101 12/02/2014 [Sat Jun 29 16:03:52 2019] RIP: 0010:verify_parent_transid+0xf7/0x1e0 [Sat Jun 29 16:03:52 2019] Code: 48 c1 f8 06 48 c1 e0 0c 4c 3b 74 10 50 74 2f 48 c7 c6 b0 5c 25 89 48 c7 c7 60 5d 90 89 e8 d1 5a 8f 00 85 c0 0f 85 a8 00 00 00 <0f> 0b 48 89 df 41 bc 01 00 00 00 e8 69 3f 03 00 85 c0 74 59 48 8b [Sat Jun 29 16:03:52 2019] RSP: 0018:ffffaf3c8d047730 EFLAGS: 00010296 [Sat Jun 29 16:03:52 2019] RAX: 0000000000000000 RBX: ffff9c7ab0654ae0 RCX: 0000000000000000 [Sat Jun 29 16:03:52 2019] RDX: 0000000000000000 RSI: ffff9c7fbc9e7cb8 RDI: ffff9c7fbc9e7cb8 [Sat Jun 29 16:03:52 2019] RBP: ffffaf3c8d047770 R08: ffffffff8812a25f R09: 0000000000000001 [Sat Jun 29 16:03:52 2019] R10: ffffaf3c8d047670 R11: 0000000000000000 R12: 0000000000000000 [Sat Jun 29 16:03:52 2019] R13: ffff9c7f5b4ac450 R14: 0000000000067c8e R15: ffffaf3c8d047738 [Sat Jun 29 16:03:52 2019] FS: 00007f53e7b74700(0000) GS:ffff9c7fbc800000(0000) knlGS:0000000000000000 [Sat Jun 29 16:03:52 2019] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Sat Jun 29 16:03:52 2019] CR2: 00000000f7e262c0 CR3: 000000021bf52000 CR4: 00000000000406e0 [Sat Jun 29 16:03:52 2019] Call Trace: [Sat Jun 29 16:03:52 2019] btree_read_extent_buffer_pages+0xbe/0x120 [Sat Jun 29 16:03:52 2019] read_tree_block+0x42/0x70 [Sat Jun 29 16:03:52 2019] read_block_for_search.isra.11+0x1ae/0x370 [Sat Jun 29 16:03:52 2019] btrfs_next_old_leaf+0x236/0x420 [Sat Jun 29 16:03:52 2019] resolve_indirect_refs+0x486/0x910 [Sat Jun 29 16:03:52 2019] find_parent_nodes+0x443/0x1340 [Sat Jun 29 16:03:52 2019] ? btrfs_inode_flags_to_fsflags+0x90/0x90 [Sat Jun 29 16:03:52 2019] iterate_extent_inodes+0xfb/0x3e0 [Sat Jun 29 16:03:52 2019] ? _raw_spin_unlock+0x27/0x40 [Sat Jun 29 16:03:52 2019] iterate_inodes_from_logical+0xa1/0xd0 [Sat Jun 29 16:03:52 2019] ? iterate_inodes_from_logical+0xa1/0xd0 [Sat Jun 29 16:03:52 2019] ? btrfs_inode_flags_to_fsflags+0x90/0x90 [Sat Jun 29 16:03:52 2019] btrfs_ioctl_logical_to_ino+0xf3/0x1a0 [Sat Jun 29 16:03:52 2019] btrfs_ioctl+0xcf7/0x2cb0 [Sat Jun 29 16:03:52 2019] ? lock_acquire+0xbd/0x1c0 [Sat Jun 29 16:03:52 2019] ? lock_acquire+0xbd/0x1c0 [Sat Jun 29 16:03:52 2019] do_vfs_ioctl+0xa6/0x6e0 [Sat Jun 29 16:03:52 2019] ? do_vfs_ioctl+0xa6/0x6e0 [Sat Jun 29 16:03:52 2019] ? __fget+0x119/0x200 [Sat Jun 29 16:03:52 2019] ksys_ioctl+0x75/0x80 [Sat Jun 29 16:03:52 2019] __x64_sys_ioctl+0x1a/0x20 [Sat Jun 29 16:03:52 2019] do_syscall_64+0x65/0x1a0 [Sat Jun 29 16:03:52 2019] entry_SYSCALL_64_after_hwframe+0x49/0xbe [Sat Jun 29 16:03:52 2019] RIP: 0033:0x7f53e9468427 [Sat Jun 29 16:03:52 2019] Code: 00 00 90 48 8b 05 69 aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 aa 0c 00 f7 d8 64 89 01 48 [Sat Jun 29 16:03:52 2019] RSP: 002b:00007f53e7b707c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [Sat Jun 29 16:03:52 2019] RAX: ffffffffffffffda RBX: 00007f53e7b70af0 RCX: 00007f53e9468427 [Sat Jun 29 16:03:52 2019] RDX: 00007f53e7b70af8 RSI: 00000000c038943b RDI: 0000000000000004 [Sat Jun 29 16:03:52 2019] RBP: 0000000001000000 R08: 0000000000000000 R09: 00007f53e7b70cd0 [Sat Jun 29 16:03:52 2019] R10: 0000559f3bf9cc40 R11: 0000000000000246 R12: 0000000000000004 [Sat Jun 29 16:03:52 2019] R13: 00007f53e7b70af8 R14: 00007f53e7b709d8 R15: 00007f53e7b70c00 [Sat Jun 29 16:03:52 2019] irq event stamp: 1551156398 [Sat Jun 29 16:03:52 2019] hardirqs last enabled at (1551156397): [<ffffffff880037da>] trace_hardirqs_on_thunk+0x1a/0x1c [Sat Jun 29 16:03:52 2019] hardirqs last disabled at (1551156398): [<ffffffff88dc51f9>] __schedule+0xd9/0xb70 [Sat Jun 29 16:03:52 2019] softirqs last enabled at (1551156396): [<ffffffff890003a0>] __do_softirq+0x3a0/0x45a [Sat Jun 29 16:03:52 2019] softirqs last disabled at (1551156389): [<ffffffff880a99a9>] irq_exit+0xe9/0xf0 [Sat Jun 29 16:03:52 2019] ---[ end trace d600eda2e1469b37 ]--- which confirms the event comes from the LOGICAL_INO ioctl, at least. I had suspected that before based on timing and event log correlations, but now I have stack traces. It looks like insufficient locking, i.e. the eb got modified while LOGICAL_INO was looking at it. As usual for the "ghost" parent transid verify failure, there's no persistent failure, no error reported to applications, and error counts in 'btrfs dev stats' are not incremented. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Spurious "ghost" "parent transid verify failed" messages on 5.0.21 - with call traces 2019-07-01 3:39 ` Spurious "ghost" "parent transid verify failed" messages on 5.0.21 - with call traces Zygo Blaxell @ 2019-07-01 5:56 ` Qu Wenruo 2019-07-03 4:32 ` Zygo Blaxell 2020-02-04 4:27 ` Spurious "ghost" "parent transid verify failed" messages on 5.0.21 - with call traces - Fixed in 47c8495d358b (5.4.18/5.5.2?) Zygo Blaxell 1 sibling, 1 reply; 9+ messages in thread From: Qu Wenruo @ 2019-07-01 5:56 UTC (permalink / raw) To: Zygo Blaxell, linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 2339 bytes --] On 2019/7/1 上午11:39, Zygo Blaxell wrote: > On Wed, Apr 03, 2019 at 10:47:16AM -0400, Zygo Blaxell wrote: >> On Tue, Mar 12, 2019 at 12:00:25AM -0400, Zygo Blaxell wrote: >>> On 4.14.x and 4.20.14 kernels (probably all the ones in between too, >>> but I haven't tested those), I get what I call "ghost parent transid >>> verify failed" errors. Here's an unedited recent example from dmesg: >>> >>> [16180.649285] BTRFS error (device dm-3): parent transid verify failed on 1218181971968 wanted 9698 found 9744 >> >> These happen much less often on 5.0.x, but they still happen from time >> to time. > > I put this patch in 5.0.21: > > commit 5abbed1af5570f1317f31736e3862e8b7df1ca8b > Author: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> > Date: Sat May 18 17:48:59 2019 -0400 > > btrfs: get a call trace when we hit ghost parent transid verify failures > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index 6fe9197f6ee4..ed961d2915a1 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -356,6 +356,7 @@ static int verify_parent_transid(struct extent_io_tree *io_tree, > "parent transid verify failed on %llu wanted %llu found %llu", > eb->start, > parent_transid, btrfs_header_generation(eb)); > + WARN_ON(1); > ret = 1; > > /* > > and eventually (six weeks later!) got another reproduction of this bug > on 5.0.21: > [snip] > > which confirms the event comes from the LOGICAL_INO ioctl, at least. > I had suspected that before based on timing and event log correlations, > but now I have stack traces. > > It looks like insufficient locking, i.e. the eb got modified while > LOGICAL_INO was looking at it. For this case, a quick dirty fix would be try to joining a transaction (if the fs is not RO) and hold the trans handler to block current transaction from being committed. This is definitely going to impact performance but at least should avoid such transid mismatch call. In theory it should also affect any backref lookup not protected, like subvolume aware defrag. Thanks, Qu > > As usual for the "ghost" parent transid verify failure, there's no > persistent failure, no error reported to applications, and error counts > in 'btrfs dev stats' are not incremented. > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Spurious "ghost" "parent transid verify failed" messages on 5.0.21 - with call traces 2019-07-01 5:56 ` Qu Wenruo @ 2019-07-03 4:32 ` Zygo Blaxell 2019-07-03 5:09 ` Qu Wenruo 0 siblings, 1 reply; 9+ messages in thread From: Zygo Blaxell @ 2019-07-03 4:32 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 2759 bytes --] On Mon, Jul 01, 2019 at 01:56:08PM +0800, Qu Wenruo wrote: > > > On 2019/7/1 上午11:39, Zygo Blaxell wrote: > > On Wed, Apr 03, 2019 at 10:47:16AM -0400, Zygo Blaxell wrote: > >> On Tue, Mar 12, 2019 at 12:00:25AM -0400, Zygo Blaxell wrote: > >>> On 4.14.x and 4.20.14 kernels (probably all the ones in between too, > >>> but I haven't tested those), I get what I call "ghost parent transid > >>> verify failed" errors. Here's an unedited recent example from dmesg: > >>> > >>> [16180.649285] BTRFS error (device dm-3): parent transid verify failed on 1218181971968 wanted 9698 found 9744 > >> > >> These happen much less often on 5.0.x, but they still happen from time > >> to time. > > > > I put this patch in 5.0.21: > > > > commit 5abbed1af5570f1317f31736e3862e8b7df1ca8b > > Author: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> > > Date: Sat May 18 17:48:59 2019 -0400 > > > > btrfs: get a call trace when we hit ghost parent transid verify failures > > > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > > index 6fe9197f6ee4..ed961d2915a1 100644 > > --- a/fs/btrfs/disk-io.c > > +++ b/fs/btrfs/disk-io.c > > @@ -356,6 +356,7 @@ static int verify_parent_transid(struct extent_io_tree *io_tree, > > "parent transid verify failed on %llu wanted %llu found %llu", > > eb->start, > > parent_transid, btrfs_header_generation(eb)); > > + WARN_ON(1); > > ret = 1; > > > > /* > > > > and eventually (six weeks later!) got another reproduction of this bug > > on 5.0.21: > > > [snip] > > > > which confirms the event comes from the LOGICAL_INO ioctl, at least. > > I had suspected that before based on timing and event log correlations, > > but now I have stack traces. > > > > It looks like insufficient locking, i.e. the eb got modified while > > LOGICAL_INO was looking at it. > > For this case, a quick dirty fix would be try to joining a transaction > (if the fs is not RO) and hold the trans handler to block current > transaction from being committed. Do you mean, revert "bfc61c36260c Btrfs: do not start a transaction at iterate_extent_inodes()"? Or something else? I've had the spurious parent transid verify failures since at least 4.14, years before that patch. > This is definitely going to impact performance but at least should avoid > such transid mismatch call. > > In theory it should also affect any backref lookup not protected, like > subvolume aware defrag. > > Thanks, > Qu > > > > > As usual for the "ghost" parent transid verify failure, there's no > > persistent failure, no error reported to applications, and error counts > > in 'btrfs dev stats' are not incremented. > > > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Spurious "ghost" "parent transid verify failed" messages on 5.0.21 - with call traces 2019-07-03 4:32 ` Zygo Blaxell @ 2019-07-03 5:09 ` Qu Wenruo 0 siblings, 0 replies; 9+ messages in thread From: Qu Wenruo @ 2019-07-03 5:09 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 3016 bytes --] On 2019/7/3 下午12:32, Zygo Blaxell wrote: > On Mon, Jul 01, 2019 at 01:56:08PM +0800, Qu Wenruo wrote: >> >> >> On 2019/7/1 上午11:39, Zygo Blaxell wrote: >>> On Wed, Apr 03, 2019 at 10:47:16AM -0400, Zygo Blaxell wrote: >>>> On Tue, Mar 12, 2019 at 12:00:25AM -0400, Zygo Blaxell wrote: >>>>> On 4.14.x and 4.20.14 kernels (probably all the ones in between too, >>>>> but I haven't tested those), I get what I call "ghost parent transid >>>>> verify failed" errors. Here's an unedited recent example from dmesg: >>>>> >>>>> [16180.649285] BTRFS error (device dm-3): parent transid verify failed on 1218181971968 wanted 9698 found 9744 >>>> >>>> These happen much less often on 5.0.x, but they still happen from time >>>> to time. >>> >>> I put this patch in 5.0.21: >>> >>> commit 5abbed1af5570f1317f31736e3862e8b7df1ca8b >>> Author: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> >>> Date: Sat May 18 17:48:59 2019 -0400 >>> >>> btrfs: get a call trace when we hit ghost parent transid verify failures >>> >>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c >>> index 6fe9197f6ee4..ed961d2915a1 100644 >>> --- a/fs/btrfs/disk-io.c >>> +++ b/fs/btrfs/disk-io.c >>> @@ -356,6 +356,7 @@ static int verify_parent_transid(struct extent_io_tree *io_tree, >>> "parent transid verify failed on %llu wanted %llu found %llu", >>> eb->start, >>> parent_transid, btrfs_header_generation(eb)); >>> + WARN_ON(1); >>> ret = 1; >>> >>> /* >>> >>> and eventually (six weeks later!) got another reproduction of this bug >>> on 5.0.21: >>> >> [snip] >>> >>> which confirms the event comes from the LOGICAL_INO ioctl, at least. >>> I had suspected that before based on timing and event log correlations, >>> but now I have stack traces. >>> >>> It looks like insufficient locking, i.e. the eb got modified while >>> LOGICAL_INO was looking at it. >> >> For this case, a quick dirty fix would be try to joining a transaction >> (if the fs is not RO) and hold the trans handler to block current >> transaction from being committed. > > Do you mean, revert "bfc61c36260c Btrfs: do not start a transaction at > iterate_extent_inodes()"? Or something else? > > I've had the spurious parent transid verify failures since at least 4.14, > years before that patch. I mean even longer trans protection. E.g. start a trans just before calling iterate_inodes_from_logical(), and end it after iterate_inodes_from_logical() call. Thanks, Qu > >> This is definitely going to impact performance but at least should avoid >> such transid mismatch call. >> >> In theory it should also affect any backref lookup not protected, like >> subvolume aware defrag. >> >> Thanks, >> Qu >> >>> >>> As usual for the "ghost" parent transid verify failure, there's no >>> persistent failure, no error reported to applications, and error counts >>> in 'btrfs dev stats' are not incremented. >>> >> > > > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Spurious "ghost" "parent transid verify failed" messages on 5.0.21 - with call traces - Fixed in 47c8495d358b (5.4.18/5.5.2?) 2019-07-01 3:39 ` Spurious "ghost" "parent transid verify failed" messages on 5.0.21 - with call traces Zygo Blaxell 2019-07-01 5:56 ` Qu Wenruo @ 2020-02-04 4:27 ` Zygo Blaxell 1 sibling, 0 replies; 9+ messages in thread From: Zygo Blaxell @ 2020-02-04 4:27 UTC (permalink / raw) To: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 15918 bytes --] On Sun, Jun 30, 2019 at 11:39:25PM -0400, Zygo Blaxell wrote: > On Wed, Apr 03, 2019 at 10:47:16AM -0400, Zygo Blaxell wrote: > > On Tue, Mar 12, 2019 at 12:00:25AM -0400, Zygo Blaxell wrote: > > > On 4.14.x and 4.20.14 kernels (probably all the ones in between too, > > > but I haven't tested those), I get what I call "ghost parent transid > > > verify failed" errors. Here's an unedited recent example from dmesg: > > > > > > [16180.649285] BTRFS error (device dm-3): parent transid verify failed on 1218181971968 wanted 9698 found 9744 > > > > These happen much less often on 5.0.x, but they still happen from time > > to time. > > I put this patch in 5.0.21: > > commit 5abbed1af5570f1317f31736e3862e8b7df1ca8b > Author: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> > Date: Sat May 18 17:48:59 2019 -0400 > > btrfs: get a call trace when we hit ghost parent transid verify failures > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index 6fe9197f6ee4..ed961d2915a1 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -356,6 +356,7 @@ static int verify_parent_transid(struct extent_io_tree *io_tree, > "parent transid verify failed on %llu wanted %llu found %llu", > eb->start, > parent_transid, btrfs_header_generation(eb)); > + WARN_ON(1); > ret = 1; > > /* > > and eventually (six weeks later!) got another reproduction of this bug > on 5.0.21: > > [Sat Jun 29 16:03:51 2019] BTRFS error (device dm-24): parent transid verify failed on 192792461312 wanted 425102 found 425088 > [Sat Jun 29 16:03:51 2019] WARNING: CPU: 1 PID: 308 at fs/btrfs/disk-io.c:359 verify_parent_transid+0xf7/0x1e0 > [Sat Jun 29 16:03:51 2019] Modules linked in: isofs cpuid nfsv3 nfs fscache algif_skcipher af_alg mq_deadline softdog nfsd auth_rpcgss nfs_acl lockd grace sunrpc bnep cpufreq_userspace cpufreq_powersave cpufreq_conservative nfnetlink_queue nfnetlink_log nfnetlink bluetooth ecdh_generic rfkill snd_seq_dummy snd_hrtimer snd_seq_midi snd_seq_oss snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device binfmt_misc fuse nbd xt_REDIRECT ipt_REJECT nf_reject_ipv4 xt_nat xt_conntrack xt_tcpudp nf_log_ipv4 nf_log_common xt_LOG ip6table_nat nf_nat_ipv6 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_mangle iptable_mangle ip6table_filter ip6_tables iptable_filter ip_tables x_tables bpfilter tcp_cubic dummy ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi lp dm_crypt ppdev edac_mce_amd kvm_amd ccp kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_via snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi aesni_intel > [Sat Jun 29 16:03:51 2019] aes_x86_64 crypto_simd cryptd glue_helper wmi_bmof serio_raw pcspkr k10temp fam15h_power usbnet input_leds mii joydev snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm_oss radeon parport_pc snd_mixer_oss asus_atk0110 parport rtc_cmos snd_pcm snd_timer pcc_cpufreq snd soundcore evdev acpi_cpufreq sg sp5100_tco af_packet ipv6 crc_ccitt raid10 raid0 multipath linear hid_generic dm_raid raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx dm_mod crc32c_intel ohci_pci raid1 psmouse sr_mod md_mod cdrom pdc202xx_new ide_pci_generic atiixp ide_core i2c_piix4 ohci_hcd ehci_pci ehci_hcd r8169 xhci_pci realtek xhci_hcd wmi [last unloaded: ax88179_178a] > [Sat Jun 29 16:03:51 2019] CPU: 1 PID: 308 Comm: crawl_257 Tainted: G W 5.0.21-zb64-45132582ab2c+ #1 > [Sat Jun 29 16:03:51 2019] Hardware name: System manufacturer System Product Name/M5A78L-M/USB3, BIOS 2101 12/02/2014 > [Sat Jun 29 16:03:51 2019] RIP: 0010:verify_parent_transid+0xf7/0x1e0 > [Sat Jun 29 16:03:51 2019] Code: 48 c1 f8 06 48 c1 e0 0c 4c 3b 74 10 50 74 2f 48 c7 c6 b0 5c 25 89 48 c7 c7 60 5d 90 89 e8 d1 5a 8f 00 85 c0 0f 85 a8 00 00 00 <0f> 0b 48 89 df 41 bc 01 00 00 00 e8 69 3f 03 00 85 c0 74 59 48 8b > [Sat Jun 29 16:03:51 2019] RSP: 0018:ffffaf3c8d047730 EFLAGS: 00010296 > [Sat Jun 29 16:03:51 2019] RAX: 0000000000000000 RBX: ffff9c7ab0654ae0 RCX: 0000000000000000 > [Sat Jun 29 16:03:51 2019] RDX: 0000000000000000 RSI: ffff9c7fbc5e7cb8 RDI: ffff9c7fbc5e7cb8 > [Sat Jun 29 16:03:51 2019] RBP: ffffaf3c8d047770 R08: ffffffff8812a25f R09: 0000000000000001 > [Sat Jun 29 16:03:51 2019] R10: ffffaf3c8d047670 R11: 0000000000000000 R12: 0000000000000000 > [Sat Jun 29 16:03:51 2019] R13: ffff9c7f5b4ac450 R14: 0000000000067c8e R15: ffffaf3c8d047738 > [Sat Jun 29 16:03:51 2019] FS: 00007f53e7b74700(0000) GS:ffff9c7fbc400000(0000) knlGS:0000000000000000 > [Sat Jun 29 16:03:51 2019] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [Sat Jun 29 16:03:51 2019] CR2: 000001df74da3000 CR3: 000000021bf52000 CR4: 00000000000406e0 > [Sat Jun 29 16:03:51 2019] Call Trace: > [Sat Jun 29 16:03:51 2019] btree_read_extent_buffer_pages+0xbe/0x120 > [Sat Jun 29 16:03:51 2019] read_tree_block+0x42/0x70 > [Sat Jun 29 16:03:51 2019] read_block_for_search.isra.11+0x1ae/0x370 > [Sat Jun 29 16:03:51 2019] btrfs_next_old_leaf+0x236/0x420 > [Sat Jun 29 16:03:51 2019] resolve_indirect_refs+0x486/0x910 > [Sat Jun 29 16:03:51 2019] find_parent_nodes+0x443/0x1340 > [Sat Jun 29 16:03:51 2019] ? btrfs_inode_flags_to_fsflags+0x90/0x90 > [Sat Jun 29 16:03:51 2019] iterate_extent_inodes+0xfb/0x3e0 > [Sat Jun 29 16:03:51 2019] ? _raw_spin_unlock+0x27/0x40 > [Sat Jun 29 16:03:51 2019] iterate_inodes_from_logical+0xa1/0xd0 > [Sat Jun 29 16:03:51 2019] ? iterate_inodes_from_logical+0xa1/0xd0 > [Sat Jun 29 16:03:51 2019] ? btrfs_inode_flags_to_fsflags+0x90/0x90 > [Sat Jun 29 16:03:51 2019] btrfs_ioctl_logical_to_ino+0xf3/0x1a0 > [Sat Jun 29 16:03:51 2019] btrfs_ioctl+0xcf7/0x2cb0 > [Sat Jun 29 16:03:51 2019] ? lock_acquire+0xbd/0x1c0 > [Sat Jun 29 16:03:51 2019] ? lock_acquire+0xbd/0x1c0 > [Sat Jun 29 16:03:51 2019] do_vfs_ioctl+0xa6/0x6e0 > [Sat Jun 29 16:03:51 2019] ? do_vfs_ioctl+0xa6/0x6e0 > [Sat Jun 29 16:03:51 2019] ? __fget+0x119/0x200 > [Sat Jun 29 16:03:51 2019] ksys_ioctl+0x75/0x80 > [Sat Jun 29 16:03:51 2019] __x64_sys_ioctl+0x1a/0x20 > [Sat Jun 29 16:03:51 2019] do_syscall_64+0x65/0x1a0 > [Sat Jun 29 16:03:51 2019] entry_SYSCALL_64_after_hwframe+0x49/0xbe > [Sat Jun 29 16:03:51 2019] RIP: 0033:0x7f53e9468427 > [Sat Jun 29 16:03:51 2019] Code: 00 00 90 48 8b 05 69 aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 aa 0c 00 f7 d8 64 89 01 48 > [Sat Jun 29 16:03:51 2019] RSP: 002b:00007f53e7b707c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 > [Sat Jun 29 16:03:51 2019] RAX: ffffffffffffffda RBX: 00007f53e7b70af0 RCX: 00007f53e9468427 > [Sat Jun 29 16:03:51 2019] RDX: 00007f53e7b70af8 RSI: 00000000c038943b RDI: 0000000000000004 > [Sat Jun 29 16:03:51 2019] RBP: 0000000001000000 R08: 0000000000000000 R09: 00007f53e7b70cd0 > [Sat Jun 29 16:03:51 2019] R10: 0000559f3bf9cc40 R11: 0000000000000246 R12: 0000000000000004 > [Sat Jun 29 16:03:51 2019] R13: 00007f53e7b70af8 R14: 00007f53e7b709d8 R15: 00007f53e7b70c00 > [Sat Jun 29 16:03:51 2019] irq event stamp: 1551156398 > [Sat Jun 29 16:03:51 2019] hardirqs last enabled at (1551156397): [<ffffffff880037da>] trace_hardirqs_on_thunk+0x1a/0x1c > [Sat Jun 29 16:03:51 2019] hardirqs last disabled at (1551156398): [<ffffffff88dc51f9>] __schedule+0xd9/0xb70 > [Sat Jun 29 16:03:51 2019] softirqs last enabled at (1551156396): [<ffffffff890003a0>] __do_softirq+0x3a0/0x45a > [Sat Jun 29 16:03:51 2019] softirqs last disabled at (1551156389): [<ffffffff880a99a9>] irq_exit+0xe9/0xf0 > [Sat Jun 29 16:03:51 2019] ---[ end trace d600eda2e1469b36 ]--- > [Sat Jun 29 16:03:51 2019] BTRFS error (device dm-24): parent transid verify failed on 192792461312 wanted 425102 found 425088 > [Sat Jun 29 16:03:51 2019] WARNING: CPU: 2 PID: 308 at fs/btrfs/disk-io.c:359 verify_parent_transid+0xf7/0x1e0 > [Sat Jun 29 16:03:51 2019] Modules linked in: isofs cpuid nfsv3 nfs fscache algif_skcipher af_alg mq_deadline softdog nfsd auth_rpcgss nfs_acl lockd grace sunrpc bnep cpufreq_userspace cpufreq_powersave cpufreq_conservative nfnetlink_queue nfnetlink_log nfnetlink bluetooth ecdh_generic rfkill snd_seq_dummy snd_hrtimer snd_seq_midi snd_seq_oss snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device binfmt_misc fuse nbd xt_REDIRECT ipt_REJECT nf_reject_ipv4 xt_nat xt_conntrack xt_tcpudp nf_log_ipv4 nf_log_common xt_LOG ip6table_nat nf_nat_ipv6 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_mangle iptable_mangle ip6table_filter ip6_tables iptable_filter ip_tables x_tables bpfilter tcp_cubic dummy ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi lp dm_crypt ppdev edac_mce_amd kvm_amd ccp kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_via snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi aesni_intel > [Sat Jun 29 16:03:51 2019] aes_x86_64 crypto_simd cryptd glue_helper wmi_bmof serio_raw pcspkr k10temp fam15h_power usbnet input_leds mii joydev snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm_oss radeon parport_pc snd_mixer_oss asus_atk0110 parport rtc_cmos snd_pcm snd_timer pcc_cpufreq snd soundcore evdev acpi_cpufreq sg sp5100_tco af_packet ipv6 crc_ccitt raid10 raid0 multipath linear hid_generic dm_raid raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx dm_mod crc32c_intel ohci_pci raid1 psmouse sr_mod md_mod cdrom pdc202xx_new ide_pci_generic atiixp ide_core i2c_piix4 ohci_hcd ehci_pci ehci_hcd r8169 xhci_pci realtek xhci_hcd wmi [last unloaded: ax88179_178a] > [Sat Jun 29 16:03:51 2019] CPU: 2 PID: 308 Comm: crawl_257 Tainted: G W 5.0.21-zb64-45132582ab2c+ #1 > [Sat Jun 29 16:03:52 2019] Hardware name: System manufacturer System Product Name/M5A78L-M/USB3, BIOS 2101 12/02/2014 > [Sat Jun 29 16:03:52 2019] RIP: 0010:verify_parent_transid+0xf7/0x1e0 > [Sat Jun 29 16:03:52 2019] Code: 48 c1 f8 06 48 c1 e0 0c 4c 3b 74 10 50 74 2f 48 c7 c6 b0 5c 25 89 48 c7 c7 60 5d 90 89 e8 d1 5a 8f 00 85 c0 0f 85 a8 00 00 00 <0f> 0b 48 89 df 41 bc 01 00 00 00 e8 69 3f 03 00 85 c0 74 59 48 8b > [Sat Jun 29 16:03:52 2019] RSP: 0018:ffffaf3c8d047730 EFLAGS: 00010296 > [Sat Jun 29 16:03:52 2019] RAX: 0000000000000000 RBX: ffff9c7ab0654ae0 RCX: 0000000000000000 > [Sat Jun 29 16:03:52 2019] RDX: 0000000000000000 RSI: ffff9c7fbc9e7cb8 RDI: ffff9c7fbc9e7cb8 > [Sat Jun 29 16:03:52 2019] RBP: ffffaf3c8d047770 R08: ffffffff8812a25f R09: 0000000000000001 > [Sat Jun 29 16:03:52 2019] R10: ffffaf3c8d047670 R11: 0000000000000000 R12: 0000000000000000 > [Sat Jun 29 16:03:52 2019] R13: ffff9c7f5b4ac450 R14: 0000000000067c8e R15: ffffaf3c8d047738 > [Sat Jun 29 16:03:52 2019] FS: 00007f53e7b74700(0000) GS:ffff9c7fbc800000(0000) knlGS:0000000000000000 > [Sat Jun 29 16:03:52 2019] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [Sat Jun 29 16:03:52 2019] CR2: 00000000f7e262c0 CR3: 000000021bf52000 CR4: 00000000000406e0 > [Sat Jun 29 16:03:52 2019] Call Trace: > [Sat Jun 29 16:03:52 2019] btree_read_extent_buffer_pages+0xbe/0x120 > [Sat Jun 29 16:03:52 2019] read_tree_block+0x42/0x70 > [Sat Jun 29 16:03:52 2019] read_block_for_search.isra.11+0x1ae/0x370 > [Sat Jun 29 16:03:52 2019] btrfs_next_old_leaf+0x236/0x420 > [Sat Jun 29 16:03:52 2019] resolve_indirect_refs+0x486/0x910 > [Sat Jun 29 16:03:52 2019] find_parent_nodes+0x443/0x1340 > [Sat Jun 29 16:03:52 2019] ? btrfs_inode_flags_to_fsflags+0x90/0x90 > [Sat Jun 29 16:03:52 2019] iterate_extent_inodes+0xfb/0x3e0 > [Sat Jun 29 16:03:52 2019] ? _raw_spin_unlock+0x27/0x40 > [Sat Jun 29 16:03:52 2019] iterate_inodes_from_logical+0xa1/0xd0 > [Sat Jun 29 16:03:52 2019] ? iterate_inodes_from_logical+0xa1/0xd0 > [Sat Jun 29 16:03:52 2019] ? btrfs_inode_flags_to_fsflags+0x90/0x90 > [Sat Jun 29 16:03:52 2019] btrfs_ioctl_logical_to_ino+0xf3/0x1a0 > [Sat Jun 29 16:03:52 2019] btrfs_ioctl+0xcf7/0x2cb0 > [Sat Jun 29 16:03:52 2019] ? lock_acquire+0xbd/0x1c0 > [Sat Jun 29 16:03:52 2019] ? lock_acquire+0xbd/0x1c0 > [Sat Jun 29 16:03:52 2019] do_vfs_ioctl+0xa6/0x6e0 > [Sat Jun 29 16:03:52 2019] ? do_vfs_ioctl+0xa6/0x6e0 > [Sat Jun 29 16:03:52 2019] ? __fget+0x119/0x200 > [Sat Jun 29 16:03:52 2019] ksys_ioctl+0x75/0x80 > [Sat Jun 29 16:03:52 2019] __x64_sys_ioctl+0x1a/0x20 > [Sat Jun 29 16:03:52 2019] do_syscall_64+0x65/0x1a0 > [Sat Jun 29 16:03:52 2019] entry_SYSCALL_64_after_hwframe+0x49/0xbe > [Sat Jun 29 16:03:52 2019] RIP: 0033:0x7f53e9468427 > [Sat Jun 29 16:03:52 2019] Code: 00 00 90 48 8b 05 69 aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 aa 0c 00 f7 d8 64 89 01 48 > [Sat Jun 29 16:03:52 2019] RSP: 002b:00007f53e7b707c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 > [Sat Jun 29 16:03:52 2019] RAX: ffffffffffffffda RBX: 00007f53e7b70af0 RCX: 00007f53e9468427 > [Sat Jun 29 16:03:52 2019] RDX: 00007f53e7b70af8 RSI: 00000000c038943b RDI: 0000000000000004 > [Sat Jun 29 16:03:52 2019] RBP: 0000000001000000 R08: 0000000000000000 R09: 00007f53e7b70cd0 > [Sat Jun 29 16:03:52 2019] R10: 0000559f3bf9cc40 R11: 0000000000000246 R12: 0000000000000004 > [Sat Jun 29 16:03:52 2019] R13: 00007f53e7b70af8 R14: 00007f53e7b709d8 R15: 00007f53e7b70c00 > [Sat Jun 29 16:03:52 2019] irq event stamp: 1551156398 > [Sat Jun 29 16:03:52 2019] hardirqs last enabled at (1551156397): [<ffffffff880037da>] trace_hardirqs_on_thunk+0x1a/0x1c > [Sat Jun 29 16:03:52 2019] hardirqs last disabled at (1551156398): [<ffffffff88dc51f9>] __schedule+0xd9/0xb70 > [Sat Jun 29 16:03:52 2019] softirqs last enabled at (1551156396): [<ffffffff890003a0>] __do_softirq+0x3a0/0x45a > [Sat Jun 29 16:03:52 2019] softirqs last disabled at (1551156389): [<ffffffff880a99a9>] irq_exit+0xe9/0xf0 > [Sat Jun 29 16:03:52 2019] ---[ end trace d600eda2e1469b37 ]--- > > which confirms the event comes from the LOGICAL_INO ioctl, at least. > I had suspected that before based on timing and event log correlations, > but now I have stack traces. > > It looks like insufficient locking, i.e. the eb got modified while > LOGICAL_INO was looking at it. > > As usual for the "ghost" parent transid verify failure, there's no > persistent failure, no error reported to applications, and error counts > in 'btrfs dev stats' are not incremented. I'm pretty sure this is fixed by the third tree mod log patch, although it's difficult to test without also applying the first two: 6609fee8897a Btrfs: fix removal logic of the tree mod log that leads to use-after-free issues efad8a853ad2 Btrfs: fix use-after-free when using the tree modification log 47c8495d358b Btrfs: fix race between adding and putting tree mod seq elements and nodes The third patch fixes insufficient locking in LOGICAL_INO, FIEMAP, balance, resize, scrub error reporting, and anything else that iterates over extent references. With all three patches applied I can no longer produce ghost PTV failures on any kernel 4.20..5.5 under my usual stress test (rsync, bees, snapshots, scrub, balance). Without the third patch, I get a few PTV failures per hour on the most recent version of my stress test, as well as other usually-fatal errors like "unable to find logical 0 length 4096" and even the occasional "bad tree block", all of which have no apparent effect other than emitting the kernel log messages. Also, the affected kernel range checks out: I've observed the ghost PTV failures on kernels older than 4.14, and the tree mod log patches fix bugs introduced in 3.10. Thanks Filipe for all three patches. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2020-02-04 4:27 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-03-12 4:00 Spurious "ghost" "parent transid verify failed" messages on 4.14.x, 4.20.14 Zygo Blaxell 2019-03-12 17:34 ` Jorge Bastos 2019-04-03 16:07 ` Zygo Blaxell 2019-04-03 14:47 ` Spurious "ghost" "parent transid verify failed" messages on 5.0.5 + lockdep warning Zygo Blaxell 2019-07-01 3:39 ` Spurious "ghost" "parent transid verify failed" messages on 5.0.21 - with call traces Zygo Blaxell 2019-07-01 5:56 ` Qu Wenruo 2019-07-03 4:32 ` Zygo Blaxell 2019-07-03 5:09 ` Qu Wenruo 2020-02-04 4:27 ` Spurious "ghost" "parent transid verify failed" messages on 5.0.21 - with call traces - Fixed in 47c8495d358b (5.4.18/5.5.2?) Zygo Blaxell
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox