From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C4B8AC4332F for ; Sun, 13 Nov 2022 20:05:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235338AbiKMUFN (ORCPT ); Sun, 13 Nov 2022 15:05:13 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59464 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233847AbiKMUFN (ORCPT ); Sun, 13 Nov 2022 15:05:13 -0500 Received: from drax.kayaks.hungrycats.org (drax.kayaks.hungrycats.org [174.142.148.226]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 25B1464F2 for ; Sun, 13 Nov 2022 12:05:10 -0800 (PST) Received: by drax.kayaks.hungrycats.org (Postfix, from userid 1002) id 795D55FF2A9; Sun, 13 Nov 2022 15:05:08 -0500 (EST) Date: Sun, 13 Nov 2022 15:05:08 -0500 From: Zygo Blaxell To: linux-btrfs@vger.kernel.org Subject: Re: infinite looping in logical_ino ioctl Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org On Fri, Nov 11, 2022 at 10:49:17PM -0500, Zygo Blaxell wrote: > I've been chasing an infinite loop in the logical_ino ioctl that appears > when dedupe and logical_ino are used together on the same filesystem. [...] > add_all_parents: I added a simple loop counter to make it easier to get gdb to stop at the right place: size_t loop_count = 0; > 462 while (!ret && count < ref->count) { if (loop_count++ == 999999) { printk(KERN_ERR "loop_count exceeded, break here\n"); } Then I set a breakpoint there, and had the debugger set count = 1 to force logical_ino out of the loop in add_all_parents. Almost immediately, the same thread goes right back into add_all_parents with a different search key: (gdb) commands 6 >set var count=1 >c >end (gdb) continue Thread 3 hit Breakpoint 6, add_all_parents (ignore_offset=, extent_item_pos=0xffffc90008897be8, time_seq=11943, level=, ref=0xffff8881dca3a210, preftrees=0xffffc90008897ac0, parents=0xffff88842c9920a0, path=0xffff88815d347460, root=) at fs/btrfs/backref.c:466 466 printk(KERN_ERR "loop_count exceeded, break here\n"); 1: key = {objectid = 304, type = 108 'l', offset = 318775296} 2: count = 0 3: *key_for_search = {objectid = 304, type = 108 'l', offset = 721883136} 4: ref->count = 1 Thread 3 hit Breakpoint 6, add_all_parents (ignore_offset=, extent_item_pos=0xffffc90008897be8, time_seq=11943, level=, ref=0xffff88842f8a9630, preftrees=0xffffc90008897ac0, parents=0xffff88842c9920a0, path=0xffff88815d347460, root=) at fs/btrfs/backref.c:466 466 printk(KERN_ERR "loop_count exceeded, break here\n"); 1: key = {objectid = 304, type = 108 'l', offset = 318775296} 2: count = 0 3: *key_for_search = {objectid = 304, type = 108 'l', offset = 721879040} 4: ref->count = 1 Thread 3 hit Breakpoint 6, add_all_parents (ignore_offset=, extent_item_pos=0xffffc90008897be8, time_seq=11943, level=, ref=0xffff8881a5a86478, preftrees=0xffffc90008897ac0, parents=0xffff88842c9920a0, path=0xffff88815d347460, root=) at fs/btrfs/backref.c:466 466 printk(KERN_ERR "loop_count exceeded, break here\n"); 1: key = {objectid = 304, type = 108 'l', offset = 318775296} 2: count = 0 3: *key_for_search = {objectid = 304, type = 108 'l', offset = 721874944} 4: ref->count = 2 Thread 3 hit Breakpoint 6, add_all_parents (ignore_offset=, extent_item_pos=0xffffc90008897be8, time_seq=11943, level=, ref=0xffff8881a03d1f20, preftrees=0xffffc90008897ac0, parents=0xffff88842c9920a0, path=0xffff88815d347460, root=) at fs/btrfs/backref.c:466 466 printk(KERN_ERR "loop_count exceeded, break here\n"); 1: key = {objectid = 304, type = 108 'l', offset = 318775296} 2: count = 0 3: *key_for_search = {objectid = 304, type = 108 'l', offset = 721850368} 4: ref->count = 1 Thread 3 hit Breakpoint 6, add_all_parents (ignore_offset=, extent_item_pos=0xffffc90008897be8, time_seq=11943, level=, ref=0xffff88816b340c60, preftrees=0xffffc90008897ac0, parents=0xffff88842c9920a0, path=0xffff88815d347460, root=) at fs/btrfs/backref.c:466 466 printk(KERN_ERR "loop_count exceeded, break here\n"); 1: key = {objectid = 304, type = 108 'l', offset = 318775296} 2: count = 0 3: *key_for_search = {objectid = 304, type = 108 'l', offset = 721829888} 4: ref->count = 1 That did get logical_ino to eventually give up and return to userspace, which ends my debugging session for today. Next time it gets stuck, I'll put some breakpoints further up the call stack to see what the caller's state looks like. I'm not sure how to debug btrfs_next_old_item(), as gdb has some real problems controlling execution in that function, and examining varibles in there is a great way to find the bugs in gdb itself. Is there a GDB trick to get the kernel to dump out the eb contents from gdb? Calling btrfs_print_leaf(eb) from the debugger seems to make the kernel splat. I suppose I could call it when loop_count hits 999999. This part was wrong in my earlier post: > When I tried to probe more deeply what btrfs_next_old_item was doing, > I found that the code is somehow executing btrfs_next_item on line 519, > despite time_seq having the value 3722 at the time. Iteration over the > items in views at two different points in time sounds like it could > result in infinite looping, but I wasn't able to confirm that before > the gdb session died. I'm now waiting for my test VM to repro again. This is just an artifact of inline function optimization. The assembly code for that if statement is full of jumps that don't correspond neatly to source lines. If the pc is pointing anywhere near that block, gdb thinks it's on line 519.