From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 27801C433EF for ; Sun, 6 Feb 2022 21:22:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237096AbiBFVWf (ORCPT ); Sun, 6 Feb 2022 16:22:35 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45716 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231774AbiBFVWe (ORCPT ); Sun, 6 Feb 2022 16:22:34 -0500 Received: from mail-oi1-x231.google.com (mail-oi1-x231.google.com [IPv6:2607:f8b0:4864:20::231]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E7335C06173B; Sun, 6 Feb 2022 13:22:32 -0800 (PST) Received: by mail-oi1-x231.google.com with SMTP id m10so15130229oie.2; Sun, 06 Feb 2022 13:22:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=pX8nlKBp44Uuhg9EXUmG+LyN9UR0HWbq6erV5cZ3MCw=; b=Z5DlpI7zw7u9xhBLe2a+f+1fSs8WOCd8B0d2KFkGeKOvOw+cPzUePf1hy/DcbUEwb4 OUmgG7Fx8WxEKFluROeNi00sPbboSqtbg9ZdbEQRKXhKWsQVFRKCir/lNdYmofd+cS3p SzJYQOs6mYbm2K1v1aM9LpJ2cwiXVS8AP9yXGITgmY1H5LEhS1pWNmG2J3RVNtm1jFmw vVBH5GYPmod/iQwFY7V0OPt6sSpvPRNqLJ7vXZJoiId2anbLi7zoqYG79FBOPu6bsfN6 uEuqcL2tJTusYAICleA3B8EsSaRh6UN1UCjlOxICzUx0FofwoCW9loDScawPLHSNNl13 RbUg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=pX8nlKBp44Uuhg9EXUmG+LyN9UR0HWbq6erV5cZ3MCw=; b=1isTVrayO5Z5UrMFT8qsKFUcbbvOLEHJirY/4u0gONyXg8dJVuxpqjubvIF2GlZnuy jtVMWWUnibBhbIIUZT83ugg59OicEwKPTjzGG2DBFyiRADAuO7MNIWmRrYp4ZRujC0YU Crw+9HwJ2Sj9QbMOWcVuxMX1Vd1o0V814WyWdqtuPOBMZbU8b6XVOQNM2k2WfP/9/gp+ x22x9OAgWXVEDXpBmmDuQ+swbNY5cTBqEV5mDw4Sj+/l9QDEO+/DnZyBVtkLfrXEjCyK P1r3q4Uov+TwDDD/AmN+u2eYFq6pOcqVk4/209/26ZYRPr+Ptga+y50YaOwtA6wD0pF8 Zfmg== X-Gm-Message-State: AOAM5303IIUlh9558S6m7hBQDSHUFoXv6xR4wpe4Wbbz8NheARmLy2AP pxIQKu0+DtlB2Sa3gMBh/JTauzmsin66zygwW1zWThARbb4= X-Google-Smtp-Source: ABdhPJyAUixt1ceMP81WcHVJryCF7r+NYUnhb68azyCQLn0ozUdZazlr5LdFrF6Nl2ghrD/Qt/d0cYobR5iFb1Ity94= X-Received: by 2002:a05:6808:191a:: with SMTP id bf26mr6057093oib.111.1644182552251; Sun, 06 Feb 2022 13:22:32 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: FMDF Date: Sun, 6 Feb 2022 22:22:16 +0100 Message-ID: Subject: Re: How to debug stuck read? To: Matthew Wilcox Cc: =?UTF-8?B?RMSBdmlzIE1vc8SBbnM=?= , linux-fsdevel@vger.kernel.org, BTRFS , kernelnewbies Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On Sun, Feb 6, 2022 at 1:48 PM Matthew Wilcox wrote: > > On Sun, Feb 06, 2022 at 12:01:02PM +0100, FMDF wrote: > > On Wed, Feb 2, 2022 at 10:50 PM D=C4=81vis Mos=C4=81ns wrote: > > > > > > tre=C5=A1d., 2022. g. 2. febr., plkst. 21:13 =E2=80=94 lietot=C4=81js= Matthew Wilcox > > > () rakst=C4=ABja: > > > > > > > > On Wed, Feb 02, 2022 at 07:15:14PM +0200, D=C4=81vis Mos=C4=81ns wr= ote: > > > > > I have a corrupted file on BTRFS which has CoW disabled thus no > > > > > checksum. Trying to read this file causes the process to get stuc= k > > > > > forever. It doesn't return EIO. > > > > > > > > > > How can I find out why it gets stuck? > > > > > > > > > $ cat /proc/3449/stack | ./scripts/decode_stacktrace.sh vmlinux > > > > > folio_wait_bit_common (mm/filemap.c:1314) > > > > > filemap_get_pages (mm/filemap.c:2622) > > > > > filemap_read (mm/filemap.c:2676) > > > > > new_sync_read (fs/read_write.c:401 (discriminator 1)) > > > > > > > > folio_wait_bit_common() is where it waits for the page to be unlock= ed. > > > > Probably the problem is that btrfs isn't unlocking the page on > > > > seeing the error, so you don't get the -EIO returned? > > > > > > > > > Yeah, but how to find where that happens. > > > Anyway by pure luck I found memcpy that wrote outside of allocated > > > memory and fixing that solved this issue but I still don't know how t= o > > > debug this properly. > > > > > There is no special recipe for debugging "this properly" :) > > > > You wrote that "by pure luck" you found a memcpy() that wrote beyond th= e > > limit of allocated memory. I suppose that you found that faulty memcpy(= ) > > somewhere in one of the function listed in the stack trace. > > I very much doubt that. The code flow here is: > > userspace calls read() -> VFS -> btrfs -> block layer -> return to btrfs > -> return to VFS, wait for read to complete. So by the time anyone's > looking at the stack trace, all you can see is the part of the call > chain in the VFS. There's no way to see where we went in btrfs, nor > in the block layer. We also can't see from the stack trace what > happened with the interrupt which _should have_ cleared the lock bit > and didn't. > OK, I agree. This appears to be is one of those special cases where the mer= e reading of a stack trace cannot help much... :( My argument is about a general approach to debugging some unknown code by just reading the calls chain. Many times I've been able to find out what= was wrong with code I had never seen before by just following the chain of call= s in subsystems that I know nothing of (e.g., a bug in "tty" that was reporte= d by Syzbot). In this special case, if the developer doesn't know that "the interrupt [wh= ich] _should have_ cleared the lock bit and didn't." there is nothing that one c= an deduce from a stack trace. Here one need to know how things work, well beyond the functions that are listed in the trace. So, probably, if one needs a "recipe" for those cases,= the recipe is just know the subsystem(s) at hand and know how the kernel manage= s interrupts. Actually I haven't deepened this issue but, by reading what Matthew writes, I doubt that a faulty memcpy() can be the culprit... Davis, are you really = sure that you've fixed that bug? Regards, Fabio