From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 539AAC4332F for ; Wed, 1 Nov 2023 03:24:34 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231344AbjKADYf (ORCPT ); Tue, 31 Oct 2023 23:24:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41214 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231252AbjKADYe (ORCPT ); Tue, 31 Oct 2023 23:24:34 -0400 Received: from mail-pl1-x62e.google.com (mail-pl1-x62e.google.com [IPv6:2607:f8b0:4864:20::62e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 16334A4; Tue, 31 Oct 2023 20:24:32 -0700 (PDT) Received: by mail-pl1-x62e.google.com with SMTP id d9443c01a7336-1cc3bc5df96so25718265ad.2; Tue, 31 Oct 2023 20:24:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1698809071; x=1699413871; darn=vger.kernel.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=+2C1iXZcOTrTqGCV4PTQYVBg6Rkhk9UufGCighF5v2k=; b=AImHUx8p0E623U3NyO12zbEksrkLgIApuepQzIzq5cCqTvVq6RmtAKEpGL4JWBGWj9 Jgo8cNgfbCI5ikqrlsWlopKex0liaHPB7AaE6QswAUmUZUGD8GoHtdqZ7RPbvevclrOS 36pRDF9xsy2e5TTExayYE/VWcMyTVMtD+tQXOfsKcjMu4uZi3j9SIYrLnPIWanAxPxtx mGeBd6yMJSoFa9JfJ1BuRkSkJQLvaDVCDB96hNDq8OHnMS4fvk6cN0wKqIZjC/F5073t JfXjV6/AfF+d43J/mtmPl5RHM4W7sD7ZrdJ8hYV6CV6HB4ajEE9oKsG3NqUa+kFFkur0 pn1w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698809071; x=1699413871; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=+2C1iXZcOTrTqGCV4PTQYVBg6Rkhk9UufGCighF5v2k=; b=L5U4losfQbdeB5WGeFsf/xu+NWq8p+Z+FX+gJ8foqZArYlNJFNWrUCf3HfV/XOD3jO bsqQOWydhSbrDiGAJuW5Yt7upro4LkhUreiv5GJjm7Sa17WoKsj/aXTGVp71HgGXeM6+ Q5ejeBqfJfmK8ylalEBtLiwLUAbvX8OtsVC0kShvfA3xkPh7ovxqOl2Ox7cCa1gnL+Vj HR5O1SfGoF6kUM+/D1hSB7+RRWgFtMqAeYizZ/hAtK6i74gUwUqKI2Ouchbg47Gtnal/ 9ZPc91PTuREXIpJSSngTDm9J+se3dsbmB6Jd7RizNLpHfkMyy5aMOz+ICETPXJyRrQNR MkWw== X-Gm-Message-State: AOJu0YwRiwx0I2sj6DOCB5xBWxTDe0WFcq9NmCAY+UfgehhWrjPGaZOm Aj7sNwEMihm/01f6catZygM= X-Google-Smtp-Source: AGHT+IEiqzbxuH1oyyz4CFB5xcpfQcMLCwLmTN0qfwugw7Xk6L/aWx4ZRSK3TgVkGOAedxuqcdqOIQ== X-Received: by 2002:a17:902:9688:b0:1c9:d90b:c3e4 with SMTP id n8-20020a170902968800b001c9d90bc3e4mr11705042plp.10.1698809071306; Tue, 31 Oct 2023 20:24:31 -0700 (PDT) Received: from fedora ([43.228.180.230]) by smtp.gmail.com with ESMTPSA id j6-20020a170902da8600b001c60d0a6d84sm279375plx.127.2023.10.31.20.24.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 31 Oct 2023 20:24:30 -0700 (PDT) Date: Wed, 1 Nov 2023 11:24:23 +0800 From: Ming Lei To: Marek =?iso-8859-1?Q?Marczykowski-G=F3recki?= Cc: Jan Kara , Mikulas Patocka , Vlastimil Babka , Andrew Morton , Matthew Wilcox , Michal Hocko , stable@vger.kernel.org, regressions@lists.linux.dev, Alasdair Kergon , Mike Snitzer , dm-devel@lists.linux.dev, linux-mm@kvack.org, linux-block@vger.kernel.org, linux-nvme@lists.infradead.org, ming.lei@redhat.com Subject: Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5 Message-ID: References: <20231030155603.k3kejytq2e4vnp7z@quack3> <98aefaa9-1ac-a0e4-fb9a-89ded456750@redhat.com> <20231031140136.25bio5wajc5pmdtl@quack3> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org On Wed, Nov 01, 2023 at 03:14:22AM +0100, Marek Marczykowski-Górecki wrote: > On Wed, Nov 01, 2023 at 09:27:24AM +0800, Ming Lei wrote: > > On Tue, Oct 31, 2023 at 11:42 PM Marek Marczykowski-Górecki > > wrote: > > > > > > On Tue, Oct 31, 2023 at 03:01:36PM +0100, Jan Kara wrote: > > > > On Tue 31-10-23 04:48:44, Marek Marczykowski-Górecki wrote: > > > > > Then tried: > > > > > - PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce, > > > > > - PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce, > > > > > - PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly > > > > > > > > > > I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times > > > > > and I can't reproduce the issue there. I'm confused... > > > > > > > > And this kind of confirms that allocations > PAGE_ALLOC_COSTLY_ORDER > > > > causing hangs is most likely just a coincidence. Rather something either in > > > > the block layer or in the storage driver has problems with handling bios > > > > with sufficiently high order pages attached. This is going to be a bit > > > > painful to debug I'm afraid. How long does it take for you trigger the > > > > hang? I'm asking to get rough estimate how heavy tracing we can afford so > > > > that we don't overwhelm the system... > > > > > > Sometimes it freezes just after logging in, but in worst case it takes > > > me about 10min of more or less `tar xz` + `dd`. > > > > blk-mq debugfs is usually helpful for hang issue in block layer or > > underlying drivers: > > > > (cd /sys/kernel/debug/block && find . -type f -exec grep -aH . {} \;) > > > > BTW, you can just collect logs of the exact disks if you know what > > are behind dm-crypt, > > which can be figured out by `lsblk`, and it has to be collected after > > the hang is triggered. > > dm-crypt lives on the nvme disk, this is what I collected when it > hanged: > ... > nvme0n1/hctx4/cpu4/default_rq_list:000000000d41998f {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=65, .internal_tag=-1} > nvme0n1/hctx4/cpu4/default_rq_list:00000000d0d04ed2 {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=70, .internal_tag=-1} Two requests stays in sw queue, but not related with this issue. > nvme0n1/hctx4/type:default > nvme0n1/hctx4/dispatch_busy:9 non-zero dispatch_busy means BLK_STS_RESOURCE is returned from nvme_queue_rq() recently and mostly. > nvme0n1/hctx4/active:0 > nvme0n1/hctx4/run:20290468 ... > nvme0n1/hctx4/tags:nr_tags=1023 > nvme0n1/hctx4/tags:nr_reserved_tags=0 > nvme0n1/hctx4/tags:active_queues=0 > nvme0n1/hctx4/tags:bitmap_tags: > nvme0n1/hctx4/tags:depth=1023 > nvme0n1/hctx4/tags:busy=3 Just three requests in-flight, two are in sw queue, another is in hctx->dispatch. ... > nvme0n1/hctx4/dispatch:00000000b335fa89 {.op=WRITE, .cmd_flags=NOMERGE, .rq_flags=DONTPREP|IO_STAT, .state=idle, .tag=78, .internal_tag=-1} > nvme0n1/hctx4/flags:alloc_policy=FIFO SHOULD_MERGE > nvme0n1/hctx4/state:SCHED_RESTART The request staying in hctx->dispatch can't move on, and nvme_queue_rq() returns -BLK_STS_RESOURCE constantly, and you can verify with the following bpftrace when the hang is triggered: bpftrace -e 'kretfunc:nvme_queue_rq { @[retval, kstack]=count() }' It is very likely that memory allocation inside nvme_queue_rq() can't be done successfully, then blk-mq just have to retry by calling nvme_queue_rq() on the above request. Thanks, Ming