From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from rivercloud.ext.redscript.org (rivercloud.ext.redscript.org [181.214.58.244]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8013813F43A for ; Tue, 3 Dec 2024 06:13:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=181.214.58.244 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733206407; cv=none; b=t3h6UfCqhE8WW+1FaoT9IKlkvKIjtY7o73UVR/1xaqKUNa2t6GvtWQXtH4V80HPTvivJP8zAe7N4Z9kjm19NpGJD4aYmy9pmx4KWaRQBUNSl4T5jDK37G3RpBM6ApK/ksOj65BqlLXEz40nsU/dfHOX0cI/+40TGGftu9C9ZpsE= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1733206407; c=relaxed/simple; bh=0N8NtPsJfMNbU6P0QcZNzteIiC69tm3Bk7HDIb2D9nw=; h=Message-ID:Date:MIME-Version:Subject:From:To:Cc:References: In-Reply-To:Content-Type; b=KY9pIPtfMPfjkl6s5F38Y1J2eNp8qzJ8VzFizlYy3qyeP1heB91OvUIC385dVImnaJGJzcw9K39ylLBUC2c429oHYyKTE2BMmJdBjqbsGea2m/2clk4A1Dtlvvh/Uo+dv+QogxWhrR+k5GAJpzz+xXKFw77VeCVldDOaWRRjDfc= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=redscript.org; spf=pass smtp.mailfrom=redscript.org; dkim=permerror (0-bit key) header.d=redscript.org header.i=@redscript.org header.b=YmYvAQIf; dkim=pass (1024-bit key) header.d=redscript.org header.i=@redscript.org header.b=lVus2TFf; arc=none smtp.client-ip=181.214.58.244 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=redscript.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redscript.org Authentication-Results: smtp.subspace.kernel.org; dkim=permerror (0-bit key) header.d=redscript.org header.i=@redscript.org header.b="YmYvAQIf"; dkim=pass (1024-bit key) header.d=redscript.org header.i=@redscript.org header.b="lVus2TFf" DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=redscript.org; s=mail2-ed25519-2024; t=1733205994; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Ww+92E3Vs/yf6o1TR71WV7BDnpS35SmTQRkvttBan5o=; b=YmYvAQIf41cohrGT9H49khF9u/QbzQdBpJQRHjZAInGXlBrFIc/igBaHykh6+WDF2xWhNo KLEvGENPWJTFWSBQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redscript.org; s=mail2-rsa-2024; t=1733205994; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Ww+92E3Vs/yf6o1TR71WV7BDnpS35SmTQRkvttBan5o=; b=lVus2TFf5sXsem8CUb/cZPWERZmFn6ZffLV59aXqluGXo1CBuxrPz6qNfPAd2CUfliHJKF 6gNc9GAW+CfYb0qt80ZHocjl6teICsQ5PTW6owSe2ecnCoWEoZ7hlrGxp5O7YxDvyDG/HK Y/vqxI6u54NU+ChihQJbcruWd1uJ7Eg= Received: by rivercloud.ext.redscript.org (OpenSMTPD) with ESMTPSA id ca24d00f (TLSv1.3:TLS_AES_256_GCM_SHA384:256:NO); Tue, 3 Dec 2024 06:06:34 +0000 (UTC) Message-ID: Date: Tue, 3 Dec 2024 10:06:29 +0400 Precedence: bulk X-Mailing-List: linux-bcachefs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] bcachefs: Allocator now directly wakes up copygc when necessary From: Ahmad Draidi To: linux-bcachefs@vger.kernel.org Cc: syzbot+7bf808f7fe4a6549f36e@syzkaller.appspotmail.com References: <20241019215605.160125-1-kent.overstreet@linux.dev> <92dce846-d110-4c97-afd1-0b198c1fdf4d@redscript.org> Content-Language: en-US In-Reply-To: <92dce846-d110-4c97-afd1-0b198c1fdf4d@redscript.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Hello, On 10/24/24 07:46, Ahmad Draidi wrote: > Greetings, > > > On 10/20/24 01:56, Kent Overstreet wrote: >> copygc tries to wait in a way that balances waiting for work to >> accumulate with running before we run out of free space - but for a >> variety of reasons (multiple devices, io clock slop, the vagaries of >> fragmentation) this isn't completely reliable. >> >> So to avoid getting stuck, add direct wakeups from the allocator to the >> copygc thread when we start to notice we're low on free buckets. > > Since I switched to 6.11.x from 6.10.x, I've had "Allocator stuck? > Waited for 30 seconds" messages and I/O would stop to the FS. No > timeout on read, for example, but it just stops for hours, until I > reboot. I'm able to quickly and reliably trigger this with my workload. > > > I applied this patch on top of 6.11.4 but can still see "Allocator > stuck" in dmesg. I see the following before and after the patch:- > > "BUG: unable to handle page fault for address: fffffffffffff81b > #PF: supervisor read access in kernel mode > #PF: error_code(0x0000) - not-present page" > > ... > > "RIP: 0010:bch2_btree_path_peek_slot+0x64/0x210 [bcachefs]" > > > A longer log snippet of "allocator stuck" and the above are at: > https://pastebin.com/ptuzaryi Just a quick update for anyone reading this. The issue is solved for me after upgrading to 6.12.1. > > > I did fsck after FS got stuck, and errors were found and fixed, but > issue happens again, before and after the patch. > > Some info that might be needed: I'm using ECC RAM, 2x SAS SSDs, 2x > SATA HDDs, LUKS, and the following opts: > > starting version 1.12: rebalance_work_acct_fix > opts=metadata_replicas=2,data_replicas=2,metadata_replicas_required=2,data_replicas_required=2, > > metadata_checksum=xxhash,data_checksum=xxhash,compression=lz4,background_compression=gzip,metadata_target=ssd,foreground_target=ssd, > > > background_target=hdd,promote_target=ssd > > > Let me know if I can help. > > > Thanks! > > Ahmad > >