From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from prime.voidband.net (prime.voidband.net [199.247.17.104]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EFFA51A9F99; Tue, 12 Aug 2025 07:37:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=199.247.17.104 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1754984266; cv=none; b=RhJyDE5aNDZRmHPAlhFKXZt9zNKk/tagDPWxMkq8t1Xj40Qr/LQEnWRd32YnoH3sBpA8I5Mjcd25I98dyemNTk8cSTzgzuq37TEePvRQr3hlZrSUxbeeujgbuVM7Z6g5N6qT5keuqfe8Y+m6Rf96NM9ymxdN227WCL/+c0wDYG0= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1754984266; c=relaxed/simple; bh=YF2BOgtvsRz+yT+YAwLFEroi8NzBJol25StSpW8uatM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=axeL+tspsNUCfNNdSWbyRjBjye+BtAay5Eyt282pUM71zDzVC7gNMAdCuQThgKHjYaiz95DofS1FlnOllSbMGBwhRhXRh4OE3nzKDjVdQyO8ewpvKktWe75sTSLnonw9hIAE5xx9YzwRjAtvCdUJjqkHqRgv9ouB1VlgrbxCc/U= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=natalenko.name; spf=pass smtp.mailfrom=natalenko.name; dkim=pass (1024-bit key) header.d=natalenko.name header.i=@natalenko.name header.b=R2AMMGg3; arc=none smtp.client-ip=199.247.17.104 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=natalenko.name Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=natalenko.name Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=natalenko.name header.i=@natalenko.name header.b="R2AMMGg3" Received: from spock.localnet (unknown [212.20.115.26]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange x25519 server-signature ECDSA (prime256v1) server-digest SHA256) (No client certificate requested) by prime.voidband.net (Postfix) with ESMTPSA id 89CA1635B040; Tue, 12 Aug 2025 09:37:34 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=natalenko.name; s=dkim-20170712; t=1754984254; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=Ko8fbK5GyQxFcouQBC+uaA5wBDCuoavWYQ2EzMwpjWI=; b=R2AMMGg3UcI+D1Q0kO3yhjRvyygTC1w++Aaq90YxKJqFZseG8At7rRfbGdaSW+zLfWooYT tJKBdsrKFfPWu+xWmebP5lZPTbF89Ein4L0rHp6/4CcTq97EWKMpcB5DpT6hZaO7HLbolf oMhtdV2QQXFYdZIbIqWZmydi3U/+cRE= From: Oleksandr Natalenko To: David Rientjes , Damien Le Moal Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org, Jens Axboe , John Garry , Christoph Hellwig , "Martin K. Petersen" , linux-mm@kvack.org, Lorenzo Stoakes , Shakeel Butt , Qi Zheng , Michal Hocko , David Hildenbrand , Johannes Weiner , Andrew Morton Subject: Re: [REGRESSION][BISECTED] Unexpected OOM instead of reclaiming inactive file pages Date: Tue, 12 Aug 2025 09:37:22 +0200 Message-ID: <3553647.iZASKD2KPV@natalenko.name> In-Reply-To: <33b6c9a3-3165-4ce8-9667-afdbaff2c3ae@kernel.org> References: <5905724.LvFx2qVVIh@natalenko.name> <15056829.uLZWGnKmhe@natalenko.name> <33b6c9a3-3165-4ce8-9667-afdbaff2c3ae@kernel.org> Precedence: bulk X-Mailing-List: linux-block@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: multipart/signed; boundary="nextPart3246356.PYKUYFuaPT"; micalg="pgp-sha512"; protocol="application/pgp-signature" --nextPart3246356.PYKUYFuaPT Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; protected-headers="v1" From: Oleksandr Natalenko Date: Tue, 12 Aug 2025 09:37:22 +0200 Message-ID: <3553647.iZASKD2KPV@natalenko.name> In-Reply-To: <33b6c9a3-3165-4ce8-9667-afdbaff2c3ae@kernel.org> MIME-Version: 1.0 Hello. On =C3=BAter=C3=BD 12. srpna 2025 2:45:02, st=C5=99edoevropsk=C3=BD letn=C3= =AD =C4=8Das Damien Le Moal wrote: > On 8/12/25 5:42 AM, Oleksandr Natalenko wrote: > > On pond=C4=9Bl=C3=AD 11. srpna 2025 18:06:16, st=C5=99edoevropsk=C3=BD = letn=C3=AD =C4=8Das David Rientjes wrote: > >> On Mon, 11 Aug 2025, Oleksandr Natalenko wrote: > >>> I'm fairly confident that the following commit > >>> > >>> 459779d04ae8d block: Improve read ahead size for rotational devices > >>> > >>> caused a regression in my test bench. > >>> > >>> I'm running v6.17-rc1 in a small QEMU VM with virtio-scsi disk. It ha= s got 1 GiB of RAM, so I can saturate it easily causing reclaiming mechanis= m to kick in. > >>> > >>> If MGLRU is enabled: > >>> > >>> $ echo 1000 | sudo tee /sys/kernel/mm/lru_gen/min_ttl_ms > >>> > >>> then, once page cache builds up, an OOM happens without reclaiming in= active file pages: [1]. Note that inactive_file:506952kB, I'd expect these = to be reclaimed instead, like how it happens with v6.16. > >>> > >>> If MGLRU is disabled: > >>> > >>> $ echo 0 | sudo tee /sys/kernel/mm/lru_gen/min_ttl_ms > >>> > >>> then OOM doesn't occur, and things seem to work as usual. > >>> > >>> If MGLRU is enabled, and 459779d04ae8d is reverted on top of v6.17-rc= 1, the OOM doesn't happen either. > >>> > >>> Could you please check this? > >>> > >> > >> This looks to be an MGLRU policy decision rather than a readahead=20 > >> regression, correct? > >> > >> Mem-Info: > >> active_anon:388 inactive_anon:5382 isolated_anon:0 > >> active_file:9638 inactive_file:126738 isolated_file:0 > >> > >> Setting min_ttl_ms to 1000 is preserving the working set and triggerin= g=20 > >> the oom kill is the only alternative to free memory in that configurat= ion. =20 > >> The oom kill is being triggered by kswapd for this purpose. > >> > >> So additional readahead would certainly increase that working set. Th= is=20 > >> looks working as intended. > >=20 > > OK, this makes sense indeed, thanks for the explanation. But is inactiv= e_file explosion expected and justified? > >=20 > > Without revert: > >=20 > > $ echo 3 | sudo tee /proc/sys/vm/drop_caches; free -m; sudo journalctl = =2Dkb >/dev/null; free -m > > 3 > > total used free shared buff/cache = available > > Mem: 690 179 536 3 57 = 510 > > Swap: 1379 12 1367 > > /* OOM happens here */ > > total used free shared buff/cache = available > > Mem: 690 177 52 3 561 = 513 > > Swap: 1379 17 1362=20 > >=20 > > With revert: > >=20 > > $ echo 3 | sudo tee /proc/sys/vm/drop_caches; free -m; sudo journalctl = =2Dkb >/dev/null; free -m > > 3 > > total used free shared buff/cache = available > > Mem: 690 214 498 4 64 = 476 > > Swap: 1379 0 1379 > > /* no OOM */ > > total used free shared buff/cache = available > > Mem: 690 209 462 4 119 = 481 > > Swap: 1379 0 1379 > >=20 > > The journal folder size is: > >=20 > > $ sudo du -hs /var/log/journal > > 575M /var/log/journal > >=20 > > It looks like this readahead change causes far more data to be read tha= n actually needed? >=20 > For your drive as seen by the VM, what is the value of > /sys/block/sdX/queue/optimal_io_size ? >=20 > I guess it is "0", as I see on my VM. Yes, it's 0. > So before 459779d04ae8d, the block device read_ahead_kb was 128KB only, a= nd > 459779d04ae8d switched it to be 2 times the max_sectors_kb, so 8MB. This = change > significantly improves file buffered read performance on HDDs, and HDDs o= nly. Right, max_sectors_kb is 4096. > This means that your VM device is probably being reported as a rotational= one > (/sys/block/sdX/queue/rotational is 1), which is normal if you attached an > actual HDD. If you are using a qcow2 image for that disk, then having > rotational=3D=3D1 is questionable... Yes, it's reported as rotational by default. I've just set -device scsi-hd,drive=3Dhd1,rotation_rate=3D1 so that guest w= ill see the drive as non-rotational from now on, which brings old behaviour= back. > The other issue is the device driver for the device reporting 0 for the o= ptimal > IO size, which normally happens only for SATA drives. I see the same with > virtio-scsi, which is also questionable given that the maximum IO size wi= th it > is fairly limited. So virtio-scsi may need some tweaking. >=20 > The other thing to question, I think, is setting read_ahead_kb using the > optimal_io_size limit (io_opt), which can be *very large*. For most SCSI > devices, it is 16MB, so you will see a read_ahead_kb of 32 MB. But for SC= SI > devices, optimal_io_size indicates a *maximum* IO size beyond which perfo= rmance > may degrade. So using any value lower than this, but still reasonably lar= ge, > would be better in general I think. Note that lim->io_opt for RAID arrays > actually indicates the stripe size, so generally a lot smaller than the > component drives io_opt. And this use changes the meaning of that queue l= imit, > which makes things even more confusing and finding an adequate default ha= rder. Thank you for the explanation. =2D-=20 Oleksandr Natalenko, MSE --nextPart3246356.PYKUYFuaPT Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part. Content-Transfer-Encoding: 7Bit -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEZUOOw5ESFLHZZtOKil/iNcg8M0sFAmia7zIACgkQil/iNcg8 M0tTChAAwRcANFL3mXoAdtvq28I6ZKLSmHdY2tZhwVBttLRFWCXreDnNiYJV/+GH gIW7xVzpfYxVZ2jXjA+pNfEng7WrXYFQKrwXKdaexfUDS0/NS0uq8ICh9uCLb+Bp 00F4E8IOL/DkrzC/AfxuXYgDo9LsCIi0+8sPWkMAuYLCxCWL52cal3yVxCq6wtlY JgxHcpxfC/zu6avLp0CrsPVxIYG2i5QV4F/9orZHUhIlitcjbvscsInu7KulDME5 DT2gm+bEKxIHCLxCzNMqGm4b1fd5OF0nFaNypnOUhy30qvLawTTqmxd/UUoHpW6p BgViyIsGv4XukLKKJOKu3/A02rzT055QixDTXybSGDLowBD7Ia9tTV1+V4JrEuw9 M/+V1zqwqIoyhsIIiu6LaKANaMATCFVrd4wSN8TkXyMh/yZzHjYuJDkrKrVjQzbv QO9qbhJ/CEyhU2bNIiAp/eITvI5Dr+S9Yj3uChZ0uJ0SYyYSF/YHq8fA8MsNwGBV gu4OIzi815wN5k+WBxhq22DFPuoDBxFmnmw/mJCXln8UG0VRdJbFvKzj5mJsj7ZZ VC1jpqT80ASM+2Cp7++uPK6rjOKr5wIYu1mxOYfyqCl3bXCd3CPQ3gHtZHi5EORn XYG2Mf/twoUjnFOIMhMz5jL438qvyA2aOyvZziDrai+fWB5VvF0= =58kA -----END PGP SIGNATURE----- --nextPart3246356.PYKUYFuaPT--