From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8FE24CA0EC4 for ; Tue, 12 Aug 2025 07:37:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EC3548E00EF; Tue, 12 Aug 2025 03:37:43 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E9B7B8E00E4; Tue, 12 Aug 2025 03:37:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DB1788E00EF; Tue, 12 Aug 2025 03:37:43 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id CA76E8E00E4 for ; Tue, 12 Aug 2025 03:37:43 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 6623616020E for ; Tue, 12 Aug 2025 07:37:43 +0000 (UTC) X-FDA: 83767300806.04.A2B3654 Received: from prime.voidband.net (prime.voidband.net [199.247.17.104]) by imf24.hostedemail.com (Postfix) with ESMTP id CD61C18000D for ; Tue, 12 Aug 2025 07:37:41 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=natalenko.name header.s=dkim-20170712 header.b=R2AMMGg3; spf=pass (imf24.hostedemail.com: domain of oleksandr@natalenko.name designates 199.247.17.104 as permitted sender) smtp.mailfrom=oleksandr@natalenko.name; dmarc=pass (policy=reject) header.from=natalenko.name ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1754984262; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Ko8fbK5GyQxFcouQBC+uaA5wBDCuoavWYQ2EzMwpjWI=; b=GQuOVr+Pe8KaNbfS8AZYewGwLrJvfuFet0cdTQ5sErPHle4+XNmaIihOMQ0Z8XSdbhNKfa 8sckRbS89uHYDzj0ZE3VNfgdZg4Lx1Sjm3Hmli4SPhEh0X2qK44jtVtmN1mU88OO4Z34/D bo3T564+0zpHW1yjXVQiQ9a2wHi/gJw= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=natalenko.name header.s=dkim-20170712 header.b=R2AMMGg3; spf=pass (imf24.hostedemail.com: domain of oleksandr@natalenko.name designates 199.247.17.104 as permitted sender) smtp.mailfrom=oleksandr@natalenko.name; dmarc=pass (policy=reject) header.from=natalenko.name ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1754984262; a=rsa-sha256; cv=none; b=aFzKipOsNI1+VM3zDq+R6HIo6hyrKpF7CQyFg6oVrK6dsP65CBRviZ1ov+NQ6Jcl3vx3Rt rhyLks3mwHq2VDJ1IN4/nBuE3mGXh0IDqRyNjB4yiWr3CyS1gUx9S9jqmLavUF8X41miDZ bsl5S6u8tIMFjFpRG3QRR8RvID5zT8M= Received: from spock.localnet (unknown [212.20.115.26]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange x25519 server-signature ECDSA (prime256v1) server-digest SHA256) (No client certificate requested) by prime.voidband.net (Postfix) with ESMTPSA id 89CA1635B040; Tue, 12 Aug 2025 09:37:34 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=natalenko.name; s=dkim-20170712; t=1754984254; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=Ko8fbK5GyQxFcouQBC+uaA5wBDCuoavWYQ2EzMwpjWI=; b=R2AMMGg3UcI+D1Q0kO3yhjRvyygTC1w++Aaq90YxKJqFZseG8At7rRfbGdaSW+zLfWooYT tJKBdsrKFfPWu+xWmebP5lZPTbF89Ein4L0rHp6/4CcTq97EWKMpcB5DpT6hZaO7HLbolf oMhtdV2QQXFYdZIbIqWZmydi3U/+cRE= From: Oleksandr Natalenko To: David Rientjes , Damien Le Moal Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org, Jens Axboe , John Garry , Christoph Hellwig , "Martin K. Petersen" , linux-mm@kvack.org, Lorenzo Stoakes , Shakeel Butt , Qi Zheng , Michal Hocko , David Hildenbrand , Johannes Weiner , Andrew Morton Subject: Re: [REGRESSION][BISECTED] Unexpected OOM instead of reclaiming inactive file pages Date: Tue, 12 Aug 2025 09:37:22 +0200 Message-ID: <3553647.iZASKD2KPV@natalenko.name> In-Reply-To: <33b6c9a3-3165-4ce8-9667-afdbaff2c3ae@kernel.org> References: <5905724.LvFx2qVVIh@natalenko.name> <15056829.uLZWGnKmhe@natalenko.name> <33b6c9a3-3165-4ce8-9667-afdbaff2c3ae@kernel.org> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="nextPart3246356.PYKUYFuaPT"; micalg="pgp-sha512"; protocol="application/pgp-signature" X-Rspamd-Queue-Id: CD61C18000D X-Rspam-User: X-Stat-Signature: 8on3db17yqz8dmc4tyo8gdfdxxdar3ir X-Rspamd-Server: rspam09 X-HE-Tag: 1754984261-598420 X-HE-Meta: U2FsdGVkX1/m0VtVdAMUYetRZhShWPqxHXjXYdbRuNfVV47Mo2WxBkH193jdsoRsKiuMOfT8t3DBl39JBvdPzHtssFP2D1eoBVjPxjppyKEvm15LeQ9r7vyKCILfDzrNNfW0qql3NtpT8Tprs4OPzRX/laCUQCBmNGQe4/+C6wTRCyrKcUlN8HSKxxY31LnA3KIz7LzI7yN7g9QJbe/ZrFG/Ro94ndGwnEckYbjW4ajcQt0skAp1NgNYJyAum1MemiKnjHl0v4xUK7Y63Eeke/xU1MLA0kOA/YPdPfBvhqFAYL/fwWWJWdl3cckvgDVOTfGi5Aet+GTD+PXnXIVnQ2uc+FLZVkLFLsUqDavjqgrW1CjNaeeqYSVDZHX9DU2NN+Nxul5GmQdDXEtMtVtEU/wNS6wQ1dBIbz0T0hlfMghSH/25i5ZxwBX8yNe/bgumNMqttuC7k54RSZr2Bni8c06C3vCXcZAmUFDsRsV8IctzWYX1nmv4h/g/wLdQ2WtjQekSJDbzYImfkFlEX0gapq9KLztVnYqLjOUs2MlI/tesmy4r1i+Bj8dKZO49naJl/IjdrIlXeMPcw3w4k5/OzMHdbTmjzd/8gY+sdYGzFtTTcQoeCGf/6cXjutokn2/zLXrbjl/mbHLQIyyaGe3b1VH85WLSIkekxObBVw7E8dZgPiFv59xl679U1gPzqnuxvt+jvswDoEwttMUdiAaoZOvICq9jw7nFLbryAVDqiP5dTC2/Uu4mX+cmOXlDvwfLrjekgK70eVsgOgjYKIxyTtlAbnkWw19d6WLhE7kFZL5zH4htb6eW+e5bq3XC1VqnKiQ7dsUB7Zgb1L9+xzQGoSMvXKmbCNLk9UOTeVWr7AUyIiWwhkpChXYSYoLNJwYhPXFUhZ9D1kBG/sOSUjA84kGVWcZmZZFttFqLnNgrFIk2DpCfoQw65DyIwxJGIFzgw4M918TV+l0WkmG48AA r6g9bRD6 nY9yaIGHK55F69gi03gJ5KYCTBwMQtmz/IlY6Xoeun8axbKHLY8wYsmAxXQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: --nextPart3246356.PYKUYFuaPT Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"; protected-headers="v1" From: Oleksandr Natalenko Date: Tue, 12 Aug 2025 09:37:22 +0200 Message-ID: <3553647.iZASKD2KPV@natalenko.name> In-Reply-To: <33b6c9a3-3165-4ce8-9667-afdbaff2c3ae@kernel.org> MIME-Version: 1.0 Hello. On =C3=BAter=C3=BD 12. srpna 2025 2:45:02, st=C5=99edoevropsk=C3=BD letn=C3= =AD =C4=8Das Damien Le Moal wrote: > On 8/12/25 5:42 AM, Oleksandr Natalenko wrote: > > On pond=C4=9Bl=C3=AD 11. srpna 2025 18:06:16, st=C5=99edoevropsk=C3=BD = letn=C3=AD =C4=8Das David Rientjes wrote: > >> On Mon, 11 Aug 2025, Oleksandr Natalenko wrote: > >>> I'm fairly confident that the following commit > >>> > >>> 459779d04ae8d block: Improve read ahead size for rotational devices > >>> > >>> caused a regression in my test bench. > >>> > >>> I'm running v6.17-rc1 in a small QEMU VM with virtio-scsi disk. It ha= s got 1 GiB of RAM, so I can saturate it easily causing reclaiming mechanis= m to kick in. > >>> > >>> If MGLRU is enabled: > >>> > >>> $ echo 1000 | sudo tee /sys/kernel/mm/lru_gen/min_ttl_ms > >>> > >>> then, once page cache builds up, an OOM happens without reclaiming in= active file pages: [1]. Note that inactive_file:506952kB, I'd expect these = to be reclaimed instead, like how it happens with v6.16. > >>> > >>> If MGLRU is disabled: > >>> > >>> $ echo 0 | sudo tee /sys/kernel/mm/lru_gen/min_ttl_ms > >>> > >>> then OOM doesn't occur, and things seem to work as usual. > >>> > >>> If MGLRU is enabled, and 459779d04ae8d is reverted on top of v6.17-rc= 1, the OOM doesn't happen either. > >>> > >>> Could you please check this? > >>> > >> > >> This looks to be an MGLRU policy decision rather than a readahead=20 > >> regression, correct? > >> > >> Mem-Info: > >> active_anon:388 inactive_anon:5382 isolated_anon:0 > >> active_file:9638 inactive_file:126738 isolated_file:0 > >> > >> Setting min_ttl_ms to 1000 is preserving the working set and triggerin= g=20 > >> the oom kill is the only alternative to free memory in that configurat= ion. =20 > >> The oom kill is being triggered by kswapd for this purpose. > >> > >> So additional readahead would certainly increase that working set. Th= is=20 > >> looks working as intended. > >=20 > > OK, this makes sense indeed, thanks for the explanation. But is inactiv= e_file explosion expected and justified? > >=20 > > Without revert: > >=20 > > $ echo 3 | sudo tee /proc/sys/vm/drop_caches; free -m; sudo journalctl = =2Dkb >/dev/null; free -m > > 3 > > total used free shared buff/cache = available > > Mem: 690 179 536 3 57 = 510 > > Swap: 1379 12 1367 > > /* OOM happens here */ > > total used free shared buff/cache = available > > Mem: 690 177 52 3 561 = 513 > > Swap: 1379 17 1362=20 > >=20 > > With revert: > >=20 > > $ echo 3 | sudo tee /proc/sys/vm/drop_caches; free -m; sudo journalctl = =2Dkb >/dev/null; free -m > > 3 > > total used free shared buff/cache = available > > Mem: 690 214 498 4 64 = 476 > > Swap: 1379 0 1379 > > /* no OOM */ > > total used free shared buff/cache = available > > Mem: 690 209 462 4 119 = 481 > > Swap: 1379 0 1379 > >=20 > > The journal folder size is: > >=20 > > $ sudo du -hs /var/log/journal > > 575M /var/log/journal > >=20 > > It looks like this readahead change causes far more data to be read tha= n actually needed? >=20 > For your drive as seen by the VM, what is the value of > /sys/block/sdX/queue/optimal_io_size ? >=20 > I guess it is "0", as I see on my VM. Yes, it's 0. > So before 459779d04ae8d, the block device read_ahead_kb was 128KB only, a= nd > 459779d04ae8d switched it to be 2 times the max_sectors_kb, so 8MB. This = change > significantly improves file buffered read performance on HDDs, and HDDs o= nly. Right, max_sectors_kb is 4096. > This means that your VM device is probably being reported as a rotational= one > (/sys/block/sdX/queue/rotational is 1), which is normal if you attached an > actual HDD. If you are using a qcow2 image for that disk, then having > rotational=3D=3D1 is questionable... Yes, it's reported as rotational by default. I've just set -device scsi-hd,drive=3Dhd1,rotation_rate=3D1 so that guest w= ill see the drive as non-rotational from now on, which brings old behaviour= back. > The other issue is the device driver for the device reporting 0 for the o= ptimal > IO size, which normally happens only for SATA drives. I see the same with > virtio-scsi, which is also questionable given that the maximum IO size wi= th it > is fairly limited. So virtio-scsi may need some tweaking. >=20 > The other thing to question, I think, is setting read_ahead_kb using the > optimal_io_size limit (io_opt), which can be *very large*. For most SCSI > devices, it is 16MB, so you will see a read_ahead_kb of 32 MB. But for SC= SI > devices, optimal_io_size indicates a *maximum* IO size beyond which perfo= rmance > may degrade. So using any value lower than this, but still reasonably lar= ge, > would be better in general I think. Note that lim->io_opt for RAID arrays > actually indicates the stripe size, so generally a lot smaller than the > component drives io_opt. And this use changes the meaning of that queue l= imit, > which makes things even more confusing and finding an adequate default ha= rder. Thank you for the explanation. =2D-=20 Oleksandr Natalenko, MSE --nextPart3246356.PYKUYFuaPT Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part. Content-Transfer-Encoding: 7Bit -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEZUOOw5ESFLHZZtOKil/iNcg8M0sFAmia7zIACgkQil/iNcg8 M0tTChAAwRcANFL3mXoAdtvq28I6ZKLSmHdY2tZhwVBttLRFWCXreDnNiYJV/+GH gIW7xVzpfYxVZ2jXjA+pNfEng7WrXYFQKrwXKdaexfUDS0/NS0uq8ICh9uCLb+Bp 00F4E8IOL/DkrzC/AfxuXYgDo9LsCIi0+8sPWkMAuYLCxCWL52cal3yVxCq6wtlY JgxHcpxfC/zu6avLp0CrsPVxIYG2i5QV4F/9orZHUhIlitcjbvscsInu7KulDME5 DT2gm+bEKxIHCLxCzNMqGm4b1fd5OF0nFaNypnOUhy30qvLawTTqmxd/UUoHpW6p BgViyIsGv4XukLKKJOKu3/A02rzT055QixDTXybSGDLowBD7Ia9tTV1+V4JrEuw9 M/+V1zqwqIoyhsIIiu6LaKANaMATCFVrd4wSN8TkXyMh/yZzHjYuJDkrKrVjQzbv QO9qbhJ/CEyhU2bNIiAp/eITvI5Dr+S9Yj3uChZ0uJ0SYyYSF/YHq8fA8MsNwGBV gu4OIzi815wN5k+WBxhq22DFPuoDBxFmnmw/mJCXln8UG0VRdJbFvKzj5mJsj7ZZ VC1jpqT80ASM+2Cp7++uPK6rjOKr5wIYu1mxOYfyqCl3bXCd3CPQ3gHtZHi5EORn XYG2Mf/twoUjnFOIMhMz5jL438qvyA2aOyvZziDrai+fWB5VvF0= =58kA -----END PGP SIGNATURE----- --nextPart3246356.PYKUYFuaPT--