From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.21]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 09B4D3DDDDC for ; Wed, 13 May 2026 08:32:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.21 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778661129; cv=none; b=dZZbOHmgCZpm41PHhpBJncf8KPL45s0GyKjy3CukH0TxRJMybf16kucphgWirMnWiCm7Eu0bw6PpiEpIZ3qT1L5qBsfAi09sTDePsKPdzgtdHI+yHsX+nwh80vOPrlwR50NWHVMyoKMQFgyqZgUqv2wDMcDWEq47Q4JHfbdvE6w= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778661129; c=relaxed/simple; bh=w3N7ovtfdrXvSPC3SyFkABxeYzl70HsVKnN1MgRa4Is=; h=Message-ID:Subject:From:To:Cc:Date:In-Reply-To:References: Content-Type:MIME-Version; b=Lw1gIgmsJWFlpbyms7NxbrBsrzxF0nrCY7Djs/e/CVuFKPsaAiNqtht4lrA1ubqlLWDov3Sl1f+NCO8WTXaDPiwrD9kv06b6MHp2V6QuLMFDVagdYekx1y0u61jtaskHkABgPDDR/qJpAAnNMsUDaU/ljr7Z7dCTllhAWaP0fOw= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=nrjxfxiA; arc=none smtp.client-ip=198.175.65.21 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="nrjxfxiA" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1778661123; x=1810197123; h=message-id:subject:from:to:cc:date:in-reply-to: references:content-transfer-encoding:mime-version; bh=w3N7ovtfdrXvSPC3SyFkABxeYzl70HsVKnN1MgRa4Is=; b=nrjxfxiAs02gM4Wm+FuI53ujropVwxinDov9EpSatU2TB1ejW3rvpIgR 2Y8pZ9LrmvZyHIPSj/h6Bht5lGdqM+IGwN7abFXOKsHMOsr0CCmVZcABG Iq2woIybfTovuXrw2xJuGurKOGsCEpBnKJnmR3jao/ETD8PhhTvHdW+Pq LHZeeHyaOle+FhSyig1mb2v74e3LZ6VucH7Ps0tEMr8hOcQY1PE0skCeE qEba2/YN0rIWH2cGAPdoFK5fVKrg5RpylcLMUmHFChYKXdbNKdQX1MyGn uO9JtANepmGRBSXCgDed20xUR2WHbk6emUANv8dN0+TQSLSOZy6tfSSwm Q==; X-CSE-ConnectionGUID: wErXbD77QUueQz1bkjkpMw== X-CSE-MsgGUID: ol+kAP/SQEK0YjLpyKuTDw== X-IronPort-AV: E=McAfee;i="6800,10657,11784"; a="79477225" X-IronPort-AV: E=Sophos;i="6.23,232,1770624000"; d="scan'208";a="79477225" Received: from fmviesa001.fm.intel.com ([10.60.135.141]) by orvoesa113.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 May 2026 01:31:56 -0700 X-CSE-ConnectionGUID: wn1YwuwwQrqVza82LHbydQ== X-CSE-MsgGUID: noPPLi1GSd63I02KpUCjFw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.23,232,1770624000"; d="scan'208";a="261764851" Received: from ijarvine-mobl1.ger.corp.intel.com (HELO [10.245.244.49]) ([10.245.244.49]) by smtpauth.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 May 2026 01:31:50 -0700 Message-ID: <68705cec43ab43088327b21c9d9318036fb2a255.camel@linux.intel.com> Subject: Re: [PATCH 1/2] mm/shmem: add shmem_insert_folio() From: Thomas =?ISO-8859-1?Q?Hellstr=F6m?= To: Christian =?ISO-8859-1?Q?K=F6nig?= , "David Hildenbrand (Arm)" , intel-xe@lists.freedesktop.org Cc: Andrew Morton , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Hugh Dickins , Baolin Wang , Brendan Jackman , Johannes Weiner , Zi Yan , Huang Rui , Matthew Auld , Matthew Brost , Maarten Lankhorst , Maxime Ripard , Thomas Zimmermann , David Airlie , Simona Vetter , dri-devel@lists.freedesktop.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Date: Wed, 13 May 2026 10:31:47 +0200 In-Reply-To: References: <20260512110339.6244-1-thomas.hellstrom@linux.intel.com> <20260512110339.6244-2-thomas.hellstrom@linux.intel.com> <26479389-459b-4cc4-914d-e7d29d5e5cc9@kernel.org> Organization: Intel Sweden AB, Registration Number: 556189-6027 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.58.3 (3.58.3-1.fc43) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Hi, David & Christian, Thanks for the feedback. I'll respin this to see if I can come up with something that is more acceptable given the comments. A couple of questions and comments below. On Wed, 2026-05-13 at 09:47 +0200, Christian K=C3=B6nig wrote: > Hi David & Thomas, >=20 > On 5/12/26 22:03, David Hildenbrand (Arm) wrote: > > On 5/12/26 13:31, Thomas Hellstr=C3=B6m wrote: > ... > > > > > +int shmem_insert_folio(struct file *file, struct folio > > > > > *folio, > > > > > unsigned int order, > > > > > + =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pgoff_t index, bool write= back, gfp_t > > > > > folio_gfp) > > > > > +{ > > > > > + struct address_space *mapping =3D file->f_mapping; > > > > > + struct inode *inode =3D mapping->host; > > > > > + bool promoted; > > > > > + long nr_pages; > > > > > + int ret; > > > > > + > > > > > + promoted =3D order > 0 && !folio_test_large(folio); > > > > > + if (promoted) > > > > > + prep_compound_page(&folio->page, order); > > > > > + nr_pages =3D folio_nr_pages(folio); > > > > > + > > > > > + VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); > > > > > + VM_BUG_ON_FOLIO(folio_mapped(folio), folio); > > > > > + VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio); > > > > > + VM_BUG_ON_FOLIO(folio->mapping, folio); > > > > > + VM_BUG_ON(index !=3D round_down(index, nr_pages)); > > > >=20 > > > > No new VM_BUG_ON_FOLIO etc. > > >=20 > > > OK, can eliminate those. Is VM_WARN_ON_FOLIO() preferred, > > > or any other type of assert? > >=20 > > VM_WARN_ON_FOLIO() is usually what you want, or VM_WARN_ON_ONCE(). > >=20 > > >=20 > > > >=20 > > > > But in general, pushing in random allocated pages into shmem, > > > > converting them to > > > > folios is not something I particularly enjoy seeing. > > > >=20 > > >=20 > > > OK, let me understand the concern. The pages are allocated as > > > multi- > > > page folios using alloc_pages(gfp, order), but typically not > > > promoted > > > to compound pages, until inserted here. Is it that promotion that > > > is of > > > concern or inserting pages of unknown origin into shmem? Anything > > > we > > > can do to alleviate that concern? > >=20 > > It's all rather questionable. > >=20 > > A couple of points: > >=20 > > a) The pages are allocated to be unmovable, but adding them to > > shmem effectively > > =C2=A0=C2=A0 turns them movable. Now you interfere with the page alloca= tor > > logic of > > =C2=A0=C2=A0 placing movable and unmovable pages a reasonable way into > > =C2=A0=C2=A0 pageblocks that group allocations of similar types. > >=20 > > b) A driver is not supposed to decide which folio size will be > > allocated for > > =C2=A0=C2=A0 shmem. >=20 > Exactly that is one of the major reasons why we aren't using a shmem > as backing store for TTM buffers in the first place. >=20 > While HW today can usually work with everything down to 4k it needs > higher order pages for optimal performance. >=20 > So for example for AMD GPUs you need 2M pages or otherwise the > performance goes down by ~30% in quite a number of use cases. >=20 > Everything between 4k and 2M and above 2M is still preferred because > it results in better L0/L1 reach, but if you can't get 2M the L2 > reach goes down so rapidly that people start to complain immediately. >=20 > And that stuff is very specific for each vendor and HW generation. > Some have the sweet spot at 64k, some at 256k, most at 2M. >=20 > > =C2=A0=C2=A0 I am not even sure if there is a fencing on > > =C2=A0=C2=A0 CONFIG_TRANSPARENT_HUGEPAGE somewhere when ending up with = large > > folios. order > > =C2=A0=C2=A0 > PMD_ORDER is currently essentially unsupported, and I su= spect > > your code > > =C2=A0=C2=A0 would=C2=A0 even allow for that (looking at > > ttm_pool_alloc_find_order). > >=20 > > =C2=A0=C2=A0 We also have some problems with the pagecache not actually > > supporting all > > =C2=A0=C2=A0 MAX_PAGE_ORDER orders (see MAX_PAGECACHE_ORDER). > >=20 > > =C2=A0=C2=A0 You are bypassing shmem logic to decide on that completely= . > >=20 > > =C2=A0=C2=A0 While these things might not actually cause harm for you t= oday > > (although I > > =C2=A0=C2=A0 suspect some of them might in shmem swapout code), we don'= t want > > drivers to > > =C2=A0=C2=A0 make our life harder by doing completely unexpected things= . >=20 > Yeah but that is the requirement the HW has. >=20 > I mean we can keep torturing the buddy allocator to give us 2M pages, > but essentially we want to get away from those specialized solutions > and has more of the functionality necessary to driver the HW in the > common Linux memory management code because that prevents vendors > from re-implementing that stuff in their specific driver over and > over again. For the code at hand, if we insert an order 10 folio shmem will split it at writeout time but spit out a warning (if enabled) at the same time. For this particular use-case, I think it might make sense for the drivers that use direct insertion to cap the page-allocator orders to THP size (2M). >=20 > Regards, > Christian. >=20 > > c) You pass folio + order, which is just the red flag that you are > > doing > > =C2=A0=C2=A0 something extremely dodgy. > >=20 > > =C2=A0=C2=A0 You just cast something that is not a folio, and was not > > allocated to be a > > =C2=A0=C2=A0 folio to a folio through page_folio(page). That will stop > > working completely > > =C2=A0=C2=A0 in the future once we decouple struct page from struct fol= io. > >=20 > > =C2=A0=C2=A0 If it's not a folio with a proper set order, you should be > > passing page + > > =C2=A0=C2=A0 order. > >=20 > > d) We are once more open-coding creation of a folio, by hand- > > crafting it > > =C2=A0=C2=A0 ourselves. > >=20 > > =C2=A0=C2=A0 We have folio_alloc() and friends for a reason. Where we, = for > > example, do a > > =C2=A0=C2=A0 page_rmappable_folio(). > >=20 > > =C2=A0=C2=A0 I am pretty sure that you are missing a call to > > page_rmappable_folio(), > > =C2=A0=C2=A0 resulting in the large folios not getting > > folio_set_large_rmappable() set. > >=20 > > e) undo_compound_page(). No words. > >=20 > >=20 > >=20 > > *maybe* it would be a little less bad if you would just allocate a > > compound page > > in your driver and use page_rmappable_folio() in there. OK, yes it sounds like a prereq for this is that the driver actually allocates compound pages. It might be that the TTM comment about *not* doing that is stale, but need to check. Would it be acceptable to export a function from core mm to split an isolated folio? > >=20 > > That wouldn't change a) or b), though. > >=20 > >=20 > > >=20 > > > Given the problem statement in the cover-letter, would there be a > > > better direction to take here? We could, for example, bypass > > > shmem and > > > insert the folios directly into the swap-cache, (although there > > > is an > > > issue with the swap-cache when the number of swap_entries are > > > close to > > > being depleted). > >=20 > > Good question. > > We'd have to keep swapoff and all of that working. For example, in > > try_to_unuse(), we special-case shmem_unuse() to handle non- > > anonymous pages. > >=20 > > But then, the whole swapcache operates on folios ... so I am not > > sure if there > > is a lot to be won by re-implementing what shmem already does? > >=20 Still that would alleviate a) and b), right? At least as long as we keep folio sizes within the swap cache limits? Thanks, Thomas