From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 43DD3C3DA60 for ; Tue, 16 Jul 2024 11:14:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D032B6B0093; Tue, 16 Jul 2024 07:14:45 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CB3B06B0095; Tue, 16 Jul 2024 07:14:45 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BA28D6B0096; Tue, 16 Jul 2024 07:14:45 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 94D266B0093 for ; Tue, 16 Jul 2024 07:14:45 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 14318A14A7 for ; Tue, 16 Jul 2024 11:14:45 +0000 (UTC) X-FDA: 82345358130.19.50FD189 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf26.hostedemail.com (Postfix) with ESMTP id 02437140013 for ; Tue, 16 Jul 2024 11:14:42 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf26.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721128452; a=rsa-sha256; cv=none; b=vutx/2DXAHNf5dONfJW83fHkgvumIbhViogK+oLrc4lYbby0EXqPyICLQYArZ3bQ3GW+9P 4efBgwoHNEWJW2DYV9+gulHBAKgRdgJyg1QGIhQFvPdn0E/x9wVZFuiRdivFI+1S4k+F48 w1XilYpYs9JzGVTTzBHhM8XfYmZ5w84= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf26.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721128452; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=3Ar8TW0/CrjW/Thusyr9C8Emx7G3F+r+BUHIHtuTcUE=; b=AroD5vgt+7pIgWAx+4oZF/nj8r9jvOtd6RNbM5YquwoBqNTyyQRziArUhssT5oRBYhR5RP KuvG8FX9vjNFEiTcqPlSdPSOvaIHSGwDNnBkcbpox5dqOIRFJrvA/9+WPFoc+UKm7u1suM TSE5k109N6ZHlJVFvMIJxIRzth98R+Y= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 9118F1063; Tue, 16 Jul 2024 04:15:07 -0700 (PDT) Received: from [10.1.34.200] (XHFQ2J9959.cambridge.arm.com [10.1.34.200]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id B46563F766; Tue, 16 Jul 2024 04:14:40 -0700 (PDT) Message-ID: Date: Tue, 16 Jul 2024 12:14:39 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v1 2/2] mm: mTHP stats for pagecache folio allocations Content-Language: en-GB To: David Hildenbrand , Lance Yang , Baolin Wang Cc: Andrew Morton , Hugh Dickins , Jonathan Corbet , "Matthew Wilcox (Oracle)" , Barry Song , linux-kernel@vger.kernel.org, linux-mm@kvack.org References: <20240711072929.3590000-1-ryan.roberts@arm.com> <20240711072929.3590000-3-ryan.roberts@arm.com> <9e0d84e5-2319-4425-9760-2c6bb23fc390@linux.alibaba.com> <756c359e-bb8f-481e-a33f-163c729afa31@redhat.com> <8c32a2fc-252d-406b-9fec-ce5bab0829df@arm.com> <5c58d9ea-8490-4ae6-b7bf-be816dab3356@redhat.com> From: Ryan Roberts In-Reply-To: <5c58d9ea-8490-4ae6-b7bf-be816dab3356@redhat.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 02437140013 X-Stat-Signature: 6zisgzyp4bcui1mt8yewawupztytdbxh X-Rspam-User: X-HE-Tag: 1721128482-252531 X-HE-Meta: U2FsdGVkX1+GlkaGSatXiAhRLGnMt5RamVP9Abpxo8aOfWrJB+zqWO07Z7q0keCupwpxvoFoYTKuIixx1oAUQZMiPSvYqmvdRf2wRky2ZSmshC7eycCblouWcWXtGM+35kGG+Zu9U2DZAiOp1Xd4H1lU3RHd+dGxK+7riqm8JzAxrCNqvmspWiDSn9Z36HeKH9EYyIVEK4l960XWeisfS3GNOlWKRmSZo/u16UYBuaTLOXJj0S8eI6jda4fdrKWWRFhgHw8lJK8Eu9qU2onKKaNSutd3Gfpyp9zGymDLxlfAXxk1b9ZMYFAgYJwfBzf2ORh5Rxy1GScRvzIS8jsQiTipCf+VZI6B8bkCOasLu9KiK/xvBQbk6aslF52mYDcIu/6+nItbGiyPWXIu4q1nEB9L9aA819i7lGq6i4AqlVp6TVV7RTyXxmk+5K2xyprPg8CvgKs7tl8EqH6JIbIne0FOaoIuUEir823rOhgOihizeZeULVzwQxTrNE/BEnBQz0mfVdREljg6e3F0Vr8H6wZoP07aM9zy3aVwLyHI/pU5yiN59VV1jtRlhjS5OlaBU2hzeAxULP1gWyp1j8GaKgI6B7mcQsTibF+JeAx3qxNRNYH7ExQhG60bZsSLm1Sqjf7CU0UvBcFaKB8ASHviD6aFU1f0gwl9CRsmz2idGulIiHOsNQeqhxBDqIW0XNg60NYotZ3+jqlckMQq+FbUCIBE7Cdm/Zi0Sjct3PJkde8c/KfyD75IRavEJ52MbJPyIzedjPsumGJtARmd1OcL5bA0ltBahbS4KK6+qpYgjL/EdFC86Wo1sUkkzUuU+SkZcCKteBm9WN1IWYx8SX5b+hN0HXJLPWxAoDH6neSWjEKL+ri/P2UeGJscooDgLmRVkXidNWVv+biy9ltAPfOghtUdog70/39N4zfkSa4DuLWts4N/DahxkJzuBZArcYBHplwlMjkKuW/5dynD9Xh PlNUX2KP QNqBzvGAi2XqwXTOLvFR63SseYLE17nq+8DZwEj4C6W9ENiQ8gaaX+ut6NiKS0j+4AEGtXAXRnewBaUtF+RllUgzmtY6TzLuGZbpgCKNMoMvul3U1fjTYq+g1j4tLvuAXUSv97g6DBiyG19adNvRVdIE4bAq7nTEOlT89woj033xAN2SUWZ+YlXe6rhPqQF1+2S2/ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 16/07/2024 11:19, David Hildenbrand wrote: > On 16.07.24 10:31, Ryan Roberts wrote: >> On 13/07/2024 11:45, Ryan Roberts wrote: >>> On 13/07/2024 02:08, David Hildenbrand wrote: >>>> On 12.07.24 14:22, Lance Yang wrote: >>>>> On Fri, Jul 12, 2024 at 11:00 AM Baolin Wang >>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 2024/7/11 15:29, Ryan Roberts wrote: >>>>>>> Expose 3 new mTHP stats for file (pagecache) folio allocations: >>>>>>> >>>>>>>      /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_alloc >>>>>>>      /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_fallback >>>>>>>     >>>>>>> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_fallback_charge >>>>>>> >>>>>>> This will provide some insight on the sizes of large folios being >>>>>>> allocated for file-backed memory, and how often allocation is failing. >>>>>>> >>>>>>> All non-order-0 (and most order-0) folio allocations are currently done >>>>>>> through filemap_alloc_folio(), and folios are charged in a subsequent >>>>>>> call to filemap_add_folio(). So count file_fallback when allocation >>>>>>> fails in filemap_alloc_folio() and count file_alloc or >>>>>>> file_fallback_charge in filemap_add_folio(), based on whether charging >>>>>>> succeeded or not. There are some users of filemap_add_folio() that >>>>>>> allocate their own order-0 folio by other means, so we would not count >>>>>>> an allocation failure in this case, but we also don't care about order-0 >>>>>>> allocations. This approach feels like it should be good enough and >>>>>>> doesn't require any (impractically large) refactoring. >>>>>>> >>>>>>> The existing mTHP stats interface is reused to provide consistency to >>>>>>> users. And because we are reusing the same interface, we can reuse the >>>>>>> same infrastructure on the kernel side. The one small wrinkle is that >>>>>>> the set of folio sizes supported by the pagecache are not identical to >>>>>>> those supported by anon and shmem; pagecache supports order-1, unlike >>>>>>> anon and shmem, and the max pagecache order may be less than PMD-size >>>>>>> (see arm64 with 64K base pages), again unlike anon and shmem. So we now >>>>>>> create a hugepages-*kB directory for the union of the sizes supported by >>>>>>> all 3 memory types and populate it with the relevant stats and controls. >>>>>> >>>>>> Personally, I like the idea that can help analyze the allocation of >>>>>> large folios for the page cache. >>>>>> >>>>>> However, I have a slight concern about the consistency of the interface. >>>>>> >>>>>> For 64K, the fields layout: >>>>>> ├── hugepages-64kB >>>>>> │   ├── enabled >>>>>> │   ├── shmem_enabled >>>>>> │   └── stats >>>>>> │       ├── anon_fault_alloc >>>>>> │       ├── anon_fault_fallback >>>>>> │       ├── anon_fault_fallback_charge >>>>>> │       ├── file_alloc >>>>>> │       ├── file_fallback >>>>>> │       ├── file_fallback_charge >>>>>> │       ├── shmem_alloc >>>>>> │       ├── shmem_fallback >>>>>> │       ├── shmem_fallback_charge >>>>>> │       ├── split >>>>>> │       ├── split_deferred >>>>>> │       ├── split_failed >>>>>> │       ├── swpout >>>>>> │       └── swpout_fallback >>>>>> >>>>>> But for 8K (for pagecache), you removed some fields (of course, I >>>>>> understand why they are not supported). >>>>>> >>>>>> ├── hugepages-8kB >>>>>> │   └── stats >>>>>> │       ├── file_alloc >>>>>> │       ├── file_fallback >>>>>> │       └── file_fallback_charge >>>>>> >>>>>> This might not be user-friendly for some user-space parsing tools, as >>>>>> they lack certain fields for the same pattern interfaces. Of course, >>>>>> this might not be an issue if we have clear documentation describing the >>>>>> differences here:) >>>>>> >>>>>> Another possible approach is to maintain the same field layout to keep >>>>>> consistent, but prohibit writing to the fields that are not supported by >>>>>> the pagecache, and any stats read from them would be 0. >>>>> >>>>> I agree that maintaining a uniform field layout, especially at the stats >>>>> level, might be necessary ;) >>>>> >>>>> Keeping a consistent interface could future-proof the design. It allows >>>>> for the possibility that features not currently supported for 8kB pages >>>>> might be enabled in the future. >>>> >>>> I'll just note that, with shmem/file effectively being disabled for order > 11, >>>> we'll also have entries there that are effectively unused. >>> >>> Indeed, I mentioned that in the commit log :) > > Well, I think it's more extreme than what you mentioned. > > For example, shmem_enable on arm64 with 64k is now effectively non-functional. > Just like it will be for other orders in the anon-shmem case when the order > exceeds MAX_PAGECACHE_ORDER. Ahh I see what you are saying now; we already have precedent for non-functional controls. (Actually, looking at the code, it looks like the shmem stats will be unconditionally exposed, but the shmem controls will only be exposed when CONFIG_SHMEM is enabled. I guess that should be fixed - I'll post a patch). > >>> >>>> >>>> Good question how we want to deal with that (stats are easy, but what about >>>> when >>>> we enable something? Maybe we should document that "enabled" is only effective >>>> when supported). >>> >>> The documentation already says "If enabling multiple hugepage sizes, the kernel >>> will select the most appropriate enabled size for a given allocation." for anon >>> THP (and I've added similar wording for my as-yet-unposted patch to add controls >>> for page cache folio sizes). So I think we could easily add dummy *enabled >>> controls for all sizes, that can be written to and read back consistently, but >>> the kernel just ignores them when deciding what size to use. It would also >>> simplify the code that populates the controls. >>> >>> Personally though, I'm not convinced of the value of trying to make the controls >>> for every size look identical. What's the real value to the user to pretend that >>> they can select a size that they cannot? What happens when we inevitably want to >>> add some new control in future which only applies to select sizes and there is >>> no good way to fake it for the other sizes? Why can't user space just be >>> expected to rely on the existance of the files rather than on the existance of >>> the directories? >>> >>> As always, I'll go with the majority, but just wanted to register my opinion. >> >> Should I assume from the lack of reply on this that everyone else is in favour >> of adding dummy controls so that all sizes have the same set of controls? If I >> don't hear anything further, I'll post v2 with dummry controls today or tomorrow. > > Sorry, busy with other stuff. > > Indicating only what really exists sounds cleaner. But I wonder how we would > want to handle in general orders that are effectively non-existant? I'm not following your distinction between orders that don't "really exist" and orders that are "effectively non-existant". I guess the real supported orders are: anon: min order: 2 max order: PMD_ORDER anon-shmem: min order: 1 max order: MAX_PAGECACHE_ORDER tmpfs-shmem: min order: PMD_ORDER <= 11 ? PMD_ORDER : NONE max order: PMD_ORDER <= 11 ? PMD_ORDER : NONE file: min order: 1 max order: MAX_PAGECACHE_ORDER But today, controls and stats are exposed for: anon: min order: 2 max order: PMD_ORDER anon-shmem: min order: 2 max order: PMD_ORDER tmpfs-shmem: min order: PMD_ORDER max order: PMD_ORDER file: min order: Nothing yet (this patch proposes 1) max order: Nothing yet (this patch proposes MAX_PAGECACHE_ORDER) So I think there is definitely a bug for shmem where the minimum order control should be order-1 but its currently order-2. I also wonder about PUD-order for DAX? We don't currently have a stat/control. If we wanted to add it in future, if we take the "expose all stats/controls for all orders" approach, we would end up extending all the way to PUD-order and all the orders between PMD and PUD would be dummy for all memory types. That really starts to feel odd, so I still favour only populating what's really supported. I propose to fix shmem (extend down to 1, stop at MAX_PAGECACHE_ORDER) and continue with the approach of "indicating only what really exists" for v2. Shout if you disagree. Thanks, Ryan