[PATCH v2 0/6] zswap: accounting & cgroup control

public inbox for cgroups@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/6] zswap: accounting & cgroup control
@ 2022-05-10 15:28 Johannes Weiner
       [not found] ` <20220510152847.230957-1-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Johannes Weiner @ 2022-05-10 15:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Roman Gushchin, Shakeel Butt, Seth Jennings,
	Dan Streetman, Minchan Kim, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg

Changelog

- Refresh and update meminfo documentation (Andrew)
- Discussions around stat sharing opportunities with zram. But agreed
  that zswap is a cache and zram a backend that could theoretically be
  stacked, so they need to be understandable separately. (Minchan)

Overview

Zswap can consume nearly a quarter of RAM in the default
configuration, yet it's neither listed in /proc/meminfo, nor is it
accounted and manageable on a per-cgroup basis.

This makes reasoning about the memory situation on a host in general
rather difficult. On shared/cgrouped hosts, the consequences are
worse. First, workloads can escape memory containment and cause
resource priority inversions: a lo-pri group can fill the global zswap
pool and force a hi-pri group out to disk. Second, not all workloads
benefit from zswap equally. Some even suffer when memory contents
compress poorly, and are better off going to disk swap directly. On a
host with mixed workloads, it's currently not possible to enable zswap
for one workload but not for the other.

This series implements the missing global accounting as well as cgroup
tracking & control for zswap backing memory:

- Patch 1 refreshes the very out-of-date meminfo documentation in
  Documentation/filesystems/proc.rst.

- Patches 2-4 clean up related and adjacent options in Kconfig. Not
  actual dependencies, just things I noticed during development.

- Patch 5 adds meminfo and vmstat coverage for zswap consumption and
  activity.

- Patch 6 implements per-cgroup tracking & control of zswap memory.

Based on v5.18-rc4-mmots-2022-04-26-19-34-5-g5e1fdb02de7a.

 Documentation/admin-guide/cgroup-v2.rst |  21 ++
 Documentation/filesystems/proc.rst      | 161 +++++----
 drivers/block/zram/Kconfig              |   3 +-
 fs/proc/meminfo.c                       |   7 +
 include/linux/memcontrol.h              |  54 +++
 include/linux/swap.h                    |   5 +
 include/linux/vm_event_item.h           |   4 +
 init/Kconfig                            | 123 -------
 mm/Kconfig                              | 523 +++++++++++++++++++-----------
 mm/memcontrol.c                         | 196 ++++++++++-
 mm/vmstat.c                             |   4 +
 mm/zswap.c                              |  50 ++-
 12 files changed, 753 insertions(+), 398 deletions(-)



^ permalink raw reply	[flat|nested] 24+ messages in thread

[parent not found: <20220510152847.230957-1-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]

* [PATCH v2 1/6] Documentation: filesystems: proc: update meminfo section
       [not found] ` <20220510152847.230957-1-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
@ 2022-05-10 15:28   ` Johannes Weiner
       [not found]     ` <20220510152847.230957-2-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
  2022-05-10 15:28   ` [PATCH v2 2/6] mm: Kconfig: move swap and slab config options to the MM section Johannes Weiner
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Johannes Weiner @ 2022-05-10 15:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Roman Gushchin, Shakeel Butt, Seth Jennings,
	Dan Streetman, Minchan Kim, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg

Add new entries. Minor corrections and cleanups.

Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
---
 Documentation/filesystems/proc.rst | 155 ++++++++++++++++++-----------
 1 file changed, 99 insertions(+), 56 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 061744c436d9..736ed384750c 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -942,56 +942,71 @@ can be substantial.  In many cases there are other means to find out
 additional memory using subsystem specific interfaces, for instance
 /proc/net/sockstat for TCP memory allocations.
 
-The following is from a 16GB PIII, which has highmem enabled.
-You may not have all of these fields.
+Example output. You may not have all of these fields.
 
 ::
 
     > cat /proc/meminfo
 
-    MemTotal:     16344972 kB
-    MemFree:      13634064 kB
-    MemAvailable: 14836172 kB
-    Buffers:          3656 kB
-    Cached:        1195708 kB
-    SwapCached:          0 kB
-    Active:         891636 kB
-    Inactive:      1077224 kB
-    HighTotal:    15597528 kB
-    HighFree:     13629632 kB
-    LowTotal:       747444 kB
-    LowFree:          4432 kB
-    SwapTotal:           0 kB
-    SwapFree:            0 kB
-    Dirty:             968 kB
-    Writeback:           0 kB
-    AnonPages:      861800 kB
-    Mapped:         280372 kB
-    Shmem:             644 kB
-    KReclaimable:   168048 kB
-    Slab:           284364 kB
-    SReclaimable:   159856 kB
-    SUnreclaim:     124508 kB
-    PageTables:      24448 kB
-    NFS_Unstable:        0 kB
-    Bounce:              0 kB
-    WritebackTmp:        0 kB
-    CommitLimit:   7669796 kB
-    Committed_AS:   100056 kB
-    VmallocTotal:   112216 kB
-    VmallocUsed:       428 kB
-    VmallocChunk:   111088 kB
-    Percpu:          62080 kB
-    HardwareCorrupted:   0 kB
-    AnonHugePages:   49152 kB
-    ShmemHugePages:      0 kB
-    ShmemPmdMapped:      0 kB
+    MemTotal:       32858820 kB
+    MemFree:        21001236 kB
+    MemAvailable:   27214312 kB
+    Buffers:          581092 kB
+    Cached:          5587612 kB
+    SwapCached:            0 kB
+    Active:          3237152 kB
+    Inactive:        7586256 kB
+    Active(anon):      94064 kB
+    Inactive(anon):  4570616 kB
+    Active(file):    3143088 kB
+    Inactive(file):  3015640 kB
+    Unevictable:           0 kB
+    Mlocked:               0 kB
+    SwapTotal:             0 kB
+    SwapFree:              0 kB
+    Dirty:                12 kB
+    Writeback:             0 kB
+    AnonPages:       4654780 kB
+    Mapped:           266244 kB
+    Shmem:              9976 kB
+    KReclaimable:     517708 kB
+    Slab:             660044 kB
+    SReclaimable:     517708 kB
+    SUnreclaim:       142336 kB
+    KernelStack:       11168 kB
+    PageTables:        20540 kB
+    NFS_Unstable:          0 kB
+    Bounce:                0 kB
+    WritebackTmp:          0 kB
+    CommitLimit:    16429408 kB
+    Committed_AS:    7715148 kB
+    VmallocTotal:   34359738367 kB
+    VmallocUsed:       40444 kB
+    VmallocChunk:          0 kB
+    Percpu:            29312 kB
+    HardwareCorrupted:     0 kB
+    AnonHugePages:   4149248 kB
+    ShmemHugePages:        0 kB
+    ShmemPmdMapped:        0 kB
+    FileHugePages:         0 kB
+    FilePmdMapped:         0 kB
+    CmaTotal:              0 kB
+    CmaFree:               0 kB
+    HugePages_Total:       0
+    HugePages_Free:        0
+    HugePages_Rsvd:        0
+    HugePages_Surp:        0
+    Hugepagesize:       2048 kB
+    Hugetlb:               0 kB
+    DirectMap4k:      401152 kB
+    DirectMap2M:    10008576 kB
+    DirectMap1G:    24117248 kB
 
 MemTotal
               Total usable RAM (i.e. physical RAM minus a few reserved
               bits and the kernel binary code)
 MemFree
-              The sum of LowFree+HighFree
+              Total free RAM. On highmem systems, the sum of LowFree+HighFree
 MemAvailable
               An estimate of how much memory is available for starting new
               applications, without swapping. Calculated from MemFree,
@@ -1005,8 +1020,9 @@ Buffers
               Relatively temporary storage for raw disk blocks
               shouldn't get tremendously large (20MB or so)
 Cached
-              in-memory cache for files read from the disk (the
-              pagecache).  Doesn't include SwapCached
+              In-memory cache for files read from the disk (the
+              pagecache) as well as tmpfs & shmem.
+              Doesn't include SwapCached.
 SwapCached
               Memory that once was swapped out, is swapped back in but
               still also is in the swapfile (if memory is needed it
@@ -1018,6 +1034,11 @@ Active
 Inactive
               Memory which has been less recently used.  It is more
               eligible to be reclaimed for other purposes
+Unevictable
+              Memory that cannot be reclaimed, such as mlocked pages,
+              ramfs backing pages, secret memfd pages etc.
+Mlocked
+              Memory locked with mlock().
 HighTotal, HighFree
               Highmem is all memory above ~860MB of physical memory.
               Highmem areas are for use by userspace programs, or
@@ -1040,20 +1061,10 @@ Writeback
               Memory which is actively being written back to the disk
 AnonPages
               Non-file backed pages mapped into userspace page tables
-HardwareCorrupted
-              The amount of RAM/memory in KB, the kernel identifies as
-	      corrupted.
-AnonHugePages
-              Non-file backed huge pages mapped into userspace page tables
 Mapped
               files which have been mmaped, such as libraries
 Shmem
               Total memory used by shared memory (shmem) and tmpfs
-ShmemHugePages
-              Memory used by shared memory (shmem) and tmpfs allocated
-              with huge pages
-ShmemPmdMapped
-              Shared memory mapped into userspace with huge pages
 KReclaimable
               Kernel allocations that the kernel will attempt to reclaim
               under memory pressure. Includes SReclaimable (below), and other
@@ -1064,9 +1075,10 @@ SReclaimable
               Part of Slab, that might be reclaimed, such as caches
 SUnreclaim
               Part of Slab, that cannot be reclaimed on memory pressure
+KernelStack
+              Memory consumed by the kernel stacks of all tasks
 PageTables
-              amount of memory dedicated to the lowest level of page
-              tables.
+              Memory consumed by userspace page tables
 NFS_Unstable
               Always zero. Previous counted pages which had been written to
               the server, but has not been committed to stable storage.
@@ -1098,7 +1110,7 @@ Committed_AS
               has been allocated by processes, even if it has not been
               "used" by them as of yet. A process which malloc()'s 1G
               of memory, but only touches 300M of it will show up as
-	      using 1G. This 1G is memory which has been "committed" to
+              using 1G. This 1G is memory which has been "committed" to
               by the VM and can be used at any time by the allocating
               application. With strict overcommit enabled on the system
               (mode 2 in 'vm.overcommit_memory'), allocations which would
@@ -1107,7 +1119,7 @@ Committed_AS
               not fail due to lack of memory once that memory has been
               successfully allocated.
 VmallocTotal
-              total size of vmalloc memory area
+              total size of vmalloc virtual address space
 VmallocUsed
               amount of vmalloc area which is used
 VmallocChunk
@@ -1115,6 +1127,37 @@ VmallocChunk
 Percpu
               Memory allocated to the percpu allocator used to back percpu
               allocations. This stat excludes the cost of metadata.
+HardwareCorrupted
+              The amount of RAM/memory in KB, the kernel identifies as
+              corrupted.
+AnonHugePages
+              Non-file backed huge pages mapped into userspace page tables
+ShmemHugePages
+              Memory used by shared memory (shmem) and tmpfs allocated
+              with huge pages
+ShmemPmdMapped
+              Shared memory mapped into userspace with huge pages
+FileHugePages
+              Memory used for filesystem data (page cache) allocated
+              with huge pages
+FilePmdMapped
+              Page cache mapped into userspace with huge pages
+CmaTotal
+              Memory reserved for the Contiguous Memory Allocator (CMA)
+CmaFree
+              Free remaining memory in the CMA reserves
+HugePages_Total
+HugePages_Free
+HugePages_Rsvd
+HugePages_Surp
+Hugepagesize
+Hugetlb
+              See Documentation/admin-guide/mm/hugetlbpage.rst.
+DirectMap4k
+DirectMap2M
+DirectMap1G
+              Breakdown of page table sizes used in the kernel's
+              identity mapping of RAM
 
 vmallocinfo
 ~~~~~~~~~~~
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

[parent not found: <20220510152847.230957-2-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]

* Re: [PATCH v2 1/6] Documentation: filesystems: proc: update meminfo section
       [not found]     ` <20220510152847.230957-2-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
@ 2022-05-11 17:11       ` David Hildenbrand
       [not found]         ` <7a6f8520-a496-e3c3-1fd9-8a30b7a12b14-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2022-05-11 17:11 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton
  Cc: Michal Hocko, Roman Gushchin, Shakeel Butt, Seth Jennings,
	Dan Streetman, Minchan Kim, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg

On 10.05.22 17:28, Johannes Weiner wrote:
> Add new entries. Minor corrections and cleanups.
> 
> Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> ---
>  Documentation/filesystems/proc.rst | 155 ++++++++++++++++++-----------
>  1 file changed, 99 insertions(+), 56 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 061744c436d9..736ed384750c 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -942,56 +942,71 @@ can be substantial.  In many cases there are other means to find out
>  additional memory using subsystem specific interfaces, for instance
>  /proc/net/sockstat for TCP memory allocations.
>  
> -The following is from a 16GB PIII, which has highmem enabled.
> -You may not have all of these fields.
> +Example output. You may not have all of these fields.
>  
>  ::
>  
>      > cat /proc/meminfo
>  
> -    MemTotal:     16344972 kB
> -    MemFree:      13634064 kB
> -    MemAvailable: 14836172 kB
> -    Buffers:          3656 kB
> -    Cached:        1195708 kB
> -    SwapCached:          0 kB
> -    Active:         891636 kB
> -    Inactive:      1077224 kB
> -    HighTotal:    15597528 kB
> -    HighFree:     13629632 kB
> -    LowTotal:       747444 kB
> -    LowFree:          4432 kB
> -    SwapTotal:           0 kB
> -    SwapFree:            0 kB
> -    Dirty:             968 kB
> -    Writeback:           0 kB
> -    AnonPages:      861800 kB
> -    Mapped:         280372 kB
> -    Shmem:             644 kB
> -    KReclaimable:   168048 kB
> -    Slab:           284364 kB
> -    SReclaimable:   159856 kB
> -    SUnreclaim:     124508 kB
> -    PageTables:      24448 kB
> -    NFS_Unstable:        0 kB
> -    Bounce:              0 kB
> -    WritebackTmp:        0 kB
> -    CommitLimit:   7669796 kB
> -    Committed_AS:   100056 kB
> -    VmallocTotal:   112216 kB
> -    VmallocUsed:       428 kB
> -    VmallocChunk:   111088 kB
> -    Percpu:          62080 kB
> -    HardwareCorrupted:   0 kB
> -    AnonHugePages:   49152 kB
> -    ShmemHugePages:      0 kB
> -    ShmemPmdMapped:      0 kB
> +    MemTotal:       32858820 kB
> +    MemFree:        21001236 kB
> +    MemAvailable:   27214312 kB
> +    Buffers:          581092 kB
> +    Cached:          5587612 kB
> +    SwapCached:            0 kB
> +    Active:          3237152 kB
> +    Inactive:        7586256 kB
> +    Active(anon):      94064 kB
> +    Inactive(anon):  4570616 kB
> +    Active(file):    3143088 kB
> +    Inactive(file):  3015640 kB
> +    Unevictable:           0 kB
> +    Mlocked:               0 kB
> +    SwapTotal:             0 kB
> +    SwapFree:              0 kB
> +    Dirty:                12 kB
> +    Writeback:             0 kB
> +    AnonPages:       4654780 kB
> +    Mapped:           266244 kB
> +    Shmem:              9976 kB
> +    KReclaimable:     517708 kB
> +    Slab:             660044 kB
> +    SReclaimable:     517708 kB
> +    SUnreclaim:       142336 kB
> +    KernelStack:       11168 kB
> +    PageTables:        20540 kB
> +    NFS_Unstable:          0 kB
> +    Bounce:                0 kB
> +    WritebackTmp:          0 kB
> +    CommitLimit:    16429408 kB
> +    Committed_AS:    7715148 kB
> +    VmallocTotal:   34359738367 kB
> +    VmallocUsed:       40444 kB
> +    VmallocChunk:          0 kB
> +    Percpu:            29312 kB
> +    HardwareCorrupted:     0 kB
> +    AnonHugePages:   4149248 kB
> +    ShmemHugePages:        0 kB
> +    ShmemPmdMapped:        0 kB
> +    FileHugePages:         0 kB
> +    FilePmdMapped:         0 kB
> +    CmaTotal:              0 kB
> +    CmaFree:               0 kB
> +    HugePages_Total:       0
> +    HugePages_Free:        0
> +    HugePages_Rsvd:        0
> +    HugePages_Surp:        0
> +    Hugepagesize:       2048 kB
> +    Hugetlb:               0 kB
> +    DirectMap4k:      401152 kB
> +    DirectMap2M:    10008576 kB
> +    DirectMap1G:    24117248 kB
>  
>  MemTotal
>                Total usable RAM (i.e. physical RAM minus a few reserved
>                bits and the kernel binary code)
>  MemFree
> -              The sum of LowFree+HighFree
> +              Total free RAM. On highmem systems, the sum of LowFree+HighFree
>  MemAvailable
>                An estimate of how much memory is available for starting new
>                applications, without swapping. Calculated from MemFree,
> @@ -1005,8 +1020,9 @@ Buffers
>                Relatively temporary storage for raw disk blocks
>                shouldn't get tremendously large (20MB or so)
>  Cached
> -              in-memory cache for files read from the disk (the
> -              pagecache).  Doesn't include SwapCached
> +              In-memory cache for files read from the disk (the
> +              pagecache) as well as tmpfs & shmem.
> +              Doesn't include SwapCached.
>  SwapCached
>                Memory that once was swapped out, is swapped back in but
>                still also is in the swapfile (if memory is needed it
> @@ -1018,6 +1034,11 @@ Active
>  Inactive
>                Memory which has been less recently used.  It is more
>                eligible to be reclaimed for other purposes
> +Unevictable
> +              Memory that cannot be reclaimed, such as mlocked pages,
> +              ramfs backing pages, secret memfd pages etc.


A little imprecise, because this only includes memory to be mapped into
user space. For example, all kernel allocations are unevictable but not
accounted here.

Apart from that

Acked-by: David Hildenbrand <david-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

[parent not found: <7a6f8520-a496-e3c3-1fd9-8a30b7a12b14-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH v2 1/6] Documentation: filesystems: proc: update meminfo section
       [not found]         ` <7a6f8520-a496-e3c3-1fd9-8a30b7a12b14-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2022-05-11 18:51           ` Johannes Weiner
       [not found]             ` <YnwFraZlVWQoCjz3-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Johannes Weiner @ 2022-05-11 18:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Seth Jennings, Dan Streetman, Minchan Kim,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg

On Wed, May 11, 2022 at 07:11:06PM +0200, David Hildenbrand wrote:
> On 10.05.22 17:28, Johannes Weiner wrote:
> > Add new entries. Minor corrections and cleanups.
> > 
> > Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> > ---
> >  Documentation/filesystems/proc.rst | 155 ++++++++++++++++++-----------
> >  1 file changed, 99 insertions(+), 56 deletions(-)
> > 
> > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> > index 061744c436d9..736ed384750c 100644
> > --- a/Documentation/filesystems/proc.rst
> > +++ b/Documentation/filesystems/proc.rst
> > @@ -942,56 +942,71 @@ can be substantial.  In many cases there are other means to find out
> >  additional memory using subsystem specific interfaces, for instance
> >  /proc/net/sockstat for TCP memory allocations.
> >  
> > -The following is from a 16GB PIII, which has highmem enabled.
> > -You may not have all of these fields.
> > +Example output. You may not have all of these fields.
> >  
> >  ::
> >  
> >      > cat /proc/meminfo
> >  
> > -    MemTotal:     16344972 kB
> > -    MemFree:      13634064 kB
> > -    MemAvailable: 14836172 kB
> > -    Buffers:          3656 kB
> > -    Cached:        1195708 kB
> > -    SwapCached:          0 kB
> > -    Active:         891636 kB
> > -    Inactive:      1077224 kB
> > -    HighTotal:    15597528 kB
> > -    HighFree:     13629632 kB
> > -    LowTotal:       747444 kB
> > -    LowFree:          4432 kB
> > -    SwapTotal:           0 kB
> > -    SwapFree:            0 kB
> > -    Dirty:             968 kB
> > -    Writeback:           0 kB
> > -    AnonPages:      861800 kB
> > -    Mapped:         280372 kB
> > -    Shmem:             644 kB
> > -    KReclaimable:   168048 kB
> > -    Slab:           284364 kB
> > -    SReclaimable:   159856 kB
> > -    SUnreclaim:     124508 kB
> > -    PageTables:      24448 kB
> > -    NFS_Unstable:        0 kB
> > -    Bounce:              0 kB
> > -    WritebackTmp:        0 kB
> > -    CommitLimit:   7669796 kB
> > -    Committed_AS:   100056 kB
> > -    VmallocTotal:   112216 kB
> > -    VmallocUsed:       428 kB
> > -    VmallocChunk:   111088 kB
> > -    Percpu:          62080 kB
> > -    HardwareCorrupted:   0 kB
> > -    AnonHugePages:   49152 kB
> > -    ShmemHugePages:      0 kB
> > -    ShmemPmdMapped:      0 kB
> > +    MemTotal:       32858820 kB
> > +    MemFree:        21001236 kB
> > +    MemAvailable:   27214312 kB
> > +    Buffers:          581092 kB
> > +    Cached:          5587612 kB
> > +    SwapCached:            0 kB
> > +    Active:          3237152 kB
> > +    Inactive:        7586256 kB
> > +    Active(anon):      94064 kB
> > +    Inactive(anon):  4570616 kB
> > +    Active(file):    3143088 kB
> > +    Inactive(file):  3015640 kB
> > +    Unevictable:           0 kB
> > +    Mlocked:               0 kB
> > +    SwapTotal:             0 kB
> > +    SwapFree:              0 kB
> > +    Dirty:                12 kB
> > +    Writeback:             0 kB
> > +    AnonPages:       4654780 kB
> > +    Mapped:           266244 kB
> > +    Shmem:              9976 kB
> > +    KReclaimable:     517708 kB
> > +    Slab:             660044 kB
> > +    SReclaimable:     517708 kB
> > +    SUnreclaim:       142336 kB
> > +    KernelStack:       11168 kB
> > +    PageTables:        20540 kB
> > +    NFS_Unstable:          0 kB
> > +    Bounce:                0 kB
> > +    WritebackTmp:          0 kB
> > +    CommitLimit:    16429408 kB
> > +    Committed_AS:    7715148 kB
> > +    VmallocTotal:   34359738367 kB
> > +    VmallocUsed:       40444 kB
> > +    VmallocChunk:          0 kB
> > +    Percpu:            29312 kB
> > +    HardwareCorrupted:     0 kB
> > +    AnonHugePages:   4149248 kB
> > +    ShmemHugePages:        0 kB
> > +    ShmemPmdMapped:        0 kB
> > +    FileHugePages:         0 kB
> > +    FilePmdMapped:         0 kB
> > +    CmaTotal:              0 kB
> > +    CmaFree:               0 kB
> > +    HugePages_Total:       0
> > +    HugePages_Free:        0
> > +    HugePages_Rsvd:        0
> > +    HugePages_Surp:        0
> > +    Hugepagesize:       2048 kB
> > +    Hugetlb:               0 kB
> > +    DirectMap4k:      401152 kB
> > +    DirectMap2M:    10008576 kB
> > +    DirectMap1G:    24117248 kB
> >  
> >  MemTotal
> >                Total usable RAM (i.e. physical RAM minus a few reserved
> >                bits and the kernel binary code)
> >  MemFree
> > -              The sum of LowFree+HighFree
> > +              Total free RAM. On highmem systems, the sum of LowFree+HighFree
> >  MemAvailable
> >                An estimate of how much memory is available for starting new
> >                applications, without swapping. Calculated from MemFree,
> > @@ -1005,8 +1020,9 @@ Buffers
> >                Relatively temporary storage for raw disk blocks
> >                shouldn't get tremendously large (20MB or so)
> >  Cached
> > -              in-memory cache for files read from the disk (the
> > -              pagecache).  Doesn't include SwapCached
> > +              In-memory cache for files read from the disk (the
> > +              pagecache) as well as tmpfs & shmem.
> > +              Doesn't include SwapCached.
> >  SwapCached
> >                Memory that once was swapped out, is swapped back in but
> >                still also is in the swapfile (if memory is needed it
> > @@ -1018,6 +1034,11 @@ Active
> >  Inactive
> >                Memory which has been less recently used.  It is more
> >                eligible to be reclaimed for other purposes
> > +Unevictable
> > +              Memory that cannot be reclaimed, such as mlocked pages,
> > +              ramfs backing pages, secret memfd pages etc.
> 
> 
> A little imprecise, because this only includes memory to be mapped into
> user space. For example, all kernel allocations are unevictable but not
> accounted here.

True. How about the below?

> Apart from that
> 
> Acked-by: David Hildenbrand <david-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Thanks!

Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
---

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 9749ff8c0ecf..5e9791457876 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -1035,8 +1035,8 @@ Inactive
               Memory which has been less recently used.  It is more
               eligible to be reclaimed for other purposes
 Unevictable
-              Memory that cannot be reclaimed, such as mlocked pages,
-              ramfs backing pages, secret memfd pages etc.
+              Memory allocated for userspace which cannot be reclaimed, such
+              as mlocked pages, ramfs backing pages, secret memfd pages etc.
 Mlocked
               Memory locked with mlock().
 HighTotal, HighFree

^ permalink raw reply related	[flat|nested] 24+ messages in thread

[parent not found: <YnwFraZlVWQoCjz3-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]

* Re: [PATCH v2 1/6] Documentation: filesystems: proc: update meminfo section
       [not found]             ` <YnwFraZlVWQoCjz3-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
@ 2022-05-12  8:55               ` David Hildenbrand
  0 siblings, 0 replies; 24+ messages in thread
From: David Hildenbrand @ 2022-05-12  8:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Seth Jennings, Dan Streetman, Minchan Kim,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg

On 11.05.22 20:51, Johannes Weiner wrote:
> On Wed, May 11, 2022 at 07:11:06PM +0200, David Hildenbrand wrote:
>> On 10.05.22 17:28, Johannes Weiner wrote:
>>> Add new entries. Minor corrections and cleanups.
>>>
>>> Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
>>> ---
>>>  Documentation/filesystems/proc.rst | 155 ++++++++++++++++++-----------
>>>  1 file changed, 99 insertions(+), 56 deletions(-)
>>>
>>> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
>>> index 061744c436d9..736ed384750c 100644
>>> --- a/Documentation/filesystems/proc.rst
>>> +++ b/Documentation/filesystems/proc.rst
>>> @@ -942,56 +942,71 @@ can be substantial.  In many cases there are other means to find out
>>>  additional memory using subsystem specific interfaces, for instance
>>>  /proc/net/sockstat for TCP memory allocations.
>>>  
>>> -The following is from a 16GB PIII, which has highmem enabled.
>>> -You may not have all of these fields.
>>> +Example output. You may not have all of these fields.
>>>  
>>>  ::
>>>  
>>>      > cat /proc/meminfo
>>>  
>>> -    MemTotal:     16344972 kB
>>> -    MemFree:      13634064 kB
>>> -    MemAvailable: 14836172 kB
>>> -    Buffers:          3656 kB
>>> -    Cached:        1195708 kB
>>> -    SwapCached:          0 kB
>>> -    Active:         891636 kB
>>> -    Inactive:      1077224 kB
>>> -    HighTotal:    15597528 kB
>>> -    HighFree:     13629632 kB
>>> -    LowTotal:       747444 kB
>>> -    LowFree:          4432 kB
>>> -    SwapTotal:           0 kB
>>> -    SwapFree:            0 kB
>>> -    Dirty:             968 kB
>>> -    Writeback:           0 kB
>>> -    AnonPages:      861800 kB
>>> -    Mapped:         280372 kB
>>> -    Shmem:             644 kB
>>> -    KReclaimable:   168048 kB
>>> -    Slab:           284364 kB
>>> -    SReclaimable:   159856 kB
>>> -    SUnreclaim:     124508 kB
>>> -    PageTables:      24448 kB
>>> -    NFS_Unstable:        0 kB
>>> -    Bounce:              0 kB
>>> -    WritebackTmp:        0 kB
>>> -    CommitLimit:   7669796 kB
>>> -    Committed_AS:   100056 kB
>>> -    VmallocTotal:   112216 kB
>>> -    VmallocUsed:       428 kB
>>> -    VmallocChunk:   111088 kB
>>> -    Percpu:          62080 kB
>>> -    HardwareCorrupted:   0 kB
>>> -    AnonHugePages:   49152 kB
>>> -    ShmemHugePages:      0 kB
>>> -    ShmemPmdMapped:      0 kB
>>> +    MemTotal:       32858820 kB
>>> +    MemFree:        21001236 kB
>>> +    MemAvailable:   27214312 kB
>>> +    Buffers:          581092 kB
>>> +    Cached:          5587612 kB
>>> +    SwapCached:            0 kB
>>> +    Active:          3237152 kB
>>> +    Inactive:        7586256 kB
>>> +    Active(anon):      94064 kB
>>> +    Inactive(anon):  4570616 kB
>>> +    Active(file):    3143088 kB
>>> +    Inactive(file):  3015640 kB
>>> +    Unevictable:           0 kB
>>> +    Mlocked:               0 kB
>>> +    SwapTotal:             0 kB
>>> +    SwapFree:              0 kB
>>> +    Dirty:                12 kB
>>> +    Writeback:             0 kB
>>> +    AnonPages:       4654780 kB
>>> +    Mapped:           266244 kB
>>> +    Shmem:              9976 kB
>>> +    KReclaimable:     517708 kB
>>> +    Slab:             660044 kB
>>> +    SReclaimable:     517708 kB
>>> +    SUnreclaim:       142336 kB
>>> +    KernelStack:       11168 kB
>>> +    PageTables:        20540 kB
>>> +    NFS_Unstable:          0 kB
>>> +    Bounce:                0 kB
>>> +    WritebackTmp:          0 kB
>>> +    CommitLimit:    16429408 kB
>>> +    Committed_AS:    7715148 kB
>>> +    VmallocTotal:   34359738367 kB
>>> +    VmallocUsed:       40444 kB
>>> +    VmallocChunk:          0 kB
>>> +    Percpu:            29312 kB
>>> +    HardwareCorrupted:     0 kB
>>> +    AnonHugePages:   4149248 kB
>>> +    ShmemHugePages:        0 kB
>>> +    ShmemPmdMapped:        0 kB
>>> +    FileHugePages:         0 kB
>>> +    FilePmdMapped:         0 kB
>>> +    CmaTotal:              0 kB
>>> +    CmaFree:               0 kB
>>> +    HugePages_Total:       0
>>> +    HugePages_Free:        0
>>> +    HugePages_Rsvd:        0
>>> +    HugePages_Surp:        0
>>> +    Hugepagesize:       2048 kB
>>> +    Hugetlb:               0 kB
>>> +    DirectMap4k:      401152 kB
>>> +    DirectMap2M:    10008576 kB
>>> +    DirectMap1G:    24117248 kB
>>>  
>>>  MemTotal
>>>                Total usable RAM (i.e. physical RAM minus a few reserved
>>>                bits and the kernel binary code)
>>>  MemFree
>>> -              The sum of LowFree+HighFree
>>> +              Total free RAM. On highmem systems, the sum of LowFree+HighFree
>>>  MemAvailable
>>>                An estimate of how much memory is available for starting new
>>>                applications, without swapping. Calculated from MemFree,
>>> @@ -1005,8 +1020,9 @@ Buffers
>>>                Relatively temporary storage for raw disk blocks
>>>                shouldn't get tremendously large (20MB or so)
>>>  Cached
>>> -              in-memory cache for files read from the disk (the
>>> -              pagecache).  Doesn't include SwapCached
>>> +              In-memory cache for files read from the disk (the
>>> +              pagecache) as well as tmpfs & shmem.
>>> +              Doesn't include SwapCached.
>>>  SwapCached
>>>                Memory that once was swapped out, is swapped back in but
>>>                still also is in the swapfile (if memory is needed it
>>> @@ -1018,6 +1034,11 @@ Active
>>>  Inactive
>>>                Memory which has been less recently used.  It is more
>>>                eligible to be reclaimed for other purposes
>>> +Unevictable
>>> +              Memory that cannot be reclaimed, such as mlocked pages,
>>> +              ramfs backing pages, secret memfd pages etc.
>>
>>
>> A little imprecise, because this only includes memory to be mapped into
>> user space. For example, all kernel allocations are unevictable but not
>> accounted here.
> 
> True. How about the below?

Sounds good, thanks!


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 2/6] mm: Kconfig: move swap and slab config options to the MM section
       [not found] ` <20220510152847.230957-1-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
  2022-05-10 15:28   ` [PATCH v2 1/6] Documentation: filesystems: proc: update meminfo section Johannes Weiner
@ 2022-05-10 15:28   ` Johannes Weiner
  2022-05-10 15:28   ` [PATCH v2 3/6] mm: Kconfig: group swap, slab, hotplug and thp options into submenus Johannes Weiner
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2022-05-10 15:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Roman Gushchin, Shakeel Butt, Seth Jennings,
	Dan Streetman, Minchan Kim, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg

These are currently under General Setup. MM seems like a better fit.

Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
---
 init/Kconfig | 123 ---------------------------------------------------
 mm/Kconfig   | 123 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 123 insertions(+), 123 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 4489416f1e5c..468fe27cec0b 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -375,23 +375,6 @@ config DEFAULT_HOSTNAME
 	  but you may wish to use a different default here to make a minimal
 	  system more usable with less configuration.
 
-#
-# For some reason microblaze and nios2 hard code SWAP=n.  Hopefully we can
-# add proper SWAP support to them, in which case this can be remove.
-#
-config ARCH_NO_SWAP
-	bool
-
-config SWAP
-	bool "Support for paging of anonymous memory (swap)"
-	depends on MMU && BLOCK && !ARCH_NO_SWAP
-	default y
-	help
-	  This option allows you to choose whether you want to have support
-	  for so called swap devices or swap files in your kernel that are
-	  used to provide more virtual memory than the actual RAM present
-	  in your computer.  If unsure say Y.
-
 config SYSVIPC
 	bool "System V IPC"
 	help
@@ -1909,112 +1892,6 @@ config COMPAT_BRK
 
 	  On non-ancient distros (post-2000 ones) N is usually a safe choice.
 
-choice
-	prompt "Choose SLAB allocator"
-	default SLUB
-	help
-	   This option allows to select a slab allocator.
-
-config SLAB
-	bool "SLAB"
-	depends on !PREEMPT_RT
-	select HAVE_HARDENED_USERCOPY_ALLOCATOR
-	help
-	  The regular slab allocator that is established and known to work
-	  well in all environments. It organizes cache hot objects in
-	  per cpu and per node queues.
-
-config SLUB
-	bool "SLUB (Unqueued Allocator)"
-	select HAVE_HARDENED_USERCOPY_ALLOCATOR
-	help
-	   SLUB is a slab allocator that minimizes cache line usage
-	   instead of managing queues of cached objects (SLAB approach).
-	   Per cpu caching is realized using slabs of objects instead
-	   of queues of objects. SLUB can use memory efficiently
-	   and has enhanced diagnostics. SLUB is the default choice for
-	   a slab allocator.
-
-config SLOB
-	depends on EXPERT
-	bool "SLOB (Simple Allocator)"
-	depends on !PREEMPT_RT
-	help
-	   SLOB replaces the stock allocator with a drastically simpler
-	   allocator. SLOB is generally more space efficient but
-	   does not perform as well on large systems.
-
-endchoice
-
-config SLAB_MERGE_DEFAULT
-	bool "Allow slab caches to be merged"
-	default y
-	depends on SLAB || SLUB
-	help
-	  For reduced kernel memory fragmentation, slab caches can be
-	  merged when they share the same size and other characteristics.
-	  This carries a risk of kernel heap overflows being able to
-	  overwrite objects from merged caches (and more easily control
-	  cache layout), which makes such heap attacks easier to exploit
-	  by attackers. By keeping caches unmerged, these kinds of exploits
-	  can usually only damage objects in the same cache. To disable
-	  merging at runtime, "slab_nomerge" can be passed on the kernel
-	  command line.
-
-config SLAB_FREELIST_RANDOM
-	bool "Randomize slab freelist"
-	depends on SLAB || SLUB
-	help
-	  Randomizes the freelist order used on creating new pages. This
-	  security feature reduces the predictability of the kernel slab
-	  allocator against heap overflows.
-
-config SLAB_FREELIST_HARDENED
-	bool "Harden slab freelist metadata"
-	depends on SLAB || SLUB
-	help
-	  Many kernel heap attacks try to target slab cache metadata and
-	  other infrastructure. This options makes minor performance
-	  sacrifices to harden the kernel slab allocator against common
-	  freelist exploit methods. Some slab implementations have more
-	  sanity-checking than others. This option is most effective with
-	  CONFIG_SLUB.
-
-config SHUFFLE_PAGE_ALLOCATOR
-	bool "Page allocator randomization"
-	default SLAB_FREELIST_RANDOM && ACPI_NUMA
-	help
-	  Randomization of the page allocator improves the average
-	  utilization of a direct-mapped memory-side-cache. See section
-	  5.2.27 Heterogeneous Memory Attribute Table (HMAT) in the ACPI
-	  6.2a specification for an example of how a platform advertises
-	  the presence of a memory-side-cache. There are also incidental
-	  security benefits as it reduces the predictability of page
-	  allocations to compliment SLAB_FREELIST_RANDOM, but the
-	  default granularity of shuffling on the "MAX_ORDER - 1" i.e,
-	  10th order of pages is selected based on cache utilization
-	  benefits on x86.
-
-	  While the randomization improves cache utilization it may
-	  negatively impact workloads on platforms without a cache. For
-	  this reason, by default, the randomization is enabled only
-	  after runtime detection of a direct-mapped memory-side-cache.
-	  Otherwise, the randomization may be force enabled with the
-	  'page_alloc.shuffle' kernel command line parameter.
-
-	  Say Y if unsure.
-
-config SLUB_CPU_PARTIAL
-	default y
-	depends on SLUB && SMP
-	bool "SLUB per cpu partial cache"
-	help
-	  Per cpu partial caches accelerate objects allocation and freeing
-	  that is local to a processor at the price of more indeterminism
-	  in the latency of the free. On overflow these caches will be cleared
-	  which requires the taking of locks that may cause latency spikes.
-	  Typically one would choose no for a realtime system.
-
 config MMAP_ALLOW_UNINITIALIZED
 	bool "Allow mmapped anonymous memory to be uninitialized"
 	depends on EXPERT && !MMU
diff --git a/mm/Kconfig b/mm/Kconfig
index c2141dd639e3..675a6be43739 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -2,6 +2,129 @@
 
 menu "Memory Management options"
 
+#
+# For some reason microblaze and nios2 hard code SWAP=n.  Hopefully we can
+# add proper SWAP support to them, in which case this can be remove.
+#
+config ARCH_NO_SWAP
+	bool
+
+config SWAP
+	bool "Support for paging of anonymous memory (swap)"
+	depends on MMU && BLOCK && !ARCH_NO_SWAP
+	default y
+	help
+	  This option allows you to choose whether you want to have support
+	  for so called swap devices or swap files in your kernel that are
+	  used to provide more virtual memory than the actual RAM present
+	  in your computer.  If unsure say Y.
+
+choice
+	prompt "Choose SLAB allocator"
+	default SLUB
+	help
+	   This option allows to select a slab allocator.
+
+config SLAB
+	bool "SLAB"
+	depends on !PREEMPT_RT
+	select HAVE_HARDENED_USERCOPY_ALLOCATOR
+	help
+	  The regular slab allocator that is established and known to work
+	  well in all environments. It organizes cache hot objects in
+	  per cpu and per node queues.
+
+config SLUB
+	bool "SLUB (Unqueued Allocator)"
+	select HAVE_HARDENED_USERCOPY_ALLOCATOR
+	help
+	   SLUB is a slab allocator that minimizes cache line usage
+	   instead of managing queues of cached objects (SLAB approach).
+	   Per cpu caching is realized using slabs of objects instead
+	   of queues of objects. SLUB can use memory efficiently
+	   and has enhanced diagnostics. SLUB is the default choice for
+	   a slab allocator.
+
+config SLOB
+	depends on EXPERT
+	bool "SLOB (Simple Allocator)"
+	depends on !PREEMPT_RT
+	help
+	   SLOB replaces the stock allocator with a drastically simpler
+	   allocator. SLOB is generally more space efficient but
+	   does not perform as well on large systems.
+
+endchoice
+
+config SLAB_MERGE_DEFAULT
+	bool "Allow slab caches to be merged"
+	default y
+	depends on SLAB || SLUB
+	help
+	  For reduced kernel memory fragmentation, slab caches can be
+	  merged when they share the same size and other characteristics.
+	  This carries a risk of kernel heap overflows being able to
+	  overwrite objects from merged caches (and more easily control
+	  cache layout), which makes such heap attacks easier to exploit
+	  by attackers. By keeping caches unmerged, these kinds of exploits
+	  can usually only damage objects in the same cache. To disable
+	  merging at runtime, "slab_nomerge" can be passed on the kernel
+	  command line.
+
+config SLAB_FREELIST_RANDOM
+	bool "Randomize slab freelist"
+	depends on SLAB || SLUB
+	help
+	  Randomizes the freelist order used on creating new pages. This
+	  security feature reduces the predictability of the kernel slab
+	  allocator against heap overflows.
+
+config SLAB_FREELIST_HARDENED
+	bool "Harden slab freelist metadata"
+	depends on SLAB || SLUB
+	help
+	  Many kernel heap attacks try to target slab cache metadata and
+	  other infrastructure. This options makes minor performance
+	  sacrifices to harden the kernel slab allocator against common
+	  freelist exploit methods. Some slab implementations have more
+	  sanity-checking than others. This option is most effective with
+	  CONFIG_SLUB.
+
+config SHUFFLE_PAGE_ALLOCATOR
+	bool "Page allocator randomization"
+	default SLAB_FREELIST_RANDOM && ACPI_NUMA
+	help
+	  Randomization of the page allocator improves the average
+	  utilization of a direct-mapped memory-side-cache. See section
+	  5.2.27 Heterogeneous Memory Attribute Table (HMAT) in the ACPI
+	  6.2a specification for an example of how a platform advertises
+	  the presence of a memory-side-cache. There are also incidental
+	  security benefits as it reduces the predictability of page
+	  allocations to compliment SLAB_FREELIST_RANDOM, but the
+	  default granularity of shuffling on the "MAX_ORDER - 1" i.e,
+	  10th order of pages is selected based on cache utilization
+	  benefits on x86.
+
+	  While the randomization improves cache utilization it may
+	  negatively impact workloads on platforms without a cache. For
+	  this reason, by default, the randomization is enabled only
+	  after runtime detection of a direct-mapped memory-side-cache.
+	  Otherwise, the randomization may be force enabled with the
+	  'page_alloc.shuffle' kernel command line parameter.
+
+	  Say Y if unsure.
+
+config SLUB_CPU_PARTIAL
+	default y
+	depends on SLUB && SMP
+	bool "SLUB per cpu partial cache"
+	help
+	  Per cpu partial caches accelerate objects allocation and freeing
+	  that is local to a processor at the price of more indeterminism
+	  in the latency of the free. On overflow these caches will be cleared
+	  which requires the taking of locks that may cause latency spikes.
+	  Typically one would choose no for a realtime system.
+
 config SELECT_MEMORY_MODEL
 	def_bool y
 	depends on ARCH_SELECT_MEMORY_MODEL
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 3/6] mm: Kconfig: group swap, slab, hotplug and thp options into submenus
       [not found] ` <20220510152847.230957-1-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
  2022-05-10 15:28   ` [PATCH v2 1/6] Documentation: filesystems: proc: update meminfo section Johannes Weiner
  2022-05-10 15:28   ` [PATCH v2 2/6] mm: Kconfig: move swap and slab config options to the MM section Johannes Weiner
@ 2022-05-10 15:28   ` Johannes Weiner
       [not found]     ` <20220510152847.230957-4-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
  2022-05-10 15:28   ` [PATCH v2 4/6] mm: Kconfig: simplify zswap configuration Johannes Weiner
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Johannes Weiner @ 2022-05-10 15:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Roman Gushchin, Shakeel Butt, Seth Jennings,
	Dan Streetman, Minchan Kim, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg

There are several clusters of related config options spread throughout
the mostly flat MM submenu. Group them together and put specialization
options into further subdirectories to make the MM submenu a bit more
organized and easier to navigate.

Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
---
 mm/Kconfig | 429 +++++++++++++++++++++++++++--------------------------
 1 file changed, 221 insertions(+), 208 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 675a6be43739..2c5935a28edf 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -9,7 +9,7 @@ menu "Memory Management options"
 config ARCH_NO_SWAP
 	bool
 
-config SWAP
+menuconfig SWAP
 	bool "Support for paging of anonymous memory (swap)"
 	depends on MMU && BLOCK && !ARCH_NO_SWAP
 	default y
@@ -19,6 +19,191 @@ config SWAP
 	  used to provide more virtual memory than the actual RAM present
 	  in your computer.  If unsure say Y.
 
+config ZSWAP
+	bool "Compressed cache for swap pages (EXPERIMENTAL)"
+	depends on SWAP && CRYPTO=y
+	select FRONTSWAP
+	select ZPOOL
+	help
+	  A lightweight compressed cache for swap pages.  It takes
+	  pages that are in the process of being swapped out and attempts to
+	  compress them into a dynamically allocated RAM-based memory pool.
+	  This can result in a significant I/O reduction on swap device and,
+	  in the case where decompressing from RAM is faster that swap device
+	  reads, can also improve workload performance.
+
+	  This is marked experimental because it is a new feature (as of
+	  v3.11) that interacts heavily with memory reclaim.  While these
+	  interactions don't cause any known issues on simple memory setups,
+	  they have not be fully explored on the large set of potential
+	  configurations and workloads that exist.
+
+choice
+	prompt "Compressed cache for swap pages default compressor"
+	depends on ZSWAP
+	default ZSWAP_COMPRESSOR_DEFAULT_LZO
+	help
+	  Selects the default compression algorithm for the compressed cache
+	  for swap pages.
+
+	  For an overview what kind of performance can be expected from
+	  a particular compression algorithm please refer to the benchmarks
+	  available at the following LWN page:
+	  https://lwn.net/Articles/751795/
+
+	  If in doubt, select 'LZO'.
+
+	  The selection made here can be overridden by using the kernel
+	  command line 'zswap.compressor=' option.
+
+config ZSWAP_COMPRESSOR_DEFAULT_DEFLATE
+	bool "Deflate"
+	select CRYPTO_DEFLATE
+	help
+	  Use the Deflate algorithm as the default compression algorithm.
+
+config ZSWAP_COMPRESSOR_DEFAULT_LZO
+	bool "LZO"
+	select CRYPTO_LZO
+	help
+	  Use the LZO algorithm as the default compression algorithm.
+
+config ZSWAP_COMPRESSOR_DEFAULT_842
+	bool "842"
+	select CRYPTO_842
+	help
+	  Use the 842 algorithm as the default compression algorithm.
+
+config ZSWAP_COMPRESSOR_DEFAULT_LZ4
+	bool "LZ4"
+	select CRYPTO_LZ4
+	help
+	  Use the LZ4 algorithm as the default compression algorithm.
+
+config ZSWAP_COMPRESSOR_DEFAULT_LZ4HC
+	bool "LZ4HC"
+	select CRYPTO_LZ4HC
+	help
+	  Use the LZ4HC algorithm as the default compression algorithm.
+
+config ZSWAP_COMPRESSOR_DEFAULT_ZSTD
+	bool "zstd"
+	select CRYPTO_ZSTD
+	help
+	  Use the zstd algorithm as the default compression algorithm.
+endchoice
+
+config ZSWAP_COMPRESSOR_DEFAULT
+       string
+       depends on ZSWAP
+       default "deflate" if ZSWAP_COMPRESSOR_DEFAULT_DEFLATE
+       default "lzo" if ZSWAP_COMPRESSOR_DEFAULT_LZO
+       default "842" if ZSWAP_COMPRESSOR_DEFAULT_842
+       default "lz4" if ZSWAP_COMPRESSOR_DEFAULT_LZ4
+       default "lz4hc" if ZSWAP_COMPRESSOR_DEFAULT_LZ4HC
+       default "zstd" if ZSWAP_COMPRESSOR_DEFAULT_ZSTD
+       default ""
+
+choice
+	prompt "Compressed cache for swap pages default allocator"
+	depends on ZSWAP
+	default ZSWAP_ZPOOL_DEFAULT_ZBUD
+	help
+	  Selects the default allocator for the compressed cache for
+	  swap pages.
+	  The default is 'zbud' for compatibility, however please do
+	  read the description of each of the allocators below before
+	  making a right choice.
+
+	  The selection made here can be overridden by using the kernel
+	  command line 'zswap.zpool=' option.
+
+config ZSWAP_ZPOOL_DEFAULT_ZBUD
+	bool "zbud"
+	select ZBUD
+	help
+	  Use the zbud allocator as the default allocator.
+
+config ZSWAP_ZPOOL_DEFAULT_Z3FOLD
+	bool "z3fold"
+	select Z3FOLD
+	help
+	  Use the z3fold allocator as the default allocator.
+
+config ZSWAP_ZPOOL_DEFAULT_ZSMALLOC
+	bool "zsmalloc"
+	select ZSMALLOC
+	help
+	  Use the zsmalloc allocator as the default allocator.
+endchoice
+
+config ZSWAP_ZPOOL_DEFAULT
+       string
+       depends on ZSWAP
+       default "zbud" if ZSWAP_ZPOOL_DEFAULT_ZBUD
+       default "z3fold" if ZSWAP_ZPOOL_DEFAULT_Z3FOLD
+       default "zsmalloc" if ZSWAP_ZPOOL_DEFAULT_ZSMALLOC
+       default ""
+
+config ZSWAP_DEFAULT_ON
+	bool "Enable the compressed cache for swap pages by default"
+	depends on ZSWAP
+	help
+	  If selected, the compressed cache for swap pages will be enabled
+	  at boot, otherwise it will be disabled.
+
+	  The selection made here can be overridden by using the kernel
+	  command line 'zswap.enabled=' option.
+
+config ZPOOL
+	tristate "Common API for compressed memory storage"
+	depends on ZSWAP
+	help
+	  Compressed memory storage API.  This allows using either zbud or
+	  zsmalloc.
+
+config ZBUD
+	tristate "Low (Up to 2x) density storage for compressed pages"
+	depends on ZPOOL
+	help
+	  A special purpose allocator for storing compressed pages.
+	  It is designed to store up to two compressed pages per physical
+	  page.  While this design limits storage density, it has simple and
+	  deterministic reclaim properties that make it preferable to a higher
+	  density approach when reclaim will be used.
+
+config Z3FOLD
+	tristate "Up to 3x density storage for compressed pages"
+	depends on ZPOOL
+	help
+	  A special purpose allocator for storing compressed pages.
+	  It is designed to store up to three compressed pages per physical
+	  page. It is a ZBUD derivative so the simplicity and determinism are
+	  still there.
+
+config ZSMALLOC
+	tristate "Memory allocator for compressed pages"
+	depends on MMU
+	help
+	  zsmalloc is a slab-based memory allocator designed to store
+	  compressed RAM pages.  zsmalloc uses virtual memory mapping
+	  in order to reduce fragmentation.  However, this results in a
+	  non-standard allocator interface where a handle, not a pointer, is
+	  returned by an alloc().  This handle must be mapped in order to
+	  access the allocated space.
+
+config ZSMALLOC_STAT
+	bool "Export zsmalloc statistics"
+	depends on ZSMALLOC
+	select DEBUG_FS
+	help
+	  This option enables code in the zsmalloc to collect various
+	  statistics about what's happening in zsmalloc and exports that
+	  information to userspace via debugfs.
+	  If unsure, say N.
+
+menu "SLAB allocator options"
+
 choice
 	prompt "Choose SLAB allocator"
 	default SLUB
@@ -90,6 +275,19 @@ config SLAB_FREELIST_HARDENED
 	  sanity-checking than others. This option is most effective with
 	  CONFIG_SLUB.
 
+config SLUB_CPU_PARTIAL
+	default y
+	depends on SLUB && SMP
+	bool "SLUB per cpu partial cache"
+	help
+	  Per cpu partial caches accelerate objects allocation and freeing
+	  that is local to a processor at the price of more indeterminism
+	  in the latency of the free. On overflow these caches will be cleared
+	  which requires the taking of locks that may cause latency spikes.
+	  Typically one would choose no for a realtime system.
+
+endmenu # SLAB allocator options
+
 config SHUFFLE_PAGE_ALLOCATOR
 	bool "Page allocator randomization"
 	default SLAB_FREELIST_RANDOM && ACPI_NUMA
@@ -114,17 +312,6 @@ config SHUFFLE_PAGE_ALLOCATOR
 
 	  Say Y if unsure.
 
-config SLUB_CPU_PARTIAL
-	default y
-	depends on SLUB && SMP
-	bool "SLUB per cpu partial cache"
-	help
-	  Per cpu partial caches accelerate objects allocation and freeing
-	  that is local to a processor at the price of more indeterminism
-	  in the latency of the free. On overflow these caches will be cleared
-	  which requires the taking of locks that may cause latency spikes.
-	  Typically one would choose no for a realtime system.
-
 config SELECT_MEMORY_MODEL
 	def_bool y
 	depends on ARCH_SELECT_MEMORY_MODEL
@@ -250,14 +437,16 @@ config ARCH_ENABLE_MEMORY_HOTPLUG
 	bool
 
 # eventually, we can have this option just 'select SPARSEMEM'
-config MEMORY_HOTPLUG
-	bool "Allow for memory hot-add"
+menuconfig MEMORY_HOTPLUG
+	bool "Memory hotplug"
 	select MEMORY_ISOLATION
 	depends on SPARSEMEM
 	depends on ARCH_ENABLE_MEMORY_HOTPLUG
 	depends on 64BIT
 	select NUMA_KEEP_MEMINFO if NUMA
 
+if MEMORY_HOTPLUG
+
 config MEMORY_HOTPLUG_DEFAULT_ONLINE
 	bool "Online the newly added memory blocks by default"
 	depends on MEMORY_HOTPLUG
@@ -287,6 +476,8 @@ config MHP_MEMMAP_ON_MEMORY
 	depends on MEMORY_HOTPLUG && SPARSEMEM_VMEMMAP
 	depends on ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
 
+endif # MEMORY_HOTPLUG
+
 # Heavily threaded applications may benefit from splitting the mm-wide
 # page_table_lock, so that faults on different parts of the user address
 # space can be handled with less contention: split it at this NR_CPUS.
@@ -501,7 +692,7 @@ config NOMMU_INITIAL_TRIM_EXCESS
 
 	  See Documentation/admin-guide/mm/nommu-mmap.rst for more information.
 
-config TRANSPARENT_HUGEPAGE
+menuconfig TRANSPARENT_HUGEPAGE
 	bool "Transparent Hugepage Support"
 	depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && !PREEMPT_RT
 	select COMPACTION
@@ -516,6 +707,8 @@ config TRANSPARENT_HUGEPAGE
 
 	  If memory constrained on embedded, you may want to say N.
 
+if TRANSPARENT_HUGEPAGE
+
 choice
 	prompt "Transparent Hugepage Support sysfs defaults"
 	depends on TRANSPARENT_HUGEPAGE
@@ -556,6 +749,19 @@ config THP_SWAP
 
 	  For selection by architectures with reasonable THP sizes.
 
+config READ_ONLY_THP_FOR_FS
+	bool "Read-only THP for filesystems (EXPERIMENTAL)"
+	depends on TRANSPARENT_HUGEPAGE && SHMEM
+
+	help
+	  Allow khugepaged to put read-only file-backed pages in THP.
+
+	  This is marked experimental because it is a new feature. Write
+	  support of file THPs will be developed in the next few release
+	  cycles.
+
+endif # TRANSPARENT_HUGEPAGE
+
 #
 # UP and nommu archs use km based percpu allocator
 #
@@ -640,188 +846,6 @@ config MEM_SOFT_DIRTY
 
 	  See Documentation/admin-guide/mm/soft-dirty.rst for more details.
 
-config ZSWAP
-	bool "Compressed cache for swap pages (EXPERIMENTAL)"
-	depends on SWAP && CRYPTO=y
-	select FRONTSWAP
-	select ZPOOL
-	help
-	  A lightweight compressed cache for swap pages.  It takes
-	  pages that are in the process of being swapped out and attempts to
-	  compress them into a dynamically allocated RAM-based memory pool.
-	  This can result in a significant I/O reduction on swap device and,
-	  in the case where decompressing from RAM is faster that swap device
-	  reads, can also improve workload performance.
-
-	  This is marked experimental because it is a new feature (as of
-	  v3.11) that interacts heavily with memory reclaim.  While these
-	  interactions don't cause any known issues on simple memory setups,
-	  they have not be fully explored on the large set of potential
-	  configurations and workloads that exist.
-
-choice
-	prompt "Compressed cache for swap pages default compressor"
-	depends on ZSWAP
-	default ZSWAP_COMPRESSOR_DEFAULT_LZO
-	help
-	  Selects the default compression algorithm for the compressed cache
-	  for swap pages.
-
-	  For an overview what kind of performance can be expected from
-	  a particular compression algorithm please refer to the benchmarks
-	  available at the following LWN page:
-	  https://lwn.net/Articles/751795/
-
-	  If in doubt, select 'LZO'.
-
-	  The selection made here can be overridden by using the kernel
-	  command line 'zswap.compressor=' option.
-
-config ZSWAP_COMPRESSOR_DEFAULT_DEFLATE
-	bool "Deflate"
-	select CRYPTO_DEFLATE
-	help
-	  Use the Deflate algorithm as the default compression algorithm.
-
-config ZSWAP_COMPRESSOR_DEFAULT_LZO
-	bool "LZO"
-	select CRYPTO_LZO
-	help
-	  Use the LZO algorithm as the default compression algorithm.
-
-config ZSWAP_COMPRESSOR_DEFAULT_842
-	bool "842"
-	select CRYPTO_842
-	help
-	  Use the 842 algorithm as the default compression algorithm.
-
-config ZSWAP_COMPRESSOR_DEFAULT_LZ4
-	bool "LZ4"
-	select CRYPTO_LZ4
-	help
-	  Use the LZ4 algorithm as the default compression algorithm.
-
-config ZSWAP_COMPRESSOR_DEFAULT_LZ4HC
-	bool "LZ4HC"
-	select CRYPTO_LZ4HC
-	help
-	  Use the LZ4HC algorithm as the default compression algorithm.
-
-config ZSWAP_COMPRESSOR_DEFAULT_ZSTD
-	bool "zstd"
-	select CRYPTO_ZSTD
-	help
-	  Use the zstd algorithm as the default compression algorithm.
-endchoice
-
-config ZSWAP_COMPRESSOR_DEFAULT
-       string
-       depends on ZSWAP
-       default "deflate" if ZSWAP_COMPRESSOR_DEFAULT_DEFLATE
-       default "lzo" if ZSWAP_COMPRESSOR_DEFAULT_LZO
-       default "842" if ZSWAP_COMPRESSOR_DEFAULT_842
-       default "lz4" if ZSWAP_COMPRESSOR_DEFAULT_LZ4
-       default "lz4hc" if ZSWAP_COMPRESSOR_DEFAULT_LZ4HC
-       default "zstd" if ZSWAP_COMPRESSOR_DEFAULT_ZSTD
-       default ""
-
-choice
-	prompt "Compressed cache for swap pages default allocator"
-	depends on ZSWAP
-	default ZSWAP_ZPOOL_DEFAULT_ZBUD
-	help
-	  Selects the default allocator for the compressed cache for
-	  swap pages.
-	  The default is 'zbud' for compatibility, however please do
-	  read the description of each of the allocators below before
-	  making a right choice.
-
-	  The selection made here can be overridden by using the kernel
-	  command line 'zswap.zpool=' option.
-
-config ZSWAP_ZPOOL_DEFAULT_ZBUD
-	bool "zbud"
-	select ZBUD
-	help
-	  Use the zbud allocator as the default allocator.
-
-config ZSWAP_ZPOOL_DEFAULT_Z3FOLD
-	bool "z3fold"
-	select Z3FOLD
-	help
-	  Use the z3fold allocator as the default allocator.
-
-config ZSWAP_ZPOOL_DEFAULT_ZSMALLOC
-	bool "zsmalloc"
-	select ZSMALLOC
-	help
-	  Use the zsmalloc allocator as the default allocator.
-endchoice
-
-config ZSWAP_ZPOOL_DEFAULT
-       string
-       depends on ZSWAP
-       default "zbud" if ZSWAP_ZPOOL_DEFAULT_ZBUD
-       default "z3fold" if ZSWAP_ZPOOL_DEFAULT_Z3FOLD
-       default "zsmalloc" if ZSWAP_ZPOOL_DEFAULT_ZSMALLOC
-       default ""
-
-config ZSWAP_DEFAULT_ON
-	bool "Enable the compressed cache for swap pages by default"
-	depends on ZSWAP
-	help
-	  If selected, the compressed cache for swap pages will be enabled
-	  at boot, otherwise it will be disabled.
-
-	  The selection made here can be overridden by using the kernel
-	  command line 'zswap.enabled=' option.
-
-config ZPOOL
-	tristate "Common API for compressed memory storage"
-	help
-	  Compressed memory storage API.  This allows using either zbud or
-	  zsmalloc.
-
-config ZBUD
-	tristate "Low (Up to 2x) density storage for compressed pages"
-	depends on ZPOOL
-	help
-	  A special purpose allocator for storing compressed pages.
-	  It is designed to store up to two compressed pages per physical
-	  page.  While this design limits storage density, it has simple and
-	  deterministic reclaim properties that make it preferable to a higher
-	  density approach when reclaim will be used.
-
-config Z3FOLD
-	tristate "Up to 3x density storage for compressed pages"
-	depends on ZPOOL
-	help
-	  A special purpose allocator for storing compressed pages.
-	  It is designed to store up to three compressed pages per physical
-	  page. It is a ZBUD derivative so the simplicity and determinism are
-	  still there.
-
-config ZSMALLOC
-	tristate "Memory allocator for compressed pages"
-	depends on MMU
-	help
-	  zsmalloc is a slab-based memory allocator designed to store
-	  compressed RAM pages.  zsmalloc uses virtual memory mapping
-	  in order to reduce fragmentation.  However, this results in a
-	  non-standard allocator interface where a handle, not a pointer, is
-	  returned by an alloc().  This handle must be mapped in order to
-	  access the allocated space.
-
-config ZSMALLOC_STAT
-	bool "Export zsmalloc statistics"
-	depends on ZSMALLOC
-	select DEBUG_FS
-	help
-	  This option enables code in the zsmalloc to collect various
-	  statistics about what's happening in zsmalloc and exports that
-	  information to userspace via debugfs.
-	  If unsure, say N.
-
 config GENERIC_EARLY_IOREMAP
 	bool
 
@@ -978,17 +1002,6 @@ comment "GUP_TEST needs to have DEBUG_FS enabled"
 config GUP_GET_PTE_LOW_HIGH
 	bool
 
-config READ_ONLY_THP_FOR_FS
-	bool "Read-only THP for filesystems (EXPERIMENTAL)"
-	depends on TRANSPARENT_HUGEPAGE && SHMEM
-
-	help
-	  Allow khugepaged to put read-only file-backed pages in THP.
-
-	  This is marked experimental because it is a new feature. Write
-	  support of file THPs will be developed in the next few release
-	  cycles.
-
 config ARCH_HAS_PTE_SPECIAL
 	bool
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

[parent not found: <20220510152847.230957-4-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]

* Re: [PATCH v2 3/6] mm: Kconfig: group swap, slab, hotplug and thp options into submenus
       [not found]     ` <20220510152847.230957-4-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
@ 2022-05-10 22:40       ` Andrew Morton
       [not found]         ` <20220510154037.c7916ee9d7de90eedd12f92c-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2022-05-10 22:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Roman Gushchin, Shakeel Butt, Seth Jennings,
	Dan Streetman, Minchan Kim, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg

On Tue, 10 May 2022 11:28:44 -0400 Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:

> There are several clusters of related config options spread throughout
> the mostly flat MM submenu. Group them together and put specialization
> options into further subdirectories to make the MM submenu a bit more
> organized and easier to navigate.

Causes

hp2:/usr/src/25> make allnoconfig

WARNING: unmet direct dependencies detected for ARCH_WANT_GENERAL_HUGETLB
  Depends on [n]: TRANSPARENT_HUGEPAGE [=n]
  Selected by [y]:
  - X86 [=y]

WARNING: unmet direct dependencies detected for ARCH_WANTS_THP_SWAP
  Depends on [n]: TRANSPARENT_HUGEPAGE [=n]
  Selected by [y]:
  - X86 [=y] && X86_64 [=y]
#
# configuration written to .config
#


I'll disable this and [4/6] for now.

^ permalink raw reply	[flat|nested] 24+ messages in thread

[parent not found: <20220510154037.c7916ee9d7de90eedd12f92c-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>]

* Re: [PATCH v2 3/6] mm: Kconfig: group swap, slab, hotplug and thp options into submenus
       [not found]         ` <20220510154037.c7916ee9d7de90eedd12f92c-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2022-05-11 15:22           ` Johannes Weiner
       [not found]             ` <YnvU0hwCfQ11P8Ce-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Johannes Weiner @ 2022-05-11 15:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Roman Gushchin, Shakeel Butt, Seth Jennings,
	Dan Streetman, Minchan Kim, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg

On Tue, May 10, 2022 at 03:40:37PM -0700, Andrew Morton wrote:
> On Tue, 10 May 2022 11:28:44 -0400 Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
> 
> > There are several clusters of related config options spread throughout
> > the mostly flat MM submenu. Group them together and put specialization
> > options into further subdirectories to make the MM submenu a bit more
> > organized and easier to navigate.
> 
> Causes
> 
> hp2:/usr/src/25> make allnoconfig

My bad. I'll respin those on top of the others and add allnoconfig
builds to my testing routine. Thanks.

^ permalink raw reply	[flat|nested] 24+ messages in thread

[parent not found: <YnvU0hwCfQ11P8Ce-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]

* Re: [PATCH v2 3/6] mm: Kconfig: group swap, slab, hotplug and thp options into submenus
       [not found]             ` <YnvU0hwCfQ11P8Ce-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
@ 2022-05-11 16:28               ` Johannes Weiner
  0 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2022-05-11 16:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Roman Gushchin, Shakeel Butt, Seth Jennings,
	Dan Streetman, Minchan Kim, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg

On Wed, May 11, 2022 at 11:22:59AM -0400, Johannes Weiner wrote:
> On Tue, May 10, 2022 at 03:40:37PM -0700, Andrew Morton wrote:
> > On Tue, 10 May 2022 11:28:44 -0400 Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
> > 
> > > There are several clusters of related config options spread throughout
> > > the mostly flat MM submenu. Group them together and put specialization
> > > options into further subdirectories to make the MM submenu a bit more
> > > organized and easier to navigate.
> > 
> > Causes
> > 
> > hp2:/usr/src/25> make allnoconfig
> 
> My bad. I'll respin those on top of the others and add allnoconfig
> builds to my testing routine. Thanks.

Actually, this is rather straight-forward to fix in place. This delta
for 3/6 takes care of both warnings:

---
From 2b5fcaed2714584b40c7d7f76bdda250aa94b48f Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Date: Wed, 11 May 2022 12:12:37 -0400
Subject: [PATCH] mm: Kconfig: group swap, slab, hotplug and thp options into
 submenus fix

WARNING: unmet direct dependencies detected for ARCH_WANT_GENERAL_HUGETLB
  Depends on [n]: TRANSPARENT_HUGEPAGE [=n]
  Selected by [y]:
  - X86 [=y]

WARNING: unmet direct dependencies detected for ARCH_WANTS_THP_SWAP
  Depends on [n]: TRANSPARENT_HUGEPAGE [=n]
  Selected by [y]:
  - X86 [=y] && X86_64 [=y]

The ARCH_WANT* symbols are selected by the arch to communicate
requests to THP if enabled. Those mustn't be inside the 'if THP' block
for user-visible THP options of course. Move them back out.

Reported-by: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
---
 mm/Kconfig | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 2c5935a28edf..c2f4a547ab00 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -692,6 +692,12 @@ config NOMMU_INITIAL_TRIM_EXCESS
 
 	  See Documentation/admin-guide/mm/nommu-mmap.rst for more information.
 
+config ARCH_WANT_GENERAL_HUGETLB
+	bool
+
+config ARCH_WANTS_THP_SWAP
+	def_bool n
+
 menuconfig TRANSPARENT_HUGEPAGE
 	bool "Transparent Hugepage Support"
 	depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && !PREEMPT_RT
@@ -733,12 +739,6 @@ choice
 	  benefit.
 endchoice
 
-config ARCH_WANT_GENERAL_HUGETLB
-	bool
-
-config ARCH_WANTS_THP_SWAP
-	def_bool n
-
 config THP_SWAP
 	def_bool y
 	depends on TRANSPARENT_HUGEPAGE && ARCH_WANTS_THP_SWAP && SWAP
-- 
2.35.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 4/6] mm: Kconfig: simplify zswap configuration
       [not found] ` <20220510152847.230957-1-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
                     ` (2 preceding siblings ...)
  2022-05-10 15:28   ` [PATCH v2 3/6] mm: Kconfig: group swap, slab, hotplug and thp options into submenus Johannes Weiner
@ 2022-05-10 15:28   ` Johannes Weiner
  2022-05-10 15:28   ` [PATCH v2 5/6] mm: zswap: add basic meminfo and vmstat coverage Johannes Weiner
  2022-05-10 15:28   ` [PATCH v2 6/6] zswap: memcg accounting Johannes Weiner
  5 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2022-05-10 15:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Roman Gushchin, Shakeel Butt, Seth Jennings,
	Dan Streetman, Minchan Kim, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg

- CONFIG_ZRAM: Zram is a user-facing feature, whereas zsmalloc is
  not. Don't make the user chase down a technical dependency like
  that, just select it in automatically when zram is requested. The
  CONFIG_CRYPTO dependency is redundant due to more specific deps.

- CONFIG_ZPOOL: This is not a user-facing feature. Hide the symbol and
  have it selected in as needed.

- CONFIG_ZSWAP: Select CRYPTO instead of depend. Common pattern.

- Make the ZSWAP suboptions and their descriptions (compression,
  allocation backend) a bit more straight-forward for the user.

Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
---
 drivers/block/zram/Kconfig |  3 ++-
 mm/Kconfig                 | 55 +++++++++++++++++---------------------
 2 files changed, 27 insertions(+), 31 deletions(-)

diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index 668c6bf2554d..e4163d4b936b 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -1,8 +1,9 @@
 # SPDX-License-Identifier: GPL-2.0
 config ZRAM
 	tristate "Compressed RAM block device support"
-	depends on BLOCK && SYSFS && ZSMALLOC && CRYPTO
+	depends on BLOCK && SYSFS
 	depends on CRYPTO_LZO || CRYPTO_ZSTD || CRYPTO_LZ4 || CRYPTO_LZ4HC || CRYPTO_842
+	select ZSMALLOC
 	help
 	  Creates virtual block devices called /dev/zramX (X = 0, 1, ...).
 	  Pages written to these disks are compressed and stored in memory
diff --git a/mm/Kconfig b/mm/Kconfig
index 2c5935a28edf..c87ffd0d98b3 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -9,6 +9,9 @@ menu "Memory Management options"
 config ARCH_NO_SWAP
 	bool
 
+config ZPOOL
+	bool
+
 menuconfig SWAP
 	bool "Support for paging of anonymous memory (swap)"
 	depends on MMU && BLOCK && !ARCH_NO_SWAP
@@ -21,8 +24,9 @@ menuconfig SWAP
 
 config ZSWAP
 	bool "Compressed cache for swap pages (EXPERIMENTAL)"
-	depends on SWAP && CRYPTO=y
+	depends on SWAP
 	select FRONTSWAP
+	select CRYPTO
 	select ZPOOL
 	help
 	  A lightweight compressed cache for swap pages.  It takes
@@ -38,8 +42,18 @@ config ZSWAP
 	  they have not be fully explored on the large set of potential
 	  configurations and workloads that exist.
 
+config ZSWAP_DEFAULT_ON
+	bool "Enable the compressed cache for swap pages by default"
+	depends on ZSWAP
+	help
+	  If selected, the compressed cache for swap pages will be enabled
+	  at boot, otherwise it will be disabled.
+
+	  The selection made here can be overridden by using the kernel
+	  command line 'zswap.enabled=' option.
+
 choice
-	prompt "Compressed cache for swap pages default compressor"
+	prompt "Default compressor"
 	depends on ZSWAP
 	default ZSWAP_COMPRESSOR_DEFAULT_LZO
 	help
@@ -105,7 +119,7 @@ config ZSWAP_COMPRESSOR_DEFAULT
        default ""
 
 choice
-	prompt "Compressed cache for swap pages default allocator"
+	prompt "Default allocator"
 	depends on ZSWAP
 	default ZSWAP_ZPOOL_DEFAULT_ZBUD
 	help
@@ -145,26 +159,9 @@ config ZSWAP_ZPOOL_DEFAULT
        default "zsmalloc" if ZSWAP_ZPOOL_DEFAULT_ZSMALLOC
        default ""
 
-config ZSWAP_DEFAULT_ON
-	bool "Enable the compressed cache for swap pages by default"
-	depends on ZSWAP
-	help
-	  If selected, the compressed cache for swap pages will be enabled
-	  at boot, otherwise it will be disabled.
-
-	  The selection made here can be overridden by using the kernel
-	  command line 'zswap.enabled=' option.
-
-config ZPOOL
-	tristate "Common API for compressed memory storage"
-	depends on ZSWAP
-	help
-	  Compressed memory storage API.  This allows using either zbud or
-	  zsmalloc.
-
 config ZBUD
-	tristate "Low (Up to 2x) density storage for compressed pages"
-	depends on ZPOOL
+	tristate "2:1 compression allocator (zbud)"
+	depends on ZSWAP
 	help
 	  A special purpose allocator for storing compressed pages.
 	  It is designed to store up to two compressed pages per physical
@@ -173,8 +170,8 @@ config ZBUD
 	  density approach when reclaim will be used.
 
 config Z3FOLD
-	tristate "Up to 3x density storage for compressed pages"
-	depends on ZPOOL
+	tristate "3:1 compression allocator (z3fold)"
+	depends on ZSWAP
 	help
 	  A special purpose allocator for storing compressed pages.
 	  It is designed to store up to three compressed pages per physical
@@ -182,15 +179,13 @@ config Z3FOLD
 	  still there.
 
 config ZSMALLOC
-	tristate "Memory allocator for compressed pages"
+	tristate
+	prompt "N:1 compression allocator (zsmalloc)" if ZSWAP
 	depends on MMU
 	help
 	  zsmalloc is a slab-based memory allocator designed to store
-	  compressed RAM pages.  zsmalloc uses virtual memory mapping
-	  in order to reduce fragmentation.  However, this results in a
-	  non-standard allocator interface where a handle, not a pointer, is
-	  returned by an alloc().  This handle must be mapped in order to
-	  access the allocated space.
+	  pages of various compression levels efficiently. It achieves
+	  the highest storage density with the least amount of fragmentation.
 
 config ZSMALLOC_STAT
 	bool "Export zsmalloc statistics"
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 5/6] mm: zswap: add basic meminfo and vmstat coverage
       [not found] ` <20220510152847.230957-1-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
                     ` (3 preceding siblings ...)
  2022-05-10 15:28   ` [PATCH v2 4/6] mm: Kconfig: simplify zswap configuration Johannes Weiner
@ 2022-05-10 15:28   ` Johannes Weiner
       [not found]     ` <20220510152847.230957-6-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
  2022-05-10 15:28   ` [PATCH v2 6/6] zswap: memcg accounting Johannes Weiner
  5 siblings, 1 reply; 24+ messages in thread
From: Johannes Weiner @ 2022-05-10 15:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Roman Gushchin, Shakeel Butt, Seth Jennings,
	Dan Streetman, Minchan Kim, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg

Currently it requires poking at debugfs to figure out the size and
population of the zswap cache on a host. There are no counters for
reads and writes against the cache. As a result, it's difficult to
understand zswap behavior on production systems.

Print zswap memory consumption and how many pages are zswapped out in
/proc/meminfo. Count zswapouts and zswapins in /proc/vmstat.

Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
---
 Documentation/filesystems/proc.rst |  6 ++++++
 fs/proc/meminfo.c                  |  7 +++++++
 include/linux/swap.h               |  5 +++++
 include/linux/vm_event_item.h      |  4 ++++
 mm/vmstat.c                        |  4 ++++
 mm/zswap.c                         | 13 ++++++-------
 6 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 736ed384750c..8b5a94cfa722 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -964,6 +964,8 @@ Example output. You may not have all of these fields.
     Mlocked:               0 kB
     SwapTotal:             0 kB
     SwapFree:              0 kB
+    Zswap:              1904 kB
+    Zswapped:           7792 kB
     Dirty:                12 kB
     Writeback:             0 kB
     AnonPages:       4654780 kB
@@ -1055,6 +1057,10 @@ SwapTotal
 SwapFree
               Memory which has been evicted from RAM, and is temporarily
               on the disk
+Zswap
+              Memory consumed by the zswap backend (compressed size)
+Zswapped
+              Amount of anonymous memory stored in zswap (original size)
 Dirty
               Memory which is waiting to get written back to the disk
 Writeback
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 6fa761c9cc78..6e89f0e2fd20 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -86,6 +86,13 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 
 	show_val_kb(m, "SwapTotal:      ", i.totalswap);
 	show_val_kb(m, "SwapFree:       ", i.freeswap);
+#ifdef CONFIG_ZSWAP
+	seq_printf(m,  "Zswap:          %8lu kB\n",
+		   (unsigned long)(zswap_pool_total_size >> 10));
+	seq_printf(m,  "Zswapped:       %8lu kB\n",
+		   (unsigned long)atomic_read(&zswap_stored_pages) <<
+		   (PAGE_SHIFT - 10));
+#endif
 	show_val_kb(m, "Dirty:          ",
 		    global_node_page_state(NR_FILE_DIRTY));
 	show_val_kb(m, "Writeback:      ",
diff --git a/include/linux/swap.h b/include/linux/swap.h
index b82c196d8867..07074afa79a7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -632,6 +632,11 @@ static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
 }
 #endif
 
+#ifdef CONFIG_ZSWAP
+extern u64 zswap_pool_total_size;
+extern atomic_t zswap_stored_pages;
+#endif
+
 #if defined(CONFIG_SWAP) && defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP)
 extern void __cgroup_throttle_swaprate(struct page *page, gfp_t gfp_mask);
 static inline  void cgroup_throttle_swaprate(struct page *page, gfp_t gfp_mask)
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 5e80138ce624..1ce8fadb2b1c 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -132,6 +132,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_KSM
 		COW_KSM,
 #endif
+#ifdef CONFIG_ZSWAP
+		ZSWPIN,
+		ZSWPOUT,
+#endif
 #ifdef CONFIG_X86
 		DIRECT_MAP_LEVEL2_SPLIT,
 		DIRECT_MAP_LEVEL3_SPLIT,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4a2aa2fa88db..da7e389cf33c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1392,6 +1392,10 @@ const char * const vmstat_text[] = {
 #ifdef CONFIG_KSM
 	"cow_ksm",
 #endif
+#ifdef CONFIG_ZSWAP
+	"zswpin",
+	"zswpout",
+#endif
 #ifdef CONFIG_X86
 	"direct_map_level2_splits",
 	"direct_map_level3_splits",
diff --git a/mm/zswap.c b/mm/zswap.c
index 2c5db4cbedea..e3c16a70f533 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -42,9 +42,9 @@
 * statistics
 **********************************/
 /* Total bytes used by the compressed storage */
-static u64 zswap_pool_total_size;
+u64 zswap_pool_total_size;
 /* The number of compressed pages currently stored in zswap */
-static atomic_t zswap_stored_pages = ATOMIC_INIT(0);
+atomic_t zswap_stored_pages = ATOMIC_INIT(0);
 /* The number of same-value filled pages currently stored in zswap */
 static atomic_t zswap_same_filled_pages = ATOMIC_INIT(0);
 
@@ -1243,6 +1243,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
 	/* update stats */
 	atomic_inc(&zswap_stored_pages);
 	zswap_update_total_size();
+	count_vm_event(ZSWPOUT);
 
 	return 0;
 
@@ -1285,11 +1286,10 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
 		zswap_fill_page(dst, entry->value);
 		kunmap_atomic(dst);
 		ret = 0;
-		goto freeentry;
+		goto stats;
 	}
 
 	if (!zpool_can_sleep_mapped(entry->pool->zpool)) {
-
 		tmp = kmalloc(entry->length, GFP_ATOMIC);
 		if (!tmp) {
 			ret = -ENOMEM;
@@ -1304,10 +1304,8 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
 		src += sizeof(struct zswap_header);
 
 	if (!zpool_can_sleep_mapped(entry->pool->zpool)) {
-
 		memcpy(tmp, src, entry->length);
 		src = tmp;
-
 		zpool_unmap_handle(entry->pool->zpool, entry->handle);
 	}
 
@@ -1326,7 +1324,8 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
 		kfree(tmp);
 
 	BUG_ON(ret);
-
+stats:
+	count_vm_event(ZSWPIN);
 freeentry:
 	spin_lock(&tree->lock);
 	zswap_entry_put(tree, entry);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

[parent not found: <20220510152847.230957-6-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]

* Re: [PATCH v2 5/6] mm: zswap: add basic meminfo and vmstat coverage
       [not found]     ` <20220510152847.230957-6-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
@ 2022-05-11 17:13       ` David Hildenbrand
  0 siblings, 0 replies; 24+ messages in thread
From: David Hildenbrand @ 2022-05-11 17:13 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton
  Cc: Michal Hocko, Roman Gushchin, Shakeel Butt, Seth Jennings,
	Dan Streetman, Minchan Kim, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg

On 10.05.22 17:28, Johannes Weiner wrote:
> Currently it requires poking at debugfs to figure out the size and
> population of the zswap cache on a host. There are no counters for
> reads and writes against the cache. As a result, it's difficult to
> understand zswap behavior on production systems.
> 
> Print zswap memory consumption and how many pages are zswapped out in
> /proc/meminfo. Count zswapouts and zswapins in /proc/vmstat.
> 
> Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>

Acked-by: David Hildenbrand <david-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 6/6] zswap: memcg accounting
       [not found] ` <20220510152847.230957-1-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
                     ` (4 preceding siblings ...)
  2022-05-10 15:28   ` [PATCH v2 5/6] mm: zswap: add basic meminfo and vmstat coverage Johannes Weiner
@ 2022-05-10 15:28   ` Johannes Weiner
       [not found]     ` <20220510152847.230957-7-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
  5 siblings, 1 reply; 24+ messages in thread
From: Johannes Weiner @ 2022-05-10 15:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Roman Gushchin, Shakeel Butt, Seth Jennings,
	Dan Streetman, Minchan Kim, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg

Applications can currently escape their cgroup memory containment when
zswap is enabled. This patch adds per-cgroup tracking and limiting of
zswap backend memory to rectify this.

The existing cgroup2 memory.stat file is extended to show zswap
statistics analogous to what's in meminfo and vmstat. Furthermore, two
new control files, memory.zswap.current and memory.zswap.max, are
added to allow tuning zswap usage on a per-workload basis. This is
important since not all workloads benefit from zswap equally; some
even suffer compared to disk swap when memory contents don't compress
well. The optimal size of the zswap pool, and the threshold for
writeback, also depends on the size of the workload's warm set.

The implementation doesn't use a traditional page_counter transaction.
zswap is unconventional as a memory consumer in that we only know the
amount of memory to charge once expensive compression has occurred. If
zwap is disabled or the limit is already exceeded we obviously don't
want to compress page upon page only to reject them all. Instead, the
limit is checked against current usage, then we compress and charge.
This allows some limit overrun, but not enough to matter in practice.

Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
---
 Documentation/admin-guide/cgroup-v2.rst |  21 +++
 include/linux/memcontrol.h              |  54 +++++++
 mm/memcontrol.c                         | 196 +++++++++++++++++++++++-
 mm/zswap.c                              |  37 ++++-
 4 files changed, 293 insertions(+), 15 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 19bcd73cad03..b4c262e99b5f 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1347,6 +1347,12 @@ PAGE_SIZE multiple when read back.
 		Amount of cached filesystem data that is swap-backed,
 		such as tmpfs, shm segments, shared anonymous mmap()s
 
+	  zswap
+		Amount of memory consumed by the zswap compression backend.
+
+	  zswapped
+		Amount of application memory swapped out to zswap.
+
 	  file_mapped
 		Amount of cached filesystem data mapped with mmap()
 
@@ -1537,6 +1543,21 @@ PAGE_SIZE multiple when read back.
 	higher than the limit for an extended period of time.  This
 	reduces the impact on the workload and memory management.
 
+  memory.zswap.current
+	A read-only single value file which exists on non-root
+	cgroups.
+
+	The total amount of memory consumed by the zswap compression
+	backend.
+
+  memory.zswap.max
+	A read-write single value file which exists on non-root
+	cgroups.  The default is "max".
+
+	Zswap usage hard limit. If a cgroup's zswap pool reaches this
+	limit, it will refuse to take any more stores before existing
+	entries fault back in or are written out to disk.
+
   memory.pressure
 	A read-only nested-keyed file.
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index fe580cb96683..3385ce81ecf3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -35,6 +35,8 @@ enum memcg_stat_item {
 	MEMCG_PERCPU_B,
 	MEMCG_VMALLOC,
 	MEMCG_KMEM,
+	MEMCG_ZSWAP_B,
+	MEMCG_ZSWAPPED,
 	MEMCG_NR_STAT,
 };
 
@@ -252,6 +254,10 @@ struct mem_cgroup {
 	/* Range enforcement for interrupt charges */
 	struct work_struct high_work;
 
+#ifdef CONFIG_ZSWAP
+	unsigned long zswap_max;
+#endif
+
 	unsigned long soft_limit;
 
 	/* vmpressure notifications */
@@ -1264,6 +1270,10 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css)
 	return NULL;
 }
 
+static inline void obj_cgroup_put(struct obj_cgroup *objcg)
+{
+}
+
 static inline void mem_cgroup_put(struct mem_cgroup *memcg)
 {
 }
@@ -1680,6 +1690,7 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order);
 void __memcg_kmem_uncharge_page(struct page *page, int order);
 
 struct obj_cgroup *get_obj_cgroup_from_current(void);
+struct obj_cgroup *get_obj_cgroup_from_page(struct page *page);
 
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
 void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
@@ -1716,6 +1727,20 @@ static inline int memcg_kmem_id(struct mem_cgroup *memcg)
 
 struct mem_cgroup *mem_cgroup_from_obj(void *p);
 
+static inline void count_objcg_event(struct obj_cgroup *objcg,
+				     enum vm_event_item idx)
+{
+	struct mem_cgroup *memcg;
+
+	if (mem_cgroup_kmem_disabled())
+		return;
+
+	rcu_read_lock();
+	memcg = obj_cgroup_memcg(objcg);
+	count_memcg_events(memcg, idx, 1);
+	rcu_read_unlock();
+}
+
 #else
 static inline bool mem_cgroup_kmem_disabled(void)
 {
@@ -1742,6 +1767,11 @@ static inline void __memcg_kmem_uncharge_page(struct page *page, int order)
 {
 }
 
+static inline struct obj_cgroup *get_obj_cgroup_from_page(struct page *page)
+{
+	return NULL;
+}
+
 static inline bool memcg_kmem_enabled(void)
 {
 	return false;
@@ -1757,6 +1787,30 @@ static inline struct mem_cgroup *mem_cgroup_from_obj(void *p)
        return NULL;
 }
 
+static inline void count_objcg_event(struct obj_cgroup *objcg,
+				     enum vm_event_item idx)
+{
+}
+
 #endif /* CONFIG_MEMCG_KMEM */
 
+#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
+bool obj_cgroup_may_zswap(struct obj_cgroup *objcg);
+void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size);
+void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size);
+#else
+static inline bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
+{
+	return true;
+}
+static inline void obj_cgroup_charge_zswap(struct obj_cgroup *objcg,
+					   size_t size)
+{
+}
+static inline void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg,
+					     size_t size)
+{
+}
+#endif
+
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 04cea4fa362a..cbb9b43bdb80 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1398,6 +1398,10 @@ static const struct memory_stat memory_stats[] = {
 	{ "sock",			MEMCG_SOCK			},
 	{ "vmalloc",			MEMCG_VMALLOC			},
 	{ "shmem",			NR_SHMEM			},
+#ifdef CONFIG_ZSWAP
+	{ "zswap",			MEMCG_ZSWAP_B			},
+	{ "zswapped",			MEMCG_ZSWAPPED			},
+#endif
 	{ "file_mapped",		NR_FILE_MAPPED			},
 	{ "file_dirty",			NR_FILE_DIRTY			},
 	{ "file_writeback",		NR_WRITEBACK			},
@@ -1432,6 +1436,7 @@ static int memcg_page_state_unit(int item)
 {
 	switch (item) {
 	case MEMCG_PERCPU_B:
+	case MEMCG_ZSWAP_B:
 	case NR_SLAB_RECLAIMABLE_B:
 	case NR_SLAB_UNRECLAIMABLE_B:
 	case WORKINGSET_REFAULT_ANON:
@@ -1512,6 +1517,13 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
 	seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGLAZYFREED),
 		       memcg_events(memcg, PGLAZYFREED));
 
+#ifdef CONFIG_ZSWAP
+	seq_buf_printf(&s, "%s %lu\n", vm_event_name(ZSWPIN),
+		       memcg_events(memcg, ZSWPIN));
+	seq_buf_printf(&s, "%s %lu\n", vm_event_name(ZSWPOUT),
+		       memcg_events(memcg, ZSWPOUT));
+#endif
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	seq_buf_printf(&s, "%s %lu\n", vm_event_name(THP_FAULT_ALLOC),
 		       memcg_events(memcg, THP_FAULT_ALLOC));
@@ -2883,6 +2895,19 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p)
 	return page_memcg_check(folio_page(folio, 0));
 }
 
+static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
+{
+	struct obj_cgroup *objcg = NULL;
+
+	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
+		objcg = rcu_dereference(memcg->objcg);
+		if (objcg && obj_cgroup_tryget(objcg))
+			break;
+		objcg = NULL;
+	}
+	return objcg;
+}
+
 __always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
 {
 	struct obj_cgroup *objcg = NULL;
@@ -2896,15 +2921,32 @@ __always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
 		memcg = active_memcg();
 	else
 		memcg = mem_cgroup_from_task(current);
-
-	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
-		objcg = rcu_dereference(memcg->objcg);
-		if (objcg && obj_cgroup_tryget(objcg))
-			break;
-		objcg = NULL;
-	}
+	objcg = __get_obj_cgroup_from_memcg(memcg);
 	rcu_read_unlock();
+	return objcg;
+}
+
+struct obj_cgroup *get_obj_cgroup_from_page(struct page *page)
+{
+	struct obj_cgroup *objcg;
+
+	if (!memcg_kmem_enabled() || memcg_kmem_bypass())
+		return NULL;
 
+	if (PageMemcgKmem(page)) {
+		objcg = __folio_objcg(page_folio(page));
+		obj_cgroup_get(objcg);
+	} else {
+		struct mem_cgroup *memcg;
+
+		rcu_read_lock();
+		memcg = __folio_memcg(page_folio(page));
+		if (memcg)
+			objcg = __get_obj_cgroup_from_memcg(memcg);
+		else
+			objcg = NULL;
+		rcu_read_unlock();
+	}
 	return objcg;
 }
 
@@ -5142,6 +5184,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 
 	page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX);
 	memcg->soft_limit = PAGE_COUNTER_MAX;
+#ifdef CONFIG_ZSWAP
+	memcg->zswap_max = PAGE_COUNTER_MAX;
+#endif
 	page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
 	if (parent) {
 		memcg->swappiness = mem_cgroup_swappiness(parent);
@@ -7406,6 +7451,139 @@ static struct cftype memsw_files[] = {
 	{ },	/* terminate */
 };
 
+#ifdef CONFIG_ZSWAP
+/**
+ * obj_cgroup_may_zswap - check if this cgroup can zswap
+ * @objcg: the object cgroup
+ *
+ * Check if the hierarchical zswap limit has been reached.
+ *
+ * This doesn't check for specific headroom, and it is not atomic
+ * either. But with zswap, the size of the allocation is only known
+ * once compression has occured, and this optimistic pre-check avoids
+ * spending cycles on compression when there is already no room left
+ * or zswap is disabled altogether somewhere in the hierarchy.
+ */
+bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
+{
+	struct mem_cgroup *memcg, *original_memcg;
+	bool ret = true;
+
+	original_memcg = get_mem_cgroup_from_objcg(objcg);
+	for (memcg = original_memcg; memcg != root_mem_cgroup;
+	     memcg = parent_mem_cgroup(memcg)) {
+		unsigned long max = READ_ONCE(memcg->zswap_max);
+		unsigned long pages;
+
+		if (max == PAGE_COUNTER_MAX)
+			continue;
+		if (max == 0) {
+			ret = false;
+			break;
+		}
+
+		cgroup_rstat_flush(memcg->css.cgroup);
+		pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
+		if (pages < max)
+			continue;
+		ret = false;
+		break;
+	}
+	mem_cgroup_put(original_memcg);
+	return ret;
+}
+
+/**
+ * obj_cgroup_charge_zswap - charge compression backend memory
+ * @objcg: the object cgroup
+ * @size: size of compressed object
+ *
+ * This forces the charge after obj_cgroup_may_swap() allowed
+ * compression and storage in zwap for this cgroup to go ahead.
+ */
+void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size)
+{
+	struct mem_cgroup *memcg;
+
+	VM_WARN_ON_ONCE(!(current->flags & PF_MEMALLOC));
+
+	/* PF_MEMALLOC context, charging must succeed */
+	if (obj_cgroup_charge(objcg, GFP_KERNEL, size))
+		VM_WARN_ON_ONCE(1);
+
+	rcu_read_lock();
+	memcg = obj_cgroup_memcg(objcg);
+	mod_memcg_state(memcg, MEMCG_ZSWAP_B, size);
+	mod_memcg_state(memcg, MEMCG_ZSWAPPED, 1);
+	rcu_read_unlock();
+}
+
+/**
+ * obj_cgroup_uncharge_zswap - uncharge compression backend memory
+ * @objcg: the object cgroup
+ * @size: size of compressed object
+ *
+ * Uncharges zswap memory on page in.
+ */
+void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size)
+{
+	struct mem_cgroup *memcg;
+
+	obj_cgroup_uncharge(objcg, size);
+
+	rcu_read_lock();
+	memcg = obj_cgroup_memcg(objcg);
+	mod_memcg_state(memcg, MEMCG_ZSWAP_B, -size);
+	mod_memcg_state(memcg, MEMCG_ZSWAPPED, -1);
+	rcu_read_unlock();
+}
+
+static u64 zswap_current_read(struct cgroup_subsys_state *css,
+			      struct cftype *cft)
+{
+	cgroup_rstat_flush(css->cgroup);
+	return memcg_page_state(mem_cgroup_from_css(css), MEMCG_ZSWAP_B);
+}
+
+static int zswap_max_show(struct seq_file *m, void *v)
+{
+	return seq_puts_memcg_tunable(m,
+		READ_ONCE(mem_cgroup_from_seq(m)->zswap_max));
+}
+
+static ssize_t zswap_max_write(struct kernfs_open_file *of,
+			       char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	unsigned long max;
+	int err;
+
+	buf = strstrip(buf);
+	err = page_counter_memparse(buf, "max", &max);
+	if (err)
+		return err;
+
+	xchg(&memcg->zswap_max, max);
+
+	return nbytes;
+}
+
+static struct cftype zswap_files[] = {
+	{
+		.name = "zswap.current",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = zswap_current_read,
+	},
+	{
+		.name = "zswap.max",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = zswap_max_show,
+		.write = zswap_max_write,
+	},
+	{ }	/* terminate */
+};
+#endif /* CONFIG_ZSWAP */
+
 /*
  * If mem_cgroup_swap_init() is implemented as a subsys_initcall()
  * instead of a core_initcall(), this could mean cgroup_memory_noswap still
@@ -7424,7 +7602,9 @@ static int __init mem_cgroup_swap_init(void)
 
 	WARN_ON(cgroup_add_dfl_cftypes(&memory_cgrp_subsys, swap_files));
 	WARN_ON(cgroup_add_legacy_cftypes(&memory_cgrp_subsys, memsw_files));
-
+#ifdef CONFIG_ZSWAP
+	WARN_ON(cgroup_add_dfl_cftypes(&memory_cgrp_subsys, zswap_files));
+#endif
 	return 0;
 }
 core_initcall(mem_cgroup_swap_init);
diff --git a/mm/zswap.c b/mm/zswap.c
index e3c16a70f533..104835b379ec 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -188,6 +188,7 @@ struct zswap_entry {
 		unsigned long handle;
 		unsigned long value;
 	};
+	struct obj_cgroup *objcg;
 };
 
 struct zswap_header {
@@ -359,6 +360,10 @@ static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry)
  */
 static void zswap_free_entry(struct zswap_entry *entry)
 {
+	if (entry->objcg) {
+		obj_cgroup_uncharge_zswap(entry->objcg, entry->length);
+		obj_cgroup_put(entry->objcg);
+	}
 	if (!entry->length)
 		atomic_dec(&zswap_same_filled_pages);
 	else {
@@ -1096,6 +1101,8 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
 	struct zswap_entry *entry, *dupentry;
 	struct scatterlist input, output;
 	struct crypto_acomp_ctx *acomp_ctx;
+	struct obj_cgroup *objcg = NULL;
+	struct zswap_pool *pool;
 	int ret;
 	unsigned int hlen, dlen = PAGE_SIZE;
 	unsigned long handle, value;
@@ -1115,17 +1122,15 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
 		goto reject;
 	}
 
+	objcg = get_obj_cgroup_from_page(page);
+	if (objcg && !obj_cgroup_may_zswap(objcg))
+		goto shrink;
+
 	/* reclaim space if needed */
 	if (zswap_is_full()) {
-		struct zswap_pool *pool;
-
 		zswap_pool_limit_hit++;
 		zswap_pool_reached_full = true;
-		pool = zswap_pool_last_get();
-		if (pool)
-			queue_work(shrink_wq, &pool->shrink_work);
-		ret = -ENOMEM;
-		goto reject;
+		goto shrink;
 	}
 
 	if (zswap_pool_reached_full) {
@@ -1227,6 +1232,13 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
 	entry->length = dlen;
 
 insert_entry:
+	entry->objcg = objcg;
+	if (objcg) {
+		obj_cgroup_charge_zswap(objcg, entry->length);
+		/* Account before objcg ref is moved to tree */
+		count_objcg_event(objcg, ZSWPOUT);
+	}
+
 	/* map */
 	spin_lock(&tree->lock);
 	do {
@@ -1253,7 +1265,16 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
 freepage:
 	zswap_entry_cache_free(entry);
 reject:
+	if (objcg)
+		obj_cgroup_put(objcg);
 	return ret;
+
+shrink:
+	pool = zswap_pool_last_get();
+	if (pool)
+		queue_work(shrink_wq, &pool->shrink_work);
+	ret = -ENOMEM;
+	goto reject;
 }
 
 /*
@@ -1326,6 +1347,8 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
 	BUG_ON(ret);
 stats:
 	count_vm_event(ZSWPIN);
+	if (entry->objcg)
+		count_objcg_event(entry->objcg, ZSWPIN);
 freeentry:
 	spin_lock(&tree->lock);
 	zswap_entry_put(tree, entry);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

[parent not found: <20220510152847.230957-7-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]

* Re: [PATCH v2 6/6] zswap: memcg accounting
       [not found]     ` <20220510152847.230957-7-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
@ 2022-05-11 17:32       ` Michal Koutný
       [not found]         ` <20220511173218.GB31592-9OudH3eul5jcvrawFnH+a6VXKuFTiq87@public.gmane.org>
  2022-05-13 17:23       ` Shakeel Butt
  1 sibling, 1 reply; 24+ messages in thread
From: Michal Koutný @ 2022-05-11 17:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Seth Jennings, Dan Streetman, Minchan Kim,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg

Hello.

On Tue, May 10, 2022 at 11:28:47AM -0400, Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
> +void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size)
> +{
> +	struct mem_cgroup *memcg;
> +
> +	VM_WARN_ON_ONCE(!(current->flags & PF_MEMALLOC));
> +
> +	/* PF_MEMALLOC context, charging must succeed */
> +	if (obj_cgroup_charge(objcg, GFP_KERNEL, size))
> +		VM_WARN_ON_ONCE(1);

IIUC, the objcg is derived from the compressed page, i.e. same memcg
(reparenting neglected for now). This memcg's memory.current is then
charged with the compressed object size.

Do I get it right that memory.zswap.current is a subset of memory.current?

(And that zswap is limited both by memory.max and memory.zswap.max?)

Thanks,
Michal

^ permalink raw reply	[flat|nested] 24+ messages in thread

[parent not found: <20220511173218.GB31592-9OudH3eul5jcvrawFnH+a6VXKuFTiq87@public.gmane.org>]

* Re: [PATCH v2 6/6] zswap: memcg accounting
       [not found]         ` <20220511173218.GB31592-9OudH3eul5jcvrawFnH+a6VXKuFTiq87@public.gmane.org>
@ 2022-05-11 19:06           ` Johannes Weiner
       [not found]             ` <YnwJUL90fuoHs3YW-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Johannes Weiner @ 2022-05-11 19:06 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Andrew Morton, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Seth Jennings, Dan Streetman, Minchan Kim,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg

On Wed, May 11, 2022 at 07:32:18PM +0200, Michal Koutný wrote:
> Hello.
> 
> On Tue, May 10, 2022 at 11:28:47AM -0400, Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
> > +void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size)
> > +{
> > +	struct mem_cgroup *memcg;
> > +
> > +	VM_WARN_ON_ONCE(!(current->flags & PF_MEMALLOC));
> > +
> > +	/* PF_MEMALLOC context, charging must succeed */
> > +	if (obj_cgroup_charge(objcg, GFP_KERNEL, size))
> > +		VM_WARN_ON_ONCE(1);
> 
> IIUC, the objcg is derived from the compressed page, i.e. same memcg
> (reparenting neglected for now). This memcg's memory.current is then
> charged with the compressed object size.

Correct. After which the uncompressed page is reclaimed and uncharged.
So the zswapout process will reduce the charge bottom line.

> Do I get it right that memory.zswap.current is a subset of memory.current?
>
> (And that zswap is limited both by memory.max and memory.zswap.max?)

Yes. Zswap is a memory consumer, and we want the compressed part of a
workload's memory to show up in the total memory footprint.

memory.zswap.* are there to configure zswap policy, within the
boundaries of available memory - it's by definition a subset.

^ permalink raw reply	[flat|nested] 24+ messages in thread

[parent not found: <YnwJUL90fuoHs3YW-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]

* Re: [PATCH v2 6/6] zswap: memcg accounting
       [not found]             ` <YnwJUL90fuoHs3YW-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
@ 2022-05-13 15:14               ` Michal Koutný
       [not found]                 ` <20220513151426.GC16096-9OudH3eul5jcvrawFnH+a6VXKuFTiq87@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Koutný @ 2022-05-13 15:14 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Seth Jennings, Dan Streetman, Minchan Kim,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg

On Wed, May 11, 2022 at 03:06:56PM -0400, Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
> Correct. After which the uncompressed page is reclaimed and uncharged.
> So the zswapout process will reduce the charge bottom line.

A zswap object falling under memory.current was my first thinking, I was
confused why it's exported as a separate counter memory.zswap.current
(which IMO suggests disjoint counting) and it doubles a
memory.stat:zswap entry.

Is the separate memory.zswap.current good for anything? (Except maybe
avoiding global rstat flush on memory.stat read but that'd be an
undesired precendent.)

(Ad the eventually reduced footprint, the transitional excursion above
memcg's (or ancestor's) limit should be limited by number of parallel
reclaims running (each one at most a page, right?), so it doesn't seem
necessary to tackle (now).)

> memory.zswap.* are there to configure zswap policy, within the
> boundaries of available memory - it's by definition a subset.

I see how the .max works when equal to 0 or "max". The intermediate
values are more difficult to reason about.
Also, I can see that on the global level, zswap is configured relatively
(/sys/module/zswap/parameters/max_pool_percent).
You wrote that the actual configured value is workload specific, would
it be simpler to have also relative zswap limit per memcg?

(Relative wrt memory.max, it'd be rather just a convenience with this
simple ratio, however, it'd correspond to the top level limit. OTOH, the
relatives would have counter-intuitive hierarchical behavior. I don't
mean this should be changed, rather wondering why this variant was
chosen.)

> +bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> +{
> +     struct mem_cgroup *memcg, *original_memcg;
> +     bool ret = true;
> +
> +     original_memcg = get_mem_cgroup_from_objcg(objcg);
> +     for (memcg = original_memcg; memcg != root_mem_cgroup;
> +          memcg = parent_mem_cgroup(memcg)) {
> +             unsigned long max = READ_ONCE(memcg->zswap_max);
> +             unsigned long pages;
> +
> +             if (max == PAGE_COUNTER_MAX)
> +                     continue;
> +             if (max == 0) {
> +                     ret = false;
> +                     break;
> +             }
> +
> +             cgroup_rstat_flush(memcg->css.cgroup);

Here, I think it'd be better not to bypass mem_cgroup_flush_stats() (the
mechanism is approximate and you traverse all ancestors anyway), i.e.
mem_cgroup_flush_stats() before the loop instead of this.

Thanks,
Michal

^ permalink raw reply	[flat|nested] 24+ messages in thread

[parent not found: <20220513151426.GC16096-9OudH3eul5jcvrawFnH+a6VXKuFTiq87@public.gmane.org>]

* Re: [PATCH v2 6/6] zswap: memcg accounting
       [not found]                 ` <20220513151426.GC16096-9OudH3eul5jcvrawFnH+a6VXKuFTiq87@public.gmane.org>
@ 2022-05-13 17:08                   ` Johannes Weiner
       [not found]                     ` <Yn6QfdouzkcrygTR-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Johannes Weiner @ 2022-05-13 17:08 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Andrew Morton, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Seth Jennings, Dan Streetman, Minchan Kim,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg

Hello Michal,

On Fri, May 13, 2022 at 05:14:26PM +0200, Michal Koutný wrote:
> On Wed, May 11, 2022 at 03:06:56PM -0400, Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
> > Correct. After which the uncompressed page is reclaimed and uncharged.
> > So the zswapout process will reduce the charge bottom line.
> 
> A zswap object falling under memory.current was my first thinking, I was
> confused why it's exported as a separate counter memory.zswap.current
> (which IMO suggests disjoint counting) and it doubles a
> memory.stat:zswap entry.
> 
> Is the separate memory.zswap.current good for anything? (Except maybe
> avoiding global rstat flush on memory.stat read but that'd be an
> undesired precendent.)

Right, it's accounted as a subset rather than fully disjointed. But it
is a limitable counter of its own, so I exported it as such, with a
current and a max knob. This is comparable to the kmem counter in v1.

From an API POV it would be quite strange to have max for a counter
that has no current. Likewise it would be strange for a major memory
consumer to be missing from memory.stat.

> (Ad the eventually reduced footprint, the transitional excursion above
> memcg's (or ancestor's) limit should be limited by number of parallel
> reclaims running (each one at most a page, right?), so it doesn't seem
> necessary to tackle (now).)

Correct.

> > memory.zswap.* are there to configure zswap policy, within the
> > boundaries of available memory - it's by definition a subset.
> 
> I see how the .max works when equal to 0 or "max". The intermediate
> values are more difficult to reason about.

It needs to be configured to the workload's access frequency curve,
which can be done with trial-and-error (reasonable balance between
zswpins and pswpins) or in a more targeted manner using tools such as
page_idle, damon etc.

> Also, I can see that on the global level, zswap is configured relatively
> (/sys/module/zswap/parameters/max_pool_percent).
> You wrote that the actual configured value is workload specific, would
> it be simpler to have also relative zswap limit per memcg?
>
> (Relative wrt memory.max, it'd be rather just a convenience with this
> simple ratio, however, it'd correspond to the top level limit. OTOH, the
> relatives would have counter-intuitive hierarchical behavior. I don't
> mean this should be changed, rather wondering why this variant was
> chosen.)

A percentage isn't a bad way to pick a global default limit for a
kernel feature. But it would have been preferable if zswap had used
the percentage internally and made the knob based in bytes (like
min_free_kbytes for example).

Because for load tuning, bytes make much more sense. That's how you
measure the workingset, so a percentage is an awkward indirection. At
the cgroup level, it makes even less sense: all memcg tunables are in
bytes, it would be quite weird to introduce a "max" that is 0-100. Add
the confusion of how percentages would propagate down the hierarchy...

> > +bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > +{
> > +     struct mem_cgroup *memcg, *original_memcg;
> > +     bool ret = true;
> > +
> > +     original_memcg = get_mem_cgroup_from_objcg(objcg);
> > +     for (memcg = original_memcg; memcg != root_mem_cgroup;
> > +          memcg = parent_mem_cgroup(memcg)) {
> > +             unsigned long max = READ_ONCE(memcg->zswap_max);
> > +             unsigned long pages;
> > +
> > +             if (max == PAGE_COUNTER_MAX)
> > +                     continue;
> > +             if (max == 0) {
> > +                     ret = false;
> > +                     break;
> > +             }
> > +
> > +             cgroup_rstat_flush(memcg->css.cgroup);
> 
> Here, I think it'd be better not to bypass mem_cgroup_flush_stats() (the
> mechanism is approximate and you traverse all ancestors anyway), i.e.
> mem_cgroup_flush_stats() before the loop instead of this.

I don't traverse all ancestors, I bail on disabled groups and skip
unlimited ones. This saves a lot of flushes in practice right now: our
heaviest swapping cgroups have zswap disabled (max=0) because they're
lowpri and forced to disk. Likewise, the zswap users have their zswap
limit several levels down from the root, and I currently don't ever
flush the higher levels (max=PAGE_COUNTER_MAX).

Flushing unnecessary groups with a ratelimit doesn't sound like an
improvement to me.

Thanks

^ permalink raw reply	[flat|nested] 24+ messages in thread

[parent not found: <Yn6QfdouzkcrygTR-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]

* Re: [PATCH v2 6/6] zswap: memcg accounting
       [not found]                     ` <Yn6QfdouzkcrygTR-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
@ 2022-05-16 14:34                       ` Michal Koutný
       [not found]                         ` <20220516143459.GA17557-9OudH3eul5jcvrawFnH+a6VXKuFTiq87@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Koutný @ 2022-05-16 14:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Seth Jennings, Dan Streetman, Minchan Kim,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg

On Fri, May 13, 2022 at 01:08:13PM -0400, Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
> Right, it's accounted as a subset rather than fully disjointed. But it
> is a limitable counter of its own, so I exported it as such, with a
> current and a max knob. This is comparable to the kmem counter in v1.

That counter and limit didn't turn out well. I liked the analogy to
writeback (and dirty limit) better.

> From an API POV it would be quite strange to have max for a counter
> that has no current. Likewise it would be strange for a major memory
> consumer to be missing from memory.stat.

My understanding would be to have all memory.stat entries as you
propose, no extra .current counter and the .max knob for zswap
configuration.

> It needs to be configured to the workload's access frequency curve,
> which can be done with trial-and-error (reasonable balance between
> zswpins and pswpins) or in a more targeted manner using tools such as
> page_idle, damon etc.
> [...]
> Because for load tuning, bytes make much more sense. That's how you
> measure the workingset, so a percentage is an awkward indirection. At
> the cgroup level, it makes even less sense: all memcg tunables are in
> bytes, it would be quite weird to introduce a "max" that is 0-100. Add
> the confusion of how percentages would propagate down the hierarchy...

Thanks for the explanation. I guess there's no simple tranformation of
in-kernel available information that'd allow a more semantic
configuration of this value. The rather crude absolute value requires
(but also simply allows) some calibration or responsive tuning.

> I don't traverse all ancestors, I bail on disabled groups and skip
> unlimited ones.

I admit I missed that.

> Flushing unnecessary groups with a ratelimit doesn't sound like an
> improvement to me.

Then I'm only concerned about a situation when there's a single deep
memcg that undergoes both workingset_refault() and zswap querying.
The latter (bare call to cgroup_rstat_flush()) won't reset
stats_flush_threshold, so the former (or the async flush more likely)
would attempt a flush too. The flush work (on the leaf memcg) would be
done twice even though it may be within the tolerance of cumulated
error the second time.

This is a thing that might require attention in the future (depending on
some data how it actually performs). I see how the current approach is
justified.

Regards,
Michal

^ permalink raw reply	[flat|nested] 24+ messages in thread

[parent not found: <20220516143459.GA17557-9OudH3eul5jcvrawFnH+a6VXKuFTiq87@public.gmane.org>]

* Re: [PATCH v2 6/6] zswap: memcg accounting
       [not found]                         ` <20220516143459.GA17557-9OudH3eul5jcvrawFnH+a6VXKuFTiq87@public.gmane.org>
@ 2022-05-16 20:01                           ` Johannes Weiner
  2022-05-17 23:52                             ` Andrew Morton
  0 siblings, 1 reply; 24+ messages in thread
From: Johannes Weiner @ 2022-05-16 20:01 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Andrew Morton, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Seth Jennings, Dan Streetman, Minchan Kim,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg

Hello Michal,

On Mon, May 16, 2022 at 04:34:59PM +0200, Michal Koutný wrote:
> On Fri, May 13, 2022 at 01:08:13PM -0400, Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
> > Right, it's accounted as a subset rather than fully disjointed. But it
> > is a limitable counter of its own, so I exported it as such, with a
> > current and a max knob. This is comparable to the kmem counter in v1.
> 
> That counter and limit didn't turn out well. I liked the analogy to
> writeback (and dirty limit) better.

Right, I was only talking about the design decision to add a usage
knob alongside the limit. The counter failed for a different reason.

> > From an API POV it would be quite strange to have max for a counter
> > that has no current. Likewise it would be strange for a major memory
> > consumer to be missing from memory.stat.
> 
> My understanding would be to have all memory.stat entries as you
> propose, no extra .current counter and the .max knob for zswap
> configuration.

There is a longer-term advantage of sticking to the usage+limit pair
precedent. Even though the usage knob happens to correspond to a
memory.stat item in this case, we had also discussed the possibility
of breaking out a "Compressed" item in memory.stat instead, of which
zswap would only be a subset. It didn't pan out this way this time -
for unrelated reasons. But it's conceivable this will happen in
another scenario down the line, and then you'd need a separate usage
knob anyway. It's a good idea to stay consistent.

There is also an immediate practical advantage. zswap is limitable, so
an auto-tuning service might want to monitor its usage at a higher
frequency, with a higher precision, and with a higher urgency than
memory stats are ususally logged. A dedicated usage knob allows doing
that. memory.stat does not: it is a bigger file that needs to be
searched with a string parser for every sample; it's flushed lazily,
so it can be less precise than desired; yet, when it does flush, it
flushes the entire tree rather than just the target group, making it
more expensive an erratic than desired as well.

> > It needs to be configured to the workload's access frequency curve,
> > which can be done with trial-and-error (reasonable balance between
> > zswpins and pswpins) or in a more targeted manner using tools such as
> > page_idle, damon etc.
> > [...]
> > Because for load tuning, bytes make much more sense. That's how you
> > measure the workingset, so a percentage is an awkward indirection. At
> > the cgroup level, it makes even less sense: all memcg tunables are in
> > bytes, it would be quite weird to introduce a "max" that is 0-100. Add
> > the confusion of how percentages would propagate down the hierarchy...
> 
> Thanks for the explanation. I guess there's no simple tranformation of
> in-kernel available information that'd allow a more semantic
> configuration of this value. The rather crude absolute value requires
> (but also simply allows) some calibration or responsive tuning.

Right.

If you think about it, the same can be said about the memory limit
itself. It's a crude, absolute number the kernel asks of you. Yet the
optimal setting depends on the workload's ever-changing access
frequency histogram, the speed of the storage backend for paging and
swapping, and the workload's tolerances for paging latencies.

Hopefully one day we'll be able to set a pressure/latency threshold on
the cgroup, and have the kernel optimally distribute the workingset
across RAM, zswap, second-tier memory, and storage - preferring the
cheapest and slowest possible backing for every page while meeting the
SLA. The proactive reclaim and memory offloading work is pursuing this.

For now, we need the interface that allows tuning and exploring from
userspace. When there is something that's ready for a general purpose
OS kernel, those absolute knobs won't get in the way - just like
memory.max doesn't get in the way of proactive reclaim today.

Anyway, I'm just outlining where I'm coming from with this. It looks
like we agree on the end result.

> > Flushing unnecessary groups with a ratelimit doesn't sound like an
> > improvement to me.
> 
> Then I'm only concerned about a situation when there's a single deep
> memcg that undergoes both workingset_refault() and zswap querying.
> The latter (bare call to cgroup_rstat_flush()) won't reset
> stats_flush_threshold, so the former (or the async flush more likely)
> would attempt a flush too. The flush work (on the leaf memcg) would be
> done twice even though it may be within the tolerance of cumulated
> error the second time.
> 
> This is a thing that might require attention in the future (depending on
> some data how it actually performs). I see how the current approach is
> justified.

Yes, we can optimize it should the need arise. So far it's been fine.

Thanks for your thoughts, Michal.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 6/6] zswap: memcg accounting
  2022-05-16 20:01                           ` Johannes Weiner
@ 2022-05-17 23:52                             ` Andrew Morton
       [not found]                               ` <20220517165216.7acd8434f8b25606836e21e6-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2022-05-17 23:52 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Koutný, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Seth Jennings, Dan Streetman, Minchan Kim, linux-mm, cgroups,
	linux-kernel, kernel-team

On Mon, 16 May 2022 16:01:05 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> > > Flushing unnecessary groups with a ratelimit doesn't sound like an
> > > improvement to me.
> > 
> > Then I'm only concerned about a situation when there's a single deep
> > memcg that undergoes both workingset_refault() and zswap querying.
> > The latter (bare call to cgroup_rstat_flush()) won't reset
> > stats_flush_threshold, so the former (or the async flush more likely)
> > would attempt a flush too. The flush work (on the leaf memcg) would be
> > done twice even though it may be within the tolerance of cumulated
> > error the second time.
> > 
> > This is a thing that might require attention in the future (depending on
> > some data how it actually performs). I see how the current approach is
> > justified.
> 
> Yes, we can optimize it should the need arise. So far it's been fine.
> 
> Thanks for your thoughts, Michal.

Me too.

I think everything is settled here so I plan to import this series into
mm-stable in a couple of days.

at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/

 documentation-filesystems-proc-update-meminfo-section.patch
 documentation-filesystems-proc-update-meminfo-section-fix.patch
 documentation-filesystems-proc-update-meminfo-section-fix-2.patch
 mm-kconfig-move-swap-and-slab-config-options-to-the-mm-section.patch
 mm-kconfig-group-swap-slab-hotplug-and-thp-options-into-submenus.patch
 mm-kconfig-group-swap-slab-hotplug-and-thp-options-into-submenus-fix.patch
 mm-kconfig-group-swap-slab-hotplug-and-thp-options-into-submenus-fix-fix.patch
 mm-kconfig-simplify-zswap-configuration.patch
 mm-zswap-add-basic-meminfo-and-vmstat-coverage.patch
 zswap-memcg-accounting.patch
 zswap-memcg-accounting-fix.patch
 zswap-memcg-accounting-fix-2.patch



^ permalink raw reply	[flat|nested] 24+ messages in thread

[parent not found: <20220517165216.7acd8434f8b25606836e21e6-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>]

* Re: [PATCH v2 6/6] zswap: memcg accounting
       [not found]                               ` <20220517165216.7acd8434f8b25606836e21e6-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2022-05-18  8:23                                 ` Michal Koutný
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Koutný @ 2022-05-18  8:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Seth Jennings, Dan Streetman, Minchan Kim,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg

On Tue, May 17, 2022 at 04:52:16PM -0700, Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
> On Mon, 16 May 2022 16:01:05 -0400 Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
> > Thanks for your thoughts, Michal.

You're welcome. The change is well-reasoned...

> I think everything is settled here so I plan to import this series into
> mm-stable in a couple of days.

...and makes sense with the listed fixups.

Michal

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 6/6] zswap: memcg accounting
       [not found]     ` <20220510152847.230957-7-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
  2022-05-11 17:32       ` Michal Koutný
@ 2022-05-13 17:23       ` Shakeel Butt
       [not found]         ` <CALvZod6kBZZFfD6Y5p_=9TMJr8P-vU_77NTq048wGUDr0wTv0Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 24+ messages in thread
From: Shakeel Butt @ 2022-05-13 17:23 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Roman Gushchin, Seth Jennings,
	Dan Streetman, Minchan Kim, Linux MM, Cgroups, LKML, Kernel Team

On Tue, May 10, 2022 at 8:29 AM Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
>
[...]
> +void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size)
> +{
> +       struct mem_cgroup *memcg;
> +
> +       VM_WARN_ON_ONCE(!(current->flags & PF_MEMALLOC));
> +
> +       /* PF_MEMALLOC context, charging must succeed */
)
Instead of these warnings and comment why not just explicitly use
memalloc_noreclaim_[save|restore]() ?

> +       if (obj_cgroup_charge(objcg, GFP_KERNEL, size))

Can we please make this specific charging an opt-in feature or at
least provide a way to opt-out? This will impact users/providers where
swap is used transparently (in terms of memory usage). Also do you
want this change for v1 users as well?

> +               VM_WARN_ON_ONCE(1);
> +
> +       rcu_read_lock();
> +       memcg = obj_cgroup_memcg(objcg);
> +       mod_memcg_state(memcg, MEMCG_ZSWAP_B, size);
> +       mod_memcg_state(memcg, MEMCG_ZSWAPPED, 1);
> +       rcu_read_unlock();
> +}
> +

^ permalink raw reply	[flat|nested] 24+ messages in thread

[parent not found: <CALvZod6kBZZFfD6Y5p_=9TMJr8P-vU_77NTq048wGUDr0wTv0Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [PATCH v2 6/6] zswap: memcg accounting
       [not found]         ` <CALvZod6kBZZFfD6Y5p_=9TMJr8P-vU_77NTq048wGUDr0wTv0Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2022-05-13 18:25           ` Johannes Weiner
  0 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2022-05-13 18:25 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Michal Hocko, Roman Gushchin, Seth Jennings,
	Dan Streetman, Minchan Kim, Linux MM, Cgroups, LKML, Kernel Team

Hello Shakeel,

On Fri, May 13, 2022 at 10:23:36AM -0700, Shakeel Butt wrote:
> On Tue, May 10, 2022 at 8:29 AM Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
> >
> [...]
> > +void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size)
> > +{
> > +       struct mem_cgroup *memcg;
> > +
> > +       VM_WARN_ON_ONCE(!(current->flags & PF_MEMALLOC));
> > +
> > +       /* PF_MEMALLOC context, charging must succeed */
> )
> Instead of these warnings and comment why not just explicitly use
> memalloc_noreclaim_[save|restore]() ?

Should the function be called from a non-reclaim context, it should
warn rather than quietly turn itself into a reclaimer. That's not a
very likely mistake, but the warning documents the expectations and
context of this function better.

> > +       if (obj_cgroup_charge(objcg, GFP_KERNEL, size))
> 
> Can we please make this specific charging an opt-in feature or at
> least provide a way to opt-out? This will impact users/providers where
> swap is used transparently (in terms of memory usage). Also do you
> want this change for v1 users as well?

Ah, of course, memsw! Let's opt out of v1, since this is clearly in
conflict with that way of accounting. I already hadn't added interface
files for v1, so it's just a matter of bypassing the charging too.

Signed-of-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
---

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 350012b93a95..3ab72b8160ee 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7469,6 +7469,9 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
 	struct mem_cgroup *memcg, *original_memcg;
 	bool ret = true;
 
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+		return true;
+
 	original_memcg = get_mem_cgroup_from_objcg(objcg);
 	for (memcg = original_memcg; memcg != root_mem_cgroup;
 	     memcg = parent_mem_cgroup(memcg)) {
@@ -7505,6 +7508,9 @@ void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size)
 {
 	struct mem_cgroup *memcg;
 
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+		return;
+
 	VM_WARN_ON_ONCE(!(current->flags & PF_MEMALLOC));
 
 	/* PF_MEMALLOC context, charging must succeed */
@@ -7529,6 +7535,9 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size)
 {
 	struct mem_cgroup *memcg;
 
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+		return;
+
 	obj_cgroup_uncharge(objcg, size);
 
 	rcu_read_lock();


^ permalink raw reply related	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2022-05-18  8:23 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-05-10 15:28 [PATCH v2 0/6] zswap: accounting & cgroup control Johannes Weiner
     [not found] ` <20220510152847.230957-1-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2022-05-10 15:28   ` [PATCH v2 1/6] Documentation: filesystems: proc: update meminfo section Johannes Weiner
     [not found]     ` <20220510152847.230957-2-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2022-05-11 17:11       ` David Hildenbrand
     [not found]         ` <7a6f8520-a496-e3c3-1fd9-8a30b7a12b14-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2022-05-11 18:51           ` Johannes Weiner
     [not found]             ` <YnwFraZlVWQoCjz3-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2022-05-12  8:55               ` David Hildenbrand
2022-05-10 15:28   ` [PATCH v2 2/6] mm: Kconfig: move swap and slab config options to the MM section Johannes Weiner
2022-05-10 15:28   ` [PATCH v2 3/6] mm: Kconfig: group swap, slab, hotplug and thp options into submenus Johannes Weiner
     [not found]     ` <20220510152847.230957-4-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2022-05-10 22:40       ` Andrew Morton
     [not found]         ` <20220510154037.c7916ee9d7de90eedd12f92c-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2022-05-11 15:22           ` Johannes Weiner
     [not found]             ` <YnvU0hwCfQ11P8Ce-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2022-05-11 16:28               ` Johannes Weiner
2022-05-10 15:28   ` [PATCH v2 4/6] mm: Kconfig: simplify zswap configuration Johannes Weiner
2022-05-10 15:28   ` [PATCH v2 5/6] mm: zswap: add basic meminfo and vmstat coverage Johannes Weiner
     [not found]     ` <20220510152847.230957-6-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2022-05-11 17:13       ` David Hildenbrand
2022-05-10 15:28   ` [PATCH v2 6/6] zswap: memcg accounting Johannes Weiner
     [not found]     ` <20220510152847.230957-7-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2022-05-11 17:32       ` Michal Koutný
     [not found]         ` <20220511173218.GB31592-9OudH3eul5jcvrawFnH+a6VXKuFTiq87@public.gmane.org>
2022-05-11 19:06           ` Johannes Weiner
     [not found]             ` <YnwJUL90fuoHs3YW-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2022-05-13 15:14               ` Michal Koutný
     [not found]                 ` <20220513151426.GC16096-9OudH3eul5jcvrawFnH+a6VXKuFTiq87@public.gmane.org>
2022-05-13 17:08                   ` Johannes Weiner
     [not found]                     ` <Yn6QfdouzkcrygTR-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2022-05-16 14:34                       ` Michal Koutný
     [not found]                         ` <20220516143459.GA17557-9OudH3eul5jcvrawFnH+a6VXKuFTiq87@public.gmane.org>
2022-05-16 20:01                           ` Johannes Weiner
2022-05-17 23:52                             ` Andrew Morton
     [not found]                               ` <20220517165216.7acd8434f8b25606836e21e6-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2022-05-18  8:23                                 ` Michal Koutný
2022-05-13 17:23       ` Shakeel Butt
     [not found]         ` <CALvZod6kBZZFfD6Y5p_=9TMJr8P-vU_77NTq048wGUDr0wTv0Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2022-05-13 18:25           ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox