* [PATCH v2] mm: add zblock allocator
@ 2025-04-04 19:28 Vitaly Wool
2025-04-04 20:03 ` Johannes Weiner
2025-04-07 15:54 ` Johannes Weiner
0 siblings, 2 replies; 17+ messages in thread
From: Vitaly Wool @ 2025-04-04 19:28 UTC (permalink / raw)
To: linux-mm
Cc: akpm, linux-kernel, Nhat Pham, Shakeel Butt, Vitaly Wool,
Igor Belousov
zblock is a special purpose allocator for storing compressed pages.
It stores integer number of compressed objects per its block. These
blocks consist of several physical pages (2**n, i. e. 1/2/4/8).
With zblock, it is possible to densely arrange objects of various sizes
resulting in low internal fragmentation. Also this allocator tries to
fill incomplete blocks instead of adding new ones, in many cases
providing a compression ratio substantially higher than z3fold and zbud
(though lower than zmalloc's).
zblock does not require MMU to operate and also is superior to zsmalloc
with regard to average performance and worst execution times, thus
allowing for better response time and real-time characteristics of the
whole system.
E. g. on a series of stress-ng tests run on a Raspberry Pi 5, we get
5-10% higher value for bogo ops/s in zblock/zsmalloc comparison.
Signed-off-by: Vitaly Wool <vitaly.wool@konsulko.se>
Signed-off-by: Igor Belousov <igor.b@beldev.am>
---
Changes since v1:
- adaptations to better handle 16K pages
- block table is dropped in favor of a linked list (the list entry on
top of the list always has empty slots)
- rbtree is created at the initialization phase to speed up the search
for an appropriate block type
Performance metrics:
1 Kernel build on a Raspberry Pi 5
Kernel build on a tmpfs with zblock/zsmalloc as zswap backend (LZ4
compression):
1.1 zblock
real 25m53.040s
user 96m43.424s
sys 4m56.652s
real 25m20.748s
user 94m24.324s
sys 4m58.005s
real 25m37.486s
user 95m35.913s
sys 4m55.892s
1.2 zsmalloc
real 26m17.934s
user 97m13.342s
sys 5m2.415s
real 25m50.694s
user 95m22.065s
sys 5m1.305s
real 25m57.714s
user 96m14.675s
sys 4m59.081s
Since zswap is used starting from minute 21, this gives 9% in average in
advantage for zblock.
2 stress-ng
Command: stress-ng --vm 4 --vm-bytes 8G --vm-keep --timeout 10m --metrics-brief
Results:
2.1 zblock
stress-ng: metrc: [421] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
stress-ng: metrc: [421] (secs) (secs) (secs) (real time) (usr+sys time)
stress-ng: metrc: [421] vm 29701982 601.32 1088.63 209.89 49394.38 22873.66
2.2 zsmalloc
stress-ng: metrc: [479] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
stress-ng: metrc: [479] (secs) (secs) (secs) (real time) (usr+sys time)
stress-ng: metrc: [479] vm 29561584 600.59 916.38 895.74 49221.11 16313.22
Object file size comparison
1 ARM64
1.1 zblock: 10k
1.2 zsmalloc: 36K
2 RISC-V
1.1 zblock: 12k
1.2 zsmalloc: 77k
Documentation/mm/zblock.rst | 24 +++
MAINTAINERS | 7 +
mm/Kconfig | 8 +
mm/Makefile | 1 +
mm/zblock.c | 418 ++++++++++++++++++++++++++++++++++++
mm/zblock.h | 136 ++++++++++++
6 files changed, 594 insertions(+)
create mode 100644 Documentation/mm/zblock.rst
create mode 100644 mm/zblock.c
create mode 100644 mm/zblock.h
diff --git a/Documentation/mm/zblock.rst b/Documentation/mm/zblock.rst
new file mode 100644
index 000000000000..69e302d53255
--- /dev/null
+++ b/Documentation/mm/zblock.rst
@@ -0,0 +1,24 @@
+. SPDX-License-Identifier: GPL-2.0
+
+======
+zblock
+======
+
+zblock is a special purpose allocator for storing compressed pages.
+It stores integer number of compressed objects per its block. These
+blocks consist of several physical pages (2**n, i. e. 1/2/4/8).
+
+With zblock, it is possible to densely arrange objects of various sizes
+resulting in low internal fragmentation. Also this allocator tries to
+fill incomplete blocks instead of adding new ones, in many cases
+providing a compression ratio substantially higher than z3fold and zbud
+(though lower than zmalloc's).
+
+zblock does not require MMU to operate and also is superior to zsmalloc
+with regard to average performance and worst execution times, thus
+allowing for better response time and real-time characteristics of the
+whole system.
+
+E. g. on a series of stress-ng tests run on a Raspberry Pi 5, we get
+5-10% higher value for bogo ops/s in zblock/zsmalloc comparison.
+
diff --git a/MAINTAINERS b/MAINTAINERS
index d5dfb9186962..19283e6a08eb 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -26556,6 +26556,13 @@ F: Documentation/networking/device_drivers/hamradio/z8530drv.rst
F: drivers/net/hamradio/*scc.c
F: drivers/net/hamradio/z8530.h
+ZBLOCK COMPRESSED SLAB MEMORY ALLOCATOR
+M: Vitaly Wool <vitaly.wool@konsulko.se>
+L: linux-mm@kvack.org
+S: Maintained
+F: Documentation/mm/zblock.rst
+F: mm/zblock.[ch]
+
ZBUD COMPRESSED PAGE ALLOCATOR
M: Seth Jennings <sjenning@redhat.com>
M: Dan Streetman <ddstreet@ieee.org>
diff --git a/mm/Kconfig b/mm/Kconfig
index 0b7f4bb5cb80..c3082809ad35 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -193,6 +193,14 @@ config Z3FOLD_DEPRECATED
page. It is a ZBUD derivative so the simplicity and determinism are
still there.
+config ZBLOCK
+ tristate "Fast compression allocator with high density"
+ depends on ZPOOL
+ help
+ A special purpose allocator for storing compressed pages.
+ It is designed to store same size compressed pages in blocks of
+ physical pages.
+
config Z3FOLD
tristate
default y if Z3FOLD_DEPRECATED=y
diff --git a/mm/Makefile b/mm/Makefile
index 850386a67b3e..2018455b7baa 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -116,6 +116,7 @@ obj-$(CONFIG_ZPOOL) += zpool.o
obj-$(CONFIG_ZBUD) += zbud.o
obj-$(CONFIG_ZSMALLOC) += zsmalloc.o
obj-$(CONFIG_Z3FOLD) += z3fold.o
+obj-$(CONFIG_ZBLOCK) += zblock.o
obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
obj-$(CONFIG_CMA) += cma.o
obj-$(CONFIG_NUMA) += numa.o
diff --git a/mm/zblock.c b/mm/zblock.c
new file mode 100644
index 000000000000..03e8180908e5
--- /dev/null
+++ b/mm/zblock.c
@@ -0,0 +1,418 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * zblock.c
+ *
+ * Author: Vitaly Wool <vitaly.wool@konsulko.se>
+ * Based on the work from Ananda Badmaev <a.badmaev@clicknet.pro>
+ * Copyright (C) 2022-2025, Konsulko AB.
+ *
+ * Zblock is a small object allocator with the intention to serve as a
+ * zpool backend. It operates on page blocks which consist of number
+ * of physical pages being a power of 2 and store integer number of
+ * compressed pages per block which results in determinism and simplicity.
+ *
+ * zblock doesn't export any API and is meant to be used via zpool API.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/atomic.h>
+#include <linux/list.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/preempt.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/zpool.h>
+#include "zblock.h"
+
+static struct rb_root block_desc_tree = RB_ROOT;
+
+/* add a new block to the list */
+static inline void add_block(struct zblock_block *block,
+ struct block_list *block_list)
+{
+ list_add(&block->link, &block_list->list);
+}
+
+/*
+ * Find a block with at least one free slot and claim it.
+ * We make sure that the first block, if exists, will always work.
+ */
+static inline struct zblock_block *find_block(struct block_list *block_list)
+{
+ struct list_head *l = &block_list->list;
+
+ if (!list_empty(l)) {
+ struct zblock_block *z = list_first_entry(l, typeof(*z), link);
+
+ if (z->free_slots > 0) {
+ if (--z->free_slots == 0)
+ list_move_tail(&z->link, l);
+ return z;
+ }
+ }
+ return NULL;
+}
+
+/* remove block from the list */
+static inline void remove_block(struct zblock_block *block)
+{
+ list_del_init(&block->link);
+}
+
+/* Encodes the handle of a particular slot in the pool using metadata */
+static inline unsigned long metadata_to_handle(struct zblock_block *block,
+ unsigned int block_type, unsigned int slot)
+{
+ return (unsigned long)(block) | (block_type << SLOT_BITS) | slot;
+}
+
+/* Returns block, block type and slot in the pool corresponding to handle */
+static inline struct zblock_block *handle_to_metadata(unsigned long handle,
+ unsigned int *block_type, unsigned int *slot)
+{
+ *block_type = (handle & (PAGE_SIZE - 1)) >> SLOT_BITS;
+ *slot = handle & SLOT_MASK;
+ return (struct zblock_block *)(handle & PAGE_MASK);
+}
+
+
+/*
+ * allocate new block and add it to corresponding block list
+ */
+static struct zblock_block *alloc_block(struct zblock_pool *pool,
+ int block_type, gfp_t gfp,
+ unsigned long *handle)
+{
+ struct zblock_block *block;
+ struct block_list *block_list;
+
+ block = (void *)__get_free_pages(gfp, block_desc[block_type].order);
+ if (!block)
+ return NULL;
+
+ block_list = &pool->block_lists[block_type];
+
+ /* init block data */
+ block->free_slots = block_desc[block_type].slots_per_block - 1;
+ memset(&block->slot_info, 0, sizeof(block->slot_info));
+ set_bit(0, block->slot_info);
+ *handle = metadata_to_handle(block, block_type, 0);
+
+ spin_lock(&block_list->lock);
+ add_block(block, block_list);
+ block_list->block_count++;
+ spin_unlock(&block_list->lock);
+ return block;
+}
+
+/*****************
+ * API Functions
+ *****************/
+/**
+ * zblock_create_pool() - create a new zblock pool
+ * @gfp: gfp flags when allocating the zblock pool structure
+ * @ops: user-defined operations for the zblock pool
+ *
+ * Return: pointer to the new zblock pool or NULL if the metadata allocation
+ * failed.
+ */
+static struct zblock_pool *zblock_create_pool(gfp_t gfp)
+{
+ struct zblock_pool *pool;
+ struct block_list *block_list;
+ int i;
+
+ pool = kmalloc(sizeof(struct zblock_pool), gfp);
+ if (!pool)
+ return NULL;
+
+ /* init each block list */
+ for (i = 0; i < ARRAY_SIZE(block_desc); i++) {
+ block_list = &pool->block_lists[i];
+ spin_lock_init(&block_list->lock);
+ INIT_LIST_HEAD(&block_list->list);
+ block_list->block_count = 0;
+ }
+ return pool;
+}
+
+/**
+ * zblock_destroy_pool() - destroys an existing zblock pool
+ * @pool: the zblock pool to be destroyed
+ *
+ */
+static void zblock_destroy_pool(struct zblock_pool *pool)
+{
+ kfree(pool);
+}
+
+
+/**
+ * zblock_alloc() - allocates a slot of appropriate size
+ * @pool: zblock pool from which to allocate
+ * @size: size in bytes of the desired allocation
+ * @gfp: gfp flags used if the pool needs to grow
+ * @handle: handle of the new allocation
+ *
+ * Return: 0 if success and handle is set, otherwise -EINVAL if the size or
+ * gfp arguments are invalid or -ENOMEM if the pool was unable to allocate
+ * a new slot.
+ */
+static int zblock_alloc(struct zblock_pool *pool, size_t size, gfp_t gfp,
+ unsigned long *handle)
+{
+ int block_type = -1;
+ unsigned int slot;
+ struct zblock_block *block;
+ struct block_list *block_list;
+
+ if (!size)
+ return -EINVAL;
+
+ if (size > PAGE_SIZE)
+ return -ENOSPC;
+
+ /* find basic block type with suitable slot size */
+ if (size < block_desc[0].slot_size)
+ block_type = 0;
+ else {
+ struct block_desc_node *block_node;
+ struct rb_node *node = block_desc_tree.rb_node;
+
+ while (node) {
+ block_node = container_of(node, typeof(*block_node), node);
+ if (size < block_node->this_slot_size)
+ node = node->rb_left;
+ else if (size >= block_node->next_slot_size)
+ node = node->rb_right;
+ else {
+ block_type = block_node->block_idx + 1;
+ break;
+ }
+ }
+ }
+ if (WARN_ON(block_type < 0 || block_type >= ARRAY_SIZE(block_desc)))
+ return -EINVAL;
+
+ block_list = &pool->block_lists[block_type];
+
+ spin_lock(&block_list->lock);
+ block = find_block(block_list);
+ spin_unlock(&block_list->lock);
+ if (block)
+ goto found;
+
+ /* not found block with free slots try to allocate new empty block */
+ block = alloc_block(pool, block_type, gfp, handle);
+ return block ? 0 : -ENOMEM;
+
+found:
+ /* find the first free slot in block */
+ for (slot = find_first_zero_bit(block->slot_info,
+ block_desc[block_type].slots_per_block);
+ slot < block_desc[block_type].slots_per_block;
+ slot = find_next_zero_bit(block->slot_info,
+ block_desc[block_type].slots_per_block,
+ slot)) {
+ if (!test_and_set_bit(slot, block->slot_info))
+ break;
+ barrier();
+ }
+ BUG_ON(slot >= block_desc[block_type].slots_per_block);
+ *handle = metadata_to_handle(block, block_type, slot);
+ return 0;
+}
+
+/**
+ * zblock_free() - frees the allocation associated with the given handle
+ * @pool: pool in which the allocation resided
+ * @handle: handle associated with the allocation returned by zblock_alloc()
+ *
+ */
+static void zblock_free(struct zblock_pool *pool, unsigned long handle)
+{
+ unsigned int slot, block_type;
+ struct zblock_block *block;
+ struct block_list *block_list;
+
+ block = handle_to_metadata(handle, &block_type, &slot);
+ block_list = &pool->block_lists[block_type];
+
+ spin_lock(&block_list->lock);
+ /* if all slots in block are empty delete whole block */
+ if (++block->free_slots == block_desc[block_type].slots_per_block) {
+ block_list->block_count--;
+ remove_block(block);
+ spin_unlock(&block_list->lock);
+ free_pages((unsigned long)block, block_desc[block_type].order);
+ return;
+ }
+ spin_unlock(&block_list->lock);
+
+ clear_bit(slot, block->slot_info);
+}
+
+/**
+ * zblock_map() - maps the allocation associated with the given handle
+ * @pool: pool in which the allocation resides
+ * @handle: handle associated with the allocation to be mapped
+ *
+ *
+ * Returns: a pointer to the mapped allocation
+ */
+static void *zblock_map(struct zblock_pool *pool, unsigned long handle)
+{
+ unsigned int block_type, slot;
+ struct zblock_block *block;
+ unsigned long offs;
+ void *p;
+
+ block = handle_to_metadata(handle, &block_type, &slot);
+ offs = ZBLOCK_HEADER_SIZE + slot * block_desc[block_type].slot_size;
+ p = (void *)block + offs;
+ return p;
+}
+
+/**
+ * zblock_unmap() - unmaps the allocation associated with the given handle
+ * @pool: pool in which the allocation resides
+ * @handle: handle associated with the allocation to be unmapped
+ */
+static void zblock_unmap(struct zblock_pool *pool, unsigned long handle)
+{
+}
+
+/**
+ * zblock_get_total_pages() - gets the zblock pool size in pages
+ * @pool: pool being queried
+ *
+ * Returns: size in bytes of the given pool.
+ */
+static u64 zblock_get_total_pages(struct zblock_pool *pool)
+{
+ u64 total_size;
+ int i;
+
+ total_size = 0;
+ for (i = 0; i < ARRAY_SIZE(block_desc); i++)
+ total_size += pool->block_lists[i].block_count << block_desc[i].order;
+
+ return total_size;
+}
+
+/*****************
+ * zpool
+ ****************/
+
+static void *zblock_zpool_create(const char *name, gfp_t gfp)
+{
+ return zblock_create_pool(gfp);
+}
+
+static void zblock_zpool_destroy(void *pool)
+{
+ zblock_destroy_pool(pool);
+}
+
+static int zblock_zpool_malloc(void *pool, size_t size, gfp_t gfp,
+ unsigned long *handle)
+{
+ return zblock_alloc(pool, size, gfp, handle);
+}
+
+static void zblock_zpool_free(void *pool, unsigned long handle)
+{
+ zblock_free(pool, handle);
+}
+
+static void *zblock_zpool_map(void *pool, unsigned long handle,
+ enum zpool_mapmode mm)
+{
+ return zblock_map(pool, handle);
+}
+
+static void zblock_zpool_unmap(void *pool, unsigned long handle)
+{
+ zblock_unmap(pool, handle);
+}
+
+static u64 zblock_zpool_total_pages(void *pool)
+{
+ return zblock_get_total_pages(pool);
+}
+
+static struct zpool_driver zblock_zpool_driver = {
+ .type = "zblock",
+ .owner = THIS_MODULE,
+ .create = zblock_zpool_create,
+ .destroy = zblock_zpool_destroy,
+ .malloc = zblock_zpool_malloc,
+ .free = zblock_zpool_free,
+ .map = zblock_zpool_map,
+ .unmap = zblock_zpool_unmap,
+ .total_pages = zblock_zpool_total_pages,
+};
+
+MODULE_ALIAS("zpool-zblock");
+
+static void delete_rbtree(void)
+{
+ while (!RB_EMPTY_ROOT(&block_desc_tree))
+ rb_erase(block_desc_tree.rb_node, &block_desc_tree);
+}
+
+static int __init create_rbtree(void)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(block_desc); i++) {
+ struct block_desc_node *block_node = kmalloc(sizeof(*block_node),
+ GFP_KERNEL);
+ struct rb_node **new = &block_desc_tree.rb_node, *parent = NULL;
+
+ if (!block_node) {
+ delete_rbtree();
+ return -ENOMEM;
+ }
+ block_node->this_slot_size = block_desc[i].slot_size;
+ block_node->block_idx = i;
+ if (i == ARRAY_SIZE(block_desc) - 1)
+ block_node->next_slot_size = PAGE_SIZE;
+ else
+ block_node->next_slot_size = block_desc[i+1].slot_size;
+ while (*new) {
+ parent = *new;
+ /* the array is sorted so we will always go to the right */
+ new = &((*new)->rb_right);
+ }
+ rb_link_node(&block_node->node, parent, new);
+ rb_insert_color(&block_node->node, &block_desc_tree);
+ }
+ return 0;
+}
+
+static int __init init_zblock(void)
+{
+ int ret = create_rbtree();
+
+ if (ret)
+ return ret;
+
+ zpool_register_driver(&zblock_zpool_driver);
+ return 0;
+}
+
+static void __exit exit_zblock(void)
+{
+ zpool_unregister_driver(&zblock_zpool_driver);
+ delete_rbtree();
+}
+
+module_init(init_zblock);
+module_exit(exit_zblock);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Vitaly Wool <vitaly.wool@konsulko.se>");
+MODULE_DESCRIPTION("Block allocator for compressed pages");
diff --git a/mm/zblock.h b/mm/zblock.h
new file mode 100644
index 000000000000..0163d13593bf
--- /dev/null
+++ b/mm/zblock.h
@@ -0,0 +1,136 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Author: Vitaly Wool <vitaly.wool@konsulko.com>
+ * Copyright (C) 2025, Konsulko AB.
+ */
+#ifndef __ZBLOCK_H__
+#define __ZBLOCK_H__
+
+#include <linux/mm.h>
+#include <linux/rbtree.h>
+#include <linux/types.h>
+
+#define SLOT_FREE 0
+#define BIT_SLOT_OCCUPIED 0
+#define BIT_SLOT_MAPPED 1
+
+#if PAGE_SIZE == 0x1000
+#define SLOT_BITS 5
+#elif PAGE_SIZE == 0x4000
+#define SLOT_BITS 8
+#else
+#error Unsupported PAGE_SIZE
+#endif
+
+#define MAX_SLOTS (1 << SLOT_BITS)
+#define SLOT_MASK ((0x1UL << SLOT_BITS) - 1)
+
+#define ZBLOCK_HEADER_SIZE round_up(sizeof(struct zblock_block), sizeof(long))
+#define BLOCK_DATA_SIZE(order) ((PAGE_SIZE << order) - ZBLOCK_HEADER_SIZE)
+#define SLOT_SIZE(nslots, order) (round_down((BLOCK_DATA_SIZE(order) / nslots), sizeof(long)))
+
+/**
+ * struct zblock_block - block metadata
+ * Block consists of several (1/2/4/8) pages and contains fixed
+ * integer number of slots for allocating compressed pages.
+ *
+ * free_slots: number of free slots in the block
+ * slot_info: contains data about free/occupied slots
+ */
+struct zblock_block {
+ struct list_head link;
+ DECLARE_BITMAP(slot_info, 1 << SLOT_BITS);
+ u32 free_slots;
+};
+
+/**
+ * struct block_desc - general metadata for block lists
+ * Each block list stores only blocks of corresponding type which means
+ * that all blocks in it have the same number and size of slots.
+ * All slots are aligned to size of long.
+ *
+ * slot_size: size of slot for this list
+ * slots_per_block: number of slots per block for this list
+ * order: order for __get_free_pages
+ */
+struct block_desc {
+ unsigned int slot_size;
+ unsigned short slots_per_block;
+ unsigned short order;
+};
+
+struct block_desc_node {
+ struct rb_node node;
+ unsigned int this_slot_size;
+ unsigned int next_slot_size;
+ unsigned int block_idx;
+};
+
+static const struct block_desc block_desc[] = {
+#if PAGE_SIZE == 0x4000
+ { SLOT_SIZE(181, 0), 181, 0 },
+ { SLOT_SIZE(150, 0), 150, 0 },
+ { SLOT_SIZE(116, 0), 116, 0 },
+ { SLOT_SIZE(94, 0), 94, 0 },
+ { SLOT_SIZE(72, 0), 72, 0 },
+ { SLOT_SIZE(54, 0), 54, 0 },
+ { SLOT_SIZE(42, 0), 42, 0 },
+#endif /* PAGE_SIZE */
+ { SLOT_SIZE(32, 0), 32, 0 },
+ { SLOT_SIZE(22, 0), 22, 0 },
+ { SLOT_SIZE(17, 0), 17, 0 },
+ { SLOT_SIZE(13, 0), 13, 0 },
+ { SLOT_SIZE(11, 0), 11, 0 },
+ { SLOT_SIZE(9, 0), 9, 0 },
+ { SLOT_SIZE(8, 0), 8, 0 },
+ { SLOT_SIZE(14, 1), 14, 1 },
+ { SLOT_SIZE(12, 1), 12, 1 },
+ { SLOT_SIZE(11, 1), 11, 1 },
+ { SLOT_SIZE(10, 1), 10, 1 },
+ { SLOT_SIZE(9, 1), 9, 1 },
+ { SLOT_SIZE(8, 1), 8, 1 },
+ { SLOT_SIZE(15, 2), 15, 2 },
+ { SLOT_SIZE(14, 2), 14, 2 },
+ { SLOT_SIZE(13, 2), 13, 2 },
+ { SLOT_SIZE(12, 2), 12, 2 },
+ { SLOT_SIZE(11, 2), 11, 2 },
+ { SLOT_SIZE(10, 2), 10, 2 },
+ { SLOT_SIZE(9, 2), 9, 2 },
+ { SLOT_SIZE(8, 2), 8, 2 },
+ { SLOT_SIZE(15, 3), 15, 3 },
+ { SLOT_SIZE(14, 3), 14, 3 },
+ { SLOT_SIZE(13, 3), 13, 3 },
+ { SLOT_SIZE(12, 3), 12, 3 },
+ { SLOT_SIZE(11, 3), 11, 3 },
+ { SLOT_SIZE(10, 3), 10, 3 },
+ { SLOT_SIZE(9, 3), 9, 3 },
+ { SLOT_SIZE(7, 3), 7, 3 }
+};
+
+/**
+ * struct block_list - stores metadata of particular list
+ * lock: protects the list of blocks
+ * list: linked list of blocks
+ * block_count: total number of blocks in the list
+ */
+struct block_list {
+ spinlock_t lock;
+ struct list_head list;
+ unsigned long block_count;
+};
+
+/**
+ * struct zblock_pool - stores metadata for each zblock pool
+ * @block_lists: array of block lists
+ * @zpool: zpool driver
+ *
+ * This structure is allocated at pool creation time and maintains metadata
+ * for a particular zblock pool.
+ */
+struct zblock_pool {
+ struct block_list block_lists[ARRAY_SIZE(block_desc)];
+ struct zpool *zpool;
+};
+
+
+#endif
--
2.39.2
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH v2] mm: add zblock allocator
2025-04-04 19:28 Vitaly Wool
@ 2025-04-04 20:03 ` Johannes Weiner
2025-04-07 15:54 ` Johannes Weiner
1 sibling, 0 replies; 17+ messages in thread
From: Johannes Weiner @ 2025-04-04 20:03 UTC (permalink / raw)
To: Vitaly Wool
Cc: linux-mm, akpm, linux-kernel, Nhat Pham, Shakeel Butt,
Igor Belousov
Hello Vitaly,
On Fri, Apr 04, 2025 at 09:28:13PM +0200, Vitaly Wool wrote:
> zblock is a special purpose allocator for storing compressed pages.
> It stores integer number of compressed objects per its block. These
> blocks consist of several physical pages (2**n, i. e. 1/2/4/8).
>
> With zblock, it is possible to densely arrange objects of various sizes
> resulting in low internal fragmentation. Also this allocator tries to
> fill incomplete blocks instead of adding new ones, in many cases
> providing a compression ratio substantially higher than z3fold and zbud
> (though lower than zmalloc's).
Do you have zswap/zswapped meminfo metrics from these tests?
> 1.1 zblock
>
> real 25m53.040s
> user 96m43.424s
> sys 4m56.652s
>
> real 25m20.748s
> user 94m24.324s
> sys 4m58.005s
>
> real 25m37.486s
> user 95m35.913s
> sys 4m55.892s
>
> 1.2 zsmalloc
>
> real 26m17.934s
> user 97m13.342s
> sys 5m2.415s
>
> real 25m50.694s
> user 95m22.065s
> sys 5m1.305s
>
> real 25m57.714s
> user 96m14.675s
> sys 4m59.081s
>
> Since zswap is used starting from minute 21, this gives 9% in average in
> advantage for zblock.
This might be the linker step swapping out source files that are no
longer needed. Do you have metrics for zswpout and zswpin as well?
My concern with this allocator, and the other alternatives to zsmalloc
before, is the following:
You might be faster at allocating objects. But if storage density is
worse, it means you have to zswap more pages to meet the same incoming
allocation demand. That means more object allocations and more
compression, and often a higher rate of refaults and decompressions.
I would assume you're at least zswapping out more to make room for
whatever happens at 21 minutes. This doesn't seem to hurt the time,
which is promising and highlights how much faster zblock is.
But for the full picture, it would be good to see the reuse/zswpin
rates, especially in the context of overall swap activity.
And if there is no notable swapin rate, maybe test without tmpfs and
tune the memory allowance to force at least some degree of reuse.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] mm: add zblock allocator
[not found] <1743810988579.7.125720@webmail-backend-production-7b88b644bb-5mmj8>
@ 2025-04-06 7:53 ` Igor Belousov
2025-04-07 9:00 ` Igor Belousov
0 siblings, 1 reply; 17+ messages in thread
From: Igor Belousov @ 2025-04-06 7:53 UTC (permalink / raw)
To: vitaly.wool
Cc: Johannes Weiner, linux-mm, akpm, linux-kernel, Nhat Pham,
Shakeel Butt
Hi Vitaly,
2025-04-05 03:56 Vitaly wrote:
>
>> Do you have zswap/zswapped meminfo metrics from these tests?
> Yep, and those look somewhat similar:
> - zblock:
> Zswap: 234128 kB
> Zswapped: 733216 kB
> - zsmalloc:
> Zswap: 286080 kB
> Zswapped: 774688 kB
I tested the kernel build on a 4-core virtual machine with allocated 4
GB RAM running on a Ryzen 9.
The results are the following:
1. make -j4:
1.1 zsmalloc:
real 10m59.689s
user 30m53.913s
sys 6m20.720s
zswpin 127811
zswpout 443914
zswpwb 764
Zswap: 292428 kB
Zswapped: 801536 kB
1.2 zblock:
real 11m1.971s
user 30m51.411s
sys 6m18.752s
zswpin 306020
zswpout 732145
zswpwb 2215
Zswap: 291016 kB
Zswapped: 741176 kB
2. make -j8:
2.1 zsmalloc
real 11m40.640s
user 33m3.675s
sys 6m28.126s
Zswap: 281336 kB
Zswapped: 785344 kB
zswpin 308624
zswpout 641576
zswpwb 2674
2.2 zblock
real 11m21.161s
user 32m21.012s
sys 5m53.864s
zswpin 207039
zswpout 621107
zswpwb 3391
Zswap: 326580 kB
Zswapped: 836088 kB
3. make -j16:
3.1 zsmalloc
real 12m42.544s
user 36m3.171s
sys 6m46.036s
Zswap: 249192 kB
Zswapped: 695664 kB
zswpin 193202
zswpout 778746
zswpwb 2611
3.2 zblock
real 12m12.276s
user 35m41.100s
sys 6m30.100s
zswpin 211179
zswpout 853612
zswpwb 2610
Zswap: 327544 kB
Zswapped: 721828 kB
We can observe that the gap between zsmalloc and zblock increases with
the increase of thread number.
>> My concern with this allocator, and the other alternatives to zsmalloc
>> before, is the following:
>> You might be faster at allocating objects. But if storage density is
>> worse, it means you have to zswap more pages to meet the same incoming
>> allocation demand. That means more object allocations and more
>> compression, and often a higher rate of refaults and decompressions.
I don't think we see a substantial difference in storage density, and on
16K pages zblock seems to act even better than zsmalloc. I wasn't able
to test 16K pages on x86_64 because there are multiple dependencies on
the PAGE_SIZE as 4K in the code, but the testing on ARM64 does suggest
that.
Besides, now when we use rbtree for search for the right block type we
can enlarge the table and get better matches and better storage density
without significant impact on performance.
Thanks,
Igor
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] mm: add zblock allocator
2025-04-06 7:53 ` [PATCH v2] mm: add zblock allocator Igor Belousov
@ 2025-04-07 9:00 ` Igor Belousov
2025-04-07 15:51 ` Nhat Pham
0 siblings, 1 reply; 17+ messages in thread
From: Igor Belousov @ 2025-04-07 9:00 UTC (permalink / raw)
To: vitaly.wool
Cc: Johannes Weiner, linux-mm, akpm, linux-kernel, Nhat Pham,
Shakeel Butt
>>> Do you have zswap/zswapped meminfo metrics from these tests?
>> Yep, and those look somewhat similar:
>> - zblock:
>> Zswap: 234128 kB
>> Zswapped: 733216 kB
>> - zsmalloc:
>> Zswap: 286080 kB
>> Zswapped: 774688 kB
>
> I tested the kernel build on a 4-core virtual machine with allocated 4
> GB RAM running on a Ryzen 9.
>
> The results are the following:
[...]
Now what's funny is that when I tried to compare how 32 threaded build
would behave on a 8-core VM I couldn't do it because it OOMs with
zsmalloc as zswap backend. With zblock it doesn't, though, and the
results are:
real 12m14.012s
user 39m37.777s
sys 14m6.923s
Zswap: 440148 kB
Zswapped: 924452 kB
zswpin 594812
zswpout 2802454
zswpwb 10878
/Igor
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] mm: add zblock allocator
2025-04-07 9:00 ` Igor Belousov
@ 2025-04-07 15:51 ` Nhat Pham
2025-04-07 16:44 ` Igor Belousov
0 siblings, 1 reply; 17+ messages in thread
From: Nhat Pham @ 2025-04-07 15:51 UTC (permalink / raw)
To: Igor Belousov
Cc: vitaly.wool, Johannes Weiner, linux-mm, akpm, linux-kernel,
Shakeel Butt
On Mon, Apr 7, 2025 at 2:00 AM Igor Belousov <igor.b@beldev.am> wrote:
>
>
> >>> Do you have zswap/zswapped meminfo metrics from these tests?
> >> Yep, and those look somewhat similar:
> >> - zblock:
> >> Zswap: 234128 kB
> >> Zswapped: 733216 kB
> >> - zsmalloc:
> >> Zswap: 286080 kB
> >> Zswapped: 774688 kB
> >
> > I tested the kernel build on a 4-core virtual machine with allocated 4
> > GB RAM running on a Ryzen 9.
> >
> > The results are the following:
> [...]
>
> Now what's funny is that when I tried to compare how 32 threaded build
> would behave on a 8-core VM I couldn't do it because it OOMs with
> zsmalloc as zswap backend. With zblock it doesn't, though, and the
> results are:
> real 12m14.012s
> user 39m37.777s
> sys 14m6.923s
> Zswap: 440148 kB
> Zswapped: 924452 kB
> zswpin 594812
> zswpout 2802454
> zswpwb 10878
>
> /Igor
May I ask what compression algorithm you are using?
And does the zswpwb come from zswap shrinker?
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] mm: add zblock allocator
2025-04-04 19:28 Vitaly Wool
2025-04-04 20:03 ` Johannes Weiner
@ 2025-04-07 15:54 ` Johannes Weiner
2025-04-07 18:26 ` Vitaly Wool
1 sibling, 1 reply; 17+ messages in thread
From: Johannes Weiner @ 2025-04-07 15:54 UTC (permalink / raw)
To: Vitaly Wool
Cc: linux-mm, akpm, linux-kernel, Nhat Pham, Shakeel Butt,
Igor Belousov
On Fri, Apr 04, 2025 at 09:28:13PM +0200, Vitaly Wool wrote:
> @@ -0,0 +1,24 @@
> +. SPDX-License-Identifier: GPL-2.0
> +
> +======
> +zblock
> +======
> +
> +zblock is a special purpose allocator for storing compressed pages.
> +It stores integer number of compressed objects per its block. These
> +blocks consist of several physical pages (2**n, i. e. 1/2/4/8).
> +
> +With zblock, it is possible to densely arrange objects of various sizes
> +resulting in low internal fragmentation. Also this allocator tries to
> +fill incomplete blocks instead of adding new ones, in many cases
> +providing a compression ratio substantially higher than z3fold and zbud
> +(though lower than zmalloc's).
This last part seems to be incorrect based on Igor's measurements.
> +zblock does not require MMU to operate and also is superior to zsmalloc
> +with regard to average performance and worst execution times, thus
> +allowing for better response time and real-time characteristics of the
> +whole system.
Please delete the MMU comment here and in the changelog. As Nhat said,
this is a pointless advantage of a swap backend allocator, as swapping
itself requires an MMU.
Please also add that highmem is not supported.
Please also add that migration is not supported, which has a negative
impact on the kernel's ability to produce higher-order pages. This is
particularly unfortunate because zblock itself wants higher-order pages.
> +E. g. on a series of stress-ng tests run on a Raspberry Pi 5, we get
> +5-10% higher value for bogo ops/s in zblock/zsmalloc comparison.
> +
> diff --git a/MAINTAINERS b/MAINTAINERS
> index d5dfb9186962..19283e6a08eb 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -26556,6 +26556,13 @@ F: Documentation/networking/device_drivers/hamradio/z8530drv.rst
> F: drivers/net/hamradio/*scc.c
> F: drivers/net/hamradio/z8530.h
>
> +ZBLOCK COMPRESSED SLAB MEMORY ALLOCATOR
> +M: Vitaly Wool <vitaly.wool@konsulko.se>
> +L: linux-mm@kvack.org
> +S: Maintained
> +F: Documentation/mm/zblock.rst
> +F: mm/zblock.[ch]
> +
> ZBUD COMPRESSED PAGE ALLOCATOR
> M: Seth Jennings <sjenning@redhat.com>
> M: Dan Streetman <ddstreet@ieee.org>
It looks like you're using an old tree. Can you please rebase on top
of the mm tree? There have been some significant changes to how zswap
is calling into zpool since the other allocators have been deleted.
> @@ -0,0 +1,418 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * zblock.c
> + *
> + * Author: Vitaly Wool <vitaly.wool@konsulko.se>
> + * Based on the work from Ananda Badmaev <a.badmaev@clicknet.pro>
> + * Copyright (C) 2022-2025, Konsulko AB.
> + *
> + * Zblock is a small object allocator with the intention to serve as a
> + * zpool backend. It operates on page blocks which consist of number
> + * of physical pages being a power of 2 and store integer number of
> + * compressed pages per block which results in determinism and simplicity.
> + *
> + * zblock doesn't export any API and is meant to be used via zpool API.
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include <linux/atomic.h>
> +#include <linux/list.h>
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/preempt.h>
> +#include <linux/slab.h>
> +#include <linux/spinlock.h>
> +#include <linux/zpool.h>
> +#include "zblock.h"
> +
> +static struct rb_root block_desc_tree = RB_ROOT;
> +
> +/* add a new block to the list */
> +static inline void add_block(struct zblock_block *block,
> + struct block_list *block_list)
> +{
> + list_add(&block->link, &block_list->list);
> +}
> +
> +/*
> + * Find a block with at least one free slot and claim it.
> + * We make sure that the first block, if exists, will always work.
> + */
> +static inline struct zblock_block *find_block(struct block_list *block_list)
> +{
> + struct list_head *l = &block_list->list;
> +
> + if (!list_empty(l)) {
> + struct zblock_block *z = list_first_entry(l, typeof(*z), link);
> +
> + if (z->free_slots > 0) {
> + if (--z->free_slots == 0)
> + list_move_tail(&z->link, l);
> + return z;
> + }
> + }
> + return NULL;
> +}
> +
> +/* remove block from the list */
> +static inline void remove_block(struct zblock_block *block)
> +{
> + list_del_init(&block->link);
> +}
> +
> +/* Encodes the handle of a particular slot in the pool using metadata */
> +static inline unsigned long metadata_to_handle(struct zblock_block *block,
> + unsigned int block_type, unsigned int slot)
> +{
> + return (unsigned long)(block) | (block_type << SLOT_BITS) | slot;
> +}
> +
> +/* Returns block, block type and slot in the pool corresponding to handle */
> +static inline struct zblock_block *handle_to_metadata(unsigned long handle,
> + unsigned int *block_type, unsigned int *slot)
> +{
> + *block_type = (handle & (PAGE_SIZE - 1)) >> SLOT_BITS;
> + *slot = handle & SLOT_MASK;
> + return (struct zblock_block *)(handle & PAGE_MASK);
> +}
> +
> +
> +/*
> + * allocate new block and add it to corresponding block list
> + */
> +static struct zblock_block *alloc_block(struct zblock_pool *pool,
> + int block_type, gfp_t gfp,
> + unsigned long *handle)
> +{
> + struct zblock_block *block;
> + struct block_list *block_list;
> +
> + block = (void *)__get_free_pages(gfp, block_desc[block_type].order);
This gfp comes from zswap, which currently does:
gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
So you might get highmem here that cannot be dereferenced and crash.
__GFP_MOVABLE is also wrong, as you're not implementing migration.
> + if (!block)
> + return NULL;
> +
> + block_list = &pool->block_lists[block_type];
> +
> + /* init block data */
> + block->free_slots = block_desc[block_type].slots_per_block - 1;
> + memset(&block->slot_info, 0, sizeof(block->slot_info));
> + set_bit(0, block->slot_info);
> + *handle = metadata_to_handle(block, block_type, 0);
> +
> + spin_lock(&block_list->lock);
> + add_block(block, block_list);
Use list_add() directly here.
> + block_list->block_count++;
> + spin_unlock(&block_list->lock);
> + return block;
> +}
> +
> +/*****************
> + * API Functions
> + *****************/
> +/**
> + * zblock_create_pool() - create a new zblock pool
> + * @gfp: gfp flags when allocating the zblock pool structure
> + * @ops: user-defined operations for the zblock pool
> + *
> + * Return: pointer to the new zblock pool or NULL if the metadata allocation
> + * failed.
> + */
> +static struct zblock_pool *zblock_create_pool(gfp_t gfp)
> +{
> + struct zblock_pool *pool;
> + struct block_list *block_list;
> + int i;
> +
> + pool = kmalloc(sizeof(struct zblock_pool), gfp);
> + if (!pool)
> + return NULL;
> +
> + /* init each block list */
> + for (i = 0; i < ARRAY_SIZE(block_desc); i++) {
> + block_list = &pool->block_lists[i];
> + spin_lock_init(&block_list->lock);
> + INIT_LIST_HEAD(&block_list->list);
> + block_list->block_count = 0;
> + }
> + return pool;
> +}
> +
> +/**
> + * zblock_destroy_pool() - destroys an existing zblock pool
> + * @pool: the zblock pool to be destroyed
> + *
> + */
> +static void zblock_destroy_pool(struct zblock_pool *pool)
> +{
> + kfree(pool);
> +}
> +
> +
> +/**
> + * zblock_alloc() - allocates a slot of appropriate size
> + * @pool: zblock pool from which to allocate
> + * @size: size in bytes of the desired allocation
> + * @gfp: gfp flags used if the pool needs to grow
> + * @handle: handle of the new allocation
> + *
> + * Return: 0 if success and handle is set, otherwise -EINVAL if the size or
> + * gfp arguments are invalid or -ENOMEM if the pool was unable to allocate
> + * a new slot.
> + */
> +static int zblock_alloc(struct zblock_pool *pool, size_t size, gfp_t gfp,
> + unsigned long *handle)
> +{
> + int block_type = -1;
> + unsigned int slot;
> + struct zblock_block *block;
> + struct block_list *block_list;
> +
> + if (!size)
> + return -EINVAL;
> +
> + if (size > PAGE_SIZE)
> + return -ENOSPC;
> +
> + /* find basic block type with suitable slot size */
> + if (size < block_desc[0].slot_size)
> + block_type = 0;
> + else {
> + struct block_desc_node *block_node;
> + struct rb_node *node = block_desc_tree.rb_node;
> +
> + while (node) {
> + block_node = container_of(node, typeof(*block_node), node);
> + if (size < block_node->this_slot_size)
> + node = node->rb_left;
> + else if (size >= block_node->next_slot_size)
> + node = node->rb_right;
> + else {
> + block_type = block_node->block_idx + 1;
> + break;
> + }
> + }
> + }
> + if (WARN_ON(block_type < 0 || block_type >= ARRAY_SIZE(block_desc)))
> + return -EINVAL;
> +
> + block_list = &pool->block_lists[block_type];
> +
> + spin_lock(&block_list->lock);
> + block = find_block(block_list);
> + spin_unlock(&block_list->lock);
> + if (block)
> + goto found;
> + /* not found block with free slots try to allocate new empty block */
> + block = alloc_block(pool, block_type, gfp, handle);
> + return block ? 0 : -ENOMEM;
> +
> +found:
> + /* find the first free slot in block */
> + for (slot = find_first_zero_bit(block->slot_info,
> + block_desc[block_type].slots_per_block);
> + slot < block_desc[block_type].slots_per_block;
> + slot = find_next_zero_bit(block->slot_info,
> + block_desc[block_type].slots_per_block,
> + slot)) {
> + if (!test_and_set_bit(slot, block->slot_info))
> + break;
> + barrier();
Ah, so find_block() reserves a slot in block->free_slots. You can race
with another allocation for the exact bit, but there is one bit
guaranteed to remain. Clever. This could use a comment.
> + }
> + BUG_ON(slot >= block_desc[block_type].slots_per_block);
> + *handle = metadata_to_handle(block, block_type, slot);
> + return 0;
> +}
> +
> +/**
> + * zblock_free() - frees the allocation associated with the given handle
> + * @pool: pool in which the allocation resided
> + * @handle: handle associated with the allocation returned by zblock_alloc()
> + *
> + */
> +static void zblock_free(struct zblock_pool *pool, unsigned long handle)
> +{
> + unsigned int slot, block_type;
> + struct zblock_block *block;
> + struct block_list *block_list;
> +
> + block = handle_to_metadata(handle, &block_type, &slot);
> + block_list = &pool->block_lists[block_type];
> +
> + spin_lock(&block_list->lock);
> + /* if all slots in block are empty delete whole block */
> + if (++block->free_slots == block_desc[block_type].slots_per_block) {
> + block_list->block_count--;
> + remove_block(block);
> + spin_unlock(&block_list->lock);
> + free_pages((unsigned long)block, block_desc[block_type].order);
> + return;
> + }
> + spin_unlock(&block_list->lock);
> +
> + clear_bit(slot, block->slot_info);
The list handling here is unusual. find_block() only checks the first
block in the list. Filling a block puts it at the tail, but partially
freeing a block doesn't rotate it back to the head. AFAICS this can
lead to an ugly edge case where you have an unbounded amount of free
memory trapped in partial blocks, but you keep allocating new blocks:
You start with [new block]. Once full, list_move_tail() is a nop.
You allocate a new one, [partial]->[full].
Keep going, you can have [partial]->[full]->[full]->[full]->[full]
until it's hundreds of gigabytes.
The workload now unmaps randomly and thus partially draining many
blocks. So you have: [partial]->[full]->[partial]->... with many
gigabytes of free slots in partially used blocks.
Once the first partial block fills up, it rotates to the tail:
[full]->[partial]->...[full]
find_block() will fail. You allocate a new block, fill it, move it to
the tail. The old [full] is back at the head. Rinse, repeat.
You could make find_block() keep going, but it might have to go
through a million [full] blocks before finding a partial one.
This is the reason why other allocator schemes, like zsmalloc e.g.,
have separate partial and full lists.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] mm: add zblock allocator
2025-04-07 15:51 ` Nhat Pham
@ 2025-04-07 16:44 ` Igor Belousov
2025-04-07 17:00 ` Nhat Pham
0 siblings, 1 reply; 17+ messages in thread
From: Igor Belousov @ 2025-04-07 16:44 UTC (permalink / raw)
To: Nhat Pham
Cc: vitaly.wool, Johannes Weiner, linux-mm, akpm, linux-kernel,
Shakeel Butt
Hi Nhat,
2025-04-07 19:51 skrev Nhat Pham:
> On Mon, Apr 7, 2025 at 2:00 AM Igor Belousov <igor.b@beldev.am> wrote:
>>
>>
>> >>> Do you have zswap/zswapped meminfo metrics from these tests?
>> >> Yep, and those look somewhat similar:
>> >> - zblock:
>> >> Zswap: 234128 kB
>> >> Zswapped: 733216 kB
>> >> - zsmalloc:
>> >> Zswap: 286080 kB
>> >> Zswapped: 774688 kB
>> >
>> > I tested the kernel build on a 4-core virtual machine with allocated 4
>> > GB RAM running on a Ryzen 9.
>> >
>> > The results are the following:
>> [...]
>>
>> Now what's funny is that when I tried to compare how 32 threaded build
>> would behave on a 8-core VM I couldn't do it because it OOMs with
>> zsmalloc as zswap backend. With zblock it doesn't, though, and the
>> results are:
>> real 12m14.012s
>> user 39m37.777s
>> sys 14m6.923s
>> Zswap: 440148 kB
>> Zswapped: 924452 kB
>> zswpin 594812
>> zswpout 2802454
>> zswpwb 10878
>>
>> /Igor
>
> May I ask what compression algorithm you are using?
It's LZ4 for all the test runs.
> And does the zswpwb come from zswap shrinker?
Haven't looked into that, to be honest.
/Igor
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] mm: add zblock allocator
2025-04-07 16:44 ` Igor Belousov
@ 2025-04-07 17:00 ` Nhat Pham
2025-04-08 9:20 ` Igor Belousov
0 siblings, 1 reply; 17+ messages in thread
From: Nhat Pham @ 2025-04-07 17:00 UTC (permalink / raw)
To: Igor Belousov
Cc: vitaly.wool, Johannes Weiner, linux-mm, akpm, linux-kernel,
Shakeel Butt
On Mon, Apr 7, 2025 at 9:44 AM Igor Belousov <igor.b@beldev.am> wrote:
>
> Hi Nhat,
>
> 2025-04-07 19:51 skrev Nhat Pham:
> > On Mon, Apr 7, 2025 at 2:00 AM Igor Belousov <igor.b@beldev.am> wrote:
> >>
> >>
> >> >>> Do you have zswap/zswapped meminfo metrics from these tests?
> >> >> Yep, and those look somewhat similar:
> >> >> - zblock:
> >> >> Zswap: 234128 kB
> >> >> Zswapped: 733216 kB
> >> >> - zsmalloc:
> >> >> Zswap: 286080 kB
> >> >> Zswapped: 774688 kB
> >> >
> >> > I tested the kernel build on a 4-core virtual machine with allocated 4
> >> > GB RAM running on a Ryzen 9.
> >> >
> >> > The results are the following:
> >> [...]
> >>
> >> Now what's funny is that when I tried to compare how 32 threaded build
> >> would behave on a 8-core VM I couldn't do it because it OOMs with
> >> zsmalloc as zswap backend. With zblock it doesn't, though, and the
> >> results are:
> >> real 12m14.012s
> >> user 39m37.777s
> >> sys 14m6.923s
> >> Zswap: 440148 kB
> >> Zswapped: 924452 kB
> >> zswpin 594812
> >> zswpout 2802454
> >> zswpwb 10878
> >>
> >> /Igor
> >
> > May I ask what compression algorithm you are using?
>
> It's LZ4 for all the test runs.
Can you try zstd and let me know how it goes :)
>
> > And does the zswpwb come from zswap shrinker?
>
> Haven't looked into that, to be honest.
Can you check:
/sys/module/zswap/parameters/shrinker_enabled
>
> /Igor
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] mm: add zblock allocator
2025-04-07 15:54 ` Johannes Weiner
@ 2025-04-07 18:26 ` Vitaly Wool
0 siblings, 0 replies; 17+ messages in thread
From: Vitaly Wool @ 2025-04-07 18:26 UTC (permalink / raw)
To: Johannes Weiner
Cc: linux-mm, akpm, linux-kernel, Nhat Pham, Shakeel Butt,
Igor Belousov
Hi Johannes,
thanks for the review, some comments follow below.
On 4/7/25 17:54, Johannes Weiner wrote:
> On Fri, Apr 04, 2025 at 09:28:13PM +0200, Vitaly Wool wrote:
>> @@ -0,0 +1,24 @@
>> +. SPDX-License-Identifier: GPL-2.0
>> +
>> +======
>> +zblock
>> +======
>> +
>> +zblock is a special purpose allocator for storing compressed pages.
>> +It stores integer number of compressed objects per its block. These
>> +blocks consist of several physical pages (2**n, i. e. 1/2/4/8).
>> +
>> +With zblock, it is possible to densely arrange objects of various sizes
>> +resulting in low internal fragmentation. Also this allocator tries to
>> +fill incomplete blocks instead of adding new ones, in many cases
>> +providing a compression ratio substantially higher than z3fold and zbud
>> +(though lower than zmalloc's).
>
> This last part seems to be incorrect based on Igor's measurements.
Well, I surely don't mind deleting that.
>> +zblock does not require MMU to operate and also is superior to zsmalloc
>> +with regard to average performance and worst execution times, thus
>> +allowing for better response time and real-time characteristics of the
>> +whole system.
>
> Please delete the MMU comment here and in the changelog. As Nhat said,
> this is a pointless advantage of a swap backend allocator, as swapping
> itself requires an MMU.
That's correct indeed but I still believe zblock may be a good match for
ZRAM, even better than zsmalloc, and that is where MMU indepencence
comes into play.
> Please also add that highmem is not supported.
>
> Please also add that migration is not supported, which has a negative
> impact on the kernel's ability to produce higher-order pages. This is
> particularly unfortunate because zblock itself wants higher-order pages.
Right. We plan to add migration support in the close future though.
>> +E. g. on a series of stress-ng tests run on a Raspberry Pi 5, we get
>> +5-10% higher value for bogo ops/s in zblock/zsmalloc comparison.
>> +
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index d5dfb9186962..19283e6a08eb 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -26556,6 +26556,13 @@ F: Documentation/networking/device_drivers/hamradio/z8530drv.rst
>> F: drivers/net/hamradio/*scc.c
>> F: drivers/net/hamradio/z8530.h
>>
>> +ZBLOCK COMPRESSED SLAB MEMORY ALLOCATOR
>> +M: Vitaly Wool <vitaly.wool@konsulko.se>
>> +L: linux-mm@kvack.org
>> +S: Maintained
>> +F: Documentation/mm/zblock.rst
>> +F: mm/zblock.[ch]
>> +
>> ZBUD COMPRESSED PAGE ALLOCATOR
>> M: Seth Jennings <sjenning@redhat.com>
>> M: Dan Streetman <ddstreet@ieee.org>
>
> It looks like you're using an old tree. Can you please rebase on top
> of the mm tree? There have been some significant changes to how zswap
> is calling into zpool since the other allocators have been deleted.
>
>> @@ -0,0 +1,418 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * zblock.c
>> + *
>> + * Author: Vitaly Wool <vitaly.wool@konsulko.se>
>> + * Based on the work from Ananda Badmaev <a.badmaev@clicknet.pro>
>> + * Copyright (C) 2022-2025, Konsulko AB.
>> + *
>> + * Zblock is a small object allocator with the intention to serve as a
>> + * zpool backend. It operates on page blocks which consist of number
>> + * of physical pages being a power of 2 and store integer number of
>> + * compressed pages per block which results in determinism and simplicity.
>> + *
>> + * zblock doesn't export any API and is meant to be used via zpool API.
>> + */
>> +
>> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>> +
>> +#include <linux/atomic.h>
>> +#include <linux/list.h>
>> +#include <linux/mm.h>
>> +#include <linux/module.h>
>> +#include <linux/preempt.h>
>> +#include <linux/slab.h>
>> +#include <linux/spinlock.h>
>> +#include <linux/zpool.h>
>> +#include "zblock.h"
>> +
>> +static struct rb_root block_desc_tree = RB_ROOT;
>> +
>> +/* add a new block to the list */
>> +static inline void add_block(struct zblock_block *block,
>> + struct block_list *block_list)
>> +{
>> + list_add(&block->link, &block_list->list);
>> +}
>> +
>> +/*
>> + * Find a block with at least one free slot and claim it.
>> + * We make sure that the first block, if exists, will always work.
>> + */
>> +static inline struct zblock_block *find_block(struct block_list *block_list)
>> +{
>> + struct list_head *l = &block_list->list;
>> +
>> + if (!list_empty(l)) {
>> + struct zblock_block *z = list_first_entry(l, typeof(*z), link);
>> +
>> + if (z->free_slots > 0) {
>> + if (--z->free_slots == 0)
>> + list_move_tail(&z->link, l);
>> + return z;
>> + }
>> + }
>> + return NULL;
>> +}
>> +
>> +/* remove block from the list */
>> +static inline void remove_block(struct zblock_block *block)
>> +{
>> + list_del_init(&block->link);
>> +}
>> +
>> +/* Encodes the handle of a particular slot in the pool using metadata */
>> +static inline unsigned long metadata_to_handle(struct zblock_block *block,
>> + unsigned int block_type, unsigned int slot)
>> +{
>> + return (unsigned long)(block) | (block_type << SLOT_BITS) | slot;
>> +}
>> +
>> +/* Returns block, block type and slot in the pool corresponding to handle */
>> +static inline struct zblock_block *handle_to_metadata(unsigned long handle,
>> + unsigned int *block_type, unsigned int *slot)
>> +{
>> + *block_type = (handle & (PAGE_SIZE - 1)) >> SLOT_BITS;
>> + *slot = handle & SLOT_MASK;
>> + return (struct zblock_block *)(handle & PAGE_MASK);
>> +}
>> +
>> +
>> +/*
>> + * allocate new block and add it to corresponding block list
>> + */
>> +static struct zblock_block *alloc_block(struct zblock_pool *pool,
>> + int block_type, gfp_t gfp,
>> + unsigned long *handle)
>> +{
>> + struct zblock_block *block;
>> + struct block_list *block_list;
>> +
>> + block = (void *)__get_free_pages(gfp, block_desc[block_type].order);
>
> This gfp comes from zswap, which currently does:
>
> gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
>
> So you might get highmem here that cannot be dereferenced and crash.
>
> __GFP_MOVABLE is also wrong, as you're not implementing migration.
Ack.
>> + if (!block)
>> + return NULL;
>> +
>> + block_list = &pool->block_lists[block_type];
>> +
>> + /* init block data */
>> + block->free_slots = block_desc[block_type].slots_per_block - 1;
>> + memset(&block->slot_info, 0, sizeof(block->slot_info));
>> + set_bit(0, block->slot_info);
>> + *handle = metadata_to_handle(block, block_type, 0);
>> +
>> + spin_lock(&block_list->lock);
>> + add_block(block, block_list);
>
> Use list_add() directly here.
This is probably too small a thing to argue about but I would rather
keep add_block and put block_count increment there. Would you agree?
>> + block_list->block_count++;
>> + spin_unlock(&block_list->lock);
>> + return block;
>> +}
>> +
>> +/*****************
>> + * API Functions
>> + *****************/
>> +/**
>> + * zblock_create_pool() - create a new zblock pool
>> + * @gfp: gfp flags when allocating the zblock pool structure
>> + * @ops: user-defined operations for the zblock pool
>> + *
>> + * Return: pointer to the new zblock pool or NULL if the metadata allocation
>> + * failed.
>> + */
>> +static struct zblock_pool *zblock_create_pool(gfp_t gfp)
>> +{
>> + struct zblock_pool *pool;
>> + struct block_list *block_list;
>> + int i;
>> +
>> + pool = kmalloc(sizeof(struct zblock_pool), gfp);
>> + if (!pool)
>> + return NULL;
>> +
>> + /* init each block list */
>> + for (i = 0; i < ARRAY_SIZE(block_desc); i++) {
>> + block_list = &pool->block_lists[i];
>> + spin_lock_init(&block_list->lock);
>> + INIT_LIST_HEAD(&block_list->list);
>> + block_list->block_count = 0;
>> + }
>> + return pool;
>> +}
>> +
>> +/**
>> + * zblock_destroy_pool() - destroys an existing zblock pool
>> + * @pool: the zblock pool to be destroyed
>> + *
>> + */
>> +static void zblock_destroy_pool(struct zblock_pool *pool)
>> +{
>> + kfree(pool);
>> +}
>> +
>> +
>> +/**
>> + * zblock_alloc() - allocates a slot of appropriate size
>> + * @pool: zblock pool from which to allocate
>> + * @size: size in bytes of the desired allocation
>> + * @gfp: gfp flags used if the pool needs to grow
>> + * @handle: handle of the new allocation
>> + *
>> + * Return: 0 if success and handle is set, otherwise -EINVAL if the size or
>> + * gfp arguments are invalid or -ENOMEM if the pool was unable to allocate
>> + * a new slot.
>> + */
>> +static int zblock_alloc(struct zblock_pool *pool, size_t size, gfp_t gfp,
>> + unsigned long *handle)
>> +{
>> + int block_type = -1;
>> + unsigned int slot;
>> + struct zblock_block *block;
>> + struct block_list *block_list;
>> +
>> + if (!size)
>> + return -EINVAL;
>> +
>> + if (size > PAGE_SIZE)
>> + return -ENOSPC;
>> +
>> + /* find basic block type with suitable slot size */
>> + if (size < block_desc[0].slot_size)
>> + block_type = 0;
>> + else {
>> + struct block_desc_node *block_node;
>> + struct rb_node *node = block_desc_tree.rb_node;
>> +
>> + while (node) {
>> + block_node = container_of(node, typeof(*block_node), node);
>> + if (size < block_node->this_slot_size)
>> + node = node->rb_left;
>> + else if (size >= block_node->next_slot_size)
>> + node = node->rb_right;
>> + else {
>> + block_type = block_node->block_idx + 1;
>> + break;
>> + }
>> + }
>> + }
>> + if (WARN_ON(block_type < 0 || block_type >= ARRAY_SIZE(block_desc)))
>> + return -EINVAL;
>> +
>> + block_list = &pool->block_lists[block_type];
>> +
>> + spin_lock(&block_list->lock);
>> + block = find_block(block_list);
>> + spin_unlock(&block_list->lock);
>> + if (block)
>> + goto found;
>> + /* not found block with free slots try to allocate new empty block */
>> + block = alloc_block(pool, block_type, gfp, handle);
>> + return block ? 0 : -ENOMEM;
>> +
>> +found:
>> + /* find the first free slot in block */
>> + for (slot = find_first_zero_bit(block->slot_info,
>> + block_desc[block_type].slots_per_block);
>> + slot < block_desc[block_type].slots_per_block;
>> + slot = find_next_zero_bit(block->slot_info,
>> + block_desc[block_type].slots_per_block,
>> + slot)) {
>> + if (!test_and_set_bit(slot, block->slot_info))
>> + break;
>> + barrier();
>
> Ah, so find_block() reserves a slot in block->free_slots. You can race
> with another allocation for the exact bit, but there is one bit
> guaranteed to remain. Clever. This could use a comment.
Will do.
>> + }
>> + BUG_ON(slot >= block_desc[block_type].slots_per_block);
>> + *handle = metadata_to_handle(block, block_type, slot);
>> + return 0;
>> +}
>> +
>> +/**
>> + * zblock_free() - frees the allocation associated with the given handle
>> + * @pool: pool in which the allocation resided
>> + * @handle: handle associated with the allocation returned by zblock_alloc()
>> + *
>> + */
>> +static void zblock_free(struct zblock_pool *pool, unsigned long handle)
>> +{
>> + unsigned int slot, block_type;
>> + struct zblock_block *block;
>> + struct block_list *block_list;
>> +
>> + block = handle_to_metadata(handle, &block_type, &slot);
>> + block_list = &pool->block_lists[block_type];
>> +
>> + spin_lock(&block_list->lock);
>> + /* if all slots in block are empty delete whole block */
>> + if (++block->free_slots == block_desc[block_type].slots_per_block) {
>> + block_list->block_count--;
>> + remove_block(block);
>> + spin_unlock(&block_list->lock);
>> + free_pages((unsigned long)block, block_desc[block_type].order);
>> + return;
>> + }
>> + spin_unlock(&block_list->lock);
>> +
>> + clear_bit(slot, block->slot_info);
>
> The list handling here is unusual. find_block() only checks the first
> block in the list. Filling a block puts it at the tail, but partially
> freeing a block doesn't rotate it back to the head. AFAICS this can
> lead to an ugly edge case where you have an unbounded amount of free
> memory trapped in partial blocks, but you keep allocating new blocks:
>
> You start with [new block]. Once full, list_move_tail() is a nop.
>
> You allocate a new one, [partial]->[full].
>
> Keep going, you can have [partial]->[full]->[full]->[full]->[full]
> until it's hundreds of gigabytes.
>
> The workload now unmaps randomly and thus partially draining many
> blocks. So you have: [partial]->[full]->[partial]->... with many
> gigabytes of free slots in partially used blocks.
>
> Once the first partial block fills up, it rotates to the tail:
> [full]->[partial]->...[full]
>
> find_block() will fail. You allocate a new block, fill it, move it to
> the tail. The old [full] is back at the head. Rinse, repeat.
>
> You could make find_block() keep going, but it might have to go
> through a million [full] blocks before finding a partial one.
>
> This is the reason why other allocator schemes, like zsmalloc e.g.,
> have separate partial and full lists.
The initial variant of zblock_free() had list_move() for that very
purpose but we hadn't hit this corner case in our testing so far so I
decided to remove it. As an interesting coincidence I seem to had hit it
minutes before I received your response so I've got it restored and
checking now if it is enough.
~Vitaly
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] mm: add zblock allocator
2025-04-07 17:00 ` Nhat Pham
@ 2025-04-08 9:20 ` Igor Belousov
2025-04-08 19:55 ` Johannes Weiner
0 siblings, 1 reply; 17+ messages in thread
From: Igor Belousov @ 2025-04-08 9:20 UTC (permalink / raw)
To: Nhat Pham
Cc: vitaly.wool, Johannes Weiner, linux-mm, akpm, linux-kernel,
Shakeel Butt
>> >> >>> Do you have zswap/zswapped meminfo metrics from these tests?
>> >> >> Yep, and those look somewhat similar:
>> >> >> - zblock:
>> >> >> Zswap: 234128 kB
>> >> >> Zswapped: 733216 kB
>> >> >> - zsmalloc:
>> >> >> Zswap: 286080 kB
>> >> >> Zswapped: 774688 kB
>> >> >
>> >> > I tested the kernel build on a 4-core virtual machine with allocated 4
>> >> > GB RAM running on a Ryzen 9.
>> >> >
>> >> > The results are the following:
>> >> [...]
>> >>
>> >> Now what's funny is that when I tried to compare how 32 threaded build
>> >> would behave on a 8-core VM I couldn't do it because it OOMs with
>> >> zsmalloc as zswap backend. With zblock it doesn't, though, and the
>> >> results are:
>> >> real 12m14.012s
>> >> user 39m37.777s
>> >> sys 14m6.923s
>> >> Zswap: 440148 kB
>> >> Zswapped: 924452 kB
>> >> zswpin 594812
>> >> zswpout 2802454
>> >> zswpwb 10878
>> >>
>> >> /Igor
>> >
>> > May I ask what compression algorithm you are using?
>>
>> It's LZ4 for all the test runs.
>
> Can you try zstd and let me know how it goes :)
Sure. zstd/8 cores/make -j32:
zsmalloc:
real 7m36.413s
user 38m0.481s
sys 7m19.108s
Zswap: 211028 kB
Zswapped: 925904 kB
zswpin 397851
zswpout 1625707
zswpwb 5126
zblock:
real 7m55.009s
user 39m23.147s
sys 7m44.004s
Zswap: 253068 kB
Zswapped: 919956 kB
zswpin 456843
zswpout 2058963
zswpwb 3921
>> > And does the zswpwb come from zswap shrinker?
>>
>> Haven't looked into that, to be honest.
>
> Can you check:
>
> /sys/module/zswap/parameters/shrinker_enabled
It's 'Y' so the answer is yes.
/Igor
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] mm: add zblock allocator
2025-04-08 9:20 ` Igor Belousov
@ 2025-04-08 19:55 ` Johannes Weiner
2025-04-08 21:11 ` Nhat Pham
` (2 more replies)
0 siblings, 3 replies; 17+ messages in thread
From: Johannes Weiner @ 2025-04-08 19:55 UTC (permalink / raw)
To: Igor Belousov
Cc: Nhat Pham, vitaly.wool, linux-mm, akpm, linux-kernel,
Shakeel Butt, Yosry Ahmed
On Tue, Apr 08, 2025 at 01:20:11PM +0400, Igor Belousov wrote:
> >> >> Now what's funny is that when I tried to compare how 32 threaded build
> >> >> would behave on a 8-core VM I couldn't do it because it OOMs with
> >> >> zsmalloc as zswap backend. With zblock it doesn't, though, and the
> >> >> results are:
> >> >> real 12m14.012s
> >> >> user 39m37.777s
> >> >> sys 14m6.923s
> >> >> Zswap: 440148 kB
> >> >> Zswapped: 924452 kB
> >> >> zswpin 594812
> >> >> zswpout 2802454
> >> >> zswpwb 10878
> >>
> >> It's LZ4 for all the test runs.
> >
> > Can you try zstd and let me know how it goes :)
>
> Sure. zstd/8 cores/make -j32:
>
> zsmalloc:
> real 7m36.413s
> user 38m0.481s
> sys 7m19.108s
> Zswap: 211028 kB
> Zswapped: 925904 kB
> zswpin 397851
> zswpout 1625707
> zswpwb 5126
>
> zblock:
> real 7m55.009s
> user 39m23.147s
> sys 7m44.004s
> Zswap: 253068 kB
> Zswapped: 919956 kB
> zswpin 456843
> zswpout 2058963
> zswpwb 3921
So zstd results in nearly double the compression ratio, which in turn
cuts total execution time *almost in half*.
The numbers speak for themselves. Compression efficiency >>> allocator
speed, because compression efficiency ultimately drives the continuous
*rate* at which allocations need to occur. You're trying to optimize a
constant coefficient at the expense of a higher-order one, which is a
losing proposition.
This is a general NAK from me on any new allocators that cannot match
or outdo zsmalloc storage density in common scenarios. I'm sorry, but
I really don't see any reason to do this.
We also should probably make zstd the zswap default.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] mm: add zblock allocator
2025-04-08 19:55 ` Johannes Weiner
@ 2025-04-08 21:11 ` Nhat Pham
2025-04-08 21:38 ` Vitaly Wool
2025-04-09 17:59 ` Igor Belousov
2 siblings, 0 replies; 17+ messages in thread
From: Nhat Pham @ 2025-04-08 21:11 UTC (permalink / raw)
To: Johannes Weiner
Cc: Igor Belousov, vitaly.wool, linux-mm, akpm, linux-kernel,
Shakeel Butt, Yosry Ahmed
On Tue, Apr 8, 2025 at 12:55 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Apr 08, 2025 at 01:20:11PM +0400, Igor Belousov wrote:
>
> So zstd results in nearly double the compression ratio, which in turn
> cuts total execution time *almost in half*.
>
> The numbers speak for themselves. Compression efficiency >>> allocator
> speed, because compression efficiency ultimately drives the continuous
Yeah good compression ratio == better performance, assuming we have an
allocator that can ensure a good enough storage density to take
advantage of the compression ratio. I think the experiments show that.
We don't need the no-MMU upstream-speaking, so with this I struggle to
see the point of inclusion of this new allocator.
> *rate* at which allocations need to occur. You're trying to optimize a
> constant coefficient at the expense of a higher-order one, which is a
> losing proposition.
>
> This is a general NAK from me on any new allocators that cannot match
> or outdo zsmalloc storage density in common scenarios. I'm sorry, but
> I really don't see any reason to do this.
I'll wait for Igor's and Vitaly's response, but this is a preliminary
NAK from me too, I suppose.
>
> We also should probably make zstd the zswap default.
Agree.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] mm: add zblock allocator
2025-04-08 19:55 ` Johannes Weiner
2025-04-08 21:11 ` Nhat Pham
@ 2025-04-08 21:38 ` Vitaly Wool
2025-04-08 22:05 ` Nhat Pham
2025-04-09 17:59 ` Igor Belousov
2 siblings, 1 reply; 17+ messages in thread
From: Vitaly Wool @ 2025-04-08 21:38 UTC (permalink / raw)
To: Johannes Weiner, Igor Belousov
Cc: Nhat Pham, linux-mm, akpm, linux-kernel, Shakeel Butt,
Yosry Ahmed
On 4/8/25 21:55, Johannes Weiner wrote:
> On Tue, Apr 08, 2025 at 01:20:11PM +0400, Igor Belousov wrote:
>>>>>> Now what's funny is that when I tried to compare how 32 threaded build
>>>>>> would behave on a 8-core VM I couldn't do it because it OOMs with
>>>>>> zsmalloc as zswap backend. With zblock it doesn't, though, and the
>>>>>> results are:
>>>>>> real 12m14.012s
>>>>>> user 39m37.777s
>>>>>> sys 14m6.923s
>>>>>> Zswap: 440148 kB
>>>>>> Zswapped: 924452 kB
>>>>>> zswpin 594812
>>>>>> zswpout 2802454
>>>>>> zswpwb 10878
>>>>
>>>> It's LZ4 for all the test runs.
>>>
>>> Can you try zstd and let me know how it goes :)
>>
>> Sure. zstd/8 cores/make -j32:
>>
>> zsmalloc:
>> real 7m36.413s
>> user 38m0.481s
>> sys 7m19.108s
>> Zswap: 211028 kB
>> Zswapped: 925904 kB
>> zswpin 397851
>> zswpout 1625707
>> zswpwb 5126
>>
>> zblock:
>> real 7m55.009s
>> user 39m23.147s
>> sys 7m44.004s
>> Zswap: 253068 kB
>> Zswapped: 919956 kB
>> zswpin 456843
>> zswpout 2058963
>> zswpwb 3921
>
> So zstd results in nearly double the compression ratio, which in turn
> cuts total execution time *almost in half*.
>
> The numbers speak for themselves. Compression efficiency >>> allocator
> speed, because compression efficiency ultimately drives the continuous
> *rate* at which allocations need to occur. You're trying to optimize a
> constant coefficient at the expense of a higher-order one, which is a
> losing proposition.
Well, not really. This is an isolated use case with
a. significant computing power under the hood
b. relatively few cores
c. relatively short test
d. 4K pages
If any of these isn't true, zblock dominates.
!a => zstd is too slow
!b => parallelization gives more effect
!c => zsmalloc starts losing due to having to deal with internal
fragmentation
!d => compression efficiency of zblock is better.
Even !d alone makes zblock a better choice for ARM64 based servers.
~Vitaly
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] mm: add zblock allocator
2025-04-08 21:38 ` Vitaly Wool
@ 2025-04-08 22:05 ` Nhat Pham
2025-04-08 23:12 ` Vitaly Wool
0 siblings, 1 reply; 17+ messages in thread
From: Nhat Pham @ 2025-04-08 22:05 UTC (permalink / raw)
To: Vitaly Wool
Cc: Johannes Weiner, Igor Belousov, linux-mm, akpm, linux-kernel,
Shakeel Butt, Yosry Ahmed
On Tue, Apr 8, 2025 at 2:38 PM Vitaly Wool <vitaly.wool@konsulko.se> wrote:
>
>
>
> On 4/8/25 21:55, Johannes Weiner wrote:
> > On Tue, Apr 08, 2025 at 01:20:11PM +0400, Igor Belousov wrote:
> >>>>>> Now what's funny is that when I tried to compare how 32 threaded build
> >>>>>> would behave on a 8-core VM I couldn't do it because it OOMs with
> >>>>>> zsmalloc as zswap backend. With zblock it doesn't, though, and the
> >>>>>> results are:
> >>>>>> real 12m14.012s
> >>>>>> user 39m37.777s
> >>>>>> sys 14m6.923s
> >>>>>> Zswap: 440148 kB
> >>>>>> Zswapped: 924452 kB
> >>>>>> zswpin 594812
> >>>>>> zswpout 2802454
> >>>>>> zswpwb 10878
> >>>>
> >>>> It's LZ4 for all the test runs.
> >>>
> >>> Can you try zstd and let me know how it goes :)
> >>
> >> Sure. zstd/8 cores/make -j32:
> >>
> >> zsmalloc:
> >> real 7m36.413s
> >> user 38m0.481s
> >> sys 7m19.108s
> >> Zswap: 211028 kB
> >> Zswapped: 925904 kB
> >> zswpin 397851
> >> zswpout 1625707
> >> zswpwb 5126
> >>
> >> zblock:
> >> real 7m55.009s
> >> user 39m23.147s
> >> sys 7m44.004s
> >> Zswap: 253068 kB
> >> Zswapped: 919956 kB
> >> zswpin 456843
> >> zswpout 2058963
> >> zswpwb 3921
> >
> > So zstd results in nearly double the compression ratio, which in turn
> > cuts total execution time *almost in half*.
> >
> > The numbers speak for themselves. Compression efficiency >>> allocator
> > speed, because compression efficiency ultimately drives the continuous
> > *rate* at which allocations need to occur. You're trying to optimize a
> > constant coefficient at the expense of a higher-order one, which is a
> > losing proposition.
>
> Well, not really. This is an isolated use case with
> a. significant computing power under the hood
> b. relatively few cores
> c. relatively short test
> d. 4K pages
>
> If any of these isn't true, zblock dominates.
> !a => zstd is too slow
> !b => parallelization gives more effect
> !c => zsmalloc starts losing due to having to deal with internal
> fragmentation
> !d => compression efficiency of zblock is better.
>
> Even !d alone makes zblock a better choice for ARM64 based servers.
>
> ~Vitaly
Could you expand on each point? And do you have data to show this?
For b, we run zswap + zsmalloc on hosts with hundreds of cores, and
have not found zsmalloc to be a noticeable bottleneck yet, FWIW.
For c - in longer runs, how does zblock perform better than zsmalloc?
In fact, my understanding is that zsmalloc does compaction, which
should help with internal fragmentation over time. zblock doesn't seem
to do this, or maybe I missed it?
For d too. I see that you hard code special configurations for zblock
blocks in the case of 0x4000 page size, but how does that help with
compression efficiency?
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] mm: add zblock allocator
2025-04-08 22:05 ` Nhat Pham
@ 2025-04-08 23:12 ` Vitaly Wool
0 siblings, 0 replies; 17+ messages in thread
From: Vitaly Wool @ 2025-04-08 23:12 UTC (permalink / raw)
To: Nhat Pham
Cc: Johannes Weiner, Igor Belousov, linux-mm, akpm, linux-kernel,
Shakeel Butt, Yosry Ahmed
>>> So zstd results in nearly double the compression ratio, which in turn
>>> cuts total execution time *almost in half*.
>>>
>>> The numbers speak for themselves. Compression efficiency >>> allocator
>>> speed, because compression efficiency ultimately drives the continuous
>>> *rate* at which allocations need to occur. You're trying to optimize a
>>> constant coefficient at the expense of a higher-order one, which is a
>>> losing proposition.
>>
>> Well, not really. This is an isolated use case with
>> a. significant computing power under the hood
>> b. relatively few cores
>> c. relatively short test
>> d. 4K pages
>>
>> If any of these isn't true, zblock dominates.
>> !a => zstd is too slow
>> !b => parallelization gives more effect
>> !c => zsmalloc starts losing due to having to deal with internal
>> fragmentation
>> !d => compression efficiency of zblock is better.
>>
>> Even !d alone makes zblock a better choice for ARM64 based servers.
>>
>> ~Vitaly
>
> Could you expand on each point? And do you have data to show this?
>
> For b, we run zswap + zsmalloc on hosts with hundreds of cores, and
> have not found zsmalloc to be a noticeable bottleneck yet, FWIW.
I don't have the numbers at hand, I think Igor will be able to provide
those tomorrow.
> For c - in longer runs, how does zblock perform better than zsmalloc?
> In fact, my understanding is that zsmalloc does compaction, which
> should help with internal fragmentation over time. zblock doesn't seem
> to do this, or maybe I missed it?
The thing is, zblock doesn't have to. Imagine a street with cars parked
at side. If you have cars of different lengths which drive in and out,
you'll end up with spaces in between that longer cars won't be able to
squeeze in to. This is why zsmalloc does compaction.
Now for zblock you can say that only same length cars are allowed to
park on one street and therefore that street is either full or you will
have a place.
> For d too. I see that you hard code special configurations for zblock
> blocks in the case of 0x4000 page size, but how does that help with
> compression efficiency?
Well, to be able to answer that I need to dig more into zsmalloc
operation, but i would guess that zsmalloc's chunks are just multiplied
by 4 in case of 16K page and thus you lose all the granularity you used
to have, but I'm not completely certain.
Meanwhile I did a quick measurement run with zblock and zsmalloc on a
Raspberry Pi 5 (native kernel build test) with zstd as the compression
backend and the results are the following:
1. zsmalloc
*** The build was OOM killed ***
real 26m58.876s
user 95m32.425s
sys 4m39.017s
Zswap: 250944 kB
Zswapped: 871536 kB
zswpin 108
zswpout 54473
663296 /mnt/tmp/build/
2. zblock
real 27m31.579s
user 96m42.845s
sys 4m40.464s
Zswap: 66592 kB
Zswapped: 563168 kB
zswpin 243
zswpout 35262
1423200 /mnt/tmp/build/
You can see by the size of the build folder that the first run was
terminated prematurely not at all close to the end of it.
So, I can re-run the tests on 8-core high performance ARM64 with 16K
pages tomorrow, but so far everything we have seen points in one
direction: zblock is clearly superior to zsmalloc in 16K page configuration.
Besides, zblock can do even better if we extend that very hardcoded
table you mentioned (and BTW, it can be automatically generated at init
but I don't see the point in that).
~Vitaly
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] mm: add zblock allocator
2025-04-08 19:55 ` Johannes Weiner
2025-04-08 21:11 ` Nhat Pham
2025-04-08 21:38 ` Vitaly Wool
@ 2025-04-09 17:59 ` Igor Belousov
2025-04-10 7:02 ` Igor Belousov
2 siblings, 1 reply; 17+ messages in thread
From: Igor Belousov @ 2025-04-09 17:59 UTC (permalink / raw)
To: Johannes Weiner
Cc: Nhat Pham, vitaly.wool, linux-mm, akpm, linux-kernel,
Shakeel Butt, Yosry Ahmed
Hi Johannes,
>> Sure. zstd/8 cores/make -j32:
>>
>> zsmalloc:
>> real 7m36.413s
>> user 38m0.481s
>> sys 7m19.108s
>> Zswap: 211028 kB
>> Zswapped: 925904 kB
>> zswpin 397851
>> zswpout 1625707
>> zswpwb 5126
>>
>> zblock:
>> real 7m55.009s
>> user 39m23.147s
>> sys 7m44.004s
>> Zswap: 253068 kB
>> Zswapped: 919956 kB
>> zswpin 456843
>> zswpout 2058963
>> zswpwb 3921
>
> So zstd results in nearly double the compression ratio, which in turn
> cuts total execution time *almost in half*.
>
> The numbers speak for themselves. Compression efficiency >>> allocator
> speed, because compression efficiency ultimately drives the continuous
> *rate* at which allocations need to occur. You're trying to optimize a
> constant coefficient at the expense of a higher-order one, which is a
> losing proposition.
Actually there's a slight bug in zblock code for 4K page case which
caused storage inefficiency for small (== well compressed) memory
blocks. With that one fixed, the results look a lot brighter for zblock:
1. zblock/zstd/8 cores/make -j32 bzImage
real 7m28.290s
user 37m27.055s
sys 7m18.629s
Zswap: 221516 kB
Zswapped: 904104 kB
zswpin 425424
zswpout 2011503
zswpwb 4111
2a. zblock/zstd/16 cores/make -j16 modules
real 15m53.119s
user 199m45.722s
sys 36m21.544s
zswpin 26600
zswpout 287021
zswpwb 0
Zswap: 205908 kB
Zswapped: 858516 kB
2b. zsmalloc/zstd/16 cores/make -j16 modules
real 16m31.052s
user 207m3.612s
sys 37m49.891s
zswpin 27044
zswpout 296763
zswpwb 61
Zswap: 198740 kB
Zswapped: 868020 kB
So what we see is:
* on 4K pages, zblock matches zsmalloc with regard to storage density on
longer tests and gives better performance in virtually all scenarios
* on 16K pages, zblock is superior to zsmalloc in terms of both
performance and storage density.
> This is a general NAK from me on any new allocators that cannot match
> or outdo zsmalloc storage density in common scenarios. I'm sorry, but
> I really don't see any reason to do this.
Given the above, I sincerely hope that you reconsider.
Thanks,
Igor
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] mm: add zblock allocator
2025-04-09 17:59 ` Igor Belousov
@ 2025-04-10 7:02 ` Igor Belousov
0 siblings, 0 replies; 17+ messages in thread
From: Igor Belousov @ 2025-04-10 7:02 UTC (permalink / raw)
To: Johannes Weiner
Cc: Nhat Pham, vitaly.wool, linux-mm, akpm, linux-kernel,
Shakeel Butt, Yosry Ahmed
> Hi Johannes,
>
>>> Sure. zstd/8 cores/make -j32:
>>>
>>> zsmalloc:
>>> real 7m36.413s
>>> user 38m0.481s
>>> sys 7m19.108s
>>> Zswap: 211028 kB
>>> Zswapped: 925904 kB
>>> zswpin 397851
>>> zswpout 1625707
>>> zswpwb 5126
>>>
>>> zblock:
>>> real 7m55.009s
>>> user 39m23.147s
>>> sys 7m44.004s
>>> Zswap: 253068 kB
>>> Zswapped: 919956 kB
>>> zswpin 456843
>>> zswpout 2058963
>>> zswpwb 3921
>>
>> So zstd results in nearly double the compression ratio, which in turn
>> cuts total execution time *almost in half*.
>>
>> The numbers speak for themselves. Compression efficiency >>> allocator
>> speed, because compression efficiency ultimately drives the continuous
>> *rate* at which allocations need to occur. You're trying to optimize a
>> constant coefficient at the expense of a higher-order one, which is a
>> losing proposition.
>
> Actually there's a slight bug in zblock code for 4K page case which
> caused storage inefficiency for small (== well compressed) memory
> blocks. With that one fixed, the results look a lot brighter for
> zblock:
>
> 1. zblock/zstd/8 cores/make -j32 bzImage
> real 7m28.290s
> user 37m27.055s
> sys 7m18.629s
> Zswap: 221516 kB
> Zswapped: 904104 kB
> zswpin 425424
> zswpout 2011503
> zswpwb 4111
For the sake of completeness I re-ran that test with the bugfix and LZ4
(so, zblock/lz4/8 cores/make -j32 bzImage) and I got:
real 7m44.154s
user 38m26.645s
sys 7m38.302s
zswpin 648108
zswpout 2490449
zswpwb 9499
So there's *no* significant cut with zstd in execution time, even on a
Ryzen 9 and that invalidates your point. Sorry for the past confusion,
it was an honest mistake from our side. If zsmalloc didn't OOM with lz4
we probably would have seen the discrepancy and found the bug earlier.
And on ARM64 and RISC-V targets we have run the tests on, zstd is slower
than lz4.
/Igor
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2025-04-10 7:02 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1743810988579.7.125720@webmail-backend-production-7b88b644bb-5mmj8>
2025-04-06 7:53 ` [PATCH v2] mm: add zblock allocator Igor Belousov
2025-04-07 9:00 ` Igor Belousov
2025-04-07 15:51 ` Nhat Pham
2025-04-07 16:44 ` Igor Belousov
2025-04-07 17:00 ` Nhat Pham
2025-04-08 9:20 ` Igor Belousov
2025-04-08 19:55 ` Johannes Weiner
2025-04-08 21:11 ` Nhat Pham
2025-04-08 21:38 ` Vitaly Wool
2025-04-08 22:05 ` Nhat Pham
2025-04-08 23:12 ` Vitaly Wool
2025-04-09 17:59 ` Igor Belousov
2025-04-10 7:02 ` Igor Belousov
2025-04-04 19:28 Vitaly Wool
2025-04-04 20:03 ` Johannes Weiner
2025-04-07 15:54 ` Johannes Weiner
2025-04-07 18:26 ` Vitaly Wool
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).