* [PATCH 00/11] Introduce mt_dup() to improve the performance of fork()
@ 2023-07-26 8:09 Peng Zhang
2023-07-26 8:09 ` [PATCH 01/11] maple_tree: Introduce ma_nonleaf_data_end{_nocheck}() Peng Zhang
` (10 more replies)
0 siblings, 11 replies; 42+ messages in thread
From: Peng Zhang @ 2023-07-26 8:09 UTC (permalink / raw)
To: Liam.Howlett, corbet, akpm, willy, brauner, surenb,
michael.christie, peterz, mathieu.desnoyers, npiggin, avagin
Cc: linux-mm, linux-doc, linux-kernel, linux-fsdevel, Peng Zhang
A few weeks ago, Liam and I discussed "Fork & Dup tree + Delete DONT_COPY" in
[1]. Thanks Liam for a lot of useful information there.
I didn't use the scheme of linking the leaf nodes, nor did I use the scheme
of building the tree in BFS order. I ended up using the scheme of building the
tree in DFS order. I implemented this algorithm and it seems to be very
efficient.
Use bench_forking() in lib/test_maple_tree.c to test in user space, and get the
performance numbers of duplicating maple tree as follows:
before: 13.52s
after: 2.60s
Their meaning is the time consumed by duplicating 80,000 maple trees with 134
entries. These numbers do not include the time consumed by mt_validate() and
mtree_destroy(). It can be seen that the time consumed has been reduced by
80.77%.
The performance improvement of fork() can be summarized as follows:
With 23 VMAs, performance improves by about 3%, with 223 VMAs, performance
improves by about 15%, and with 4023 VMAs, performance improves by about 30%.
See patch[11/11] for details.
In addition, I would like to assist Liam in maintaining the maple tree, which
requires Liam's consent. In the future I will make some contributions to the
development of maple tree.
The layout of these patches:
001 - 003: Introduce some internal functions to facilitate the
implementation of mt_dup().
004 - 005: Introduce __mt_dup() and mt_dup(), and their tests.
006: Introduce mas_replace_entry() to efficiently replace an entry.
007 - 009: Follow-up work on introducing these things.
010: Add myself as co-maintainer for maple tree.
011: Use __mt_dup() to duplicate maple tree in dup_mmap().
[1] https://lore.kernel.org/lkml/463899aa-6cbd-f08e-0aca-077b0e4e4475@bytedance.com/
Peng Zhang (11):
maple_tree: Introduce ma_nonleaf_data_end{_nocheck}()
maple_tree: Validate MAPLE_ENODE and ma_nonleaf_data_end()
maple_tree: Add some helper functions
maple_tree: Introduce interfaces __mt_dup() and mt_dup()
maple_tree: Add test for mt_dup()
maple_tree: Introduce mas_replace_entry() to directly replace an entry
maple_tree: Update the documentation of maple tree
maple_tree: Skip other tests when BENCH is enabled
maple_tree: Update check_forking() and bench_forking()
MAINTAINERS: Add co-maintainer for maple tree
fork: Use __mt_dup() to duplicate maple tree in dup_mmap()
Documentation/core-api/maple_tree.rst | 10 +
MAINTAINERS | 1 +
include/linux/maple_tree.h | 4 +
kernel/fork.c | 35 ++-
lib/maple_tree.c | 389 ++++++++++++++++++++++++--
lib/test_maple_tree.c | 67 ++---
mm/mmap.c | 14 +-
tools/testing/radix-tree/maple.c | 204 ++++++++++++++
8 files changed, 658 insertions(+), 66 deletions(-)
--
2.20.1
^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH 01/11] maple_tree: Introduce ma_nonleaf_data_end{_nocheck}()
2023-07-26 8:09 [PATCH 00/11] Introduce mt_dup() to improve the performance of fork() Peng Zhang
@ 2023-07-26 8:09 ` Peng Zhang
2023-07-26 14:58 ` Liam R. Howlett
2023-07-26 8:09 ` [PATCH 02/11] maple_tree: Validate MAPLE_ENODE and ma_nonleaf_data_end() Peng Zhang
` (9 subsequent siblings)
10 siblings, 1 reply; 42+ messages in thread
From: Peng Zhang @ 2023-07-26 8:09 UTC (permalink / raw)
To: Liam.Howlett, corbet, akpm, willy, brauner, surenb,
michael.christie, peterz, mathieu.desnoyers, npiggin, avagin
Cc: linux-mm, linux-doc, linux-kernel, linux-fsdevel, Peng Zhang
Introduce ma_nonleaf_data_end{_nocheck}() to get the data end of
non-leaf nodes without knowing the maximum value of nodes, so that any
ascending can be avoided even if the maximum value of nodes is not known.
The principle is that we introduce MAPLE_ENODE to mark an ENODE, which
cannot be used by metadata, so we can distinguish whether it is ENODE or
metadata.
The nocheck version is to avoid lockdep complaining in some scenarios
where no locks are held.
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
---
lib/maple_tree.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 68 insertions(+), 2 deletions(-)
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index a3d602cfd030..98e4fdf6f4b9 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -310,12 +310,19 @@ static inline void mte_set_node_dead(struct maple_enode *mn)
#define MAPLE_ENODE_TYPE_SHIFT 0x03
/* Bit 2 means a NULL somewhere below */
#define MAPLE_ENODE_NULL 0x04
+/* Bit 7 means this is an ENODE, instead of metadata */
+#define MAPLE_ENODE 0x80
+
+static inline bool slot_is_mte(unsigned long slot)
+{
+ return slot & MAPLE_ENODE;
+}
static inline struct maple_enode *mt_mk_node(const struct maple_node *node,
enum maple_type type)
{
- return (void *)((unsigned long)node |
- (type << MAPLE_ENODE_TYPE_SHIFT) | MAPLE_ENODE_NULL);
+ return (void *)((unsigned long)node | (type << MAPLE_ENODE_TYPE_SHIFT) |
+ MAPLE_ENODE_NULL | MAPLE_ENODE);
}
static inline void *mte_mk_root(const struct maple_enode *node)
@@ -1411,6 +1418,65 @@ static inline struct maple_enode *mas_start(struct ma_state *mas)
return NULL;
}
+/*
+ * ma_nonleaf_data_end() - Find the end of the data in a non-leaf node.
+ * @mt: The maple tree
+ * @node: The maple node
+ * @type: The maple node type
+ *
+ * Uses metadata to find the end of the data when possible without knowing the
+ * node maximum.
+ *
+ * Return: The zero indexed last slot with child.
+ */
+static inline unsigned char ma_nonleaf_data_end(struct maple_tree *mt,
+ struct maple_node *node,
+ enum maple_type type)
+{
+ void __rcu **slots;
+ unsigned long slot;
+
+ slots = ma_slots(node, type);
+ slot = (unsigned long)mt_slot(mt, slots, mt_pivots[type]);
+ if (unlikely(slot_is_mte(slot)))
+ return mt_pivots[type];
+
+ return ma_meta_end(node, type);
+}
+
+/*
+ * ma_nonleaf_data_end_nocheck() - Find the end of the data in a non-leaf node.
+ * @node: The maple node
+ * @type: The maple node type
+ *
+ * Uses metadata to find the end of the data when possible without knowing the
+ * node maximum. This is the version of ma_nonleaf_data_end() that does not
+ * check for lock held. This particular version is designed to avoid lockdep
+ * complaining in some scenarios.
+ *
+ * Return: The zero indexed last slot with child.
+ */
+static inline unsigned char ma_nonleaf_data_end_nocheck(struct maple_node *node,
+ enum maple_type type)
+{
+ void __rcu **slots;
+ unsigned long slot;
+
+ slots = ma_slots(node, type);
+ slot = (unsigned long)rcu_dereference_raw(slots[mt_pivots[type]]);
+ if (unlikely(slot_is_mte(slot)))
+ return mt_pivots[type];
+
+ return ma_meta_end(node, type);
+}
+
+/* See ma_nonleaf_data_end() */
+static inline unsigned char mte_nonleaf_data_end(struct maple_tree *mt,
+ struct maple_enode *enode)
+{
+ return ma_nonleaf_data_end(mt, mte_to_node(enode), mte_node_type(enode));
+}
+
/*
* ma_data_end() - Find the end of the data in a node.
* @node: The maple node
--
2.20.1
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 02/11] maple_tree: Validate MAPLE_ENODE and ma_nonleaf_data_end()
2023-07-26 8:09 [PATCH 00/11] Introduce mt_dup() to improve the performance of fork() Peng Zhang
2023-07-26 8:09 ` [PATCH 01/11] maple_tree: Introduce ma_nonleaf_data_end{_nocheck}() Peng Zhang
@ 2023-07-26 8:09 ` Peng Zhang
2023-07-26 8:09 ` [PATCH 03/11] maple_tree: Add some helper functions Peng Zhang
` (8 subsequent siblings)
10 siblings, 0 replies; 42+ messages in thread
From: Peng Zhang @ 2023-07-26 8:09 UTC (permalink / raw)
To: Liam.Howlett, corbet, akpm, willy, brauner, surenb,
michael.christie, peterz, mathieu.desnoyers, npiggin, avagin
Cc: linux-mm, linux-doc, linux-kernel, linux-fsdevel, Peng Zhang
Updated mt_validate() to validate MAPLE_ENODE and ma_nonleaf_data_end().
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
---
lib/maple_tree.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 98e4fdf6f4b9..e0e9a87bdb43 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -7130,6 +7130,11 @@ static void mas_validate_child_slot(struct ma_state *mas)
MT_BUG_ON(mas->tree, 1);
}
+ if (!slot_is_mte((unsigned long)child)) {
+ pr_err("Slot is not mte %p[%u]\n", mas_mn(mas), i);
+ MT_BUG_ON(mas->tree, 1);
+ }
+
if (mte_parent_slot(child) != i) {
pr_err("Slot error at %p[%u]: child %p has pslot %u\n",
mas_mn(mas), i, mte_to_node(child),
@@ -7200,6 +7205,13 @@ static void mas_validate_limits(struct ma_state *mas)
MT_BUG_ON(mas->tree, 1);
}
+ if (!mte_is_leaf(mas->node) &&
+ mas_data_end(mas) != mte_nonleaf_data_end(mas->tree, mas->node)) {
+ pr_err("node:%p mas_data_end() != mte_nonleaf_data_end()\n",
+ mas_mn(mas));
+ MT_BUG_ON(mas->tree, 1);
+ }
+
for (i += 1; i < mt_slots[type]; i++) {
void *entry = mas_slot(mas, slots, i);
--
2.20.1
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 03/11] maple_tree: Add some helper functions
2023-07-26 8:09 [PATCH 00/11] Introduce mt_dup() to improve the performance of fork() Peng Zhang
2023-07-26 8:09 ` [PATCH 01/11] maple_tree: Introduce ma_nonleaf_data_end{_nocheck}() Peng Zhang
2023-07-26 8:09 ` [PATCH 02/11] maple_tree: Validate MAPLE_ENODE and ma_nonleaf_data_end() Peng Zhang
@ 2023-07-26 8:09 ` Peng Zhang
2023-07-26 15:02 ` Liam R. Howlett
2023-07-26 8:09 ` [PATCH 04/11] maple_tree: Introduce interfaces __mt_dup() and mt_dup() Peng Zhang
` (7 subsequent siblings)
10 siblings, 1 reply; 42+ messages in thread
From: Peng Zhang @ 2023-07-26 8:09 UTC (permalink / raw)
To: Liam.Howlett, corbet, akpm, willy, brauner, surenb,
michael.christie, peterz, mathieu.desnoyers, npiggin, avagin
Cc: linux-mm, linux-doc, linux-kernel, linux-fsdevel, Peng Zhang
Add some helper functions so that their parameters are maple node
instead of maple enode, these functions will be used later.
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
---
lib/maple_tree.c | 71 +++++++++++++++++++++++++++++++++++++-----------
1 file changed, 55 insertions(+), 16 deletions(-)
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index e0e9a87bdb43..da3a2fb405c0 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -164,6 +164,11 @@ static inline int mt_alloc_bulk(gfp_t gfp, size_t size, void **nodes)
return kmem_cache_alloc_bulk(maple_node_cache, gfp, size, nodes);
}
+static inline void mt_free_one(struct maple_node *node)
+{
+ kmem_cache_free(maple_node_cache, node);
+}
+
static inline void mt_free_bulk(size_t size, void __rcu **nodes)
{
kmem_cache_free_bulk(maple_node_cache, size, (void **)nodes);
@@ -432,18 +437,18 @@ static inline unsigned long mte_parent_slot_mask(unsigned long parent)
}
/*
- * mas_parent_type() - Return the maple_type of the parent from the stored
- * parent type.
- * @mas: The maple state
- * @enode: The maple_enode to extract the parent's enum
+ * ma_parent_type() - Return the maple_type of the parent from the stored parent
+ * type.
+ * @mt: The maple tree
+ * @node: The maple_node to extract the parent's enum
* Return: The node->parent maple_type
*/
static inline
-enum maple_type mas_parent_type(struct ma_state *mas, struct maple_enode *enode)
+enum maple_type ma_parent_type(struct maple_tree *mt, struct maple_node *node)
{
unsigned long p_type;
- p_type = (unsigned long)mte_to_node(enode)->parent;
+ p_type = (unsigned long)node->parent;
if (WARN_ON(p_type & MAPLE_PARENT_ROOT))
return 0;
@@ -451,7 +456,7 @@ enum maple_type mas_parent_type(struct ma_state *mas, struct maple_enode *enode)
p_type &= ~mte_parent_slot_mask(p_type);
switch (p_type) {
case MAPLE_PARENT_RANGE64: /* or MAPLE_PARENT_ARANGE64 */
- if (mt_is_alloc(mas->tree))
+ if (mt_is_alloc(mt))
return maple_arange_64;
return maple_range_64;
}
@@ -459,6 +464,19 @@ enum maple_type mas_parent_type(struct ma_state *mas, struct maple_enode *enode)
return 0;
}
+/*
+ * mas_parent_type() - Return the maple_type of the parent from the stored
+ * parent type.
+ * @mas: The maple state
+ * @enode: The maple_enode to extract the parent's enum
+ * Return: The node->parent maple_type
+ */
+static inline
+enum maple_type mas_parent_type(struct ma_state *mas, struct maple_enode *enode)
+{
+ return ma_parent_type(mas->tree, mte_to_node(enode));
+}
+
/*
* mas_set_parent() - Set the parent node and encode the slot
* @enode: The encoded maple node.
@@ -499,14 +517,14 @@ void mas_set_parent(struct ma_state *mas, struct maple_enode *enode,
}
/*
- * mte_parent_slot() - get the parent slot of @enode.
- * @enode: The encoded maple node.
+ * ma_parent_slot() - get the parent slot of @node.
+ * @node: The maple node.
*
- * Return: The slot in the parent node where @enode resides.
+ * Return: The slot in the parent node where @node resides.
*/
-static inline unsigned int mte_parent_slot(const struct maple_enode *enode)
+static inline unsigned int ma_parent_slot(const struct maple_node *node)
{
- unsigned long val = (unsigned long)mte_to_node(enode)->parent;
+ unsigned long val = (unsigned long)node->parent;
if (val & MA_ROOT_PARENT)
return 0;
@@ -519,15 +537,36 @@ static inline unsigned int mte_parent_slot(const struct maple_enode *enode)
}
/*
- * mte_parent() - Get the parent of @node.
- * @node: The encoded maple node.
+ * mte_parent_slot() - get the parent slot of @enode.
+ * @enode: The encoded maple node.
+ *
+ * Return: The slot in the parent node where @enode resides.
+ */
+static inline unsigned int mte_parent_slot(const struct maple_enode *enode)
+{
+ return ma_parent_slot(mte_to_node(enode));
+}
+
+/*
+ * ma_parent() - Get the parent of @node.
+ * @node: The maple node.
+ *
+ * Return: The parent maple node.
+ */
+static inline struct maple_node *ma_parent(const struct maple_node *node)
+{
+ return (void *)((unsigned long)(node->parent) & ~MAPLE_NODE_MASK);
+}
+
+/*
+ * mte_parent() - Get the parent of @enode.
+ * @enode: The encoded maple node.
*
* Return: The parent maple node.
*/
static inline struct maple_node *mte_parent(const struct maple_enode *enode)
{
- return (void *)((unsigned long)
- (mte_to_node(enode)->parent) & ~MAPLE_NODE_MASK);
+ return ma_parent(mte_to_node(enode));
}
/*
--
2.20.1
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 04/11] maple_tree: Introduce interfaces __mt_dup() and mt_dup()
2023-07-26 8:09 [PATCH 00/11] Introduce mt_dup() to improve the performance of fork() Peng Zhang
` (2 preceding siblings ...)
2023-07-26 8:09 ` [PATCH 03/11] maple_tree: Add some helper functions Peng Zhang
@ 2023-07-26 8:09 ` Peng Zhang
2023-07-26 16:03 ` Liam R. Howlett
2023-07-26 8:09 ` [PATCH 05/11] maple_tree: Add test for mt_dup() Peng Zhang
` (6 subsequent siblings)
10 siblings, 1 reply; 42+ messages in thread
From: Peng Zhang @ 2023-07-26 8:09 UTC (permalink / raw)
To: Liam.Howlett, corbet, akpm, willy, brauner, surenb,
michael.christie, peterz, mathieu.desnoyers, npiggin, avagin
Cc: linux-mm, linux-doc, linux-kernel, linux-fsdevel, Peng Zhang
Introduce interfaces __mt_dup() and mt_dup(), which are used to
duplicate a maple tree. Compared with traversing the source tree and
reinserting entry by entry in the new tree, it has better performance.
The difference between __mt_dup() and mt_dup() is that mt_dup() holds
an internal lock.
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
---
include/linux/maple_tree.h | 3 +
lib/maple_tree.c | 211 +++++++++++++++++++++++++++++++++++++
2 files changed, 214 insertions(+)
diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
index c962af188681..229fe78e4c89 100644
--- a/include/linux/maple_tree.h
+++ b/include/linux/maple_tree.h
@@ -327,6 +327,9 @@ int mtree_store(struct maple_tree *mt, unsigned long index,
void *entry, gfp_t gfp);
void *mtree_erase(struct maple_tree *mt, unsigned long index);
+int mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
+int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
+
void mtree_destroy(struct maple_tree *mt);
void __mt_destroy(struct maple_tree *mt);
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index da3a2fb405c0..efac6761ae37 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -6595,6 +6595,217 @@ void *mtree_erase(struct maple_tree *mt, unsigned long index)
}
EXPORT_SYMBOL(mtree_erase);
+/*
+ * mt_dup_free() - Free the nodes of a incomplete maple tree.
+ * @mt: The incomplete maple tree
+ * @node: Free nodes from @node
+ *
+ * This function frees all nodes starting from @node in the reverse order of
+ * mt_dup_build(). At this point we don't need to hold the source tree lock.
+ */
+static void mt_dup_free(struct maple_tree *mt, struct maple_node *node)
+{
+ void **slots;
+ unsigned char offset;
+ struct maple_enode *enode;
+ enum maple_type type;
+ unsigned char count = 0, i;
+
+try_ascend:
+ if (ma_is_root(node)) {
+ mt_free_one(node);
+ return;
+ }
+
+ offset = ma_parent_slot(node);
+ type = ma_parent_type(mt, node);
+ node = ma_parent(node);
+ if (!offset)
+ goto free;
+
+ offset--;
+
+descend:
+ slots = (void **)ma_slots(node, type);
+ enode = slots[offset];
+ if (mte_is_leaf(enode))
+ goto free;
+
+ type = mte_node_type(enode);
+ node = mte_to_node(enode);
+ offset = ma_nonleaf_data_end_nocheck(node, type);
+ goto descend;
+
+free:
+ slots = (void **)ma_slots(node, type);
+ count = ma_nonleaf_data_end_nocheck(node, type) + 1;
+ for (i = 0; i < count; i++)
+ ((unsigned long *)slots)[i] &= ~MAPLE_NODE_MASK;
+
+ /* Cast to __rcu to avoid sparse checker complaining. */
+ mt_free_bulk(count, (void __rcu **)slots);
+ goto try_ascend;
+}
+
+/*
+ * mt_dup_build() - Build a new maple tree from a source tree
+ * @mt: The source maple tree to copy from
+ * @new: The new maple tree
+ * @gfp: The GFP_FLAGS to use for allocations
+ * @to_free: Free nodes starting from @to_free if the build fails
+ *
+ * This function builds a new tree in DFS preorder. If it fails due to memory
+ * allocation, @to_free will store the last failed node to free the incomplete
+ * tree. Use mt_dup_free() to free nodes.
+ *
+ * Return: 0 on success, -ENOMEM if memory could not be allocated.
+ */
+static inline int mt_dup_build(struct maple_tree *mt, struct maple_tree *new,
+ gfp_t gfp, struct maple_node **to_free)
+{
+ struct maple_enode *enode;
+ struct maple_node *new_node, *new_parent = NULL, *node;
+ enum maple_type type;
+ void __rcu **slots;
+ void **new_slots;
+ unsigned char count, request, i, offset;
+ unsigned long *set_parent;
+ unsigned long new_root;
+
+ mt_init_flags(new, mt->ma_flags);
+ enode = mt_root_locked(mt);
+ if (unlikely(!xa_is_node(enode))) {
+ rcu_assign_pointer(new->ma_root, enode);
+ return 0;
+ }
+
+ new_node = mt_alloc_one(gfp);
+ if (!new_node)
+ return -ENOMEM;
+
+ new_root = (unsigned long)new_node;
+ new_root |= (unsigned long)enode & MAPLE_NODE_MASK;
+
+copy_node:
+ node = mte_to_node(enode);
+ type = mte_node_type(enode);
+ memcpy(new_node, node, sizeof(struct maple_node));
+
+ set_parent = (unsigned long *)&(new_node->parent);
+ *set_parent &= MAPLE_NODE_MASK;
+ *set_parent |= (unsigned long)new_parent;
+ if (ma_is_leaf(type))
+ goto ascend;
+
+ new_slots = (void **)ma_slots(new_node, type);
+ slots = ma_slots(node, type);
+ request = ma_nonleaf_data_end(mt, node, type) + 1;
+ count = mt_alloc_bulk(gfp, request, new_slots);
+ if (!count) {
+ *to_free = new_node;
+ return -ENOMEM;
+ }
+
+ for (i = 0; i < count; i++)
+ ((unsigned long *)new_slots)[i] |=
+ ((unsigned long)mt_slot_locked(mt, slots, i) &
+ MAPLE_NODE_MASK);
+ offset = 0;
+
+descend:
+ new_parent = new_node;
+ enode = mt_slot_locked(mt, slots, offset);
+ new_node = mte_to_node(new_slots[offset]);
+ goto copy_node;
+
+ascend:
+ if (ma_is_root(node)) {
+ new_node = mte_to_node((void *)new_root);
+ new_node->parent = ma_parent_ptr((unsigned long)new |
+ MA_ROOT_PARENT);
+ rcu_assign_pointer(new->ma_root, (void *)new_root);
+ return 0;
+ }
+
+ offset = ma_parent_slot(node);
+ type = ma_parent_type(mt, node);
+ node = ma_parent(node);
+ new_node = ma_parent(new_node);
+ if (offset < ma_nonleaf_data_end(mt, node, type)) {
+ offset++;
+ new_slots = (void **)ma_slots(new_node, type);
+ slots = ma_slots(node, type);
+ goto descend;
+ }
+
+ goto ascend;
+}
+
+/**
+ * __mt_dup(): Duplicate a maple tree
+ * @mt: The source maple tree
+ * @new: The new maple tree
+ * @gfp: The GFP_FLAGS to use for allocations
+ *
+ * This function duplicates a maple tree using a faster method than traversing
+ * the source tree and inserting entries into the new tree one by one. The user
+ * needs to lock the source tree manually. Before calling this function, @new
+ * must be an empty tree or an uninitialized tree. If @mt uses an external lock,
+ * we may also need to manually set @new's external lock using
+ * mt_set_external_lock().
+ *
+ * Return: 0 on success, -ENOMEM if memory could not be allocated.
+ */
+int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
+{
+ int ret;
+ struct maple_node *to_free = NULL;
+
+ ret = mt_dup_build(mt, new, gfp, &to_free);
+
+ if (unlikely(ret == -ENOMEM)) {
+ if (to_free)
+ mt_dup_free(new, to_free);
+ }
+
+ return ret;
+}
+EXPORT_SYMBOL(__mt_dup);
+
+/**
+ * mt_dup(): Duplicate a maple tree
+ * @mt: The source maple tree
+ * @new: The new maple tree
+ * @gfp: The GFP_FLAGS to use for allocations
+ *
+ * This function duplicates a maple tree using a faster method than traversing
+ * the source tree and inserting entries into the new tree one by one. The
+ * function will lock the source tree with an internal lock, and the user does
+ * not need to manually handle the lock. Before calling this function, @new must
+ * be an empty tree or an uninitialized tree. If @mt uses an external lock, we
+ * may also need to manually set @new's external lock using
+ * mt_set_external_lock().
+ *
+ * Return: 0 on success, -ENOMEM if memory could not be allocated.
+ */
+int mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
+{
+ int ret;
+ struct maple_node *to_free = NULL;
+
+ mtree_lock(mt);
+ ret = mt_dup_build(mt, new, gfp, &to_free);
+ mtree_unlock(mt);
+
+ if (unlikely(ret == -ENOMEM)) {
+ if (to_free)
+ mt_dup_free(new, to_free);
+ }
+
+ return ret;
+}
+EXPORT_SYMBOL(mt_dup);
+
/**
* __mt_destroy() - Walk and free all nodes of a locked maple tree.
* @mt: The maple tree
--
2.20.1
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 05/11] maple_tree: Add test for mt_dup()
2023-07-26 8:09 [PATCH 00/11] Introduce mt_dup() to improve the performance of fork() Peng Zhang
` (3 preceding siblings ...)
2023-07-26 8:09 ` [PATCH 04/11] maple_tree: Introduce interfaces __mt_dup() and mt_dup() Peng Zhang
@ 2023-07-26 8:09 ` Peng Zhang
2023-07-26 16:06 ` Liam R. Howlett
2023-07-26 8:09 ` [PATCH 06/11] maple_tree: Introduce mas_replace_entry() to directly replace an entry Peng Zhang
` (5 subsequent siblings)
10 siblings, 1 reply; 42+ messages in thread
From: Peng Zhang @ 2023-07-26 8:09 UTC (permalink / raw)
To: Liam.Howlett, corbet, akpm, willy, brauner, surenb,
michael.christie, peterz, mathieu.desnoyers, npiggin, avagin
Cc: linux-mm, linux-doc, linux-kernel, linux-fsdevel, Peng Zhang
Add test for mt_dup().
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
---
tools/testing/radix-tree/maple.c | 202 +++++++++++++++++++++++++++++++
1 file changed, 202 insertions(+)
diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
index e5da1cad70ba..3052e899e5df 100644
--- a/tools/testing/radix-tree/maple.c
+++ b/tools/testing/radix-tree/maple.c
@@ -35857,6 +35857,204 @@ static noinline void __init check_locky(struct maple_tree *mt)
mt_clear_in_rcu(mt);
}
+/*
+ * Compare two nodes and return 0 if they are the same, non-zero otherwise.
+ */
+static int __init compare_node(struct maple_enode *enode_a,
+ struct maple_enode *enode_b)
+{
+ struct maple_node *node_a, *node_b;
+ struct maple_node a, b;
+ void **slots_a, **slots_b; /* Do not use the rcu tag. */
+ enum maple_type type;
+ int i;
+
+ if (((unsigned long)enode_a & MAPLE_NODE_MASK) !=
+ ((unsigned long)enode_b & MAPLE_NODE_MASK)) {
+ pr_err("The lower 8 bits of enode are different.\n");
+ return -1;
+ }
+
+ type = mte_node_type(enode_a);
+ node_a = mte_to_node(enode_a);
+ node_b = mte_to_node(enode_b);
+ a = *node_a;
+ b = *node_b;
+
+ /* Do not compare addresses. */
+ if (ma_is_root(node_a) || ma_is_root(node_b)) {
+ a.parent = (struct maple_pnode *)((unsigned long)a.parent &
+ MA_ROOT_PARENT);
+ b.parent = (struct maple_pnode *)((unsigned long)b.parent &
+ MA_ROOT_PARENT);
+ } else {
+ a.parent = (struct maple_pnode *)((unsigned long)a.parent &
+ MAPLE_NODE_MASK);
+ b.parent = (struct maple_pnode *)((unsigned long)b.parent &
+ MAPLE_NODE_MASK);
+ }
+
+ if (a.parent != b.parent) {
+ pr_err("The lower 8 bits of parents are different. %p %p\n",
+ a.parent, b.parent);
+ return -1;
+ }
+
+ /*
+ * If it is a leaf node, the slots do not contain the node address, and
+ * no special processing of slots is required.
+ */
+ if (ma_is_leaf(type))
+ goto cmp;
+
+ slots_a = ma_slots(&a, type);
+ slots_b = ma_slots(&b, type);
+
+ for (i = 0; i < mt_slots[type]; i++) {
+ if (!slots_a[i] && !slots_b[i])
+ break;
+
+ if (!slots_a[i] || !slots_b[i]) {
+ pr_err("The number of slots is different.\n");
+ return -1;
+ }
+
+ /* Do not compare addresses in slots. */
+ ((unsigned long *)slots_a)[i] &= MAPLE_NODE_MASK;
+ ((unsigned long *)slots_b)[i] &= MAPLE_NODE_MASK;
+ }
+
+cmp:
+ /*
+ * Compare all contents of two nodes, including parent (except address),
+ * slots (except address), pivots, gaps and metadata.
+ */
+ return memcmp(&a, &b, sizeof(struct maple_node));
+}
+
+/*
+ * Compare two trees and return 0 if they are the same, non-zero otherwise.
+ */
+static int __init compare_tree(struct maple_tree *mt_a, struct maple_tree *mt_b)
+{
+ MA_STATE(mas_a, mt_a, 0, 0);
+ MA_STATE(mas_b, mt_b, 0, 0);
+
+ if (mt_a->ma_flags != mt_b->ma_flags) {
+ pr_err("The flags of the two trees are different.\n");
+ return -1;
+ }
+
+ mas_dfs_preorder(&mas_a);
+ mas_dfs_preorder(&mas_b);
+
+ if (mas_is_ptr(&mas_a) || mas_is_ptr(&mas_b)) {
+ if (!(mas_is_ptr(&mas_a) && mas_is_ptr(&mas_b))) {
+ pr_err("One is MAS_ROOT and the other is not.\n");
+ return -1;
+ }
+ return 0;
+ }
+
+ while (!mas_is_none(&mas_a) || !mas_is_none(&mas_b)) {
+
+ if (mas_is_none(&mas_a) || mas_is_none(&mas_b)) {
+ pr_err("One is MAS_NONE and the other is not.\n");
+ return -1;
+ }
+
+ if (mas_a.min != mas_b.min ||
+ mas_a.max != mas_b.max) {
+ pr_err("mas->min, mas->max do not match.\n");
+ return -1;
+ }
+
+ if (compare_node(mas_a.node, mas_b.node)) {
+ pr_err("The contents of nodes %p and %p are different.\n",
+ mas_a.node, mas_b.node);
+ mt_dump(mt_a, mt_dump_dec);
+ mt_dump(mt_b, mt_dump_dec);
+ return -1;
+ }
+
+ mas_dfs_preorder(&mas_a);
+ mas_dfs_preorder(&mas_b);
+ }
+
+ return 0;
+}
+
+static noinline void __init check_mt_dup(struct maple_tree *mt)
+{
+ DEFINE_MTREE(new);
+ int i, j, ret, count = 0;
+
+ /* stored in the root pointer*/
+ mt_init_flags(&tree, 0);
+ mtree_store_range(&tree, 0, 0, xa_mk_value(0), GFP_KERNEL);
+ mt_dup(&tree, &new, GFP_KERNEL);
+ mt_validate(&new);
+ if (compare_tree(&tree, &new))
+ MT_BUG_ON(&new, 1);
+
+ mtree_destroy(&tree);
+ mtree_destroy(&new);
+
+ for (i = 0; i < 1000; i += 3) {
+ if (i & 1)
+ mt_init_flags(&tree, 0);
+ else
+ mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
+
+ for (j = 0; j < i; j++) {
+ mtree_store_range(&tree, j * 10, j * 10 + 5,
+ xa_mk_value(j), GFP_KERNEL);
+ }
+
+ ret = mt_dup(&tree, &new, GFP_KERNEL);
+ MT_BUG_ON(&new, ret != 0);
+ mt_validate(&new);
+ if (compare_tree(&tree, &new))
+ MT_BUG_ON(&new, 1);
+
+ mtree_destroy(&tree);
+ mtree_destroy(&new);
+ }
+
+ /* Test memory allocation failed. */
+ for (i = 0; i < 1000; i += 3) {
+ if (i & 1)
+ mt_init_flags(&tree, 0);
+ else
+ mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
+
+ for (j = 0; j < i; j++) {
+ mtree_store_range(&tree, j * 10, j * 10 + 5,
+ xa_mk_value(j), GFP_KERNEL);
+ }
+
+ mt_set_non_kernel(50);
+ ret = mt_dup(&tree, &new, GFP_NOWAIT);
+ mt_set_non_kernel(0);
+ if (ret != 0) {
+ MT_BUG_ON(&new, ret != -ENOMEM);
+ count++;
+ mtree_destroy(&tree);
+ continue;
+ }
+
+ mt_validate(&new);
+ if (compare_tree(&tree, &new))
+ MT_BUG_ON(&new, 1);
+
+ mtree_destroy(&tree);
+ mtree_destroy(&new);
+ }
+
+ /* pr_info("mt_dup() fail %d times\n", count); */
+ BUG_ON(!count);
+}
+
extern void test_kmem_cache_bulk(void);
void farmer_tests(void)
@@ -35904,6 +36102,10 @@ void farmer_tests(void)
check_null_expand(&tree);
mtree_destroy(&tree);
+ mt_init_flags(&tree, 0);
+ check_mt_dup(&tree);
+ mtree_destroy(&tree);
+
/* RCU testing */
mt_init_flags(&tree, 0);
check_erase_testset(&tree);
--
2.20.1
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 06/11] maple_tree: Introduce mas_replace_entry() to directly replace an entry
2023-07-26 8:09 [PATCH 00/11] Introduce mt_dup() to improve the performance of fork() Peng Zhang
` (4 preceding siblings ...)
2023-07-26 8:09 ` [PATCH 05/11] maple_tree: Add test for mt_dup() Peng Zhang
@ 2023-07-26 8:09 ` Peng Zhang
2023-07-26 16:08 ` Liam R. Howlett
2023-07-26 8:09 ` [PATCH 07/11] maple_tree: Update the documentation of maple tree Peng Zhang
` (4 subsequent siblings)
10 siblings, 1 reply; 42+ messages in thread
From: Peng Zhang @ 2023-07-26 8:09 UTC (permalink / raw)
To: Liam.Howlett, corbet, akpm, willy, brauner, surenb,
michael.christie, peterz, mathieu.desnoyers, npiggin, avagin
Cc: linux-mm, linux-doc, linux-kernel, linux-fsdevel, Peng Zhang
If mas has located a specific entry, it may be need to replace this
entry, so introduce mas_replace_entry() to do this. mas_replace_entry()
will be more efficient than mas_store*() because it doesn't do many
unnecessary checks.
This function should be inline, but more functions need to be moved to
the header file, so I didn't do it for the time being.
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
---
include/linux/maple_tree.h | 1 +
lib/maple_tree.c | 25 +++++++++++++++++++++++++
2 files changed, 26 insertions(+)
diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
index 229fe78e4c89..a05e9827d761 100644
--- a/include/linux/maple_tree.h
+++ b/include/linux/maple_tree.h
@@ -462,6 +462,7 @@ struct ma_wr_state {
void *mas_walk(struct ma_state *mas);
void *mas_store(struct ma_state *mas, void *entry);
+void mas_replace_entry(struct ma_state *mas, void *entry);
void *mas_erase(struct ma_state *mas);
int mas_store_gfp(struct ma_state *mas, void *entry, gfp_t gfp);
void mas_store_prealloc(struct ma_state *mas, void *entry);
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index efac6761ae37..d58572666a00 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -5600,6 +5600,31 @@ void *mas_store(struct ma_state *mas, void *entry)
}
EXPORT_SYMBOL_GPL(mas_store);
+/**
+ * mas_replace_entry() - Replace an entry that already exists in the maple tree
+ * @mas: The maple state
+ * @entry: The entry to store
+ *
+ * Please note that mas must already locate an existing entry, and the new entry
+ * must not be NULL. If these two points cannot be guaranteed, please use
+ * mas_store*() instead, otherwise it will cause an internal error in the maple
+ * tree. This function does not need to allocate memory, so it must succeed.
+ */
+void mas_replace_entry(struct ma_state *mas, void *entry)
+{
+ void __rcu **slots;
+
+#ifdef CONFIG_DEBUG_MAPLE_TREE
+ MAS_WARN_ON(mas, !mte_is_leaf(mas->node));
+ MAS_WARN_ON(mas, !entry);
+ MAS_WARN_ON(mas, mas->offset >= mt_slots[mte_node_type(mas->node)]);
+#endif
+
+ slots = ma_slots(mte_to_node(mas->node), mte_node_type(mas->node));
+ rcu_assign_pointer(slots[mas->offset], entry);
+}
+EXPORT_SYMBOL_GPL(mas_replace_entry);
+
/**
* mas_store_gfp() - Store a value into the tree.
* @mas: The maple state
--
2.20.1
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 07/11] maple_tree: Update the documentation of maple tree
2023-07-26 8:09 [PATCH 00/11] Introduce mt_dup() to improve the performance of fork() Peng Zhang
` (5 preceding siblings ...)
2023-07-26 8:09 ` [PATCH 06/11] maple_tree: Introduce mas_replace_entry() to directly replace an entry Peng Zhang
@ 2023-07-26 8:09 ` Peng Zhang
2023-07-26 8:09 ` [PATCH 08/11] maple_tree: Skip other tests when BENCH is enabled Peng Zhang
` (3 subsequent siblings)
10 siblings, 0 replies; 42+ messages in thread
From: Peng Zhang @ 2023-07-26 8:09 UTC (permalink / raw)
To: Liam.Howlett, corbet, akpm, willy, brauner, surenb,
michael.christie, peterz, mathieu.desnoyers, npiggin, avagin
Cc: linux-mm, linux-doc, linux-kernel, linux-fsdevel, Peng Zhang
Introduces the newly introduced mt_dup() and mas_replace_entry().
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
---
Documentation/core-api/maple_tree.rst | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/Documentation/core-api/maple_tree.rst b/Documentation/core-api/maple_tree.rst
index 45defcf15da7..a4fa991277c7 100644
--- a/Documentation/core-api/maple_tree.rst
+++ b/Documentation/core-api/maple_tree.rst
@@ -71,6 +71,11 @@ return -EEXIST if the range is not empty.
You can search for an entry from an index upwards by using mt_find().
+If you want to duplicate a tree, you can use mt_dup(). It will build a new tree
+that is exactly the same as the source tree, and it uses an efficient
+implementation, so it is much faster than traversing the source tree and
+inserting into the new tree one by one.
+
You can walk each entry within a range by calling mt_for_each(). You must
provide a temporary variable to store a cursor. If you want to walk each
element of the tree then ``0`` and ``ULONG_MAX`` may be used as the range. If
@@ -115,6 +120,7 @@ Takes ma_lock internally:
* mtree_destroy()
* mt_set_in_rcu()
* mt_clear_in_rcu()
+ * mt_dup()
If you want to take advantage of the internal lock to protect the data
structures that you are storing in the Maple Tree, you can call mtree_lock()
@@ -155,6 +161,10 @@ You can set entries using mas_store(). mas_store() will overwrite any entry
with the new entry and return the first existing entry that is overwritten.
The range is passed in as members of the maple state: index and last.
+If you have located an entry using something like mas_find(), and want to
+replace this entry, you can use mas_replace_entry(), which is more efficient
+than mas_store*().
+
You can use mas_erase() to erase an entire range by setting index and
last of the maple state to the desired range to erase. This will erase
the first range that is found in that range, set the maple state index
--
2.20.1
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 08/11] maple_tree: Skip other tests when BENCH is enabled
2023-07-26 8:09 [PATCH 00/11] Introduce mt_dup() to improve the performance of fork() Peng Zhang
` (6 preceding siblings ...)
2023-07-26 8:09 ` [PATCH 07/11] maple_tree: Update the documentation of maple tree Peng Zhang
@ 2023-07-26 8:09 ` Peng Zhang
2023-07-26 8:09 ` [PATCH 09/11] maple_tree: Update check_forking() and bench_forking() Peng Zhang
` (2 subsequent siblings)
10 siblings, 0 replies; 42+ messages in thread
From: Peng Zhang @ 2023-07-26 8:09 UTC (permalink / raw)
To: Liam.Howlett, corbet, akpm, willy, brauner, surenb,
michael.christie, peterz, mathieu.desnoyers, npiggin, avagin
Cc: linux-mm, linux-doc, linux-kernel, linux-fsdevel, Peng Zhang
Skip other tests when BENCH is enabled so that performance can be
measured in user space.
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
---
lib/test_maple_tree.c | 8 ++++----
tools/testing/radix-tree/maple.c | 2 ++
2 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/lib/test_maple_tree.c b/lib/test_maple_tree.c
index 0674aebd4423..0ec0c6a7c0b5 100644
--- a/lib/test_maple_tree.c
+++ b/lib/test_maple_tree.c
@@ -3514,10 +3514,6 @@ static int __init maple_tree_seed(void)
pr_info("\nTEST STARTING\n\n");
- mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
- check_root_expand(&tree);
- mtree_destroy(&tree);
-
#if defined(BENCH_SLOT_STORE)
#define BENCH
mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
@@ -3575,6 +3571,10 @@ static int __init maple_tree_seed(void)
goto skip;
#endif
+ mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
+ check_root_expand(&tree);
+ mtree_destroy(&tree);
+
mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
check_iteration(&tree);
mtree_destroy(&tree);
diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
index 3052e899e5df..57e6b0bc5984 100644
--- a/tools/testing/radix-tree/maple.c
+++ b/tools/testing/radix-tree/maple.c
@@ -36140,7 +36140,9 @@ void farmer_tests(void)
void maple_tree_tests(void)
{
+#if !defined(BENCH_FORK)
farmer_tests();
+#endif
maple_tree_seed();
maple_tree_harvest();
}
--
2.20.1
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 09/11] maple_tree: Update check_forking() and bench_forking()
2023-07-26 8:09 [PATCH 00/11] Introduce mt_dup() to improve the performance of fork() Peng Zhang
` (7 preceding siblings ...)
2023-07-26 8:09 ` [PATCH 08/11] maple_tree: Skip other tests when BENCH is enabled Peng Zhang
@ 2023-07-26 8:09 ` Peng Zhang
2023-07-26 8:09 ` [PATCH 10/11] MAINTAINERS: Add co-maintainer for maple tree Peng Zhang
2023-07-26 8:09 ` [PATCH 11/11] fork: Use __mt_dup() to duplicate maple tree in dup_mmap() Peng Zhang
10 siblings, 0 replies; 42+ messages in thread
From: Peng Zhang @ 2023-07-26 8:09 UTC (permalink / raw)
To: Liam.Howlett, corbet, akpm, willy, brauner, surenb,
michael.christie, peterz, mathieu.desnoyers, npiggin, avagin
Cc: linux-mm, linux-doc, linux-kernel, linux-fsdevel, Peng Zhang
Updated check_forking() and bench_forking() to use __mt_dup() to
duplicate maple tree.
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
---
lib/test_maple_tree.c | 59 ++++++++++++++++++-------------------------
1 file changed, 25 insertions(+), 34 deletions(-)
diff --git a/lib/test_maple_tree.c b/lib/test_maple_tree.c
index 0ec0c6a7c0b5..bbdac08927c6 100644
--- a/lib/test_maple_tree.c
+++ b/lib/test_maple_tree.c
@@ -1837,7 +1837,7 @@ static noinline void __init check_forking(struct maple_tree *mt)
{
struct maple_tree newmt;
- int i, nr_entries = 134;
+ int i, nr_entries = 134, ret;
void *val;
MA_STATE(mas, mt, 0, 0);
MA_STATE(newmas, mt, 0, 0);
@@ -1847,26 +1847,22 @@ static noinline void __init check_forking(struct maple_tree *mt)
xa_mk_value(i), GFP_KERNEL);
mt_set_non_kernel(99999);
- mt_init_flags(&newmt, MT_FLAGS_ALLOC_RANGE);
- newmas.tree = &newmt;
- mas_reset(&newmas);
- mas_reset(&mas);
- mas_lock(&newmas);
- mas.index = 0;
- mas.last = 0;
- if (mas_expected_entries(&newmas, nr_entries)) {
+ mas_lock(&mas);
+
+ ret = __mt_dup(mt, &newmt, GFP_NOWAIT | __GFP_NOWARN);
+ if (ret) {
pr_err("OOM!");
BUG_ON(1);
}
- rcu_read_lock();
- mas_for_each(&mas, val, ULONG_MAX) {
- newmas.index = mas.index;
- newmas.last = mas.last;
- mas_store(&newmas, val);
+
+ mas_set(&newmas, 0);
+ mas_for_each(&newmas, val, ULONG_MAX) {
+ mas_replace_entry(&newmas, val);
}
- rcu_read_unlock();
+
+ mas_unlock(&mas);
+
mas_destroy(&newmas);
- mas_unlock(&newmas);
mt_validate(&newmt);
mt_set_non_kernel(0);
mtree_destroy(&newmt);
@@ -1974,9 +1970,8 @@ static noinline void __init check_mas_store_gfp(struct maple_tree *mt)
#if defined(BENCH_FORK)
static noinline void __init bench_forking(struct maple_tree *mt)
{
-
struct maple_tree newmt;
- int i, nr_entries = 134, nr_fork = 80000;
+ int i, nr_entries = 134, nr_fork = 80000, ret;
void *val;
MA_STATE(mas, mt, 0, 0);
MA_STATE(newmas, mt, 0, 0);
@@ -1987,26 +1982,22 @@ static noinline void __init bench_forking(struct maple_tree *mt)
for (i = 0; i < nr_fork; i++) {
mt_set_non_kernel(99999);
- mt_init_flags(&newmt, MT_FLAGS_ALLOC_RANGE);
- newmas.tree = &newmt;
- mas_reset(&newmas);
- mas_reset(&mas);
- mas.index = 0;
- mas.last = 0;
- rcu_read_lock();
- mas_lock(&newmas);
- if (mas_expected_entries(&newmas, nr_entries)) {
- printk("OOM!");
+
+ mas_lock(&mas);
+ ret = __mt_dup(mt, &newmt, GFP_NOWAIT | __GFP_NOWARN);
+ if (ret) {
+ pr_err("OOM!");
BUG_ON(1);
}
- mas_for_each(&mas, val, ULONG_MAX) {
- newmas.index = mas.index;
- newmas.last = mas.last;
- mas_store(&newmas, val);
+
+ mas_set(&newmas, 0);
+ mas_for_each(&newmas, val, ULONG_MAX) {
+ mas_replace_entry(&newmas, val);
}
+
+ mas_unlock(&mas);
+
mas_destroy(&newmas);
- mas_unlock(&newmas);
- rcu_read_unlock();
mt_validate(&newmt);
mt_set_non_kernel(0);
mtree_destroy(&newmt);
--
2.20.1
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 10/11] MAINTAINERS: Add co-maintainer for maple tree
2023-07-26 8:09 [PATCH 00/11] Introduce mt_dup() to improve the performance of fork() Peng Zhang
` (8 preceding siblings ...)
2023-07-26 8:09 ` [PATCH 09/11] maple_tree: Update check_forking() and bench_forking() Peng Zhang
@ 2023-07-26 8:09 ` Peng Zhang
2023-07-26 16:39 ` Liam R. Howlett
2023-07-26 8:09 ` [PATCH 11/11] fork: Use __mt_dup() to duplicate maple tree in dup_mmap() Peng Zhang
10 siblings, 1 reply; 42+ messages in thread
From: Peng Zhang @ 2023-07-26 8:09 UTC (permalink / raw)
To: Liam.Howlett, corbet, akpm, willy, brauner, surenb,
michael.christie, peterz, mathieu.desnoyers, npiggin, avagin
Cc: linux-mm, linux-doc, linux-kernel, linux-fsdevel, Peng Zhang
Add myself as co-maintainer for maple tree. I would like to assist
Liam R. Howlett in maintaining maple tree. I will continue to contribute
to the development of maple tree in the future.
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
---
MAINTAINERS | 1 +
1 file changed, 1 insertion(+)
diff --git a/MAINTAINERS b/MAINTAINERS
index ddc71b815791..8cfedd492509 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12526,6 +12526,7 @@ F: net/mctp/
MAPLE TREE
M: Liam R. Howlett <Liam.Howlett@oracle.com>
+M: Peng Zhang <zhangpeng.00@bytedance.com>
L: linux-mm@kvack.org
S: Supported
F: Documentation/core-api/maple_tree.rst
--
2.20.1
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 11/11] fork: Use __mt_dup() to duplicate maple tree in dup_mmap()
2023-07-26 8:09 [PATCH 00/11] Introduce mt_dup() to improve the performance of fork() Peng Zhang
` (9 preceding siblings ...)
2023-07-26 8:09 ` [PATCH 10/11] MAINTAINERS: Add co-maintainer for maple tree Peng Zhang
@ 2023-07-26 8:09 ` Peng Zhang
2023-07-26 17:06 ` Liam R. Howlett
10 siblings, 1 reply; 42+ messages in thread
From: Peng Zhang @ 2023-07-26 8:09 UTC (permalink / raw)
To: Liam.Howlett, corbet, akpm, willy, brauner, surenb,
michael.christie, peterz, mathieu.desnoyers, npiggin, avagin
Cc: linux-mm, linux-doc, linux-kernel, linux-fsdevel, Peng Zhang
Use __mt_dup() to duplicate the old maple tree in dup_mmap(), and then
directly modify the entries of VMAs in the new maple tree, which can
get better performance. dup_mmap() is used by fork(), so this patch
optimizes fork(). The optimization effect is proportional to the number
of VMAs.
Due to the introduction of this method, the optimization in
(maple_tree: add a fast path case in mas_wr_slot_store())[1] no longer
has an effect here, but it is also an optimization of the maple tree.
There is a unixbench test suite[2] where 'spawn' is used to test fork().
'spawn' only has 23 VMAs by default, so I tweaked the benchmark code a
bit to use mmap() to control the number of VMAs. Therefore, the
performance under different numbers of VMAs can be measured.
Insert code like below into 'spawn':
for (int i = 0; i < 200; ++i) {
size_t size = 10 * getpagesize();
void *addr;
if (i & 1) {
addr = mmap(NULL, size, PROT_READ,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
} else {
addr = mmap(NULL, size, PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
}
if (addr == MAP_FAILED)
...
}
Based on next-20230721, use 'spawn' under 23, 203, and 4023 VMAs, test
4 times in 30 seconds each time, and get the following numbers. These
numbers are the number of fork() successes in 30s (average of the best
3 out of 4). By the way, based on next-20230725, I reverted [1], and
tested it together as a comparison. In order to ensure the reliability
of the test results, these tests were run on a physical machine.
23VMAs 223VMAs 4023VMAs
revert [1]: 159104.00 73316.33 6787.00
+0.77% +0.42% +0.28%
next-20230721: 160321.67 73624.67 6806.33
+2.77% +15.42% +29.86%
apply this: 164751.67 84980.33 8838.67
It can be seen that the performance improvement is proportional to
the number of VMAs. With 23 VMAs, performance improves by about 3%,
with 223 VMAs, performance improves by about 15%, and with 4023 VMAs,
performance improves by about 30%.
[1] https://lore.kernel.org/lkml/20230628073657.75314-4-zhangpeng.00@bytedance.com/
[2] https://github.com/kdlucas/byte-unixbench/tree/master
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
---
kernel/fork.c | 35 +++++++++++++++++++++++++++--------
mm/mmap.c | 14 ++++++++++++--
2 files changed, 39 insertions(+), 10 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index f81149739eb9..ef80025b62d6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -650,7 +650,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
int retval;
unsigned long charge = 0;
LIST_HEAD(uf);
- VMA_ITERATOR(old_vmi, oldmm, 0);
VMA_ITERATOR(vmi, mm, 0);
uprobe_start_dup_mmap();
@@ -678,17 +677,40 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
goto out;
khugepaged_fork(mm, oldmm);
- retval = vma_iter_bulk_alloc(&vmi, oldmm->map_count);
- if (retval)
+ /* Use __mt_dup() to efficiently build an identical maple tree. */
+ retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_NOWAIT | __GFP_NOWARN);
+ if (unlikely(retval))
goto out;
mt_clear_in_rcu(vmi.mas.tree);
- for_each_vma(old_vmi, mpnt) {
+ for_each_vma(vmi, mpnt) {
struct file *file;
vma_start_write(mpnt);
if (mpnt->vm_flags & VM_DONTCOPY) {
vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
+
+ /*
+ * Since the new tree is exactly the same as the old one,
+ * we need to remove the unneeded VMAs.
+ */
+ mas_store(&vmi.mas, NULL);
+
+ /*
+ * Even removing an entry may require memory allocation,
+ * and if removal fails, we use XA_ZERO_ENTRY to mark
+ * from which VMA it failed. The case of encountering
+ * XA_ZERO_ENTRY will be handled in exit_mmap().
+ */
+ if (unlikely(mas_is_err(&vmi.mas))) {
+ retval = xa_err(vmi.mas.node);
+ mas_reset(&vmi.mas);
+ if (mas_find(&vmi.mas, ULONG_MAX))
+ mas_replace_entry(&vmi.mas,
+ XA_ZERO_ENTRY);
+ goto loop_out;
+ }
+
continue;
}
charge = 0;
@@ -750,8 +772,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
hugetlb_dup_vma_private(tmp);
/* Link the vma into the MT */
- if (vma_iter_bulk_store(&vmi, tmp))
- goto fail_nomem_vmi_store;
+ mas_replace_entry(&vmi.mas, tmp);
mm->map_count++;
if (!(tmp->vm_flags & VM_WIPEONFORK))
@@ -778,8 +799,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
uprobe_end_dup_mmap();
return retval;
-fail_nomem_vmi_store:
- unlink_anon_vmas(tmp);
fail_nomem_anon_vma_fork:
mpol_put(vma_policy(tmp));
fail_nomem_policy:
diff --git a/mm/mmap.c b/mm/mmap.c
index bc91d91261ab..5bfba2fb0e39 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3184,7 +3184,11 @@ void exit_mmap(struct mm_struct *mm)
arch_exit_mmap(mm);
vma = mas_find(&mas, ULONG_MAX);
- if (!vma) {
+ /*
+ * If dup_mmap() fails to remove a VMA marked VM_DONTCOPY,
+ * xa_is_zero(vma) may be true.
+ */
+ if (!vma || xa_is_zero(vma)) {
/* Can happen if dup_mmap() received an OOM */
mmap_read_unlock(mm);
return;
@@ -3222,7 +3226,13 @@ void exit_mmap(struct mm_struct *mm)
remove_vma(vma, true);
count++;
cond_resched();
- } while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
+ vma = mas_find(&mas, ULONG_MAX);
+ /*
+ * If xa_is_zero(vma) is true, it means that subsequent VMAs
+ * donot need to be removed. Can happen if dup_mmap() fails to
+ * remove a VMA marked VM_DONTCOPY.
+ */
+ } while (vma != NULL && !xa_is_zero(vma));
BUG_ON(count != mm->map_count);
--
2.20.1
^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH 01/11] maple_tree: Introduce ma_nonleaf_data_end{_nocheck}()
2023-07-26 8:09 ` [PATCH 01/11] maple_tree: Introduce ma_nonleaf_data_end{_nocheck}() Peng Zhang
@ 2023-07-26 14:58 ` Liam R. Howlett
2023-07-31 9:52 ` Peng Zhang
0 siblings, 1 reply; 42+ messages in thread
From: Liam R. Howlett @ 2023-07-26 14:58 UTC (permalink / raw)
To: Peng Zhang
Cc: corbet, akpm, willy, brauner, surenb, michael.christie, peterz,
mathieu.desnoyers, npiggin, avagin, linux-mm, linux-doc,
linux-kernel, linux-fsdevel
* Peng Zhang <zhangpeng.00@bytedance.com> [230726 04:10]:
> Introduce ma_nonleaf_data_end{_nocheck}() to get the data end of
> non-leaf nodes without knowing the maximum value of nodes, so that any
> ascending can be avoided even if the maximum value of nodes is not known.
>
> The principle is that we introduce MAPLE_ENODE to mark an ENODE, which
> cannot be used by metadata, so we can distinguish whether it is ENODE or
> metadata.
>
> The nocheck version is to avoid lockdep complaining in some scenarios
> where no locks are held.
>
> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
> ---
> lib/maple_tree.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 68 insertions(+), 2 deletions(-)
>
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index a3d602cfd030..98e4fdf6f4b9 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -310,12 +310,19 @@ static inline void mte_set_node_dead(struct maple_enode *mn)
> #define MAPLE_ENODE_TYPE_SHIFT 0x03
> /* Bit 2 means a NULL somewhere below */
> #define MAPLE_ENODE_NULL 0x04
> +/* Bit 7 means this is an ENODE, instead of metadata */
> +#define MAPLE_ENODE 0x80
We were saving this bit for more node types. I don't want to use this
bit for this reason since you could have done BFS to duplicate the tree
using the existing way to find the node end.
Bits are highly valuable and this is the only remaining bit. I had
thought about using this in Feb 2021 to see if there was metadata or
not, but figured a way around it (using the max trick) and thus saved
this bit for potential expansion of node types.
> +
> +static inline bool slot_is_mte(unsigned long slot)
> +{
> + return slot & MAPLE_ENODE;
> +}
>
> static inline struct maple_enode *mt_mk_node(const struct maple_node *node,
> enum maple_type type)
> {
> - return (void *)((unsigned long)node |
> - (type << MAPLE_ENODE_TYPE_SHIFT) | MAPLE_ENODE_NULL);
> + return (void *)((unsigned long)node | (type << MAPLE_ENODE_TYPE_SHIFT) |
> + MAPLE_ENODE_NULL | MAPLE_ENODE);
> }
>
> static inline void *mte_mk_root(const struct maple_enode *node)
> @@ -1411,6 +1418,65 @@ static inline struct maple_enode *mas_start(struct ma_state *mas)
> return NULL;
> }
>
> +/*
> + * ma_nonleaf_data_end() - Find the end of the data in a non-leaf node.
> + * @mt: The maple tree
> + * @node: The maple node
> + * @type: The maple node type
> + *
> + * Uses metadata to find the end of the data when possible without knowing the
> + * node maximum.
> + *
> + * Return: The zero indexed last slot with child.
> + */
> +static inline unsigned char ma_nonleaf_data_end(struct maple_tree *mt,
> + struct maple_node *node,
> + enum maple_type type)
> +{
> + void __rcu **slots;
> + unsigned long slot;
> +
> + slots = ma_slots(node, type);
> + slot = (unsigned long)mt_slot(mt, slots, mt_pivots[type]);
> + if (unlikely(slot_is_mte(slot)))
> + return mt_pivots[type];
> +
> + return ma_meta_end(node, type);
> +}
> +
> +/*
> + * ma_nonleaf_data_end_nocheck() - Find the end of the data in a non-leaf node.
> + * @node: The maple node
> + * @type: The maple node type
> + *
> + * Uses metadata to find the end of the data when possible without knowing the
> + * node maximum. This is the version of ma_nonleaf_data_end() that does not
> + * check for lock held. This particular version is designed to avoid lockdep
> + * complaining in some scenarios.
> + *
> + * Return: The zero indexed last slot with child.
> + */
> +static inline unsigned char ma_nonleaf_data_end_nocheck(struct maple_node *node,
> + enum maple_type type)
> +{
> + void __rcu **slots;
> + unsigned long slot;
> +
> + slots = ma_slots(node, type);
> + slot = (unsigned long)rcu_dereference_raw(slots[mt_pivots[type]]);
> + if (unlikely(slot_is_mte(slot)))
> + return mt_pivots[type];
> +
> + return ma_meta_end(node, type);
> +}
> +
> +/* See ma_nonleaf_data_end() */
> +static inline unsigned char mte_nonleaf_data_end(struct maple_tree *mt,
> + struct maple_enode *enode)
> +{
> + return ma_nonleaf_data_end(mt, mte_to_node(enode), mte_node_type(enode));
> +}
> +
> /*
> * ma_data_end() - Find the end of the data in a node.
> * @node: The maple node
> --
> 2.20.1
>
>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 03/11] maple_tree: Add some helper functions
2023-07-26 8:09 ` [PATCH 03/11] maple_tree: Add some helper functions Peng Zhang
@ 2023-07-26 15:02 ` Liam R. Howlett
2023-07-26 15:08 ` Matthew Wilcox
2023-07-31 11:40 ` Peng Zhang
0 siblings, 2 replies; 42+ messages in thread
From: Liam R. Howlett @ 2023-07-26 15:02 UTC (permalink / raw)
To: Peng Zhang
Cc: corbet, akpm, willy, brauner, surenb, michael.christie, peterz,
mathieu.desnoyers, npiggin, avagin, linux-mm, linux-doc,
linux-kernel, linux-fsdevel
* Peng Zhang <zhangpeng.00@bytedance.com> [230726 04:10]:
> Add some helper functions so that their parameters are maple node
> instead of maple enode, these functions will be used later.
>
> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
> ---
> lib/maple_tree.c | 71 +++++++++++++++++++++++++++++++++++++-----------
> 1 file changed, 55 insertions(+), 16 deletions(-)
>
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index e0e9a87bdb43..da3a2fb405c0 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -164,6 +164,11 @@ static inline int mt_alloc_bulk(gfp_t gfp, size_t size, void **nodes)
> return kmem_cache_alloc_bulk(maple_node_cache, gfp, size, nodes);
> }
>
> +static inline void mt_free_one(struct maple_node *node)
> +{
> + kmem_cache_free(maple_node_cache, node);
> +}
> +
There is a place in mas_destroy() that could use this if it is added.
> static inline void mt_free_bulk(size_t size, void __rcu **nodes)
> {
> kmem_cache_free_bulk(maple_node_cache, size, (void **)nodes);
> @@ -432,18 +437,18 @@ static inline unsigned long mte_parent_slot_mask(unsigned long parent)
> }
>
> /*
> - * mas_parent_type() - Return the maple_type of the parent from the stored
> - * parent type.
> - * @mas: The maple state
> - * @enode: The maple_enode to extract the parent's enum
> + * ma_parent_type() - Return the maple_type of the parent from the stored parent
> + * type.
> + * @mt: The maple tree
> + * @node: The maple_node to extract the parent's enum
> * Return: The node->parent maple_type
> */
> static inline
> -enum maple_type mas_parent_type(struct ma_state *mas, struct maple_enode *enode)
> +enum maple_type ma_parent_type(struct maple_tree *mt, struct maple_node *node)
I was trying to keep ma_* prefix to mean the first argument is
maple_node and mt_* to mean maple_tree. I wasn't entirely successful
with this and I do see why you want to use ma_, but maybe reverse the
arguments here?
> {
> unsigned long p_type;
>
> - p_type = (unsigned long)mte_to_node(enode)->parent;
> + p_type = (unsigned long)node->parent;
> if (WARN_ON(p_type & MAPLE_PARENT_ROOT))
> return 0;
>
> @@ -451,7 +456,7 @@ enum maple_type mas_parent_type(struct ma_state *mas, struct maple_enode *enode)
> p_type &= ~mte_parent_slot_mask(p_type);
> switch (p_type) {
> case MAPLE_PARENT_RANGE64: /* or MAPLE_PARENT_ARANGE64 */
> - if (mt_is_alloc(mas->tree))
> + if (mt_is_alloc(mt))
> return maple_arange_64;
> return maple_range_64;
> }
> @@ -459,6 +464,19 @@ enum maple_type mas_parent_type(struct ma_state *mas, struct maple_enode *enode)
> return 0;
> }
>
> +/*
> + * mas_parent_type() - Return the maple_type of the parent from the stored
> + * parent type.
> + * @mas: The maple state
> + * @enode: The maple_enode to extract the parent's enum
> + * Return: The node->parent maple_type
> + */
> +static inline
> +enum maple_type mas_parent_type(struct ma_state *mas, struct maple_enode *enode)
> +{
> + return ma_parent_type(mas->tree, mte_to_node(enode));
> +}
> +
> /*
> * mas_set_parent() - Set the parent node and encode the slot
> * @enode: The encoded maple node.
> @@ -499,14 +517,14 @@ void mas_set_parent(struct ma_state *mas, struct maple_enode *enode,
> }
>
> /*
> - * mte_parent_slot() - get the parent slot of @enode.
> - * @enode: The encoded maple node.
> + * ma_parent_slot() - get the parent slot of @node.
> + * @node: The maple node.
> *
> - * Return: The slot in the parent node where @enode resides.
> + * Return: The slot in the parent node where @node resides.
> */
> -static inline unsigned int mte_parent_slot(const struct maple_enode *enode)
> +static inline unsigned int ma_parent_slot(const struct maple_node *node)
> {
> - unsigned long val = (unsigned long)mte_to_node(enode)->parent;
> + unsigned long val = (unsigned long)node->parent;
>
> if (val & MA_ROOT_PARENT)
> return 0;
> @@ -519,15 +537,36 @@ static inline unsigned int mte_parent_slot(const struct maple_enode *enode)
> }
>
> /*
> - * mte_parent() - Get the parent of @node.
> - * @node: The encoded maple node.
> + * mte_parent_slot() - get the parent slot of @enode.
> + * @enode: The encoded maple node.
> + *
> + * Return: The slot in the parent node where @enode resides.
> + */
> +static inline unsigned int mte_parent_slot(const struct maple_enode *enode)
> +{
> + return ma_parent_slot(mte_to_node(enode));
> +}
> +
> +/*
> + * ma_parent() - Get the parent of @node.
> + * @node: The maple node.
> + *
> + * Return: The parent maple node.
> + */
> +static inline struct maple_node *ma_parent(const struct maple_node *node)
I had a lot of these helpers before, but they eventually became used so
little that I dropped them.
> +{
> + return (void *)((unsigned long)(node->parent) & ~MAPLE_NODE_MASK);
> +}
> +
> +/*
> + * mte_parent() - Get the parent of @enode.
> + * @enode: The encoded maple node.
> *
> * Return: The parent maple node.
> */
> static inline struct maple_node *mte_parent(const struct maple_enode *enode)
> {
> - return (void *)((unsigned long)
> - (mte_to_node(enode)->parent) & ~MAPLE_NODE_MASK);
> + return ma_parent(mte_to_node(enode));
> }
>
> /*
> --
> 2.20.1
>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 03/11] maple_tree: Add some helper functions
2023-07-26 15:02 ` Liam R. Howlett
@ 2023-07-26 15:08 ` Matthew Wilcox
2023-07-31 11:45 ` Peng Zhang
2023-07-31 11:40 ` Peng Zhang
1 sibling, 1 reply; 42+ messages in thread
From: Matthew Wilcox @ 2023-07-26 15:08 UTC (permalink / raw)
To: Liam R. Howlett, Peng Zhang, corbet, akpm, brauner, surenb,
michael.christie, peterz, mathieu.desnoyers, npiggin, avagin,
linux-mm, linux-doc, linux-kernel, linux-fsdevel
On Wed, Jul 26, 2023 at 11:02:52AM -0400, Liam R. Howlett wrote:
> * Peng Zhang <zhangpeng.00@bytedance.com> [230726 04:10]:
> > static inline
> > -enum maple_type mas_parent_type(struct ma_state *mas, struct maple_enode *enode)
> > +enum maple_type ma_parent_type(struct maple_tree *mt, struct maple_node *node)
>
> I was trying to keep ma_* prefix to mean the first argument is
> maple_node and mt_* to mean maple_tree. I wasn't entirely successful
> with this and I do see why you want to use ma_, but maybe reverse the
> arguments here?
I think your first idea is better. Usually we prefer to order the
arguments by "containing thing" to "contained thing". So always use
(struct address_space *, struct folio *), for example. Or (struct
mm_struct *, struct vm_area_struct *).
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 04/11] maple_tree: Introduce interfaces __mt_dup() and mt_dup()
2023-07-26 8:09 ` [PATCH 04/11] maple_tree: Introduce interfaces __mt_dup() and mt_dup() Peng Zhang
@ 2023-07-26 16:03 ` Liam R. Howlett
2023-07-31 12:24 ` Peng Zhang
0 siblings, 1 reply; 42+ messages in thread
From: Liam R. Howlett @ 2023-07-26 16:03 UTC (permalink / raw)
To: Peng Zhang
Cc: corbet, akpm, willy, brauner, surenb, michael.christie, peterz,
mathieu.desnoyers, npiggin, avagin, linux-mm, linux-doc,
linux-kernel, linux-fsdevel
* Peng Zhang <zhangpeng.00@bytedance.com> [230726 04:10]:
> Introduce interfaces __mt_dup() and mt_dup(), which are used to
> duplicate a maple tree. Compared with traversing the source tree and
> reinserting entry by entry in the new tree, it has better performance.
> The difference between __mt_dup() and mt_dup() is that mt_dup() holds
> an internal lock.
>
> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
> ---
> include/linux/maple_tree.h | 3 +
> lib/maple_tree.c | 211 +++++++++++++++++++++++++++++++++++++
> 2 files changed, 214 insertions(+)
>
> diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
> index c962af188681..229fe78e4c89 100644
> --- a/include/linux/maple_tree.h
> +++ b/include/linux/maple_tree.h
> @@ -327,6 +327,9 @@ int mtree_store(struct maple_tree *mt, unsigned long index,
> void *entry, gfp_t gfp);
> void *mtree_erase(struct maple_tree *mt, unsigned long index);
>
> +int mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
> +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
> +
> void mtree_destroy(struct maple_tree *mt);
> void __mt_destroy(struct maple_tree *mt);
>
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index da3a2fb405c0..efac6761ae37 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -6595,6 +6595,217 @@ void *mtree_erase(struct maple_tree *mt, unsigned long index)
> }
> EXPORT_SYMBOL(mtree_erase);
>
> +/*
> + * mt_dup_free() - Free the nodes of a incomplete maple tree.
> + * @mt: The incomplete maple tree
> + * @node: Free nodes from @node
> + *
> + * This function frees all nodes starting from @node in the reverse order of
> + * mt_dup_build(). At this point we don't need to hold the source tree lock.
> + */
> +static void mt_dup_free(struct maple_tree *mt, struct maple_node *node)
> +{
> + void **slots;
> + unsigned char offset;
> + struct maple_enode *enode;
> + enum maple_type type;
> + unsigned char count = 0, i;
> +
Can we make these labels inline functions and try to make this a loop?
> +try_ascend:
> + if (ma_is_root(node)) {
> + mt_free_one(node);
> + return;
> + }
> +
> + offset = ma_parent_slot(node);
> + type = ma_parent_type(mt, node);
> + node = ma_parent(node);
> + if (!offset)
> + goto free;
> +
> + offset--;
> +
> +descend:
> + slots = (void **)ma_slots(node, type);
> + enode = slots[offset];
> + if (mte_is_leaf(enode))
> + goto free;
> +
> + type = mte_node_type(enode);
> + node = mte_to_node(enode);
> + offset = ma_nonleaf_data_end_nocheck(node, type);
> + goto descend;
> +
> +free:
> + slots = (void **)ma_slots(node, type);
> + count = ma_nonleaf_data_end_nocheck(node, type) + 1;
> + for (i = 0; i < count; i++)
> + ((unsigned long *)slots)[i] &= ~MAPLE_NODE_MASK;
> +
> + /* Cast to __rcu to avoid sparse checker complaining. */
> + mt_free_bulk(count, (void __rcu **)slots);
> + goto try_ascend;
> +}
> +
> +/*
> + * mt_dup_build() - Build a new maple tree from a source tree
> + * @mt: The source maple tree to copy from
> + * @new: The new maple tree
> + * @gfp: The GFP_FLAGS to use for allocations
> + * @to_free: Free nodes starting from @to_free if the build fails
> + *
> + * This function builds a new tree in DFS preorder. If it fails due to memory
> + * allocation, @to_free will store the last failed node to free the incomplete
> + * tree. Use mt_dup_free() to free nodes.
> + *
> + * Return: 0 on success, -ENOMEM if memory could not be allocated.
> + */
> +static inline int mt_dup_build(struct maple_tree *mt, struct maple_tree *new,
> + gfp_t gfp, struct maple_node **to_free)
I am trying to change the functions to be two tabs of indent for
arguments from now on. It allows for more to fit on a single line and
still maintains a clear separation between code and argument lists.
> +{
> + struct maple_enode *enode;
> + struct maple_node *new_node, *new_parent = NULL, *node;
> + enum maple_type type;
> + void __rcu **slots;
> + void **new_slots;
> + unsigned char count, request, i, offset;
> + unsigned long *set_parent;
> + unsigned long new_root;
> +
> + mt_init_flags(new, mt->ma_flags);
> + enode = mt_root_locked(mt);
> + if (unlikely(!xa_is_node(enode))) {
> + rcu_assign_pointer(new->ma_root, enode);
> + return 0;
> + }
> +
> + new_node = mt_alloc_one(gfp);
> + if (!new_node)
> + return -ENOMEM;
> +
> + new_root = (unsigned long)new_node;
> + new_root |= (unsigned long)enode & MAPLE_NODE_MASK;
> +
> +copy_node:
Can you make copy_node, descend, ascend inline functions instead of the
goto jumping please? It's better to have loops over jumping around a
lot. Gotos are good for undoing things and retry, but constructing
loops with them makes it difficult to follow.
> + node = mte_to_node(enode);
> + type = mte_node_type(enode);
> + memcpy(new_node, node, sizeof(struct maple_node));
> +
> + set_parent = (unsigned long *)&(new_node->parent);
> + *set_parent &= MAPLE_NODE_MASK;
> + *set_parent |= (unsigned long)new_parent;
Maybe make a small inline to set the parent instead of this?
There are some defined helpers for setting the types like
ma_parent_ptr() and ma_enode_ptr() to make casting more type-safe.
> + if (ma_is_leaf(type))
> + goto ascend;
> +
> + new_slots = (void **)ma_slots(new_node, type);
> + slots = ma_slots(node, type);
> + request = ma_nonleaf_data_end(mt, node, type) + 1;
> + count = mt_alloc_bulk(gfp, request, new_slots);
> + if (!count) {
> + *to_free = new_node;
> + return -ENOMEM;
> + }
> +
> + for (i = 0; i < count; i++)
> + ((unsigned long *)new_slots)[i] |=
> + ((unsigned long)mt_slot_locked(mt, slots, i) &
> + MAPLE_NODE_MASK);
> + offset = 0;
> +
> +descend:
> + new_parent = new_node;
> + enode = mt_slot_locked(mt, slots, offset);
> + new_node = mte_to_node(new_slots[offset]);
> + goto copy_node;
> +
> +ascend:
> + if (ma_is_root(node)) {
> + new_node = mte_to_node((void *)new_root);
> + new_node->parent = ma_parent_ptr((unsigned long)new |
> + MA_ROOT_PARENT);
> + rcu_assign_pointer(new->ma_root, (void *)new_root);
> + return 0;
> + }
> +
> + offset = ma_parent_slot(node);
> + type = ma_parent_type(mt, node);
> + node = ma_parent(node);
> + new_node = ma_parent(new_node);
> + if (offset < ma_nonleaf_data_end(mt, node, type)) {
> + offset++;
> + new_slots = (void **)ma_slots(new_node, type);
> + slots = ma_slots(node, type);
> + goto descend;
> + }
> +
> + goto ascend;
> +}
> +
> +/**
> + * __mt_dup(): Duplicate a maple tree
> + * @mt: The source maple tree
> + * @new: The new maple tree
> + * @gfp: The GFP_FLAGS to use for allocations
> + *
> + * This function duplicates a maple tree using a faster method than traversing
> + * the source tree and inserting entries into the new tree one by one. The user
> + * needs to lock the source tree manually. Before calling this function, @new
> + * must be an empty tree or an uninitialized tree. If @mt uses an external lock,
> + * we may also need to manually set @new's external lock using
> + * mt_set_external_lock().
> + *
> + * Return: 0 on success, -ENOMEM if memory could not be allocated.
> + */
> +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
We use mas_ for things that won't handle the locking and pass in a maple
state. Considering the leaves need to be altered once this is returned,
I would expect passing in a maple state should be feasible?
> +{
> + int ret;
> + struct maple_node *to_free = NULL;
> +
> + ret = mt_dup_build(mt, new, gfp, &to_free);
> +
> + if (unlikely(ret == -ENOMEM)) {
On other errors, will the half constructed tree be returned? Is this
safe?
> + if (to_free)
> + mt_dup_free(new, to_free);
> + }
> +
> + return ret;
> +}
> +EXPORT_SYMBOL(__mt_dup);
> +
> +/**
> + * mt_dup(): Duplicate a maple tree
> + * @mt: The source maple tree
> + * @new: The new maple tree
> + * @gfp: The GFP_FLAGS to use for allocations
> + *
> + * This function duplicates a maple tree using a faster method than traversing
> + * the source tree and inserting entries into the new tree one by one. The
> + * function will lock the source tree with an internal lock, and the user does
> + * not need to manually handle the lock. Before calling this function, @new must
> + * be an empty tree or an uninitialized tree. If @mt uses an external lock, we
> + * may also need to manually set @new's external lock using
> + * mt_set_external_lock().
> + *
> + * Return: 0 on success, -ENOMEM if memory could not be allocated.
> + */
> +int mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
mtree_ ususually used to indicate locking is handled.
> +{
> + int ret;
> + struct maple_node *to_free = NULL;
> +
> + mtree_lock(mt);
> + ret = mt_dup_build(mt, new, gfp, &to_free);
> + mtree_unlock(mt);
> +
> + if (unlikely(ret == -ENOMEM)) {
> + if (to_free)
> + mt_dup_free(new, to_free);
Again, is a half constructed tree safe to return? Since each caller
checks to_free is NULL, could that be in mt_dup_free() instead?
> + }
> +
> + return ret;
> +}
> +EXPORT_SYMBOL(mt_dup);
> +
> /**
> * __mt_destroy() - Walk and free all nodes of a locked maple tree.
> * @mt: The maple tree
> --
> 2.20.1
>
>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 05/11] maple_tree: Add test for mt_dup()
2023-07-26 8:09 ` [PATCH 05/11] maple_tree: Add test for mt_dup() Peng Zhang
@ 2023-07-26 16:06 ` Liam R. Howlett
2023-07-31 12:32 ` Peng Zhang
0 siblings, 1 reply; 42+ messages in thread
From: Liam R. Howlett @ 2023-07-26 16:06 UTC (permalink / raw)
To: Peng Zhang
Cc: corbet, akpm, willy, brauner, surenb, michael.christie, peterz,
mathieu.desnoyers, npiggin, avagin, linux-mm, linux-doc,
linux-kernel, linux-fsdevel
* Peng Zhang <zhangpeng.00@bytedance.com> [230726 04:10]:
> Add test for mt_dup().
>
> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
> ---
> tools/testing/radix-tree/maple.c | 202 +++++++++++++++++++++++++++++++
> 1 file changed, 202 insertions(+)
>
> diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
> index e5da1cad70ba..3052e899e5df 100644
> --- a/tools/testing/radix-tree/maple.c
> +++ b/tools/testing/radix-tree/maple.c
> @@ -35857,6 +35857,204 @@ static noinline void __init check_locky(struct maple_tree *mt)
> mt_clear_in_rcu(mt);
> }
>
> +/*
> + * Compare two nodes and return 0 if they are the same, non-zero otherwise.
> + */
> +static int __init compare_node(struct maple_enode *enode_a,
> + struct maple_enode *enode_b)
> +{
> + struct maple_node *node_a, *node_b;
> + struct maple_node a, b;
> + void **slots_a, **slots_b; /* Do not use the rcu tag. */
> + enum maple_type type;
> + int i;
> +
> + if (((unsigned long)enode_a & MAPLE_NODE_MASK) !=
> + ((unsigned long)enode_b & MAPLE_NODE_MASK)) {
> + pr_err("The lower 8 bits of enode are different.\n");
> + return -1;
> + }
> +
> + type = mte_node_type(enode_a);
> + node_a = mte_to_node(enode_a);
> + node_b = mte_to_node(enode_b);
> + a = *node_a;
> + b = *node_b;
> +
> + /* Do not compare addresses. */
> + if (ma_is_root(node_a) || ma_is_root(node_b)) {
> + a.parent = (struct maple_pnode *)((unsigned long)a.parent &
> + MA_ROOT_PARENT);
> + b.parent = (struct maple_pnode *)((unsigned long)b.parent &
> + MA_ROOT_PARENT);
> + } else {
> + a.parent = (struct maple_pnode *)((unsigned long)a.parent &
> + MAPLE_NODE_MASK);
> + b.parent = (struct maple_pnode *)((unsigned long)b.parent &
> + MAPLE_NODE_MASK);
> + }
> +
> + if (a.parent != b.parent) {
> + pr_err("The lower 8 bits of parents are different. %p %p\n",
> + a.parent, b.parent);
> + return -1;
> + }
> +
> + /*
> + * If it is a leaf node, the slots do not contain the node address, and
> + * no special processing of slots is required.
> + */
> + if (ma_is_leaf(type))
> + goto cmp;
> +
> + slots_a = ma_slots(&a, type);
> + slots_b = ma_slots(&b, type);
> +
> + for (i = 0; i < mt_slots[type]; i++) {
> + if (!slots_a[i] && !slots_b[i])
> + break;
> +
> + if (!slots_a[i] || !slots_b[i]) {
> + pr_err("The number of slots is different.\n");
> + return -1;
> + }
> +
> + /* Do not compare addresses in slots. */
> + ((unsigned long *)slots_a)[i] &= MAPLE_NODE_MASK;
> + ((unsigned long *)slots_b)[i] &= MAPLE_NODE_MASK;
> + }
> +
> +cmp:
> + /*
> + * Compare all contents of two nodes, including parent (except address),
> + * slots (except address), pivots, gaps and metadata.
> + */
> + return memcmp(&a, &b, sizeof(struct maple_node));
> +}
> +
> +/*
> + * Compare two trees and return 0 if they are the same, non-zero otherwise.
> + */
> +static int __init compare_tree(struct maple_tree *mt_a, struct maple_tree *mt_b)
> +{
> + MA_STATE(mas_a, mt_a, 0, 0);
> + MA_STATE(mas_b, mt_b, 0, 0);
> +
> + if (mt_a->ma_flags != mt_b->ma_flags) {
> + pr_err("The flags of the two trees are different.\n");
> + return -1;
> + }
> +
> + mas_dfs_preorder(&mas_a);
> + mas_dfs_preorder(&mas_b);
> +
> + if (mas_is_ptr(&mas_a) || mas_is_ptr(&mas_b)) {
> + if (!(mas_is_ptr(&mas_a) && mas_is_ptr(&mas_b))) {
> + pr_err("One is MAS_ROOT and the other is not.\n");
> + return -1;
> + }
> + return 0;
> + }
> +
> + while (!mas_is_none(&mas_a) || !mas_is_none(&mas_b)) {
> +
> + if (mas_is_none(&mas_a) || mas_is_none(&mas_b)) {
> + pr_err("One is MAS_NONE and the other is not.\n");
> + return -1;
> + }
> +
> + if (mas_a.min != mas_b.min ||
> + mas_a.max != mas_b.max) {
> + pr_err("mas->min, mas->max do not match.\n");
> + return -1;
> + }
> +
> + if (compare_node(mas_a.node, mas_b.node)) {
> + pr_err("The contents of nodes %p and %p are different.\n",
> + mas_a.node, mas_b.node);
> + mt_dump(mt_a, mt_dump_dec);
> + mt_dump(mt_b, mt_dump_dec);
> + return -1;
> + }
> +
> + mas_dfs_preorder(&mas_a);
> + mas_dfs_preorder(&mas_b);
> + }
> +
> + return 0;
> +}
> +
> +static noinline void __init check_mt_dup(struct maple_tree *mt)
> +{
> + DEFINE_MTREE(new);
> + int i, j, ret, count = 0;
> +
> + /* stored in the root pointer*/
> + mt_init_flags(&tree, 0);
> + mtree_store_range(&tree, 0, 0, xa_mk_value(0), GFP_KERNEL);
> + mt_dup(&tree, &new, GFP_KERNEL);
> + mt_validate(&new);
> + if (compare_tree(&tree, &new))
> + MT_BUG_ON(&new, 1);
> +
> + mtree_destroy(&tree);
> + mtree_destroy(&new);
> +
> + for (i = 0; i < 1000; i += 3) {
> + if (i & 1)
> + mt_init_flags(&tree, 0);
> + else
> + mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
> +
> + for (j = 0; j < i; j++) {
> + mtree_store_range(&tree, j * 10, j * 10 + 5,
> + xa_mk_value(j), GFP_KERNEL);
> + }
Storing in this way is probably not checking a full tree. I think it's
important to check the full tree/full nodes since you have changes to
detect the metadata.
> +
> + ret = mt_dup(&tree, &new, GFP_KERNEL);
> + MT_BUG_ON(&new, ret != 0);
> + mt_validate(&new);
> + if (compare_tree(&tree, &new))
> + MT_BUG_ON(&new, 1);
> +
> + mtree_destroy(&tree);
> + mtree_destroy(&new);
> + }
> +
> + /* Test memory allocation failed. */
> + for (i = 0; i < 1000; i += 3) {
> + if (i & 1)
> + mt_init_flags(&tree, 0);
> + else
> + mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
> +
> + for (j = 0; j < i; j++) {
> + mtree_store_range(&tree, j * 10, j * 10 + 5,
> + xa_mk_value(j), GFP_KERNEL);
> + }
> +
> + mt_set_non_kernel(50);
It may be worth while allowing more/less than 50 allocations.
> + ret = mt_dup(&tree, &new, GFP_NOWAIT);
> + mt_set_non_kernel(0);
> + if (ret != 0) {
> + MT_BUG_ON(&new, ret != -ENOMEM);
> + count++;
> + mtree_destroy(&tree);
> + continue;
> + }
> +
> + mt_validate(&new);
> + if (compare_tree(&tree, &new))
> + MT_BUG_ON(&new, 1);
> +
> + mtree_destroy(&tree);
> + mtree_destroy(&new);
> + }
> +
> + /* pr_info("mt_dup() fail %d times\n", count); */
> + BUG_ON(!count);
> +}
> +
> extern void test_kmem_cache_bulk(void);
>
> void farmer_tests(void)
> @@ -35904,6 +36102,10 @@ void farmer_tests(void)
> check_null_expand(&tree);
> mtree_destroy(&tree);
>
> + mt_init_flags(&tree, 0);
> + check_mt_dup(&tree);
> + mtree_destroy(&tree);
> +
> /* RCU testing */
> mt_init_flags(&tree, 0);
> check_erase_testset(&tree);
> --
> 2.20.1
>
>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 06/11] maple_tree: Introduce mas_replace_entry() to directly replace an entry
2023-07-26 8:09 ` [PATCH 06/11] maple_tree: Introduce mas_replace_entry() to directly replace an entry Peng Zhang
@ 2023-07-26 16:08 ` Liam R. Howlett
2023-07-31 12:39 ` Peng Zhang
0 siblings, 1 reply; 42+ messages in thread
From: Liam R. Howlett @ 2023-07-26 16:08 UTC (permalink / raw)
To: Peng Zhang
Cc: corbet, akpm, willy, brauner, surenb, michael.christie, peterz,
mathieu.desnoyers, npiggin, avagin, linux-mm, linux-doc,
linux-kernel, linux-fsdevel
* Peng Zhang <zhangpeng.00@bytedance.com> [230726 04:10]:
> If mas has located a specific entry, it may be need to replace this
> entry, so introduce mas_replace_entry() to do this. mas_replace_entry()
> will be more efficient than mas_store*() because it doesn't do many
> unnecessary checks.
>
> This function should be inline, but more functions need to be moved to
> the header file, so I didn't do it for the time being.
I am really nervous having no checks here. I get that this could be
used for duplicating the tree more efficiently, but having a function
that just swaps a value in is very dangerous - especially since it is
decoupled from the tree duplication code.
>
> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
> ---
> include/linux/maple_tree.h | 1 +
> lib/maple_tree.c | 25 +++++++++++++++++++++++++
> 2 files changed, 26 insertions(+)
>
> diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
> index 229fe78e4c89..a05e9827d761 100644
> --- a/include/linux/maple_tree.h
> +++ b/include/linux/maple_tree.h
> @@ -462,6 +462,7 @@ struct ma_wr_state {
>
> void *mas_walk(struct ma_state *mas);
> void *mas_store(struct ma_state *mas, void *entry);
> +void mas_replace_entry(struct ma_state *mas, void *entry);
> void *mas_erase(struct ma_state *mas);
> int mas_store_gfp(struct ma_state *mas, void *entry, gfp_t gfp);
> void mas_store_prealloc(struct ma_state *mas, void *entry);
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index efac6761ae37..d58572666a00 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -5600,6 +5600,31 @@ void *mas_store(struct ma_state *mas, void *entry)
> }
> EXPORT_SYMBOL_GPL(mas_store);
>
> +/**
> + * mas_replace_entry() - Replace an entry that already exists in the maple tree
> + * @mas: The maple state
> + * @entry: The entry to store
> + *
> + * Please note that mas must already locate an existing entry, and the new entry
> + * must not be NULL. If these two points cannot be guaranteed, please use
> + * mas_store*() instead, otherwise it will cause an internal error in the maple
> + * tree. This function does not need to allocate memory, so it must succeed.
> + */
> +void mas_replace_entry(struct ma_state *mas, void *entry)
> +{
> + void __rcu **slots;
> +
> +#ifdef CONFIG_DEBUG_MAPLE_TREE
> + MAS_WARN_ON(mas, !mte_is_leaf(mas->node));
> + MAS_WARN_ON(mas, !entry);
> + MAS_WARN_ON(mas, mas->offset >= mt_slots[mte_node_type(mas->node)]);
> +#endif
> +
> + slots = ma_slots(mte_to_node(mas->node), mte_node_type(mas->node));
> + rcu_assign_pointer(slots[mas->offset], entry);
> +}
> +EXPORT_SYMBOL_GPL(mas_replace_entry);
> +
> /**
> * mas_store_gfp() - Store a value into the tree.
> * @mas: The maple state
> --
> 2.20.1
>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 10/11] MAINTAINERS: Add co-maintainer for maple tree
2023-07-26 8:09 ` [PATCH 10/11] MAINTAINERS: Add co-maintainer for maple tree Peng Zhang
@ 2023-07-26 16:39 ` Liam R. Howlett
2023-07-31 12:55 ` Peng Zhang
0 siblings, 1 reply; 42+ messages in thread
From: Liam R. Howlett @ 2023-07-26 16:39 UTC (permalink / raw)
To: Peng Zhang
Cc: corbet, akpm, willy, brauner, surenb, michael.christie, peterz,
mathieu.desnoyers, npiggin, avagin, linux-mm, linux-doc,
linux-kernel, linux-fsdevel
* Peng Zhang <zhangpeng.00@bytedance.com> [230726 04:10]:
> Add myself as co-maintainer for maple tree. I would like to assist
> Liam R. Howlett in maintaining maple tree. I will continue to contribute
> to the development of maple tree in the future.
Sorry, but no.
I appreciate the patches, bug fixes, and code review but there is no
need for another maintainer for the tree at this time.
Thank you,
Liam
>
> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
> ---
> MAINTAINERS | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index ddc71b815791..8cfedd492509 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -12526,6 +12526,7 @@ F: net/mctp/
>
> MAPLE TREE
> M: Liam R. Howlett <Liam.Howlett@oracle.com>
> +M: Peng Zhang <zhangpeng.00@bytedance.com>
> L: linux-mm@kvack.org
> S: Supported
> F: Documentation/core-api/maple_tree.rst
> --
> 2.20.1
>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 11/11] fork: Use __mt_dup() to duplicate maple tree in dup_mmap()
2023-07-26 8:09 ` [PATCH 11/11] fork: Use __mt_dup() to duplicate maple tree in dup_mmap() Peng Zhang
@ 2023-07-26 17:06 ` Liam R. Howlett
2023-07-31 12:59 ` Peng Zhang
0 siblings, 1 reply; 42+ messages in thread
From: Liam R. Howlett @ 2023-07-26 17:06 UTC (permalink / raw)
To: Peng Zhang
Cc: corbet, akpm, willy, brauner, surenb, michael.christie, peterz,
mathieu.desnoyers, npiggin, avagin, linux-mm, linux-doc,
linux-kernel, linux-fsdevel
* Peng Zhang <zhangpeng.00@bytedance.com> [230726 04:10]:
> Use __mt_dup() to duplicate the old maple tree in dup_mmap(), and then
> directly modify the entries of VMAs in the new maple tree, which can
> get better performance. dup_mmap() is used by fork(), so this patch
> optimizes fork(). The optimization effect is proportional to the number
> of VMAs.
>
> Due to the introduction of this method, the optimization in
> (maple_tree: add a fast path case in mas_wr_slot_store())[1] no longer
> has an effect here, but it is also an optimization of the maple tree.
>
> There is a unixbench test suite[2] where 'spawn' is used to test fork().
> 'spawn' only has 23 VMAs by default, so I tweaked the benchmark code a
> bit to use mmap() to control the number of VMAs. Therefore, the
> performance under different numbers of VMAs can be measured.
>
> Insert code like below into 'spawn':
> for (int i = 0; i < 200; ++i) {
> size_t size = 10 * getpagesize();
> void *addr;
>
> if (i & 1) {
> addr = mmap(NULL, size, PROT_READ,
> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> } else {
> addr = mmap(NULL, size, PROT_WRITE,
> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> }
> if (addr == MAP_FAILED)
> ...
> }
>
> Based on next-20230721, use 'spawn' under 23, 203, and 4023 VMAs, test
> 4 times in 30 seconds each time, and get the following numbers. These
> numbers are the number of fork() successes in 30s (average of the best
> 3 out of 4). By the way, based on next-20230725, I reverted [1], and
> tested it together as a comparison. In order to ensure the reliability
> of the test results, these tests were run on a physical machine.
>
> 23VMAs 223VMAs 4023VMAs
> revert [1]: 159104.00 73316.33 6787.00
You can probably remove the revert benchmark from this since there is no
reason to revert the previous change. The change is worth while on its
own, so it's better to have the numbers more clear by having with and
without this series.
>
> +0.77% +0.42% +0.28%
> next-20230721: 160321.67 73624.67 6806.33
>
> +2.77% +15.42% +29.86%
> apply this: 164751.67 84980.33 8838.67
What is the difference between using this patch with mas_replace_entry()
and mas_store_entry()?
>
> It can be seen that the performance improvement is proportional to
> the number of VMAs. With 23 VMAs, performance improves by about 3%,
> with 223 VMAs, performance improves by about 15%, and with 4023 VMAs,
> performance improves by about 30%.
>
> [1] https://lore.kernel.org/lkml/20230628073657.75314-4-zhangpeng.00@bytedance.com/
> [2] https://github.com/kdlucas/byte-unixbench/tree/master
>
> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
> ---
> kernel/fork.c | 35 +++++++++++++++++++++++++++--------
> mm/mmap.c | 14 ++++++++++++--
> 2 files changed, 39 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index f81149739eb9..ef80025b62d6 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -650,7 +650,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> int retval;
> unsigned long charge = 0;
> LIST_HEAD(uf);
> - VMA_ITERATOR(old_vmi, oldmm, 0);
> VMA_ITERATOR(vmi, mm, 0);
>
> uprobe_start_dup_mmap();
> @@ -678,17 +677,40 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> goto out;
> khugepaged_fork(mm, oldmm);
>
> - retval = vma_iter_bulk_alloc(&vmi, oldmm->map_count);
> - if (retval)
> + /* Use __mt_dup() to efficiently build an identical maple tree. */
> + retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_NOWAIT | __GFP_NOWARN);
> + if (unlikely(retval))
> goto out;
>
> mt_clear_in_rcu(vmi.mas.tree);
> - for_each_vma(old_vmi, mpnt) {
> + for_each_vma(vmi, mpnt) {
> struct file *file;
>
> vma_start_write(mpnt);
> if (mpnt->vm_flags & VM_DONTCOPY) {
> vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
> +
> + /*
> + * Since the new tree is exactly the same as the old one,
> + * we need to remove the unneeded VMAs.
> + */
> + mas_store(&vmi.mas, NULL);
> +
> + /*
> + * Even removing an entry may require memory allocation,
> + * and if removal fails, we use XA_ZERO_ENTRY to mark
> + * from which VMA it failed. The case of encountering
> + * XA_ZERO_ENTRY will be handled in exit_mmap().
> + */
> + if (unlikely(mas_is_err(&vmi.mas))) {
> + retval = xa_err(vmi.mas.node);
> + mas_reset(&vmi.mas);
> + if (mas_find(&vmi.mas, ULONG_MAX))
> + mas_replace_entry(&vmi.mas,
> + XA_ZERO_ENTRY);
> + goto loop_out;
> + }
> +
> continue;
> }
> charge = 0;
> @@ -750,8 +772,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> hugetlb_dup_vma_private(tmp);
>
> /* Link the vma into the MT */
> - if (vma_iter_bulk_store(&vmi, tmp))
> - goto fail_nomem_vmi_store;
> + mas_replace_entry(&vmi.mas, tmp);
>
> mm->map_count++;
> if (!(tmp->vm_flags & VM_WIPEONFORK))
> @@ -778,8 +799,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> uprobe_end_dup_mmap();
> return retval;
>
> -fail_nomem_vmi_store:
> - unlink_anon_vmas(tmp);
> fail_nomem_anon_vma_fork:
> mpol_put(vma_policy(tmp));
> fail_nomem_policy:
> diff --git a/mm/mmap.c b/mm/mmap.c
> index bc91d91261ab..5bfba2fb0e39 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3184,7 +3184,11 @@ void exit_mmap(struct mm_struct *mm)
> arch_exit_mmap(mm);
>
> vma = mas_find(&mas, ULONG_MAX);
> - if (!vma) {
> + /*
> + * If dup_mmap() fails to remove a VMA marked VM_DONTCOPY,
> + * xa_is_zero(vma) may be true.
> + */
> + if (!vma || xa_is_zero(vma)) {
> /* Can happen if dup_mmap() received an OOM */
> mmap_read_unlock(mm);
> return;
> @@ -3222,7 +3226,13 @@ void exit_mmap(struct mm_struct *mm)
> remove_vma(vma, true);
> count++;
> cond_resched();
> - } while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
> + vma = mas_find(&mas, ULONG_MAX);
> + /*
> + * If xa_is_zero(vma) is true, it means that subsequent VMAs
> + * donot need to be removed. Can happen if dup_mmap() fails to
> + * remove a VMA marked VM_DONTCOPY.
> + */
> + } while (vma != NULL && !xa_is_zero(vma));
>
> BUG_ON(count != mm->map_count);
>
> --
> 2.20.1
>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 01/11] maple_tree: Introduce ma_nonleaf_data_end{_nocheck}()
2023-07-26 14:58 ` Liam R. Howlett
@ 2023-07-31 9:52 ` Peng Zhang
2023-07-31 16:08 ` Liam R. Howlett
0 siblings, 1 reply; 42+ messages in thread
From: Peng Zhang @ 2023-07-31 9:52 UTC (permalink / raw)
To: Liam R. Howlett
Cc: corbet, akpm, willy, brauner, surenb, michael.christie,
mathieu.desnoyers, peterz, npiggin, avagin, linux-mm, linux-doc,
linux-kernel, linux-fsdevel
在 2023/7/26 22:58, Liam R. Howlett 写道:
> * Peng Zhang <zhangpeng.00@bytedance.com> [230726 04:10]:
>> Introduce ma_nonleaf_data_end{_nocheck}() to get the data end of
>> non-leaf nodes without knowing the maximum value of nodes, so that any
>> ascending can be avoided even if the maximum value of nodes is not known.
>>
>> The principle is that we introduce MAPLE_ENODE to mark an ENODE, which
>> cannot be used by metadata, so we can distinguish whether it is ENODE or
>> metadata.
>>
>> The nocheck version is to avoid lockdep complaining in some scenarios
>> where no locks are held.
>>
>> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
>> ---
>> lib/maple_tree.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++--
>> 1 file changed, 68 insertions(+), 2 deletions(-)
>>
>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>> index a3d602cfd030..98e4fdf6f4b9 100644
>> --- a/lib/maple_tree.c
>> +++ b/lib/maple_tree.c
>> @@ -310,12 +310,19 @@ static inline void mte_set_node_dead(struct maple_enode *mn)
>> #define MAPLE_ENODE_TYPE_SHIFT 0x03
>> /* Bit 2 means a NULL somewhere below */
>> #define MAPLE_ENODE_NULL 0x04
>> +/* Bit 7 means this is an ENODE, instead of metadata */
>> +#define MAPLE_ENODE 0x80
>
> We were saving this bit for more node types. I don't want to use this
> bit for this reason since you could have done BFS to duplicate the tree
> using the existing way to find the node end.
We have reserved 4 bits for the node type. I don't think there will be
more than 16 node types going forward.
Even the DFS that has been implemented can use the existing way to get
the data end. I didn't use it because when walking up the tree, we don't
know the maximum value of the node, and the continuous upward walk will
introduce more overhead, which is what mas_ascend() does. Doing BFS
cannot avoid this problem also.
The reason I don't do BFS is that it has more overhead than DFS. If you
think of a tree as a graph, doing DFS will only walk each edge twice.
What if it is BFS? Since we can't use queues, we can only emulate BFS,
which additionally does something like mas_next_node() does, which
introduces more overhead than DFS. Considering only the layer of leaf
nodes, it needs to walk each edge twice. So the overhead of doing BFS is
more than DFS.
>
> Bits are highly valuable and this is the only remaining bit. I had
> thought about using this in Feb 2021 to see if there was metadata or
> not, but figured a way around it (using the max trick) and thus saved
> this bit for potential expansion of node types.
I thought of another way to get the maximum value of a node without
doing any extra upward walk. When doing DFS, we can use a stack to save
the maximum value of ancestor nodes. The stack size can be set to
MAPLE_HEIGHT_MAX. In this way, this bit can be reserved, and there is no
need to do a loop like mas_ascend() in order to get the maximum value.
>
>> +
>> +static inline bool slot_is_mte(unsigned long slot)
>> +{
>> + return slot & MAPLE_ENODE;
>> +}
>>
>> static inline struct maple_enode *mt_mk_node(const struct maple_node *node,
>> enum maple_type type)
>> {
>> - return (void *)((unsigned long)node |
>> - (type << MAPLE_ENODE_TYPE_SHIFT) | MAPLE_ENODE_NULL);
>> + return (void *)((unsigned long)node | (type << MAPLE_ENODE_TYPE_SHIFT) |
>> + MAPLE_ENODE_NULL | MAPLE_ENODE);
>> }
>>
>> static inline void *mte_mk_root(const struct maple_enode *node)
>> @@ -1411,6 +1418,65 @@ static inline struct maple_enode *mas_start(struct ma_state *mas)
>> return NULL;
>> }
>>
>> +/*
>> + * ma_nonleaf_data_end() - Find the end of the data in a non-leaf node.
>> + * @mt: The maple tree
>> + * @node: The maple node
>> + * @type: The maple node type
>> + *
>> + * Uses metadata to find the end of the data when possible without knowing the
>> + * node maximum.
>> + *
>> + * Return: The zero indexed last slot with child.
>> + */
>> +static inline unsigned char ma_nonleaf_data_end(struct maple_tree *mt,
>> + struct maple_node *node,
>> + enum maple_type type)
>> +{
>> + void __rcu **slots;
>> + unsigned long slot;
>> +
>> + slots = ma_slots(node, type);
>> + slot = (unsigned long)mt_slot(mt, slots, mt_pivots[type]);
>> + if (unlikely(slot_is_mte(slot)))
>> + return mt_pivots[type];
>> +
>> + return ma_meta_end(node, type);
>> +}
>> +
>> +/*
>> + * ma_nonleaf_data_end_nocheck() - Find the end of the data in a non-leaf node.
>> + * @node: The maple node
>> + * @type: The maple node type
>> + *
>> + * Uses metadata to find the end of the data when possible without knowing the
>> + * node maximum. This is the version of ma_nonleaf_data_end() that does not
>> + * check for lock held. This particular version is designed to avoid lockdep
>> + * complaining in some scenarios.
>> + *
>> + * Return: The zero indexed last slot with child.
>> + */
>> +static inline unsigned char ma_nonleaf_data_end_nocheck(struct maple_node *node,
>> + enum maple_type type)
>> +{
>> + void __rcu **slots;
>> + unsigned long slot;
>> +
>> + slots = ma_slots(node, type);
>> + slot = (unsigned long)rcu_dereference_raw(slots[mt_pivots[type]]);
>> + if (unlikely(slot_is_mte(slot)))
>> + return mt_pivots[type];
>> +
>> + return ma_meta_end(node, type);
>> +}
>> +
>> +/* See ma_nonleaf_data_end() */
>> +static inline unsigned char mte_nonleaf_data_end(struct maple_tree *mt,
>> + struct maple_enode *enode)
>> +{
>> + return ma_nonleaf_data_end(mt, mte_to_node(enode), mte_node_type(enode));
>> +}
>> +
>> /*
>> * ma_data_end() - Find the end of the data in a node.
>> * @node: The maple node
>> --
>> 2.20.1
>>
>>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 03/11] maple_tree: Add some helper functions
2023-07-26 15:02 ` Liam R. Howlett
2023-07-26 15:08 ` Matthew Wilcox
@ 2023-07-31 11:40 ` Peng Zhang
1 sibling, 0 replies; 42+ messages in thread
From: Peng Zhang @ 2023-07-31 11:40 UTC (permalink / raw)
To: Liam R. Howlett
Cc: peterz, mathieu.desnoyers, brauner, Peng Zhang, surenb,
linux-kernel, npiggin, linux-fsdevel, linux-doc, akpm, corbet,
willy, linux-mm, avagin, michael.christie
在 2023/7/26 23:02, Liam R. Howlett 写道:
> * Peng Zhang <zhangpeng.00@bytedance.com> [230726 04:10]:
>> Add some helper functions so that their parameters are maple node
>> instead of maple enode, these functions will be used later.
>>
>> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
>> ---
>> lib/maple_tree.c | 71 +++++++++++++++++++++++++++++++++++++-----------
>> 1 file changed, 55 insertions(+), 16 deletions(-)
>>
>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>> index e0e9a87bdb43..da3a2fb405c0 100644
>> --- a/lib/maple_tree.c
>> +++ b/lib/maple_tree.c
>> @@ -164,6 +164,11 @@ static inline int mt_alloc_bulk(gfp_t gfp, size_t size, void **nodes)
>> return kmem_cache_alloc_bulk(maple_node_cache, gfp, size, nodes);
>> }
>>
>> +static inline void mt_free_one(struct maple_node *node)
>> +{
>> + kmem_cache_free(maple_node_cache, node);
>> +}
>> +
>
> There is a place in mas_destroy() that could use this if it is added.
I will make changes accordingly. It's not done here because it doesn't
seem to be relevant to the theme of this patchset.
>
>> static inline void mt_free_bulk(size_t size, void __rcu **nodes)
>> {
>> kmem_cache_free_bulk(maple_node_cache, size, (void **)nodes);
>> @@ -432,18 +437,18 @@ static inline unsigned long mte_parent_slot_mask(unsigned long parent)
>> }
>>
>> /*
>> - * mas_parent_type() - Return the maple_type of the parent from the stored
>> - * parent type.
>> - * @mas: The maple state
>> - * @enode: The maple_enode to extract the parent's enum
>> + * ma_parent_type() - Return the maple_type of the parent from the stored parent
>> + * type.
>> + * @mt: The maple tree
>> + * @node: The maple_node to extract the parent's enum
>> * Return: The node->parent maple_type
>> */
>> static inline
>> -enum maple_type mas_parent_type(struct ma_state *mas, struct maple_enode *enode)
>> +enum maple_type ma_parent_type(struct maple_tree *mt, struct maple_node *node)
>
> I was trying to keep ma_* prefix to mean the first argument is
> maple_node and mt_* to mean maple_tree. I wasn't entirely successful
> with this and I do see why you want to use ma_, but maybe reverse the
> arguments here?
I just think it is redundant to construct maple enode through
node->parent in order to adapt the parameters of mte_*. So ma_* are
introduced to avoid meaningless construction.
>
>> {
>> unsigned long p_type;
>>
>> - p_type = (unsigned long)mte_to_node(enode)->parent;
>> + p_type = (unsigned long)node->parent;
>> if (WARN_ON(p_type & MAPLE_PARENT_ROOT))
>> return 0;
>>
>> @@ -451,7 +456,7 @@ enum maple_type mas_parent_type(struct ma_state *mas, struct maple_enode *enode)
>> p_type &= ~mte_parent_slot_mask(p_type);
>> switch (p_type) {
>> case MAPLE_PARENT_RANGE64: /* or MAPLE_PARENT_ARANGE64 */
>> - if (mt_is_alloc(mas->tree))
>> + if (mt_is_alloc(mt))
>> return maple_arange_64;
>> return maple_range_64;
>> }
>> @@ -459,6 +464,19 @@ enum maple_type mas_parent_type(struct ma_state *mas, struct maple_enode *enode)
>> return 0;
>> }
>>
>> +/*
>> + * mas_parent_type() - Return the maple_type of the parent from the stored
>> + * parent type.
>> + * @mas: The maple state
>> + * @enode: The maple_enode to extract the parent's enum
>> + * Return: The node->parent maple_type
>> + */
>> +static inline
>> +enum maple_type mas_parent_type(struct ma_state *mas, struct maple_enode *enode)
>> +{
>> + return ma_parent_type(mas->tree, mte_to_node(enode));
>> +}
>> +
>> /*
>> * mas_set_parent() - Set the parent node and encode the slot
>> * @enode: The encoded maple node.
>> @@ -499,14 +517,14 @@ void mas_set_parent(struct ma_state *mas, struct maple_enode *enode,
>> }
>>
>> /*
>> - * mte_parent_slot() - get the parent slot of @enode.
>> - * @enode: The encoded maple node.
>> + * ma_parent_slot() - get the parent slot of @node.
>> + * @node: The maple node.
>> *
>> - * Return: The slot in the parent node where @enode resides.
>> + * Return: The slot in the parent node where @node resides.
>> */
>> -static inline unsigned int mte_parent_slot(const struct maple_enode *enode)
>> +static inline unsigned int ma_parent_slot(const struct maple_node *node)
>> {
>> - unsigned long val = (unsigned long)mte_to_node(enode)->parent;
>> + unsigned long val = (unsigned long)node->parent;
>>
>> if (val & MA_ROOT_PARENT)
>> return 0;
>> @@ -519,15 +537,36 @@ static inline unsigned int mte_parent_slot(const struct maple_enode *enode)
>> }
>>
>> /*
>> - * mte_parent() - Get the parent of @node.
>> - * @node: The encoded maple node.
>> + * mte_parent_slot() - get the parent slot of @enode.
>> + * @enode: The encoded maple node.
>> + *
>> + * Return: The slot in the parent node where @enode resides.
>> + */
>> +static inline unsigned int mte_parent_slot(const struct maple_enode *enode)
>> +{
>> + return ma_parent_slot(mte_to_node(enode));
>> +}
>> +
>> +/*
>> + * ma_parent() - Get the parent of @node.
>> + * @node: The maple node.
>> + *
>> + * Return: The parent maple node.
>> + */
>> +static inline struct maple_node *ma_parent(const struct maple_node *node)
>
> I had a lot of these helpers before, but they eventually became used so
> little that I dropped them.
Just for not wanting to construct maple enode. It's not really a
problem.
>
>> +{
>> + return (void *)((unsigned long)(node->parent) & ~MAPLE_NODE_MASK);
>> +}
>> +
>> +/*
>> + * mte_parent() - Get the parent of @enode.
>> + * @enode: The encoded maple node.
>> *
>> * Return: The parent maple node.
>> */
>> static inline struct maple_node *mte_parent(const struct maple_enode *enode)
>> {
>> - return (void *)((unsigned long)
>> - (mte_to_node(enode)->parent) & ~MAPLE_NODE_MASK);
>> + return ma_parent(mte_to_node(enode));
>> }
>>
>> /*
>> --
>> 2.20.1
>>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 03/11] maple_tree: Add some helper functions
2023-07-26 15:08 ` Matthew Wilcox
@ 2023-07-31 11:45 ` Peng Zhang
2023-08-11 17:28 ` Liam R. Howlett
0 siblings, 1 reply; 42+ messages in thread
From: Peng Zhang @ 2023-07-31 11:45 UTC (permalink / raw)
To: Matthew Wilcox, Liam R. Howlett
Cc: mathieu.desnoyers, npiggin, Peng Zhang, corbet, akpm,
michael.christie, peterz, linux-mm, linux-kernel, linux-fsdevel,
linux-doc, avagin, surenb, brauner
在 2023/7/26 23:08, Matthew Wilcox 写道:
> On Wed, Jul 26, 2023 at 11:02:52AM -0400, Liam R. Howlett wrote:
>> * Peng Zhang <zhangpeng.00@bytedance.com> [230726 04:10]:
>>> static inline
>>> -enum maple_type mas_parent_type(struct ma_state *mas, struct maple_enode *enode)
>>> +enum maple_type ma_parent_type(struct maple_tree *mt, struct maple_node *node)
>>
>> I was trying to keep ma_* prefix to mean the first argument is
>> maple_node and mt_* to mean maple_tree. I wasn't entirely successful
>> with this and I do see why you want to use ma_, but maybe reverse the
>> arguments here?
>
> I think your first idea is better. Usually we prefer to order the
> arguments by "containing thing" to "contained thing". So always use
> (struct address_space *, struct folio *), for example. Or (struct
> mm_struct *, struct vm_area_struct *).
There are disagreements here, so how to decide? But I don't know if the
new version still has this helper.
>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 04/11] maple_tree: Introduce interfaces __mt_dup() and mt_dup()
2023-07-26 16:03 ` Liam R. Howlett
@ 2023-07-31 12:24 ` Peng Zhang
2023-07-31 16:27 ` Liam R. Howlett
0 siblings, 1 reply; 42+ messages in thread
From: Peng Zhang @ 2023-07-31 12:24 UTC (permalink / raw)
To: Liam R. Howlett
Cc: Peng Zhang, willy, michael.christie, surenb, npiggin, corbet,
mathieu.desnoyers, avagin, linux-fsdevel, linux-kernel, linux-doc,
linux-mm, akpm, brauner, peterz
在 2023/7/27 00:03, Liam R. Howlett 写道:
> * Peng Zhang <zhangpeng.00@bytedance.com> [230726 04:10]:
>> Introduce interfaces __mt_dup() and mt_dup(), which are used to
>> duplicate a maple tree. Compared with traversing the source tree and
>> reinserting entry by entry in the new tree, it has better performance.
>> The difference between __mt_dup() and mt_dup() is that mt_dup() holds
>> an internal lock.
>>
>> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
>> ---
>> include/linux/maple_tree.h | 3 +
>> lib/maple_tree.c | 211 +++++++++++++++++++++++++++++++++++++
>> 2 files changed, 214 insertions(+)
>>
>> diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
>> index c962af188681..229fe78e4c89 100644
>> --- a/include/linux/maple_tree.h
>> +++ b/include/linux/maple_tree.h
>> @@ -327,6 +327,9 @@ int mtree_store(struct maple_tree *mt, unsigned long index,
>> void *entry, gfp_t gfp);
>> void *mtree_erase(struct maple_tree *mt, unsigned long index);
>>
>> +int mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
>> +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
>> +
>> void mtree_destroy(struct maple_tree *mt);
>> void __mt_destroy(struct maple_tree *mt);
>>
>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>> index da3a2fb405c0..efac6761ae37 100644
>> --- a/lib/maple_tree.c
>> +++ b/lib/maple_tree.c
>> @@ -6595,6 +6595,217 @@ void *mtree_erase(struct maple_tree *mt, unsigned long index)
>> }
>> EXPORT_SYMBOL(mtree_erase);
>>
>> +/*
>> + * mt_dup_free() - Free the nodes of a incomplete maple tree.
>> + * @mt: The incomplete maple tree
>> + * @node: Free nodes from @node
>> + *
>> + * This function frees all nodes starting from @node in the reverse order of
>> + * mt_dup_build(). At this point we don't need to hold the source tree lock.
>> + */
>> +static void mt_dup_free(struct maple_tree *mt, struct maple_node *node)
>> +{
>> + void **slots;
>> + unsigned char offset;
>> + struct maple_enode *enode;
>> + enum maple_type type;
>> + unsigned char count = 0, i;
>> +
>
> Can we make these labels inline functions and try to make this a loop?
I did this just to make things easier. Refer to the implementation of
walk_tg_tree_from() in sched/core.c. Using some loops and inline
functions probably doesn't simplify things. I'll try to do that and give
up if it complicates things.
>
>> +try_ascend:
>> + if (ma_is_root(node)) {
>> + mt_free_one(node);
>> + return;
>> + }
>> +
>> + offset = ma_parent_slot(node);
>> + type = ma_parent_type(mt, node);
>> + node = ma_parent(node);
>> + if (!offset)
>> + goto free;
>> +
>> + offset--;
>> +
>> +descend:
>> + slots = (void **)ma_slots(node, type);
>> + enode = slots[offset];
>> + if (mte_is_leaf(enode))
>> + goto free;
>> +
>> + type = mte_node_type(enode);
>> + node = mte_to_node(enode);
>> + offset = ma_nonleaf_data_end_nocheck(node, type);
>> + goto descend;
>> +
>> +free:
>> + slots = (void **)ma_slots(node, type);
>> + count = ma_nonleaf_data_end_nocheck(node, type) + 1;
>> + for (i = 0; i < count; i++)
>> + ((unsigned long *)slots)[i] &= ~MAPLE_NODE_MASK;
>> +
>> + /* Cast to __rcu to avoid sparse checker complaining. */
>> + mt_free_bulk(count, (void __rcu **)slots);
>> + goto try_ascend;
>> +}
>> +
>> +/*
>> + * mt_dup_build() - Build a new maple tree from a source tree
>> + * @mt: The source maple tree to copy from
>> + * @new: The new maple tree
>> + * @gfp: The GFP_FLAGS to use for allocations
>> + * @to_free: Free nodes starting from @to_free if the build fails
>> + *
>> + * This function builds a new tree in DFS preorder. If it fails due to memory
>> + * allocation, @to_free will store the last failed node to free the incomplete
>> + * tree. Use mt_dup_free() to free nodes.
>> + *
>> + * Return: 0 on success, -ENOMEM if memory could not be allocated.
>> + */
>> +static inline int mt_dup_build(struct maple_tree *mt, struct maple_tree *new,
>> + gfp_t gfp, struct maple_node **to_free)
>
> I am trying to change the functions to be two tabs of indent for
> arguments from now on. It allows for more to fit on a single line and
> still maintains a clear separation between code and argument lists.
I'm not too concerned about code formatting. . . At least in this
patchset.
>
>> +{
>> + struct maple_enode *enode;
>> + struct maple_node *new_node, *new_parent = NULL, *node;
>> + enum maple_type type;
>> + void __rcu **slots;
>> + void **new_slots;
>> + unsigned char count, request, i, offset;
>> + unsigned long *set_parent;
>> + unsigned long new_root;
>> +
>> + mt_init_flags(new, mt->ma_flags);
>> + enode = mt_root_locked(mt);
>> + if (unlikely(!xa_is_node(enode))) {
>> + rcu_assign_pointer(new->ma_root, enode);
>> + return 0;
>> + }
>> +
>> + new_node = mt_alloc_one(gfp);
>> + if (!new_node)
>> + return -ENOMEM;
>> +
>> + new_root = (unsigned long)new_node;
>> + new_root |= (unsigned long)enode & MAPLE_NODE_MASK;
>> +
>> +copy_node:
>
> Can you make copy_node, descend, ascend inline functions instead of the
> goto jumping please? It's better to have loops over jumping around a
> lot. Gotos are good for undoing things and retry, but constructing
> loops with them makes it difficult to follow.
Same as above.
>
>> + node = mte_to_node(enode);
>> + type = mte_node_type(enode);
>> + memcpy(new_node, node, sizeof(struct maple_node));
>> +
>> + set_parent = (unsigned long *)&(new_node->parent);
>> + *set_parent &= MAPLE_NODE_MASK;
>> + *set_parent |= (unsigned long)new_parent;
>
> Maybe make a small inline to set the parent instead of this?
>
> There are some defined helpers for setting the types like
> ma_parent_ptr() and ma_enode_ptr() to make casting more type-safe.
Ok, I'll try to do that.
>
>> + if (ma_is_leaf(type))
>> + goto ascend;
>> +
>> + new_slots = (void **)ma_slots(new_node, type);
>> + slots = ma_slots(node, type);
>> + request = ma_nonleaf_data_end(mt, node, type) + 1;
>> + count = mt_alloc_bulk(gfp, request, new_slots);
>> + if (!count) {
>> + *to_free = new_node;
>> + return -ENOMEM;
>> + }
>> +
>> + for (i = 0; i < count; i++)
>> + ((unsigned long *)new_slots)[i] |=
>> + ((unsigned long)mt_slot_locked(mt, slots, i) &
>> + MAPLE_NODE_MASK);
>> + offset = 0;
>> +
>> +descend:
>> + new_parent = new_node;
>> + enode = mt_slot_locked(mt, slots, offset);
>> + new_node = mte_to_node(new_slots[offset]);
>> + goto copy_node;
>> +
>> +ascend:
>> + if (ma_is_root(node)) {
>> + new_node = mte_to_node((void *)new_root);
>> + new_node->parent = ma_parent_ptr((unsigned long)new |
>> + MA_ROOT_PARENT);
>> + rcu_assign_pointer(new->ma_root, (void *)new_root);
>> + return 0;
>> + }
>> +
>> + offset = ma_parent_slot(node);
>> + type = ma_parent_type(mt, node);
>> + node = ma_parent(node);
>> + new_node = ma_parent(new_node);
>> + if (offset < ma_nonleaf_data_end(mt, node, type)) {
>> + offset++;
>> + new_slots = (void **)ma_slots(new_node, type);
>> + slots = ma_slots(node, type);
>> + goto descend;
>> + }
>> +
>> + goto ascend;
>> +}
>> +
>> +/**
>> + * __mt_dup(): Duplicate a maple tree
>> + * @mt: The source maple tree
>> + * @new: The new maple tree
>> + * @gfp: The GFP_FLAGS to use for allocations
>> + *
>> + * This function duplicates a maple tree using a faster method than traversing
>> + * the source tree and inserting entries into the new tree one by one. The user
>> + * needs to lock the source tree manually. Before calling this function, @new
>> + * must be an empty tree or an uninitialized tree. If @mt uses an external lock,
>> + * we may also need to manually set @new's external lock using
>> + * mt_set_external_lock().
>> + *
>> + * Return: 0 on success, -ENOMEM if memory could not be allocated.
>> + */
>> +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
>
> We use mas_ for things that won't handle the locking and pass in a maple
> state. Considering the leaves need to be altered once this is returned,
> I would expect passing in a maple state should be feasible?
But we don't really need mas here. What do you think the state of mas
should be when this function returns? Make it point to the first entry,
or the last entry?
>
>> +{
>> + int ret;
>> + struct maple_node *to_free = NULL;
>> +
>> + ret = mt_dup_build(mt, new, gfp, &to_free);
>> +
>> + if (unlikely(ret == -ENOMEM)) {
>
> On other errors, will the half constructed tree be returned? Is this
> safe?
Of course, mt_dup_free() is carefully designed to handle this.
>
>> + if (to_free)
>> + mt_dup_free(new, to_free);
>> + }
>> +
>> + return ret;
>> +}
>> +EXPORT_SYMBOL(__mt_dup);
>> +
>> +/**
>> + * mt_dup(): Duplicate a maple tree
>> + * @mt: The source maple tree
>> + * @new: The new maple tree
>> + * @gfp: The GFP_FLAGS to use for allocations
>> + *
>> + * This function duplicates a maple tree using a faster method than traversing
>> + * the source tree and inserting entries into the new tree one by one. The
>> + * function will lock the source tree with an internal lock, and the user does
>> + * not need to manually handle the lock. Before calling this function, @new must
>> + * be an empty tree or an uninitialized tree. If @mt uses an external lock, we
>> + * may also need to manually set @new's external lock using
>> + * mt_set_external_lock().
>> + *
>> + * Return: 0 on success, -ENOMEM if memory could not be allocated.
>> + */
>> +int mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
>
> mtree_ ususually used to indicate locking is handled.
Before unifying mtree_* and mt_*, I don't think I can see any difference
between them. At least mt_set_in_rcu() and mt_clear_in_rcu() will hold
the lock.
>
>> +{
>> + int ret;
>> + struct maple_node *to_free = NULL;
>> +
>> + mtree_lock(mt);
>> + ret = mt_dup_build(mt, new, gfp, &to_free);
>> + mtree_unlock(mt);
>> +
>> + if (unlikely(ret == -ENOMEM)) {
>> + if (to_free)
>> + mt_dup_free(new, to_free);
>
> Again, is a half constructed tree safe to return? Since each caller
> checks to_free is NULL, could that be in mt_dup_free() instead?
Yes, this check can be put in mt_dup_free().
>
>> + }
>> +
>> + return ret;
>> +}
>> +EXPORT_SYMBOL(mt_dup);
>> +
>> /**
>> * __mt_destroy() - Walk and free all nodes of a locked maple tree.
>> * @mt: The maple tree
>> --
>> 2.20.1
>>
>>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 05/11] maple_tree: Add test for mt_dup()
2023-07-26 16:06 ` Liam R. Howlett
@ 2023-07-31 12:32 ` Peng Zhang
2023-07-31 16:41 ` Liam R. Howlett
0 siblings, 1 reply; 42+ messages in thread
From: Peng Zhang @ 2023-07-31 12:32 UTC (permalink / raw)
To: Liam R. Howlett
Cc: Peng Zhang, linux-doc, linux-mm, avagin, npiggin,
mathieu.desnoyers, peterz, michael.christie, surenb, brauner,
willy, corbet, linux-fsdevel, akpm, linux-kernel
在 2023/7/27 00:06, Liam R. Howlett 写道:
> * Peng Zhang <zhangpeng.00@bytedance.com> [230726 04:10]:
>> Add test for mt_dup().
>>
>> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
>> ---
>> tools/testing/radix-tree/maple.c | 202 +++++++++++++++++++++++++++++++
>> 1 file changed, 202 insertions(+)
>>
>> diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
>> index e5da1cad70ba..3052e899e5df 100644
>> --- a/tools/testing/radix-tree/maple.c
>> +++ b/tools/testing/radix-tree/maple.c
>> @@ -35857,6 +35857,204 @@ static noinline void __init check_locky(struct maple_tree *mt)
>> mt_clear_in_rcu(mt);
>> }
>>
>> +/*
>> + * Compare two nodes and return 0 if they are the same, non-zero otherwise.
>> + */
>> +static int __init compare_node(struct maple_enode *enode_a,
>> + struct maple_enode *enode_b)
>> +{
>> + struct maple_node *node_a, *node_b;
>> + struct maple_node a, b;
>> + void **slots_a, **slots_b; /* Do not use the rcu tag. */
>> + enum maple_type type;
>> + int i;
>> +
>> + if (((unsigned long)enode_a & MAPLE_NODE_MASK) !=
>> + ((unsigned long)enode_b & MAPLE_NODE_MASK)) {
>> + pr_err("The lower 8 bits of enode are different.\n");
>> + return -1;
>> + }
>> +
>> + type = mte_node_type(enode_a);
>> + node_a = mte_to_node(enode_a);
>> + node_b = mte_to_node(enode_b);
>> + a = *node_a;
>> + b = *node_b;
>> +
>> + /* Do not compare addresses. */
>> + if (ma_is_root(node_a) || ma_is_root(node_b)) {
>> + a.parent = (struct maple_pnode *)((unsigned long)a.parent &
>> + MA_ROOT_PARENT);
>> + b.parent = (struct maple_pnode *)((unsigned long)b.parent &
>> + MA_ROOT_PARENT);
>> + } else {
>> + a.parent = (struct maple_pnode *)((unsigned long)a.parent &
>> + MAPLE_NODE_MASK);
>> + b.parent = (struct maple_pnode *)((unsigned long)b.parent &
>> + MAPLE_NODE_MASK);
>> + }
>> +
>> + if (a.parent != b.parent) {
>> + pr_err("The lower 8 bits of parents are different. %p %p\n",
>> + a.parent, b.parent);
>> + return -1;
>> + }
>> +
>> + /*
>> + * If it is a leaf node, the slots do not contain the node address, and
>> + * no special processing of slots is required.
>> + */
>> + if (ma_is_leaf(type))
>> + goto cmp;
>> +
>> + slots_a = ma_slots(&a, type);
>> + slots_b = ma_slots(&b, type);
>> +
>> + for (i = 0; i < mt_slots[type]; i++) {
>> + if (!slots_a[i] && !slots_b[i])
>> + break;
>> +
>> + if (!slots_a[i] || !slots_b[i]) {
>> + pr_err("The number of slots is different.\n");
>> + return -1;
>> + }
>> +
>> + /* Do not compare addresses in slots. */
>> + ((unsigned long *)slots_a)[i] &= MAPLE_NODE_MASK;
>> + ((unsigned long *)slots_b)[i] &= MAPLE_NODE_MASK;
>> + }
>> +
>> +cmp:
>> + /*
>> + * Compare all contents of two nodes, including parent (except address),
>> + * slots (except address), pivots, gaps and metadata.
>> + */
>> + return memcmp(&a, &b, sizeof(struct maple_node));
>> +}
>> +
>> +/*
>> + * Compare two trees and return 0 if they are the same, non-zero otherwise.
>> + */
>> +static int __init compare_tree(struct maple_tree *mt_a, struct maple_tree *mt_b)
>> +{
>> + MA_STATE(mas_a, mt_a, 0, 0);
>> + MA_STATE(mas_b, mt_b, 0, 0);
>> +
>> + if (mt_a->ma_flags != mt_b->ma_flags) {
>> + pr_err("The flags of the two trees are different.\n");
>> + return -1;
>> + }
>> +
>> + mas_dfs_preorder(&mas_a);
>> + mas_dfs_preorder(&mas_b);
>> +
>> + if (mas_is_ptr(&mas_a) || mas_is_ptr(&mas_b)) {
>> + if (!(mas_is_ptr(&mas_a) && mas_is_ptr(&mas_b))) {
>> + pr_err("One is MAS_ROOT and the other is not.\n");
>> + return -1;
>> + }
>> + return 0;
>> + }
>> +
>> + while (!mas_is_none(&mas_a) || !mas_is_none(&mas_b)) {
>> +
>> + if (mas_is_none(&mas_a) || mas_is_none(&mas_b)) {
>> + pr_err("One is MAS_NONE and the other is not.\n");
>> + return -1;
>> + }
>> +
>> + if (mas_a.min != mas_b.min ||
>> + mas_a.max != mas_b.max) {
>> + pr_err("mas->min, mas->max do not match.\n");
>> + return -1;
>> + }
>> +
>> + if (compare_node(mas_a.node, mas_b.node)) {
>> + pr_err("The contents of nodes %p and %p are different.\n",
>> + mas_a.node, mas_b.node);
>> + mt_dump(mt_a, mt_dump_dec);
>> + mt_dump(mt_b, mt_dump_dec);
>> + return -1;
>> + }
>> +
>> + mas_dfs_preorder(&mas_a);
>> + mas_dfs_preorder(&mas_b);
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +static noinline void __init check_mt_dup(struct maple_tree *mt)
>> +{
>> + DEFINE_MTREE(new);
>> + int i, j, ret, count = 0;
>> +
>> + /* stored in the root pointer*/
>> + mt_init_flags(&tree, 0);
>> + mtree_store_range(&tree, 0, 0, xa_mk_value(0), GFP_KERNEL);
>> + mt_dup(&tree, &new, GFP_KERNEL);
>> + mt_validate(&new);
>> + if (compare_tree(&tree, &new))
>> + MT_BUG_ON(&new, 1);
>> +
>> + mtree_destroy(&tree);
>> + mtree_destroy(&new);
>> +
>> + for (i = 0; i < 1000; i += 3) {
>> + if (i & 1)
>> + mt_init_flags(&tree, 0);
>> + else
>> + mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
>> +
>> + for (j = 0; j < i; j++) {
>> + mtree_store_range(&tree, j * 10, j * 10 + 5,
>> + xa_mk_value(j), GFP_KERNEL);
>> + }
>
> Storing in this way is probably not checking a full tree. I think it's
> important to check the full tree/full nodes since you have changes to
> detect the metadata.
I probably won't change the way I check metadata. But is there a way to
construct a full tree? All I can think of is to write new code to
construct a full tree.
>
>> +
>> + ret = mt_dup(&tree, &new, GFP_KERNEL);
>> + MT_BUG_ON(&new, ret != 0);
>> + mt_validate(&new);
>> + if (compare_tree(&tree, &new))
>> + MT_BUG_ON(&new, 1);
>> +
>> + mtree_destroy(&tree);
>> + mtree_destroy(&new);
>> + }
>> +
>> + /* Test memory allocation failed. */
>> + for (i = 0; i < 1000; i += 3) {
>> + if (i & 1)
>> + mt_init_flags(&tree, 0);
>> + else
>> + mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
>> +
>> + for (j = 0; j < i; j++) {
>> + mtree_store_range(&tree, j * 10, j * 10 + 5,
>> + xa_mk_value(j), GFP_KERNEL);
>> + }
>> +
>> + mt_set_non_kernel(50);
>
> It may be worth while allowing more/less than 50 allocations.
Actually I have used other values before. I haven't thought of a good
value yet, probably a random number in a suitable range would be nice
too.
>
>> + ret = mt_dup(&tree, &new, GFP_NOWAIT);
>> + mt_set_non_kernel(0);
>> + if (ret != 0) {
>> + MT_BUG_ON(&new, ret != -ENOMEM);
>> + count++;
>> + mtree_destroy(&tree);
>> + continue;
>> + }
>> +
>> + mt_validate(&new);
>> + if (compare_tree(&tree, &new))
>> + MT_BUG_ON(&new, 1);
>> +
>> + mtree_destroy(&tree);
>> + mtree_destroy(&new);
>> + }
>> +
>> + /* pr_info("mt_dup() fail %d times\n", count); */
>> + BUG_ON(!count);
>> +}
>> +
>> extern void test_kmem_cache_bulk(void);
>>
>> void farmer_tests(void)
>> @@ -35904,6 +36102,10 @@ void farmer_tests(void)
>> check_null_expand(&tree);
>> mtree_destroy(&tree);
>>
>> + mt_init_flags(&tree, 0);
>> + check_mt_dup(&tree);
>> + mtree_destroy(&tree);
>> +
>> /* RCU testing */
>> mt_init_flags(&tree, 0);
>> check_erase_testset(&tree);
>> --
>> 2.20.1
>>
>>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 06/11] maple_tree: Introduce mas_replace_entry() to directly replace an entry
2023-07-26 16:08 ` Liam R. Howlett
@ 2023-07-31 12:39 ` Peng Zhang
2023-07-31 16:48 ` Liam R. Howlett
0 siblings, 1 reply; 42+ messages in thread
From: Peng Zhang @ 2023-07-31 12:39 UTC (permalink / raw)
To: Liam R. Howlett
Cc: Peng Zhang, avagin, npiggin, mathieu.desnoyers, peterz,
michael.christie, surenb, brauner, willy, akpm, corbet,
linux-kernel, linux-fsdevel, linux-mm, linux-doc
在 2023/7/27 00:08, Liam R. Howlett 写道:
> * Peng Zhang <zhangpeng.00@bytedance.com> [230726 04:10]:
>> If mas has located a specific entry, it may be need to replace this
>> entry, so introduce mas_replace_entry() to do this. mas_replace_entry()
>> will be more efficient than mas_store*() because it doesn't do many
>> unnecessary checks.
>>
>> This function should be inline, but more functions need to be moved to
>> the header file, so I didn't do it for the time being.
>
> I am really nervous having no checks here. I get that this could be
> used for duplicating the tree more efficiently, but having a function
> that just swaps a value in is very dangerous - especially since it is
> decoupled from the tree duplication code.
I've thought about this, and I feel like this is something the user
should be guaranteed. If the user is not sure whether to use it,
mas_store() can be used instead. And we should provide this interface
because it has better performance.
>
>>
>> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
>> ---
>> include/linux/maple_tree.h | 1 +
>> lib/maple_tree.c | 25 +++++++++++++++++++++++++
>> 2 files changed, 26 insertions(+)
>>
>> diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
>> index 229fe78e4c89..a05e9827d761 100644
>> --- a/include/linux/maple_tree.h
>> +++ b/include/linux/maple_tree.h
>> @@ -462,6 +462,7 @@ struct ma_wr_state {
>>
>> void *mas_walk(struct ma_state *mas);
>> void *mas_store(struct ma_state *mas, void *entry);
>> +void mas_replace_entry(struct ma_state *mas, void *entry);
>> void *mas_erase(struct ma_state *mas);
>> int mas_store_gfp(struct ma_state *mas, void *entry, gfp_t gfp);
>> void mas_store_prealloc(struct ma_state *mas, void *entry);
>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>> index efac6761ae37..d58572666a00 100644
>> --- a/lib/maple_tree.c
>> +++ b/lib/maple_tree.c
>> @@ -5600,6 +5600,31 @@ void *mas_store(struct ma_state *mas, void *entry)
>> }
>> EXPORT_SYMBOL_GPL(mas_store);
>>
>> +/**
>> + * mas_replace_entry() - Replace an entry that already exists in the maple tree
>> + * @mas: The maple state
>> + * @entry: The entry to store
>> + *
>> + * Please note that mas must already locate an existing entry, and the new entry
>> + * must not be NULL. If these two points cannot be guaranteed, please use
>> + * mas_store*() instead, otherwise it will cause an internal error in the maple
>> + * tree. This function does not need to allocate memory, so it must succeed.
>> + */
>> +void mas_replace_entry(struct ma_state *mas, void *entry)
>> +{
>> + void __rcu **slots;
>> +
>> +#ifdef CONFIG_DEBUG_MAPLE_TREE
>> + MAS_WARN_ON(mas, !mte_is_leaf(mas->node));
>> + MAS_WARN_ON(mas, !entry);
>> + MAS_WARN_ON(mas, mas->offset >= mt_slots[mte_node_type(mas->node)]);
>> +#endif
>> +
>> + slots = ma_slots(mte_to_node(mas->node), mte_node_type(mas->node));
>> + rcu_assign_pointer(slots[mas->offset], entry);
>> +}
>> +EXPORT_SYMBOL_GPL(mas_replace_entry);
>> +
>> /**
>> * mas_store_gfp() - Store a value into the tree.
>> * @mas: The maple state
>> --
>> 2.20.1
>>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 10/11] MAINTAINERS: Add co-maintainer for maple tree
2023-07-26 16:39 ` Liam R. Howlett
@ 2023-07-31 12:55 ` Peng Zhang
2023-07-31 20:55 ` Liam R. Howlett
0 siblings, 1 reply; 42+ messages in thread
From: Peng Zhang @ 2023-07-31 12:55 UTC (permalink / raw)
To: Liam R. Howlett
Cc: linux-fsdevel, brauner, npiggin, mathieu.desnoyers, peterz,
michael.christie, surenb, willy, akpm, Peng Zhang, linux-doc,
corbet, avagin, linux-mm, linux-kernel
在 2023/7/27 00:39, Liam R. Howlett 写道:
> * Peng Zhang <zhangpeng.00@bytedance.com> [230726 04:10]:
>> Add myself as co-maintainer for maple tree. I would like to assist
>> Liam R. Howlett in maintaining maple tree. I will continue to contribute
>> to the development of maple tree in the future.
>
> Sorry, but no.
>
> I appreciate the patches, bug fixes, and code review but there is no
> need for another maintainer for the tree at this time.
So can I add a reviewer here? This is convenient for me to review the
code and know the problems reported from the community. Usually I can't
receive maple tree issues reported by the community. It has no effect on
your maintenance of it.
>
> Thank you,
> Liam
>
>>
>> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
>> ---
>> MAINTAINERS | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index ddc71b815791..8cfedd492509 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -12526,6 +12526,7 @@ F: net/mctp/
>>
>> MAPLE TREE
>> M: Liam R. Howlett <Liam.Howlett@oracle.com>
>> +M: Peng Zhang <zhangpeng.00@bytedance.com>
>> L: linux-mm@kvack.org
>> S: Supported
>> F: Documentation/core-api/maple_tree.rst
>> --
>> 2.20.1
>>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 11/11] fork: Use __mt_dup() to duplicate maple tree in dup_mmap()
2023-07-26 17:06 ` Liam R. Howlett
@ 2023-07-31 12:59 ` Peng Zhang
0 siblings, 0 replies; 42+ messages in thread
From: Peng Zhang @ 2023-07-31 12:59 UTC (permalink / raw)
To: Liam R. Howlett
Cc: linux-mm, avagin, npiggin, mathieu.desnoyers, peterz,
michael.christie, surenb, brauner, willy, akpm, linux-fsdevel,
Peng Zhang, corbet, linux-doc, linux-kernel
在 2023/7/27 01:06, Liam R. Howlett 写道:
> * Peng Zhang <zhangpeng.00@bytedance.com> [230726 04:10]:
>> Use __mt_dup() to duplicate the old maple tree in dup_mmap(), and then
>> directly modify the entries of VMAs in the new maple tree, which can
>> get better performance. dup_mmap() is used by fork(), so this patch
>> optimizes fork(). The optimization effect is proportional to the number
>> of VMAs.
>>
>> Due to the introduction of this method, the optimization in
>> (maple_tree: add a fast path case in mas_wr_slot_store())[1] no longer
>> has an effect here, but it is also an optimization of the maple tree.
>>
>> There is a unixbench test suite[2] where 'spawn' is used to test fork().
>> 'spawn' only has 23 VMAs by default, so I tweaked the benchmark code a
>> bit to use mmap() to control the number of VMAs. Therefore, the
>> performance under different numbers of VMAs can be measured.
>>
>> Insert code like below into 'spawn':
>> for (int i = 0; i < 200; ++i) {
>> size_t size = 10 * getpagesize();
>> void *addr;
>>
>> if (i & 1) {
>> addr = mmap(NULL, size, PROT_READ,
>> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>> } else {
>> addr = mmap(NULL, size, PROT_WRITE,
>> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>> }
>> if (addr == MAP_FAILED)
>> ...
>> }
>>
>> Based on next-20230721, use 'spawn' under 23, 203, and 4023 VMAs, test
>> 4 times in 30 seconds each time, and get the following numbers. These
>> numbers are the number of fork() successes in 30s (average of the best
>> 3 out of 4). By the way, based on next-20230725, I reverted [1], and
>> tested it together as a comparison. In order to ensure the reliability
>> of the test results, these tests were run on a physical machine.
>>
>> 23VMAs 223VMAs 4023VMAs
>> revert [1]: 159104.00 73316.33 6787.00
>
> You can probably remove the revert benchmark from this since there is no
> reason to revert the previous change. The change is worth while on its
> own, so it's better to have the numbers more clear by having with and
> without this series.
I will remove it.
>
>>
>> +0.77% +0.42% +0.28%
>> next-20230721: 160321.67 73624.67 6806.33
>>
>> +2.77% +15.42% +29.86%
>> apply this: 164751.67 84980.33 8838.67
>
> What is the difference between using this patch with mas_replace_entry()
> and mas_store_entry()?
I haven't tested and compared them yet, I will compare them when I have
time. It may be compared by simulating fork() in user space.
>
>>
>> It can be seen that the performance improvement is proportional to
>> the number of VMAs. With 23 VMAs, performance improves by about 3%,
>> with 223 VMAs, performance improves by about 15%, and with 4023 VMAs,
>> performance improves by about 30%.
>>
>> [1] https://lore.kernel.org/lkml/20230628073657.75314-4-zhangpeng.00@bytedance.com/
>> [2] https://github.com/kdlucas/byte-unixbench/tree/master
>>
>> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
>> ---
>> kernel/fork.c | 35 +++++++++++++++++++++++++++--------
>> mm/mmap.c | 14 ++++++++++++--
>> 2 files changed, 39 insertions(+), 10 deletions(-)
>>
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index f81149739eb9..ef80025b62d6 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -650,7 +650,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>> int retval;
>> unsigned long charge = 0;
>> LIST_HEAD(uf);
>> - VMA_ITERATOR(old_vmi, oldmm, 0);
>> VMA_ITERATOR(vmi, mm, 0);
>>
>> uprobe_start_dup_mmap();
>> @@ -678,17 +677,40 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>> goto out;
>> khugepaged_fork(mm, oldmm);
>>
>> - retval = vma_iter_bulk_alloc(&vmi, oldmm->map_count);
>> - if (retval)
>> + /* Use __mt_dup() to efficiently build an identical maple tree. */
>> + retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_NOWAIT | __GFP_NOWARN);
>> + if (unlikely(retval))
>> goto out;
>>
>> mt_clear_in_rcu(vmi.mas.tree);
>> - for_each_vma(old_vmi, mpnt) {
>> + for_each_vma(vmi, mpnt) {
>> struct file *file;
>>
>> vma_start_write(mpnt);
>> if (mpnt->vm_flags & VM_DONTCOPY) {
>> vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
>> +
>> + /*
>> + * Since the new tree is exactly the same as the old one,
>> + * we need to remove the unneeded VMAs.
>> + */
>> + mas_store(&vmi.mas, NULL);
>> +
>> + /*
>> + * Even removing an entry may require memory allocation,
>> + * and if removal fails, we use XA_ZERO_ENTRY to mark
>> + * from which VMA it failed. The case of encountering
>> + * XA_ZERO_ENTRY will be handled in exit_mmap().
>> + */
>> + if (unlikely(mas_is_err(&vmi.mas))) {
>> + retval = xa_err(vmi.mas.node);
>> + mas_reset(&vmi.mas);
>> + if (mas_find(&vmi.mas, ULONG_MAX))
>> + mas_replace_entry(&vmi.mas,
>> + XA_ZERO_ENTRY);
>> + goto loop_out;
>> + }
>> +
>> continue;
>> }
>> charge = 0;
>> @@ -750,8 +772,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>> hugetlb_dup_vma_private(tmp);
>>
>> /* Link the vma into the MT */
>> - if (vma_iter_bulk_store(&vmi, tmp))
>> - goto fail_nomem_vmi_store;
>> + mas_replace_entry(&vmi.mas, tmp);
>>
>> mm->map_count++;
>> if (!(tmp->vm_flags & VM_WIPEONFORK))
>> @@ -778,8 +799,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>> uprobe_end_dup_mmap();
>> return retval;
>>
>> -fail_nomem_vmi_store:
>> - unlink_anon_vmas(tmp);
>> fail_nomem_anon_vma_fork:
>> mpol_put(vma_policy(tmp));
>> fail_nomem_policy:
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index bc91d91261ab..5bfba2fb0e39 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -3184,7 +3184,11 @@ void exit_mmap(struct mm_struct *mm)
>> arch_exit_mmap(mm);
>>
>> vma = mas_find(&mas, ULONG_MAX);
>> - if (!vma) {
>> + /*
>> + * If dup_mmap() fails to remove a VMA marked VM_DONTCOPY,
>> + * xa_is_zero(vma) may be true.
>> + */
>> + if (!vma || xa_is_zero(vma)) {
>> /* Can happen if dup_mmap() received an OOM */
>> mmap_read_unlock(mm);
>> return;
>> @@ -3222,7 +3226,13 @@ void exit_mmap(struct mm_struct *mm)
>> remove_vma(vma, true);
>> count++;
>> cond_resched();
>> - } while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
>> + vma = mas_find(&mas, ULONG_MAX);
>> + /*
>> + * If xa_is_zero(vma) is true, it means that subsequent VMAs
>> + * donot need to be removed. Can happen if dup_mmap() fails to
>> + * remove a VMA marked VM_DONTCOPY.
>> + */
>> + } while (vma != NULL && !xa_is_zero(vma));
>>
>> BUG_ON(count != mm->map_count);
>>
>> --
>> 2.20.1
>>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 01/11] maple_tree: Introduce ma_nonleaf_data_end{_nocheck}()
2023-07-31 9:52 ` Peng Zhang
@ 2023-07-31 16:08 ` Liam R. Howlett
0 siblings, 0 replies; 42+ messages in thread
From: Liam R. Howlett @ 2023-07-31 16:08 UTC (permalink / raw)
To: Peng Zhang
Cc: corbet, akpm, willy, brauner, surenb, michael.christie,
mathieu.desnoyers, peterz, npiggin, avagin, linux-mm, linux-doc,
linux-kernel, linux-fsdevel
* Peng Zhang <zhangpeng.00@bytedance.com> [230731 05:53]:
>
>
> 在 2023/7/26 22:58, Liam R. Howlett 写道:
> > * Peng Zhang <zhangpeng.00@bytedance.com> [230726 04:10]:
> > > Introduce ma_nonleaf_data_end{_nocheck}() to get the data end of
> > > non-leaf nodes without knowing the maximum value of nodes, so that any
> > > ascending can be avoided even if the maximum value of nodes is not known.
> > >
> > > The principle is that we introduce MAPLE_ENODE to mark an ENODE, which
> > > cannot be used by metadata, so we can distinguish whether it is ENODE or
> > > metadata.
> > >
> > > The nocheck version is to avoid lockdep complaining in some scenarios
> > > where no locks are held.
> > >
> > > Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
> > > ---
> > > lib/maple_tree.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++--
> > > 1 file changed, 68 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> > > index a3d602cfd030..98e4fdf6f4b9 100644
> > > --- a/lib/maple_tree.c
> > > +++ b/lib/maple_tree.c
> > > @@ -310,12 +310,19 @@ static inline void mte_set_node_dead(struct maple_enode *mn)
> > > #define MAPLE_ENODE_TYPE_SHIFT 0x03
> > > /* Bit 2 means a NULL somewhere below */
> > > #define MAPLE_ENODE_NULL 0x04
> > > +/* Bit 7 means this is an ENODE, instead of metadata */
> > > +#define MAPLE_ENODE 0x80
> >
> > We were saving this bit for more node types. I don't want to use this
> > bit for this reason since you could have done BFS to duplicate the tree
> > using the existing way to find the node end.
> We have reserved 4 bits for the node type. I don't think there will be
> more than 16 node types going forward.
We aren't using this bit for the metadata validation. We have plans for
the tree to be used elsewhere and I'd like to keep this bit in case we
need it.
>
> Even the DFS that has been implemented can use the existing way to get
> the data end.
Please try using it and see the impact on your performance testing.
Think about how often a node is full and how often a parent node is
full, etc, etc. As I brought up during the review of your testing code,
you actually have not filled the trees during construction. So there
will not be a walk up more than one, and I'm not even sure you will get
one walk up.
The nodes are also going to be in CPU cache, so even when you do need to
walk up multiple times, it's not going to be a huge performance
consideration.
Furthermore, the maple allocation nodes have metadata stored outside of
the data - there is extra space that we are using in those nodes always.
So for fork, all this is to make walking up to the parent of full leaves
faster - but it's not used for leaves so there is no benefit in that
case.
>I didn't use it because when walking up the tree, we don't
> know the maximum value of the node, and the continuous upward walk will
> introduce more overhead, which is what mas_ascend() does. Doing BFS
> cannot avoid this problem also.
>
> The reason I don't do BFS is that it has more overhead than DFS. If you
> think of a tree as a graph, doing DFS will only walk each edge twice.
> What if it is BFS? Since we can't use queues, we can only emulate BFS,
> which additionally does something like mas_next_node() does, which
> introduces more overhead than DFS. Considering only the layer of leaf
> nodes, it needs to walk each edge twice. So the overhead of doing BFS is
> more than DFS.
Okay, thanks.
> >
> > Bits are highly valuable and this is the only remaining bit. I had
> > thought about using this in Feb 2021 to see if there was metadata or
> > not, but figured a way around it (using the max trick) and thus saved
> > this bit for potential expansion of node types.
> I thought of another way to get the maximum value of a node without
> doing any extra upward walk. When doing DFS, we can use a stack to save
> the maximum value of ancestor nodes. The stack size can be set to
> MAPLE_HEIGHT_MAX. In this way, this bit can be reserved, and there is no
> need to do a loop like mas_ascend() in order to get the maximum value.
You want an array of 31 unsigned longs to keep track of the max of this
node? Please don't do that.
> >
> > > +
> > > +static inline bool slot_is_mte(unsigned long slot)
> > > +{
> > > + return slot & MAPLE_ENODE;
> > > +}
> > > static inline struct maple_enode *mt_mk_node(const struct maple_node *node,
> > > enum maple_type type)
> > > {
> > > - return (void *)((unsigned long)node |
> > > - (type << MAPLE_ENODE_TYPE_SHIFT) | MAPLE_ENODE_NULL);
> > > + return (void *)((unsigned long)node | (type << MAPLE_ENODE_TYPE_SHIFT) |
> > > + MAPLE_ENODE_NULL | MAPLE_ENODE);
> > > }
> > > static inline void *mte_mk_root(const struct maple_enode *node)
> > > @@ -1411,6 +1418,65 @@ static inline struct maple_enode *mas_start(struct ma_state *mas)
> > > return NULL;
> > > }
> > > +/*
> > > + * ma_nonleaf_data_end() - Find the end of the data in a non-leaf node.
> > > + * @mt: The maple tree
> > > + * @node: The maple node
> > > + * @type: The maple node type
> > > + *
> > > + * Uses metadata to find the end of the data when possible without knowing the
> > > + * node maximum.
> > > + *
> > > + * Return: The zero indexed last slot with child.
> > > + */
> > > +static inline unsigned char ma_nonleaf_data_end(struct maple_tree *mt,
> > > + struct maple_node *node,
> > > + enum maple_type type)
> > > +{
> > > + void __rcu **slots;
> > > + unsigned long slot;
> > > +
> > > + slots = ma_slots(node, type);
> > > + slot = (unsigned long)mt_slot(mt, slots, mt_pivots[type]);
> > > + if (unlikely(slot_is_mte(slot)))
> > > + return mt_pivots[type];
> > > +
> > > + return ma_meta_end(node, type);
> > > +}
> > > +
> > > +/*
> > > + * ma_nonleaf_data_end_nocheck() - Find the end of the data in a non-leaf node.
> > > + * @node: The maple node
> > > + * @type: The maple node type
> > > + *
> > > + * Uses metadata to find the end of the data when possible without knowing the
> > > + * node maximum. This is the version of ma_nonleaf_data_end() that does not
> > > + * check for lock held. This particular version is designed to avoid lockdep
> > > + * complaining in some scenarios.
> > > + *
> > > + * Return: The zero indexed last slot with child.
> > > + */
> > > +static inline unsigned char ma_nonleaf_data_end_nocheck(struct maple_node *node,
> > > + enum maple_type type)
> > > +{
> > > + void __rcu **slots;
> > > + unsigned long slot;
> > > +
> > > + slots = ma_slots(node, type);
> > > + slot = (unsigned long)rcu_dereference_raw(slots[mt_pivots[type]]);
> > > + if (unlikely(slot_is_mte(slot)))
> > > + return mt_pivots[type];
> > > +
> > > + return ma_meta_end(node, type);
> > > +}
> > > +
> > > +/* See ma_nonleaf_data_end() */
> > > +static inline unsigned char mte_nonleaf_data_end(struct maple_tree *mt,
> > > + struct maple_enode *enode)
> > > +{
> > > + return ma_nonleaf_data_end(mt, mte_to_node(enode), mte_node_type(enode));
> > > +}
> > > +
> > > /*
> > > * ma_data_end() - Find the end of the data in a node.
> > > * @node: The maple node
> > > --
> > > 2.20.1
> > >
> > >
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 04/11] maple_tree: Introduce interfaces __mt_dup() and mt_dup()
2023-07-31 12:24 ` Peng Zhang
@ 2023-07-31 16:27 ` Liam R. Howlett
2023-08-16 13:41 ` Peng Zhang
0 siblings, 1 reply; 42+ messages in thread
From: Liam R. Howlett @ 2023-07-31 16:27 UTC (permalink / raw)
To: Peng Zhang
Cc: willy, michael.christie, surenb, npiggin, corbet,
mathieu.desnoyers, avagin, linux-fsdevel, linux-kernel, linux-doc,
linux-mm, akpm, brauner, peterz
* Peng Zhang <zhangpeng.00@bytedance.com> [230731 08:24]:
>
>
> 在 2023/7/27 00:03, Liam R. Howlett 写道:
> > * Peng Zhang <zhangpeng.00@bytedance.com> [230726 04:10]:
> > > Introduce interfaces __mt_dup() and mt_dup(), which are used to
> > > duplicate a maple tree. Compared with traversing the source tree and
> > > reinserting entry by entry in the new tree, it has better performance.
> > > The difference between __mt_dup() and mt_dup() is that mt_dup() holds
> > > an internal lock.
> > >
> > > Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
> > > ---
> > > include/linux/maple_tree.h | 3 +
> > > lib/maple_tree.c | 211 +++++++++++++++++++++++++++++++++++++
> > > 2 files changed, 214 insertions(+)
> > >
> > > diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
> > > index c962af188681..229fe78e4c89 100644
> > > --- a/include/linux/maple_tree.h
> > > +++ b/include/linux/maple_tree.h
> > > @@ -327,6 +327,9 @@ int mtree_store(struct maple_tree *mt, unsigned long index,
> > > void *entry, gfp_t gfp);
> > > void *mtree_erase(struct maple_tree *mt, unsigned long index);
> > > +int mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
> > > +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
> > > +
> > > void mtree_destroy(struct maple_tree *mt);
> > > void __mt_destroy(struct maple_tree *mt);
> > > diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> > > index da3a2fb405c0..efac6761ae37 100644
> > > --- a/lib/maple_tree.c
> > > +++ b/lib/maple_tree.c
> > > @@ -6595,6 +6595,217 @@ void *mtree_erase(struct maple_tree *mt, unsigned long index)
> > > }
> > > EXPORT_SYMBOL(mtree_erase);
> > > +/*
> > > + * mt_dup_free() - Free the nodes of a incomplete maple tree.
> > > + * @mt: The incomplete maple tree
> > > + * @node: Free nodes from @node
> > > + *
> > > + * This function frees all nodes starting from @node in the reverse order of
> > > + * mt_dup_build(). At this point we don't need to hold the source tree lock.
> > > + */
> > > +static void mt_dup_free(struct maple_tree *mt, struct maple_node *node)
> > > +{
> > > + void **slots;
> > > + unsigned char offset;
> > > + struct maple_enode *enode;
> > > + enum maple_type type;
> > > + unsigned char count = 0, i;
> > > +
> >
> > Can we make these labels inline functions and try to make this a loop?
> I did this just to make things easier. Refer to the implementation of
> walk_tg_tree_from() in sched/core.c. Using some loops and inline
> functions probably doesn't simplify things. I'll try to do that and give
> up if it complicates things.
Thanks, I'd like to try and simplify the code instead of adding goto
label loops. The code you are referencing is from 2008 and goto loops
are not common.
> >
> > > +try_ascend:
> > > + if (ma_is_root(node)) {
> > > + mt_free_one(node);
> > > + return;
> > > + }
> > > +
> > > + offset = ma_parent_slot(node);
> > > + type = ma_parent_type(mt, node);
> > > + node = ma_parent(node);
> > > + if (!offset)
> > > + goto free;
> > > +
> > > + offset--;
> > > +
> > > +descend:
> > > + slots = (void **)ma_slots(node, type);
> > > + enode = slots[offset];
> > > + if (mte_is_leaf(enode))
> > > + goto free;
> > > +
> > > + type = mte_node_type(enode);
> > > + node = mte_to_node(enode);
> > > + offset = ma_nonleaf_data_end_nocheck(node, type);
> > > + goto descend;
> > > +
> > > +free:
> > > + slots = (void **)ma_slots(node, type);
> > > + count = ma_nonleaf_data_end_nocheck(node, type) + 1;
> > > + for (i = 0; i < count; i++)
> > > + ((unsigned long *)slots)[i] &= ~MAPLE_NODE_MASK;
> > > +
> > > + /* Cast to __rcu to avoid sparse checker complaining. */
> > > + mt_free_bulk(count, (void __rcu **)slots);
> > > + goto try_ascend;
> > > +}
> > > +
> > > +/*
> > > + * mt_dup_build() - Build a new maple tree from a source tree
> > > + * @mt: The source maple tree to copy from
> > > + * @new: The new maple tree
> > > + * @gfp: The GFP_FLAGS to use for allocations
> > > + * @to_free: Free nodes starting from @to_free if the build fails
> > > + *
> > > + * This function builds a new tree in DFS preorder. If it fails due to memory
> > > + * allocation, @to_free will store the last failed node to free the incomplete
> > > + * tree. Use mt_dup_free() to free nodes.
> > > + *
> > > + * Return: 0 on success, -ENOMEM if memory could not be allocated.
> > > + */
> > > +static inline int mt_dup_build(struct maple_tree *mt, struct maple_tree *new,
> > > + gfp_t gfp, struct maple_node **to_free)
> >
> > I am trying to change the functions to be two tabs of indent for
> > arguments from now on. It allows for more to fit on a single line and
> > still maintains a clear separation between code and argument lists.
> I'm not too concerned about code formatting. . . At least in this
> patchset.
I have a mess of it in the tree and wanted to communicate my desire to
shift to using two tabs for extra arguments in the future.
> >
> > > +{
> > > + struct maple_enode *enode;
> > > + struct maple_node *new_node, *new_parent = NULL, *node;
> > > + enum maple_type type;
> > > + void __rcu **slots;
> > > + void **new_slots;
> > > + unsigned char count, request, i, offset;
> > > + unsigned long *set_parent;
> > > + unsigned long new_root;
> > > +
> > > + mt_init_flags(new, mt->ma_flags);
> > > + enode = mt_root_locked(mt);
> > > + if (unlikely(!xa_is_node(enode))) {
> > > + rcu_assign_pointer(new->ma_root, enode);
> > > + return 0;
> > > + }
> > > +
> > > + new_node = mt_alloc_one(gfp);
> > > + if (!new_node)
> > > + return -ENOMEM;
> > > +
> > > + new_root = (unsigned long)new_node;
> > > + new_root |= (unsigned long)enode & MAPLE_NODE_MASK;
> > > +
> > > +copy_node:
> >
> > Can you make copy_node, descend, ascend inline functions instead of the
> > goto jumping please? It's better to have loops over jumping around a
> > lot. Gotos are good for undoing things and retry, but constructing
> > loops with them makes it difficult to follow.
> Same as above.
> >
> > > + node = mte_to_node(enode);
> > > + type = mte_node_type(enode);
> > > + memcpy(new_node, node, sizeof(struct maple_node));
> > > +
> > > + set_parent = (unsigned long *)&(new_node->parent);
> > > + *set_parent &= MAPLE_NODE_MASK;
> > > + *set_parent |= (unsigned long)new_parent;
> >
> > Maybe make a small inline to set the parent instead of this?
> >
> > There are some defined helpers for setting the types like
> > ma_parent_ptr() and ma_enode_ptr() to make casting more type-safe.
> Ok, I'll try to do that.
> >
> > > + if (ma_is_leaf(type))
> > > + goto ascend;
> > > +
> > > + new_slots = (void **)ma_slots(new_node, type);
> > > + slots = ma_slots(node, type);
> > > + request = ma_nonleaf_data_end(mt, node, type) + 1;
> > > + count = mt_alloc_bulk(gfp, request, new_slots);
> > > + if (!count) {
> > > + *to_free = new_node;
> > > + return -ENOMEM;
> > > + }
> > > +
> > > + for (i = 0; i < count; i++)
> > > + ((unsigned long *)new_slots)[i] |=
> > > + ((unsigned long)mt_slot_locked(mt, slots, i) &
> > > + MAPLE_NODE_MASK);
> > > + offset = 0;
> > > +
> > > +descend:
> > > + new_parent = new_node;
> > > + enode = mt_slot_locked(mt, slots, offset);
> > > + new_node = mte_to_node(new_slots[offset]);
> > > + goto copy_node;
> > > +
> > > +ascend:
> > > + if (ma_is_root(node)) {
> > > + new_node = mte_to_node((void *)new_root);
> > > + new_node->parent = ma_parent_ptr((unsigned long)new |
> > > + MA_ROOT_PARENT);
> > > + rcu_assign_pointer(new->ma_root, (void *)new_root);
> > > + return 0;
> > > + }
> > > +
> > > + offset = ma_parent_slot(node);
> > > + type = ma_parent_type(mt, node);
> > > + node = ma_parent(node);
> > > + new_node = ma_parent(new_node);
> > > + if (offset < ma_nonleaf_data_end(mt, node, type)) {
> > > + offset++;
> > > + new_slots = (void **)ma_slots(new_node, type);
> > > + slots = ma_slots(node, type);
> > > + goto descend;
> > > + }
> > > +
> > > + goto ascend;
> > > +}
> > > +
> > > +/**
> > > + * __mt_dup(): Duplicate a maple tree
> > > + * @mt: The source maple tree
> > > + * @new: The new maple tree
> > > + * @gfp: The GFP_FLAGS to use for allocations
> > > + *
> > > + * This function duplicates a maple tree using a faster method than traversing
> > > + * the source tree and inserting entries into the new tree one by one. The user
> > > + * needs to lock the source tree manually. Before calling this function, @new
> > > + * must be an empty tree or an uninitialized tree. If @mt uses an external lock,
> > > + * we may also need to manually set @new's external lock using
> > > + * mt_set_external_lock().
> > > + *
> > > + * Return: 0 on success, -ENOMEM if memory could not be allocated.
> > > + */
> > > +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
> >
> > We use mas_ for things that won't handle the locking and pass in a maple
> > state. Considering the leaves need to be altered once this is returned,
> > I would expect passing in a maple state should be feasible?
> But we don't really need mas here. What do you think the state of mas
> should be when this function returns? Make it point to the first entry,
> or the last entry?
I would write it to point to the first element so that the call to
replace the first element can just do that without an extra walk and
document the maple state end point.
> >
> > > +{
> > > + int ret;
> > > + struct maple_node *to_free = NULL;
> > > +
> > > + ret = mt_dup_build(mt, new, gfp, &to_free);
> > > +
> > > + if (unlikely(ret == -ENOMEM)) {
> >
> > On other errors, will the half constructed tree be returned? Is this
> > safe?
> Of course, mt_dup_free() is carefully designed to handle this.
> >
> > > + if (to_free)
> > > + mt_dup_free(new, to_free);
> > > + }
> > > +
> > > + return ret;
> > > +}
> > > +EXPORT_SYMBOL(__mt_dup);
> > > +
> > > +/**
> > > + * mt_dup(): Duplicate a maple tree
> > > + * @mt: The source maple tree
> > > + * @new: The new maple tree
> > > + * @gfp: The GFP_FLAGS to use for allocations
> > > + *
> > > + * This function duplicates a maple tree using a faster method than traversing
> > > + * the source tree and inserting entries into the new tree one by one. The
> > > + * function will lock the source tree with an internal lock, and the user does
> > > + * not need to manually handle the lock. Before calling this function, @new must
> > > + * be an empty tree or an uninitialized tree. If @mt uses an external lock, we
> > > + * may also need to manually set @new's external lock using
> > > + * mt_set_external_lock().
> > > + *
> > > + * Return: 0 on success, -ENOMEM if memory could not be allocated.
> > > + */
> > > +int mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
> >
> > mtree_ ususually used to indicate locking is handled.
> Before unifying mtree_* and mt_*, I don't think I can see any difference
> between them. At least mt_set_in_rcu() and mt_clear_in_rcu() will hold
> the lock.
Fair enough. I was thinking this closely matches __mt_destroy() and
mtree_destroy(). We could be consistent in our inconsistency, at least.
> >
> > > +{
> > > + int ret;
> > > + struct maple_node *to_free = NULL;
> > > +
> > > + mtree_lock(mt);
> > > + ret = mt_dup_build(mt, new, gfp, &to_free);
> > > + mtree_unlock(mt);
> > > +
> > > + if (unlikely(ret == -ENOMEM)) {
> > > + if (to_free)
> > > + mt_dup_free(new, to_free);
> >
> > Again, is a half constructed tree safe to return? Since each caller
> > checks to_free is NULL, could that be in mt_dup_free() instead?
> Yes, this check can be put in mt_dup_free().
> >
> > > + }
> > > +
> > > + return ret;
> > > +}
> > > +EXPORT_SYMBOL(mt_dup);
> > > +
> > > /**
> > > * __mt_destroy() - Walk and free all nodes of a locked maple tree.
> > > * @mt: The maple tree
> > > --
> > > 2.20.1
> > >
> > >
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 05/11] maple_tree: Add test for mt_dup()
2023-07-31 12:32 ` Peng Zhang
@ 2023-07-31 16:41 ` Liam R. Howlett
0 siblings, 0 replies; 42+ messages in thread
From: Liam R. Howlett @ 2023-07-31 16:41 UTC (permalink / raw)
To: Peng Zhang
Cc: linux-doc, linux-mm, avagin, npiggin, mathieu.desnoyers, peterz,
michael.christie, surenb, brauner, willy, corbet, linux-fsdevel,
akpm, linux-kernel
* Peng Zhang <zhangpeng.00@bytedance.com> [230731 08:32]:
>
>
> 在 2023/7/27 00:06, Liam R. Howlett 写道:
> > * Peng Zhang <zhangpeng.00@bytedance.com> [230726 04:10]:
> > > Add test for mt_dup().
> > >
> > > Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
> > > ---
> > > tools/testing/radix-tree/maple.c | 202 +++++++++++++++++++++++++++++++
> > > 1 file changed, 202 insertions(+)
> > >
> > > diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
> > > index e5da1cad70ba..3052e899e5df 100644
> > > --- a/tools/testing/radix-tree/maple.c
> > > +++ b/tools/testing/radix-tree/maple.c
> > > @@ -35857,6 +35857,204 @@ static noinline void __init check_locky(struct maple_tree *mt)
> > > mt_clear_in_rcu(mt);
> > > }
> > > +/*
> > > + * Compare two nodes and return 0 if they are the same, non-zero otherwise.
> > > + */
> > > +static int __init compare_node(struct maple_enode *enode_a,
> > > + struct maple_enode *enode_b)
> > > +{
> > > + struct maple_node *node_a, *node_b;
> > > + struct maple_node a, b;
> > > + void **slots_a, **slots_b; /* Do not use the rcu tag. */
> > > + enum maple_type type;
> > > + int i;
> > > +
> > > + if (((unsigned long)enode_a & MAPLE_NODE_MASK) !=
> > > + ((unsigned long)enode_b & MAPLE_NODE_MASK)) {
> > > + pr_err("The lower 8 bits of enode are different.\n");
> > > + return -1;
> > > + }
> > > +
> > > + type = mte_node_type(enode_a);
> > > + node_a = mte_to_node(enode_a);
> > > + node_b = mte_to_node(enode_b);
> > > + a = *node_a;
> > > + b = *node_b;
> > > +
> > > + /* Do not compare addresses. */
> > > + if (ma_is_root(node_a) || ma_is_root(node_b)) {
> > > + a.parent = (struct maple_pnode *)((unsigned long)a.parent &
> > > + MA_ROOT_PARENT);
> > > + b.parent = (struct maple_pnode *)((unsigned long)b.parent &
> > > + MA_ROOT_PARENT);
> > > + } else {
> > > + a.parent = (struct maple_pnode *)((unsigned long)a.parent &
> > > + MAPLE_NODE_MASK);
> > > + b.parent = (struct maple_pnode *)((unsigned long)b.parent &
> > > + MAPLE_NODE_MASK);
> > > + }
> > > +
> > > + if (a.parent != b.parent) {
> > > + pr_err("The lower 8 bits of parents are different. %p %p\n",
> > > + a.parent, b.parent);
> > > + return -1;
> > > + }
> > > +
> > > + /*
> > > + * If it is a leaf node, the slots do not contain the node address, and
> > > + * no special processing of slots is required.
> > > + */
> > > + if (ma_is_leaf(type))
> > > + goto cmp;
> > > +
> > > + slots_a = ma_slots(&a, type);
> > > + slots_b = ma_slots(&b, type);
> > > +
> > > + for (i = 0; i < mt_slots[type]; i++) {
> > > + if (!slots_a[i] && !slots_b[i])
> > > + break;
> > > +
> > > + if (!slots_a[i] || !slots_b[i]) {
> > > + pr_err("The number of slots is different.\n");
> > > + return -1;
> > > + }
> > > +
> > > + /* Do not compare addresses in slots. */
> > > + ((unsigned long *)slots_a)[i] &= MAPLE_NODE_MASK;
> > > + ((unsigned long *)slots_b)[i] &= MAPLE_NODE_MASK;
> > > + }
> > > +
> > > +cmp:
> > > + /*
> > > + * Compare all contents of two nodes, including parent (except address),
> > > + * slots (except address), pivots, gaps and metadata.
> > > + */
> > > + return memcmp(&a, &b, sizeof(struct maple_node));
> > > +}
> > > +
> > > +/*
> > > + * Compare two trees and return 0 if they are the same, non-zero otherwise.
> > > + */
> > > +static int __init compare_tree(struct maple_tree *mt_a, struct maple_tree *mt_b)
> > > +{
> > > + MA_STATE(mas_a, mt_a, 0, 0);
> > > + MA_STATE(mas_b, mt_b, 0, 0);
> > > +
> > > + if (mt_a->ma_flags != mt_b->ma_flags) {
> > > + pr_err("The flags of the two trees are different.\n");
> > > + return -1;
> > > + }
> > > +
> > > + mas_dfs_preorder(&mas_a);
> > > + mas_dfs_preorder(&mas_b);
> > > +
> > > + if (mas_is_ptr(&mas_a) || mas_is_ptr(&mas_b)) {
> > > + if (!(mas_is_ptr(&mas_a) && mas_is_ptr(&mas_b))) {
> > > + pr_err("One is MAS_ROOT and the other is not.\n");
> > > + return -1;
> > > + }
> > > + return 0;
> > > + }
> > > +
> > > + while (!mas_is_none(&mas_a) || !mas_is_none(&mas_b)) {
> > > +
> > > + if (mas_is_none(&mas_a) || mas_is_none(&mas_b)) {
> > > + pr_err("One is MAS_NONE and the other is not.\n");
> > > + return -1;
> > > + }
> > > +
> > > + if (mas_a.min != mas_b.min ||
> > > + mas_a.max != mas_b.max) {
> > > + pr_err("mas->min, mas->max do not match.\n");
> > > + return -1;
> > > + }
> > > +
> > > + if (compare_node(mas_a.node, mas_b.node)) {
> > > + pr_err("The contents of nodes %p and %p are different.\n",
> > > + mas_a.node, mas_b.node);
> > > + mt_dump(mt_a, mt_dump_dec);
> > > + mt_dump(mt_b, mt_dump_dec);
> > > + return -1;
> > > + }
> > > +
> > > + mas_dfs_preorder(&mas_a);
> > > + mas_dfs_preorder(&mas_b);
> > > + }
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +static noinline void __init check_mt_dup(struct maple_tree *mt)
> > > +{
> > > + DEFINE_MTREE(new);
> > > + int i, j, ret, count = 0;
> > > +
> > > + /* stored in the root pointer*/
> > > + mt_init_flags(&tree, 0);
> > > + mtree_store_range(&tree, 0, 0, xa_mk_value(0), GFP_KERNEL);
> > > + mt_dup(&tree, &new, GFP_KERNEL);
> > > + mt_validate(&new);
> > > + if (compare_tree(&tree, &new))
> > > + MT_BUG_ON(&new, 1);
> > > +
> > > + mtree_destroy(&tree);
> > > + mtree_destroy(&new);
> > > +
> > > + for (i = 0; i < 1000; i += 3) {
> > > + if (i & 1)
> > > + mt_init_flags(&tree, 0);
> > > + else
> > > + mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
> > > +
> > > + for (j = 0; j < i; j++) {
> > > + mtree_store_range(&tree, j * 10, j * 10 + 5,
> > > + xa_mk_value(j), GFP_KERNEL);
> > > + }
> >
> > Storing in this way is probably not checking a full tree. I think it's
> > important to check the full tree/full nodes since you have changes to
> > detect the metadata.
> I probably won't change the way I check metadata.
What I am tell you is that you haven't tested your new code for the
metadata of full nodes with this testcase, or have I missed something?
If it's not tested here, are there other testscases that cover the new
code?
>But is there a way to
> construct a full tree? All I can think of is to write new code to
> construct a full tree.
Normally, what I do, is create a tree in a loop like you have done above
and then store entries over a portion of existing ranges to fill out the
nodes until they are full. check_ranges() in lib/test_maple_tree.c
might be of help.
> >
> > > +
> > > + ret = mt_dup(&tree, &new, GFP_KERNEL);
> > > + MT_BUG_ON(&new, ret != 0);
> > > + mt_validate(&new);
> > > + if (compare_tree(&tree, &new))
> > > + MT_BUG_ON(&new, 1);
> > > +
> > > + mtree_destroy(&tree);
> > > + mtree_destroy(&new);
> > > + }
> > > +
> > > + /* Test memory allocation failed. */
> > > + for (i = 0; i < 1000; i += 3) {
> > > + if (i & 1)
> > > + mt_init_flags(&tree, 0);
> > > + else
> > > + mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
> > > +
> > > + for (j = 0; j < i; j++) {
> > > + mtree_store_range(&tree, j * 10, j * 10 + 5,
> > > + xa_mk_value(j), GFP_KERNEL);
> > > + }
> > > +
> > > + mt_set_non_kernel(50);
> >
> > It may be worth while allowing more/less than 50 allocations.
> Actually I have used other values before. I haven't thought of a good
> value yet, probably a random number in a suitable range would be nice
> too.
random numbers are difficult to recreate so it might be best to limit
that to the userspace tools/testing/radix-tree/maple.c tests and print
the random number for reproducibility.
>
> >
> > > + ret = mt_dup(&tree, &new, GFP_NOWAIT);
> > > + mt_set_non_kernel(0);
> > > + if (ret != 0) {
> > > + MT_BUG_ON(&new, ret != -ENOMEM);
> > > + count++;
> > > + mtree_destroy(&tree);
> > > + continue;
> > > + }
> > > +
> > > + mt_validate(&new);
> > > + if (compare_tree(&tree, &new))
> > > + MT_BUG_ON(&new, 1);
> > > +
> > > + mtree_destroy(&tree);
> > > + mtree_destroy(&new);
> > > + }
> > > +
> > > + /* pr_info("mt_dup() fail %d times\n", count); */
> > > + BUG_ON(!count);
> > > +}
> > > +
> > > extern void test_kmem_cache_bulk(void);
> > > void farmer_tests(void)
> > > @@ -35904,6 +36102,10 @@ void farmer_tests(void)
> > > check_null_expand(&tree);
> > > mtree_destroy(&tree);
> > > + mt_init_flags(&tree, 0);
> > > + check_mt_dup(&tree);
> > > + mtree_destroy(&tree);
> > > +
> > > /* RCU testing */
> > > mt_init_flags(&tree, 0);
> > > check_erase_testset(&tree);
> > > --
> > > 2.20.1
> > >
> > >
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 06/11] maple_tree: Introduce mas_replace_entry() to directly replace an entry
2023-07-31 12:39 ` Peng Zhang
@ 2023-07-31 16:48 ` Liam R. Howlett
2023-08-16 13:11 ` Peng Zhang
0 siblings, 1 reply; 42+ messages in thread
From: Liam R. Howlett @ 2023-07-31 16:48 UTC (permalink / raw)
To: Peng Zhang
Cc: avagin, npiggin, mathieu.desnoyers, peterz, michael.christie,
surenb, brauner, willy, akpm, corbet, linux-kernel, linux-fsdevel,
linux-mm, linux-doc
* Peng Zhang <zhangpeng.00@bytedance.com> [230731 08:39]:
>
>
> 在 2023/7/27 00:08, Liam R. Howlett 写道:
> > * Peng Zhang <zhangpeng.00@bytedance.com> [230726 04:10]:
> > > If mas has located a specific entry, it may be need to replace this
> > > entry, so introduce mas_replace_entry() to do this. mas_replace_entry()
> > > will be more efficient than mas_store*() because it doesn't do many
> > > unnecessary checks.
> > >
> > > This function should be inline, but more functions need to be moved to
> > > the header file, so I didn't do it for the time being.
> >
> > I am really nervous having no checks here. I get that this could be
> > used for duplicating the tree more efficiently, but having a function
> > that just swaps a value in is very dangerous - especially since it is
> > decoupled from the tree duplication code.
> I've thought about this, and I feel like this is something the user
> should be guaranteed. If the user is not sure whether to use it,
> mas_store() can be used instead.
Documentation often isn't up to date and even more rarely read.
mas_replace_entry() does not give a hint of a requirement for a specific
state to the mas. This is not acceptable.
The description of the function also doesn't say anything about a
requirement of the maple state, just that it replaces an already
existing entry. You have to read the notes to find out that 'mas must
already locate an existing entry'.
>And we should provide this interface
> because it has better performance.
How much better is the performance? There's always a trade off but
without numbers, this is hard to justify.
> >
> > >
> > > Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
> > > ---
> > > include/linux/maple_tree.h | 1 +
> > > lib/maple_tree.c | 25 +++++++++++++++++++++++++
> > > 2 files changed, 26 insertions(+)
> > >
> > > diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
> > > index 229fe78e4c89..a05e9827d761 100644
> > > --- a/include/linux/maple_tree.h
> > > +++ b/include/linux/maple_tree.h
> > > @@ -462,6 +462,7 @@ struct ma_wr_state {
> > > void *mas_walk(struct ma_state *mas);
> > > void *mas_store(struct ma_state *mas, void *entry);
> > > +void mas_replace_entry(struct ma_state *mas, void *entry);
> > > void *mas_erase(struct ma_state *mas);
> > > int mas_store_gfp(struct ma_state *mas, void *entry, gfp_t gfp);
> > > void mas_store_prealloc(struct ma_state *mas, void *entry);
> > > diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> > > index efac6761ae37..d58572666a00 100644
> > > --- a/lib/maple_tree.c
> > > +++ b/lib/maple_tree.c
> > > @@ -5600,6 +5600,31 @@ void *mas_store(struct ma_state *mas, void *entry)
> > > }
> > > EXPORT_SYMBOL_GPL(mas_store);
> > > +/**
> > > + * mas_replace_entry() - Replace an entry that already exists in the maple tree
> > > + * @mas: The maple state
> > > + * @entry: The entry to store
> > > + *
> > > + * Please note that mas must already locate an existing entry, and the new entry
> > > + * must not be NULL. If these two points cannot be guaranteed, please use
> > > + * mas_store*() instead, otherwise it will cause an internal error in the maple
> > > + * tree. This function does not need to allocate memory, so it must succeed.
> > > + */
> > > +void mas_replace_entry(struct ma_state *mas, void *entry)
> > > +{
> > > + void __rcu **slots;
> > > +
> > > +#ifdef CONFIG_DEBUG_MAPLE_TREE
> > > + MAS_WARN_ON(mas, !mte_is_leaf(mas->node));
> > > + MAS_WARN_ON(mas, !entry);
> > > + MAS_WARN_ON(mas, mas->offset >= mt_slots[mte_node_type(mas->node)]);
> > > +#endif
> > > +
> > > + slots = ma_slots(mte_to_node(mas->node), mte_node_type(mas->node));
> > > + rcu_assign_pointer(slots[mas->offset], entry);
> > > +}
> > > +EXPORT_SYMBOL_GPL(mas_replace_entry);
> > > +
> > > /**
> > > * mas_store_gfp() - Store a value into the tree.
> > > * @mas: The maple state
> > > --
> > > 2.20.1
> > >
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 10/11] MAINTAINERS: Add co-maintainer for maple tree
2023-07-31 12:55 ` Peng Zhang
@ 2023-07-31 20:55 ` Liam R. Howlett
0 siblings, 0 replies; 42+ messages in thread
From: Liam R. Howlett @ 2023-07-31 20:55 UTC (permalink / raw)
To: Peng Zhang
Cc: linux-fsdevel, brauner, npiggin, mathieu.desnoyers, peterz,
michael.christie, surenb, willy, akpm, linux-doc, corbet, avagin,
linux-mm, linux-kernel
* Peng Zhang <zhangpeng.00@bytedance.com> [230731 08:55]:
>
>
> 在 2023/7/27 00:39, Liam R. Howlett 写道:
> > * Peng Zhang <zhangpeng.00@bytedance.com> [230726 04:10]:
> > > Add myself as co-maintainer for maple tree. I would like to assist
> > > Liam R. Howlett in maintaining maple tree. I will continue to contribute
> > > to the development of maple tree in the future.
> >
> > Sorry, but no.
> >
> > I appreciate the patches, bug fixes, and code review but there is no
> > need for another maintainer for the tree at this time.
> So can I add a reviewer here? This is convenient for me to review the
> code and know the problems reported from the community. Usually I can't
> receive maple tree issues reported by the community. It has no effect on
> your maintenance of it.
Although you are on the path to becoming a reviewer in the MAINTAINERS
file, that file is not a mailing list and shouldn't be treated as such.
You should receive all maple tree issues reported by the community by
using the mailing lists.
You do not need to have an entry in that file to review patches and to
be added as a reviewer of a particular patch; just keep doing what you
have done and send out R-b responses.
I am happy to get reviews by you and will be sure to pick up any R-b
from you for the commits between revisions. Please also review other
peoples patches on the mailing lists.
Thanks,
Liam
> >
> > >
> > > Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
> > > ---
> > > MAINTAINERS | 1 +
> > > 1 file changed, 1 insertion(+)
> > >
> > > diff --git a/MAINTAINERS b/MAINTAINERS
> > > index ddc71b815791..8cfedd492509 100644
> > > --- a/MAINTAINERS
> > > +++ b/MAINTAINERS
> > > @@ -12526,6 +12526,7 @@ F: net/mctp/
> > > MAPLE TREE
> > > M: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > +M: Peng Zhang <zhangpeng.00@bytedance.com>
> > > L: linux-mm@kvack.org
> > > S: Supported
> > > F: Documentation/core-api/maple_tree.rst
> > > --
> > > 2.20.1
> > >
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 03/11] maple_tree: Add some helper functions
2023-07-31 11:45 ` Peng Zhang
@ 2023-08-11 17:28 ` Liam R. Howlett
0 siblings, 0 replies; 42+ messages in thread
From: Liam R. Howlett @ 2023-08-11 17:28 UTC (permalink / raw)
To: Peng Zhang
Cc: Matthew Wilcox, mathieu.desnoyers, npiggin, corbet, akpm,
michael.christie, peterz, linux-mm, linux-kernel, linux-fsdevel,
linux-doc, avagin, surenb, brauner
* Peng Zhang <zhangpeng.00@bytedance.com> [230731 07:45]:
>
>
> 在 2023/7/26 23:08, Matthew Wilcox 写道:
> > On Wed, Jul 26, 2023 at 11:02:52AM -0400, Liam R. Howlett wrote:
> > > * Peng Zhang <zhangpeng.00@bytedance.com> [230726 04:10]:
> > > > static inline
> > > > -enum maple_type mas_parent_type(struct ma_state *mas, struct maple_enode *enode)
> > > > +enum maple_type ma_parent_type(struct maple_tree *mt, struct maple_node *node)
> > >
> > > I was trying to keep ma_* prefix to mean the first argument is
> > > maple_node and mt_* to mean maple_tree. I wasn't entirely successful
> > > with this and I do see why you want to use ma_, but maybe reverse the
> > > arguments here?
> >
> > I think your first idea is better. Usually we prefer to order the
> > arguments by "containing thing" to "contained thing". So always use
> > (struct address_space *, struct folio *), for example. Or (struct
> > mm_struct *, struct vm_area_struct *).
> There are disagreements here, so how to decide? But I don't know if the
> new version still has this helper.
Please keep the maple tree as the first argument and use:
mt_parent_type as the name.. if you still need it.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 06/11] maple_tree: Introduce mas_replace_entry() to directly replace an entry
2023-07-31 16:48 ` Liam R. Howlett
@ 2023-08-16 13:11 ` Peng Zhang
2023-08-16 17:40 ` Liam R. Howlett
0 siblings, 1 reply; 42+ messages in thread
From: Peng Zhang @ 2023-08-16 13:11 UTC (permalink / raw)
To: Liam R. Howlett, Peng Zhang, avagin, npiggin, mathieu.desnoyers,
peterz, michael.christie, surenb, brauner, willy, akpm, corbet,
linux-kernel, linux-fsdevel, linux-mm, linux-doc
在 2023/8/1 00:48, Liam R. Howlett 写道:
> * Peng Zhang <zhangpeng.00@bytedance.com> [230731 08:39]:
>>
>>
>> 在 2023/7/27 00:08, Liam R. Howlett 写道:
>>> * Peng Zhang <zhangpeng.00@bytedance.com> [230726 04:10]:
>>>> If mas has located a specific entry, it may be need to replace this
>>>> entry, so introduce mas_replace_entry() to do this. mas_replace_entry()
>>>> will be more efficient than mas_store*() because it doesn't do many
>>>> unnecessary checks.
>>>>
>>>> This function should be inline, but more functions need to be moved to
>>>> the header file, so I didn't do it for the time being.
>>>
>>> I am really nervous having no checks here. I get that this could be
>>> used for duplicating the tree more efficiently, but having a function
>>> that just swaps a value in is very dangerous - especially since it is
>>> decoupled from the tree duplication code.
>> I've thought about this, and I feel like this is something the user
>> should be guaranteed. If the user is not sure whether to use it,
>> mas_store() can be used instead.
>
> Documentation often isn't up to date and even more rarely read.
> mas_replace_entry() does not give a hint of a requirement for a specific
> state to the mas. This is not acceptable.
>
> The description of the function also doesn't say anything about a
> requirement of the maple state, just that it replaces an already
> existing entry. You have to read the notes to find out that 'mas must
> already locate an existing entry'.
>
>> And we should provide this interface
>> because it has better performance.
>
> How much better is the performance? There's always a trade off but
> without numbers, this is hard to justify.
I have implemented a new version of this pachset, and I will post it
soon.
I tested the benefits of mas_replace_entry() in userspace.
The test code is attached at the end.
Run three times:
mas_replace_entry(): 2.7613050s 2.7120030s 2.7274200s
mas_store(): 3.8451260s 3.8113200s 3.9334160s
Using mas_store() reduces the performance of duplicating VMAs by about
41%.
So I think mas_replace_entry() is necessary. We can describe it in more
detail in the documentation to prevent users from misusing it.
static noinline void __init bench_forking(struct maple_tree *mt)
{
struct maple_tree newmt;
int i, nr_entries = 134, nr_fork = 80000, ret;
void *val;
MA_STATE(mas, mt, 0, 0);
MA_STATE(newmas, &newmt, 0, 0);
clock_t start;
clock_t end;
double cpu_time_used = 0;
for (i = 0; i <= nr_entries; i++)
mtree_store_range(mt, i*10, i*10 + 5,
xa_mk_value(i), GFP_KERNEL);
for (i = 0; i < nr_fork; i++) {
mt_set_non_kernel(99999);
start = clock();
mt_init_flags(&newmt, MT_FLAGS_ALLOC_RANGE);
mas_lock(&newmas);
mas_lock(&mas);
ret = __mt_dup(mt, &newmt, GFP_NOWAIT | __GFP_NOWARN);
if (ret) {
pr_err("OOM!");
BUG_ON(1);
}
mas_set(&newmas, 0);
mas_for_each(&newmas, val, ULONG_MAX) {
mas_replace_entry(&newmas, val);
}
mas_unlock(&mas);
mas_unlock(&newmas);
end = clock();
cpu_time_used += ((double) (end - start));
mas_destroy(&newmas);
mt_validate(&newmt);
mt_set_non_kernel(0);
mtree_destroy(&newmt);
}
printf("time consumption:%.7fs\n", cpu_time_used / CLOCKS_PER_SEC);
}
>
>>>
>>>>
>>>> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
>>>> ---
>>>> include/linux/maple_tree.h | 1 +
>>>> lib/maple_tree.c | 25 +++++++++++++++++++++++++
>>>> 2 files changed, 26 insertions(+)
>>>>
>>>> diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
>>>> index 229fe78e4c89..a05e9827d761 100644
>>>> --- a/include/linux/maple_tree.h
>>>> +++ b/include/linux/maple_tree.h
>>>> @@ -462,6 +462,7 @@ struct ma_wr_state {
>>>> void *mas_walk(struct ma_state *mas);
>>>> void *mas_store(struct ma_state *mas, void *entry);
>>>> +void mas_replace_entry(struct ma_state *mas, void *entry);
>>>> void *mas_erase(struct ma_state *mas);
>>>> int mas_store_gfp(struct ma_state *mas, void *entry, gfp_t gfp);
>>>> void mas_store_prealloc(struct ma_state *mas, void *entry);
>>>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>>>> index efac6761ae37..d58572666a00 100644
>>>> --- a/lib/maple_tree.c
>>>> +++ b/lib/maple_tree.c
>>>> @@ -5600,6 +5600,31 @@ void *mas_store(struct ma_state *mas, void *entry)
>>>> }
>>>> EXPORT_SYMBOL_GPL(mas_store);
>>>> +/**
>>>> + * mas_replace_entry() - Replace an entry that already exists in the maple tree
>>>> + * @mas: The maple state
>>>> + * @entry: The entry to store
>>>> + *
>>>> + * Please note that mas must already locate an existing entry, and the new entry
>>>> + * must not be NULL. If these two points cannot be guaranteed, please use
>>>> + * mas_store*() instead, otherwise it will cause an internal error in the maple
>>>> + * tree. This function does not need to allocate memory, so it must succeed.
>>>> + */
>>>> +void mas_replace_entry(struct ma_state *mas, void *entry)
>>>> +{
>>>> + void __rcu **slots;
>>>> +
>>>> +#ifdef CONFIG_DEBUG_MAPLE_TREE
>>>> + MAS_WARN_ON(mas, !mte_is_leaf(mas->node));
>>>> + MAS_WARN_ON(mas, !entry);
>>>> + MAS_WARN_ON(mas, mas->offset >= mt_slots[mte_node_type(mas->node)]);
>>>> +#endif
>>>> +
>>>> + slots = ma_slots(mte_to_node(mas->node), mte_node_type(mas->node));
>>>> + rcu_assign_pointer(slots[mas->offset], entry);
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(mas_replace_entry);
>>>> +
>>>> /**
>>>> * mas_store_gfp() - Store a value into the tree.
>>>> * @mas: The maple state
>>>> --
>>>> 2.20.1
>>>>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 04/11] maple_tree: Introduce interfaces __mt_dup() and mt_dup()
2023-07-31 16:27 ` Liam R. Howlett
@ 2023-08-16 13:41 ` Peng Zhang
2023-08-16 18:30 ` Liam R. Howlett
0 siblings, 1 reply; 42+ messages in thread
From: Peng Zhang @ 2023-08-16 13:41 UTC (permalink / raw)
To: Liam R. Howlett, Peng Zhang, willy, michael.christie, surenb,
npiggin, corbet, mathieu.desnoyers, avagin, linux-fsdevel,
linux-kernel, linux-doc, linux-mm, akpm, brauner, peterz
在 2023/8/1 00:27, Liam R. Howlett 写道:
> * Peng Zhang <zhangpeng.00@bytedance.com> [230731 08:24]:
>>
>>
>> 在 2023/7/27 00:03, Liam R. Howlett 写道:
>>> * Peng Zhang <zhangpeng.00@bytedance.com> [230726 04:10]:
>>>> Introduce interfaces __mt_dup() and mt_dup(), which are used to
>>>> duplicate a maple tree. Compared with traversing the source tree and
>>>> reinserting entry by entry in the new tree, it has better performance.
>>>> The difference between __mt_dup() and mt_dup() is that mt_dup() holds
>>>> an internal lock.
>>>>
>>>> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
>>>> ---
>>>> include/linux/maple_tree.h | 3 +
>>>> lib/maple_tree.c | 211 +++++++++++++++++++++++++++++++++++++
>>>> 2 files changed, 214 insertions(+)
>>>>
>>>> diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
>>>> index c962af188681..229fe78e4c89 100644
>>>> --- a/include/linux/maple_tree.h
>>>> +++ b/include/linux/maple_tree.h
>>>> @@ -327,6 +327,9 @@ int mtree_store(struct maple_tree *mt, unsigned long index,
>>>> void *entry, gfp_t gfp);
>>>> void *mtree_erase(struct maple_tree *mt, unsigned long index);
>>>> +int mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
>>>> +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
>>>> +
>>>> void mtree_destroy(struct maple_tree *mt);
>>>> void __mt_destroy(struct maple_tree *mt);
>>>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>>>> index da3a2fb405c0..efac6761ae37 100644
>>>> --- a/lib/maple_tree.c
>>>> +++ b/lib/maple_tree.c
>>>> @@ -6595,6 +6595,217 @@ void *mtree_erase(struct maple_tree *mt, unsigned long index)
>>>> }
>>>> EXPORT_SYMBOL(mtree_erase);
>>>> +/*
>>>> + * mt_dup_free() - Free the nodes of a incomplete maple tree.
>>>> + * @mt: The incomplete maple tree
>>>> + * @node: Free nodes from @node
>>>> + *
>>>> + * This function frees all nodes starting from @node in the reverse order of
>>>> + * mt_dup_build(). At this point we don't need to hold the source tree lock.
>>>> + */
>>>> +static void mt_dup_free(struct maple_tree *mt, struct maple_node *node)
>>>> +{
>>>> + void **slots;
>>>> + unsigned char offset;
>>>> + struct maple_enode *enode;
>>>> + enum maple_type type;
>>>> + unsigned char count = 0, i;
>>>> +
>>>
>>> Can we make these labels inline functions and try to make this a loop?
>> I did this just to make things easier. Refer to the implementation of
>> walk_tg_tree_from() in sched/core.c. Using some loops and inline
>> functions probably doesn't simplify things. I'll try to do that and give
>> up if it complicates things.
>
> Thanks, I'd like to try and simplify the code instead of adding goto
> label loops. The code you are referencing is from 2008 and goto loops
> are not common.
>
>>>
>>>> +try_ascend:
>>>> + if (ma_is_root(node)) {
>>>> + mt_free_one(node);
>>>> + return;
>>>> + }
>>>> +
>>>> + offset = ma_parent_slot(node);
>>>> + type = ma_parent_type(mt, node);
>>>> + node = ma_parent(node);
>>>> + if (!offset)
>>>> + goto free;
>>>> +
>>>> + offset--;
>>>> +
>>>> +descend:
>>>> + slots = (void **)ma_slots(node, type);
>>>> + enode = slots[offset];
>>>> + if (mte_is_leaf(enode))
>>>> + goto free;
>>>> +
>>>> + type = mte_node_type(enode);
>>>> + node = mte_to_node(enode);
>>>> + offset = ma_nonleaf_data_end_nocheck(node, type);
>>>> + goto descend;
>>>> +
>>>> +free:
>>>> + slots = (void **)ma_slots(node, type);
>>>> + count = ma_nonleaf_data_end_nocheck(node, type) + 1;
>>>> + for (i = 0; i < count; i++)
>>>> + ((unsigned long *)slots)[i] &= ~MAPLE_NODE_MASK;
>>>> +
>>>> + /* Cast to __rcu to avoid sparse checker complaining. */
>>>> + mt_free_bulk(count, (void __rcu **)slots);
>>>> + goto try_ascend;
>>>> +}
>>>> +
>>>> +/*
>>>> + * mt_dup_build() - Build a new maple tree from a source tree
>>>> + * @mt: The source maple tree to copy from
>>>> + * @new: The new maple tree
>>>> + * @gfp: The GFP_FLAGS to use for allocations
>>>> + * @to_free: Free nodes starting from @to_free if the build fails
>>>> + *
>>>> + * This function builds a new tree in DFS preorder. If it fails due to memory
>>>> + * allocation, @to_free will store the last failed node to free the incomplete
>>>> + * tree. Use mt_dup_free() to free nodes.
>>>> + *
>>>> + * Return: 0 on success, -ENOMEM if memory could not be allocated.
>>>> + */
>>>> +static inline int mt_dup_build(struct maple_tree *mt, struct maple_tree *new,
>>>> + gfp_t gfp, struct maple_node **to_free)
>>>
>>> I am trying to change the functions to be two tabs of indent for
>>> arguments from now on. It allows for more to fit on a single line and
>>> still maintains a clear separation between code and argument lists.
>> I'm not too concerned about code formatting. . . At least in this
>> patchset.
>
> I have a mess of it in the tree and wanted to communicate my desire to
> shift to using two tabs for extra arguments in the future.
>
>>>
>>>> +{
>>>> + struct maple_enode *enode;
>>>> + struct maple_node *new_node, *new_parent = NULL, *node;
>>>> + enum maple_type type;
>>>> + void __rcu **slots;
>>>> + void **new_slots;
>>>> + unsigned char count, request, i, offset;
>>>> + unsigned long *set_parent;
>>>> + unsigned long new_root;
>>>> +
>>>> + mt_init_flags(new, mt->ma_flags);
>>>> + enode = mt_root_locked(mt);
>>>> + if (unlikely(!xa_is_node(enode))) {
>>>> + rcu_assign_pointer(new->ma_root, enode);
>>>> + return 0;
>>>> + }
>>>> +
>>>> + new_node = mt_alloc_one(gfp);
>>>> + if (!new_node)
>>>> + return -ENOMEM;
>>>> +
>>>> + new_root = (unsigned long)new_node;
>>>> + new_root |= (unsigned long)enode & MAPLE_NODE_MASK;
>>>> +
>>>> +copy_node:
>>>
>>> Can you make copy_node, descend, ascend inline functions instead of the
>>> goto jumping please? It's better to have loops over jumping around a
>>> lot. Gotos are good for undoing things and retry, but constructing
>>> loops with them makes it difficult to follow.
>> Same as above.
>>>
>>>> + node = mte_to_node(enode);
>>>> + type = mte_node_type(enode);
>>>> + memcpy(new_node, node, sizeof(struct maple_node));
>>>> +
>>>> + set_parent = (unsigned long *)&(new_node->parent);
>>>> + *set_parent &= MAPLE_NODE_MASK;
>>>> + *set_parent |= (unsigned long)new_parent;
>>>
>>> Maybe make a small inline to set the parent instead of this?
>>>
>>> There are some defined helpers for setting the types like
>>> ma_parent_ptr() and ma_enode_ptr() to make casting more type-safe.
>> Ok, I'll try to do that.
>>>
>>>> + if (ma_is_leaf(type))
>>>> + goto ascend;
>>>> +
>>>> + new_slots = (void **)ma_slots(new_node, type);
>>>> + slots = ma_slots(node, type);
>>>> + request = ma_nonleaf_data_end(mt, node, type) + 1;
>>>> + count = mt_alloc_bulk(gfp, request, new_slots);
>>>> + if (!count) {
>>>> + *to_free = new_node;
>>>> + return -ENOMEM;
>>>> + }
>>>> +
>>>> + for (i = 0; i < count; i++)
>>>> + ((unsigned long *)new_slots)[i] |=
>>>> + ((unsigned long)mt_slot_locked(mt, slots, i) &
>>>> + MAPLE_NODE_MASK);
>>>> + offset = 0;
>>>> +
>>>> +descend:
>>>> + new_parent = new_node;
>>>> + enode = mt_slot_locked(mt, slots, offset);
>>>> + new_node = mte_to_node(new_slots[offset]);
>>>> + goto copy_node;
>>>> +
>>>> +ascend:
>>>> + if (ma_is_root(node)) {
>>>> + new_node = mte_to_node((void *)new_root);
>>>> + new_node->parent = ma_parent_ptr((unsigned long)new |
>>>> + MA_ROOT_PARENT);
>>>> + rcu_assign_pointer(new->ma_root, (void *)new_root);
>>>> + return 0;
>>>> + }
>>>> +
>>>> + offset = ma_parent_slot(node);
>>>> + type = ma_parent_type(mt, node);
>>>> + node = ma_parent(node);
>>>> + new_node = ma_parent(new_node);
>>>> + if (offset < ma_nonleaf_data_end(mt, node, type)) {
>>>> + offset++;
>>>> + new_slots = (void **)ma_slots(new_node, type);
>>>> + slots = ma_slots(node, type);
>>>> + goto descend;
>>>> + }
>>>> +
>>>> + goto ascend;
>>>> +}
>>>> +
>>>> +/**
>>>> + * __mt_dup(): Duplicate a maple tree
>>>> + * @mt: The source maple tree
>>>> + * @new: The new maple tree
>>>> + * @gfp: The GFP_FLAGS to use for allocations
>>>> + *
>>>> + * This function duplicates a maple tree using a faster method than traversing
>>>> + * the source tree and inserting entries into the new tree one by one. The user
>>>> + * needs to lock the source tree manually. Before calling this function, @new
>>>> + * must be an empty tree or an uninitialized tree. If @mt uses an external lock,
>>>> + * we may also need to manually set @new's external lock using
>>>> + * mt_set_external_lock().
>>>> + *
>>>> + * Return: 0 on success, -ENOMEM if memory could not be allocated.
>>>> + */
>>>> +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
>>>
>>> We use mas_ for things that won't handle the locking and pass in a maple
>>> state. Considering the leaves need to be altered once this is returned,
>>> I would expect passing in a maple state should be feasible?
>> But we don't really need mas here. What do you think the state of mas
>> should be when this function returns? Make it point to the first entry,
>> or the last entry?
>
> I would write it to point to the first element so that the call to
> replace the first element can just do that without an extra walk and
> document the maple state end point.
Unfortunately, this does not seem to be convenient. Users usually use
mas_for_each() to replace elements. If we set mas to the first element,
the first call to mas_find() in mas_for_each() will get the next
element.
There may also be other scenarios where the user does not necessarily
have to replace every element.
Finally, getting the first element in __mt_dup() requires an additional
check to check whether the first element has already been recorded. Such
a check will be performed at each leaf node, which is unnecessary
overhead.
Of course, the first reason is the main reason, which prevents us from
using mas_for_each(). So I don't want to record the first element.
>
>>>
>>>> +{
>>>> + int ret;
>>>> + struct maple_node *to_free = NULL;
>>>> +
>>>> + ret = mt_dup_build(mt, new, gfp, &to_free);
>>>> +
>>>> + if (unlikely(ret == -ENOMEM)) {
>>>
>>> On other errors, will the half constructed tree be returned? Is this
>>> safe?
>> Of course, mt_dup_free() is carefully designed to handle this.
>>>
>>>> + if (to_free)
>>>> + mt_dup_free(new, to_free);
>>>> + }
>>>> +
>>>> + return ret;
>>>> +}
>>>> +EXPORT_SYMBOL(__mt_dup);
>>>> +
>>>> +/**
>>>> + * mt_dup(): Duplicate a maple tree
>>>> + * @mt: The source maple tree
>>>> + * @new: The new maple tree
>>>> + * @gfp: The GFP_FLAGS to use for allocations
>>>> + *
>>>> + * This function duplicates a maple tree using a faster method than traversing
>>>> + * the source tree and inserting entries into the new tree one by one. The
>>>> + * function will lock the source tree with an internal lock, and the user does
>>>> + * not need to manually handle the lock. Before calling this function, @new must
>>>> + * be an empty tree or an uninitialized tree. If @mt uses an external lock, we
>>>> + * may also need to manually set @new's external lock using
>>>> + * mt_set_external_lock().
>>>> + *
>>>> + * Return: 0 on success, -ENOMEM if memory could not be allocated.
>>>> + */
>>>> +int mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
>>>
>>> mtree_ ususually used to indicate locking is handled.
>> Before unifying mtree_* and mt_*, I don't think I can see any difference
>> between them. At least mt_set_in_rcu() and mt_clear_in_rcu() will hold
>> the lock.
>
> Fair enough. I was thinking this closely matches __mt_destroy() and
> mtree_destroy(). We could be consistent in our inconsistency, at least.
>
>>>
>>>> +{
>>>> + int ret;
>>>> + struct maple_node *to_free = NULL;
>>>> +
>>>> + mtree_lock(mt);
>>>> + ret = mt_dup_build(mt, new, gfp, &to_free);
>>>> + mtree_unlock(mt);
>>>> +
>>>> + if (unlikely(ret == -ENOMEM)) {
>>>> + if (to_free)
>>>> + mt_dup_free(new, to_free);
>>>
>>> Again, is a half constructed tree safe to return? Since each caller
>>> checks to_free is NULL, could that be in mt_dup_free() instead?
>> Yes, this check can be put in mt_dup_free().
>>>
>>>> + }
>>>> +
>>>> + return ret;
>>>> +}
>>>> +EXPORT_SYMBOL(mt_dup);
>>>> +
>>>> /**
>>>> * __mt_destroy() - Walk and free all nodes of a locked maple tree.
>>>> * @mt: The maple tree
>>>> --
>>>> 2.20.1
>>>>
>>>>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 06/11] maple_tree: Introduce mas_replace_entry() to directly replace an entry
2023-08-16 13:11 ` Peng Zhang
@ 2023-08-16 17:40 ` Liam R. Howlett
2023-08-18 9:39 ` Peng Zhang
0 siblings, 1 reply; 42+ messages in thread
From: Liam R. Howlett @ 2023-08-16 17:40 UTC (permalink / raw)
To: Peng Zhang
Cc: avagin, npiggin, mathieu.desnoyers, peterz, michael.christie,
surenb, brauner, willy, akpm, corbet, linux-kernel, linux-fsdevel,
linux-mm, linux-doc
* Peng Zhang <zhangpeng.00@bytedance.com> [230816 09:11]:
>
>
> 在 2023/8/1 00:48, Liam R. Howlett 写道:
> > * Peng Zhang <zhangpeng.00@bytedance.com> [230731 08:39]:
> > >
> > >
> > > 在 2023/7/27 00:08, Liam R. Howlett 写道:
> > > > * Peng Zhang <zhangpeng.00@bytedance.com> [230726 04:10]:
> > > > > If mas has located a specific entry, it may be need to replace this
> > > > > entry, so introduce mas_replace_entry() to do this. mas_replace_entry()
> > > > > will be more efficient than mas_store*() because it doesn't do many
> > > > > unnecessary checks.
> > > > >
> > > > > This function should be inline, but more functions need to be moved to
> > > > > the header file, so I didn't do it for the time being.
> > > >
> > > > I am really nervous having no checks here. I get that this could be
> > > > used for duplicating the tree more efficiently, but having a function
> > > > that just swaps a value in is very dangerous - especially since it is
> > > > decoupled from the tree duplication code.
> > > I've thought about this, and I feel like this is something the user
> > > should be guaranteed. If the user is not sure whether to use it,
> > > mas_store() can be used instead.
> >
> > Documentation often isn't up to date and even more rarely read.
> > mas_replace_entry() does not give a hint of a requirement for a specific
> > state to the mas. This is not acceptable.
> >
> > The description of the function also doesn't say anything about a
> > requirement of the maple state, just that it replaces an already
> > existing entry. You have to read the notes to find out that 'mas must
> > already locate an existing entry'.
> >
> > > And we should provide this interface
> > > because it has better performance.
> >
> > How much better is the performance? There's always a trade off but
> > without numbers, this is hard to justify.
> I have implemented a new version of this pachset, and I will post it
> soon.
>
> I tested the benefits of mas_replace_entry() in userspace.
> The test code is attached at the end.
>
> Run three times:
> mas_replace_entry(): 2.7613050s 2.7120030s 2.7274200s
> mas_store(): 3.8451260s 3.8113200s 3.9334160s
This runtime is too short, we should increase the number of elements or
loops until it is over 10 seconds. This will make the setup time
and other variances less significant and we can use the command run time
as a rough estimate of performance. IIRC 134 was picked for a rough
estimate of an average task size so maybe increase the loops.
I understand the numbers here are from clock recordings to demonstrate
the significance of your change.
>
> Using mas_store() reduces the performance of duplicating VMAs by about
> 41%.
>
> So I think mas_replace_entry() is necessary. We can describe it in more
> detail in the documentation to prevent users from misusing it.
I think something is necessary for a quicker replacement, yes. I don't
want to go as far as you did with the lack of checking.
>
>
> static noinline void __init bench_forking(struct maple_tree *mt)
> {
> struct maple_tree newmt;
> int i, nr_entries = 134, nr_fork = 80000, ret;
> void *val;
> MA_STATE(mas, mt, 0, 0);
> MA_STATE(newmas, &newmt, 0, 0);
> clock_t start;
> clock_t end;
> double cpu_time_used = 0;
>
> for (i = 0; i <= nr_entries; i++)
> mtree_store_range(mt, i*10, i*10 + 5,
> xa_mk_value(i), GFP_KERNEL);
>
> for (i = 0; i < nr_fork; i++) {
> mt_set_non_kernel(99999);
>
> start = clock();
> mt_init_flags(&newmt, MT_FLAGS_ALLOC_RANGE);
> mas_lock(&newmas);
> mas_lock(&mas);
> ret = __mt_dup(mt, &newmt, GFP_NOWAIT | __GFP_NOWARN);
> if (ret) {
> pr_err("OOM!");
> BUG_ON(1);
> }
>
> mas_set(&newmas, 0);
> mas_for_each(&newmas, val, ULONG_MAX) {
> mas_replace_entry(&newmas, val);
> }
>
> mas_unlock(&mas);
> mas_unlock(&newmas);
> end = clock();
> cpu_time_used += ((double) (end - start));
>
> mas_destroy(&newmas);
> mt_validate(&newmt);
> mt_set_non_kernel(0);
> mtree_destroy(&newmt);
> }
> printf("time consumption:%.7fs\n", cpu_time_used / CLOCKS_PER_SEC);
> }
>
>
> >
> > > >
> > > > >
> > > > > Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
> > > > > ---
> > > > > include/linux/maple_tree.h | 1 +
> > > > > lib/maple_tree.c | 25 +++++++++++++++++++++++++
> > > > > 2 files changed, 26 insertions(+)
> > > > >
> > > > > diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
> > > > > index 229fe78e4c89..a05e9827d761 100644
> > > > > --- a/include/linux/maple_tree.h
> > > > > +++ b/include/linux/maple_tree.h
> > > > > @@ -462,6 +462,7 @@ struct ma_wr_state {
> > > > > void *mas_walk(struct ma_state *mas);
> > > > > void *mas_store(struct ma_state *mas, void *entry);
> > > > > +void mas_replace_entry(struct ma_state *mas, void *entry);
> > > > > void *mas_erase(struct ma_state *mas);
> > > > > int mas_store_gfp(struct ma_state *mas, void *entry, gfp_t gfp);
> > > > > void mas_store_prealloc(struct ma_state *mas, void *entry);
> > > > > diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> > > > > index efac6761ae37..d58572666a00 100644
> > > > > --- a/lib/maple_tree.c
> > > > > +++ b/lib/maple_tree.c
> > > > > @@ -5600,6 +5600,31 @@ void *mas_store(struct ma_state *mas, void *entry)
> > > > > }
> > > > > EXPORT_SYMBOL_GPL(mas_store);
> > > > > +/**
> > > > > + * mas_replace_entry() - Replace an entry that already exists in the maple tree
> > > > > + * @mas: The maple state
> > > > > + * @entry: The entry to store
> > > > > + *
> > > > > + * Please note that mas must already locate an existing entry, and the new entry
> > > > > + * must not be NULL. If these two points cannot be guaranteed, please use
> > > > > + * mas_store*() instead, otherwise it will cause an internal error in the maple
> > > > > + * tree. This function does not need to allocate memory, so it must succeed.
> > > > > + */
> > > > > +void mas_replace_entry(struct ma_state *mas, void *entry)
> > > > > +{
> > > > > + void __rcu **slots;
> > > > > +
> > > > > +#ifdef CONFIG_DEBUG_MAPLE_TREE
CONFIG_DEBUG_MAPLE_TREE is not necessary, MAS_WRAN_ON() will be compiled
out if it's not set.
> > > > > + MAS_WARN_ON(mas, !mte_is_leaf(mas->node));
> > > > > + MAS_WARN_ON(mas, !entry);
> > > > > + MAS_WARN_ON(mas, mas->offset >= mt_slots[mte_node_type(mas->node)]);
> > > > > +#endif
> > > > > +
> > > > > + slots = ma_slots(mte_to_node(mas->node), mte_node_type(mas->node));
> > > > > + rcu_assign_pointer(slots[mas->offset], entry);
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(mas_replace_entry);
> > > > > +
> > > > > /**
> > > > > * mas_store_gfp() - Store a value into the tree.
> > > > > * @mas: The maple state
> > > > > --
> > > > > 2.20.1
> > > > >
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 04/11] maple_tree: Introduce interfaces __mt_dup() and mt_dup()
2023-08-16 13:41 ` Peng Zhang
@ 2023-08-16 18:30 ` Liam R. Howlett
2023-08-18 11:53 ` Peng Zhang
0 siblings, 1 reply; 42+ messages in thread
From: Liam R. Howlett @ 2023-08-16 18:30 UTC (permalink / raw)
To: Peng Zhang
Cc: willy, michael.christie, surenb, npiggin, corbet,
mathieu.desnoyers, avagin, linux-fsdevel, linux-kernel, linux-doc,
linux-mm, akpm, brauner, peterz
* Peng Zhang <zhangpeng.00@bytedance.com> [230816 09:42]:
>
>
...
> > > > > +/**
> > > > > + * __mt_dup(): Duplicate a maple tree
> > > > > + * @mt: The source maple tree
> > > > > + * @new: The new maple tree
> > > > > + * @gfp: The GFP_FLAGS to use for allocations
> > > > > + *
> > > > > + * This function duplicates a maple tree using a faster method than traversing
> > > > > + * the source tree and inserting entries into the new tree one by one. The user
> > > > > + * needs to lock the source tree manually. Before calling this function, @new
> > > > > + * must be an empty tree or an uninitialized tree. If @mt uses an external lock,
> > > > > + * we may also need to manually set @new's external lock using
> > > > > + * mt_set_external_lock().
> > > > > + *
> > > > > + * Return: 0 on success, -ENOMEM if memory could not be allocated.
> > > > > + */
> > > > > +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
> > > >
> > > > We use mas_ for things that won't handle the locking and pass in a maple
> > > > state. Considering the leaves need to be altered once this is returned,
> > > > I would expect passing in a maple state should be feasible?
> > > But we don't really need mas here. What do you think the state of mas
> > > should be when this function returns? Make it point to the first entry,
> > > or the last entry?
> >
> > I would write it to point to the first element so that the call to
> > replace the first element can just do that without an extra walk and
> > document the maple state end point.
> Unfortunately, this does not seem to be convenient. Users usually use
> mas_for_each() to replace elements. If we set mas to the first element,
> the first call to mas_find() in mas_for_each() will get the next
> element.
This sounds like the need for another iterator specifically for
duplicating.
>
> There may also be other scenarios where the user does not necessarily
> have to replace every element.
Do you mean a limit or elements that need to be skipped? We could have
a limit on the iteration.
>
> Finally, getting the first element in __mt_dup() requires an additional
> check to check whether the first element has already been recorded. Such
> a check will be performed at each leaf node, which is unnecessary
> overhead.
>
> Of course, the first reason is the main reason, which prevents us from
> using mas_for_each(). So I don't want to record the first element.
I don't like the interface because it can easily be misunderstood and
used incorrectly. I don't know how to make a cleaner interface, but
I've gone through a few thoughts:
The first was hide _all of it_ in a new iterator:
mas_dup_each(old, new, old_entry) {
if (don't_dup(old_entry)) {
mas_erase(new);
continue;
}
mas_dup_insert(new, new_entry);
}
This iterator would check if mas_is_start(old) and dup the tree in that
event. Leave the both new trees pointing to the first element and set
old_entry. I don't know how to handle the failure in duplicating the
tree in this case - I guess we could return old_entry = NULL and check
if mas_is_err(old) after the loop. Do you see a problem with this?
The second idea was an init of the old tree. This is closest to what you
have:
if (mas_dup_init(old, new))
goto -ENOMEM;
mas_dup_each(old, new) {
if (don't_dup(old_entry)) {
mas_erase(new);
continue;
}
mas_dup_insert(new, new_entry);
}
This would duplicate the tree at the start and leave both pointing at
the first element so that mas_dup_each() could start on that element.
Each subsequent call would go to the next element in both maple states.
It sounds like you don't want this for performance reasons? Although
looking at mas_find() today, I think this could still work since we are
checking the maple state for a lot.
Both ideas could be even faster than what you have if we handle the
special cases of mas_is_none()/mas_is_ptr() in a smarter way because we
don't need to be as worried about the entry point of the maple state as
much as we do with mas_find()/mas_for_each(). I mean, is it possible to
get to a mas_is_none() or mas_is_ptr() on duplicating a tree? How do we
handle these users?
Both ideas still suffer from someone saying "Gee, that {insert function
name here} is used in the forking code, so I can totally use that in my
code because that's how it work!" and find out it works for the limited
testing they do. Then it fails later and the emails start flying.
I almost think we should do something like this on insert:
void mas_dup_insert(old, new, new_entry) {
WARN_ON_ONCE(old == new);
WARN_ON_ONCE(old->index != new->index);
WARN_ON_ONCE(old->last != new->last);
...
}
This would at least _require_ someone to have two maple states and
hopefully think twice on using it where it should not be used.
The bottom line is that this code is close to what we need to make
forking better, but I fear the misuse of the interface.
Something else to think about:
In the work items for the Maple Tree, there is a plan to have an enum to
specify the type of write that is going to happen. The idea was for
mas_preallocate() to set this type of write so we can just go right to
the correct function. We could use that here and set the maple state
write type to a direct replacement.
Thanks,
Liam
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 06/11] maple_tree: Introduce mas_replace_entry() to directly replace an entry
2023-08-16 17:40 ` Liam R. Howlett
@ 2023-08-18 9:39 ` Peng Zhang
2023-08-18 16:15 ` Liam R. Howlett
0 siblings, 1 reply; 42+ messages in thread
From: Peng Zhang @ 2023-08-18 9:39 UTC (permalink / raw)
To: Liam R. Howlett, Peng Zhang, avagin, npiggin, mathieu.desnoyers,
peterz, michael.christie, surenb, brauner, willy, akpm, corbet,
linux-kernel, linux-fsdevel, linux-mm, linux-doc
在 2023/8/17 01:40, Liam R. Howlett 写道:
> * Peng Zhang <zhangpeng.00@bytedance.com> [230816 09:11]:
>>
>>
>> 在 2023/8/1 00:48, Liam R. Howlett 写道:
>>> * Peng Zhang <zhangpeng.00@bytedance.com> [230731 08:39]:
>>>>
>>>>
>>>> 在 2023/7/27 00:08, Liam R. Howlett 写道:
>>>>> * Peng Zhang <zhangpeng.00@bytedance.com> [230726 04:10]:
>>>>>> If mas has located a specific entry, it may be need to replace this
>>>>>> entry, so introduce mas_replace_entry() to do this. mas_replace_entry()
>>>>>> will be more efficient than mas_store*() because it doesn't do many
>>>>>> unnecessary checks.
>>>>>>
>>>>>> This function should be inline, but more functions need to be moved to
>>>>>> the header file, so I didn't do it for the time being.
>>>>>
>>>>> I am really nervous having no checks here. I get that this could be
>>>>> used for duplicating the tree more efficiently, but having a function
>>>>> that just swaps a value in is very dangerous - especially since it is
>>>>> decoupled from the tree duplication code.
>>>> I've thought about this, and I feel like this is something the user
>>>> should be guaranteed. If the user is not sure whether to use it,
>>>> mas_store() can be used instead.
>>>
>>> Documentation often isn't up to date and even more rarely read.
>>> mas_replace_entry() does not give a hint of a requirement for a specific
>>> state to the mas. This is not acceptable.
>>>
>>> The description of the function also doesn't say anything about a
>>> requirement of the maple state, just that it replaces an already
>>> existing entry. You have to read the notes to find out that 'mas must
>>> already locate an existing entry'.
>>>
>>>> And we should provide this interface
>>>> because it has better performance.
>>>
>>> How much better is the performance? There's always a trade off but
>>> without numbers, this is hard to justify.
>> I have implemented a new version of this pachset, and I will post it
>> soon.
>>
>> I tested the benefits of mas_replace_entry() in userspace.
>> The test code is attached at the end.
>>
>> Run three times:
>> mas_replace_entry(): 2.7613050s 2.7120030s 2.7274200s
>> mas_store(): 3.8451260s 3.8113200s 3.9334160s
>
> This runtime is too short, we should increase the number of elements or
> loops until it is over 10 seconds. This will make the setup time
> and other variances less significant and we can use the command run time
> as a rough estimate of performance. IIRC 134 was picked for a rough
> estimate of an average task size so maybe increase the loops.
I changed nr_entries to 1000, and the measured numbers are as follows:
mas_replace_entry(): 20.0375820s
mas_store(): 28.6175720s
It can be seen that mas_store() is still nearly 40% slower.
>
> I understand the numbers here are from clock recordings to demonstrate
> the significance of your change.
>
>>
>> Using mas_store() reduces the performance of duplicating VMAs by about
>> 41%.
>>
>> So I think mas_replace_entry() is necessary. We can describe it in more
>> detail in the documentation to prevent users from misusing it.
>
> I think something is necessary for a quicker replacement, yes. I don't
> want to go as far as you did with the lack of checking.
>
>>
>>
>> static noinline void __init bench_forking(struct maple_tree *mt)
>> {
>> struct maple_tree newmt;
>> int i, nr_entries = 134, nr_fork = 80000, ret;
>> void *val;
>> MA_STATE(mas, mt, 0, 0);
>> MA_STATE(newmas, &newmt, 0, 0);
>> clock_t start;
>> clock_t end;
>> double cpu_time_used = 0;
>>
>> for (i = 0; i <= nr_entries; i++)
>> mtree_store_range(mt, i*10, i*10 + 5,
>> xa_mk_value(i), GFP_KERNEL);
>>
>> for (i = 0; i < nr_fork; i++) {
>> mt_set_non_kernel(99999);
>>
>> start = clock();
>> mt_init_flags(&newmt, MT_FLAGS_ALLOC_RANGE);
>> mas_lock(&newmas);
>> mas_lock(&mas);
>> ret = __mt_dup(mt, &newmt, GFP_NOWAIT | __GFP_NOWARN);
>> if (ret) {
>> pr_err("OOM!");
>> BUG_ON(1);
>> }
>>
>> mas_set(&newmas, 0);
>> mas_for_each(&newmas, val, ULONG_MAX) {
>> mas_replace_entry(&newmas, val);
>> }
>>
>> mas_unlock(&mas);
>> mas_unlock(&newmas);
>> end = clock();
>> cpu_time_used += ((double) (end - start));
>>
>> mas_destroy(&newmas);
>> mt_validate(&newmt);
>> mt_set_non_kernel(0);
>> mtree_destroy(&newmt);
>> }
>> printf("time consumption:%.7fs\n", cpu_time_used / CLOCKS_PER_SEC);
>> }
>>
>>
>>>
>>>>>
>>>>>>
>>>>>> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
>>>>>> ---
>>>>>> include/linux/maple_tree.h | 1 +
>>>>>> lib/maple_tree.c | 25 +++++++++++++++++++++++++
>>>>>> 2 files changed, 26 insertions(+)
>>>>>>
>>>>>> diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
>>>>>> index 229fe78e4c89..a05e9827d761 100644
>>>>>> --- a/include/linux/maple_tree.h
>>>>>> +++ b/include/linux/maple_tree.h
>>>>>> @@ -462,6 +462,7 @@ struct ma_wr_state {
>>>>>> void *mas_walk(struct ma_state *mas);
>>>>>> void *mas_store(struct ma_state *mas, void *entry);
>>>>>> +void mas_replace_entry(struct ma_state *mas, void *entry);
>>>>>> void *mas_erase(struct ma_state *mas);
>>>>>> int mas_store_gfp(struct ma_state *mas, void *entry, gfp_t gfp);
>>>>>> void mas_store_prealloc(struct ma_state *mas, void *entry);
>>>>>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>>>>>> index efac6761ae37..d58572666a00 100644
>>>>>> --- a/lib/maple_tree.c
>>>>>> +++ b/lib/maple_tree.c
>>>>>> @@ -5600,6 +5600,31 @@ void *mas_store(struct ma_state *mas, void *entry)
>>>>>> }
>>>>>> EXPORT_SYMBOL_GPL(mas_store);
>>>>>> +/**
>>>>>> + * mas_replace_entry() - Replace an entry that already exists in the maple tree
>>>>>> + * @mas: The maple state
>>>>>> + * @entry: The entry to store
>>>>>> + *
>>>>>> + * Please note that mas must already locate an existing entry, and the new entry
>>>>>> + * must not be NULL. If these two points cannot be guaranteed, please use
>>>>>> + * mas_store*() instead, otherwise it will cause an internal error in the maple
>>>>>> + * tree. This function does not need to allocate memory, so it must succeed.
>>>>>> + */
>>>>>> +void mas_replace_entry(struct ma_state *mas, void *entry)
>>>>>> +{
>>>>>> + void __rcu **slots;
>>>>>> +
>>>>>> +#ifdef CONFIG_DEBUG_MAPLE_TREE
>
> CONFIG_DEBUG_MAPLE_TREE is not necessary, MAS_WRAN_ON() will be compiled
> out if it's not set.
>
>>>>>> + MAS_WARN_ON(mas, !mte_is_leaf(mas->node));
>>>>>> + MAS_WARN_ON(mas, !entry);
>>>>>> + MAS_WARN_ON(mas, mas->offset >= mt_slots[mte_node_type(mas->node)]);
>>>>>> +#endif
>>>>>> +
>>>>>> + slots = ma_slots(mte_to_node(mas->node), mte_node_type(mas->node));
>>>>>> + rcu_assign_pointer(slots[mas->offset], entry);
>>>>>> +}
>>>>>> +EXPORT_SYMBOL_GPL(mas_replace_entry);
>>>>>> +
>>>>>> /**
>>>>>> * mas_store_gfp() - Store a value into the tree.
>>>>>> * @mas: The maple state
>>>>>> --
>>>>>> 2.20.1
>>>>>>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 04/11] maple_tree: Introduce interfaces __mt_dup() and mt_dup()
2023-08-16 18:30 ` Liam R. Howlett
@ 2023-08-18 11:53 ` Peng Zhang
2023-08-18 16:13 ` Liam R. Howlett
0 siblings, 1 reply; 42+ messages in thread
From: Peng Zhang @ 2023-08-18 11:53 UTC (permalink / raw)
To: Liam R. Howlett, Peng Zhang, willy, michael.christie, surenb,
npiggin, corbet, mathieu.desnoyers, avagin, linux-fsdevel,
linux-kernel, linux-doc, linux-mm, akpm, brauner, peterz
在 2023/8/17 02:30, Liam R. Howlett 写道:
> * Peng Zhang <zhangpeng.00@bytedance.com> [230816 09:42]:
>>
>>
> ...
>
>>>>>> +/**
>>>>>> + * __mt_dup(): Duplicate a maple tree
>>>>>> + * @mt: The source maple tree
>>>>>> + * @new: The new maple tree
>>>>>> + * @gfp: The GFP_FLAGS to use for allocations
>>>>>> + *
>>>>>> + * This function duplicates a maple tree using a faster method than traversing
>>>>>> + * the source tree and inserting entries into the new tree one by one. The user
>>>>>> + * needs to lock the source tree manually. Before calling this function, @new
>>>>>> + * must be an empty tree or an uninitialized tree. If @mt uses an external lock,
>>>>>> + * we may also need to manually set @new's external lock using
>>>>>> + * mt_set_external_lock().
>>>>>> + *
>>>>>> + * Return: 0 on success, -ENOMEM if memory could not be allocated.
>>>>>> + */
>>>>>> +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
>>>>>
>>>>> We use mas_ for things that won't handle the locking and pass in a maple
>>>>> state. Considering the leaves need to be altered once this is returned,
>>>>> I would expect passing in a maple state should be feasible?
>>>> But we don't really need mas here. What do you think the state of mas
>>>> should be when this function returns? Make it point to the first entry,
>>>> or the last entry?
>>>
>>> I would write it to point to the first element so that the call to
>>> replace the first element can just do that without an extra walk and
>>> document the maple state end point.
>> Unfortunately, this does not seem to be convenient. Users usually use
>> mas_for_each() to replace elements. If we set mas to the first element,
>> the first call to mas_find() in mas_for_each() will get the next
>> element.
>
> This sounds like the need for another iterator specifically for
> duplicating.
>
>>
>> There may also be other scenarios where the user does not necessarily
>> have to replace every element.
>
> Do you mean a limit or elements that need to be skipped? We could have
> a limit on the iteration.
>
>>
>> Finally, getting the first element in __mt_dup() requires an additional
>> check to check whether the first element has already been recorded. Such
>> a check will be performed at each leaf node, which is unnecessary
>> overhead.
>>
>> Of course, the first reason is the main reason, which prevents us from
>> using mas_for_each(). So I don't want to record the first element.
>
>
> I don't like the interface because it can easily be misunderstood and
> used incorrectly. I don't know how to make a cleaner interface, but
> I've gone through a few thoughts:
>
> The first was hide _all of it_ in a new iterator:
> mas_dup_each(old, new, old_entry) {
> if (don't_dup(old_entry)) {
> mas_erase(new);
> continue;
> }
>
> mas_dup_insert(new, new_entry);
> }
>
> This iterator would check if mas_is_start(old) and dup the tree in that
> event. Leave the both new trees pointing to the first element and set
> old_entry. I don't know how to handle the failure in duplicating the
> tree in this case - I guess we could return old_entry = NULL and check
> if mas_is_err(old) after the loop. Do you see a problem with this?
This interface looks OK. But handling the failure case is tricky.
>
>
> The second idea was an init of the old tree. This is closest to what you
> have:
>
> if (mas_dup_init(old, new))
> goto -ENOMEM;
>
> mas_dup_each(old, new) {
> if (don't_dup(old_entry)) {
> mas_erase(new);
> continue;
> }
>
> mas_dup_insert(new, new_entry);
> }
I think this interface could be better.
>
> This would duplicate the tree at the start and leave both pointing at
> the first element so that mas_dup_each() could start on that element.
> Each subsequent call would go to the next element in both maple states.
Every element of the new tree is the same as the old tree, and we don't
need to maintain the mas of the old tree. It is enough to maintain the
mas of the new tree when traversing.
> It sounds like you don't want this for performance reasons? Although
I mean I don't want to record the first element during duplicating. But
we can get the first element after the duplicate completes. This can
also still be within the implementation of the interface.
> looking at mas_find() today, I think this could still work since we are
> checking the maple state for a lot.
Yes, mas_find() does a whole bunch of checks.
>
> Both ideas could be even faster than what you have if we handle the
> special cases of mas_is_none()/mas_is_ptr() in a smarter way because we
> don't need to be as worried about the entry point of the maple state as
> much as we do with mas_find()/mas_for_each(). I mean, is it possible to
> get to a mas_is_none() or mas_is_ptr() on duplicating a tree? How do we
> handle these users?
The check for mas_is_none() or mas_is_ptr() in mas_find() is really not
worth it if we hold the lock. There doesn't seem to be a good way around
mas_is_ptr() since it needs to enter the loop once. mas_is_none() can be
solved because it does not enter the loop, we can use it as a condition
to enter the loop.
Without using mas_find() to avoid the check inside, I have to figure out
how I can handle mas_is_ptr() properly.
>
> Both ideas still suffer from someone saying "Gee, that {insert function
> name here} is used in the forking code, so I can totally use that in my
> code because that's how it work!" and find out it works for the limited
> testing they do. Then it fails later and the emails start flying.
>
>
> I almost think we should do something like this on insert:
>
> void mas_dup_insert(old, new, new_entry) {
> WARN_ON_ONCE(old == new);
> WARN_ON_ONCE(old->index != new->index);
> WARN_ON_ONCE(old->last != new->last);
> ...
> }
Maintaining old mas doesn't feel worth it. If this we have to traverse
the old tree one more time.
>
> This would at least _require_ someone to have two maple states and
> hopefully think twice on using it where it should not be used.
>
> The bottom line is that this code is close to what we need to make
> forking better, but I fear the misuse of the interface.
>
> Something else to think about:
> In the work items for the Maple Tree, there is a plan to have an enum to
> specify the type of write that is going to happen. The idea was for
> mas_preallocate() to set this type of write so we can just go right to
> the correct function. We could use that here and set the maple state
> write type to a direct replacement.
This can be the next step. We can do without it for now.
>
> Thanks,
> Liam
>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 04/11] maple_tree: Introduce interfaces __mt_dup() and mt_dup()
2023-08-18 11:53 ` Peng Zhang
@ 2023-08-18 16:13 ` Liam R. Howlett
0 siblings, 0 replies; 42+ messages in thread
From: Liam R. Howlett @ 2023-08-18 16:13 UTC (permalink / raw)
To: Peng Zhang
Cc: willy, michael.christie, surenb, npiggin, corbet,
mathieu.desnoyers, avagin, linux-fsdevel, linux-kernel, linux-doc,
linux-mm, akpm, brauner, peterz, maple-tree
* Peng Zhang <zhangpeng.00@bytedance.com> [230818 07:54]:
>
>
> 在 2023/8/17 02:30, Liam R. Howlett 写道:
> > * Peng Zhang <zhangpeng.00@bytedance.com> [230816 09:42]:
> > >
> > >
> > ...
> >
> > > > > > > +/**
> > > > > > > + * __mt_dup(): Duplicate a maple tree
> > > > > > > + * @mt: The source maple tree
> > > > > > > + * @new: The new maple tree
> > > > > > > + * @gfp: The GFP_FLAGS to use for allocations
> > > > > > > + *
> > > > > > > + * This function duplicates a maple tree using a faster method than traversing
> > > > > > > + * the source tree and inserting entries into the new tree one by one. The user
> > > > > > > + * needs to lock the source tree manually. Before calling this function, @new
> > > > > > > + * must be an empty tree or an uninitialized tree. If @mt uses an external lock,
> > > > > > > + * we may also need to manually set @new's external lock using
> > > > > > > + * mt_set_external_lock().
> > > > > > > + *
> > > > > > > + * Return: 0 on success, -ENOMEM if memory could not be allocated.
> > > > > > > + */
> > > > > > > +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
> > > > > >
> > > > > > We use mas_ for things that won't handle the locking and pass in a maple
> > > > > > state. Considering the leaves need to be altered once this is returned,
> > > > > > I would expect passing in a maple state should be feasible?
> > > > > But we don't really need mas here. What do you think the state of mas
> > > > > should be when this function returns? Make it point to the first entry,
> > > > > or the last entry?
> > > >
> > > > I would write it to point to the first element so that the call to
> > > > replace the first element can just do that without an extra walk and
> > > > document the maple state end point.
> > > Unfortunately, this does not seem to be convenient. Users usually use
> > > mas_for_each() to replace elements. If we set mas to the first element,
> > > the first call to mas_find() in mas_for_each() will get the next
> > > element.
> >
> > This sounds like the need for another iterator specifically for
> > duplicating.
> >
> > >
> > > There may also be other scenarios where the user does not necessarily
> > > have to replace every element.
> >
> > Do you mean a limit or elements that need to be skipped? We could have
> > a limit on the iteration.
> >
> > >
> > > Finally, getting the first element in __mt_dup() requires an additional
> > > check to check whether the first element has already been recorded. Such
> > > a check will be performed at each leaf node, which is unnecessary
> > > overhead.
> > >
> > > Of course, the first reason is the main reason, which prevents us from
> > > using mas_for_each(). So I don't want to record the first element.
> >
> >
> > I don't like the interface because it can easily be misunderstood and
> > used incorrectly. I don't know how to make a cleaner interface, but
> > I've gone through a few thoughts:
> >
> > The first was hide _all of it_ in a new iterator:
> > mas_dup_each(old, new, old_entry) {
> > if (don't_dup(old_entry)) {
> > mas_erase(new);
> > continue;
> > }
> >
> > mas_dup_insert(new, new_entry);
> > }
> >
> > This iterator would check if mas_is_start(old) and dup the tree in that
> > event. Leave the both new trees pointing to the first element and set
> > old_entry. I don't know how to handle the failure in duplicating the
> > tree in this case - I guess we could return old_entry = NULL and check
> > if mas_is_err(old) after the loop. Do you see a problem with this?
> This interface looks OK. But handling the failure case is tricky.
I think it's awkward, but not tricky; possibly error prone to users. I
don't like this solution because of the error handling. I'm just
stating some options I ruled out in hopes you see some way of improving
them.
Maybe we name the check something like mas_dup_complete(old, new) and
all it does is checks for an error? It makes it obvious that it's
necessary but doesn't avoid people leaving it out.
> >
> >
> > The second idea was an init of the old tree. This is closest to what you
> > have:
> >
> > if (mas_dup_init(old, new))
> > goto -ENOMEM;
> >
> > mas_dup_each(old, new) {
> > if (don't_dup(old_entry)) {
> > mas_erase(new);
> > continue;
> > }
> >
> > mas_dup_insert(new, new_entry);
> > }
> I think this interface could be better.
> >
> > This would duplicate the tree at the start and leave both pointing at
> > the first element so that mas_dup_each() could start on that element.
> > Each subsequent call would go to the next element in both maple states.
> Every element of the new tree is the same as the old tree, and we don't
> need to maintain the mas of the old tree. It is enough to maintain the
> mas of the new tree when traversing.
Okay, and using DFS means we can't stop one level before the leaves
during duplication since deletes may not function.
>
> > It sounds like you don't want this for performance reasons? Although
> I mean I don't want to record the first element during duplicating. But
> we can get the first element after the duplicate completes. This can
> also still be within the implementation of the interface.
>
> > looking at mas_find() today, I think this could still work since we are
> > checking the maple state for a lot.
> Yes, mas_find() does a whole bunch of checks.
> >
> > Both ideas could be even faster than what you have if we handle the
> > special cases of mas_is_none()/mas_is_ptr() in a smarter way because we
> > don't need to be as worried about the entry point of the maple state as
> > much as we do with mas_find()/mas_for_each(). I mean, is it possible to
> > get to a mas_is_none() or mas_is_ptr() on duplicating a tree? How do we
> > handle these users?
> The check for mas_is_none() or mas_is_ptr() in mas_find() is really not
> worth it if we hold the lock. There doesn't seem to be a good way around
> mas_is_ptr() since it needs to enter the loop once. mas_is_none() can be
> solved because it does not enter the loop, we can use it as a condition
> to enter the loop.
So do you think it is worth making a new iterator then? One that
requires the write lock so some work can be avoided?
>
> Without using mas_find() to avoid the check inside, I have to figure out
> how I can handle mas_is_ptr() properly.
> >
> > Both ideas still suffer from someone saying "Gee, that {insert function
> > name here} is used in the forking code, so I can totally use that in my
> > code because that's how it work!" and find out it works for the limited
> > testing they do. Then it fails later and the emails start flying.
> >
> >
> > I almost think we should do something like this on insert:
> >
> > void mas_dup_insert(old, new, new_entry) {
> > WARN_ON_ONCE(old == new);
> > WARN_ON_ONCE(old->index != new->index);
> > WARN_ON_ONCE(old->last != new->last);
> > ...
> > }
> Maintaining old mas doesn't feel worth it. If this we have to traverse
> the old tree one more time.
Maybe only when debug is enabled? We should at least keep the first
check.
I'm not committed to this interface either. Do you have an idea that
works better?
> >
> > This would at least _require_ someone to have two maple states and
> > hopefully think twice on using it where it should not be used.
> >
> > The bottom line is that this code is close to what we need to make
> > forking better, but I fear the misuse of the interface.
> >
> > Something else to think about:
> > In the work items for the Maple Tree, there is a plan to have an enum to
> > specify the type of write that is going to happen. The idea was for
> > mas_preallocate() to set this type of write so we can just go right to
> > the correct function. We could use that here and set the maple state
> > write type to a direct replacement.
> This can be the next step. We can do without it for now.
Agreed, but your current replacement is very close to what we have for
direct replacement after all the checks in the current write path.
I'm wondering if we have an enum in the maple state to just go to that
store operation, then these two interfaces could share a lot of the
code.
But we should leave it for later, I was just pointing it out.
Thanks,
Liam
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 06/11] maple_tree: Introduce mas_replace_entry() to directly replace an entry
2023-08-18 9:39 ` Peng Zhang
@ 2023-08-18 16:15 ` Liam R. Howlett
0 siblings, 0 replies; 42+ messages in thread
From: Liam R. Howlett @ 2023-08-18 16:15 UTC (permalink / raw)
To: Peng Zhang
Cc: avagin, npiggin, mathieu.desnoyers, peterz, michael.christie,
surenb, brauner, willy, akpm, corbet, linux-kernel, linux-fsdevel,
linux-mm, linux-doc
* Peng Zhang <zhangpeng.00@bytedance.com> [230818 05:40]:
>
>
> 在 2023/8/17 01:40, Liam R. Howlett 写道:
> > * Peng Zhang <zhangpeng.00@bytedance.com> [230816 09:11]:
> > >
> > >
> > > 在 2023/8/1 00:48, Liam R. Howlett 写道:
> > > > * Peng Zhang <zhangpeng.00@bytedance.com> [230731 08:39]:
> > > > >
> > > > >
> > > > > 在 2023/7/27 00:08, Liam R. Howlett 写道:
> > > > > > * Peng Zhang <zhangpeng.00@bytedance.com> [230726 04:10]:
> > > > > > > If mas has located a specific entry, it may be need to replace this
> > > > > > > entry, so introduce mas_replace_entry() to do this. mas_replace_entry()
> > > > > > > will be more efficient than mas_store*() because it doesn't do many
> > > > > > > unnecessary checks.
> > > > > > >
> > > > > > > This function should be inline, but more functions need to be moved to
> > > > > > > the header file, so I didn't do it for the time being.
> > > > > >
> > > > > > I am really nervous having no checks here. I get that this could be
> > > > > > used for duplicating the tree more efficiently, but having a function
> > > > > > that just swaps a value in is very dangerous - especially since it is
> > > > > > decoupled from the tree duplication code.
> > > > > I've thought about this, and I feel like this is something the user
> > > > > should be guaranteed. If the user is not sure whether to use it,
> > > > > mas_store() can be used instead.
> > > >
> > > > Documentation often isn't up to date and even more rarely read.
> > > > mas_replace_entry() does not give a hint of a requirement for a specific
> > > > state to the mas. This is not acceptable.
> > > >
> > > > The description of the function also doesn't say anything about a
> > > > requirement of the maple state, just that it replaces an already
> > > > existing entry. You have to read the notes to find out that 'mas must
> > > > already locate an existing entry'.
> > > >
> > > > > And we should provide this interface
> > > > > because it has better performance.
> > > >
> > > > How much better is the performance? There's always a trade off but
> > > > without numbers, this is hard to justify.
> > > I have implemented a new version of this pachset, and I will post it
> > > soon.
> > >
> > > I tested the benefits of mas_replace_entry() in userspace.
> > > The test code is attached at the end.
> > >
> > > Run three times:
> > > mas_replace_entry(): 2.7613050s 2.7120030s 2.7274200s
> > > mas_store(): 3.8451260s 3.8113200s 3.9334160s
> >
> > This runtime is too short, we should increase the number of elements or
> > loops until it is over 10 seconds. This will make the setup time
> > and other variances less significant and we can use the command run time
> > as a rough estimate of performance. IIRC 134 was picked for a rough
> > estimate of an average task size so maybe increase the loops.
> I changed nr_entries to 1000, and the measured numbers are as follows:
> mas_replace_entry(): 20.0375820s
> mas_store(): 28.6175720s
> It can be seen that mas_store() is still nearly 40% slower.
To be clear, I didn't doubt your numbers or want you to rerun the
benchmark. I was just saying we should increase the loops now that the
tree is faster. It should allow for you to not need to use clock count
to see benefits - although they will always be more accurate.
Thanks,
Liam
^ permalink raw reply [flat|nested] 42+ messages in thread
end of thread, other threads:[~2023-08-18 16:17 UTC | newest]
Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-26 8:09 [PATCH 00/11] Introduce mt_dup() to improve the performance of fork() Peng Zhang
2023-07-26 8:09 ` [PATCH 01/11] maple_tree: Introduce ma_nonleaf_data_end{_nocheck}() Peng Zhang
2023-07-26 14:58 ` Liam R. Howlett
2023-07-31 9:52 ` Peng Zhang
2023-07-31 16:08 ` Liam R. Howlett
2023-07-26 8:09 ` [PATCH 02/11] maple_tree: Validate MAPLE_ENODE and ma_nonleaf_data_end() Peng Zhang
2023-07-26 8:09 ` [PATCH 03/11] maple_tree: Add some helper functions Peng Zhang
2023-07-26 15:02 ` Liam R. Howlett
2023-07-26 15:08 ` Matthew Wilcox
2023-07-31 11:45 ` Peng Zhang
2023-08-11 17:28 ` Liam R. Howlett
2023-07-31 11:40 ` Peng Zhang
2023-07-26 8:09 ` [PATCH 04/11] maple_tree: Introduce interfaces __mt_dup() and mt_dup() Peng Zhang
2023-07-26 16:03 ` Liam R. Howlett
2023-07-31 12:24 ` Peng Zhang
2023-07-31 16:27 ` Liam R. Howlett
2023-08-16 13:41 ` Peng Zhang
2023-08-16 18:30 ` Liam R. Howlett
2023-08-18 11:53 ` Peng Zhang
2023-08-18 16:13 ` Liam R. Howlett
2023-07-26 8:09 ` [PATCH 05/11] maple_tree: Add test for mt_dup() Peng Zhang
2023-07-26 16:06 ` Liam R. Howlett
2023-07-31 12:32 ` Peng Zhang
2023-07-31 16:41 ` Liam R. Howlett
2023-07-26 8:09 ` [PATCH 06/11] maple_tree: Introduce mas_replace_entry() to directly replace an entry Peng Zhang
2023-07-26 16:08 ` Liam R. Howlett
2023-07-31 12:39 ` Peng Zhang
2023-07-31 16:48 ` Liam R. Howlett
2023-08-16 13:11 ` Peng Zhang
2023-08-16 17:40 ` Liam R. Howlett
2023-08-18 9:39 ` Peng Zhang
2023-08-18 16:15 ` Liam R. Howlett
2023-07-26 8:09 ` [PATCH 07/11] maple_tree: Update the documentation of maple tree Peng Zhang
2023-07-26 8:09 ` [PATCH 08/11] maple_tree: Skip other tests when BENCH is enabled Peng Zhang
2023-07-26 8:09 ` [PATCH 09/11] maple_tree: Update check_forking() and bench_forking() Peng Zhang
2023-07-26 8:09 ` [PATCH 10/11] MAINTAINERS: Add co-maintainer for maple tree Peng Zhang
2023-07-26 16:39 ` Liam R. Howlett
2023-07-31 12:55 ` Peng Zhang
2023-07-31 20:55 ` Liam R. Howlett
2023-07-26 8:09 ` [PATCH 11/11] fork: Use __mt_dup() to duplicate maple tree in dup_mmap() Peng Zhang
2023-07-26 17:06 ` Liam R. Howlett
2023-07-31 12:59 ` Peng Zhang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).