* Maple Tree Work
@ 2023-07-07 16:38 Liam R. Howlett
2023-07-12 11:49 ` Peng Zhang
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Liam R. Howlett @ 2023-07-07 16:38 UTC (permalink / raw)
To: maple-tree, linux-kernel; +Cc: Peng Zhang, Danilo Krummrich
Hello,
I have received enquiries about the status of features for the maple
tree as well as requests for work. I've decided to track this on the
maple tree mailing list [1] for better coordination and open discussion.
Please keep all discussions on the maple tree mailing list.
I'll send out an updated list monthly with names against people working
on them to avoid work duplication, so don't work on something with a
name next to it.
If you want to work on something then please respond to this email so
everyone knows and I can add your name to the item. Feel free to ask
questions to clarify the feature or discuss directions.
Likewise, if you cannot work on anything anymore then let me know so
others can.
If you have any ideas, then please email the list. We can discuss it
and I can add it to the list.
Please sign up to the maple tree mailing list [1] to get these emails.
Features:
- Better preallocation calculations - Liam R. Howlett 2023/07/07
Without tracking the tree status on the walk down, we can
partially walk down to figure out the type of write and which
'worst case' a write will cause. Using this "write type"
information, the preallocations can be a better estimate. v2 was
sent to the mailing list [2].
- mas_store_gfp() align with mas_prealloc()
Right now mas_store_gfp() is going to allocate once it figures
out what to do and it does the worst case allocations. This is
inefficient and can easily be done better by doing more of what
mas_prealloc()/mas_store_prealloc() does - or at least will be
doing once the 'Better preallocation calculations' lands.
- Tracking tree status on walks down
Track in the maple state when the last minimum nodes, space for
2, or space for 3 slots occurred (height in the tree). This
will allow for a better estimate on how many nodes are needed
for a write. This can be complicated on next/prev walking, etc.
It has to be done in read mode since some walk may have been
done before deciding to write - see mmap_region(), for example.
- Store type (enum for store type?)
Extending the "Tracking tree status on walks down", the
information can then be used to determine what operation is
needed during a mas_prealloct(), etc. The operation can then be
stored in the maple state to continue on a mas_store_prealloc().
Obviously, this would work out well if we have mas_store_gfp()
using the same functionality as mentioned above.
- Full count/flag & Dense nodes
There is a bit that exists in the pointer reserved to indicate
there is no NULLs in the leaves below. This feature is mainly
for trees that are non-alloc trees. We can then know there is
at least one singleton that can be stored below this entry.
This is coupled with restoring Dense nodes as a potential node
type. The tricky part is deciding on when to switch to dense
nodes/back from dense nodes (all entries to dense nodes must be
singletons!). See mte_set_full(), mte_clear_full(),
mte_has_null() which use MAPLE_ENODE_NULL.
- Fork & Dup tree + Delete DONT_COPY
This is to optimize dup_mmap() in kernel/fork.c, but other
users may want faster duplications of a tree.
This should be faster than building a tree in bulk mode. The
idea is to just copy each node and replace the pointers in,
probably, a BFS order. Once the leaves are reached, the VMAs
will be replaced by the copies made in fork, unless DONT_COPY is
set, in which case the VMA will be deleted from the copied tree.
DONT_COPY is not common and since the tree isn't visible, all
nodes should be available for reuse (no RCU worries).
- Push reuse parent
During an insert, new nodes can be "pushed" left or right -
see mas_push_data(). On a left push, it may be possible to
reuse the node by making the write like an 'append'. This may
also be possible to be done during a split operation when the
write extends the last slot. This needs examining to ensure RCU
safety.
Larger Features:
- 64 bit stores on 32 bit arch
A number of users want to use the maple tree, but cannot because
they need a storage medium that supports 64 bit stores on 32 bit
arch.
- wr_mas with alloc instead of mas
Internally, we have a write maple state, but the maple state
currently holds the allocations. I'm not sure if this is worth
it, and it will probably need "Tracking tree status on walks
down" so that preallocations/allocations can be accurate.
- Big Dense Node Type
There are various node types that could be added to the maple
tree. One of the most useful is probably a 'big dense' with the
node being a 4k block of singletons. This would have to come
after the dense node type as dense enables more users and helps
scope this work.
- Judy Array
The Judy Array is another fancy data structure. Mathieu
Desnoyers has enquired and spoken to us about incorporating judy
arrays as a node type or other features he has in the judy
array. This is in the early thought stages.
Please note that any new features will need test cases added to ensure
they function correctly.
[1] https://lists.infradead.org/mailman/listinfo/maple-tree
[2] http://lists.infradead.org/pipermail/maple-tree/2023-June/002599.html
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: Maple Tree Work 2023-07-07 16:38 Maple Tree Work Liam R. Howlett @ 2023-07-12 11:49 ` Peng Zhang 2023-07-12 15:27 ` Liam R. Howlett 2023-07-13 9:58 ` Peng Zhang 2025-07-29 12:03 ` Danilo Krummrich 2 siblings, 1 reply; 12+ messages in thread From: Peng Zhang @ 2023-07-12 11:49 UTC (permalink / raw) To: Liam R. Howlett; +Cc: Peng Zhang, Danilo Krummrich, maple-tree, linux-kernel 在 2023/7/8 00:38, Liam R. Howlett 写道: > - Fork & Dup tree + Delete DONT_COPY > This is to optimize dup_mmap() in kernel/fork.c, but other > users may want faster duplications of a tree. > This should be faster than building a tree in bulk mode. The > idea is to just copy each node and replace the pointers in, > probably, a BFS order. Once the leaves are reached, the VMAs > will be replaced by the copies made in fork, unless DONT_COPY is > set, in which case the VMA will be deleted from the copied tree. > DONT_COPY is not common and since the tree isn't visible, all > nodes should be available for reuse (no RCU worries). If DONT_COPY is set, this method will be complicated, because the gaps adjacent to it need to be merged, and the gaps of all ancestor nodes need to be updated. I have another idea to build a tree, if inserted in order, we only insert at the leaf node. All leaf nodes are connected using a linked list. In the end we get a linked list with only leaf nodes. Then we construct non-leaf nodes layer by layer from bottom to top. I think this is also faster than bulk mode. Another advantage of this method is that we are applicable to more scenarios, do not need the original tree, only need to know the ranges inserted in order. I don't know how fast this method is, so we can discuss it. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Maple Tree Work 2023-07-12 11:49 ` Peng Zhang @ 2023-07-12 15:27 ` Liam R. Howlett 2023-07-13 8:05 ` Peng Zhang 0 siblings, 1 reply; 12+ messages in thread From: Liam R. Howlett @ 2023-07-12 15:27 UTC (permalink / raw) To: Peng Zhang; +Cc: maple-tree, linux-kernel, linux-mm Dropping Danilo from the Cc.. I was asked to add linux-mm to the list, so I did that as well. If anyone else is interested in seeing the full list, it's on lkml [1] and the maple-tree list [2]. Thank you Peng for looking at the list and taking time to think about the items. * Peng Zhang <zhangpeng.00@bytedance.com> [230712 07:49]: > > > 在 2023/7/8 00:38, Liam R. Howlett 写道: > > - Fork & Dup tree + Delete DONT_COPY > > This is to optimize dup_mmap() in kernel/fork.c, but other > > users may want faster duplications of a tree. > > This should be faster than building a tree in bulk mode. The > > idea is to just copy each node and replace the pointers in, > > probably, a BFS order. Once the leaves are reached, the VMAs > > will be replaced by the copies made in fork, unless DONT_COPY is > > set, in which case the VMA will be deleted from the copied tree. > > DONT_COPY is not common and since the tree isn't visible, all > > nodes should be available for reuse (no RCU worries). > If DONT_COPY is set, this method will be complicated, because the gaps > adjacent to it need to be merged, and the gaps of all ancestor nodes need to > be updated. My understanding is that this is a rare event; there aren't many VMAs marked this way. The store operation already does all the necessary work for deleting an entry. The gap tracking is also updated, and that would only happen if the new gap is larger. Are you concerned about the performance/runtime of handling the DONT_COPY in this way? > > I have another idea to build a tree, if inserted in order, we only > insert at the leaf node. All leaf nodes are connected using a linked > list. In the end we get a linked list with only leaf nodes. Then we > construct non-leaf nodes layer by layer from bottom to top. I think > this is also faster than bulk mode. Another advantage of this method > is that we are applicable to more scenarios, do not need the original > tree, only need to know the ranges inserted in order. I don't know > how fast this method is, so we can discuss it. What is the advantage of a linked list over just building the tree as you go? Considering the non-leaf nodes are just a list of nodes already, and you will have to do the same work of setting pivots, allocating nodes, and filling them after you have the linked list. What work do you avoid that would make a linked list faster than bulk insert or a tree copy/replace VMAs? I was thinking that we could avoid a lot of the work involved with splitting/pushing and the parent construction by using memcpy of each node, replace each slot (and parent) with a new memcpy of the mirrored tree, then have a minimum amount of modifications to delete the DONT_COPY during the VMA replacement. BFS copy would ensure we wouldn't modify the source tree during VMA replacement and deletion (DONT_COPY). So the rebalancing (in bulk insert), pivot calculations, metadata, and gaps are (mostly) saved by using memcpy. From what I understand from your linked list idea, we would need to construct the child node by examining each entry and know that a certain entry is a DONT_COPY (ie: VMA logic would be needed in the maple tree code or a callback?). We really can't have VMA logic in the maple tree code, so we could do some sort of loop through the VMAs to add the entry to the list if desired. Once we have a linked list, we still need to figure out each pivot for the parent (I guess we won't use the last slot so there is a pivot to check?), and each max gap in each child to construct the upper layers of the tree. Is this correct? I guess we would still need to adjust the last node to ensure sufficient entries as well, so as we add items we may need to rebalance the last leaf with the second-last leaf. The bulk store currently adjusts the split to favour splitting left, could you get the same result by strictly filling the nodes? This would have to have special handling to rebalance the last one - which we have a pretty good idea of when it's going to happen as we have a count (and the DONT_COPY is rare). Are you thinking you could compact the tree at the same time to gain efficiency? What would you consider a sufficient packed tree? It's best to keep some space around for expansion/contraction. This works out since, I think, we would need to keep that last slot free so we have a pivot to check in your linked list plan. Initial development with strict split/join rules resulted in a 'jittering' of nodes as the number of entries in a node shifted just above/below the threshold so we relaxed the merging of nodes to avoid such a scenario. Let me know if you would like me to put your name beside the Fork & Dup Tree item in the list of work. [1] https://lore.kernel.org/lkml/20230707163815.ns4kdz7iut5octjv@revolver/ [2] https://lists.infradead.org/mailman/listinfo/maple-tree Thanks, Liam ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Maple Tree Work 2023-07-12 15:27 ` Liam R. Howlett @ 2023-07-13 8:05 ` Peng Zhang 2023-07-13 15:36 ` Liam R. Howlett 0 siblings, 1 reply; 12+ messages in thread From: Peng Zhang @ 2023-07-13 8:05 UTC (permalink / raw) To: Liam R. Howlett; +Cc: Peng Zhang, maple-tree, linux-kernel, linux-mm 在 2023/7/12 23:27, Liam R. Howlett 写道: > Dropping Danilo from the Cc.. > > I was asked to add linux-mm to the list, so I did that as well. > > If anyone else is interested in seeing the full list, it's on lkml [1] and > the maple-tree list [2]. > > Thank you Peng for looking at the list and taking time to think about > the items. > > * Peng Zhang <zhangpeng.00@bytedance.com> [230712 07:49]: >> >> >> 在 2023/7/8 00:38, Liam R. Howlett 写道: >>> - Fork & Dup tree + Delete DONT_COPY >>> This is to optimize dup_mmap() in kernel/fork.c, but other >>> users may want faster duplications of a tree. >>> This should be faster than building a tree in bulk mode. The >>> idea is to just copy each node and replace the pointers in, >>> probably, a BFS order. Once the leaves are reached, the VMAs >>> will be replaced by the copies made in fork, unless DONT_COPY is >>> set, in which case the VMA will be deleted from the copied tree. >>> DONT_COPY is not common and since the tree isn't visible, all >>> nodes should be available for reuse (no RCU worries). >> If DONT_COPY is set, this method will be complicated, because the gaps >> adjacent to it need to be merged, and the gaps of all ancestor nodes need to >> be updated. > > My understanding is that this is a rare event; there aren't many VMAs > marked this way. The store operation already does all the necessary > work for deleting an entry. The gap tracking is also updated, and that > would only happen if the new gap is larger. Are you concerned about the > performance/runtime of handling the DONT_COPY in this way? Yes, if no DONT_COPY is set, copying all nodes and replacing the pointers must be the fastest way. I was just thinking if there is a faster way if DONT_COPY exists. Using the store operation to delete unnecessary entries will cause additional overhead sometimes. To give an example: Suppose a leaf node layout of the maple tree is as follows: [VMA1][VMA2][NULL][VMA3] If VMA2 has DONT_COPY set, we need to change the node layout as follows to delete it: [VMA1'][NULL][VMA3'] At least we need to copy this node to replace it to make this delete operation, even need to rebalance. However, this is a rare case. In most cases, there is no gap between VMAs, so it will not cause changes in the node layout. > >> >> I have another idea to build a tree, if inserted in order, we only >> insert at the leaf node. All leaf nodes are connected using a linked >> list. In the end we get a linked list with only leaf nodes. Then we >> construct non-leaf nodes layer by layer from bottom to top. I think >> this is also faster than bulk mode. Another advantage of this method >> is that we are applicable to more scenarios, do not need the original >> tree, only need to know the ranges inserted in order. I don't know >> how fast this method is, so we can discuss it. > > What is the advantage of a linked list over just building the tree as > you go? Considering the non-leaf nodes are just a list of nodes > already, and you will have to do the same work of setting pivots, > allocating nodes, and filling them after you have the linked list. > > What work do you avoid that would make a linked list faster than bulk > insert or a tree copy/replace VMAs? I was thinking that we could avoid > a lot of the work involved with splitting/pushing and the parent > construction by using memcpy of each node, replace each slot (and > parent) with a new memcpy of the mirrored tree, then have a minimum > amount of modifications to delete the DONT_COPY during the VMA > replacement. BFS copy would ensure we wouldn't modify the source tree > during VMA replacement and deletion (DONT_COPY). So the rebalancing (in > bulk insert), pivot calculations, metadata, and gaps are (mostly) saved > by using memcpy. Your analysis is correct. > > From what I understand from your linked list idea, we would need to > construct the child node by examining each entry and know that a certain > entry is a DONT_COPY (ie: VMA logic would be needed in the maple tree > code or a callback?). We really can't have VMA logic in the maple tree > code, so we could do some sort of loop through the VMAs to add the entry > to the list if desired. > > Once we have a linked list, we still need to figure out each pivot for > the parent (I guess we won't use the last slot so there is a pivot to > check?), and each max gap in each child to construct the upper layers > of the tree. Is this correct? > > I guess we would still need to adjust the last node to ensure sufficient > entries as well, so as we add items we may need to rebalance the last > leaf with the second-last leaf. Yes, the last two leaves need to check to make sure they have enough items. > > The bulk store currently adjusts the split to favour splitting > left, could you get the same result by strictly filling the nodes? This > would have to have special handling to rebalance the last one - which we > have a pretty good idea of when it's going to happen as we have a count > (and the DONT_COPY is rare). > > Are you thinking you could compact the tree at the same time to gain > efficiency? > > What would you consider a sufficient packed tree? It's best to keep > some space around for expansion/contraction. This works out since, I > think, we would need to keep that last slot free so we have a pivot to > check in your linked list plan. Initial development with strict > split/join rules resulted in a 'jittering' of nodes as the number of > entries in a node shifted just above/below the threshold so we relaxed > the merging of nodes to avoid such a scenario. We can control the number of entries of nodes, for example, let this number be (min + max)/2, so as to avoid making a node too 'dense' or too 'loose'. > > Let me know if you would like me to put your name beside the Fork & Dup > Tree item in the list of work. You can put my name on this one and I'll do it. I use the method of copying all nodes, so I will implement an interface to get a mirrored tree. But I can't think of a good way to replace old VMAs, it can't be done during the copy tree process, because maybe some VMAs have DONT_COPY flag. It seems that we can only scan all VMAs in the source tree again to update the new tree. We have already traversed the source tree once in the process of copying the tree. Is there any way to avoid traversing it again? > > [1] https://lore.kernel.org/lkml/20230707163815.ns4kdz7iut5octjv@revolver/ > [2] https://lists.infradead.org/mailman/listinfo/maple-tree > > Thanks, > Liam > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Maple Tree Work 2023-07-13 8:05 ` Peng Zhang @ 2023-07-13 15:36 ` Liam R. Howlett 2023-07-14 12:32 ` Peng Zhang 0 siblings, 1 reply; 12+ messages in thread From: Liam R. Howlett @ 2023-07-13 15:36 UTC (permalink / raw) To: Peng Zhang; +Cc: maple-tree, linux-kernel, linux-mm * Peng Zhang <zhangpeng.00@bytedance.com> [230713 04:05]: > > > 在 2023/7/12 23:27, Liam R. Howlett 写道: > > Dropping Danilo from the Cc.. > > > > I was asked to add linux-mm to the list, so I did that as well. > > > > If anyone else is interested in seeing the full list, it's on lkml [1] and > > the maple-tree list [2]. > > > > Thank you Peng for looking at the list and taking time to think about > > the items. > > > > * Peng Zhang <zhangpeng.00@bytedance.com> [230712 07:49]: > > > > > > > > > 在 2023/7/8 00:38, Liam R. Howlett 写道: > > > > - Fork & Dup tree + Delete DONT_COPY > > > > This is to optimize dup_mmap() in kernel/fork.c, but other > > > > users may want faster duplications of a tree. > > > > This should be faster than building a tree in bulk mode. The > > > > idea is to just copy each node and replace the pointers in, > > > > probably, a BFS order. Once the leaves are reached, the VMAs > > > > will be replaced by the copies made in fork, unless DONT_COPY is > > > > set, in which case the VMA will be deleted from the copied tree. > > > > DONT_COPY is not common and since the tree isn't visible, all > > > > nodes should be available for reuse (no RCU worries). > > > If DONT_COPY is set, this method will be complicated, because the gaps > > > adjacent to it need to be merged, and the gaps of all ancestor nodes need to > > > be updated. > > > > My understanding is that this is a rare event; there aren't many VMAs > > marked this way. The store operation already does all the necessary > > work for deleting an entry. The gap tracking is also updated, and that > > would only happen if the new gap is larger. Are you concerned about the > > performance/runtime of handling the DONT_COPY in this way? > Yes, if no DONT_COPY is set, copying all nodes and replacing the > pointers must be the fastest way. I was just thinking if there is > a faster way if DONT_COPY exists. Using the store operation to > delete unnecessary entries will cause additional overhead sometimes. > To give an example: > > Suppose a leaf node layout of the maple tree is as follows: > [VMA1][VMA2][NULL][VMA3] > > If VMA2 has DONT_COPY set, we need to change the node layout as > follows to delete it: > [VMA1'][NULL][VMA3'] > > At least we need to copy this node to replace it to make this > delete operation, even need to rebalance. However, this is a > rare case. In most cases, there is no gap between VMAs, so it > will not cause changes in the node layout. Remember that at this point, there is no readers so we could edit the node without a copy. It would require new code, but it's just moving data left.. The bigger worry is a rebalance, as you said, and that can get complicated. We know we have (more than) enough room to store the data, but editing in place isn't done in this version of the code. We do allow for node re-use by pushing back onto the mas->alloc, so the data requirements won't be too high. Without any readers, the parent pivots could be changed directly between two leaves. > > > > > > > > I have another idea to build a tree, if inserted in order, we only > > > insert at the leaf node. All leaf nodes are connected using a linked > > > list. In the end we get a linked list with only leaf nodes. Then we > > > construct non-leaf nodes layer by layer from bottom to top. I think > > > this is also faster than bulk mode. Another advantage of this method > > > is that we are applicable to more scenarios, do not need the original > > > tree, only need to know the ranges inserted in order. I don't know > > > how fast this method is, so we can discuss it. > > > > What is the advantage of a linked list over just building the tree as > > you go? Considering the non-leaf nodes are just a list of nodes > > already, and you will have to do the same work of setting pivots, > > allocating nodes, and filling them after you have the linked list. > > > > What work do you avoid that would make a linked list faster than bulk > > insert or a tree copy/replace VMAs? I was thinking that we could avoid > > a lot of the work involved with splitting/pushing and the parent > > construction by using memcpy of each node, replace each slot (and > > parent) with a new memcpy of the mirrored tree, then have a minimum > > amount of modifications to delete the DONT_COPY during the VMA > > replacement. BFS copy would ensure we wouldn't modify the source tree > > during VMA replacement and deletion (DONT_COPY). So the rebalancing (in > > bulk insert), pivot calculations, metadata, and gaps are (mostly) saved > > by using memcpy. > Your analysis is correct. > > > > From what I understand from your linked list idea, we would need to > > construct the child node by examining each entry and know that a certain > > entry is a DONT_COPY (ie: VMA logic would be needed in the maple tree > > code or a callback?). We really can't have VMA logic in the maple tree > > code, so we could do some sort of loop through the VMAs to add the entry > > to the list if desired. > > > > Once we have a linked list, we still need to figure out each pivot for > > the parent (I guess we won't use the last slot so there is a pivot to > > check?), and each max gap in each child to construct the upper layers > > of the tree. Is this correct? > > > > I guess we would still need to adjust the last node to ensure sufficient > > entries as well, so as we add items we may need to rebalance the last > > leaf with the second-last leaf. > Yes, the last two leaves need to check to make sure they have enough > items. > > > > The bulk store currently adjusts the split to favour splitting > > left, could you get the same result by strictly filling the nodes? This > > would have to have special handling to rebalance the last one - which we > > have a pretty good idea of when it's going to happen as we have a count > > (and the DONT_COPY is rare). > > > > Are you thinking you could compact the tree at the same time to gain > > efficiency? > > > > What would you consider a sufficient packed tree? It's best to keep > > some space around for expansion/contraction. This works out since, I > > think, we would need to keep that last slot free so we have a pivot to > > check in your linked list plan. Initial development with strict > > split/join rules resulted in a 'jittering' of nodes as the number of > > entries in a node shifted just above/below the threshold so we relaxed > > the merging of nodes to avoid such a scenario. > We can control the number of entries of nodes, for example, let this > number be (min + max)/2, so as to avoid making a node too 'dense' or > too 'loose'. By the way, in the VMA case, we also know the number of VMAs in the tree. Unfortunately, we don't know how many are DONT_COPY VMAs. I wonder if it would be worth while to balance each VMA with its neighbour during this operation, at least within one tree level? It would reduce the possibility of a larger rebalance on a DONT_COPY. It's probably not worth it because it would slow down our fast path. > > > > Let me know if you would like me to put your name beside the Fork & Dup > > Tree item in the list of work. > You can put my name on this one and I'll do it. Sounds good, thanks! > I use the method of copying all nodes, so I will implement an interface > to get a mirrored tree. > > But I can't think of a good way to replace old VMAs, it can't be done > during the copy tree process, because maybe some VMAs have DONT_COPY > flag. I think it could be done during the copy, instead of a 'replace' it would be a 'delete'. I think this is why we need to use a BFS-like duplication. So once we reach the leaves, then we can modify the tree knowing that the above state has already been copied and so it's going to be altering a copy of the data and so we are at a point where it can be mutated. You could detect that the lock-step next/insert is out of sync and delete the range between the old_mas and the new_mas. > It seems that we can only scan all VMAs in the source tree again > to update the new tree. We have already traversed the source tree once > in the process of copying the tree. Is there any way to avoid traversing > it again? Well, we haven't visited the VMAs in the copy, but we have visited the leaf nodes with all the pointers to the VMAs. I get what you are saying thought, we will have to duplicate the leaves then re-visit the leaves to replace the VMAs. I am not sure we can avoid this since a rebalance may occur, and it would be very tricky to rebalance with old and new data in the tree - it's probably best to just revisit, at least to begin with. Depending on how you implement it, you could make the copying of the tree end on the first leaf node by using the height of the tree to figure out which way to sweep (left to right or right to left) on the first level. Not strictly BFS, but you could end the maple state in the correct location to start replacing the VMAs. Would that work? We also have a reverse iterator, so we could just run through the tree from the right-most node to the start. I was thinking that we would make the first 'duplication store' detect an empty tree and duplicate the tree, ending on the left-most leaf and then replace the first entry (and possibly delete up to the first store). Each subsequent store would do the same. We would need a 'duplication complete' that would remove any entries beyond the last store and rebalance, if necessary. Feel free to use some or none of the ideas. I hope some of this helps with what you are trying to work out. Let me know if you have any more questions. Thanks, Liam ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Maple Tree Work 2023-07-13 15:36 ` Liam R. Howlett @ 2023-07-14 12:32 ` Peng Zhang 0 siblings, 0 replies; 12+ messages in thread From: Peng Zhang @ 2023-07-14 12:32 UTC (permalink / raw) To: Liam R. Howlett; +Cc: Peng Zhang, maple-tree, linux-kernel, linux-mm 在 2023/7/13 23:36, Liam R. Howlett 写道: > * Peng Zhang <zhangpeng.00@bytedance.com> [230713 04:05]: >> >> >> 在 2023/7/12 23:27, Liam R. Howlett 写道: >>> Dropping Danilo from the Cc.. >>> >>> I was asked to add linux-mm to the list, so I did that as well. >>> >>> If anyone else is interested in seeing the full list, it's on lkml [1] and >>> the maple-tree list [2]. >>> >>> Thank you Peng for looking at the list and taking time to think about >>> the items. >>> >>> * Peng Zhang <zhangpeng.00@bytedance.com> [230712 07:49]: >>>> >>>> >>>> 在 2023/7/8 00:38, Liam R. Howlett 写道: >>>>> - Fork & Dup tree + Delete DONT_COPY >>>>> This is to optimize dup_mmap() in kernel/fork.c, but other >>>>> users may want faster duplications of a tree. >>>>> This should be faster than building a tree in bulk mode. The >>>>> idea is to just copy each node and replace the pointers in, >>>>> probably, a BFS order. Once the leaves are reached, the VMAs >>>>> will be replaced by the copies made in fork, unless DONT_COPY is >>>>> set, in which case the VMA will be deleted from the copied tree. >>>>> DONT_COPY is not common and since the tree isn't visible, all >>>>> nodes should be available for reuse (no RCU worries). >>>> If DONT_COPY is set, this method will be complicated, because the gaps >>>> adjacent to it need to be merged, and the gaps of all ancestor nodes need to >>>> be updated. >>> >>> My understanding is that this is a rare event; there aren't many VMAs >>> marked this way. The store operation already does all the necessary >>> work for deleting an entry. The gap tracking is also updated, and that >>> would only happen if the new gap is larger. Are you concerned about the >>> performance/runtime of handling the DONT_COPY in this way? >> Yes, if no DONT_COPY is set, copying all nodes and replacing the >> pointers must be the fastest way. I was just thinking if there is >> a faster way if DONT_COPY exists. Using the store operation to >> delete unnecessary entries will cause additional overhead sometimes. >> To give an example: >> >> Suppose a leaf node layout of the maple tree is as follows: >> [VMA1][VMA2][NULL][VMA3] >> >> If VMA2 has DONT_COPY set, we need to change the node layout as >> follows to delete it: >> [VMA1'][NULL][VMA3'] >> >> At least we need to copy this node to replace it to make this >> delete operation, even need to rebalance. However, this is a >> rare case. In most cases, there is no gap between VMAs, so it >> will not cause changes in the node layout. > > Remember that at this point, there is no readers so we could edit the > node without a copy. It would require new code, but it's just moving > data left.. The bigger worry is a rebalance, as you said, and that can > get complicated. We know we have (more than) enough room to store the > data, but editing in place isn't done in this version of the code. We > do allow for node re-use by pushing back onto the mas->alloc, so the > data requirements won't be too high. Without any readers, the parent > pivots could be changed directly between two leaves. > >>> >>>> >>>> I have another idea to build a tree, if inserted in order, we only >>>> insert at the leaf node. All leaf nodes are connected using a linked >>>> list. In the end we get a linked list with only leaf nodes. Then we >>>> construct non-leaf nodes layer by layer from bottom to top. I think >>>> this is also faster than bulk mode. Another advantage of this method >>>> is that we are applicable to more scenarios, do not need the original >>>> tree, only need to know the ranges inserted in order. I don't know >>>> how fast this method is, so we can discuss it. >>> >>> What is the advantage of a linked list over just building the tree as >>> you go? Considering the non-leaf nodes are just a list of nodes >>> already, and you will have to do the same work of setting pivots, >>> allocating nodes, and filling them after you have the linked list. >>> >>> What work do you avoid that would make a linked list faster than bulk >>> insert or a tree copy/replace VMAs? I was thinking that we could avoid >>> a lot of the work involved with splitting/pushing and the parent >>> construction by using memcpy of each node, replace each slot (and >>> parent) with a new memcpy of the mirrored tree, then have a minimum >>> amount of modifications to delete the DONT_COPY during the VMA >>> replacement. BFS copy would ensure we wouldn't modify the source tree >>> during VMA replacement and deletion (DONT_COPY). So the rebalancing (in >>> bulk insert), pivot calculations, metadata, and gaps are (mostly) saved >>> by using memcpy. >> Your analysis is correct. >>> >>> From what I understand from your linked list idea, we would need to >>> construct the child node by examining each entry and know that a certain >>> entry is a DONT_COPY (ie: VMA logic would be needed in the maple tree >>> code or a callback?). We really can't have VMA logic in the maple tree >>> code, so we could do some sort of loop through the VMAs to add the entry >>> to the list if desired. >>> >>> Once we have a linked list, we still need to figure out each pivot for >>> the parent (I guess we won't use the last slot so there is a pivot to >>> check?), and each max gap in each child to construct the upper layers >>> of the tree. Is this correct? >>> >>> I guess we would still need to adjust the last node to ensure sufficient >>> entries as well, so as we add items we may need to rebalance the last >>> leaf with the second-last leaf. >> Yes, the last two leaves need to check to make sure they have enough >> items. >>> >>> The bulk store currently adjusts the split to favour splitting >>> left, could you get the same result by strictly filling the nodes? This >>> would have to have special handling to rebalance the last one - which we >>> have a pretty good idea of when it's going to happen as we have a count >>> (and the DONT_COPY is rare). >>> >>> Are you thinking you could compact the tree at the same time to gain >>> efficiency? >>> >>> What would you consider a sufficient packed tree? It's best to keep >>> some space around for expansion/contraction. This works out since, I >>> think, we would need to keep that last slot free so we have a pivot to >>> check in your linked list plan. Initial development with strict >>> split/join rules resulted in a 'jittering' of nodes as the number of >>> entries in a node shifted just above/below the threshold so we relaxed >>> the merging of nodes to avoid such a scenario. >> We can control the number of entries of nodes, for example, let this >> number be (min + max)/2, so as to avoid making a node too 'dense' or >> too 'loose'. > > By the way, in the VMA case, we also know the number of VMAs in the > tree. Unfortunately, we don't know how many are DONT_COPY VMAs. I > wonder if it would be worth while to balance each VMA with its neighbour > during this operation, at least within one tree level? It would reduce > the possibility of a larger rebalance on a DONT_COPY. It's probably not > worth it because it would slow down our fast path. > >>> >>> Let me know if you would like me to put your name beside the Fork & Dup >>> Tree item in the list of work. >> You can put my name on this one and I'll do it. > > Sounds good, thanks! > >> I use the method of copying all nodes, so I will implement an interface >> to get a mirrored tree. >> >> But I can't think of a good way to replace old VMAs, it can't be done >> during the copy tree process, because maybe some VMAs have DONT_COPY >> flag. > > I think it could be done during the copy, instead of a 'replace' it > would be a 'delete'. I think this is why we need to use a BFS-like > duplication. So once we reach the leaves, then we can modify the tree > knowing that the above state has already been copied and so it's going > to be altering a copy of the data and so we are at a point where it can > be mutated. You could detect that the lock-step next/insert is out of > sync and delete the range between the old_mas and the new_mas. > >> It seems that we can only scan all VMAs in the source tree again >> to update the new tree. We have already traversed the source tree once >> in the process of copying the tree. Is there any way to avoid traversing >> it again? > > Well, we haven't visited the VMAs in the copy, but we have visited the > leaf nodes with all the pointers to the VMAs. I get what you are > saying thought, we will have to duplicate the leaves then re-visit the > leaves to replace the VMAs. I am not sure we can avoid this since a > rebalance may occur, and it would be very tricky to rebalance with old > and new data in the tree - it's probably best to just revisit, at least > to begin with. > > Depending on how you implement it, you could make the copying of the > tree end on the first leaf node by using the height of the tree to > figure out which way to sweep (left to right or right to left) on the > first level. Not strictly BFS, but you could end the maple state in the > correct location to start replacing the VMAs. Would that work? > > We also have a reverse iterator, so we could just run through the tree > from the right-most node to the start. > > I was thinking that we would make the first 'duplication store' detect > an empty tree and duplicate the tree, ending on the left-most leaf and > then replace the first entry (and possibly delete up to the first > store). Each subsequent store would do the same. We would need a > 'duplication complete' that would remove any entries beyond the last > store and rebalance, if necessary. > > Feel free to use some or none of the ideas. I hope some of this helps > with what you are trying to work out. Let me know if you have any more > questions. Thank you for providing so much information, I am doing this feature. Thanks, Peng > > Thanks, > Liam > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Maple Tree Work 2023-07-07 16:38 Maple Tree Work Liam R. Howlett 2023-07-12 11:49 ` Peng Zhang @ 2023-07-13 9:58 ` Peng Zhang 2023-07-13 15:52 ` Liam R. Howlett 2023-07-14 12:57 ` Matthew Wilcox 2025-07-29 12:03 ` Danilo Krummrich 2 siblings, 2 replies; 12+ messages in thread From: Peng Zhang @ 2023-07-13 9:58 UTC (permalink / raw) To: Liam R. Howlett; +Cc: Peng Zhang, maple-tree, linux-kernel I have a question I want to discuss here. I noticed that the interface of maple tree has three different prefixes, namely mtree_*, mt_*, mas_*. I am curious why the interfaces prefixed with mtree_* and mt_* cannot be unified? I think they can be changed to mtree_* to avoid two different prefixes. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Maple Tree Work 2023-07-13 9:58 ` Peng Zhang @ 2023-07-13 15:52 ` Liam R. Howlett 2023-07-14 12:57 ` Matthew Wilcox 1 sibling, 0 replies; 12+ messages in thread From: Liam R. Howlett @ 2023-07-13 15:52 UTC (permalink / raw) To: Peng Zhang; +Cc: maple-tree, linux-kernel * Peng Zhang <zhangpeng.00@bytedance.com> [230713 05:58]: > I have a question I want to discuss here. I noticed that the interface > of maple tree has three different prefixes, namely mtree_*, mt_*, mas_*. > I am curious why the interfaces prefixed with mtree_* and mt_* cannot be > unified? I think they can be changed to mtree_* to avoid two different > prefixes. > I am sure there was a reason.. looking at the documentation, it may have been to indicate rcu/write locking.. But the entire interface doesn't fit that reasoning. We can probably align the mt_ and mtree_ all to be mt_. If we do, and we probably should, we'll have to change other places in the kernel and the documentation. The mas_ is the advanced interface so locking isn't handled, and internally, the mas_ indicates the first argument is a struct ma_state pointer. I can add this to the list as well. Thanks, Liam ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Maple Tree Work 2023-07-13 9:58 ` Peng Zhang 2023-07-13 15:52 ` Liam R. Howlett @ 2023-07-14 12:57 ` Matthew Wilcox 2023-07-14 13:32 ` Peng Zhang 1 sibling, 1 reply; 12+ messages in thread From: Matthew Wilcox @ 2023-07-14 12:57 UTC (permalink / raw) To: Peng Zhang; +Cc: Liam R. Howlett, maple-tree, linux-kernel On Thu, Jul 13, 2023 at 05:58:13PM +0800, Peng Zhang wrote: > I have a question I want to discuss here. I noticed that the interface > of maple tree has three different prefixes, namely mtree_*, mt_*, mas_*. > I am curious why the interfaces prefixed with mtree_* and mt_* cannot be > unified? I think they can be changed to mtree_* to avoid two different > prefixes. I haven't worried about this too much. The long-term goal is to use the maple tree data structure to replace the radix tree data structure underlying the xarray and use the xarray API to access the maple tree. The xarray API will need some enhancements to make this work, but churning the maple tree API doesn't seem like a good use of effort. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Maple Tree Work 2023-07-14 12:57 ` Matthew Wilcox @ 2023-07-14 13:32 ` Peng Zhang 0 siblings, 0 replies; 12+ messages in thread From: Peng Zhang @ 2023-07-14 13:32 UTC (permalink / raw) To: Matthew Wilcox; +Cc: Liam R. Howlett, maple-tree, linux-kernel, Peng Zhang 在 2023/7/14 20:57, Matthew Wilcox 写道: > On Thu, Jul 13, 2023 at 05:58:13PM +0800, Peng Zhang wrote: >> I have a question I want to discuss here. I noticed that the interface >> of maple tree has three different prefixes, namely mtree_*, mt_*, mas_*. >> I am curious why the interfaces prefixed with mtree_* and mt_* cannot be >> unified? I think they can be changed to mtree_* to avoid two different >> prefixes. > > I haven't worried about this too much. The long-term goal is to use > the maple tree data structure to replace the radix tree data structure > underlying the xarray and use the xarray API to access the maple tree. > The xarray API will need some enhancements to make this work, but churning > the maple tree API doesn't seem like a good use of effort. > If there are two prefixes, mt_ and mtree_, I will be confused which prefix to use when adding a new API. Users will also be confused when they see the two prefixes mt_ and mtree_. However, renaming the API is not a good thing either. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Maple Tree Work 2023-07-07 16:38 Maple Tree Work Liam R. Howlett 2023-07-12 11:49 ` Peng Zhang 2023-07-13 9:58 ` Peng Zhang @ 2025-07-29 12:03 ` Danilo Krummrich 2025-08-06 18:53 ` Liam R. Howlett 2 siblings, 1 reply; 12+ messages in thread From: Danilo Krummrich @ 2025-07-29 12:03 UTC (permalink / raw) To: Liam R. Howlett; +Cc: maple-tree, linux-kernel, Peng Zhang, Alice Ryhl (Cc: Alice) On 7/7/23 6:38 PM, Liam R. Howlett wrote: > Larger Features: > - 64 bit stores on 32 bit arch > A number of users want to use the maple tree, but cannot because > they need a storage medium that supports 64 bit stores on 32 bit > arch. Has there been more thoughts (or even progress) on this? Managing GPU virtual address spaces is still an interesting candidate for this, as they may be larger than 32 bit even on 32 bit architectures. - Danilo ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Maple Tree Work 2025-07-29 12:03 ` Danilo Krummrich @ 2025-08-06 18:53 ` Liam R. Howlett 0 siblings, 0 replies; 12+ messages in thread From: Liam R. Howlett @ 2025-08-06 18:53 UTC (permalink / raw) To: Danilo Krummrich; +Cc: maple-tree, linux-kernel, Peng Zhang, Alice Ryhl * Danilo Krummrich <dakr@kernel.org> [250729 08:03]: > (Cc: Alice) > > On 7/7/23 6:38 PM, Liam R. Howlett wrote: > > > Larger Features: > > - 64 bit stores on 32 bit arch > > A number of users want to use the maple tree, but cannot because > > they need a storage medium that supports 64 bit stores on 32 bit > > arch. > > Has there been more thoughts (or even progress) on this? Managing GPU virtual > address spaces is still an interesting candidate for this, as they may be larger > than 32 bit even on 32 bit architectures. Yes, thought has been given but I've been working on removing the big node to get mixed node types going. I think I have someone I can get working on this now. Thanks, Liam ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-08-06 18:53 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-07-07 16:38 Maple Tree Work Liam R. Howlett 2023-07-12 11:49 ` Peng Zhang 2023-07-12 15:27 ` Liam R. Howlett 2023-07-13 8:05 ` Peng Zhang 2023-07-13 15:36 ` Liam R. Howlett 2023-07-14 12:32 ` Peng Zhang 2023-07-13 9:58 ` Peng Zhang 2023-07-13 15:52 ` Liam R. Howlett 2023-07-14 12:57 ` Matthew Wilcox 2023-07-14 13:32 ` Peng Zhang 2025-07-29 12:03 ` Danilo Krummrich 2025-08-06 18:53 ` Liam R. Howlett
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox