From: Bert Karwatzki <spasswolf@web.de>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R . Howlett" <Liam.Howlett@oracle.com>,
Andrew Morton <akpm@linux-foundation.org>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
spassowlf@web.de
Subject: Re: [PATCH v8 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region()
Date: Sat, 05 Oct 2024 02:56:01 +0200 [thread overview]
Message-ID: <088a3541b85b783ef68337bd4bb790d62f200dfa.camel@web.de> (raw)
In-Reply-To: <3b83746d-b73e-4b89-9b74-5aba5a827f45@lucifer.local>
Am Freitag, dem 04.10.2024 um 23:41 +0100 schrieb Lorenzo Stoakes:
> On Fri, Oct 04, 2024 at 11:35:44AM +0200, Bert Karwatzki wrote:
> > Here's the log procduced by this kernel:
> >
> > c9e7f76815d3 (HEAD -> maple_tree_debug_4) hack: set of info stuff v5
> > 7e3bb072761a mm: correct error handling in mmap_region()
> > 77df9e4bb222 (tag: next-20241001, origin/master, origin/HEAD, master) Add linux-next specific files for 20241001
> >
> > Again it took two attempts to trigger the bug.
> >
> > Bert Karwatzki
> >
>
> Sending an updated, cleaned up version of the patch with a lot of
> explanation. This is functionally identical to the v3 fix I already sent so
> you can try that or this to confirm it resolves your issue.
>
> If you are able to do so, I can submit this to upstream for a hotfix. If
> not, well then back to the drawing board and I'd be very shocked :)
>
> I have been able to reproduce the issue locally in our userland testing
> suite entirely consistently, and this patch resolves the issue and also
> continues to pass all maple tree unit tests.
>
> Again thank you so much for all your help - I hope you are able to find a
> spare moment to quickly give this one a try and confirm whether it does
> indeed address the problem you've reported.
>
> Thanks, Lorenzo
>
> ----8<----
> From 126d65bd9839cd3ec941007872b357e27fd56066 Mon Sep 17 00:00:00 2001
> From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Date: Fri, 4 Oct 2024 15:18:58 +0100
> Subject: [PATCH] maple_tree: correct tree corruption on spanning store
>
> Writing a data range into a maple tree may involve overwriting a number of
> existing entries that span across more than one node. Doing so invokes a
> 'spanning' store.
>
> Performing a spanning store across two leaf nodes in a maple tree in which
> entries are overwritten is achieved by first initialising a 'big' node,
> which will store the coalesced entries between the two nodes comprising
> entries prior to the newly stored entry, the newly stored entry, and
> subsequent entries.
>
> This 'big node' is then merged back into the tree and the tree is
> rebalanced, replacing the entries across the spanned nodes with those
> contained in the big node.
>
> The operation is performed in mas_wr_spanning_store() which starts by
> establishing two maple tree state objects ('mas' objects) on the left of
> the range and on the right (l_mas and r_mas respectively).
>
> l_mas traverses to the beginning of the range to be stored in order to copy
> the data BEFORE the requested store into the big node.
>
> We then insert our new entry immediately afterwards (both the left copy and
> the storing of the new entry are combined and performed by
> mas_store_b_node()).
>
> r_mas traverses to the populated slot immediately after, in order to copy
> the data AFTER the requested store into the big node.
>
> This copy of the right-hand node is performed by mas_mab_cp() as long as
> r_mas indicates that there's data to copy, i.e. r_mas.offset <= r_mas.end.
>
> We traverse r_mas to this position in mas_wr_node_walk() using a simple
> loop:
>
> while (offset < count && mas->index > wr_mas->pivots[offset])
> offset++;
>
> Note here that count is determined to be the (inclusive) index of the last
> node containing data in the node as determined by ma_data_end().
>
> This means that even in searching for mas->index, which will have been set
> to one plus the end of the target range in order to traverse to the next
> slot in mas_wr_spanning_store(), we will terminate the iteration at the end
> of the node range even if this condition is not met due to the offset <
> count condition.
>
> The fact this right hand node contains the end of the range being stored is
> why we are traversing it, and this loop is why we appear to discover a
> viable range within the right node to copy to the big one.
>
> However, if the node that r_mas traverses contains a pivot EQUAL to the end
> of the range being stored, and this is the LAST pivot contained within the
> node, something unexpected happens:
>
> 1. The l_mas traversal copy and insertion of the new entry in the big node
> is performed via mas_store_b_node() correctly.
>
> 2. The traversal performed by mas_wr_node_walk() means our r_mas.offset is
> set to the offset of the entry equal to the end of the range we store.
>
> 3. We therefore copy this DUPLICATE of the final pivot into the big node,
> and insert this DUPLICATE entry, alongside its invalid slot entry
> immediately after the newly inserted entry.
>
> 4. The big node containing this duplicated is inserted into the tree which
> is rebalanced, and therefore the maple tree becomes corrupted.
>
> Note that if the right hand node had one or more entries with pivots of
> greater value than the end of the stored range, this would not happen. If
> it contained entries with pivots of lesser value it would not be the right
> node in this spanning store.
>
> This appears to have been at risk of happening throughout the maple tree's
> history, however it seemed significantly less likely to occur until
> recently.
>
> The balancing of the tree seems to have made it unlikely that you would
> happen to perform a store that both spans two nodes AND would overwrite
> precisely the entry with the largest pivot in the right-hand node which
> contains no further larger pivots.
>
> The work performed in commit f8d112a4e657 ("mm/mmap: avoid zeroing vma tree
> in mmap_region()") seems to have made the probability of this event much
> more likely.
>
> Previous to this change, MAP_FIXED mappings which were overwritten would
> first be cleared before any subsequent store or importantly - merge of
> surrounding entries - would be performed.
>
> After this change, this is no longer the case, and this means that, in the
> worst case, a number of entries might be overwritten in combination with a
> merge (and subsequent overwriting expansion) between both the prior entry
> AND a subsequent entry.
>
> The motivation for this change arose from Bert Karwatzki's report of
> encountering mm instability after the release of kernel v6.12-rc1 which,
> after the use of CONFIG_DEBUG_VM_MAPLE_TREE and similar configuration
> options, was identified as maple tree corruption.
>
> After Bert very generously provided his time and ability to reproduce this
> event consistently, I was able to finally identify that the issue discussed
> in this commit message was occurring for him.
>
> The solution implemented in this patch is:
>
> 1. Adjust mas_wr_walk_index() to return a boolean value indicating whether
> the containing node is actually populated with entries possessing pivots
> equal to or greater than mas->index.
>
> 2. When traversing the right node in mas_wr_spanning_store(), use this
> value to determine whether to try to copy from the right node - if it is
> not populated, then do not do so.
>
> This passes all maple tree unit tests and resolves the reported bug.
> ---
> lib/maple_tree.c | 20 ++++++++++++++++----
> 1 file changed, 16 insertions(+), 4 deletions(-)
>
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index 37abf0fe380b..e6f0da908ba7 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -2194,6 +2194,8 @@ static inline void mas_node_or_none(struct ma_state *mas,
>
> /*
> * mas_wr_node_walk() - Find the correct offset for the index in the @mas.
> + * If @mas->index cannot be found within the containing
> + * node, we traverse to the last entry in the node.
> * @wr_mas: The maple write state
> *
> * Uses mas_slot_locked() and does not need to worry about dead nodes.
> @@ -3527,6 +3529,12 @@ static bool mas_wr_walk(struct ma_wr_state *wr_mas)
> return true;
> }
>
> +/*
> + * Traverse the maple tree until the offset of mas->index is reached.
> + *
> + * Return: Is this node actually populated with entries possessing pivots equal
> + * to or greater than mas->index?
> + */
> static bool mas_wr_walk_index(struct ma_wr_state *wr_mas)
> {
> struct ma_state *mas = wr_mas->mas;
> @@ -3535,8 +3543,11 @@ static bool mas_wr_walk_index(struct ma_wr_state *wr_mas)
> mas_wr_walk_descend(wr_mas);
> wr_mas->content = mas_slot_locked(mas, wr_mas->slots,
> mas->offset);
> - if (ma_is_leaf(wr_mas->type))
> - return true;
> + if (ma_is_leaf(wr_mas->type)) {
> + unsigned long pivot = wr_mas->pivots[mas->offset];
> +
> + return pivot == 0 || mas->index <= pivot;
> + }
> mas_wr_walk_traverse(wr_mas);
>
> }
> @@ -3696,6 +3707,7 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas)
> struct maple_big_node b_node;
> struct ma_state *mas;
> unsigned char height;
> + bool r_populated;
>
> /* Left and Right side of spanning store */
> MA_STATE(l_mas, NULL, 0, 0);
> @@ -3737,7 +3749,7 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas)
> r_mas.last++;
>
> r_mas.index = r_mas.last;
> - mas_wr_walk_index(&r_wr_mas);
> + r_populated = mas_wr_walk_index(&r_wr_mas);
> r_mas.last = r_mas.index = mas->last;
>
> /* Set up left side. */
> @@ -3761,7 +3773,7 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas)
> /* Copy l_mas and store the value in b_node. */
> mas_store_b_node(&l_wr_mas, &b_node, l_mas.end);
> /* Copy r_mas into b_node. */
> - if (r_mas.offset <= r_mas.end)
> + if (r_populated && r_mas.offset <= r_mas.end)
> mas_mab_cp(&r_mas, r_mas.offset, r_mas.end,
> &b_node, b_node.b_end + 1);
> else
> --
> 2.46.2
I just tested this and it passed ten tests (i.e. upgrading the proton version i
steam) in a row.
Bert Karwatzki
next prev parent reply other threads:[~2024-10-05 0:56 UTC|newest]
Thread overview: 71+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-04 9:35 [PATCH v8 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region() Bert Karwatzki
2024-10-04 9:58 ` Lorenzo Stoakes
2024-10-04 14:23 ` Lorenzo Stoakes
2024-10-04 14:26 ` Lorenzo Stoakes
2024-10-04 14:32 ` Lorenzo Stoakes
2024-10-04 14:58 ` Lorenzo Stoakes
2024-10-04 22:41 ` Lorenzo Stoakes
2024-10-05 0:56 ` Bert Karwatzki [this message]
2024-10-05 6:21 ` Lorenzo Stoakes
2024-10-05 8:57 ` Bert Karwatzki
2024-10-05 11:11 ` Lorenzo Stoakes
-- strict thread matches above, loose matches on Subject: below --
2024-10-13 22:35 Bert Karwatzki
2024-10-14 9:46 ` Lorenzo Stoakes
2024-10-16 10:28 ` Bert Karwatzki
2024-10-16 11:16 ` Lorenzo Stoakes
2024-10-16 14:13 ` Liam R. Howlett
2024-10-04 8:51 Bert Karwatzki
2024-10-04 8:59 ` Lorenzo Stoakes
2024-10-03 17:07 Bert Karwatzki
2024-10-03 17:24 ` Lorenzo Stoakes
2024-10-03 19:32 ` Lorenzo Stoakes
2024-10-04 8:36 ` Lorenzo Stoakes
2024-10-03 13:09 Bert Karwatzki
2024-10-03 13:34 ` Lorenzo Stoakes
2024-10-03 10:51 Bert Karwatzki
2024-10-03 11:17 ` Lorenzo Stoakes
2024-10-03 10:41 Bert Karwatzki
2024-10-03 10:46 ` Lorenzo Stoakes
2024-10-03 8:59 Bert Karwatzki
2024-10-03 9:04 ` Lorenzo Stoakes
2024-10-03 9:27 ` Lorenzo Stoakes
2024-10-02 22:58 Bert Karwatzki
2024-10-03 7:43 ` Lorenzo Stoakes
2024-10-02 22:57 Bert Karwatzki
2024-10-03 8:06 ` Lorenzo Stoakes
2024-10-02 21:58 Bert Karwatzki
2024-10-02 21:48 Bert Karwatzki
2024-10-02 21:41 Bert Karwatzki
[not found] <20241002105131.4545-1-spasswolf@web.de>
2024-10-02 11:19 ` Lorenzo Stoakes
2024-10-01 2:34 Bert Karwatzki
2024-10-01 8:02 ` Lorenzo Stoakes
2024-10-01 8:38 ` Bert Karwatzki
2024-10-01 8:49 ` Lorenzo Stoakes
2024-10-01 8:55 ` Bert Karwatzki
2024-10-01 8:59 ` Lorenzo Stoakes
2024-10-01 9:10 ` Bert Karwatzki
2024-10-01 9:20 ` Lorenzo Stoakes
2024-10-01 9:49 ` Lorenzo Stoakes
2024-10-01 9:57 ` Bert Karwatzki
2024-10-01 10:02 ` Lorenzo Stoakes
2024-10-01 10:22 ` Bert Karwatzki
2024-10-01 10:33 ` Lorenzo Stoakes
2024-10-01 10:42 ` Bert Karwatzki
2024-10-01 11:23 ` Lorenzo Stoakes
2024-10-01 11:56 ` Lorenzo Stoakes
2024-10-01 16:43 ` Bert Karwatzki
2024-10-01 18:01 ` Lorenzo Stoakes
2024-10-02 8:39 ` Lorenzo Stoakes
2024-10-02 8:48 ` Lorenzo Stoakes
2024-10-02 12:13 ` Lorenzo Stoakes
2024-10-02 13:23 ` Lorenzo Stoakes
2024-10-02 16:13 ` Bert Karwatzki
2024-10-02 17:19 ` Lorenzo Stoakes
2024-10-02 18:28 ` Lorenzo Stoakes
2024-10-02 18:54 ` Lorenzo Stoakes
2024-10-02 20:06 ` Bert Karwatzki
2024-10-02 20:22 ` Lorenzo Stoakes
2024-10-02 20:39 ` Bert Karwatzki
2024-10-02 20:44 ` Lorenzo Stoakes
2024-10-02 21:13 ` Lorenzo Stoakes
2024-08-30 4:00 [PATCH v8 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
2024-08-30 4:00 ` [PATCH v8 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region() Liam R. Howlett
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=088a3541b85b783ef68337bd4bb790d62f200dfa.camel@web.de \
--to=spasswolf@web.de \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=spassowlf@web.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).