linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [mm] VMA merging behavior wrt anon_vma has been slightly broken since Linux 3.0 (in a non-dangerous way)
@ 2023-08-15 19:43 Jann Horn
  2023-08-17 13:42 ` Liam R. Howlett
  0 siblings, 1 reply; 3+ messages in thread
From: Jann Horn @ 2023-08-15 19:43 UTC (permalink / raw)
  To: Linux-MM, Andrew Morton, Shaohua Li
  Cc: kernel list, Peter Zijlstra, Liam Howlett

Hi!

I think VMA merging was accidentally nerfed a bit by commit
965f55dea0e3 ("mmap: avoid merging cloned VMAs"), which landed in
Linux 3.0 - essentially, that commit makes it impossible to merge a
VMA with an anon_vma into an adjacent VMA that does not have an
anon_vma. (But the other direction works.)


is_mergeable_anon_vma() is defined as:

```
static inline bool is_mergeable_anon_vma(struct anon_vma *anon_vma1,
                 struct anon_vma *anon_vma2, struct vm_area_struct *vma)
{
        /*
         * The list_is_singular() test is to avoid merging VMA cloned from
         * parents. This can improve scalability caused by anon_vma lock.
         */
        if ((!anon_vma1 || !anon_vma2) && (!vma ||
                list_is_singular(&vma->anon_vma_chain)))
                return true;
        return anon_vma1 == anon_vma2;
}
```

If this function is called with a non-NULL vma pointer (which is
almost always the case, except when checking for whether it's possible
to merge in both directions at the same time), and one of the two
anon_vmas is non-NULL, this returns
list_is_singular(&vma->anon_vma_chain). I believe that
list_is_singular() call is supposed to check whether the
anon_vma_chain contains *more than one* element, but it actually also
fails if the anon_vma_chain contains zero elements.

This means that the dup_anon_vma() calls in vma_merge() are
effectively all no-ops because they are never called with a target
that does not have an anon_vma and a source that has an anon_vma.

I think this is unintentional - though I guess this unintentional
refusal to merge VMAs this way also lowers the complexity of what can
happen in the VMA merging logic. So I think the right fix here is to
make this kind of merging possible again by changing
"list_is_singular(&vma->anon_vma_chain)" to
"list_empty(&vma->anon_vma_chain) ||
list_is_singular(&vma->anon_vma_chain)", but my security hat makes me
say that I'd also be happy if the unintentional breakage stayed this
way it is now.


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [mm] VMA merging behavior wrt anon_vma has been slightly broken since Linux 3.0 (in a non-dangerous way)
  2023-08-15 19:43 [mm] VMA merging behavior wrt anon_vma has been slightly broken since Linux 3.0 (in a non-dangerous way) Jann Horn
@ 2023-08-17 13:42 ` Liam R. Howlett
  2023-08-17 16:21   ` Jann Horn
  0 siblings, 1 reply; 3+ messages in thread
From: Liam R. Howlett @ 2023-08-17 13:42 UTC (permalink / raw)
  To: Jann Horn
  Cc: Linux-MM, Andrew Morton, Shaohua Li, kernel list, Peter Zijlstra

* Jann Horn <jannh@google.com> [230815 15:44]:
> Hi!
> 
> I think VMA merging was accidentally nerfed a bit by commit
> 965f55dea0e3 ("mmap: avoid merging cloned VMAs"), which landed in
> Linux 3.0 - essentially, that commit makes it impossible to merge a
> VMA with an anon_vma into an adjacent VMA that does not have an
> anon_vma. (But the other direction works.)
> 
> 
> is_mergeable_anon_vma() is defined as:
> 
> ```
> static inline bool is_mergeable_anon_vma(struct anon_vma *anon_vma1,
>                  struct anon_vma *anon_vma2, struct vm_area_struct *vma)
> {
>         /*
>          * The list_is_singular() test is to avoid merging VMA cloned from
>          * parents. This can improve scalability caused by anon_vma lock.
>          */
>         if ((!anon_vma1 || !anon_vma2) && (!vma ||
>                 list_is_singular(&vma->anon_vma_chain)))
>                 return true;
>         return anon_vma1 == anon_vma2;
> }
> ```
> 
> If this function is called with a non-NULL vma pointer (which is
> almost always the case, except when checking for whether it's possible
> to merge in both directions at the same time),

You are talking about case 1 & 6 here?  To get here merge_prev and
merge_next must be set.. which means can_vma_merge_after() and
can_vma_merge_before() must succeed.. which means
is_mergeable_anon_vma() returned true with both prev and next being
passed through as "vma".  So, I think, even that case suffers the same
issue?

That is, we won't have merge_prev == true if prev has an empty
anon_vma_chain.  The same is true for merge_next.

>and one of the two
> anon_vmas is non-NULL, this returns
> list_is_singular(&vma->anon_vma_chain). I believe that
> list_is_singular() call is supposed to check whether the
> anon_vma_chain contains *more than one* element, but it actually also
> fails if the anon_vma_chain contains zero elements.
> 
> This means that the dup_anon_vma() calls in vma_merge() are
> effectively all no-ops because they are never called with a target
> that does not have an anon_vma and a source that has an anon_vma.
> 
> I think this is unintentional - though I guess this unintentional
> refusal to merge VMAs this way also lowers the complexity of what can
> happen in the VMA merging logic. So I think the right fix here is to
> make this kind of merging possible again by changing
> "list_is_singular(&vma->anon_vma_chain)" to
> "list_empty(&vma->anon_vma_chain) ||
> list_is_singular(&vma->anon_vma_chain)", but my security hat makes me
> say that I'd also be happy if the unintentional breakage stayed this
> way it is now.

The commit message for the offending change says
find_mergeable_anon_vma() already considers merging these, so it may not
be as nerfed as it looks?

From what I understand the merging is an optimisation and, from the
commit message,  the change was to increase scalability, so this shifts
to using more memory to gain scalability on the anon_vma lock.  Making
this change will shift back to (maybe?) less memory for more pressure on
that lock then?  I am hesitant to suggest un-nerfing it, but it
shouldn't be left as it is since the code is unclear on what is
happening.

Thanks,
Liam



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [mm] VMA merging behavior wrt anon_vma has been slightly broken since Linux 3.0 (in a non-dangerous way)
  2023-08-17 13:42 ` Liam R. Howlett
@ 2023-08-17 16:21   ` Jann Horn
  0 siblings, 0 replies; 3+ messages in thread
From: Jann Horn @ 2023-08-17 16:21 UTC (permalink / raw)
  To: Liam R. Howlett, Jann Horn, Linux-MM, Andrew Morton, Shaohua Li,
	kernel list, Peter Zijlstra

On Thu, Aug 17, 2023 at 3:42 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> * Jann Horn <jannh@google.com> [230815 15:44]:
> > Hi!
> >
> > I think VMA merging was accidentally nerfed a bit by commit
> > 965f55dea0e3 ("mmap: avoid merging cloned VMAs"), which landed in
> > Linux 3.0 - essentially, that commit makes it impossible to merge a
> > VMA with an anon_vma into an adjacent VMA that does not have an
> > anon_vma. (But the other direction works.)
> >
> >
> > is_mergeable_anon_vma() is defined as:
> >
> > ```
> > static inline bool is_mergeable_anon_vma(struct anon_vma *anon_vma1,
> >                  struct anon_vma *anon_vma2, struct vm_area_struct *vma)
> > {
> >         /*
> >          * The list_is_singular() test is to avoid merging VMA cloned from
> >          * parents. This can improve scalability caused by anon_vma lock.
> >          */
> >         if ((!anon_vma1 || !anon_vma2) && (!vma ||
> >                 list_is_singular(&vma->anon_vma_chain)))
> >                 return true;
> >         return anon_vma1 == anon_vma2;
> > }
> > ```
> >
> > If this function is called with a non-NULL vma pointer (which is
> > almost always the case, except when checking for whether it's possible
> > to merge in both directions at the same time),
>
> You are talking about case 1 & 6 here?

Yeah, about the third call to is_mergeable_anon_vma() in those cases.

> To get here merge_prev and
> merge_next must be set.. which means can_vma_merge_after() and
> can_vma_merge_before() must succeed.. which means
> is_mergeable_anon_vma() returned true with both prev and next being
> passed through as "vma".  So, I think, even that case suffers the same
> issue?

Right, my intent was to talk about individual callers of vma_merge().

> That is, we won't have merge_prev == true if prev has an empty
> anon_vma_chain.  The same is true for merge_next.
>
> >and one of the two
> > anon_vmas is non-NULL, this returns
> > list_is_singular(&vma->anon_vma_chain). I believe that
> > list_is_singular() call is supposed to check whether the
> > anon_vma_chain contains *more than one* element, but it actually also
> > fails if the anon_vma_chain contains zero elements.
> >
> > This means that the dup_anon_vma() calls in vma_merge() are
> > effectively all no-ops because they are never called with a target
> > that does not have an anon_vma and a source that has an anon_vma.
> >
> > I think this is unintentional - though I guess this unintentional
> > refusal to merge VMAs this way also lowers the complexity of what can
> > happen in the VMA merging logic. So I think the right fix here is to
> > make this kind of merging possible again by changing
> > "list_is_singular(&vma->anon_vma_chain)" to
> > "list_empty(&vma->anon_vma_chain) ||
> > list_is_singular(&vma->anon_vma_chain)", but my security hat makes me
> > say that I'd also be happy if the unintentional breakage stayed this
> > way it is now.
>
> The commit message for the offending change says
> find_mergeable_anon_vma() already considers merging these, so it may not
> be as nerfed as it looks?

Ah, right, that should reduce the impact a lot.

> From what I understand the merging is an optimisation and, from the
> commit message,  the change was to increase scalability

Yes, my understanding is that it was to increase scalability by
avoiding merging VMAs when that would cause more use of an anon_vma
that is shared with other processes.

 - An empty anon_vma_chain means we have no anon_vma.
 - An anon_vma_chain with a single element means we have an anon_vma
that is private to our process.
 - An anon_vma_chain with more than one element is likely shared with
other processes.

My understanding is that the scalability issue probably comes mainly
from having multiple processes hitting the same anon_vma lock during
operations under the mmap lock in write mode; that wouldn't be a
problem for same-process sharing. There might also be some scalability
impact from this kind of situation, although probably less because
AFAIU rmap walks aren't needed for much hotpath stuff:

 - Process P1 creates a VMA V1, covering range A1-A2, with a new anon_vma AV1
 - Process P1 forks a child process P2
   - P2 inherits a copy of V1 called V1', with two anon_vmas:
     - AV1 (inherited)
     - AV2 (used for new pages)
 - P1 extends VMA V1, now it covers range A1-A3
 - let's hypothetically assume that P2 extends VMA V1', now also
covering range A1-A3
 - P1 inserts a page somewhere in the range A2-A3, it gets attached to AV1
 - now if we have to do an rmap walk on the page, this unnecessarily
requires walking the page tables of both P1 and P2

Basically merging VMAs with anon_vma_chains of length >1 could easily
lead to some kind of false sharing for rmap walks of pages mapped in
the parent, which is probably undesirable especially in cases where a
process forks off a lot of children. This sort of situation is
probably harder to trigger in a single-process scenario with a
length-1 AVC, since you'd have to be splitting VMAs and moving them
around with mremap() and extending them with mremap() to have
overlapping ranges in the interval tree.

My understanding is that essentially, in a process that is not going
to fork (or constantly move VMAs around), it would be fine to have
more or less a single anon_vma for all anonymous VMAs; that just means
rmap walks will have to do O(log(number_of_vmas)) steps through the
anon_vma interval tree. In a process that forks, doing so would mean
that operations like VMA expansion and unmapping on inherited VMAs in
all the children would involve taking a shared lock - that'd probably
still be mostly fine, I guess? But if children kept merging new VMAs
into the parent's anon_vma, all their VMA modification operations
would all hit the same lock, which would be bad.

(By the way, we also depend on not merging VMAs with more than one
anon_vma_chain entry for correctness/security, see commit 2555283eb40d
"mm/rmap: Fix anon_vma->degree ambiguity leading to double-reuse".)

> so this shifts
> to using more memory to gain scalability on the anon_vma lock.  Making
> this change will shift back to (maybe?) less memory for more pressure on
> that lock then?  I am hesitant to suggest un-nerfing it, but it
> shouldn't be left as it is since the code is unclear on what is
> happening.


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-08-17 16:22 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-15 19:43 [mm] VMA merging behavior wrt anon_vma has been slightly broken since Linux 3.0 (in a non-dangerous way) Jann Horn
2023-08-17 13:42 ` Liam R. Howlett
2023-08-17 16:21   ` Jann Horn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).