All of lore.kernel.org
 help / color / mirror / Atom feed
From: Carlos Llamas <cmllamas@google.com>
To: Barry Song <21cnbao@gmail.com>
Cc: Hillf Danton <hdanton@sina.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Barry Song <v-songbaohua@oppo.com>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,
	Tangquan Zheng <zhengtangquan@oppo.com>
Subject: Re: [PATCH] binder_alloc: Move alloc_page() out of mmap_rwsem to reduce the lock duration
Date: Tue, 3 Sep 2024 13:55:44 +0000	[thread overview]
Message-ID: <ZtcVYBUNWGow40pX@google.com> (raw)
In-Reply-To: <CAGsJ_4xr-HvqKdh=Q=sVKM+hM+VS1Cf4gqPvq9vDtnQSBO9X=A@mail.gmail.com>

On Tue, Sep 03, 2024 at 07:45:12PM +0800, Barry Song wrote:
> On Tue, Sep 3, 2024 at 7:01 PM Hillf Danton <hdanton@sina.com> wrote:
> >
> > On Tue, Sep 03, 2024 at 10:50:09AM +1200, Barry Song wrote:
> > > From: Barry Song <v-songbaohua@oppo.com>
> > >
> > > The mmap_write_lock() can block all access to the VMAs, for example page
> > > faults. Performing memory allocation while holding this lock may trigger
> > > direct reclamation, leading to others being queued in the rwsem for an
> > > extended period.
> > > We've observed that the allocation can sometimes take more than 300ms,
> > > significantly blocking other threads. The user interface sometimes
> > > becomes less responsive as a result. To prevent this, let's move the
> > > allocation outside of the write lock.

Thanks for you patch Barry. So, we are aware of this contention and I've
been working on a fix for it. See more about this below.

> >
> > I suspect concurrent allocators make things better wrt response, cutting
> > alloc latency down to 10ms for instance in your scenario. Feel free to
> > show figures given Tangquan's 48-hour profiling.
> 
> Likely.
> 
> Concurrent allocators are quite common in PFs which occur
> in the same PTE. whoever gets PTL sets PTE, others free the allocated
> pages.
> 
> >
> > > A potential side effect could be an extra alloc_page() for the second
> > > thread executing binder_install_single_page() while the first thread
> > > has done it earlier. However, according to Tangquan's 48-hour profiling
> > > using monkey, the likelihood of this occurring is minimal, with a ratio
> > > of only 1 in 2400. Compared to the significantly costly rwsem, this is
> > > negligible.

This is not negligible. In fact, it is the exact reason for the page
allocation to be done with the mmap sem. If the first thread sleeps on
vm_insert_page(), then binder gets into a bad state of multiple threads
trying to reclaim pages that won't really be used. Memory pressure goes
from bad to worst pretty quick.

FWIW, I believe this was first talked about here:
https://lore.kernel.org/all/ZWmNpxPXZSxdmDE1@google.com/


> > > On the other hand, holding a write lock without making any VMA
> > > modifications appears questionable and likely incorrect. While this
> > > patch focuses on reducing the lock duration, future updates may aim
> > > to eliminate the write lock entirely.
> >
> > If spin, better not before taking a look at vm_insert_page().
> 
> I have patch 2/3 transitioning to mmap_read_lock, and per_vma_lock is
> currently in the
> testing queue. At the moment, alloc->spin is in place, but I'm not
> entirely convinced
> it's the best replacement for the write lock. Let's wait for
> Tangquan's test results.
> 
> Patch 2 is detailed below, but it has only passed the build-test phase
> so far, so
> its result is uncertain. I'm sharing it early in case you find it
> interesting. And I
> am not convinced Commit d1d8875c8c13 ("binder: fix UAF of alloc->vma in
> race with munmap()") is a correct fix to really avoid all UAF of alloc->vma.
> 
> [PATCH]  binder_alloc: Don't use mmap_write_lock for installing page
> 
> Commit d1d8875c8c13 ("binder: fix UAF of alloc->vma in race with
> munmap()") uses the mmap_rwsem write lock to protect against a race
> condition with munmap, where the vma is detached by the write lock,
> but pages are zapped by the read lock. This approach is extremely
> expensive for the system, though perhaps less so for binder itself,
> as the write lock can block all other operations.
> 
> As an alternative, we could hold only the read lock and re-check
> that the vma hasn't been detached. To protect simultaneous page
> installation, we could use alloc->lock instead.
> 
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  drivers/android/binder_alloc.c | 32 +++++++++++++++++---------------
>  1 file changed, 17 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c
> index f20074e23a7c..a2281dfacbbc 100644
> --- a/drivers/android/binder_alloc.c
> +++ b/drivers/android/binder_alloc.c
> @@ -228,24 +228,17 @@ static int binder_install_single_page(struct
> binder_alloc *alloc,
>                 return -ESRCH;
> 
>         /*
> -        * Don't allocate page in mmap_write_lock, this can block
> -        * mmap_rwsem for a long time; Meanwhile, allocation failure
> -        * doesn't necessarily need to return -ENOMEM, if lru_page
> -        * has been installed, we can still return 0(success).
> +        * Allocation failure doesn't necessarily need to return -ENOMEM,
> +        * if lru_page has been installed, we can still return 0(success).
> +        * So, defer the !page check until after binder_get_installed_page()
> +        * is completed.
>          */
>         page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
> 
> -       /*
> -        * Protected with mmap_sem in write mode as multiple tasks
> -        * might race to install the same page.
> -        */
> -       mmap_write_lock(alloc->mm);
> -       if (binder_get_installed_page(lru_page)) {
> -               ret = 1;
> -               goto out;
> -       }
> +       mmap_read_lock(alloc->mm);
> 
> -       if (!alloc->vma) {
> +       /* vma might have been dropped or deattached */
> +       if (!alloc->vma || !find_vma(alloc->mm, addr)) {
>                 pr_err("%d: %s failed, no vma\n", alloc->pid, __func__);
>                 ret = -ESRCH;
>                 goto out;
> @@ -257,18 +250,27 @@ static int binder_install_single_page(struct
> binder_alloc *alloc,
>                 goto out;
>         }
> 
> +       spin_lock(&alloc->lock);

You can't hold a spinlock and then call vm_insert_page().

> +       if (binder_get_installed_page(lru_page)) {
> +               spin_unlock(&alloc->lock);
> +               ret = 1;
> +               goto out;
> +       }
> +
>         ret = vm_insert_page(alloc->vma, addr, page);
>         if (ret) {
>                 pr_err("%d: %s failed to insert page at offset %lx with %d\n",
>                        alloc->pid, __func__, addr - alloc->buffer, ret);
> +               spin_unlock(&alloc->lock);
>                 ret = -ENOMEM;
>                 goto out;
>         }
> 
>         /* Mark page installation complete and safe to use */
>         binder_set_installed_page(lru_page, page);
> +       spin_unlock(&alloc->lock);
>  out:
> -       mmap_write_unlock(alloc->mm);
> +       mmap_read_unlock(alloc->mm);
>         mmput_async(alloc->mm);
>         if (ret && page)
>                 __free_page(page);
> --
> 2.39.3 (Apple Git-146)


Sorry, but as I mentioned, I've been working on fixing this contention
by supporting concurrent "faults" in binder_install_single_page(). This
is the appropriate fix. I should be sending a patch soon after working
out the conflicts with the shrinker's callback.

Thanks,
--
Carlos Llamas


  reply	other threads:[~2024-09-03 13:55 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-02 22:50 [PATCH] binder_alloc: Move alloc_page() out of mmap_rwsem to reduce the lock duration Barry Song
2024-09-03  8:57 ` Greg Kroah-Hartman
2024-09-03  9:07   ` Barry Song
2024-09-03 10:04     ` Greg Kroah-Hartman
2024-09-03 11:01       ` Barry Song
2024-09-03 11:01 ` Hillf Danton
2024-09-03 11:45   ` Barry Song
2024-09-03 13:55     ` Carlos Llamas [this message]
2024-09-03 21:10       ` Barry Song
2024-09-03 22:23         ` Carlos Llamas
2024-09-03 22:43           ` Barry Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZtcVYBUNWGow40pX@google.com \
    --to=cmllamas@google.com \
    --cc=21cnbao@gmail.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hdanton@sina.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=surenb@google.com \
    --cc=v-songbaohua@oppo.com \
    --cc=zhengtangquan@oppo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.