BPF List
 help / color / mirror / Atom feed
* [PATCH bpf-next] bpf: avoid holding freeze_mutex during mmap operation
@ 2025-01-24 19:56 Andrii Nakryiko
  2025-01-27 15:45 ` Stanislav Fomichev
  2025-01-27 22:27 ` Alexei Starovoitov
  0 siblings, 2 replies; 5+ messages in thread
From: Andrii Nakryiko @ 2025-01-24 19:56 UTC (permalink / raw)
  To: bpf, ast, daniel, martin.lau
  Cc: andrii, kernel-team, syzbot+4dc041c686b7c816a71e

We use map->freeze_mutex to prevent races between map_freeze() and
memory mapping BPF map contents with writable permissions. The way we
naively do this means we'll hold freeze_mutex for entire duration of all
the mm and VMA manipulations, which is completely unnecessary. This can
potentially also lead to deadlocks, as reported by syzbot in [0].

So, instead, hold freeze_mutex only during writeability checks, bump
(proactively) "write active" count for the map, unlock the mutex and
proceed with mmap logic. And only if something went wrong during mmap
logic, then undo that "write active" counter increment.

Note, instead of checking VM_MAYWRITE we check VM_WRITE before and after
mmaping, because we also have a logic that unsets VM_MAYWRITE
forcefully, if VM_WRITE is not set. So VM_MAYWRITE could be set early on
for read-only mmaping, but it won't be afterwards. VM_WRITE is
a consistent way to detect writable mmaping in our implementation.

  [0] https://lore.kernel.org/bpf/678dcbc9.050a0220.303755.0066.GAE@google.com/

Fixes: fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY")
Reported-by: syzbot+4dc041c686b7c816a71e@syzkaller.appspotmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/bpf/syscall.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0daf098e3207..0d5b39e99770 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1035,7 +1035,7 @@ static const struct vm_operations_struct bpf_map_default_vmops = {
 static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
 {
 	struct bpf_map *map = filp->private_data;
-	int err;
+	int err = 0;
 
 	if (!map->ops->map_mmap || !IS_ERR_OR_NULL(map->record))
 		return -ENOTSUPP;
@@ -1059,7 +1059,12 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
 			err = -EACCES;
 			goto out;
 		}
+		bpf_map_write_active_inc(map);
 	}
+out:
+	mutex_unlock(&map->freeze_mutex);
+	if (err)
+		return err;
 
 	/* set default open/close callbacks */
 	vma->vm_ops = &bpf_map_default_vmops;
@@ -1070,13 +1075,14 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
 		vm_flags_clear(vma, VM_MAYWRITE);
 
 	err = map->ops->map_mmap(map, vma);
-	if (err)
-		goto out;
+	if (err) {
+		if (vma->vm_flags & VM_WRITE) {
+			mutex_lock(&map->freeze_mutex);
+			bpf_map_write_active_dec(map);
+			mutex_unlock(&map->freeze_mutex);
+		}
+	}
 
-	if (vma->vm_flags & VM_MAYWRITE)
-		bpf_map_write_active_inc(map);
-out:
-	mutex_unlock(&map->freeze_mutex);
 	return err;
 }
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH bpf-next] bpf: avoid holding freeze_mutex during mmap operation
  2025-01-24 19:56 [PATCH bpf-next] bpf: avoid holding freeze_mutex during mmap operation Andrii Nakryiko
@ 2025-01-27 15:45 ` Stanislav Fomichev
  2025-01-27 22:27 ` Alexei Starovoitov
  1 sibling, 0 replies; 5+ messages in thread
From: Stanislav Fomichev @ 2025-01-27 15:45 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, ast, daniel, martin.lau, kernel-team,
	syzbot+4dc041c686b7c816a71e

On 01/24, Andrii Nakryiko wrote:
> We use map->freeze_mutex to prevent races between map_freeze() and
> memory mapping BPF map contents with writable permissions. The way we
> naively do this means we'll hold freeze_mutex for entire duration of all
> the mm and VMA manipulations, which is completely unnecessary. This can
> potentially also lead to deadlocks, as reported by syzbot in [0].
> 
> So, instead, hold freeze_mutex only during writeability checks, bump
> (proactively) "write active" count for the map, unlock the mutex and
> proceed with mmap logic. And only if something went wrong during mmap
> logic, then undo that "write active" counter increment.
> 
> Note, instead of checking VM_MAYWRITE we check VM_WRITE before and after
> mmaping, because we also have a logic that unsets VM_MAYWRITE
> forcefully, if VM_WRITE is not set. So VM_MAYWRITE could be set early on
> for read-only mmaping, but it won't be afterwards. VM_WRITE is
> a consistent way to detect writable mmaping in our implementation.
> 
>   [0] https://lore.kernel.org/bpf/678dcbc9.050a0220.303755.0066.GAE@google.com/
> 
> Fixes: fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY")
> Reported-by: syzbot+4dc041c686b7c816a71e@syzkaller.appspotmail.com
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

Acked-by: Stanislav Fomichev <sdf@fomichev.me>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH bpf-next] bpf: avoid holding freeze_mutex during mmap operation
  2025-01-24 19:56 [PATCH bpf-next] bpf: avoid holding freeze_mutex during mmap operation Andrii Nakryiko
  2025-01-27 15:45 ` Stanislav Fomichev
@ 2025-01-27 22:27 ` Alexei Starovoitov
  2025-01-27 23:18   ` Andrii Nakryiko
  1 sibling, 1 reply; 5+ messages in thread
From: Alexei Starovoitov @ 2025-01-27 22:27 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Kernel Team, syzbot+4dc041c686b7c816a71e

On Fri, Jan 24, 2025 at 11:56 AM Andrii Nakryiko <andrii@kernel.org> wrote:
>
> We use map->freeze_mutex to prevent races between map_freeze() and
> memory mapping BPF map contents with writable permissions. The way we
> naively do this means we'll hold freeze_mutex for entire duration of all
> the mm and VMA manipulations, which is completely unnecessary. This can
> potentially also lead to deadlocks, as reported by syzbot in [0].
>
> So, instead, hold freeze_mutex only during writeability checks, bump
> (proactively) "write active" count for the map, unlock the mutex and
> proceed with mmap logic. And only if something went wrong during mmap
> logic, then undo that "write active" counter increment.
>
> Note, instead of checking VM_MAYWRITE we check VM_WRITE before and after
> mmaping, because we also have a logic that unsets VM_MAYWRITE
> forcefully, if VM_WRITE is not set. So VM_MAYWRITE could be set early on
> for read-only mmaping, but it won't be afterwards. VM_WRITE is
> a consistent way to detect writable mmaping in our implementation.

bpf_map_mmap_open/bpf_map_mmap_close use VM_MAYWRITE,

Do they need to change as well?

>   [0] https://lore.kernel.org/bpf/678dcbc9.050a0220.303755.0066.GAE@google.com/
>
> Fixes: fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY")
> Reported-by: syzbot+4dc041c686b7c816a71e@syzkaller.appspotmail.com
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  kernel/bpf/syscall.c | 20 +++++++++++++-------
>  1 file changed, 13 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 0daf098e3207..0d5b39e99770 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1035,7 +1035,7 @@ static const struct vm_operations_struct bpf_map_default_vmops = {
>  static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
>  {
>         struct bpf_map *map = filp->private_data;
> -       int err;
> +       int err = 0;
>
>         if (!map->ops->map_mmap || !IS_ERR_OR_NULL(map->record))
>                 return -ENOTSUPP;
> @@ -1059,7 +1059,12 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
>                         err = -EACCES;
>                         goto out;
>                 }
> +               bpf_map_write_active_inc(map);
>         }
> +out:
> +       mutex_unlock(&map->freeze_mutex);
> +       if (err)
> +               return err;
>
>         /* set default open/close callbacks */
>         vma->vm_ops = &bpf_map_default_vmops;
> @@ -1070,13 +1075,14 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
>                 vm_flags_clear(vma, VM_MAYWRITE);
>
>         err = map->ops->map_mmap(map, vma);
> -       if (err)
> -               goto out;
> +       if (err) {
> +               if (vma->vm_flags & VM_WRITE) {
> +                       mutex_lock(&map->freeze_mutex);
> +                       bpf_map_write_active_dec(map);
> +                       mutex_unlock(&map->freeze_mutex);

Extra lock/unlock looks unnecessary.

This functiona and map_freeze() need to see frozen and write_active coherent,
but write_active_dec looks like without mutex.
It's atomic64_dec.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH bpf-next] bpf: avoid holding freeze_mutex during mmap operation
  2025-01-27 22:27 ` Alexei Starovoitov
@ 2025-01-27 23:18   ` Andrii Nakryiko
  2025-01-28  0:47     ` Alexei Starovoitov
  0 siblings, 1 reply; 5+ messages in thread
From: Andrii Nakryiko @ 2025-01-27 23:18 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Kernel Team, syzbot+4dc041c686b7c816a71e

On Mon, Jan 27, 2025 at 2:27 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Jan 24, 2025 at 11:56 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > We use map->freeze_mutex to prevent races between map_freeze() and
> > memory mapping BPF map contents with writable permissions. The way we
> > naively do this means we'll hold freeze_mutex for entire duration of all
> > the mm and VMA manipulations, which is completely unnecessary. This can
> > potentially also lead to deadlocks, as reported by syzbot in [0].
> >
> > So, instead, hold freeze_mutex only during writeability checks, bump
> > (proactively) "write active" count for the map, unlock the mutex and
> > proceed with mmap logic. And only if something went wrong during mmap
> > logic, then undo that "write active" counter increment.
> >
> > Note, instead of checking VM_MAYWRITE we check VM_WRITE before and after
> > mmaping, because we also have a logic that unsets VM_MAYWRITE
> > forcefully, if VM_WRITE is not set. So VM_MAYWRITE could be set early on
> > for read-only mmaping, but it won't be afterwards. VM_WRITE is
> > a consistent way to detect writable mmaping in our implementation.
>
> bpf_map_mmap_open/bpf_map_mmap_open use VM_MAYWRITE,
>
> Do they need to change as well?

So I didn't want to elaborate too much on this (because of
verboseness), but it is indeed non-obvious (I was confused by this for
a bit while working on the patch).

We have this piece of logic in the middle of bpf_map_mmap():

if (!(vma->vm_flags & VM_WRITE))
    vm_flags_clear(vma, VM_MAYWRITE);

After this point, VM_WRITE and VM_MAYWRITE are equivalent (when it
comes to BPF maps mmap-ing). I.e., if we have writable mapping, we'll
have both VM_WRITE and VM_MAYWRITE; if we have read-only mapping, we
won't have either. We can't have any other mix of those two.

bpf_map_write_active_inc() used to happen after this point, and so we
were checking VM_MAYWRITE, but I had to move
bpf_map_write_active_inc() before that point, so I switched to
VM_WRITE check.

bpf_map_mmap_open/bpf_map_mmap_close happen after this
vm_flags_clear(vma, VM_MAYWRITE), so whether they use VM_MAYWRITE or
VM_WRITE doesn't matter. So they should be fine as is.

It is confusing, though, I agree. So maybe we should just normalize
all the checks to VM_WRITE and leave a comment that MAYWRITE and WRITE
are coupled with our custom mmaping logic?

>
> >   [0] https://lore.kernel.org/bpf/678dcbc9.050a0220.303755.0066.GAE@google.com/
> >
> > Fixes: fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY")
> > Reported-by: syzbot+4dc041c686b7c816a71e@syzkaller.appspotmail.com
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> >  kernel/bpf/syscall.c | 20 +++++++++++++-------
> >  1 file changed, 13 insertions(+), 7 deletions(-)
> >
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 0daf098e3207..0d5b39e99770 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -1035,7 +1035,7 @@ static const struct vm_operations_struct bpf_map_default_vmops = {
> >  static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
> >  {
> >         struct bpf_map *map = filp->private_data;
> > -       int err;
> > +       int err = 0;
> >
> >         if (!map->ops->map_mmap || !IS_ERR_OR_NULL(map->record))
> >                 return -ENOTSUPP;
> > @@ -1059,7 +1059,12 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
> >                         err = -EACCES;
> >                         goto out;
> >                 }
> > +               bpf_map_write_active_inc(map);
> >         }
> > +out:
> > +       mutex_unlock(&map->freeze_mutex);
> > +       if (err)
> > +               return err;
> >
> >         /* set default open/close callbacks */
> >         vma->vm_ops = &bpf_map_default_vmops;
> > @@ -1070,13 +1075,14 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
> >                 vm_flags_clear(vma, VM_MAYWRITE);
> >
> >         err = map->ops->map_mmap(map, vma);
> > -       if (err)
> > -               goto out;
> > +       if (err) {
> > +               if (vma->vm_flags & VM_WRITE) {
> > +                       mutex_lock(&map->freeze_mutex);
> > +                       bpf_map_write_active_dec(map);
> > +                       mutex_unlock(&map->freeze_mutex);
>
> Extra lock/unlock looks unnecessary.
>
> This functiona and map_freeze() need to see frozen and write_active coherent,
> but write_active_dec looks like without mutex.
> It's atomic64_dec.

Yep, I think you are right. I wanted a no-brainer change and not
having to think about any memory ordering effects or anything like
that. But seeing bpf_map_is_rdonly() checks this without any lock
anyways, I think we should be fine. I can drop this lock/unlock for
v2.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH bpf-next] bpf: avoid holding freeze_mutex during mmap operation
  2025-01-27 23:18   ` Andrii Nakryiko
@ 2025-01-28  0:47     ` Alexei Starovoitov
  0 siblings, 0 replies; 5+ messages in thread
From: Alexei Starovoitov @ 2025-01-28  0:47 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Kernel Team, syzbot+4dc041c686b7c816a71e

On Mon, Jan 27, 2025 at 3:18 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Jan 27, 2025 at 2:27 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Fri, Jan 24, 2025 at 11:56 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > >
> > > We use map->freeze_mutex to prevent races between map_freeze() and
> > > memory mapping BPF map contents with writable permissions. The way we
> > > naively do this means we'll hold freeze_mutex for entire duration of all
> > > the mm and VMA manipulations, which is completely unnecessary. This can
> > > potentially also lead to deadlocks, as reported by syzbot in [0].
> > >
> > > So, instead, hold freeze_mutex only during writeability checks, bump
> > > (proactively) "write active" count for the map, unlock the mutex and
> > > proceed with mmap logic. And only if something went wrong during mmap
> > > logic, then undo that "write active" counter increment.
> > >
> > > Note, instead of checking VM_MAYWRITE we check VM_WRITE before and after
> > > mmaping, because we also have a logic that unsets VM_MAYWRITE
> > > forcefully, if VM_WRITE is not set. So VM_MAYWRITE could be set early on
> > > for read-only mmaping, but it won't be afterwards. VM_WRITE is
> > > a consistent way to detect writable mmaping in our implementation.
> >
> > bpf_map_mmap_open/bpf_map_mmap_open use VM_MAYWRITE,
> >
> > Do they need to change as well?
>
> So I didn't want to elaborate too much on this (because of
> verboseness), but it is indeed non-obvious (I was confused by this for
> a bit while working on the patch).
>
> We have this piece of logic in the middle of bpf_map_mmap():
>
> if (!(vma->vm_flags & VM_WRITE))
>     vm_flags_clear(vma, VM_MAYWRITE);
>
> After this point, VM_WRITE and VM_MAYWRITE are equivalent (when it
> comes to BPF maps mmap-ing). I.e., if we have writable mapping, we'll
> have both VM_WRITE and VM_MAYWRITE; if we have read-only mapping, we
> won't have either. We can't have any other mix of those two.
>
> bpf_map_write_active_inc() used to happen after this point, and so we
> were checking VM_MAYWRITE, but I had to move
> bpf_map_write_active_inc() before that point, so I switched to
> VM_WRITE check.
>
> bpf_map_mmap_open/bpf_map_mmap_close happen after this
> vm_flags_clear(vma, VM_MAYWRITE), so whether they use VM_MAYWRITE or
> VM_WRITE doesn't matter. So they should be fine as is.

I see. Yeah. I think this analysis is correct.
It seems to be fine as-is.

> It is confusing, though, I agree. So maybe we should just normalize
> all the checks to VM_WRITE and leave a comment that MAYWRITE and WRITE
> are coupled with our custom mmaping logic?

Yeah. That would be nice.

> >
> > >   [0] https://lore.kernel.org/bpf/678dcbc9.050a0220.303755.0066.GAE@google.com/
> > >
> > > Fixes: fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY")
> > > Reported-by: syzbot+4dc041c686b7c816a71e@syzkaller.appspotmail.com
> > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > ---
> > >  kernel/bpf/syscall.c | 20 +++++++++++++-------
> > >  1 file changed, 13 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > index 0daf098e3207..0d5b39e99770 100644
> > > --- a/kernel/bpf/syscall.c
> > > +++ b/kernel/bpf/syscall.c
> > > @@ -1035,7 +1035,7 @@ static const struct vm_operations_struct bpf_map_default_vmops = {
> > >  static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
> > >  {
> > >         struct bpf_map *map = filp->private_data;
> > > -       int err;
> > > +       int err = 0;
> > >
> > >         if (!map->ops->map_mmap || !IS_ERR_OR_NULL(map->record))
> > >                 return -ENOTSUPP;
> > > @@ -1059,7 +1059,12 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
> > >                         err = -EACCES;
> > >                         goto out;
> > >                 }
> > > +               bpf_map_write_active_inc(map);
> > >         }
> > > +out:
> > > +       mutex_unlock(&map->freeze_mutex);
> > > +       if (err)
> > > +               return err;
> > >
> > >         /* set default open/close callbacks */
> > >         vma->vm_ops = &bpf_map_default_vmops;
> > > @@ -1070,13 +1075,14 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
> > >                 vm_flags_clear(vma, VM_MAYWRITE);
> > >
> > >         err = map->ops->map_mmap(map, vma);
> > > -       if (err)
> > > -               goto out;
> > > +       if (err) {
> > > +               if (vma->vm_flags & VM_WRITE) {
> > > +                       mutex_lock(&map->freeze_mutex);
> > > +                       bpf_map_write_active_dec(map);
> > > +                       mutex_unlock(&map->freeze_mutex);
> >
> > Extra lock/unlock looks unnecessary.
> >
> > This functiona and map_freeze() need to see frozen and write_active coherent,
> > but write_active_dec looks like without mutex.
> > It's atomic64_dec.
>
> Yep, I think you are right. I wanted a no-brainer change and not
> having to think about any memory ordering effects or anything like
> that. But seeing bpf_map_is_rdonly() checks this without any lock
> anyways, I think we should be fine. I can drop this lock/unlock for
> v2.

Not only that. map_update/delete do it as well.
So extra mutex_lock provokes questions like mine :)
So pls remove.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-01-28  0:47 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-24 19:56 [PATCH bpf-next] bpf: avoid holding freeze_mutex during mmap operation Andrii Nakryiko
2025-01-27 15:45 ` Stanislav Fomichev
2025-01-27 22:27 ` Alexei Starovoitov
2025-01-27 23:18   ` Andrii Nakryiko
2025-01-28  0:47     ` Alexei Starovoitov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox