Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH bpf-next v4 2/6] bpf: introduce BPF dispatcher
From: Alexei Starovoitov @ 2022-08-15 15:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Björn Töpel, Network Development, Alexei Starovoitov,
	Daniel Borkmann, Björn Töpel, bpf, Magnus Karlsson,
	Karlsson, Magnus, Jonathan Lemon, Edward Cree,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer,
	Andrii Nakryiko, LKML, Linus Torvalds, Christoph Hellwig,
	Peter Zijlstra, Josh Poimboeuf, Thomas Gleixner, Andrew Morton
In-Reply-To: <20220815111658.58d75672@gandalf.local.home>

On Mon, Aug 15, 2022 at 8:16 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Mon, 15 Aug 2022 07:31:23 -0700
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > >
> > > When I heard that ftrace was broken by BPF I thought it was something
> > > unique they were doing, but unfortunately, I didn't investigate what they
> > > were doing at the time.
> >
> > ftrace is still broken and refusing to accept the fact doesn't make it
> > non-broken.
>
> I extended Jiri's patch to make it work again.
>
> >
> > > Then they started sending me patches to hide fentry locations from ftrace.
> > > And even telling me that fentry != ftrace
> >
> > It sounds that you've invented nop5 and kernel's ability
> > to replace nop5 with a jump or call.
>
> Actually I did invent it.
>
>    https://lore.kernel.org/lkml/20080210072109.GR4100@elte.hu/
>
>
> I'm the one that introduced the code to convert mcount into the 5 byte nop,
> and did the research and development to make it work at run time. I had one
> hiccup along the way that caused the e1000e network card breakage.
>
> The "daemon" approach was horrible, and then I created the recordmcount.pl
> perl script to accomplish the same thing at compile time.
>
> > ftrace should really stop trying to own all of the kernel text rewrites.
> > It's in the way. Like this case.
>
> It's not trying to own all modifications (static_calls is not ftrace). But
> the code at the start of functions with fentry does belong to it.
>
> >
> > >    https://lore.kernel.org/all/CAADnVQJTT7h3MniVqdBEU=eLwvJhEKNLSjbUAK4sOrhN=zggCQ@mail.gmail.com/
> > >
> > > Even though fentry was created for ftrace
> > >
> > >    https://lore.kernel.org/lkml/1258720459.22249.1018.camel@gandalf.stny.rr.com/
> > >
> > > and all the work with fentry was for the ftrace infrastructure. Ftrace
> > > takes a lot of care for security and use cases for other users (like
> > > live kernel patching). But BPF has the NIH syndrome, and likes to own
> > > everything and recreate the wheel so that they have full control.
> > >
> > > > of the trampoline. One dispatcher instance currently supports up to 64
> > > > dispatch points. A user creates a dispatcher with its corresponding
> > > > trampoline with the DEFINE_BPF_DISPATCHER macro.
> > >
> > > Anyway, this patch just looks like a re-implementation of static_calls:
> >
> > It was implemented long before static_calls made it to the kernel
> > and it's different. Please do your home work.
>
> Long before? This code made it into the kernel in Dec 2019. Yes static calls
> finally made it into the kernel in 2020, but it was first introduced in
> October 2018:
>
>   https://lore.kernel.org/all/20181006015110.653946300@goodmis.org/
>
> If you had Cc'd us on this patch, we could have collaborated and come up
> with something that would have worked for you.
>
> It took time to get in because we don't just push our features in, we make
> sure that they are generic and work for others, and is secure and robust.
>
> I sent a proof of concept, Josh took over, Linus had issues, and finally
> Peter pushed it through the gate. It's a long process, but we don't break
> others code while doing it!

Replied in the other thread. Let's stick to one thread please.

^ permalink raw reply

* Re: [PATCH bpf-next v4 2/6] bpf: introduce BPF dispatcher
From: Steven Rostedt @ 2022-08-15 15:21 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Björn Töpel, Network Development, Alexei Starovoitov,
	Daniel Borkmann, Björn Töpel, bpf, Magnus Karlsson,
	Karlsson, Magnus, Jonathan Lemon, Edward Cree,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer,
	Andrii Nakryiko, LKML, Linus Torvalds, Christoph Hellwig,
	Peter Zijlstra, Josh Poimboeuf, Thomas Gleixner, Andrew Morton
In-Reply-To: <20220815111658.58d75672@gandalf.local.home>

On Mon, 15 Aug 2022 11:16:58 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> Actually I did invent it.
> 
>    https://lore.kernel.org/lkml/20080210072109.GR4100@elte.hu/

And the next patch replaced the jmps with nops. We kept this as separate
patches for debugging reasons.

 commit dfa60aba04dae78 ("ftrace: use nops instead of jmp")

-- Steve

^ permalink raw reply

* Re: [PATCH stable 4.9.x] Revert "net: usb: ax88179_178a needs FLAG_SEND_ZLP"
From: Greg Kroah-Hartman @ 2022-08-15 15:31 UTC (permalink / raw)
  To: Jose Alonso; +Cc: David S . Miller, netdev, stable, Ronald Wahl, Jakub Kicinski
In-Reply-To: <20220815151912.319147-1-joalonsof@gmail.com>

On Mon, Aug 15, 2022 at 12:19:12PM -0300, Jose Alonso wrote:
> commit 6fd2c17fb6e02a8c0ab51df1cfec82ce96b8e83d upstream.
> 
> This reverts commit 36a15e1cb134c0395261ba1940762703f778438c.
> 
> The usage of FLAG_SEND_ZLP causes problems to other firmware/hardware
> versions that have no issues.
> 
> The FLAG_SEND_ZLP is not safe to use in this context.
> See:
> https://patchwork.ozlabs.org/project/netdev/patch/1270599787.8900.8.camel@Linuxdev4-laptop/#118378
> The original problem needs another way to solve.
> 
> Fixes: 36a15e1cb134 ("net: usb: ax88179_178a needs FLAG_SEND_ZLP")
> Cc: stable@vger.kernel.org
> Reported-by: Ronald Wahl <ronald.wahl@raritan.com>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=216327
> Link: https://bugs.archlinux.org/task/75491
> Signed-off-by: Jose Alonso <joalonsof@gmail.com>
> Signed-off-by: David S. Miller <davem@davemloft.net>
> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> ---
>  drivers/net/usb/ax88179_178a.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)

All now queued up, thanks.

greg k-h

^ permalink raw reply

* Re: [PATCHv2 bpf-next] libbpf: making bpf_prog_load() ignore name if kernel doesn't support
From: Quentin Monnet @ 2022-08-15 15:36 UTC (permalink / raw)
  To: Hangbin Liu, netdev
  Cc: Andrii Nakryiko, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf
In-Reply-To: <20220813000936.6464-1-liuhangbin@gmail.com>

On 13/08/2022 01:09, Hangbin Liu wrote:
> Similar with commit 10b62d6a38f7 ("libbpf: Add names for auxiliary maps"),
> let's make bpf_prog_load() also ignore name if kernel doesn't support
> program name.
> 
> To achieve this, we need to call sys_bpf_prog_load() directly in
> probe_kern_prog_name() to avoid circular dependency. sys_bpf_prog_load()
> also need to be exported in the libbpf_internal.h file.
> 
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> ---
> v2: move sys_bpf_prog_load definition to libbpf_internal.h. memset attr
>     to 0 specifically to aviod padding.
> ---
>  tools/lib/bpf/bpf.c             |  6 ++----
>  tools/lib/bpf/libbpf.c          | 12 ++++++++++--
>  tools/lib/bpf/libbpf_internal.h |  3 +++
>  3 files changed, 15 insertions(+), 6 deletions(-)
> 
> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> index 6a96e665dc5d..575867d69496 100644
> --- a/tools/lib/bpf/bpf.c
> +++ b/tools/lib/bpf/bpf.c
> @@ -84,9 +84,7 @@ static inline int sys_bpf_fd(enum bpf_cmd cmd, union bpf_attr *attr,
>  	return ensure_good_fd(fd);
>  }
>  
> -#define PROG_LOAD_ATTEMPTS 5
> -
> -static inline int sys_bpf_prog_load(union bpf_attr *attr, unsigned int size, int attempts)
> +int sys_bpf_prog_load(union bpf_attr *attr, unsigned int size, int attempts)
>  {
>  	int fd;
>  
> @@ -263,7 +261,7 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
>  	attr.prog_ifindex = OPTS_GET(opts, prog_ifindex, 0);
>  	attr.kern_version = OPTS_GET(opts, kern_version, 0);
>  
> -	if (prog_name)
> +	if (prog_name && kernel_supports(NULL, FEAT_PROG_NAME))
>  		libbpf_strlcpy(attr.prog_name, prog_name, sizeof(attr.prog_name));
>  	attr.license = ptr_to_u64(license);
>  
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 3f01f5cd8a4c..4a351897bdcc 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -4419,10 +4419,18 @@ static int probe_kern_prog_name(void)
>  		BPF_MOV64_IMM(BPF_REG_0, 0),
>  		BPF_EXIT_INSN(),
>  	};
> -	int ret, insn_cnt = ARRAY_SIZE(insns);
> +	union bpf_attr attr;
> +	int ret;
> +
> +	memset(&attr, 0, sizeof(attr));
> +	attr.prog_type = BPF_PROG_TYPE_SOCKET_FILTER;
> +	attr.license = ptr_to_u64("GPL");
> +	attr.insns = ptr_to_u64(insns);
> +	attr.insn_cnt = (__u32)ARRAY_SIZE(insns);
> +	libbpf_strlcpy(attr.prog_name, "test", sizeof(attr.prog_name));
>  
>  	/* make sure loading with name works */
> -	ret = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, "test", "GPL", insns, insn_cnt, NULL);
> +	ret = sys_bpf_prog_load(&attr, sizeof(attr), PROG_LOAD_ATTEMPTS);
>  	return probe_fd(ret);
>  }
>  
> diff --git a/tools/lib/bpf/libbpf_internal.h b/tools/lib/bpf/libbpf_internal.h
> index 4135ae0a2bc3..377642ff51fc 100644
> --- a/tools/lib/bpf/libbpf_internal.h
> +++ b/tools/lib/bpf/libbpf_internal.h
> @@ -573,4 +573,7 @@ static inline bool is_pow_of_2(size_t x)
>  	return x && (x & (x - 1)) == 0;
>  }
>  
> +#define PROG_LOAD_ATTEMPTS 5
> +int sys_bpf_prog_load(union bpf_attr *attr, unsigned int size, int attempts);
> +
>  #endif /* __LIBBPF_LIBBPF_INTERNAL_H */

Looks good to me, thanks!

Acked-by: Quentin Monnet <quentin@isovalent.com>

^ permalink raw reply

* Re: [PATCH bpf] xsk: fix corrupted packets for XDP_SHARED_UMEM
From: patchwork-bot+netdevbpf @ 2022-08-15 15:36 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: magnus.karlsson, bjorn, ast, daniel, netdev, maciej.fijalkowski,
	jonathan.lemon, bpf, alasdair.mcwilliam, dnevil
In-Reply-To: <20220812113259.531-1-magnus.karlsson@gmail.com>

Hello:

This patch was applied to bpf/bpf.git (master)
by Daniel Borkmann <daniel@iogearbox.net>:

On Fri, 12 Aug 2022 13:32:59 +0200 you wrote:
> From: Magnus Karlsson <magnus.karlsson@intel.com>
> 
> Fix an issue in XDP_SHARED_UMEM mode together with aligned mode were
> packets are corrupted for the second and any further sockets bound to
> the same umem. In other words, this does not affect the first socket
> bound to the umem. The culprit for this bug is that the initialization
> of the DMA addresses for the pre-populated xsk buffer pool entries was
> not performed for any socket but the first one bound to the umem. Only
> the linear array of DMA addresses was populated. Fix this by
> populating the DMA addresses in the xsk buffer pool for every socket
> bound to the same umem.
> 
> [...]

Here is the summary with links:
  - [bpf] xsk: fix corrupted packets for XDP_SHARED_UMEM
    https://git.kernel.org/bpf/bpf/c/58ca14ed98c8

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: Patch "skbuff: don't mix ubuf_info from different sources" has been added to the 5.19-stable tree
From: Jakub Kicinski @ 2022-08-15 15:37 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Pavel Begunkov, stable-commits, David S. Miller, Eric Dumazet,
	Paolo Abeni, netdev@vger.kernel.org
In-Reply-To: <4c748c4f-e3a6-8325-fa34-ad6f056830b9@gmail.com>

On Mon, 15 Aug 2022 13:30:26 +0100 Pavel Begunkov wrote:
> On 8/13/22 21:29, Sasha Levin wrote:
> > This is a note to let you know that I've just added the patch titled
> > 
> >      skbuff: don't mix ubuf_info from different sources
> > 
> > to the 5.19-stable tree which can be found at:
> >      http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary
> > 
> > The filename of the patch is:
> >       skbuff-don-t-mix-ubuf_info-from-different-sources.patch
> > and it can be found in the queue-5.19 subdirectory.
> > 
> > If you, or anyone else, feels it should not be added to the stable tree,
> > please let <stable@vger.kernel.org> know about it.  
> 
> It doesn't hurt but we don't need it in 5.19, added only because of
> 5.20 io_uring zerocopy send work.

Concerning, I already said this doesn't need to be backported,
something's not working here, Sasha...

https://lore.kernel.org/r/20220808114451.78686a5b@kernel.org/

^ permalink raw reply

* Re: [PATCH bpf-next v4 2/6] bpf: introduce BPF dispatcher
From: Steven Rostedt @ 2022-08-15 15:39 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Björn Töpel, Network Development, Alexei Starovoitov,
	Daniel Borkmann, Björn Töpel, bpf, Magnus Karlsson,
	Karlsson, Magnus, Jonathan Lemon, Edward Cree,
	Toke Høiland-Jørgensen, Jesper Dangaard Brouer,
	Andrii Nakryiko, LKML, Linus Torvalds, Christoph Hellwig,
	Peter Zijlstra, Josh Poimboeuf, Thomas Gleixner, Andrew Morton
In-Reply-To: <20220815111658.58d75672@gandalf.local.home>

On Mon, 15 Aug 2022 11:16:58 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> > It sounds that you've invented nop5 and kernel's ability
> > to replace nop5 with a jump or call.  
> 
> Actually I did invent it.
> 
>    https://lore.kernel.org/lkml/20080210072109.GR4100@elte.hu/
> 
> 
> I'm the one that introduced the code to convert mcount into the 5 byte nop,
> and did the research and development to make it work at run time. I had one
> hiccup along the way that caused the e1000e network card breakage.
> 
> The "daemon" approach was horrible, and then I created the recordmcount.pl
> perl script to accomplish the same thing at compile time.

I guess you were not paying attention to my talk at the Kernel Recipes I
invited you to in 2019. The talk I gave was the history of how fentry came
about.

  https://kernel-recipes.org/en/2019/talks/ftrace-where-modifying-a-running-kernel-all-started/

 ;-)

-- Steve

^ permalink raw reply

* Re: [PATCH v4 0/4] Introduce security_create_user_ns()
From: Serge E. Hallyn @ 2022-08-15 15:41 UTC (permalink / raw)
  To: Paul Moore
  Cc: Serge E. Hallyn, Eric W. Biederman, Frederick Lawler, kpsingh,
	revest, jackmanb, ast, daniel, andrii, kafai, songliubraving, yhs,
	john.fastabend, jmorris, stephen.smalley.work, eparis, shuah,
	brauner, casey, bpf, linux-security-module, selinux,
	linux-kselftest, linux-kernel, netdev, kernel-team, cgzones, karl
In-Reply-To: <CAHC9VhRSCXCM51xpOT95G_WVi=UQ44gNV=uvvG23p8wn16uYSA@mail.gmail.com>

On Sun, Aug 14, 2022 at 10:32:51PM -0400, Paul Moore wrote:
> On Sun, Aug 14, 2022 at 11:55 AM Serge E. Hallyn <serge@hallyn.com> wrote:
> > On Mon, Aug 08, 2022 at 03:16:16PM -0400, Paul Moore wrote:
> > > On Mon, Aug 8, 2022 at 2:56 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> > > > Paul Moore <paul@paul-moore.com> writes:
> > > > > On Mon, Aug 1, 2022 at 10:56 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> > > > >> Frederick Lawler <fred@cloudflare.com> writes:
> > > > >>
> > > > >> > While creating a LSM BPF MAC policy to block user namespace creation, we
> > > > >> > used the LSM cred_prepare hook because that is the closest hook to prevent
> > > > >> > a call to create_user_ns().
> > > > >>
> > > > >> Re-nack for all of the same reasons.
> > > > >> AKA This can only break the users of the user namespace.
> > > > >>
> > > > >> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> > > > >>
> > > > >> You aren't fixing what your problem you are papering over it by denying
> > > > >> access to the user namespace.
> > > > >>
> > > > >> Nack Nack Nack.
> > > > >>
> > > > >> Stop.
> > > > >>
> > > > >> Go back to the drawing board.
> > > > >>
> > > > >> Do not pass go.
> > > > >>
> > > > >> Do not collect $200.
> > > > >
> > > > > If you want us to take your comments seriously Eric, you need to
> > > > > provide the list with some constructive feedback that would allow
> > > > > Frederick to move forward with a solution to the use case that has
> > > > > been proposed.  You response above may be many things, but it is
> > > > > certainly not that.
> > > >
> > > > I did provide constructive feedback.  My feedback to his problem
> > > > was to address the real problem of bugs in the kernel.
> > >
> > > We've heard from several people who have use cases which require
> > > adding LSM-level access controls and observability to user namespace
> > > creation.  This is the problem we are trying to solve here; if you do
> > > not like the approach proposed in this patchset please suggest another
> > > implementation that allows LSMs visibility into user namespace
> > > creation.
> >
> > Regarding the observability - can someone concisely lay out why just
> > auditing userns creation would not suffice?  Userspace could decide
> > what to report based on whether the creating user_ns == /proc/1/ns/user...
> 
> One of the selling points of the BPF LSM is that it allows for various
> different ways of reporting and logging beyond audit.  However, even
> if it was limited to just audit I believe that provides some useful
> justification as auditing fork()/clone() isn't quite the same and
> could be difficult to do at scale in some configurations.  I haven't
> personally added a BPF LSM program to the kernel so I can't speak to
> the details on what is possible, but I'm sure others on the To/CC line
> could help provide more information if that is important to you.
> 
> > Regarding limiting the tweaking of otherwise-privileged code by
> > unprivileged users, i wonder whether we could instead add smarts to
> > ns_capable().
> 
> The existing security_capable() hook is eventually called by ns_capable():
> 
>   ns_capable()
>     ns_capable_common()
>       security_capable(const struct cred *cred,
>                        struct user_namespace *ns,
>                        int cap,
>                        unsigned int opts);
> 
> ... I'm not sure what additional smarts would be useful here?

Oh - i wasn't necessarily thinking of an LSM.  I was picturing a
sysctl next to unprivileged_userns_clone.  But you're right, looks
like an LSM could already do this.  Of course, there's an issue early
on in that the root user in the new namespace couldn't setuid, so
the uid mapping is still limited.  So this idea probably isn't worth
the characters we've typed about it so far, sorry.

> [side note: SELinux does actually distinguish between capability
> checks in the initial user namespace vs child namespaces]
> 
> > Point being, uid mapping would still work, but we'd
> > break the "privileged against resources you own" part of user
> > namespaces.  I would want it to default to allow, but then when a
> > 0-day is found which requires reaching ns_capable() code, admins
> > could easily prevent exploitation until reboot from a fixed kernel.
> 
> That assumes that everything you care about is behind a capability
> check, which is probably going to be correct in a lot of the cases,
> but I think it would be a mistake to assume that is always going to be
> true.

I might be thinking wrongly, but if it's not behind a capability check,
then it seems to me it's not an exploit that can only be reached by
becoming root in a user namespace, which means disabling user namespace
creation by unprivileged users will not stop the attack.

^ permalink raw reply

* Re: upstream kernel crashes
From: Michael S. Tsirkin @ 2022-08-15 15:40 UTC (permalink / raw)
  To: Xuan Zhuo, Jason Wang, Andres Freund, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, virtualization, netdev,
	Linus Torvalds, Jens Axboe, James Bottomley, Martin K. Petersen,
	Guenter Roeck, linux-kernel, Greg KH, c
In-Reply-To: <FCDC5DDE-3CDD-4B8A-916F-CA7D87B547CE@anarazel.de>

On Mon, Aug 15, 2022 at 01:34:41AM -0700, Andres Freund wrote:
> Hi, 
> 
> On August 15, 2022 1:28:29 AM PDT, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >On Mon, Aug 15, 2022 at 01:15:27AM -0700, Andres Freund wrote:
> >> Hi,
> >> 
> >> On 2022-08-15 03:51:34 -0400, Michael S. Tsirkin wrote:
> >> > It is possible that GCP gets confused if ring size is smaller than the
> >> > device maximum simply because no one did it in the past.
> >> > 
> >> > So I pushed just the revert of 762faee5a267 to the test branch.
> >> > Could you give it a spin?
> >> 
> >> Seems to fix the issue, at least to the extent I can determine at 1am... :)
> >> 
> >> Greetings,
> >> 
> >> Andres Freund
> >
> >So you tested this:
> >
> >commit 13df5a7eaeb22561d39354b576bc98a7e2c389f9 (HEAD, kernel.org/test)
> >Author: Michael S. Tsirkin <mst@redhat.com>
> >Date:   Mon Aug 15 03:44:38 2022 -0400
> >
> >    Revert "virtio_net: set the default max ring size by find_vqs()"
> >    
> >    This reverts commit 762faee5a2678559d3dc09d95f8f2c54cd0466a7.
> >    
> >    Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >
> >and it fixes both issues right? No crashes no networking issue?
> 
> Correct. I only did limited testing, but it's survived far longer / more reboots than anything since the commit.
> 
> Andres
> -- 
> Sent from my Android device with K-9 Mail. Please excuse my brevity.


OK so this gives us a quick revert as a solution for now.
Next, I would appreciate it if you just try this simple hack.
If it crashes we either have a long standing problem in virtio
code or more likely a gcp bug where it can't handle smaller
rings than what device requestes.
Thanks!

diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
index f7965c5dd36b..bdd5f481570b 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -314,6 +314,9 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
 	if (!size || size > num)
 		size = num;
 
+	if (size > 1024)
+		size = 1024;
+
 	if (size & (size - 1)) {
 		dev_warn(&vp_dev->pci_dev->dev, "bad queue size %u", size);
 		return ERR_PTR(-EINVAL);


-- 
MST


^ permalink raw reply related

* Re: [PATCH 2/2] vDPA: conditionally read fields in virtio-net dev
From: Michael S. Tsirkin @ 2022-08-15 15:52 UTC (permalink / raw)
  To: Zhu Lingshan
  Cc: jasowang, virtualization, netdev, kvm, parav, xieyongji,
	gautam.dawar
In-Reply-To: <20220815092638.504528-3-lingshan.zhu@intel.com>

On Mon, Aug 15, 2022 at 05:26:38PM +0800, Zhu Lingshan wrote:
> Some fields of virtio-net device config space are
> conditional on the feature bits, the spec says:
> 
> "The mac address field always exists
> (though is only valid if VIRTIO_NET_F_MAC is set)"
> 
> "max_virtqueue_pairs only exists if VIRTIO_NET_F_MQ
> or VIRTIO_NET_F_RSS is set"
> 
> "mtu only exists if VIRTIO_NET_F_MTU is set"
> 
> so we should read MTU, MAC and MQ in the device config
> space only when these feature bits are offered.
> 
> For MQ, if both VIRTIO_NET_F_MQ and VIRTIO_NET_F_RSS are
> not set, the virtio device should have
> one queue pair as default value, so when userspace querying queue pair numbers,
> it should return mq=1 than zero.
> 
> For MTU, if VIRTIO_NET_F_MTU is not set, we should not read
> MTU from the device config sapce.
> RFC894 <A Standard for the Transmission of IP Datagrams over Ethernet Networks>
> says:"The minimum length of the data field of a packet sent over an
> Ethernet is 1500 octets, thus the maximum length of an IP datagram
> sent over an Ethernet is 1500 octets.  Implementations are encouraged
> to support full-length packets"
> 
> virtio spec says:"The virtio network device is a virtual ethernet card",
> so the default MTU value should be 1500 for virtio-net.
> 
> For MAC, the spec says:"If the VIRTIO_NET_F_MAC feature bit is set,
> the configuration space mac entry indicates the “physical” address
> of the network card, otherwise the driver would typically
> generate a random local MAC address." So there is no
> default MAC address if VIRTIO_NET_F_MAC not set.
> 
> This commits introduces functions vdpa_dev_net_mtu_config_fill()
> and vdpa_dev_net_mac_config_fill() to fill MTU and MAC.
> It also fixes vdpa_dev_net_mq_config_fill() to report correct
> MQ when _F_MQ is not present.
> 
> These functions should check devices features than driver
> features, and struct vdpa_device is not needed as a parameter
> 
> The test & userspace tool output:
> 
> Feature bit VIRTIO_NET_F_MTU, VIRTIO_NET_F_RSS, VIRTIO_NET_F_MQ
> and VIRTIO_NET_F_MAC can be mask out by hardcode.
> 
> However, it is challenging to "disable" the related fields
> in the HW device config space, so let's just assume the values
> are meaningless if the feature bits are not set.
> 
> Before this change, when feature bits for RSS, MQ, MTU and MAC
> are not set, iproute2 output:
> $vdpa vdpa0: mac 00:e8:ca:11:be:05 link up link_announce false mtu 1500
>   negotiated_features

where does it get 1500? what if there's e.g. 0 in the mtu field?

> without this commit, function vdpa_dev_net_config_fill()
> reads all config space fields unconditionally, so let's
> assume the MAC and MTU are meaningless, and it checks
> MQ with driver_features, so we don't see max_vq_pairs.
> 
> After applying this commit, when feature bits for
> MQ, RSS, MAC and MTU are not set,iproute2 output:
> $vdpa dev config show vdpa0
> vdpa0: link up link_announce false max_vq_pairs 1 mtu 1500
>   negotiated_features
> 
> As explained above:
> Here is no MAC, because VIRTIO_NET_F_MAC is not set,
> and there is no default value for MAC. It shows
> max_vq_paris = 1 because even without MQ feature,
> a functional virtio-net must have one queue pair.
> mtu = 1500 is the default value as ethernet
> required.
> 
> This commit also add supplementary comments for
> __virtio16_to_cpu(true, xxx) operations in
> vdpa_dev_net_config_fill() and vdpa_fill_stats_rec()
> 
> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> ---
>  drivers/vdpa/vdpa.c | 60 +++++++++++++++++++++++++++++++++++----------
>  1 file changed, 47 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
> index efb55a06e961..a74660b98979 100644
> --- a/drivers/vdpa/vdpa.c
> +++ b/drivers/vdpa/vdpa.c
> @@ -801,19 +801,44 @@ static int vdpa_nl_cmd_dev_get_dumpit(struct sk_buff *msg, struct netlink_callba
>  	return msg->len;
>  }
>  
> -static int vdpa_dev_net_mq_config_fill(struct vdpa_device *vdev,
> -				       struct sk_buff *msg, u64 features,
> +static int vdpa_dev_net_mq_config_fill(struct sk_buff *msg, u64 features,
>  				       const struct virtio_net_config *config)
>  {
>  	u16 val_u16;
>  
> -	if ((features & BIT_ULL(VIRTIO_NET_F_MQ)) == 0)
> -		return 0;
> +	if ((features & BIT_ULL(VIRTIO_NET_F_MQ)) == 0 &&
> +	    (features & BIT_ULL(VIRTIO_NET_F_RSS)) == 0)
> +		val_u16 = 1;
> +	else
> +		val_u16 = __virtio16_to_cpu(true, config->max_virtqueue_pairs);
>  
> -	val_u16 = le16_to_cpu(config->max_virtqueue_pairs);
>  	return nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MAX_VQP, val_u16);
>  }
>  
> +static int vdpa_dev_net_mtu_config_fill(struct sk_buff *msg, u64 features,
> +					const struct virtio_net_config *config)
> +{
> +	u16 val_u16;
> +
> +	if ((features & BIT_ULL(VIRTIO_NET_F_MTU)) == 0)
> +		val_u16 = 1500;
> +	else
> +		val_u16 = __virtio16_to_cpu(true, config->mtu);
> +
> +	return nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MTU, val_u16);
> +}
> +
> +static int vdpa_dev_net_mac_config_fill(struct sk_buff *msg, u64 features,
> +					const struct virtio_net_config *config)
> +{
> +	if ((features & BIT_ULL(VIRTIO_NET_F_MAC)) == 0)
> +		return 0;
> +	else
> +		return  nla_put(msg, VDPA_ATTR_DEV_NET_CFG_MACADDR,
> +				sizeof(config->mac), config->mac);
> +}
> +
> +
>  static int vdpa_dev_net_config_fill(struct vdpa_device *vdev, struct sk_buff *msg)
>  {
>  	struct virtio_net_config config = {};
> @@ -822,18 +847,16 @@ static int vdpa_dev_net_config_fill(struct vdpa_device *vdev, struct sk_buff *ms
>  
>  	vdpa_get_config_unlocked(vdev, 0, &config, sizeof(config));
>  
> -	if (nla_put(msg, VDPA_ATTR_DEV_NET_CFG_MACADDR, sizeof(config.mac),
> -		    config.mac))
> -		return -EMSGSIZE;
> +	/*
> +	 * Assume little endian for now, userspace can tweak this for
> +	 * legacy guest support.
> +	 */
> +	val_u16 = __virtio16_to_cpu(true, config.status);
>  
>  	val_u16 = __virtio16_to_cpu(true, config.status);
>  	if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_STATUS, val_u16))
>  		return -EMSGSIZE;
>  
> -	val_u16 = __virtio16_to_cpu(true, config.mtu);
> -	if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MTU, val_u16))
> -		return -EMSGSIZE;
> -
>  	features_driver = vdev->config->get_driver_features(vdev);
>  	if (nla_put_u64_64bit(msg, VDPA_ATTR_DEV_NEGOTIATED_FEATURES, features_driver,
>  			      VDPA_ATTR_PAD))
> @@ -846,7 +869,13 @@ static int vdpa_dev_net_config_fill(struct vdpa_device *vdev, struct sk_buff *ms
>  			      VDPA_ATTR_PAD))
>  		return -EMSGSIZE;
>  
> -	return vdpa_dev_net_mq_config_fill(vdev, msg, features_driver, &config);
> +	if (vdpa_dev_net_mac_config_fill(msg, features_device, &config))
> +		return -EMSGSIZE;
> +
> +	if (vdpa_dev_net_mtu_config_fill(msg, features_device, &config))
> +		return -EMSGSIZE;
> +
> +	return vdpa_dev_net_mq_config_fill(msg, features_device, &config);
>  }
>  
>  static int
> @@ -914,6 +943,11 @@ static int vdpa_fill_stats_rec(struct vdpa_device *vdev, struct sk_buff *msg,
>  	}
>  	vdpa_get_config_unlocked(vdev, 0, &config, sizeof(config));
>  
> +	/*
> +	 * Assume little endian for now, userspace can tweak this for
> +	 * legacy guest support.
> +	 */
> +
>  	max_vqp = __virtio16_to_cpu(true, config.max_virtqueue_pairs);
>  	if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MAX_VQP, max_vqp))
>  		return -EMSGSIZE;
> -- 
> 2.31.1


^ permalink raw reply

* [PATCH v5 0/4] Introduce security_create_user_ns()
From: Frederick Lawler @ 2022-08-15 16:20 UTC (permalink / raw)
  To: kpsingh, revest, jackmanb, ast, daniel, andrii, kafai,
	songliubraving, yhs, john.fastabend, jmorris, serge, paul,
	stephen.smalley.work, eparis, shuah, brauner, casey, ebiederm,
	bpf, linux-security-module, selinux, linux-kselftest
  Cc: linux-kernel, netdev, kernel-team, cgzones, karl, tixxdz,
	Frederick Lawler

While user namespaces do not make the kernel more vulnerable, they are however
used to initiate exploits. Some users do not want to block namespace creation
for the entirety of the system, which some distributions provide. Instead, we
needed a way to have some applications be blocked, and others allowed. This is
not possible with those tools. Managing hierarchies also did not fit our case
because we're determining which tasks are allowed based on their attributes.

While exploring a solution, we first leveraged the LSM cred_prepare hook
because that is the closest hook to prevent a call to create_user_ns().

The calls look something like this:

    cred = prepare_creds()
        security_prepare_creds()
            call_int_hook(cred_prepare, ...
    if (cred)
        create_user_ns(cred)

We noticed that error codes were not propagated from this hook and
introduced a patch [1] to propagate those errors.

The discussion notes that security_prepare_creds() is not appropriate for
MAC policies, and instead the hook is meant for LSM authors to prepare
credentials for mutation. [2]

Additionally, cred_prepare hook is not without problems. Handling the clone3
case is a bit more tricky due to the user space pointer passed to it. This
makes checking the syscall subject to a possible TOCTTOU attack.

Ultimately, we concluded that a better course of action is to introduce
a new security hook for LSM authors. [3]

This patch set first introduces a new security_create_user_ns() function
and userns_create LSM hook, then marks the hook as sleepable in BPF. The
following patches after include a BPF test and a patch for an SELinux
implementation.

We want to encourage use of user namespaces, and also cater the needs
of users/administrators to observe and/or control access. There is no
expectation of an impact on user space applications because access control 
is opt-in, and users wishing to observe within a LSM context 


Links:
1. https://lore.kernel.org/all/20220608150942.776446-1-fred@cloudflare.com/
2. https://lore.kernel.org/all/87y1xzyhub.fsf@email.froward.int.ebiederm.org/
3. https://lore.kernel.org/all/9fe9cd9f-1ded-a179-8ded-5fde8960a586@cloudflare.com/

Past discussions:
V4: https://lore.kernel.org/all/20220801180146.1157914-1-fred@cloudflare.com/
V3: https://lore.kernel.org/all/20220721172808.585539-1-fred@cloudflare.com/
V2: https://lore.kernel.org/all/20220707223228.1940249-1-fred@cloudflare.com/
V1: https://lore.kernel.org/all/20220621233939.993579-1-fred@cloudflare.com/

Changes since v4:
- Update commit description
- Update cover letter
Changes since v3:
- Explicitly set CAP_SYS_ADMIN to test namespace is created given
  permission
- Simplify BPF test to use sleepable hook only
- Prefer unshare() over clone() for tests
Changes since v2:
- Rename create_user_ns hook to userns_create
- Use user_namespace as an object opposed to a generic namespace object
- s/domB_t/domA_t in commit message
Changes since v1:
- Add selftests/bpf: Add tests verifying bpf lsm create_user_ns hook patch
- Add selinux: Implement create_user_ns hook patch
- Change function signature of security_create_user_ns() to only take
  struct cred
- Move security_create_user_ns() call after id mapping check in
  create_user_ns()
- Update documentation to reflect changes

Frederick Lawler (4):
  security, lsm: Introduce security_create_user_ns()
  bpf-lsm: Make bpf_lsm_userns_create() sleepable
  selftests/bpf: Add tests verifying bpf lsm userns_create hook
  selinux: Implement userns_create hook

 include/linux/lsm_hook_defs.h                 |   1 +
 include/linux/lsm_hooks.h                     |   4 +
 include/linux/security.h                      |   6 ++
 kernel/bpf/bpf_lsm.c                          |   1 +
 kernel/user_namespace.c                       |   5 +
 security/security.c                           |   5 +
 security/selinux/hooks.c                      |   9 ++
 security/selinux/include/classmap.h           |   2 +
 .../selftests/bpf/prog_tests/deny_namespace.c | 102 ++++++++++++++++++
 .../selftests/bpf/progs/test_deny_namespace.c |  33 ++++++
 10 files changed, 168 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/deny_namespace.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_deny_namespace.c

-- 
2.30.2


^ permalink raw reply

* [PATCH v5 1/4] security, lsm: Introduce security_create_user_ns()
From: Frederick Lawler @ 2022-08-15 16:20 UTC (permalink / raw)
  To: kpsingh, revest, jackmanb, ast, daniel, andrii, kafai,
	songliubraving, yhs, john.fastabend, jmorris, serge, paul,
	stephen.smalley.work, eparis, shuah, brauner, casey, ebiederm,
	bpf, linux-security-module, selinux, linux-kselftest
  Cc: linux-kernel, netdev, kernel-team, cgzones, karl, tixxdz,
	Frederick Lawler
In-Reply-To: <20220815162028.926858-1-fred@cloudflare.com>

User namespaces are an effective tool to allow programs to run with
permission without requiring the need for a program to run as root. User
namespaces may also be used as a sandboxing technique. However, attackers
sometimes leverage user namespaces as an initial attack vector to perform
some exploit. [1,2,3]

While it is not the unprivileged user namespace functionality, which
causes the kernel to be exploitable, users/administrators might want to
more granularly limit or at least monitor how various processes use this
functionality, while vulnerable kernel subsystems are being patched.

Preventing user namespace already creation comes in a few of forms in
order of granularity:

        1. /proc/sys/user/max_user_namespaces sysctl
        2. Distro specific patch(es)
        3. CONFIG_USER_NS

To block a task based on its attributes, the LSM hook cred_prepare is a
decent candidate for use because it provides more granular control, and
it is called before create_user_ns():

        cred = prepare_creds()
                security_prepare_creds()
                        call_int_hook(cred_prepare, ...
        if (cred)
                create_user_ns(cred)

Since security_prepare_creds() is meant for LSMs to copy and prepare
credentials, access control is an unintended use of the hook. [4]
Further, security_prepare_creds() will always return a ENOMEM if the
hook returns any non-zero error code.

This hook also does not handle the clone3 case which requires us to
access a user space pointer to know if we're in the CLONE_NEW_USER
call path which may be subject to a TOCTTOU attack.

Lastly, cred_prepare is called in many call paths, and a targeted hook
further limits the frequency of calls which is a beneficial outcome.
Therefore introduce a new function security_create_user_ns() with an
accompanying userns_create LSM hook.

With the new userns_create hook, users will have more control over the
observability and access control over user namespace creation. Users
should expect that normal operation of user namespaces will behave as
usual, and only be impacted when controls are implemented by users or
administrators.

This hook takes the prepared creds for LSM authors to write policy
against. On success, the new namespace is applied to credentials,
otherwise an error is returned.

Links:
1. https://nvd.nist.gov/vuln/detail/CVE-2022-0492
2. https://nvd.nist.gov/vuln/detail/CVE-2022-25636
3. https://nvd.nist.gov/vuln/detail/CVE-2022-34918
4. https://lore.kernel.org/all/1c4b1c0d-12f6-6e9e-a6a3-cdce7418110c@schaufler-ca.com/

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Signed-off-by: Frederick Lawler <fred@cloudflare.com>

---
Changes since v4:
- Update commit description
Changes since v3:
- No changes
Changes since v2:
- Rename create_user_ns hook to userns_create
Changes since v1:
- Changed commit wording
- Moved execution to be after id mapping check
- Changed signature to only accept a const struct cred *
---
 include/linux/lsm_hook_defs.h | 1 +
 include/linux/lsm_hooks.h     | 4 ++++
 include/linux/security.h      | 6 ++++++
 kernel/user_namespace.c       | 5 +++++
 security/security.c           | 5 +++++
 5 files changed, 21 insertions(+)

diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
index 806448173033..aa7272e83626 100644
--- a/include/linux/lsm_hook_defs.h
+++ b/include/linux/lsm_hook_defs.h
@@ -224,6 +224,7 @@ LSM_HOOK(int, -ENOSYS, task_prctl, int option, unsigned long arg2,
 	 unsigned long arg3, unsigned long arg4, unsigned long arg5)
 LSM_HOOK(void, LSM_RET_VOID, task_to_inode, struct task_struct *p,
 	 struct inode *inode)
+LSM_HOOK(int, 0, userns_create, const struct cred *cred)
 LSM_HOOK(int, 0, ipc_permission, struct kern_ipc_perm *ipcp, short flag)
 LSM_HOOK(void, LSM_RET_VOID, ipc_getsecid, struct kern_ipc_perm *ipcp,
 	 u32 *secid)
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 84a0d7e02176..2e11a2a22ed1 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -806,6 +806,10 @@
  *	security attributes, e.g. for /proc/pid inodes.
  *	@p contains the task_struct for the task.
  *	@inode contains the inode structure for the inode.
+ * @userns_create:
+ *	Check permission prior to creating a new user namespace.
+ *	@cred points to prepared creds.
+ *	Return 0 if successful, otherwise < 0 error code.
  *
  * Security hooks for Netlink messaging.
  *
diff --git a/include/linux/security.h b/include/linux/security.h
index 1bc362cb413f..767802fe9bfa 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -437,6 +437,7 @@ int security_task_kill(struct task_struct *p, struct kernel_siginfo *info,
 int security_task_prctl(int option, unsigned long arg2, unsigned long arg3,
 			unsigned long arg4, unsigned long arg5);
 void security_task_to_inode(struct task_struct *p, struct inode *inode);
+int security_create_user_ns(const struct cred *cred);
 int security_ipc_permission(struct kern_ipc_perm *ipcp, short flag);
 void security_ipc_getsecid(struct kern_ipc_perm *ipcp, u32 *secid);
 int security_msg_msg_alloc(struct msg_msg *msg);
@@ -1194,6 +1195,11 @@ static inline int security_task_prctl(int option, unsigned long arg2,
 static inline void security_task_to_inode(struct task_struct *p, struct inode *inode)
 { }
 
+static inline int security_create_user_ns(const struct cred *cred)
+{
+	return 0;
+}
+
 static inline int security_ipc_permission(struct kern_ipc_perm *ipcp,
 					  short flag)
 {
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 5481ba44a8d6..3f464bbda0e9 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -9,6 +9,7 @@
 #include <linux/highuid.h>
 #include <linux/cred.h>
 #include <linux/securebits.h>
+#include <linux/security.h>
 #include <linux/keyctl.h>
 #include <linux/key-type.h>
 #include <keys/user-type.h>
@@ -113,6 +114,10 @@ int create_user_ns(struct cred *new)
 	    !kgid_has_mapping(parent_ns, group))
 		goto fail_dec;
 
+	ret = security_create_user_ns(new);
+	if (ret < 0)
+		goto fail_dec;
+
 	ret = -ENOMEM;
 	ns = kmem_cache_zalloc(user_ns_cachep, GFP_KERNEL);
 	if (!ns)
diff --git a/security/security.c b/security/security.c
index 14d30fec8a00..1e60c4b570ec 100644
--- a/security/security.c
+++ b/security/security.c
@@ -1909,6 +1909,11 @@ void security_task_to_inode(struct task_struct *p, struct inode *inode)
 	call_void_hook(task_to_inode, p, inode);
 }
 
+int security_create_user_ns(const struct cred *cred)
+{
+	return call_int_hook(userns_create, 0, cred);
+}
+
 int security_ipc_permission(struct kern_ipc_perm *ipcp, short flag)
 {
 	return call_int_hook(ipc_permission, 0, ipcp, flag);
-- 
2.30.2


^ permalink raw reply related

* [PATCH v5 2/4] bpf-lsm: Make bpf_lsm_userns_create() sleepable
From: Frederick Lawler @ 2022-08-15 16:20 UTC (permalink / raw)
  To: kpsingh, revest, jackmanb, ast, daniel, andrii, kafai,
	songliubraving, yhs, john.fastabend, jmorris, serge, paul,
	stephen.smalley.work, eparis, shuah, brauner, casey, ebiederm,
	bpf, linux-security-module, selinux, linux-kselftest
  Cc: linux-kernel, netdev, kernel-team, cgzones, karl, tixxdz,
	Frederick Lawler
In-Reply-To: <20220815162028.926858-1-fred@cloudflare.com>

Users may want to audit calls to security_create_user_ns() and access
user space memory. Also create_user_ns() runs without
pagefault_disabled(). Therefore, make bpf_lsm_userns_create() sleepable
for mandatory access control policies.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Acked-by: KP Singh <kpsingh@kernel.org>
Signed-off-by: Frederick Lawler <fred@cloudflare.com>

---
Changes since v4:
- None
Changes since v3:
- None
Changes since v2:
- Rename create_user_ns hook to userns_create
Changes since v1:
- None
---
 kernel/bpf/bpf_lsm.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/bpf/bpf_lsm.c b/kernel/bpf/bpf_lsm.c
index fa71d58b7ded..761998fda762 100644
--- a/kernel/bpf/bpf_lsm.c
+++ b/kernel/bpf/bpf_lsm.c
@@ -335,6 +335,7 @@ BTF_ID(func, bpf_lsm_task_getsecid_obj)
 BTF_ID(func, bpf_lsm_task_prctl)
 BTF_ID(func, bpf_lsm_task_setscheduler)
 BTF_ID(func, bpf_lsm_task_to_inode)
+BTF_ID(func, bpf_lsm_userns_create)
 BTF_SET_END(sleepable_lsm_hooks)
 
 bool bpf_lsm_is_sleepable_hook(u32 btf_id)
-- 
2.30.2


^ permalink raw reply related

* [PATCH v5 3/4] selftests/bpf: Add tests verifying bpf lsm userns_create hook
From: Frederick Lawler @ 2022-08-15 16:20 UTC (permalink / raw)
  To: kpsingh, revest, jackmanb, ast, daniel, andrii, kafai,
	songliubraving, yhs, john.fastabend, jmorris, serge, paul,
	stephen.smalley.work, eparis, shuah, brauner, casey, ebiederm,
	bpf, linux-security-module, selinux, linux-kselftest
  Cc: linux-kernel, netdev, kernel-team, cgzones, karl, tixxdz,
	Frederick Lawler
In-Reply-To: <20220815162028.926858-1-fred@cloudflare.com>

The LSM hook userns_create was introduced to provide LSM's an
opportunity to block or allow unprivileged user namespace creation. This
test serves two purposes: it provides a test eBPF implementation, and
tests the hook successfully blocks or allows user namespace creation.

This tests 3 cases:

        1. Unattached bpf program does not block unpriv user namespace
           creation.
        2. Attached bpf program allows user namespace creation given
           CAP_SYS_ADMIN privileges.
        3. Attached bpf program denies user namespace creation for a
           user without CAP_SYS_ADMIN.

Acked-by: KP Singh <kpsingh@kernel.org>
Signed-off-by: Frederick Lawler <fred@cloudflare.com>

---
The generic deny_namespace file name is used for future namespace
expansion. I didn't want to limit these files to just the create_user_ns
hook.
Changes since v4:
- None
Changes since v3:
- Explicitly set CAP_SYS_ADMIN to test namespace is created given
  permission
- Simplify BPF test to use sleepable hook only
- Prefer unshare() over clone() for tests
Changes since v2:
- Rename create_user_ns hook to userns_create
Changes since v1:
- Introduce this patch
---
 .../selftests/bpf/prog_tests/deny_namespace.c | 102 ++++++++++++++++++
 .../selftests/bpf/progs/test_deny_namespace.c |  33 ++++++
 2 files changed, 135 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/deny_namespace.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_deny_namespace.c

diff --git a/tools/testing/selftests/bpf/prog_tests/deny_namespace.c b/tools/testing/selftests/bpf/prog_tests/deny_namespace.c
new file mode 100644
index 000000000000..1bc6241b755b
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/deny_namespace.c
@@ -0,0 +1,102 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <test_progs.h>
+#include "test_deny_namespace.skel.h"
+#include <sched.h>
+#include "cap_helpers.h"
+#include <stdio.h>
+
+static int wait_for_pid(pid_t pid)
+{
+	int status, ret;
+
+again:
+	ret = waitpid(pid, &status, 0);
+	if (ret == -1) {
+		if (errno == EINTR)
+			goto again;
+
+		return -1;
+	}
+
+	if (!WIFEXITED(status))
+		return -1;
+
+	return WEXITSTATUS(status);
+}
+
+/* negative return value -> some internal error
+ * positive return value -> userns creation failed
+ * 0                     -> userns creation succeeded
+ */
+static int create_user_ns(void)
+{
+	pid_t pid;
+
+	pid = fork();
+	if (pid < 0)
+		return -1;
+
+	if (pid == 0) {
+		if (unshare(CLONE_NEWUSER))
+			_exit(EXIT_FAILURE);
+		_exit(EXIT_SUCCESS);
+	}
+
+	return wait_for_pid(pid);
+}
+
+static void test_userns_create_bpf(void)
+{
+	__u32 cap_mask = 1ULL << CAP_SYS_ADMIN;
+	__u64 old_caps = 0;
+
+	cap_enable_effective(cap_mask, &old_caps);
+
+	ASSERT_OK(create_user_ns(), "priv new user ns");
+
+	cap_disable_effective(cap_mask, &old_caps);
+
+	ASSERT_EQ(create_user_ns(), EPERM, "unpriv new user ns");
+
+	if (cap_mask & old_caps)
+		cap_enable_effective(cap_mask, NULL);
+}
+
+static void test_unpriv_userns_create_no_bpf(void)
+{
+	__u32 cap_mask = 1ULL << CAP_SYS_ADMIN;
+	__u64 old_caps = 0;
+
+	cap_disable_effective(cap_mask, &old_caps);
+
+	ASSERT_OK(create_user_ns(), "no-bpf unpriv new user ns");
+
+	if (cap_mask & old_caps)
+		cap_enable_effective(cap_mask, NULL);
+}
+
+void test_deny_namespace(void)
+{
+	struct test_deny_namespace *skel = NULL;
+	int err;
+
+	if (test__start_subtest("unpriv_userns_create_no_bpf"))
+		test_unpriv_userns_create_no_bpf();
+
+	skel = test_deny_namespace__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel load"))
+		goto close_prog;
+
+	err = test_deny_namespace__attach(skel);
+	if (!ASSERT_OK(err, "attach"))
+		goto close_prog;
+
+	if (test__start_subtest("userns_create_bpf"))
+		test_userns_create_bpf();
+
+	test_deny_namespace__detach(skel);
+
+close_prog:
+	test_deny_namespace__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_deny_namespace.c b/tools/testing/selftests/bpf/progs/test_deny_namespace.c
new file mode 100644
index 000000000000..09ad5a4ebd1f
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_deny_namespace.c
@@ -0,0 +1,33 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <errno.h>
+#include <linux/capability.h>
+
+struct kernel_cap_struct {
+	__u32 cap[_LINUX_CAPABILITY_U32S_3];
+} __attribute__((preserve_access_index));
+
+struct cred {
+	struct kernel_cap_struct cap_effective;
+} __attribute__((preserve_access_index));
+
+char _license[] SEC("license") = "GPL";
+
+SEC("lsm.s/userns_create")
+int BPF_PROG(test_userns_create, const struct cred *cred, int ret)
+{
+	struct kernel_cap_struct caps = cred->cap_effective;
+	int cap_index = CAP_TO_INDEX(CAP_SYS_ADMIN);
+	__u32 cap_mask = CAP_TO_MASK(CAP_SYS_ADMIN);
+
+	if (ret)
+		return 0;
+
+	ret = -EPERM;
+	if (caps.cap[cap_index] & cap_mask)
+		return 0;
+
+	return -EPERM;
+}
-- 
2.30.2


^ permalink raw reply related

* [PATCH v5 4/4] selinux: Implement userns_create hook
From: Frederick Lawler @ 2022-08-15 16:20 UTC (permalink / raw)
  To: kpsingh, revest, jackmanb, ast, daniel, andrii, kafai,
	songliubraving, yhs, john.fastabend, jmorris, serge, paul,
	stephen.smalley.work, eparis, shuah, brauner, casey, ebiederm,
	bpf, linux-security-module, selinux, linux-kselftest
  Cc: linux-kernel, netdev, kernel-team, cgzones, karl, tixxdz,
	Frederick Lawler
In-Reply-To: <20220815162028.926858-1-fred@cloudflare.com>

Unprivileged user namespace creation is an intended feature to enable
sandboxing, however this feature is often used to as an initial step to
perform a privilege escalation attack.

This patch implements a new user_namespace { create } access control
permission to restrict which domains allow or deny user namespace
creation. This is necessary for system administrators to quickly protect
their systems while waiting for vulnerability patches to be applied.

This permission can be used in the following way:

        allow domA_t domA_t : user_namespace { create };

Signed-off-by: Frederick Lawler <fred@cloudflare.com>

---
Changes since v4:
- None
Changes since v3:
- None
Changes since v2:
- Rename create_user_ns hook to userns_create
- Use user_namespace as an object opposed to a generic namespace object
- s/domB_t/domA_t in commit message
Changes since v1:
- Introduce this patch
---
 security/selinux/hooks.c            | 9 +++++++++
 security/selinux/include/classmap.h | 2 ++
 2 files changed, 11 insertions(+)

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 79573504783b..b9f1078450b3 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -4221,6 +4221,14 @@ static void selinux_task_to_inode(struct task_struct *p,
 	spin_unlock(&isec->lock);
 }
 
+static int selinux_userns_create(const struct cred *cred)
+{
+	u32 sid = current_sid();
+
+	return avc_has_perm(&selinux_state, sid, sid, SECCLASS_USER_NAMESPACE,
+						USER_NAMESPACE__CREATE, NULL);
+}
+
 /* Returns error only if unable to parse addresses */
 static int selinux_parse_skb_ipv4(struct sk_buff *skb,
 			struct common_audit_data *ad, u8 *proto)
@@ -7111,6 +7119,7 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = {
 	LSM_HOOK_INIT(task_movememory, selinux_task_movememory),
 	LSM_HOOK_INIT(task_kill, selinux_task_kill),
 	LSM_HOOK_INIT(task_to_inode, selinux_task_to_inode),
+	LSM_HOOK_INIT(userns_create, selinux_userns_create),
 
 	LSM_HOOK_INIT(ipc_permission, selinux_ipc_permission),
 	LSM_HOOK_INIT(ipc_getsecid, selinux_ipc_getsecid),
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index ff757ae5f253..0bff55bb9cde 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -254,6 +254,8 @@ const struct security_class_mapping secclass_map[] = {
 	  { COMMON_FILE_PERMS, NULL } },
 	{ "io_uring",
 	  { "override_creds", "sqpoll", NULL } },
+	{ "user_namespace",
+	  { "create", NULL } },
 	{ NULL }
   };
 
-- 
2.30.2


^ permalink raw reply related

* Re: [PATCH v4 0/4] Introduce security_create_user_ns()
From: Paul Moore @ 2022-08-15 16:24 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Eric W. Biederman, Frederick Lawler, kpsingh, revest, jackmanb,
	ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	jmorris, stephen.smalley.work, eparis, shuah, brauner, casey, bpf,
	linux-security-module, selinux, linux-kselftest, linux-kernel,
	netdev, kernel-team, cgzones, karl
In-Reply-To: <20220815154102.GA20944@mail.hallyn.com>

On Mon, Aug 15, 2022 at 11:41 AM Serge E. Hallyn <serge@hallyn.com> wrote:
> On Sun, Aug 14, 2022 at 10:32:51PM -0400, Paul Moore wrote:
> > On Sun, Aug 14, 2022 at 11:55 AM Serge E. Hallyn <serge@hallyn.com> wrote:
> > > On Mon, Aug 08, 2022 at 03:16:16PM -0400, Paul Moore wrote:
> > > > On Mon, Aug 8, 2022 at 2:56 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> > > > > Paul Moore <paul@paul-moore.com> writes:
> > > > > > On Mon, Aug 1, 2022 at 10:56 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> > > > > >> Frederick Lawler <fred@cloudflare.com> writes:
> > > > > >>
> > > > > >> > While creating a LSM BPF MAC policy to block user namespace creation, we
> > > > > >> > used the LSM cred_prepare hook because that is the closest hook to prevent
> > > > > >> > a call to create_user_ns().
> > > > > >>
> > > > > >> Re-nack for all of the same reasons.
> > > > > >> AKA This can only break the users of the user namespace.
> > > > > >>
> > > > > >> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> > > > > >>
> > > > > >> You aren't fixing what your problem you are papering over it by denying
> > > > > >> access to the user namespace.
> > > > > >>
> > > > > >> Nack Nack Nack.
> > > > > >>
> > > > > >> Stop.
> > > > > >>
> > > > > >> Go back to the drawing board.
> > > > > >>
> > > > > >> Do not pass go.
> > > > > >>
> > > > > >> Do not collect $200.
> > > > > >
> > > > > > If you want us to take your comments seriously Eric, you need to
> > > > > > provide the list with some constructive feedback that would allow
> > > > > > Frederick to move forward with a solution to the use case that has
> > > > > > been proposed.  You response above may be many things, but it is
> > > > > > certainly not that.
> > > > >
> > > > > I did provide constructive feedback.  My feedback to his problem
> > > > > was to address the real problem of bugs in the kernel.
> > > >
> > > > We've heard from several people who have use cases which require
> > > > adding LSM-level access controls and observability to user namespace
> > > > creation.  This is the problem we are trying to solve here; if you do
> > > > not like the approach proposed in this patchset please suggest another
> > > > implementation that allows LSMs visibility into user namespace
> > > > creation.
> > >
> > > Regarding the observability - can someone concisely lay out why just
> > > auditing userns creation would not suffice?  Userspace could decide
> > > what to report based on whether the creating user_ns == /proc/1/ns/user...
> >
> > One of the selling points of the BPF LSM is that it allows for various
> > different ways of reporting and logging beyond audit.  However, even
> > if it was limited to just audit I believe that provides some useful
> > justification as auditing fork()/clone() isn't quite the same and
> > could be difficult to do at scale in some configurations.  I haven't
> > personally added a BPF LSM program to the kernel so I can't speak to
> > the details on what is possible, but I'm sure others on the To/CC line
> > could help provide more information if that is important to you.
> >
> > > Regarding limiting the tweaking of otherwise-privileged code by
> > > unprivileged users, i wonder whether we could instead add smarts to
> > > ns_capable().
> >
> > The existing security_capable() hook is eventually called by ns_capable():
> >
> >   ns_capable()
> >     ns_capable_common()
> >       security_capable(const struct cred *cred,
> >                        struct user_namespace *ns,
> >                        int cap,
> >                        unsigned int opts);
> >
> > ... I'm not sure what additional smarts would be useful here?
>
> Oh - i wasn't necessarily thinking of an LSM.  I was picturing a
> sysctl next to unprivileged_userns_clone.  But you're right, looks
> like an LSM could already do this.  Of course, there's an issue early
> on in that the root user in the new namespace couldn't setuid, so
> the uid mapping is still limited.  So this idea probably isn't worth
> the characters we've typed about it so far, sorry.

No harm, no foul.  This thread has already reached record lows with
respect to usefulness-vs-characters ratio, a few more isn't going to
hurt anything ;)

> > [side note: SELinux does actually distinguish between capability
> > checks in the initial user namespace vs child namespaces]
> >
> > > Point being, uid mapping would still work, but we'd
> > > break the "privileged against resources you own" part of user
> > > namespaces.  I would want it to default to allow, but then when a
> > > 0-day is found which requires reaching ns_capable() code, admins
> > > could easily prevent exploitation until reboot from a fixed kernel.
> >
> > That assumes that everything you care about is behind a capability
> > check, which is probably going to be correct in a lot of the cases,
> > but I think it would be a mistake to assume that is always going to be
> > true.
>
> I might be thinking wrongly, but if it's not behind a capability check,
> then it seems to me it's not an exploit that can only be reached by
> becoming root in a user namespace, which means disabling user namespace
> creation by unprivileged users will not stop the attack.

I was primarily thinking about two things: subj/obj relationships
which are really not addressed with capability checks, and unrelated
problems which aren't the fault of the user namespace but could be
somehow made easier through some of the unique situations offered by
user namespaces.  There are exploits that often require chaining
together multiple "things" to trigger the necessary flaw, and
sometimes the most immediate way to stop such an attack is to apply
additional controls to one of these intermediate steps.  Frekerick's
work puts the necessary infrastructure in place so we can do that with
user namespaces.

-- 
paul-moore.com

^ permalink raw reply

* Re: upstream kernel crashes
From: Andres Freund @ 2022-08-15 16:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Xuan Zhuo, Jason Wang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, virtualization, netdev,
	Linus Torvalds, Jens Axboe, James Bottomley, Martin K. Petersen,
	Guenter Roeck, linux-kernel, Greg KH, c
In-Reply-To: <20220815113729-mutt-send-email-mst@kernel.org>

Hi,

On 2022-08-15 11:40:59 -0400, Michael S. Tsirkin wrote:
> OK so this gives us a quick revert as a solution for now.
> Next, I would appreciate it if you just try this simple hack.
> If it crashes we either have a long standing problem in virtio
> code or more likely a gcp bug where it can't handle smaller
> rings than what device requestes.
> Thanks!

I applied the below and the problem persists.

> diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
> index f7965c5dd36b..bdd5f481570b 100644
> --- a/drivers/virtio/virtio_pci_modern.c
> +++ b/drivers/virtio/virtio_pci_modern.c
> @@ -314,6 +314,9 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
>  	if (!size || size > num)
>  		size = num;
>  
> +	if (size > 1024)
> +		size = 1024;
> +
>  	if (size & (size - 1)) {
>  		dev_warn(&vp_dev->pci_dev->dev, "bad queue size %u", size);
>  		return ERR_PTR(-EINVAL);
> 
> 

[    1.165162] virtio_net virtio1 enp0s4: renamed from eth0
[    1.177815] general protection fault, probably for non-canonical address 0xffff000000000400: 0000 [#1] PREEMPT SMP PTI
[    1.179565] CPU: 1 PID: 125 Comm: systemd-udevd Not tainted 6.0.0-rc1-bisect14-dirty #14
[    1.180785] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/29/2022
[    1.182475] RIP: 0010:__kmalloc_node_track_caller+0x19e/0x380
[    1.183365] Code: 2b 04 25 28 00 00 00 0f 85 f8 01 00 00 48 83 c4 18 48 89 e8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 8b 4d 28 48 8b 7d 00 <48> 8b 1c 08 48 8d 4a 40 65 48 0f c7 0f 0f 94 c0 84 c0 0f 84 0b ff
[    1.186208] RSP: 0018:ffff9c470021b860 EFLAGS: 00010246
[    1.187194] RAX: ffff000000000000 RBX: 00000000000928c0 RCX: 0000000000000400
[    1.188634] RDX: 0000000000005781 RSI: 00000000000928c0 RDI: 000000000002e0f0
[    1.190177] RBP: ffff908380042c00 R08: 0000000000000600 R09: ffff908380b665e4
[    1.191256] R10: 0000000000000003 R11: 0000000000000002 R12: 00000000000928c0
[    1.192269] R13: 0000000000000740 R14: 00000000ffffffff R15: 0000000000000000
[    1.193368] FS:  00007f746702a8c0(0000) GS:ffff9084b7d00000(0000) knlGS:0000000000000000
[    1.194846] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.195661] CR2: 00007ffc010df980 CR3: 0000000103826005 CR4: 00000000003706e0
[    1.196912] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    1.198216] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    1.199367] Call Trace:
[    1.199815]  <TASK>
[    1.200138]  ? netlink_trim+0x85/0xb0
[    1.200754]  pskb_expand_head+0x92/0x340
[    1.202512]  netlink_trim+0x85/0xb0
[    1.203069]  netlink_unicast+0x54/0x390
[    1.203630]  rtnl_getlink+0x366/0x410
[    1.204155]  ? __d_alloc+0x24/0x1d0
[    1.204668]  rtnetlink_rcv_msg+0x146/0x3b0
[    1.205256]  ? _raw_spin_unlock+0xd/0x30
[    1.205867]  ? __d_add+0xf2/0x1b0
[    1.206600]  ? rtnl_calcit.isra.0+0x130/0x130
[    1.207221]  netlink_rcv_skb+0x49/0xf0
[    1.207904]  netlink_unicast+0x23a/0x390
[    1.208585]  netlink_sendmsg+0x23b/0x4b0
[    1.209203]  sock_sendmsg+0x57/0x60
[    1.210118]  __sys_sendto+0x117/0x170
[    1.210694]  ? __wake_up_common_lock+0x83/0xc0
[    1.211420]  __x64_sys_sendto+0x1b/0x30
[    1.211992]  do_syscall_64+0x37/0x90
[    1.212497]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[    1.213407] RIP: 0033:0x7f74677404e6
[    1.213973] Code: 69 0e 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 11 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 72 c3 90 41 54 48 83 ec 30 44 89 4c 24 2c 4c
[    1.217098] RSP: 002b:00007ffc010daa78 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[    1.219539] RAX: ffffffffffffffda RBX: 000000000011bc98 RCX: 00007f74677404e6
[    1.220552] RDX: 0000000000000020 RSI: 0000563160679570 RDI: 0000000000000005
[    1.222378] RBP: 00005631606796b0 R08: 00007ffc010daaf0 R09: 0000000000000080
[    1.223692] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
[    1.224793] R13: 0000000000000000 R14: 0000000000000000 R15: 00005631606794b0
[    1.226228]  </TASK>
[    1.226775] Modules linked in:
[    1.227414] ---[ end trace 0000000000000000 ]---

Greetings,

Andres Freund

^ permalink raw reply

* net-next is OPEN
From: Jakub Kicinski @ 2022-08-15 16:49 UTC (permalink / raw)
  To: netdev@vger.kernel.org; +Cc: David Miller, Paolo Abeni, Eric Dumazet

Hi,

rc1 is out, net-next is open, again.

^ permalink raw reply

* Re: upstream kernel crashes
From: Michael S. Tsirkin @ 2022-08-15 16:50 UTC (permalink / raw)
  To: Andres Freund
  Cc: Xuan Zhuo, Jason Wang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, virtualization, netdev,
	Linus Torvalds, Jens Axboe, James Bottomley, Martin K. Petersen,
	Guenter Roeck, linux-kernel, Greg KH, c
In-Reply-To: <20220815164503.jsoezxcm6q4u2b6j@awork3.anarazel.de>

On Mon, Aug 15, 2022 at 09:45:03AM -0700, Andres Freund wrote:
> Hi,
> 
> On 2022-08-15 11:40:59 -0400, Michael S. Tsirkin wrote:
> > OK so this gives us a quick revert as a solution for now.
> > Next, I would appreciate it if you just try this simple hack.
> > If it crashes we either have a long standing problem in virtio
> > code or more likely a gcp bug where it can't handle smaller
> > rings than what device requestes.
> > Thanks!
> 
> I applied the below and the problem persists.
> 
> > diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
> > index f7965c5dd36b..bdd5f481570b 100644
> > --- a/drivers/virtio/virtio_pci_modern.c
> > +++ b/drivers/virtio/virtio_pci_modern.c
> > @@ -314,6 +314,9 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
> >  	if (!size || size > num)
> >  		size = num;
> >  
> > +	if (size > 1024)
> > +		size = 1024;
> > +
> >  	if (size & (size - 1)) {
> >  		dev_warn(&vp_dev->pci_dev->dev, "bad queue size %u", size);
> >  		return ERR_PTR(-EINVAL);
> > 
> > 
> 
> [    1.165162] virtio_net virtio1 enp0s4: renamed from eth0
> [    1.177815] general protection fault, probably for non-canonical address 0xffff000000000400: 0000 [#1] PREEMPT SMP PTI
> [    1.179565] CPU: 1 PID: 125 Comm: systemd-udevd Not tainted 6.0.0-rc1-bisect14-dirty #14
> [    1.180785] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/29/2022
> [    1.182475] RIP: 0010:__kmalloc_node_track_caller+0x19e/0x380
> [    1.183365] Code: 2b 04 25 28 00 00 00 0f 85 f8 01 00 00 48 83 c4 18 48 89 e8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 8b 4d 28 48 8b 7d 00 <48> 8b 1c 08 48 8d 4a 40 65 48 0f c7 0f 0f 94 c0 84 c0 0f 84 0b ff
> [    1.186208] RSP: 0018:ffff9c470021b860 EFLAGS: 00010246
> [    1.187194] RAX: ffff000000000000 RBX: 00000000000928c0 RCX: 0000000000000400
> [    1.188634] RDX: 0000000000005781 RSI: 00000000000928c0 RDI: 000000000002e0f0
> [    1.190177] RBP: ffff908380042c00 R08: 0000000000000600 R09: ffff908380b665e4
> [    1.191256] R10: 0000000000000003 R11: 0000000000000002 R12: 00000000000928c0
> [    1.192269] R13: 0000000000000740 R14: 00000000ffffffff R15: 0000000000000000
> [    1.193368] FS:  00007f746702a8c0(0000) GS:ffff9084b7d00000(0000) knlGS:0000000000000000
> [    1.194846] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    1.195661] CR2: 00007ffc010df980 CR3: 0000000103826005 CR4: 00000000003706e0
> [    1.196912] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [    1.198216] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [    1.199367] Call Trace:
> [    1.199815]  <TASK>
> [    1.200138]  ? netlink_trim+0x85/0xb0
> [    1.200754]  pskb_expand_head+0x92/0x340
> [    1.202512]  netlink_trim+0x85/0xb0
> [    1.203069]  netlink_unicast+0x54/0x390
> [    1.203630]  rtnl_getlink+0x366/0x410
> [    1.204155]  ? __d_alloc+0x24/0x1d0
> [    1.204668]  rtnetlink_rcv_msg+0x146/0x3b0
> [    1.205256]  ? _raw_spin_unlock+0xd/0x30
> [    1.205867]  ? __d_add+0xf2/0x1b0
> [    1.206600]  ? rtnl_calcit.isra.0+0x130/0x130
> [    1.207221]  netlink_rcv_skb+0x49/0xf0
> [    1.207904]  netlink_unicast+0x23a/0x390
> [    1.208585]  netlink_sendmsg+0x23b/0x4b0
> [    1.209203]  sock_sendmsg+0x57/0x60
> [    1.210118]  __sys_sendto+0x117/0x170
> [    1.210694]  ? __wake_up_common_lock+0x83/0xc0
> [    1.211420]  __x64_sys_sendto+0x1b/0x30
> [    1.211992]  do_syscall_64+0x37/0x90
> [    1.212497]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> [    1.213407] RIP: 0033:0x7f74677404e6
> [    1.213973] Code: 69 0e 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 11 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 72 c3 90 41 54 48 83 ec 30 44 89 4c 24 2c 4c
> [    1.217098] RSP: 002b:00007ffc010daa78 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
> [    1.219539] RAX: ffffffffffffffda RBX: 000000000011bc98 RCX: 00007f74677404e6
> [    1.220552] RDX: 0000000000000020 RSI: 0000563160679570 RDI: 0000000000000005
> [    1.222378] RBP: 00005631606796b0 R08: 00007ffc010daaf0 R09: 0000000000000080
> [    1.223692] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
> [    1.224793] R13: 0000000000000000 R14: 0000000000000000 R15: 00005631606794b0
> [    1.226228]  </TASK>
> [    1.226775] Modules linked in:
> [    1.227414] ---[ end trace 0000000000000000 ]---
> 
> Greetings,
> 
> Andres Freund

Okay! And just to be 100% sure, can you try the following on top of 5.19:


diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
index 623906b4996c..6f4e54a618bc 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -208,6 +208,9 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
 		return ERR_PTR(-EINVAL);
 	}
 
+	if (num > 1024)
+		num = 1024;
+
 	info->msix_vector = msix_vec;
 
 	/* create the vring */

-- 
MST


^ permalink raw reply related

* Re: [PATCH v1 0/3] Bring back driver_deferred_probe_check_state() for now
From: Luca Weiss @ 2022-08-15 16:57 UTC (permalink / raw)
  To: Tony Lindgren, Saravana Kannan
  Cc: Greg Kroah-Hartman, Rafael J. Wysocki, Kevin Hilman, Ulf Hansson,
	Pavel Machek, Len Brown, Andrew Lunn, Heiner Kallweit,
	Russell King, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, naresh.kamboju, kernel-team, linux-kernel, linux-pm,
	netdev
In-Reply-To: <Yvonn9C/AFcRUefV@atomide.com>

On Mon Aug 15, 2022 at 1:01 PM CEST, Tony Lindgren wrote:
> * Saravana Kannan <saravanak@google.com> [700101 02:00]:
> > More fixes/changes are needed before driver_deferred_probe_check_state()
> > can be deleted. So, bring it back for now.
> > 
> > Greg,
> > 
> > Can we get this into 5.19? If not, it might not be worth picking up this
> > series. I could just do the other/more fixes in time for 5.20.
>
> Yes please pick this as fixes for v6.0-rc series, it fixes booting for
> me. I've replied with fixes tags for the two patches that were causing
> regressions for me.
>

Hi,

for me Patch 1+3 fix display probe on Qualcomm SM6350 (although display
for this SoC isn't upstream yet, there are lots of other SoCs with very
similar setup).

Probe for DPU silently fails, with CONFIG_DEBUG_DRIVER=y we get this:

msm-mdss ae00000.mdss: __genpd_dev_pm_attach() failed to find PM domain: -2

While I'm not familiar with the specifics of fw_devlink, the dtsi has
power-domains = <&dispcc MDSS_GDSC> for this node but it doesn't pick
that up for some reason.

We can also see that a bit later dispcc finally probes.

Regards
Luca

> Regards,
>
> Tony

^ permalink raw reply

* Re: [PATCH v3 6/9] arm64: bcmbca: Make BCM4908 drivers depend on ARCH_BCMBCA
From: Florian Fainelli @ 2022-08-15 16:57 UTC (permalink / raw)
  To: William Zhang, Linux ARM List
  Cc: Broadcom Kernel List, joel.peshkin, dan.beygelman, f.fainelli,
	krzysztof.kozlowski, rafal, Guenter Roeck, Bjorn Helgaas,
	Wolfram Sang, Philipp Zabel, Miquel Raynal, David S. Miller,
	Rob Herring, Kishon Vijay Abraham I, Vinod Koul, Linus Walleij,
	Greg Kroah-Hartman, Wim Van Sebroeck,
	open list:I2C SUBSYSTEM HOST DRIVERS, open list,
	open list:MEMORY TECHNOLOGY DEVICES (MTD),
	open list:NETWORKING DRIVERS,
	open list:PCI NATIVE HOST BRIDGE AND ENDPOINT DRIVERS,
	open list:GENERIC PHY FRAMEWORK, open list:PIN CONTROL SUBSYSTEM,
	open list:BROADCOM BMIPS MIPS ARCHITECTURE,
	open list:SERIAL DRIVERS, open list:WATCHDOG DEVICE DRIVERS
In-Reply-To: <20220803175455.47638-7-william.zhang@broadcom.com>

On 8/3/22 10:54, William Zhang wrote:
> With Broadcom Broadband arch ARCH_BCMBCA supported in the kernel, this
> patch series migrate the ARCH_BCM4908 symbol to ARCH_BCMBCA. Hence
> replace ARCH_BCM4908 with ARCH_BCMBCA in subsystem Kconfig files.
> 
> Signed-off-by: William Zhang <william.zhang@broadcom.com>
> Acked-by: Guenter Roeck <linux@roeck-us.net> (for watchdog)
> Acked-by: Bjorn Helgaas <bhelgaas@google.com> (for drivers/pci)
> Acked-by: Wolfram Sang <wsa@kernel.org> (for i2c)
> Acked-by: Philipp Zabel <p.zabel@pengutronix.de> (for reset)

Applied to https://github.com/Broadcom/stblinux/commits/drivers/next, 
thanks!
-- 
Florian

^ permalink raw reply

* [PATCH net-next] net: phy: broadcom: Implement suspend/resume for AC131 and BCM5241
From: Florian Fainelli @ 2022-08-15 17:43 UTC (permalink / raw)
  To: netdev
  Cc: Florian Fainelli, Broadcom internal kernel review list,
	Andrew Lunn, Heiner Kallweit, Russell King, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, open list

Implement the suspend/resume procedure for the Broadcom AC131 and BCM5241 type
of PHYs (10/100 only) by entering the standard power down followed by the
proprietary standby mode in the auxiliary mode 4 shadow register. On resume,
the PHY software reset is enough to make it come out of standby mode so we can
utilize brcm_fet_config_init() as the resume hook.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 drivers/net/phy/broadcom.c | 48 ++++++++++++++++++++++++++++++++++++++
 include/linux/brcmphy.h    |  1 +
 2 files changed, 49 insertions(+)

diff --git a/drivers/net/phy/broadcom.c b/drivers/net/phy/broadcom.c
index 31fbcdddc9ad..739348b3f135 100644
--- a/drivers/net/phy/broadcom.c
+++ b/drivers/net/phy/broadcom.c
@@ -766,6 +766,50 @@ static irqreturn_t brcm_fet_handle_interrupt(struct phy_device *phydev)
 	return IRQ_HANDLED;
 }
 
+static int brcm_fet_suspend(struct phy_device *phydev)
+{
+	int reg, err, err2, brcmtest;
+
+	/* We cannot use a read/modify/write here otherwise the PHY continues
+	 * to drive LEDs which defeats the purpose of low power mode.
+	 */
+	err = phy_write(phydev, MII_BMCR, BMCR_PDOWN);
+	if (err < 0)
+		return err;
+
+	/* Enable shadow register access */
+	brcmtest = phy_read(phydev, MII_BRCM_FET_BRCMTEST);
+	if (brcmtest < 0)
+		return brcmtest;
+
+	reg = brcmtest | MII_BRCM_FET_BT_SRE;
+
+	err = phy_write(phydev, MII_BRCM_FET_BRCMTEST, reg);
+	if (err < 0)
+		return err;
+
+	/* Set standby mode */
+	reg = phy_read(phydev, MII_BRCM_FET_SHDW_AUXMODE4);
+	if (reg < 0) {
+		err = reg;
+		goto done;
+	}
+
+	reg |= MII_BRCM_FET_SHDW_AM4_STANDBY;
+
+	err = phy_write(phydev, MII_BRCM_FET_SHDW_AUXMODE4, reg);
+	if (err < 0)
+		goto done;
+
+done:
+	/* Disable shadow register access */
+	err2 = phy_write(phydev, MII_BRCM_FET_BRCMTEST, brcmtest);
+	if (!err)
+		err = err2;
+
+	return err;
+}
+
 static int bcm54xx_phy_probe(struct phy_device *phydev)
 {
 	struct bcm54xx_phy_priv *priv;
@@ -1033,6 +1077,8 @@ static struct phy_driver broadcom_drivers[] = {
 	.config_init	= brcm_fet_config_init,
 	.config_intr	= brcm_fet_config_intr,
 	.handle_interrupt = brcm_fet_handle_interrupt,
+	.suspend	= brcm_fet_suspend,
+	.resume		= brcm_fet_config_init,
 }, {
 	.phy_id		= PHY_ID_BCM5241,
 	.phy_id_mask	= 0xfffffff0,
@@ -1041,6 +1087,8 @@ static struct phy_driver broadcom_drivers[] = {
 	.config_init	= brcm_fet_config_init,
 	.config_intr	= brcm_fet_config_intr,
 	.handle_interrupt = brcm_fet_handle_interrupt,
+	.suspend	= brcm_fet_suspend,
+	.resume		= brcm_fet_config_init,
 }, {
 	.phy_id		= PHY_ID_BCM5395,
 	.phy_id_mask	= 0xfffffff0,
diff --git a/include/linux/brcmphy.h b/include/linux/brcmphy.h
index 6ff567ece34a..9e77165f3ef6 100644
--- a/include/linux/brcmphy.h
+++ b/include/linux/brcmphy.h
@@ -293,6 +293,7 @@
 #define MII_BRCM_FET_SHDW_MC_FAME	0x4000	/* Force Auto MDIX enable */
 
 #define MII_BRCM_FET_SHDW_AUXMODE4	0x1a	/* Auxiliary mode 4 */
+#define MII_BRCM_FET_SHDW_AM4_STANDBY	0x0008	/* Standby enable */
 #define MII_BRCM_FET_SHDW_AM4_LED_MASK	0x0003
 #define MII_BRCM_FET_SHDW_AM4_LED_MODE1 0x0001
 
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH net 1/1] net_sched: cls_route: disallow handle of 0
From: Jakub Kicinski @ 2022-08-15 17:44 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: davem, edumazet, pabeni, netdev, xiyou.wangcong, jiri, kuznet,
	cascardo, linux-distros, security, stephen, dsahern, gregkh
In-Reply-To: <20220814112758.3088655-1-jhs@mojatatu.com>

On Sun, 14 Aug 2022 11:27:58 +0000 Jamal Hadi Salim wrote:
> Follows up on:
> https://lore.kernel.org/all/20220809170518.164662-1-cascardo@canonical.com/
> 
> handle of 0 implies from/to of universe realm which is not very
> sensible.

Heh, I was gonna say, but then you acked the other fix :)

^ permalink raw reply

* Re: upstream kernel crashes
From: Andres Freund @ 2022-08-15 17:46 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Xuan Zhuo, Jason Wang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, virtualization, netdev,
	Linus Torvalds, Jens Axboe, James Bottomley, Martin K. Petersen,
	Guenter Roeck, linux-kernel, Greg KH, c
In-Reply-To: <20220815124748-mutt-send-email-mst@kernel.org>

Hi,

On 2022-08-15 12:50:52 -0400, Michael S. Tsirkin wrote:
> On Mon, Aug 15, 2022 at 09:45:03AM -0700, Andres Freund wrote:
> > Hi,
> > 
> > On 2022-08-15 11:40:59 -0400, Michael S. Tsirkin wrote:
> > > OK so this gives us a quick revert as a solution for now.
> > > Next, I would appreciate it if you just try this simple hack.
> > > If it crashes we either have a long standing problem in virtio
> > > code or more likely a gcp bug where it can't handle smaller
> > > rings than what device requestes.
> > > Thanks!
> > 
> > I applied the below and the problem persists.
> > 
> > [...]
>
> Okay!

Just checking - I applied and tested this atop 6.0-rc1, correct? Or did you
want me to test it with the 762faee5a267 reverted? I guess what you're trying
to test if a smaller queue than what's requested you'd want to do so without
the problematic patch applied...


> And just to be 100% sure, can you try the following on top of 5.19:

> diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
> index 623906b4996c..6f4e54a618bc 100644
> --- a/drivers/virtio/virtio_pci_modern.c
> +++ b/drivers/virtio/virtio_pci_modern.c
> @@ -208,6 +208,9 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
>  		return ERR_PTR(-EINVAL);
>  	}
>  
> +	if (num > 1024)
> +		num = 1024;
> +
>  	info->msix_vector = msix_vec;
>  
>  	/* create the vring */
> 
> -- 

Either way, I did this, and there are no issues that I could observe. No
oopses, no broken networking. But:

To make sure it does something I added a debugging printk - which doesn't show
up. I assume this is at a point at least earlyprintk should work (which I see
getting enabled via serial)?

Greetings,

Andres Freund

^ permalink raw reply

* [PATCH net-next 0/2] net: dsa: bcm_sf2: Utilize PHYLINK for all ports
From: Florian Fainelli @ 2022-08-15 17:50 UTC (permalink / raw)
  To: netdev
  Cc: Florian Fainelli, Andrew Lunn, Vivien Didelot, Vladimir Oltean,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	open list

Hi all,

This patch series has the bcm_sf2 driver utilize PHYLINK to configure
the CPU port link parameters to unify the configuration and pave the way
for DSA to utilize PHYLINK for all ports in the future.

Tested on BCM7445 and BCM7278

Florian Fainelli (2):
  net: dsa: bcm_sf2: Introduce helper for port override offset
  net: dsa: bcm_sf2: Have PHYLINK configure CPU/IMP port(s)

 drivers/net/dsa/bcm_sf2.c | 130 ++++++++++++++++++--------------------
 1 file changed, 63 insertions(+), 67 deletions(-)

-- 
2.25.1


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox