Netdev List
 help / color / mirror / Atom feed
* [PATCH net-next v4] hv_netvsc: Mark VF as slave before exposing it to user-mode
From: longli @ 2023-11-08 22:56 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	linux-hyperv, netdev, linux-kernel
  Cc: Long Li

From: Long Li <longli@microsoft.com>

When a VF is being exposed form the kernel, it should be marked as "slave"
before exposing to the user-mode. The VF is not usable without netvsc running
as master. The user-mode should never see a VF without the "slave" flag.

An example of a user-mode program depending on this flag is cloud-init
(https://github.com/canonical/cloud-init/blob/19.3/cloudinit/net/__init__.py)
When scanning interfaces, it checks on if this interface has a master to
decide if it should be configured. There are other user-mode programs perform
similar checks.

This commit moves the code of setting the slave flag to the time before VF is
exposed to user-mode.

Signed-off-by: Long Li <longli@microsoft.com>
---

Change since v1:
Use a new function to handle NETDEV_POST_INIT.

Change since v2:
Add "net" in subject. Add more details on the user-mode program behavior.

Change since v3:
Change target to net-next.

 drivers/net/hyperv/netvsc_drv.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index ec77fb9dcf89..fdad58dcc6a8 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -2206,9 +2206,6 @@ static int netvsc_vf_join(struct net_device *vf_netdev,
 		goto upper_link_failed;
 	}
 
-	/* set slave flag before open to prevent IPv6 addrconf */
-	vf_netdev->flags |= IFF_SLAVE;
-
 	schedule_delayed_work(&ndev_ctx->vf_takeover, VF_TAKEOVER_INT);
 
 	call_netdevice_notifiers(NETDEV_JOIN, vf_netdev);
@@ -2320,11 +2317,9 @@ static struct net_device *get_netvsc_byslot(const struct net_device *vf_netdev)
 	 */
 	list_for_each_entry(ndev_ctx, &netvsc_dev_list, list) {
 		ndev = hv_get_drvdata(ndev_ctx->device_ctx);
-		if (ether_addr_equal(vf_netdev->perm_addr, ndev->perm_addr)) {
-			netdev_notice(vf_netdev,
-				      "falling back to mac addr based matching\n");
+		if (ether_addr_equal(vf_netdev->perm_addr, ndev->perm_addr) ||
+		    ether_addr_equal(vf_netdev->dev_addr, ndev->perm_addr))
 			return ndev;
-		}
 	}
 
 	netdev_notice(vf_netdev,
@@ -2332,6 +2327,19 @@ static struct net_device *get_netvsc_byslot(const struct net_device *vf_netdev)
 	return NULL;
 }
 
+static int netvsc_prepare_slave(struct net_device *vf_netdev)
+{
+	struct net_device *ndev;
+
+	ndev = get_netvsc_byslot(vf_netdev);
+	if (!ndev)
+		return NOTIFY_DONE;
+
+	/* set slave flag before open to prevent IPv6 addrconf */
+	vf_netdev->flags |= IFF_SLAVE;
+	return NOTIFY_DONE;
+}
+
 static int netvsc_register_vf(struct net_device *vf_netdev)
 {
 	struct net_device_context *net_device_ctx;
@@ -2753,6 +2761,8 @@ static int netvsc_netdev_event(struct notifier_block *this,
 		return NOTIFY_DONE;
 
 	switch (event) {
+	case NETDEV_POST_INIT:
+		return netvsc_prepare_slave(event_dev);
 	case NETDEV_REGISTER:
 		return netvsc_register_vf(event_dev);
 	case NETDEV_UNREGISTER:
-- 
2.34.1


^ permalink raw reply related

* Re: BPF/XDP: kernel panic when removing an interface that is an xdp_redirect target
From: Toke Høiland-Jørgensen @ 2023-11-08 23:10 UTC (permalink / raw)
  To: Nelson, Shannon, Jesper Dangaard Brouer, David Ahern,
	Jakub Kicinski, netdev, bpf, Daniel Borkmann, Alexei Starovoitov,
	Andrii Nakryiko
In-Reply-To: <fa95d5d0-35c0-497e-aea8-a35f9f6304f4@amd.com>

"Nelson, Shannon" <shannon.nelson@amd.com> writes:

> On 11/7/2023 7:31 AM, Toke Høiland-Jørgensen wrote:
>> 
>> "Nelson, Shannon" <shannon.nelson@amd.com> writes:
>> 
>>> While testing new code to support XDP in the ionic driver we found that
>>> we could panic the kernel by running a bind/unbind loop on the target
>>> interface of an xdp_redirect action.  Obviously this is a stress test
>>> that is abusing the system, but it does point to a window of opportunity
>>> in bq_enqueue() and bq_xmit_all().  I believe that while the validity of
>>> the target interface has been checked in __xdp_enqueue(), the interface
>>> can be unbound by the time either bq_enqueue() or bq_xmit_all() tries to
>>> use the interface.  There is no locking or reference taken on the
>>> interface to hold it in place before the target’s ndo_xdp_xmit() is called.
>>>
>>> Below is a stack trace that our tester captured while running our test
>>> code on a RHEL 9.2 kernel – yes, I know, unpublished driver code on a
>>> non-upstream kernel.  But if you look at the current upstream code in
>>> kernel/bpf/devmap.c I think you can see what we ran into.
>>>
>>> Other than telling users to not abuse the system with a bind/unbind
>>> loop, is there something we can do to limit the potential pain here?
>>> Without knowing what interfaces might be targeted by the users’ XDP
>>> programs, is there a step the originating driver can do to take
>>> precautions?  Did we simply miss a step in the driver, or is this an
>>> actual problem in the devmap code?
>> 
>> Sounds like a driver bug :)
>
> Entirely possible, wouldn't be our first ... :-)
>
>> 
>> The XDP redirect flow guarantees that all outstanding packets are
>> flushed within a single NAPI cycle, as documented here:
>> https://docs.kernel.org/bpf/redirect.html
>> 
>> So basically, the driver should be doing a two-step teardown: remove
>> global visibility of the resource in question, wait for all concurrent
>> users to finish, and *then* free the data structure. This corresponds to
>> the usual RCU protection: resources should be kept alive until all
>> concurrent RCU critical sections have exited on all CPUs. So if your
>> driver is removing an interface's data structure without waiting for
>> concurrent NAPI cycles to finish, that's a bug in the driver.
>> 
>> This kind of thing is what the synchronize_net() function is for; for a
>> usage example, see veth_napi_del_range(). My guess would be that you're
>> missing this as part of your driver teardown flow?
>
> Essentially, the first thing we do in the remove function is to call 
> unregister_netdev(), which has synchronize_net() in the path, so I don't 
> think this is missing from our scenario, but thanks for the hint, I'll 
> keep this in mind.  I do see there are a couple of net drivers that are 
> more aggressive about calling it directly in some other parts of the 
> logic - I don't think that has a bearing on this issue, but I'll keep it 
> in mind.

Hmm, right, in fact unregister_netdev() has two such synchronize_net()
calls. The XDP queue is only guaranteed to be flushed after the second
one of those, though, and there's an 'ndo_uninit()' callback in-between
them. So I don't suppose your driver implements that ndo and does
something there that could cause the crash you're seeing?

Otherwise, the one thing I can think of is that maybe it can be related
to the fact that synchronize_net() turns into a
synchronize_rcu_expedited() if the rtnl lock is held (which it is in
this case if you're calling the parameter-less unregister_netdev()). I'm
not quite sure I grok the expedited wait thing, but it should be pretty
easy to check if this is the cause by making a change like the one below
and seeing if the issue goes away.

-Toke

diff --git a/net/core/dev.c b/net/core/dev.c
index e28a18e7069b..1a035a5f0b0e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10932,7 +10932,7 @@ void synchronize_net(void)
 {
        might_sleep();
        if (rtnl_is_locked())
-               synchronize_rcu_expedited();
+               synchronize_rcu();
        else
                synchronize_rcu();
 }


^ permalink raw reply related

* Re: [PATCH 03/22] [RESEND] kprobes: unify kprobes_exceptions_nofify() prototypes
From: Masami Hiramatsu @ 2023-11-08 23:20 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Andrew Morton, linux-kernel, Masahiro Yamada, linux-kbuild,
	Arnd Bergmann, Matt Turner, Vineet Gupta, Russell King,
	Catalin Marinas, Will Deacon, Steven Rostedt, Masami Hiramatsu,
	Mark Rutland, Guo Ren, Peter Zijlstra, Ard Biesheuvel,
	Huacai Chen, Greg Ungerer, Michal Simek, Thomas Bogendoerfer,
	Dinh Nguyen, Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Geoff Levand, Palmer Dabbelt, Heiko Carstens,
	John Paul Adrian Glaubitz, David S. Miller, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, x86, Helge Deller, Sudip Mukherjee,
	Greg Kroah-Hartman, Timur Tabi, Kent Overstreet, David Woodhouse,
	Naveen N. Rao, Anil S Keshavamurthy, Kees Cook, Vincenzo Frascino,
	Juri Lelli, Vincent Guittot, Nathan Chancellor, Nick Desaulniers,
	Nicolas Schier, Al Viro, Uwe Kleine-König, linux-alpha,
	linux-snps-arc, linux-arm-kernel, linux-trace-kernel, linux-csky,
	loongarch, linux-m68k, linux-mips, linuxppc-dev, linux-riscv,
	linux-s390, linux-sh, sparclinux, netdev, linux-parisc, linux-usb,
	linux-fbdev, dri-devel, linux-bcachefs, linux-mtd
In-Reply-To: <20231108125843.3806765-4-arnd@kernel.org>

On Wed,  8 Nov 2023 13:58:24 +0100
Arnd Bergmann <arnd@kernel.org> wrote:

> From: Arnd Bergmann <arnd@arndb.de>
> 
> Most architectures that support kprobes declare this function in their
> own asm/kprobes.h header and provide an override, but some are missing
> the prototype, which causes a warning for the __weak stub implementation:
> 
> kernel/kprobes.c:1865:12: error: no previous prototype for 'kprobe_exceptions_notify' [-Werror=missing-prototypes]
>  1865 | int __weak kprobe_exceptions_notify(struct notifier_block *self,
> 
> Move the prototype into linux/kprobes.h so it is visible to all
> the definitions.

Thanks, let me pick this to linux-trace tree.

> 
> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
>  arch/arc/include/asm/kprobes.h     | 3 ---
>  arch/arm/include/asm/kprobes.h     | 2 --
>  arch/arm64/include/asm/kprobes.h   | 2 --
>  arch/mips/include/asm/kprobes.h    | 2 --
>  arch/powerpc/include/asm/kprobes.h | 2 --
>  arch/s390/include/asm/kprobes.h    | 2 --
>  arch/sh/include/asm/kprobes.h      | 2 --
>  arch/sparc/include/asm/kprobes.h   | 2 --
>  arch/x86/include/asm/kprobes.h     | 2 --
>  include/linux/kprobes.h            | 4 ++++
>  10 files changed, 4 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/arc/include/asm/kprobes.h b/arch/arc/include/asm/kprobes.h
> index de1566e32cb8..68e8301c0df2 100644
> --- a/arch/arc/include/asm/kprobes.h
> +++ b/arch/arc/include/asm/kprobes.h
> @@ -32,9 +32,6 @@ struct kprobe;
>  
>  void arch_remove_kprobe(struct kprobe *p);
>  
> -int kprobe_exceptions_notify(struct notifier_block *self,
> -			     unsigned long val, void *data);
> -
>  struct prev_kprobe {
>  	struct kprobe *kp;
>  	unsigned long status;
> diff --git a/arch/arm/include/asm/kprobes.h b/arch/arm/include/asm/kprobes.h
> index e26a278d301a..5b8dbf1b0be4 100644
> --- a/arch/arm/include/asm/kprobes.h
> +++ b/arch/arm/include/asm/kprobes.h
> @@ -40,8 +40,6 @@ struct kprobe_ctlblk {
>  
>  void arch_remove_kprobe(struct kprobe *);
>  int kprobe_fault_handler(struct pt_regs *regs, unsigned int fsr);
> -int kprobe_exceptions_notify(struct notifier_block *self,
> -			     unsigned long val, void *data);
>  
>  /* optinsn template addresses */
>  extern __visible kprobe_opcode_t optprobe_template_entry[];
> diff --git a/arch/arm64/include/asm/kprobes.h b/arch/arm64/include/asm/kprobes.h
> index 05cd82eeca13..be7a3680dadf 100644
> --- a/arch/arm64/include/asm/kprobes.h
> +++ b/arch/arm64/include/asm/kprobes.h
> @@ -37,8 +37,6 @@ struct kprobe_ctlblk {
>  
>  void arch_remove_kprobe(struct kprobe *);
>  int kprobe_fault_handler(struct pt_regs *regs, unsigned int fsr);
> -int kprobe_exceptions_notify(struct notifier_block *self,
> -			     unsigned long val, void *data);
>  void __kretprobe_trampoline(void);
>  void __kprobes *trampoline_probe_handler(struct pt_regs *regs);
>  
> diff --git a/arch/mips/include/asm/kprobes.h b/arch/mips/include/asm/kprobes.h
> index 68b1e5d458cf..bc27d99c9436 100644
> --- a/arch/mips/include/asm/kprobes.h
> +++ b/arch/mips/include/asm/kprobes.h
> @@ -71,8 +71,6 @@ struct kprobe_ctlblk {
>  	struct prev_kprobe prev_kprobe;
>  };
>  
> -extern int kprobe_exceptions_notify(struct notifier_block *self,
> -				    unsigned long val, void *data);
>  
>  #endif /* CONFIG_KPROBES */
>  #endif /* _ASM_KPROBES_H */
> diff --git a/arch/powerpc/include/asm/kprobes.h b/arch/powerpc/include/asm/kprobes.h
> index c8e4b4fd4e33..4525a9c68260 100644
> --- a/arch/powerpc/include/asm/kprobes.h
> +++ b/arch/powerpc/include/asm/kprobes.h
> @@ -84,8 +84,6 @@ struct arch_optimized_insn {
>  	kprobe_opcode_t *insn;
>  };
>  
> -extern int kprobe_exceptions_notify(struct notifier_block *self,
> -					unsigned long val, void *data);
>  extern int kprobe_fault_handler(struct pt_regs *regs, int trapnr);
>  extern int kprobe_handler(struct pt_regs *regs);
>  extern int kprobe_post_handler(struct pt_regs *regs);
> diff --git a/arch/s390/include/asm/kprobes.h b/arch/s390/include/asm/kprobes.h
> index 21b9e5290c04..01f1682a73b7 100644
> --- a/arch/s390/include/asm/kprobes.h
> +++ b/arch/s390/include/asm/kprobes.h
> @@ -73,8 +73,6 @@ struct kprobe_ctlblk {
>  void arch_remove_kprobe(struct kprobe *p);
>  
>  int kprobe_fault_handler(struct pt_regs *regs, int trapnr);
> -int kprobe_exceptions_notify(struct notifier_block *self,
> -	unsigned long val, void *data);
>  
>  #define flush_insn_slot(p)	do { } while (0)
>  
> diff --git a/arch/sh/include/asm/kprobes.h b/arch/sh/include/asm/kprobes.h
> index eeba83e0a7d2..65d4c3316a5b 100644
> --- a/arch/sh/include/asm/kprobes.h
> +++ b/arch/sh/include/asm/kprobes.h
> @@ -46,8 +46,6 @@ struct kprobe_ctlblk {
>  };
>  
>  extern int kprobe_fault_handler(struct pt_regs *regs, int trapnr);
> -extern int kprobe_exceptions_notify(struct notifier_block *self,
> -				    unsigned long val, void *data);
>  extern int kprobe_handle_illslot(unsigned long pc);
>  #else
>  
> diff --git a/arch/sparc/include/asm/kprobes.h b/arch/sparc/include/asm/kprobes.h
> index 06c2bc767ef7..aec742cd898f 100644
> --- a/arch/sparc/include/asm/kprobes.h
> +++ b/arch/sparc/include/asm/kprobes.h
> @@ -47,8 +47,6 @@ struct kprobe_ctlblk {
>  	struct prev_kprobe prev_kprobe;
>  };
>  
> -int kprobe_exceptions_notify(struct notifier_block *self,
> -			     unsigned long val, void *data);
>  int kprobe_fault_handler(struct pt_regs *regs, int trapnr);
>  asmlinkage void __kprobes kprobe_trap(unsigned long trap_level,
>  				      struct pt_regs *regs);
> diff --git a/arch/x86/include/asm/kprobes.h b/arch/x86/include/asm/kprobes.h
> index a2e9317aad49..5939694dfb28 100644
> --- a/arch/x86/include/asm/kprobes.h
> +++ b/arch/x86/include/asm/kprobes.h
> @@ -113,8 +113,6 @@ struct kprobe_ctlblk {
>  };
>  
>  extern int kprobe_fault_handler(struct pt_regs *regs, int trapnr);
> -extern int kprobe_exceptions_notify(struct notifier_block *self,
> -				    unsigned long val, void *data);
>  extern int kprobe_int3_handler(struct pt_regs *regs);
>  
>  #else
> diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
> index 365eb092e9c4..ab1da3142b06 100644
> --- a/include/linux/kprobes.h
> +++ b/include/linux/kprobes.h
> @@ -445,6 +445,10 @@ int kprobe_get_kallsym(unsigned int symnum, unsigned long *value, char *type,
>  
>  int arch_kprobe_get_kallsym(unsigned int *symnum, unsigned long *value,
>  			    char *type, char *sym);
> +
> +int kprobe_exceptions_notify(struct notifier_block *self,
> +			     unsigned long val, void *data);
> +
>  #else /* !CONFIG_KPROBES: */
>  
>  static inline int kprobe_fault_handler(struct pt_regs *regs, int trapnr)
> -- 
> 2.39.2
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH v2 net 2/7] net/sched: taprio: fix cycle time adjustment for next entry
From: Vladimir Oltean @ 2023-11-08 23:20 UTC (permalink / raw)
  To: Faizal Rahim
  Cc: Vinicius Costa Gomes, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev, linux-kernel
In-Reply-To: <20231107112023.676016-3-faizal.abdul.rahim@linux.intel.com>

On Tue, Nov 07, 2023 at 06:20:18AM -0500, Faizal Rahim wrote:
> According to IEEE Std. 802.1Q-2018 section Q.5 CycleTimeExtension:
> "the Cycle Time Extension variable allows this extension of the last old
> cycle to be done in a defined way. If the last complete old cycle would
> normally end less than OperCycleTimeExtension nanoseconds before the new
> base time, then the last complete cycle before AdminBaseTime is reached
> is extended so that it ends at AdminBaseTime."

So far, so good.

> The current taprio implementation does not extend the last old cycle in
> the defined manner specified in the Qbv Spec. This is part of the fix
> covered in this patch.

In the discussion on v1, I said that prior to commit a1e6ad30fa19
("net/sched: taprio: calculate guard band against actual TC gate close
time"), the last entry's next->close_time actually used to include the
oper schedule's correction, but it no longer does. This points to a
regression, and not to something that was never there. Am I wrong?

> Here are the changes made:
> 
> 1. A new API, get_cycle_time_correction(), has been added to return the

I would call an "API" an interface between two distinct software layers,
based on an agreed-upon contract. Not a function in sch_taprio.c which
is called by another function in sch_taprio.c.

> correction value. If it returns a non-initialize value, it indicates
> changes required for the next entry schedule, and upon the completion
> of the next entry's duration, entries will be loaded from the new admin
> schedule.

This paragraph doesn't really help. It gets the reader lost in
irrelevant details which are actually not that hard to deduce from the
code with some good naming. Actually I find it poor naming to say
"non-initialize value" when what you mean is "!= INIT_CYCLE_TIME_CORRECTION".
I think I would name this a "specific" or "valid" cycle correction, when
it takes a value different from CYCLE_TIME_CORRECTION_UNSPEC.

> 
> 2. Store correction values in cycle_time_correction:
> a) Positive correction value/extension
> We calculate the correction between the last operational cycle and the
> new admin base time. Note that for positive correction to take place,
> the next entry should be the last entry from oper and the new admin base
> time falls within the next cycle time of old oper.
> 
> b) Negative correction value
> The new admin base time starts earlier than the next entry's end time.
> 
> c) Zero correction value
> The new admin base time aligns exactly with the old cycle.
> 
> 3. When cycle_time_correction is set to a non-initialized value, it is
> used to:
> a) Indicate that cycle correction is active to trigger adjustments in
> affected fields like interval and cycle_time. A new API,
> cycle_corr_active(), has been created to help with this purpose.

Again, it's exaggerated to call this an "API".
Although, what you can do is provide a kerneldoc-style comment above the
functions which you wish to describe, and remove this explanation from
the commit message.

> 
> b) Transition to the new admin schedule at the beginning of
> advance_sched(). A new API, should_change_sched(), has been introduced
> for this purpose.

This should have been done since patch 1, not here.

> 4. Remove the previous definition of should_change_scheds() API. A new
> should_change_sched() API has been introduced, but it serves a
> completely different purpose than the one that was removed.

So don't name it the same!

> 
> Fixes: a3d43c0d56f1 ("taprio: Add support adding an admin schedule")
> Signed-off-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com>
> ---
>  net/sched/sch_taprio.c | 105 +++++++++++++++++++++++++++--------------
>  1 file changed, 70 insertions(+), 35 deletions(-)
> 
> diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
> index dee103647823..ed32654b46f5 100644
> --- a/net/sched/sch_taprio.c
> +++ b/net/sched/sch_taprio.c
> @@ -259,6 +259,14 @@ static int duration_to_length(struct taprio_sched *q, u64 duration)
>  	return div_u64(duration * PSEC_PER_NSEC, atomic64_read(&q->picos_per_byte));
>  }
>  
> +static bool cycle_corr_active(s64 cycle_time_correction)
> +{
> +	if (cycle_time_correction == INIT_CYCLE_TIME_CORRECTION)
> +		return false;
> +	else
> +		return true;
> +}
> +

Could look like this:

static bool cycle_corr_active(struct sched_gate_list *oper)
{
	return oper->cycle_time_correction != INIT_CYCLE_TIME_CORRECTION;
}

>  /* Sets sched->max_sdu[] and sched->max_frm_len[] to the minimum between the
>   * q->max_sdu[] requested by the user and the max_sdu dynamically determined by
>   * the maximum open gate durations at the given link speed.
> @@ -888,38 +896,59 @@ static bool should_restart_cycle(const struct sched_gate_list *oper,
>  	return false;
>  }
>  
> -static bool should_change_schedules(const struct sched_gate_list *admin,
> -				    const struct sched_gate_list *oper,
> -				    ktime_t end_time)
> +static bool should_change_sched(struct sched_gate_list *oper)
>  {
> -	ktime_t next_base_time, extension_time;
> +	bool change_to_admin_sched = false;
>  
> -	if (!admin)
> -		return false;
> +	if (oper->cycle_time_correction != INIT_CYCLE_TIME_CORRECTION) {
> +		/* The recent entry ran is the last one from oper */
> +		change_to_admin_sched = true;
> +		oper->cycle_time_correction = INIT_CYCLE_TIME_CORRECTION;

Don't make a function called should_change_sched() stateful. Don't make
this function reset the value of oper->cycle_time_correction, since that
practically equates with actually starting the schedule change.

The oper->cycle_time_correction assignment actually belongs to
switch_schedules(), in my opinion.

And if you make should_change_sched() stateless, then surprise-surprise,
it will contain the exact same logic, and return the exact same thing,
as cycle_corr_active(). I think that naming this single function
sched_change_pending() and providing a kerneldoc comment as to why it is
implemented the way it is should be sufficient.

> +	}
>  
> -	next_base_time = sched_base_time(admin);
> +	return change_to_admin_sched;
> +}
>  
> -	/* This is the simple case, the end_time would fall after
> -	 * the next schedule base_time.
> -	 */
> -	if (ktime_compare(next_base_time, end_time) <= 0)
> +static bool should_extend_cycle(const struct sched_gate_list *oper,
> +				ktime_t new_base_time,
> +				ktime_t entry_end_time,
> +				const struct sched_entry *entry)
> +{
> +	ktime_t next_cycle_end_time = ktime_add_ns(oper->cycle_end_time,
> +						   oper->cycle_time);
> +	bool extension_supported = oper->cycle_time_extension > 0 ? true : false;

"? true : false" is redundant. Just "extension_supported = oper->cycle_time_extension > 0"
is enough.

> +	s64 extension_limit = oper->cycle_time_extension;
> +
> +	if (extension_supported &&
> +	    list_is_last(&entry->list, &oper->entries) &&
> +	    ktime_before(new_base_time, next_cycle_end_time) &&
> +	    ktime_sub(new_base_time, entry_end_time) < extension_limit)
>  		return true;
> +	else
> +		return false;

Style nitpick:

	return extension_supported &&
	       list_is_last(&entry->list, &oper->entries) &&
	       ktime_before(new_base_time, next_cycle_end_time) &&
	       ktime_sub(new_base_time, entry_end_time) < extension_limit;

> +}
>  
> -	/* This is the cycle_time_extension case, if the end_time
> -	 * plus the amount that can be extended would fall after the
> -	 * next schedule base_time, we can extend the current schedule
> -	 * for that amount.
> -	 */
> -	extension_time = ktime_add_ns(end_time, oper->cycle_time_extension);
> +static s64 get_cycle_time_correction(const struct sched_gate_list *oper,
> +				     ktime_t new_base_time,
> +				     ktime_t entry_end_time,
> +				     const struct sched_entry *entry)
> +{
> +	s64 correction = INIT_CYCLE_TIME_CORRECTION;
>  
> -	/* FIXME: the IEEE 802.1Q-2018 Specification isn't clear about
> -	 * how precisely the extension should be made. So after
> -	 * conformance testing, this logic may change.
> -	 */
> -	if (ktime_compare(next_base_time, extension_time) <= 0)
> -		return true;
> +	if (!entry || !oper)
> +		return correction;

This function is called as follows:

	oper->cycle_time_correction = get_cycle_time_correction(oper, ...);

So, "oper" cannot be NULL if we dereference "oper" in the left hand side
of the assignment and expect the kernel not to crash, no?

"entry" - assigned from the "next" variable in advance_sched() - should
not be NULL either, from the way in which it is assigned.

>  
> -	return false;
> +	if (ktime_compare(new_base_time, entry_end_time) <= 0) {
> +		/* negative or zero correction */

At least for me, it would be helpful if you could transplant the
explanation from the commit message ("The new admin base time starts
earlier than the next entry's end time") into an expanded comment here.
I am easily confused about the "ktime_compare(a, b) <= 0" construction.

> +		correction = ktime_sub(new_base_time, entry_end_time);
> +	} else if (ktime_after(new_base_time, entry_end_time) &&
> +		   should_extend_cycle(oper, new_base_time,
> +				       entry_end_time, entry)) {
> +		/* positive correction */

Same here - move the explanation from the commit message to the comment,
please.

> +		correction = ktime_sub(new_base_time, entry_end_time);
> +	}
> +
> +	return correction;
>  }
>  
>  static enum hrtimer_restart advance_sched(struct hrtimer *timer)
> @@ -942,10 +971,8 @@ static enum hrtimer_restart advance_sched(struct hrtimer *timer)
>  	admin = rcu_dereference_protected(q->admin_sched,
>  					  lockdep_is_held(&q->current_entry_lock));
>  
> -	if (!oper || oper->cycle_time_correction != INIT_CYCLE_TIME_CORRECTION) {
> -		oper->cycle_time_correction = INIT_CYCLE_TIME_CORRECTION;
> +	if (!oper || should_change_sched(oper))
>  		switch_schedules(q, &admin, &oper);
> -	}
>  
>  	/* This can happen in two cases: 1. this is the very first run
>  	 * of this function (i.e. we weren't running any schedule
> @@ -972,6 +999,22 @@ static enum hrtimer_restart advance_sched(struct hrtimer *timer)
>  	end_time = ktime_add_ns(entry->end_time, next->interval);
>  	end_time = min_t(ktime_t, end_time, oper->cycle_end_time);
>  
> +	if (admin) {
> +		ktime_t new_base_time = sched_base_time(admin);
> +
> +		oper->cycle_time_correction =
> +			get_cycle_time_correction(oper, new_base_time,
> +						  end_time, next);
> +
> +		if (cycle_corr_active(oper->cycle_time_correction)) {
> +			/* The next entry is the last entry we will run from
> +			 * oper, subsequent ones will take from the new admin
> +			 */
> +			oper->cycle_end_time = new_base_time;
> +			end_time = new_base_time;
> +		}
> +	}
> +
>  	for (tc = 0; tc < num_tc; tc++) {
>  		if (next->gate_duration[tc] == oper->cycle_time)
>  			next->gate_close_time[tc] = KTIME_MAX;
> @@ -980,14 +1023,6 @@ static enum hrtimer_restart advance_sched(struct hrtimer *timer)
>  								 next->gate_duration[tc]);
>  	}
>  
> -	if (should_change_schedules(admin, oper, end_time)) {
> -		/* Set things so the next time this runs, the new
> -		 * schedule runs.
> -		 */
> -		end_time = sched_base_time(admin);
> -		oper->cycle_time_correction = 0;
> -	}
> -
>  	next->end_time = end_time;
>  	taprio_set_budgets(q, oper, next);
>  
> -- 
> 2.25.1
>

^ permalink raw reply

* Re: [PATCH net v2] net: set SOCK_RCU_FREE before inserting socket into hashtable
From: Kuniyuki Iwashima @ 2023-11-08 23:30 UTC (permalink / raw)
  To: sdf; +Cc: davem, edumazet, kuba, netdev, pabeni, kuniyu
In-Reply-To: <20231108211325.18938-1-sdf@google.com>

From: Stanislav Fomichev <sdf@google.com>
Date: Wed,  8 Nov 2023 13:13:25 -0800
> We've started to see the following kernel traces:
> 
>  WARNING: CPU: 83 PID: 0 at net/core/filter.c:6641 sk_lookup+0x1bd/0x1d0
> 
>  Call Trace:
>   <IRQ>
>   __bpf_skc_lookup+0x10d/0x120
>   bpf_sk_lookup+0x48/0xd0
>   bpf_sk_lookup_tcp+0x19/0x20
>   bpf_prog_<redacted>+0x37c/0x16a3
>   cls_bpf_classify+0x205/0x2e0
>   tcf_classify+0x92/0x160
>   __netif_receive_skb_core+0xe52/0xf10
>   __netif_receive_skb_list_core+0x96/0x2b0
>   napi_complete_done+0x7b5/0xb70
>   <redacted>_poll+0x94/0xb0
>   net_rx_action+0x163/0x1d70
>   __do_softirq+0xdc/0x32e
>   asm_call_irq_on_stack+0x12/0x20
>   </IRQ>
>   do_softirq_own_stack+0x36/0x50
>   do_softirq+0x44/0x70
> 
> __inet_hash can race with lockless (rcu) readers on the other cpus:
> 
>   __inet_hash
>     __sk_nulls_add_node_rcu
>     <- (bpf triggers here)
>     sock_set_flag(SOCK_RCU_FREE)
> 
> Let's move the SOCK_RCU_FREE part up a bit, before we are inserting
> the socket into hashtables. Note, that the race is really harmless;
> the bpf callers are handling this situation (where listener socket
> doesn't have SOCK_RCU_FREE set) correctly, so the only
> annoyance is a WARN_ONCE.
> 
> More details from Eric regarding SOCK_RCU_FREE timeline:
> 
> Commit 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under
> synflood") added SOCK_RCU_FREE. At that time, the precise location of
> sock_set_flag(sk, SOCK_RCU_FREE) did not matter, because the thread calling
> __inet_hash() owns a reference on sk. SOCK_RCU_FREE was only tested
> at dismantle time.
> 
> Commit 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF")
> started checking SOCK_RCU_FREE _after_ the lookup to infer whether
> the refcount has been taken care of.
> 
> Fixes: 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF")
> Reviewed-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>

Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>


> ---
>  net/ipv4/inet_hashtables.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index 598c1b114d2c..a532f749e477 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -751,12 +751,12 @@ int __inet_hash(struct sock *sk, struct sock *osk)
>  		if (err)
>  			goto unlock;
>  	}
> +	sock_set_flag(sk, SOCK_RCU_FREE);
>  	if (IS_ENABLED(CONFIG_IPV6) && sk->sk_reuseport &&
>  		sk->sk_family == AF_INET6)
>  		__sk_nulls_add_node_tail_rcu(sk, &ilb2->nulls_head);
>  	else
>  		__sk_nulls_add_node_rcu(sk, &ilb2->nulls_head);
> -	sock_set_flag(sk, SOCK_RCU_FREE);
>  	sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
>  unlock:
>  	spin_unlock(&ilb2->lock);
> -- 
> 2.42.0.869.gea05f2083d-goog


^ permalink raw reply

* Re: [PATCH v2 net 3/7] net/sched: taprio: update impacted fields during cycle time adjustment
From: Vladimir Oltean @ 2023-11-08 23:41 UTC (permalink / raw)
  To: Faizal Rahim
  Cc: Vinicius Costa Gomes, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev, linux-kernel
In-Reply-To: <20231107112023.676016-4-faizal.abdul.rahim@linux.intel.com>

On Tue, Nov 07, 2023 at 06:20:19AM -0500, Faizal Rahim wrote:
> Update impacted fields in advance_sched() if cycle_corr_active()
> is true, which indicates that the next entry is the last entry
> from oper that it will run.
> 
> Update impacted fields:
> 
> 1. gate_duration[tc], max_open_gate_duration[tc]
> Created a new API update_open_gate_duration().The API sets the
> duration based on the last remaining entry, the original value
> was based on consideration of multiple entries.
> 
> 2. gate_close_time[tc]
> Update next entry gate close time according to the new admin
> base time
> 
> 3. max_sdu[tc], budget[tc]
> Restrict from setting to max value because there's only a single
> entry left to run from oper before changing to the new admin
> schedule, so we shouldn't set to max.
> 
> Signed-off-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com>

The commit message shouldn't be a text-to-speech output of the commit body.
Say very shortly how the system should behave, what's wrong such that it
doesn't behave as expected, what's the user-visible impact of the bug,
and try to identify why the bug happened.

In this case, what happened is that commit a306a90c8ffe ("net/sched:
taprio: calculate tc gate durations"), which introduced the impacted
fields you are changing, never took dynamic schedule changes into
consideration. So this commit should also appear in the Fixes: tag.

> ---
>  net/sched/sch_taprio.c | 49 +++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 46 insertions(+), 3 deletions(-)
> 
> diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
> index ed32654b46f5..119dec3bbe88 100644
> --- a/net/sched/sch_taprio.c
> +++ b/net/sched/sch_taprio.c
> @@ -288,7 +288,8 @@ static void taprio_update_queue_max_sdu(struct taprio_sched *q,
>  		/* TC gate never closes => keep the queueMaxSDU
>  		 * selected by the user
>  		 */
> -		if (sched->max_open_gate_duration[tc] == sched->cycle_time) {
> +		if (sched->max_open_gate_duration[tc] == sched->cycle_time &&
> +		    !cycle_corr_active(sched->cycle_time_correction)) {
>  			max_sdu_dynamic = U32_MAX;
>  		} else {
>  			u32 max_frm_len;
> @@ -684,7 +685,8 @@ static void taprio_set_budgets(struct taprio_sched *q,
>  
>  	for (tc = 0; tc < num_tc; tc++) {
>  		/* Traffic classes which never close have infinite budget */
> -		if (entry->gate_duration[tc] == sched->cycle_time)
> +		if (entry->gate_duration[tc] == sched->cycle_time &&
> +		    !cycle_corr_active(sched->cycle_time_correction))
>  			budget = INT_MAX;
>  		else
>  			budget = div64_u64((u64)entry->gate_duration[tc] * PSEC_PER_NSEC,
> @@ -896,6 +898,32 @@ static bool should_restart_cycle(const struct sched_gate_list *oper,
>  	return false;
>  }
>  
> +/* Open gate duration were calculated at the beginning with consideration of
> + * multiple entries. If cycle time correction is active, there's only a single
> + * remaining entry left from oper to run.
> + * Update open gate duration based on this last entry.
> + */
> +static void update_open_gate_duration(struct sched_entry *entry,
> +				      struct sched_gate_list *oper,
> +				      int num_tc,
> +				      u64 open_gate_duration)
> +{
> +	int tc;
> +
> +	if (!entry || !oper)
> +		return;
> +
> +	for (tc = 0; tc < num_tc; tc++) {
> +		if (entry->gate_mask & BIT(tc)) {
> +			entry->gate_duration[tc] = open_gate_duration;
> +			oper->max_open_gate_duration[tc] = open_gate_duration;
> +		} else {
> +			entry->gate_duration[tc] = 0;
> +			oper->max_open_gate_duration[tc] = 0;
> +		}
> +	}
> +}
> +
>  static bool should_change_sched(struct sched_gate_list *oper)
>  {
>  	bool change_to_admin_sched = false;
> @@ -1010,13 +1038,28 @@ static enum hrtimer_restart advance_sched(struct hrtimer *timer)
>  			/* The next entry is the last entry we will run from
>  			 * oper, subsequent ones will take from the new admin
>  			 */
> +			u64 new_gate_duration =
> +				next->interval + oper->cycle_time_correction;
> +			struct qdisc_size_table *stab =
> +				rtnl_dereference(q->root->stab);
> +

The lockdep annotation for this RCU accessor is bogus.
rtnl_dereference() is the same as rcu_dereference_protected(..., lockdep_rtnl_is_held()),
which cannot be true in a hrtimer callback, as the rtnetlink lock is a
sleepable mutex and hrtimers run in atomic context.

Running with lockdep enabled will tell you as much:

$ ./test_taprio_cycle_extension.sh 
Testing config change with a delay of 5250000000 ns between schedules
[  100.734925] 
[  100.736703] =============================
[  100.740780] WARNING: suspicious RCU usage
[  100.744857] 6.6.0-10114-gca572939947f #1495 Not tainted
[  100.750162] -----------------------------
[  100.754236] net/sched/sch_taprio.c:1064 suspicious rcu_dereference_protected() usage!
[  100.762155] 
[  100.762155] other info that might help us debug this:
[  100.762155] 
[  100.770242] 
[  100.770242] rcu_scheduler_active = 2, debug_locks = 1
[  100.776851] 1 lock held by swapper/0/0:
[  100.780756]  #0: ffff3d9784b83b00 (&q->current_entry_lock){-...}-{3:3}, at: advance_sched+0x44/0x59c
[  100.790099] 
[  100.790099] stack backtrace:
[  100.794477] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.6.0-10114-gca572939947f #1495
[  100.802346] Hardware name: LS1028A RDB Board (DT)
[  100.807072] Call trace:
[  100.809531]  dump_backtrace+0xf4/0x140
[  100.813305]  show_stack+0x18/0x2c
[  100.816638]  dump_stack_lvl+0x60/0x80
[  100.820321]  dump_stack+0x18/0x24
[  100.823654]  lockdep_rcu_suspicious+0x170/0x210
[  100.828210]  advance_sched+0x384/0x59c
[  100.831978]  __hrtimer_run_queues+0x200/0x430
[  100.836360]  hrtimer_interrupt+0xdc/0x39c
[  100.840392]  arch_timer_handler_phys+0x3c/0x4c
[  100.844862]  handle_percpu_devid_irq+0xb8/0x28c
[  100.849417]  generic_handle_domain_irq+0x2c/0x44
[  100.854060]  gic_handle_irq+0x4c/0x110
[  100.857830]  call_on_irq_stack+0x24/0x4c
[  100.861775]  el1_interrupt+0x74/0xc0
[  100.865370]  el1h_64_irq_handler+0x18/0x24
[  100.869489]  el1h_64_irq+0x64/0x68
[  100.872909]  arch_local_irq_enable+0x8/0xc
[  100.877032]  cpuidle_enter+0x38/0x50
[  100.880629]  do_idle+0x1ec/0x280
[  100.883877]  cpu_startup_entry+0x34/0x38
[  100.887822]  kernel_init+0x0/0x1a0
[  100.891245]  start_kernel+0x0/0x3b0
[  100.894756]  start_kernel+0x2f8/0x3b0
[  100.898439]  __primary_switched+0xbc/0xc4

What I would do is:

			struct qdisc_size_table *stab;

			rcu_read_lock();
			stab = rcu_dereference(q->root->stab);
			taprio_update_queue_max_sdu(q, oper, stab);
			rcu_read_unlock();

>  			oper->cycle_end_time = new_base_time;
>  			end_time = new_base_time;
> +
> +			update_open_gate_duration(next, oper, num_tc,
> +						  new_gate_duration);
> +			taprio_update_queue_max_sdu(q, oper, stab);
>  		}
>  	}
>  
>  	for (tc = 0; tc < num_tc; tc++) {
> -		if (next->gate_duration[tc] == oper->cycle_time)
> +		if (cycle_corr_active(oper->cycle_time_correction) &&
> +		    (next->gate_mask & BIT(tc)))
> +			/* Set to the new base time, ensuring a smooth transition
> +			 * to the new schedule when the next entry finishes.
> +			 */
> +			next->gate_close_time[tc] = end_time;
> +		else if (next->gate_duration[tc] == oper->cycle_time)
>  			next->gate_close_time[tc] = KTIME_MAX;
>  		else
>  			next->gate_close_time[tc] = ktime_add_ns(entry->end_time,
> -- 
> 2.25.1
>

^ permalink raw reply

* Re: [RFC PATCH v3 04/12] netdev: support binding dma-buf to netdevice
From: David Wei @ 2023-11-08 23:47 UTC (permalink / raw)
  To: Mina Almasry, netdev, linux-kernel, linux-arch, linux-kselftest,
	linux-media, dri-devel, linaro-mm-sig
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Ilias Apalodimas, Arnd Bergmann,
	David Ahern, Willem de Bruijn, Shuah Khan, Sumit Semwal,
	Christian König, Shakeel Butt, Jeroen de Borst,
	Praveen Kaligineedi, Willem de Bruijn, Kaiyuan Zhang,
	Pavel Begunkov
In-Reply-To: <20231106024413.2801438-5-almasrymina@google.com>

On 2023-11-05 18:44, Mina Almasry wrote:
> Add a netdev_dmabuf_binding struct which represents the
> dma-buf-to-netdevice binding. The netlink API will bind the dma-buf to
> rx queues on the netdevice. On the binding, the dma_buf_attach
> & dma_buf_map_attachment will occur. The entries in the sg_table from
> mapping will be inserted into a genpool to make it ready
> for allocation.
> 
> The chunks in the genpool are owned by a dmabuf_chunk_owner struct which
> holds the dma-buf offset of the base of the chunk and the dma_addr of
> the chunk. Both are needed to use allocations that come from this chunk.
> 
> We create a new type that represents an allocation from the genpool:
> page_pool_iov. We setup the page_pool_iov allocation size in the
> genpool to PAGE_SIZE for simplicity: to match the PAGE_SIZE normally
> allocated by the page pool and given to the drivers.
> 
> The user can unbind the dmabuf from the netdevice by closing the netlink
> socket that established the binding. We do this so that the binding is
> automatically unbound even if the userspace process crashes.
> 
> The binding and unbinding leaves an indicator in struct netdev_rx_queue
> that the given queue is bound, but the binding doesn't take effect until
> the driver actually reconfigures its queues, and re-initializes its page
> pool.
> 
> The netdev_dmabuf_binding struct is refcounted, and releases its
> resources only when all the refs are released.
> 
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
> Signed-off-by: Mina Almasry <almasrymina@google.com>
> 
> ---
> 
> RFC v3:
> - Support multi rx-queue binding
> 
> ---
>  include/linux/netdevice.h     |  80 ++++++++++++++
>  include/net/netdev_rx_queue.h |   1 +
>  include/net/page_pool/types.h |  27 +++++
>  net/core/dev.c                | 203 ++++++++++++++++++++++++++++++++++
>  net/core/netdev-genl.c        | 116 ++++++++++++++++++-
>  5 files changed, 425 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index b8bf669212cc..eeeda849115c 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -52,6 +52,8 @@
>  #include <net/net_trackers.h>
>  #include <net/net_debug.h>
>  #include <net/dropreason-core.h>
> +#include <linux/xarray.h>
> +#include <linux/refcount.h>
>  
>  struct netpoll_info;
>  struct device;
> @@ -808,6 +810,84 @@ bool rps_may_expire_flow(struct net_device *dev, u16 rxq_index, u32 flow_id,
>  #endif
>  #endif /* CONFIG_RPS */
>  
> +struct netdev_dmabuf_binding {
> +	struct dma_buf *dmabuf;
> +	struct dma_buf_attachment *attachment;
> +	struct sg_table *sgt;
> +	struct net_device *dev;
> +	struct gen_pool *chunk_pool;
> +
> +	/* The user holds a ref (via the netlink API) for as long as they want
> +	 * the binding to remain alive. Each page pool using this binding holds
> +	 * a ref to keep the binding alive. Each allocated page_pool_iov holds a
> +	 * ref.
> +	 *
> +	 * The binding undos itself and unmaps the underlying dmabuf once all
> +	 * those refs are dropped and the binding is no longer desired or in
> +	 * use.
> +	 */
> +	refcount_t ref;
> +
> +	/* The portid of the user that owns this binding. Used for netlink to
> +	 * notify us of the user dropping the bind.
> +	 */
> +	u32 owner_nlportid;
> +
> +	/* The list of bindings currently active. Used for netlink to notify us
> +	 * of the user dropping the bind.
> +	 */
> +	struct list_head list;
> +
> +	/* rxq's this binding is active on. */
> +	struct xarray bound_rxq_list;
> +};
> +
> +#ifdef CONFIG_DMA_SHARED_BUFFER
> +void __netdev_devmem_binding_free(struct netdev_dmabuf_binding *binding);
> +int netdev_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
> +		       struct netdev_dmabuf_binding **out);
> +void netdev_unbind_dmabuf(struct netdev_dmabuf_binding *binding);
> +int netdev_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
> +				struct netdev_dmabuf_binding *binding);
> +#else
> +static inline void
> +__netdev_devmem_binding_free(struct netdev_dmabuf_binding *binding)
> +{
> +}
> +
> +static inline int netdev_bind_dmabuf(struct net_device *dev,
> +				     unsigned int dmabuf_fd,
> +				     struct netdev_dmabuf_binding **out)
> +{
> +	return -EOPNOTSUPP;
> +}
> +static inline void netdev_unbind_dmabuf(struct netdev_dmabuf_binding *binding)
> +{
> +}
> +
> +static inline int
> +netdev_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
> +			    struct netdev_dmabuf_binding *binding)
> +{
> +	return -EOPNOTSUPP;
> +}
> +#endif
> +
> +static inline void
> +netdev_devmem_binding_get(struct netdev_dmabuf_binding *binding)
> +{
> +	refcount_inc(&binding->ref);
> +}
> +
> +static inline void
> +netdev_devmem_binding_put(struct netdev_dmabuf_binding *binding)
> +{
> +	if (!refcount_dec_and_test(&binding->ref))
> +		return;
> +
> +	__netdev_devmem_binding_free(binding);
> +}
> +
>  /* XPS map type and offset of the xps map within net_device->xps_maps[]. */
>  enum xps_map_type {
>  	XPS_CPUS = 0,
> diff --git a/include/net/netdev_rx_queue.h b/include/net/netdev_rx_queue.h
> index cdcafb30d437..1bfcf60a145d 100644
> --- a/include/net/netdev_rx_queue.h
> +++ b/include/net/netdev_rx_queue.h
> @@ -21,6 +21,7 @@ struct netdev_rx_queue {
>  #ifdef CONFIG_XDP_SOCKETS
>  	struct xsk_buff_pool            *pool;
>  #endif
> +	struct netdev_dmabuf_binding *binding;

@Pavel - They are using struct netdev_rx_queue to hold the binding,
which is an object that holds the state and is mapped 1:1 to an rxq.
This object is similar to our "interface queue". I wonder if we should
re-visit using this generic struct, instead of driver specific structs
e.g. bnxt_rx_ring_info?

>  } ____cacheline_aligned_in_smp;
>  
>  /*
> diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
> index d4bea053bb7e..64386325d965 100644
> --- a/include/net/page_pool/types.h
> +++ b/include/net/page_pool/types.h
> @@ -133,6 +133,33 @@ struct pp_memory_provider_ops {
>  	bool (*release_page)(struct page_pool *pool, struct page *page);
>  };
>  
> +/* page_pool_iov support */
> +
> +/* Owner of the dma-buf chunks inserted into the gen pool. Each scatterlist
> + * entry from the dmabuf is inserted into the genpool as a chunk, and needs
> + * this owner struct to keep track of some metadata necessary to create
> + * allocations from this chunk.
> + */
> +struct dmabuf_genpool_chunk_owner {
> +	/* Offset into the dma-buf where this chunk starts.  */
> +	unsigned long base_virtual;
> +
> +	/* dma_addr of the start of the chunk.  */
> +	dma_addr_t base_dma_addr;
> +
> +	/* Array of page_pool_iovs for this chunk. */
> +	struct page_pool_iov *ppiovs;
> +	size_t num_ppiovs;
> +
> +	struct netdev_dmabuf_binding *binding;
> +};
> +
> +struct page_pool_iov {
> +	struct dmabuf_genpool_chunk_owner *owner;
> +
> +	refcount_t refcount;
> +};
> +
>  struct page_pool {
>  	struct page_pool_params p;
>  
> diff --git a/net/core/dev.c b/net/core/dev.c
> index a37a932a3e14..c8c3709d42c8 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -153,6 +153,9 @@
>  #include <linux/prandom.h>
>  #include <linux/once_lite.h>
>  #include <net/netdev_rx_queue.h>
> +#include <linux/genalloc.h>
> +#include <linux/dma-buf.h>
> +#include <net/page_pool/types.h>
>  
>  #include "dev.h"
>  #include "net-sysfs.h"
> @@ -2040,6 +2043,206 @@ static int call_netdevice_notifiers_mtu(unsigned long val,
>  	return call_netdevice_notifiers_info(val, &info.info);
>  }
>  
> +/* Device memory support */
> +
> +#ifdef CONFIG_DMA_SHARED_BUFFER
> +static void netdev_devmem_free_chunk_owner(struct gen_pool *genpool,
> +					   struct gen_pool_chunk *chunk,
> +					   void *not_used)
> +{
> +	struct dmabuf_genpool_chunk_owner *owner = chunk->owner;
> +
> +	kvfree(owner->ppiovs);
> +	kfree(owner);
> +}
> +
> +void __netdev_devmem_binding_free(struct netdev_dmabuf_binding *binding)
> +{
> +	size_t size, avail;
> +
> +	gen_pool_for_each_chunk(binding->chunk_pool,
> +				netdev_devmem_free_chunk_owner, NULL);
> +
> +	size = gen_pool_size(binding->chunk_pool);
> +	avail = gen_pool_avail(binding->chunk_pool);
> +
> +	if (!WARN(size != avail, "can't destroy genpool. size=%lu, avail=%lu",
> +		  size, avail))
> +		gen_pool_destroy(binding->chunk_pool);
> +
> +	dma_buf_unmap_attachment(binding->attachment, binding->sgt,
> +				 DMA_BIDIRECTIONAL);
> +	dma_buf_detach(binding->dmabuf, binding->attachment);
> +	dma_buf_put(binding->dmabuf);
> +	kfree(binding);
> +}
> +
> +void netdev_unbind_dmabuf(struct netdev_dmabuf_binding *binding)
> +{
> +	struct netdev_rx_queue *rxq;
> +	unsigned long xa_idx;
> +
> +	if (!binding)
> +		return;
> +
> +	list_del_rcu(&binding->list);
> +
> +	xa_for_each(&binding->bound_rxq_list, xa_idx, rxq)
> +		if (rxq->binding == binding)
> +			/* We hold the rtnl_lock while binding/unbinding
> +			 * dma-buf, so we can't race with another thread that
> +			 * is also modifying this value. However, the driver
> +			 * may read this config while it's creating its
> +			 * rx-queues. WRITE_ONCE() here to match the
> +			 * READ_ONCE() in the driver.
> +			 */
> +			WRITE_ONCE(rxq->binding, NULL);
> +
> +	netdev_devmem_binding_put(binding);
> +}
> +
> +int netdev_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
> +				struct netdev_dmabuf_binding *binding)
> +{
> +	struct netdev_rx_queue *rxq;
> +	u32 xa_idx;
> +	int err;
> +
> +	rxq = __netif_get_rx_queue(dev, rxq_idx);
> +
> +	if (rxq->binding)
> +		return -EEXIST;
> +
> +	err = xa_alloc(&binding->bound_rxq_list, &xa_idx, rxq, xa_limit_32b,
> +		       GFP_KERNEL);
> +	if (err)
> +		return err;
> +
> +	/*We hold the rtnl_lock while binding/unbinding dma-buf, so we can't
> +	 * race with another thread that is also modifying this value. However,
> +	 * the driver may read this config while it's creating its * rx-queues.
> +	 * WRITE_ONCE() here to match the READ_ONCE() in the driver.
> +	 */
> +	WRITE_ONCE(rxq->binding, binding);
> +
> +	return 0;
> +}
> +
> +int netdev_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
> +		       struct netdev_dmabuf_binding **out)

I'm not entirely familiar with the Netlink API. Mina, do you know if we
can call into netdev_bind_dmabuf or netdev_nl_bind_rx_doit directly,
without needing to call send/recv on a Netlink socket? We likely want
io_uring to do the registration of a dmabuf fd and keep ownership over
it.

> +{
> +	struct netdev_dmabuf_binding *binding;
> +	struct scatterlist *sg;
> +	struct dma_buf *dmabuf;
> +	unsigned int sg_idx, i;
> +	unsigned long virtual;
> +	int err;
> +
> +	if (!capable(CAP_NET_ADMIN))
> +		return -EPERM;
> +
> +	dmabuf = dma_buf_get(dmabuf_fd);
> +	if (IS_ERR_OR_NULL(dmabuf))
> +		return -EBADFD;
> +
> +	binding = kzalloc_node(sizeof(*binding), GFP_KERNEL,
> +			       dev_to_node(&dev->dev));
> +	if (!binding) {
> +		err = -ENOMEM;
> +		goto err_put_dmabuf;
> +	}
> +
> +	xa_init_flags(&binding->bound_rxq_list, XA_FLAGS_ALLOC);
> +
> +	refcount_set(&binding->ref, 1);
> +
> +	binding->dmabuf = dmabuf;
> +
> +	binding->attachment = dma_buf_attach(binding->dmabuf, dev->dev.parent);
> +	if (IS_ERR(binding->attachment)) {
> +		err = PTR_ERR(binding->attachment);
> +		goto err_free_binding;
> +	}
> +
> +	binding->sgt = dma_buf_map_attachment(binding->attachment,
> +					      DMA_BIDIRECTIONAL);
> +	if (IS_ERR(binding->sgt)) {
> +		err = PTR_ERR(binding->sgt);
> +		goto err_detach;
> +	}
> +
> +	/* For simplicity we expect to make PAGE_SIZE allocations, but the
> +	 * binding can be much more flexible than that. We may be able to
> +	 * allocate MTU sized chunks here. Leave that for future work...
> +	 */
> +	binding->chunk_pool = gen_pool_create(PAGE_SHIFT,
> +					      dev_to_node(&dev->dev));
> +	if (!binding->chunk_pool) {
> +		err = -ENOMEM;
> +		goto err_unmap;
> +	}
> +
> +	virtual = 0;
> +	for_each_sgtable_dma_sg(binding->sgt, sg, sg_idx) {
> +		dma_addr_t dma_addr = sg_dma_address(sg);
> +		struct dmabuf_genpool_chunk_owner *owner;
> +		size_t len = sg_dma_len(sg);
> +		struct page_pool_iov *ppiov;
> +
> +		owner = kzalloc_node(sizeof(*owner), GFP_KERNEL,
> +				     dev_to_node(&dev->dev));
> +		owner->base_virtual = virtual;
> +		owner->base_dma_addr = dma_addr;
> +		owner->num_ppiovs = len / PAGE_SIZE;
> +		owner->binding = binding;
> +
> +		err = gen_pool_add_owner(binding->chunk_pool, dma_addr,
> +					 dma_addr, len, dev_to_node(&dev->dev),
> +					 owner);
> +		if (err) {
> +			err = -EINVAL;
> +			goto err_free_chunks;
> +		}
> +
> +		owner->ppiovs = kvmalloc_array(owner->num_ppiovs,
> +					       sizeof(*owner->ppiovs),
> +					       GFP_KERNEL);
> +		if (!owner->ppiovs) {
> +			err = -ENOMEM;
> +			goto err_free_chunks;
> +		}
> +
> +		for (i = 0; i < owner->num_ppiovs; i++) {
> +			ppiov = &owner->ppiovs[i];
> +			ppiov->owner = owner;
> +			refcount_set(&ppiov->refcount, 1);
> +		}
> +
> +		dma_addr += len;
> +		virtual += len;
> +	}
> +
> +	*out = binding;
> +
> +	return 0;
> +
> +err_free_chunks:
> +	gen_pool_for_each_chunk(binding->chunk_pool,
> +				netdev_devmem_free_chunk_owner, NULL);
> +	gen_pool_destroy(binding->chunk_pool);
> +err_unmap:
> +	dma_buf_unmap_attachment(binding->attachment, binding->sgt,
> +				 DMA_BIDIRECTIONAL);
> +err_detach:
> +	dma_buf_detach(dmabuf, binding->attachment);
> +err_free_binding:
> +	kfree(binding);
> +err_put_dmabuf:
> +	dma_buf_put(dmabuf);
> +	return err;
> +}
> +#endif
> +
>  #ifdef CONFIG_NET_INGRESS
>  static DEFINE_STATIC_KEY_FALSE(ingress_needed_key);
>  
> diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
> index 59d3d512d9cc..2c2a62593217 100644
> --- a/net/core/netdev-genl.c
> +++ b/net/core/netdev-genl.c
> @@ -129,10 +129,89 @@ int netdev_nl_dev_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
>  	return skb->len;
>  }
>  
> -/* Stub */
> +static LIST_HEAD(netdev_rbinding_list);
> +
>  int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
>  {
> -	return 0;
> +	struct netdev_dmabuf_binding *out_binding;
> +	u32 ifindex, dmabuf_fd, rxq_idx;
> +	struct net_device *netdev;
> +	struct sk_buff *rsp;
> +	int rem, err = 0;
> +	void *hdr;
> +	struct nlattr *attr;
> +
> +	if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_DEV_IFINDEX) ||
> +	    GENL_REQ_ATTR_CHECK(info, NETDEV_A_BIND_DMABUF_DMABUF_FD) ||
> +	    GENL_REQ_ATTR_CHECK(info, NETDEV_A_BIND_DMABUF_QUEUES))
> +		return -EINVAL;
> +
> +	ifindex = nla_get_u32(info->attrs[NETDEV_A_DEV_IFINDEX]);
> +	dmabuf_fd = nla_get_u32(info->attrs[NETDEV_A_BIND_DMABUF_DMABUF_FD]);
> +
> +	rtnl_lock();
> +
> +	netdev = __dev_get_by_index(genl_info_net(info), ifindex);
> +	if (!netdev) {
> +		err = -ENODEV;
> +		goto err_unlock;
> +	}
> +
> +	err = netdev_bind_dmabuf(netdev, dmabuf_fd, &out_binding);
> +	if (err)
> +		goto err_unlock;
> +
> +	nla_for_each_attr(attr, genlmsg_data(info->genlhdr),
> +			  genlmsg_len(info->genlhdr), rem) {
> +		switch (nla_type(attr)) {
> +		case NETDEV_A_BIND_DMABUF_QUEUES:
> +			rxq_idx = nla_get_u32(attr);
> +
> +			if (rxq_idx >= netdev->num_rx_queues) {
> +				err = -ERANGE;
> +				goto err_unbind;
> +			}
> +
> +			err = netdev_bind_dmabuf_to_queue(netdev, rxq_idx,
> +							  out_binding);
> +			if (err)
> +				goto err_unbind;
> +
> +			break;
> +		default:
> +			break;
> +		}
> +	}
> +
> +	out_binding->owner_nlportid = info->snd_portid;
> +	list_add_rcu(&out_binding->list, &netdev_rbinding_list);
> +
> +	rsp = genlmsg_new(GENLMSG_DEFAULT_SIZE, GFP_KERNEL);
> +	if (!rsp) {
> +		err = -ENOMEM;
> +		goto err_unbind;
> +	}
> +
> +	hdr = genlmsg_put(rsp, info->snd_portid, info->snd_seq,
> +			  &netdev_nl_family, 0, info->genlhdr->cmd);
> +	if (!hdr) {
> +		err = -EMSGSIZE;
> +		goto err_genlmsg_free;
> +	}
> +
> +	genlmsg_end(rsp, hdr);
> +
> +	rtnl_unlock();
> +
> +	return genlmsg_reply(rsp, info);
> +
> +err_genlmsg_free:
> +	nlmsg_free(rsp);
> +err_unbind:
> +	netdev_unbind_dmabuf(out_binding);
> +err_unlock:
> +	rtnl_unlock();
> +	return err;
>  }
>  
>  static int netdev_genl_netdevice_event(struct notifier_block *nb,
> @@ -155,10 +234,37 @@ static int netdev_genl_netdevice_event(struct notifier_block *nb,
>  	return NOTIFY_OK;
>  }
>  
> +static int netdev_netlink_notify(struct notifier_block *nb, unsigned long state,
> +				 void *_notify)
> +{
> +	struct netlink_notify *notify = _notify;
> +	struct netdev_dmabuf_binding *rbinding;
> +
> +	if (state != NETLINK_URELEASE || notify->protocol != NETLINK_GENERIC)
> +		return NOTIFY_DONE;
> +
> +	rcu_read_lock();
> +
> +	list_for_each_entry_rcu(rbinding, &netdev_rbinding_list, list) {
> +		if (rbinding->owner_nlportid == notify->portid) {
> +			netdev_unbind_dmabuf(rbinding);
> +			break;
> +		}
> +	}
> +
> +	rcu_read_unlock();
> +
> +	return NOTIFY_OK;
> +}
> +
>  static struct notifier_block netdev_genl_nb = {
>  	.notifier_call	= netdev_genl_netdevice_event,
>  };
>  
> +static struct notifier_block netdev_netlink_notifier = {
> +	.notifier_call = netdev_netlink_notify,
> +};

Is this mechamism what cleans up TCP devmem in case userspace crashes
and the associated Netlink socket is closed?

> +
>  static int __init netdev_genl_init(void)
>  {
>  	int err;
> @@ -171,8 +277,14 @@ static int __init netdev_genl_init(void)
>  	if (err)
>  		goto err_unreg_ntf;
>  
> +	err = netlink_register_notifier(&netdev_netlink_notifier);
> +	if (err)
> +		goto err_unreg_family;
> +
>  	return 0;
>  
> +err_unreg_family:
> +	genl_unregister_family(&netdev_nl_family);
>  err_unreg_ntf:
>  	unregister_netdevice_notifier(&netdev_genl_nb);
>  	return err;

^ permalink raw reply

* [RFC net-next] net: don't dump stack on queue timeout
From: Jakub Kicinski @ 2023-11-09  0:09 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, Jakub Kicinski,
	syzbot+d55372214aff0faa1f1f, jhs, xiyou.wangcong, jiri

The top syzbot report for networking (#14 for the entire kernel)
is the queue timeout splat. We kept it around for a long time,
because in real life it provides pretty strong signal that
something is wrong with the driver or the device.

Removing it is also likely to break monitoring for those who
track it as a kernel warning.

Nevertheless, WARN()ings are best suited for catching kernel
programming bugs. If a Tx queue gets starved due to a pause
storm, priority configuration, or other weirdness - that's
obviously a problem, but not a problem we can fix at
the kernel level.

Bite the bullet and convert the WARN() to a print.

Before:

  NETDEV WATCHDOG: eni1np1 (netdevsim): transmit queue 0 timed out 1975 ms
  WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x39e/0x3b0
  [... completely pointless stack trace of a timer follows ...]

Now:

  netdevsim netdevsim1 eni1np1: NETDEV WATCHDOG: CPU: 0: transmit queue 0 timed out 1769 ms

Alternatively we could mark the drivers which syzbot has
learned to abuse as "print-instead-of-WARN" selectively.

Reported-by: syzbot+d55372214aff0faa1f1f@syzkaller.appspotmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
CC: jhs@mojatatu.com
CC: xiyou.wangcong@gmail.com
CC: jiri@resnulli.us
---
 net/sched/sch_generic.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 4195a4bc26ca..8dd0e5925342 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -522,8 +522,9 @@ static void dev_watchdog(struct timer_list *t)
 
 			if (unlikely(timedout_ms)) {
 				trace_net_dev_xmit_timeout(dev, i);
-				WARN_ONCE(1, "NETDEV WATCHDOG: %s (%s): transmit queue %u timed out %u ms\n",
-					  dev->name, netdev_drivername(dev), i, timedout_ms);
+				netdev_crit(dev, "NETDEV WATCHDOG: CPU: %d: transmit queue %u timed out %u ms\n",
+					    raw_smp_processor_id(),
+					    i, timedout_ms);
 				netif_freeze_queues(dev);
 				dev->netdev_ops->ndo_tx_timeout(dev, i);
 				netif_unfreeze_queues(dev);
-- 
2.41.0


^ permalink raw reply related

* Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator
From: David Wei @ 2023-11-09  1:00 UTC (permalink / raw)
  To: David Ahern, Mina Almasry
  Cc: netdev, linux-kernel, linux-arch, linux-kselftest, linux-media,
	dri-devel, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jesper Dangaard Brouer, Ilias Apalodimas,
	Arnd Bergmann, Willem de Bruijn, Shuah Khan, Sumit Semwal,
	Christian König, Shakeel Butt, Jeroen de Borst,
	Praveen Kaligineedi, Willem de Bruijn, Kaiyuan Zhang,
	Pavel Begunkov
In-Reply-To: <a5b95e6b-8716-4e2e-9183-959b754b5b5e@kernel.org>

On 2023-11-07 14:55, David Ahern wrote:
> On 11/7/23 3:10 PM, Mina Almasry wrote:
>> On Mon, Nov 6, 2023 at 3:44 PM David Ahern <dsahern@kernel.org> wrote:
>>>
>>> On 11/5/23 7:44 PM, Mina Almasry wrote:
>>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>>> index eeeda849115c..1c351c138a5b 100644
>>>> --- a/include/linux/netdevice.h
>>>> +++ b/include/linux/netdevice.h
>>>> @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding {
>>>>  };
>>>>
>>>>  #ifdef CONFIG_DMA_SHARED_BUFFER
>>>> +struct page_pool_iov *
>>>> +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
>>>> +void netdev_free_devmem(struct page_pool_iov *ppiov);
>>>
>>> netdev_{alloc,free}_dmabuf?
>>>
>>
>> Can do.
>>
>>> I say that because a dmabuf can be host memory, at least I am not aware
>>> of a restriction that a dmabuf is device memory.
>>>
>>
>> In my limited experience dma-buf is generally device memory, and
>> that's really its use case. CONFIG_UDMABUF is a driver that mocks
>> dma-buf with a memfd which I think is used for testing. But I can do
>> the rename, it's more clear anyway, I think.
> 
> config UDMABUF
>         bool "userspace dmabuf misc driver"
>         default n
>         depends on DMA_SHARED_BUFFER
>         depends on MEMFD_CREATE || COMPILE_TEST
>         help
>           A driver to let userspace turn memfd regions into dma-bufs.
>           Qemu can use this to create host dmabufs for guest framebuffers.
> 
> 
> Qemu is just a userspace process; it is no way a special one.
> 
> Treating host memory as a dmabuf should radically simplify the io_uring
> extension of this set. That the io_uring set needs to dive into
> page_pools is just wrong - complicating the design and code and pushing
> io_uring into a realm it does not need to be involved in.

I think our io_uring proposal will already be vastly simplified once we
rebase onto Kuba's page pool memory provider API. Using udmabuf means
depending on a driver designed for testing, vs io_uring's registered
buffers API that's been tried and tested.

I don't have an intuitive understanding of the trade offs yet, and would
need to try out udmabuf and compare vs say using our own page pool
memory provider.

> 
> Most (all?) of this patch set can work with any memory; only device
> memory is unreadable.
> 
> 

^ permalink raw reply

* Re: [PATCH 02/22] [RESEND^2] jffs2: mark __jffs2_dbg_superblock_counts() static
From: Zhihao Cheng @ 2023-11-09  1:07 UTC (permalink / raw)
  To: Arnd Bergmann, Andrew Morton, linux-kernel, Masahiro Yamada,
	linux-kbuild
  Cc: Arnd Bergmann, Matt Turner, Vineet Gupta, Russell King,
	Catalin Marinas, Will Deacon, Steven Rostedt, Masami Hiramatsu,
	Mark Rutland, Guo Ren, Peter Zijlstra, Ard Biesheuvel,
	Huacai Chen, Greg Ungerer, Michal Simek, Thomas Bogendoerfer,
	Dinh Nguyen, Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Geoff Levand, Palmer Dabbelt, Heiko Carstens,
	John Paul Adrian Glaubitz, David S. Miller, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, x86, Helge Deller, Sudip Mukherjee,
	Greg Kroah-Hartman, Timur Tabi, Kent Overstreet, David Woodhouse,
	Naveen N. Rao, Anil S Keshavamurthy, Kees Cook, Vincenzo Frascino,
	Juri Lelli, Vincent Guittot, Nathan Chancellor, Nick Desaulniers,
	Nicolas Schier, Al Viro, Uwe Kleine-König, linux-alpha,
	linux-snps-arc, linux-arm-kernel, linux-trace-kernel, linux-csky,
	loongarch, linux-m68k, linux-mips, linuxppc-dev, linux-riscv,
	linux-s390, linux-sh, sparclinux, netdev, linux-parisc, linux-usb,
	linux-fbdev, dri-devel, linux-bcachefs, linux-mtd, Tudor Ambarus
In-Reply-To: <20231108125843.3806765-3-arnd@kernel.org>

在 2023/11/8 20:58, Arnd Bergmann 写道:
> From: Arnd Bergmann <arnd@arndb.de>
>
> This function is only called locally and does not need to be
> global. Since there is no external prototype, gcc warns about
> the non-static definition:
>
> fs/jffs2/debug.c:160:6: error: no previous prototype for '__jffs2_dbg_superblock_counts' [-Werror=missing-prototypes]
>
> Reviewed-by: Tudor Ambarus <tudor.ambarus@linaro.org>
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
>   fs/jffs2/debug.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>

^ permalink raw reply

* Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator
From: David Wei @ 2023-11-09  1:15 UTC (permalink / raw)
  To: Mina Almasry, David Ahern
  Cc: netdev, linux-kernel, linux-arch, linux-kselftest, linux-media,
	dri-devel, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jesper Dangaard Brouer, Ilias Apalodimas,
	Arnd Bergmann, Willem de Bruijn, Shuah Khan, Sumit Semwal,
	Christian König, Shakeel Butt, Jeroen de Borst,
	Praveen Kaligineedi, Willem de Bruijn, Kaiyuan Zhang,
	Pavel Begunkov
In-Reply-To: <CAHS8izMKDOw5_y2MLRfuJHs=ai+sZ6GF7Rg1NuR_JqONg-5u5Q@mail.gmail.com>

On 2023-11-07 15:03, Mina Almasry wrote:
> On Tue, Nov 7, 2023 at 2:55 PM David Ahern <dsahern@kernel.org> wrote:
>>
>> On 11/7/23 3:10 PM, Mina Almasry wrote:
>>> On Mon, Nov 6, 2023 at 3:44 PM David Ahern <dsahern@kernel.org> wrote:
>>>>
>>>> On 11/5/23 7:44 PM, Mina Almasry wrote:
>>>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>>>> index eeeda849115c..1c351c138a5b 100644
>>>>> --- a/include/linux/netdevice.h
>>>>> +++ b/include/linux/netdevice.h
>>>>> @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding {
>>>>>  };
>>>>>
>>>>>  #ifdef CONFIG_DMA_SHARED_BUFFER
>>>>> +struct page_pool_iov *
>>>>> +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
>>>>> +void netdev_free_devmem(struct page_pool_iov *ppiov);
>>>>
>>>> netdev_{alloc,free}_dmabuf?
>>>>
>>>
>>> Can do.
>>>
>>>> I say that because a dmabuf can be host memory, at least I am not aware
>>>> of a restriction that a dmabuf is device memory.
>>>>
>>>
>>> In my limited experience dma-buf is generally device memory, and
>>> that's really its use case. CONFIG_UDMABUF is a driver that mocks
>>> dma-buf with a memfd which I think is used for testing. But I can do
>>> the rename, it's more clear anyway, I think.
>>
>> config UDMABUF
>>         bool "userspace dmabuf misc driver"
>>         default n
>>         depends on DMA_SHARED_BUFFER
>>         depends on MEMFD_CREATE || COMPILE_TEST
>>         help
>>           A driver to let userspace turn memfd regions into dma-bufs.
>>           Qemu can use this to create host dmabufs for guest framebuffers.
>>
>>
>> Qemu is just a userspace process; it is no way a special one.
>>
>> Treating host memory as a dmabuf should radically simplify the io_uring
>> extension of this set.
> 
> I agree actually, and I was about to make that comment to David Wei's
> series once I have the time.
> 
> David, your io_uring RX zerocopy proposal actually works with devmem
> TCP, if you're inclined to do that instead, what you'd do roughly is
> (I think):
> 
> - Allocate a memfd,
> - Use CONFIG_UDMABUF to create a dma-buf out of that memfd.
> - Bind the dma-buf to the NIC using the netlink API in this RFC.
> - Your io_uring extensions and io_uring uapi should work as-is almost
> on top of this series, I think.
> 
> If you do this the incoming packets should land into your memfd, which
> may or may not work for you. In the future if you feel inclined to use
> device memory, this approach that I'm describing here would be more
> extensible to device memory, because you'd already be using dma-bufs
> for your user memory; you'd just replace one kind of dma-buf (UDMABUF)
> with another.
> 

How would TCP devmem change if we no longer assume that dmabuf is device
memory? Pavel will know more on the perf side, but I wouldn't want to
put any if/else on the hot path if we can avoid it. I could be wrong,
but right now in my mind using different memory providers solves this
neatly and the driver/networking stack doesn't need to care.

Mina, I believe you said at NetDev conf that you already had an udmabuf
implementation for testing. I would like to see this (you can send
privately) to see how TCP devmem would handle both user memory and
device memory.

>> That the io_uring set needs to dive into
>> page_pools is just wrong - complicating the design and code and pushing
>> io_uring into a realm it does not need to be involved in.
>>
>> Most (all?) of this patch set can work with any memory; only device
>> memory is unreadable.
>>
>>
> 
> 

^ permalink raw reply

* Re: [ANN] netdev development stats for 6.7
From: Hangbin Liu @ 2023-11-09  1:17 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Andrew Lunn, netdev, netdev-driver-reviewers, Stanislav Fomichev,
	Simon Horman, Toke Høiland-Jørgensen
In-Reply-To: <20231108083307.364cfe91@kernel.org>

Hi Jakub,
On Wed, Nov 08, 2023 at 08:33:07AM -0800, Jakub Kicinski wrote:
> Now, CNCF has a similar setup: https://github.com/cncf/gitdm
> and they do share their database. So I use that, plus my local hacky
> mapping. Unfortunately the CNCF DB is not very up to date for kernel
> folks.
> 
> Hangbin, according to CNCF you're at Red Hat, which seems sane, and
> that's how I count you :)

Thanks for this info. Glad to know my email and company are mapping correctly.

> I brought creating a public DB up at Linux Foundation TAB meetings,
> but after some poking there's no movement.
> 
> It would be great if Linux Foundation helped the community with the
> developer/company DB, which is ACTUALLY USEFUL BEFORE WASTING TIME ON
> SOME WEB STUFF THAT DOESN'T WORK FOR THE KERNEL.

I personally agree Linux Foundation should maintain a developer/company DB.
The developer could submit their information on a voluntary basis, instead of
letting some tools search the website and collect data.

Thanks
Hangbin

^ permalink raw reply

* Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator
From: Mina Almasry @ 2023-11-09  1:41 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: David Ahern, netdev, linux-kernel, linux-arch, linux-kselftest,
	linux-media, dri-devel, linaro-mm-sig, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
	Ilias Apalodimas, Arnd Bergmann, Willem de Bruijn, Shuah Khan,
	Sumit Semwal, Christian König, Shakeel Butt, Jeroen de Borst,
	Praveen Kaligineedi, Willem de Bruijn, Kaiyuan Zhang
In-Reply-To: <6c629d6d-6927-3857-edaa-1971a94b6e93@huawei.com>

> > On Mon, Nov 6, 2023 at 11:45 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>
> >> On 2023/11/6 10:44, Mina Almasry wrote:
> >>> +
> >>> +void netdev_free_devmem(struct page_pool_iov *ppiov)
> >>> +{
> >>> +     struct netdev_dmabuf_binding *binding = page_pool_iov_binding(ppiov);
> >>> +
> >>> +     refcount_set(&ppiov->refcount, 1);
> >>> +
> >>> +     if (gen_pool_has_addr(binding->chunk_pool,
> >>> +                           page_pool_iov_dma_addr(ppiov), PAGE_SIZE))
> >>
> >> When gen_pool_has_addr() returns false, does it mean something has gone
> >> really wrong here?
> >>
> >
> > Yes, good eye. gen_pool_has_addr() should never return false, but then
> > again, gen_pool_free()  BUG_ON()s if it doesn't find the address,
> > which is an extremely severe reaction to what can be a minor bug in
> > the accounting. I prefer to leak rather than crash the machine. It's a
> > bit of defensive programming that is normally frowned upon, but I feel
> > like in this case it's maybe warranted due to the very severe reaction
> > (BUG_ON).
>
> I would argue that why is the above defensive programming not done in the
> gen_pool core:)
>

I think gen_pool is not really not that new, and suggesting removing
the BUG_ONs must have been proposed before and rejected. I'll try to
do some research and maybe suggest downgrading the BUG_ON to WARN_ON,
but my guess is there is some reason the maintainer wants it to be a
BUG_ON.

On Wed, Nov 8, 2023 at 5:00 PM David Wei <dw@davidwei.uk> wrote:
>
> On 2023-11-07 14:55, David Ahern wrote:
> > On 11/7/23 3:10 PM, Mina Almasry wrote:
> >> On Mon, Nov 6, 2023 at 3:44 PM David Ahern <dsahern@kernel.org> wrote:
> >>>
> >>> On 11/5/23 7:44 PM, Mina Almasry wrote:
> >>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> >>>> index eeeda849115c..1c351c138a5b 100644
> >>>> --- a/include/linux/netdevice.h
> >>>> +++ b/include/linux/netdevice.h
> >>>> @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding {
> >>>>  };
> >>>>
> >>>>  #ifdef CONFIG_DMA_SHARED_BUFFER
> >>>> +struct page_pool_iov *
> >>>> +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
> >>>> +void netdev_free_devmem(struct page_pool_iov *ppiov);
> >>>
> >>> netdev_{alloc,free}_dmabuf?
> >>>
> >>
> >> Can do.
> >>
> >>> I say that because a dmabuf can be host memory, at least I am not aware
> >>> of a restriction that a dmabuf is device memory.
> >>>
> >>
> >> In my limited experience dma-buf is generally device memory, and
> >> that's really its use case. CONFIG_UDMABUF is a driver that mocks
> >> dma-buf with a memfd which I think is used for testing. But I can do
> >> the rename, it's more clear anyway, I think.
> >
> > config UDMABUF
> >         bool "userspace dmabuf misc driver"
> >         default n
> >         depends on DMA_SHARED_BUFFER
> >         depends on MEMFD_CREATE || COMPILE_TEST
> >         help
> >           A driver to let userspace turn memfd regions into dma-bufs.
> >           Qemu can use this to create host dmabufs for guest framebuffers.
> >
> >
> > Qemu is just a userspace process; it is no way a special one.
> >
> > Treating host memory as a dmabuf should radically simplify the io_uring
> > extension of this set. That the io_uring set needs to dive into
> > page_pools is just wrong - complicating the design and code and pushing
> > io_uring into a realm it does not need to be involved in.
>
> I think our io_uring proposal will already be vastly simplified once we
> rebase onto Kuba's page pool memory provider API. Using udmabuf means
> depending on a driver designed for testing, vs io_uring's registered
> buffers API that's been tried and tested.
>

FWIW I also get an impression that udmabuf is mostly targeting
testing, but I'm not aware of any deficiency that makes it concretely
unsuitable for you. You be the judge.

The only quirk of udmabuf I'm aware of is that it seems to cap the max
dma-buf size to 16000 pages. Not sure if that's due to a genuine
technical limitation or just convenience.

> I don't have an intuitive understanding of the trade offs yet, and would
> need to try out udmabuf and compare vs say using our own page pool
> memory provider.
>


On Wed, Nov 8, 2023 at 5:15 PM David Wei <dw@davidwei.uk> wrote:
> How would TCP devmem change if we no longer assume that dmabuf is device
> memory?

It wouldn't. The code already never assumes that dmabuf is device
memory. Any dma-buf should work, as far as I can tell. I'm also quite
confident udmabuf works, I use it for testing.

(Jason Gunthrope is much more of an expert and may chime in to say
'some dma-buf will not work'. My primitive understanding is that we're
using dma-bufs without any quirks and any dma-buf should work. I of
course haven't tested all dma-bufs :D)

> Pavel will know more on the perf side, but I wouldn't want to
> put any if/else on the hot path if we can avoid it. I could be wrong,
> but right now in my mind using different memory providers solves this
> neatly and the driver/networking stack doesn't need to care.
>
> Mina, I believe you said at NetDev conf that you already had an udmabuf
> implementation for testing. I would like to see this (you can send
> privately) to see how TCP devmem would handle both user memory and
> device memory.
>

There is nothing to send privately. The patch series you're looking at
supports udma-buf as-is, and the kselftest included with the series
demonstrates devmem TCP working with udmabuf.

The only thing missing from this series is the driver support. You can
see the GVE driver support for devmem TCP here:

https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-v3

You may need to implement devmem TCP for your driver before you can
reproduce udmabuf working for yourself, though.

-- 
Thanks,
Mina

^ permalink raw reply

* Re: [PATCH] Documentation: Document the Netlink spec
From: Jakub Kicinski @ 2023-11-09  1:43 UTC (permalink / raw)
  To: Jonathan Corbet; +Cc: Breno Leitao, linux-doc, netdev, pabeni, edumazet
In-Reply-To: <875y2cxa6n.fsf@meer.lwn.net>

On Wed, 08 Nov 2023 13:27:28 -0700 Jonathan Corbet wrote:
> I do have to wonder, though, whether a sphinx extension is the right way
> to solve this problem.  You're essentially implementing a filter that
> turns one YAML file into one RST file; might it be better to keep that
> outside of sphinx as a standalone script, invoked by the Makefile?

If we're considering other ways of generating the files - I'd also like
to voice a weak preference towards removing the need for the "stub"
files.

Get all the docs rendered under Documentation/netlink/ with an
auto-generated index.

This way newcomers won't have to remember to add a stub to get the doc
rendered. One fewer thing to worry about during review.

^ permalink raw reply

* Re: BPF/XDP: kernel panic when removing an interface that is an xdp_redirect target
From: Jakub Kicinski @ 2023-11-09  1:52 UTC (permalink / raw)
  To: Nelson, Shannon
  Cc: Toke Høiland-Jørgensen, Jesper Dangaard Brouer,
	David Ahern, netdev, bpf, Daniel Borkmann, Alexei Starovoitov,
	Andrii Nakryiko
In-Reply-To: <fa95d5d0-35c0-497e-aea8-a35f9f6304f4@amd.com>

On Wed, 8 Nov 2023 13:30:21 -0800 Nelson, Shannon wrote:
> > Another source of a bug like this could be that your driver does not in
> > fact call xdp_do_flush() before exiting its NAPI cycle, so that there
> > will be packets from the previous cycle in the bq queue, in which case
> > the assumption mentioned in the linked document obviously breaks down.
> > But that would also be a driver bug :)  
> 
> We do call the xdp_do_flush() at the end of the NAPI cycle, just before 
> calling napi_complete_done().

Just to be sure - flush has to happen on every cycle, not only
before calling napi_complete_done().

^ permalink raw reply

* Re: [PATCH net] net/sched: act_ct: Always fill offloading tuple iifidx
From: patchwork-bot+netdevbpf @ 2023-11-09  2:00 UTC (permalink / raw)
  To: Vlad Buslov
  Cc: davem, kuba, pabeni, netdev, jhs, xiyou.wangcong, jiri, pablo,
	paulb
In-Reply-To: <20231103151410.764271-1-vladbu@nvidia.com>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Fri, 3 Nov 2023 16:14:10 +0100 you wrote:
> Referenced commit doesn't always set iifidx when offloading the flow to
> hardware. Fix the following cases:
> 
> - nf_conn_act_ct_ext_fill() is called before extension is created with
> nf_conn_act_ct_ext_add() in tcf_ct_act(). This can cause rule offload with
> unspecified iifidx when connection is offloaded after only single
> original-direction packet has been processed by tc data path. Always fill
> the new nf_conn_act_ct_ext instance after creating it in
> nf_conn_act_ct_ext_add().
> 
> [...]

Here is the summary with links:
  - [net] net/sched: act_ct: Always fill offloading tuple iifidx
    https://git.kernel.org/netdev/net/c/9bc64bd0cd76

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH][ncsi] Revert NCSI link loss/gain commit
From: Jakub Kicinski @ 2023-11-09  2:06 UTC (permalink / raw)
  To: Johnathan Mantey; +Cc: netdev
In-Reply-To: <20231108222943.4098887-1-johnathanx.mantey@intel.com>

On Wed, 8 Nov 2023 14:29:43 -0800 Johnathan Mantey wrote:
> The NCSI commit
> ncsi: Propagate carrier gain/loss events to the NCSI controller
> introduced unwanted behavior.
> 
> The intent for the commit was to be able to detect carrier loss/gain
> for just the NIC connected to the BMC. The unwanted effect is a
> carrier loss for auxiliary paths also causes the BMC to lose
> carrier. The BMC never regains carrier despite the secondary NIC
> regaining a link.
> 
> This change, when merged, needs to be backported to stable kernels.

You need to add a Fixes tag (pointing at the reverted change), 
and a CC: stable tag. Here's an example of a well formatted fix:

https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=02d5fdbf4f2b8c406f7a4c98fa52aa181a11d733

When you repost make sure you use get_maintainers on the patch file, 
to catch all reviewers. And put [PATCH net v2] ncsi: Revert...
as the subject.
-- 
pw-bot: cr

^ permalink raw reply

* Re: BPF/XDP: kernel panic when removing an interface that is an xdp_redirect target
From: Nelson, Shannon @ 2023-11-09  2:11 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Toke Høiland-Jørgensen, Jesper Dangaard Brouer,
	David Ahern, netdev, bpf, Daniel Borkmann, Alexei Starovoitov,
	Andrii Nakryiko
In-Reply-To: <20231108175215.351d22bf@kernel.org>

On 11/8/2023 5:52 PM, Jakub Kicinski wrote:
> 
> On Wed, 8 Nov 2023 13:30:21 -0800 Nelson, Shannon wrote:
>>> Another source of a bug like this could be that your driver does not in
>>> fact call xdp_do_flush() before exiting its NAPI cycle, so that there
>>> will be packets from the previous cycle in the bq queue, in which case
>>> the assumption mentioned in the linked document obviously breaks down.
>>> But that would also be a driver bug :)
>>
>> We do call the xdp_do_flush() at the end of the NAPI cycle, just before
>> calling napi_complete_done().
> 
> Just to be sure - flush has to happen on every cycle, not only
> before calling napi_complete_done().

Ah, good catch.  The notes in redirect.html say "Before exiting its NAPI 
loop" and "must be called before napi_complete_done()".  I interpreted 
those together.

We'll make sure we do the flush at the end of every budget cycle and see 
what happens.

Thanks all.  This is exactly why talking out your problems with others 
is a very good thing.  And yes, this eventually will make it into an 
upstream patch set - after we do a bunch more testing.

Cheers,
sln

^ permalink raw reply

* Re: [PATCH net-next v4] hv_netvsc: Mark VF as slave before exposing it to user-mode
From: Jakub Kicinski @ 2023-11-09  2:13 UTC (permalink / raw)
  To: longli
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	David S. Miller, Eric Dumazet, Paolo Abeni, linux-hyperv, netdev,
	linux-kernel, Long Li
In-Reply-To: <1699484212-24079-1-git-send-email-longli@linuxonhyperv.com>

On Wed,  8 Nov 2023 14:56:52 -0800 longli@linuxonhyperv.com wrote:
> From: Long Li <longli@microsoft.com>
> 
> When a VF is being exposed form the kernel, it should be marked as "slave"
> before exposing to the user-mode. The VF is not usable without netvsc running
> as master. The user-mode should never see a VF without the "slave" flag.
> 
> An example of a user-mode program depending on this flag is cloud-init
> (https://github.com/canonical/cloud-init/blob/19.3/cloudinit/net/__init__.py)

Quick grep for "flags", "priv" and "slave" doesn't show anything.
Can you point me to the line of code?

> When scanning interfaces, it checks on if this interface has a master to
> decide if it should be configured. There are other user-mode programs perform
> similar checks.
> 
> This commit moves the code of setting the slave flag to the time before VF is
> exposed to user-mode.

> Change since v3:
> Change target to net-next.

You don't consider this a fix? It seems like a race condition.

> -		if (ether_addr_equal(vf_netdev->perm_addr, ndev->perm_addr)) {
> -			netdev_notice(vf_netdev,
> -				      "falling back to mac addr based matching\n");
> +		if (ether_addr_equal(vf_netdev->perm_addr, ndev->perm_addr) ||
> +		    ether_addr_equal(vf_netdev->dev_addr, ndev->perm_addr))

This change doesn't seem to be described in the commit message.

Please note that we have a rule against reposting patches within 24h:

https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#resending-after-review

^ permalink raw reply

* Re: pull-request: bpf 2023-11-08
From: patchwork-bot+netdevbpf @ 2023-11-09  2:20 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: davem, kuba, pabeni, edumazet, ast, andrii, martin.lau, netdev,
	bpf
In-Reply-To: <20231108132448.1970-1-daniel@iogearbox.net>

Hello:

This pull request was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Wed,  8 Nov 2023 14:24:48 +0100 you wrote:
> Hi David, hi Jakub, hi Paolo, hi Eric,
> 
> The following pull-request contains BPF updates for your *net* tree.
> 
> We've added 16 non-merge commits during the last 6 day(s) which contain
> a total of 30 files changed, 341 insertions(+), 130 deletions(-).
> 
> [...]

Here is the summary with links:
  - pull-request: bpf 2023-11-08
    https://git.kernel.org/netdev/net/c/942b8b38de3f

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [RFC PATCH v3 04/12] netdev: support binding dma-buf to netdevice
From: Mina Almasry @ 2023-11-09  2:22 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: netdev, linux-kernel, linux-arch, linux-kselftest, linux-media,
	dri-devel, linaro-mm-sig, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
	Ilias Apalodimas, Arnd Bergmann, David Ahern, Willem de Bruijn,
	Shuah Khan, Sumit Semwal, Christian König, Shakeel Butt,
	Jeroen de Borst, Praveen Kaligineedi, Willem de Bruijn,
	Kaiyuan Zhang
In-Reply-To: <c1b689bd-a05b-85e9-0ce4-7264c818c2dc@huawei.com>

On Tue, Nov 7, 2023 at 7:40 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2023/11/8 5:59, Mina Almasry wrote:
> > On Mon, Nov 6, 2023 at 11:46 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>
> >> On 2023/11/6 10:44, Mina Almasry wrote:
> >>> +
> >>> +void __netdev_devmem_binding_free(struct netdev_dmabuf_binding *binding)
> >>> +{
> >>> +     size_t size, avail;
> >>> +
> >>> +     gen_pool_for_each_chunk(binding->chunk_pool,
> >>> +                             netdev_devmem_free_chunk_owner, NULL);
> >>> +
> >>> +     size = gen_pool_size(binding->chunk_pool);
> >>> +     avail = gen_pool_avail(binding->chunk_pool);
> >>> +
> >>> +     if (!WARN(size != avail, "can't destroy genpool. size=%lu, avail=%lu",
> >>> +               size, avail))
> >>> +             gen_pool_destroy(binding->chunk_pool);
> >>
> >>
> >> Is there any other place calling the gen_pool_destroy() when the above
> >> warning is triggered? Do we have a leaking for binding->chunk_pool?
> >>
> >
> > gen_pool_destroy BUG_ON() if it's not empty at the time of destroying.
> > Technically that should never happen, because
> > __netdev_devmem_binding_free() should only be called when the refcount
> > hits 0, so all the chunks have been freed back to the gen_pool. But,
> > just in case, I don't want to crash the server just because I'm
> > leaking a chunk... this is a bit of defensive programming that is
> > typically frowned upon, but the behavior of gen_pool is so severe I
> > think the WARN() + check is warranted here.
>
> It seems it is pretty normal for the above to happen nowadays because of
> retransmits timeouts, NAPI defer schemes mentioned below:
>
> https://lkml.kernel.org/netdev/168269854650.2191653.8465259808498269815.stgit@firesoul/
>
> And currently page pool core handles that by using a workqueue.

Forgive me but I'm not understanding the concern here.

__netdev_devmem_binding_free() is called when binding->ref hits 0.

binding->ref is incremented when an iov slice of the dma-buf is
allocated, and decremented when an iov is freed. So,
__netdev_devmem_binding_free() can't really be called unless all the
iovs have been freed, and gen_pool_size() == gen_pool_avail(),
regardless of what's happening on the page_pool side of things, right?

-- 
Thanks,
Mina

^ permalink raw reply

* Re: [RFC PATCH v3 04/12] netdev: support binding dma-buf to netdevice
From: Mina Almasry @ 2023-11-09  2:25 UTC (permalink / raw)
  To: David Wei
  Cc: netdev, linux-kernel, linux-arch, linux-kselftest, linux-media,
	dri-devel, linaro-mm-sig, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
	Ilias Apalodimas, Arnd Bergmann, David Ahern, Willem de Bruijn,
	Shuah Khan, Sumit Semwal, Christian König, Shakeel Butt,
	Jeroen de Borst, Praveen Kaligineedi, Willem de Bruijn,
	Kaiyuan Zhang, Pavel Begunkov
In-Reply-To: <7769b74d-dd23-41de-8e11-434a0acabf72@davidwei.uk>

On Wed, Nov 8, 2023 at 3:47 PM David Wei <dw@davidwei.uk> wrote:
>
> On 2023-11-05 18:44, Mina Almasry wrote:
> > Add a netdev_dmabuf_binding struct which represents the
> > dma-buf-to-netdevice binding. The netlink API will bind the dma-buf to
> > rx queues on the netdevice. On the binding, the dma_buf_attach
> > & dma_buf_map_attachment will occur. The entries in the sg_table from
> > mapping will be inserted into a genpool to make it ready
> > for allocation.
> >
> > The chunks in the genpool are owned by a dmabuf_chunk_owner struct which
> > holds the dma-buf offset of the base of the chunk and the dma_addr of
> > the chunk. Both are needed to use allocations that come from this chunk.
> >
> > We create a new type that represents an allocation from the genpool:
> > page_pool_iov. We setup the page_pool_iov allocation size in the
> > genpool to PAGE_SIZE for simplicity: to match the PAGE_SIZE normally
> > allocated by the page pool and given to the drivers.
> >
> > The user can unbind the dmabuf from the netdevice by closing the netlink
> > socket that established the binding. We do this so that the binding is
> > automatically unbound even if the userspace process crashes.
> >
> > The binding and unbinding leaves an indicator in struct netdev_rx_queue
> > that the given queue is bound, but the binding doesn't take effect until
> > the driver actually reconfigures its queues, and re-initializes its page
> > pool.
> >
> > The netdev_dmabuf_binding struct is refcounted, and releases its
> > resources only when all the refs are released.
> >
> > Signed-off-by: Willem de Bruijn <willemb@google.com>
> > Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
> > Signed-off-by: Mina Almasry <almasrymina@google.com>
> >
> > ---
> >
> > RFC v3:
> > - Support multi rx-queue binding
> >
> > ---
> >  include/linux/netdevice.h     |  80 ++++++++++++++
> >  include/net/netdev_rx_queue.h |   1 +
> >  include/net/page_pool/types.h |  27 +++++
> >  net/core/dev.c                | 203 ++++++++++++++++++++++++++++++++++
> >  net/core/netdev-genl.c        | 116 ++++++++++++++++++-
> >  5 files changed, 425 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index b8bf669212cc..eeeda849115c 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -52,6 +52,8 @@
> >  #include <net/net_trackers.h>
> >  #include <net/net_debug.h>
> >  #include <net/dropreason-core.h>
> > +#include <linux/xarray.h>
> > +#include <linux/refcount.h>
> >
> >  struct netpoll_info;
> >  struct device;
> > @@ -808,6 +810,84 @@ bool rps_may_expire_flow(struct net_device *dev, u16 rxq_index, u32 flow_id,
> >  #endif
> >  #endif /* CONFIG_RPS */
> >
> > +struct netdev_dmabuf_binding {
> > +     struct dma_buf *dmabuf;
> > +     struct dma_buf_attachment *attachment;
> > +     struct sg_table *sgt;
> > +     struct net_device *dev;
> > +     struct gen_pool *chunk_pool;
> > +
> > +     /* The user holds a ref (via the netlink API) for as long as they want
> > +      * the binding to remain alive. Each page pool using this binding holds
> > +      * a ref to keep the binding alive. Each allocated page_pool_iov holds a
> > +      * ref.
> > +      *
> > +      * The binding undos itself and unmaps the underlying dmabuf once all
> > +      * those refs are dropped and the binding is no longer desired or in
> > +      * use.
> > +      */
> > +     refcount_t ref;
> > +
> > +     /* The portid of the user that owns this binding. Used for netlink to
> > +      * notify us of the user dropping the bind.
> > +      */
> > +     u32 owner_nlportid;
> > +
> > +     /* The list of bindings currently active. Used for netlink to notify us
> > +      * of the user dropping the bind.
> > +      */
> > +     struct list_head list;
> > +
> > +     /* rxq's this binding is active on. */
> > +     struct xarray bound_rxq_list;
> > +};
> > +
> > +#ifdef CONFIG_DMA_SHARED_BUFFER
> > +void __netdev_devmem_binding_free(struct netdev_dmabuf_binding *binding);
> > +int netdev_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
> > +                    struct netdev_dmabuf_binding **out);
> > +void netdev_unbind_dmabuf(struct netdev_dmabuf_binding *binding);
> > +int netdev_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
> > +                             struct netdev_dmabuf_binding *binding);
> > +#else
> > +static inline void
> > +__netdev_devmem_binding_free(struct netdev_dmabuf_binding *binding)
> > +{
> > +}
> > +
> > +static inline int netdev_bind_dmabuf(struct net_device *dev,
> > +                                  unsigned int dmabuf_fd,
> > +                                  struct netdev_dmabuf_binding **out)
> > +{
> > +     return -EOPNOTSUPP;
> > +}
> > +static inline void netdev_unbind_dmabuf(struct netdev_dmabuf_binding *binding)
> > +{
> > +}
> > +
> > +static inline int
> > +netdev_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
> > +                         struct netdev_dmabuf_binding *binding)
> > +{
> > +     return -EOPNOTSUPP;
> > +}
> > +#endif
> > +
> > +static inline void
> > +netdev_devmem_binding_get(struct netdev_dmabuf_binding *binding)
> > +{
> > +     refcount_inc(&binding->ref);
> > +}
> > +
> > +static inline void
> > +netdev_devmem_binding_put(struct netdev_dmabuf_binding *binding)
> > +{
> > +     if (!refcount_dec_and_test(&binding->ref))
> > +             return;
> > +
> > +     __netdev_devmem_binding_free(binding);
> > +}
> > +
> >  /* XPS map type and offset of the xps map within net_device->xps_maps[]. */
> >  enum xps_map_type {
> >       XPS_CPUS = 0,
> > diff --git a/include/net/netdev_rx_queue.h b/include/net/netdev_rx_queue.h
> > index cdcafb30d437..1bfcf60a145d 100644
> > --- a/include/net/netdev_rx_queue.h
> > +++ b/include/net/netdev_rx_queue.h
> > @@ -21,6 +21,7 @@ struct netdev_rx_queue {
> >  #ifdef CONFIG_XDP_SOCKETS
> >       struct xsk_buff_pool            *pool;
> >  #endif
> > +     struct netdev_dmabuf_binding *binding;
>
> @Pavel - They are using struct netdev_rx_queue to hold the binding,
> which is an object that holds the state and is mapped 1:1 to an rxq.
> This object is similar to our "interface queue". I wonder if we should
> re-visit using this generic struct, instead of driver specific structs
> e.g. bnxt_rx_ring_info?
>
> >  } ____cacheline_aligned_in_smp;
> >
> >  /*
> > diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
> > index d4bea053bb7e..64386325d965 100644
> > --- a/include/net/page_pool/types.h
> > +++ b/include/net/page_pool/types.h
> > @@ -133,6 +133,33 @@ struct pp_memory_provider_ops {
> >       bool (*release_page)(struct page_pool *pool, struct page *page);
> >  };
> >
> > +/* page_pool_iov support */
> > +
> > +/* Owner of the dma-buf chunks inserted into the gen pool. Each scatterlist
> > + * entry from the dmabuf is inserted into the genpool as a chunk, and needs
> > + * this owner struct to keep track of some metadata necessary to create
> > + * allocations from this chunk.
> > + */
> > +struct dmabuf_genpool_chunk_owner {
> > +     /* Offset into the dma-buf where this chunk starts.  */
> > +     unsigned long base_virtual;
> > +
> > +     /* dma_addr of the start of the chunk.  */
> > +     dma_addr_t base_dma_addr;
> > +
> > +     /* Array of page_pool_iovs for this chunk. */
> > +     struct page_pool_iov *ppiovs;
> > +     size_t num_ppiovs;
> > +
> > +     struct netdev_dmabuf_binding *binding;
> > +};
> > +
> > +struct page_pool_iov {
> > +     struct dmabuf_genpool_chunk_owner *owner;
> > +
> > +     refcount_t refcount;
> > +};
> > +
> >  struct page_pool {
> >       struct page_pool_params p;
> >
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index a37a932a3e14..c8c3709d42c8 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -153,6 +153,9 @@
> >  #include <linux/prandom.h>
> >  #include <linux/once_lite.h>
> >  #include <net/netdev_rx_queue.h>
> > +#include <linux/genalloc.h>
> > +#include <linux/dma-buf.h>
> > +#include <net/page_pool/types.h>
> >
> >  #include "dev.h"
> >  #include "net-sysfs.h"
> > @@ -2040,6 +2043,206 @@ static int call_netdevice_notifiers_mtu(unsigned long val,
> >       return call_netdevice_notifiers_info(val, &info.info);
> >  }
> >
> > +/* Device memory support */
> > +
> > +#ifdef CONFIG_DMA_SHARED_BUFFER
> > +static void netdev_devmem_free_chunk_owner(struct gen_pool *genpool,
> > +                                        struct gen_pool_chunk *chunk,
> > +                                        void *not_used)
> > +{
> > +     struct dmabuf_genpool_chunk_owner *owner = chunk->owner;
> > +
> > +     kvfree(owner->ppiovs);
> > +     kfree(owner);
> > +}
> > +
> > +void __netdev_devmem_binding_free(struct netdev_dmabuf_binding *binding)
> > +{
> > +     size_t size, avail;
> > +
> > +     gen_pool_for_each_chunk(binding->chunk_pool,
> > +                             netdev_devmem_free_chunk_owner, NULL);
> > +
> > +     size = gen_pool_size(binding->chunk_pool);
> > +     avail = gen_pool_avail(binding->chunk_pool);
> > +
> > +     if (!WARN(size != avail, "can't destroy genpool. size=%lu, avail=%lu",
> > +               size, avail))
> > +             gen_pool_destroy(binding->chunk_pool);
> > +
> > +     dma_buf_unmap_attachment(binding->attachment, binding->sgt,
> > +                              DMA_BIDIRECTIONAL);
> > +     dma_buf_detach(binding->dmabuf, binding->attachment);
> > +     dma_buf_put(binding->dmabuf);
> > +     kfree(binding);
> > +}
> > +
> > +void netdev_unbind_dmabuf(struct netdev_dmabuf_binding *binding)
> > +{
> > +     struct netdev_rx_queue *rxq;
> > +     unsigned long xa_idx;
> > +
> > +     if (!binding)
> > +             return;
> > +
> > +     list_del_rcu(&binding->list);
> > +
> > +     xa_for_each(&binding->bound_rxq_list, xa_idx, rxq)
> > +             if (rxq->binding == binding)
> > +                     /* We hold the rtnl_lock while binding/unbinding
> > +                      * dma-buf, so we can't race with another thread that
> > +                      * is also modifying this value. However, the driver
> > +                      * may read this config while it's creating its
> > +                      * rx-queues. WRITE_ONCE() here to match the
> > +                      * READ_ONCE() in the driver.
> > +                      */
> > +                     WRITE_ONCE(rxq->binding, NULL);
> > +
> > +     netdev_devmem_binding_put(binding);
> > +}
> > +
> > +int netdev_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
> > +                             struct netdev_dmabuf_binding *binding)
> > +{
> > +     struct netdev_rx_queue *rxq;
> > +     u32 xa_idx;
> > +     int err;
> > +
> > +     rxq = __netif_get_rx_queue(dev, rxq_idx);
> > +
> > +     if (rxq->binding)
> > +             return -EEXIST;
> > +
> > +     err = xa_alloc(&binding->bound_rxq_list, &xa_idx, rxq, xa_limit_32b,
> > +                    GFP_KERNEL);
> > +     if (err)
> > +             return err;
> > +
> > +     /*We hold the rtnl_lock while binding/unbinding dma-buf, so we can't
> > +      * race with another thread that is also modifying this value. However,
> > +      * the driver may read this config while it's creating its * rx-queues.
> > +      * WRITE_ONCE() here to match the READ_ONCE() in the driver.
> > +      */
> > +     WRITE_ONCE(rxq->binding, binding);
> > +
> > +     return 0;
> > +}
> > +
> > +int netdev_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
> > +                    struct netdev_dmabuf_binding **out)
>
> I'm not entirely familiar with the Netlink API. Mina, do you know if we
> can call into netdev_bind_dmabuf or netdev_nl_bind_rx_doit directly,
> without needing to call send/recv on a Netlink socket? We likely want
> io_uring to do the registration of a dmabuf fd and keep ownership over
> it.
>

You can likely call into netdev_bind_dmabuf(), but not
netdev_nl_bind_rx_doit. The latter is very netlink specific.

> > +{
> > +     struct netdev_dmabuf_binding *binding;
> > +     struct scatterlist *sg;
> > +     struct dma_buf *dmabuf;
> > +     unsigned int sg_idx, i;
> > +     unsigned long virtual;
> > +     int err;
> > +
> > +     if (!capable(CAP_NET_ADMIN))
> > +             return -EPERM;
> > +
> > +     dmabuf = dma_buf_get(dmabuf_fd);
> > +     if (IS_ERR_OR_NULL(dmabuf))
> > +             return -EBADFD;
> > +
> > +     binding = kzalloc_node(sizeof(*binding), GFP_KERNEL,
> > +                            dev_to_node(&dev->dev));
> > +     if (!binding) {
> > +             err = -ENOMEM;
> > +             goto err_put_dmabuf;
> > +     }
> > +
> > +     xa_init_flags(&binding->bound_rxq_list, XA_FLAGS_ALLOC);
> > +
> > +     refcount_set(&binding->ref, 1);
> > +
> > +     binding->dmabuf = dmabuf;
> > +
> > +     binding->attachment = dma_buf_attach(binding->dmabuf, dev->dev.parent);
> > +     if (IS_ERR(binding->attachment)) {
> > +             err = PTR_ERR(binding->attachment);
> > +             goto err_free_binding;
> > +     }
> > +
> > +     binding->sgt = dma_buf_map_attachment(binding->attachment,
> > +                                           DMA_BIDIRECTIONAL);
> > +     if (IS_ERR(binding->sgt)) {
> > +             err = PTR_ERR(binding->sgt);
> > +             goto err_detach;
> > +     }
> > +
> > +     /* For simplicity we expect to make PAGE_SIZE allocations, but the
> > +      * binding can be much more flexible than that. We may be able to
> > +      * allocate MTU sized chunks here. Leave that for future work...
> > +      */
> > +     binding->chunk_pool = gen_pool_create(PAGE_SHIFT,
> > +                                           dev_to_node(&dev->dev));
> > +     if (!binding->chunk_pool) {
> > +             err = -ENOMEM;
> > +             goto err_unmap;
> > +     }
> > +
> > +     virtual = 0;
> > +     for_each_sgtable_dma_sg(binding->sgt, sg, sg_idx) {
> > +             dma_addr_t dma_addr = sg_dma_address(sg);
> > +             struct dmabuf_genpool_chunk_owner *owner;
> > +             size_t len = sg_dma_len(sg);
> > +             struct page_pool_iov *ppiov;
> > +
> > +             owner = kzalloc_node(sizeof(*owner), GFP_KERNEL,
> > +                                  dev_to_node(&dev->dev));
> > +             owner->base_virtual = virtual;
> > +             owner->base_dma_addr = dma_addr;
> > +             owner->num_ppiovs = len / PAGE_SIZE;
> > +             owner->binding = binding;
> > +
> > +             err = gen_pool_add_owner(binding->chunk_pool, dma_addr,
> > +                                      dma_addr, len, dev_to_node(&dev->dev),
> > +                                      owner);
> > +             if (err) {
> > +                     err = -EINVAL;
> > +                     goto err_free_chunks;
> > +             }
> > +
> > +             owner->ppiovs = kvmalloc_array(owner->num_ppiovs,
> > +                                            sizeof(*owner->ppiovs),
> > +                                            GFP_KERNEL);
> > +             if (!owner->ppiovs) {
> > +                     err = -ENOMEM;
> > +                     goto err_free_chunks;
> > +             }
> > +
> > +             for (i = 0; i < owner->num_ppiovs; i++) {
> > +                     ppiov = &owner->ppiovs[i];
> > +                     ppiov->owner = owner;
> > +                     refcount_set(&ppiov->refcount, 1);
> > +             }
> > +
> > +             dma_addr += len;
> > +             virtual += len;
> > +     }
> > +
> > +     *out = binding;
> > +
> > +     return 0;
> > +
> > +err_free_chunks:
> > +     gen_pool_for_each_chunk(binding->chunk_pool,
> > +                             netdev_devmem_free_chunk_owner, NULL);
> > +     gen_pool_destroy(binding->chunk_pool);
> > +err_unmap:
> > +     dma_buf_unmap_attachment(binding->attachment, binding->sgt,
> > +                              DMA_BIDIRECTIONAL);
> > +err_detach:
> > +     dma_buf_detach(dmabuf, binding->attachment);
> > +err_free_binding:
> > +     kfree(binding);
> > +err_put_dmabuf:
> > +     dma_buf_put(dmabuf);
> > +     return err;
> > +}
> > +#endif
> > +
> >  #ifdef CONFIG_NET_INGRESS
> >  static DEFINE_STATIC_KEY_FALSE(ingress_needed_key);
> >
> > diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
> > index 59d3d512d9cc..2c2a62593217 100644
> > --- a/net/core/netdev-genl.c
> > +++ b/net/core/netdev-genl.c
> > @@ -129,10 +129,89 @@ int netdev_nl_dev_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
> >       return skb->len;
> >  }
> >
> > -/* Stub */
> > +static LIST_HEAD(netdev_rbinding_list);
> > +
> >  int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
> >  {
> > -     return 0;
> > +     struct netdev_dmabuf_binding *out_binding;
> > +     u32 ifindex, dmabuf_fd, rxq_idx;
> > +     struct net_device *netdev;
> > +     struct sk_buff *rsp;
> > +     int rem, err = 0;
> > +     void *hdr;
> > +     struct nlattr *attr;
> > +
> > +     if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_DEV_IFINDEX) ||
> > +         GENL_REQ_ATTR_CHECK(info, NETDEV_A_BIND_DMABUF_DMABUF_FD) ||
> > +         GENL_REQ_ATTR_CHECK(info, NETDEV_A_BIND_DMABUF_QUEUES))
> > +             return -EINVAL;
> > +
> > +     ifindex = nla_get_u32(info->attrs[NETDEV_A_DEV_IFINDEX]);
> > +     dmabuf_fd = nla_get_u32(info->attrs[NETDEV_A_BIND_DMABUF_DMABUF_FD]);
> > +
> > +     rtnl_lock();
> > +
> > +     netdev = __dev_get_by_index(genl_info_net(info), ifindex);
> > +     if (!netdev) {
> > +             err = -ENODEV;
> > +             goto err_unlock;
> > +     }
> > +
> > +     err = netdev_bind_dmabuf(netdev, dmabuf_fd, &out_binding);
> > +     if (err)
> > +             goto err_unlock;
> > +
> > +     nla_for_each_attr(attr, genlmsg_data(info->genlhdr),
> > +                       genlmsg_len(info->genlhdr), rem) {
> > +             switch (nla_type(attr)) {
> > +             case NETDEV_A_BIND_DMABUF_QUEUES:
> > +                     rxq_idx = nla_get_u32(attr);
> > +
> > +                     if (rxq_idx >= netdev->num_rx_queues) {
> > +                             err = -ERANGE;
> > +                             goto err_unbind;
> > +                     }
> > +
> > +                     err = netdev_bind_dmabuf_to_queue(netdev, rxq_idx,
> > +                                                       out_binding);
> > +                     if (err)
> > +                             goto err_unbind;
> > +
> > +                     break;
> > +             default:
> > +                     break;
> > +             }
> > +     }
> > +
> > +     out_binding->owner_nlportid = info->snd_portid;
> > +     list_add_rcu(&out_binding->list, &netdev_rbinding_list);
> > +
> > +     rsp = genlmsg_new(GENLMSG_DEFAULT_SIZE, GFP_KERNEL);
> > +     if (!rsp) {
> > +             err = -ENOMEM;
> > +             goto err_unbind;
> > +     }
> > +
> > +     hdr = genlmsg_put(rsp, info->snd_portid, info->snd_seq,
> > +                       &netdev_nl_family, 0, info->genlhdr->cmd);
> > +     if (!hdr) {
> > +             err = -EMSGSIZE;
> > +             goto err_genlmsg_free;
> > +     }
> > +
> > +     genlmsg_end(rsp, hdr);
> > +
> > +     rtnl_unlock();
> > +
> > +     return genlmsg_reply(rsp, info);
> > +
> > +err_genlmsg_free:
> > +     nlmsg_free(rsp);
> > +err_unbind:
> > +     netdev_unbind_dmabuf(out_binding);
> > +err_unlock:
> > +     rtnl_unlock();
> > +     return err;
> >  }
> >
> >  static int netdev_genl_netdevice_event(struct notifier_block *nb,
> > @@ -155,10 +234,37 @@ static int netdev_genl_netdevice_event(struct notifier_block *nb,
> >       return NOTIFY_OK;
> >  }
> >
> > +static int netdev_netlink_notify(struct notifier_block *nb, unsigned long state,
> > +                              void *_notify)
> > +{
> > +     struct netlink_notify *notify = _notify;
> > +     struct netdev_dmabuf_binding *rbinding;
> > +
> > +     if (state != NETLINK_URELEASE || notify->protocol != NETLINK_GENERIC)
> > +             return NOTIFY_DONE;
> > +
> > +     rcu_read_lock();
> > +
> > +     list_for_each_entry_rcu(rbinding, &netdev_rbinding_list, list) {
> > +             if (rbinding->owner_nlportid == notify->portid) {
> > +                     netdev_unbind_dmabuf(rbinding);
> > +                     break;
> > +             }
> > +     }
> > +
> > +     rcu_read_unlock();
> > +
> > +     return NOTIFY_OK;
> > +}
> > +
> >  static struct notifier_block netdev_genl_nb = {
> >       .notifier_call  = netdev_genl_netdevice_event,
> >  };
> >
> > +static struct notifier_block netdev_netlink_notifier = {
> > +     .notifier_call = netdev_netlink_notify,
> > +};
>
> Is this mechamism what cleans up TCP devmem in case userspace crashes
> and the associated Netlink socket is closed?
>

Correct.

> > +
> >  static int __init netdev_genl_init(void)
> >  {
> >       int err;
> > @@ -171,8 +277,14 @@ static int __init netdev_genl_init(void)
> >       if (err)
> >               goto err_unreg_ntf;
> >
> > +     err = netlink_register_notifier(&netdev_netlink_notifier);
> > +     if (err)
> > +             goto err_unreg_family;
> > +
> >       return 0;
> >
> > +err_unreg_family:
> > +     genl_unregister_family(&netdev_nl_family);
> >  err_unreg_ntf:
> >       unregister_netdevice_notifier(&netdev_genl_nb);
> >       return err;



-- 
Thanks,
Mina

^ permalink raw reply

* Re: [PATCH net,v3, 2/2] hv_netvsc: Fix race of register_netdevice_notifier and VF register
From: Jakub Kicinski @ 2023-11-09  2:26 UTC (permalink / raw)
  To: Haiyang Zhang
  Cc: linux-hyperv, netdev, kys, wei.liu, decui, edumazet, pabeni,
	davem, linux-kernel, stable
In-Reply-To: <1699391132-30317-3-git-send-email-haiyangz@microsoft.com>

On Tue,  7 Nov 2023 13:05:32 -0800 Haiyang Zhang wrote:
> If VF NIC is registered earlier, NETDEV_REGISTER event is replayed,
> but NETDEV_POST_INIT is not.

But Long Li sent the patch which starts to use POST_INIT against 
the net-next tree. If we apply this to net and Long Li's patch to
net-next one release will have half of the fixes.

I think that you should add Long Li's patch to this series. That'd
limit the confusion and git preserves authorship of the changes, so
neither of you will loose the credit.

^ permalink raw reply

* Re: [RFC PATCH v3 10/12] tcp: RX path for devmem TCP
From: Mina Almasry @ 2023-11-09  2:39 UTC (permalink / raw)
  To: David Ahern
  Cc: Willem de Bruijn, Stanislav Fomichev, netdev, linux-kernel,
	linux-arch, linux-kselftest, linux-media, dri-devel,
	linaro-mm-sig, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jesper Dangaard Brouer, Ilias Apalodimas,
	Arnd Bergmann, Shuah Khan, Sumit Semwal, Christian König,
	Shakeel Butt, Jeroen de Borst, Praveen Kaligineedi,
	Willem de Bruijn, Kaiyuan Zhang
In-Reply-To: <7ce2d027-1e02-4a63-afb7-7304fbfbdf90@kernel.org>

On Tue, Nov 7, 2023 at 4:01 PM David Ahern <dsahern@kernel.org> wrote:
>
> On 11/7/23 4:55 PM, Mina Almasry wrote:
> > On Mon, Nov 6, 2023 at 4:03 PM Willem de Bruijn
> > <willemdebruijn.kernel@gmail.com> wrote:
> >>
> >> On Mon, Nov 6, 2023 at 3:55 PM David Ahern <dsahern@kernel.org> wrote:
> >>>
> >>> On 11/6/23 4:32 PM, Stanislav Fomichev wrote:
> >>>>> The concise notification API returns tokens as a range for
> >>>>> compression, encoding as two 32-bit unsigned integers start + length.
> >>>>> It allows for even further batching by returning multiple such ranges
> >>>>> in a single call.
> >>>>
> >>>> Tangential: should tokens be u64? Otherwise we can't have more than
> >>>> 4gb unacknowledged. Or that's a reasonable constraint?
> >>>>
> >>>
> >>> Was thinking the same and with bits reserved for a dmabuf id to allow
> >>> multiple dmabufs in a single rx queue (future extension, but build the
> >>> capability in now). e.g., something like a 37b offset (128GB dmabuf
> >>> size), 19b length (large GRO), 8b dmabuf id (lots of dmabufs to a queue).
> >>
> >> Agreed. Converting to 64b now sounds like a good forward looking revision.
> >
> > The concept of IDing a dma-buf came up in a couple of different
> > contexts. First, in the context of us giving the dma-buf ID to the
> > user on recvmsg() to tell the user the data is in this specific
> > dma-buf. The second context is here, to bind dma-bufs with multiple
> > user-visible IDs to an rx queue.
> >
> > My issue here is that I don't see anything in the struct dma_buf that
> > can practically serve as an ID:
> >
> > https://elixir.bootlin.com/linux/v6.6-rc7/source/include/linux/dma-buf.h#L302
> >
> > Actually, from the userspace, only the name of the dma-buf seems
> > queryable. That's only unique if the user sets it as such. The dmabuf
> > FD can't serve as an ID. For our use case we need to support 1 process
> > doing the dma-buf bind via netlink, sharing the dma-buf FD to another
> > process, and that process receives the data.  In this case the FDs
> > shown by the 2 processes may be different. Converting to 64b is a
> > trivial change I can make now, but I'm not sure how to ID these
> > dma-bufs. Suggestions welcome. I'm not sure the dma-buf guys will
> > allow adding a new ID + APIs to query said dma-buf ID.
> >
>
> The API can be unique to this usage: e.g., add a dmabuf id to the
> netlink API. Userspace manages the ids (tells the kernel what value to
> use with an instance), the kernel validates no 2 dmabufs have the same
> id and then returns the value here.
>
>

Seems reasonable, will do.

On Wed, Nov 8, 2023 at 7:36 AM Edward Cree <ecree.xilinx@gmail.com> wrote:
>
> On 06/11/2023 21:17, Stanislav Fomichev wrote:
> > I guess I'm just wondering whether other people have any suggestions
> > here. Not sure Jonathan's way was better, but we fundamentally
> > have two queues between the kernel and the userspace:
> > - userspace receiving tokens (recvmsg + magical flag)
> > - userspace refilling tokens (setsockopt + magical flag)
> >
> > So having some kind of shared memory producer-consumer queue feels natural.
> > And using 'classic' socket api here feels like a stretch, idk.
>
> Do 'refilled tokens' (returned memory areas) get used for anything other
>  than subsequent RX?

Hi Ed!

Not really, it's only the subsequent RX.

>  If not then surely the way to return a memory area
>  in an io_uring idiom is just to post a new read sqe ('RX descriptor')
>  pointing into it, rather than explicitly returning it with setsockopt.

We're interested in using this with regular TCP sockets, not
necessarily io_uring. The io_uring interface to devmem TCP may very
well use what you suggest and can drop the setsockopt.


> (Being async means you can post lots of these, unlike recvmsg(), so you
>  don't need any kernel management to keep the RX queue filled; it can
>  just be all handled by the userland thus simplifying APIs overall.)
> Or I'm misunderstanding something?
>
> -e


--
Thanks,
Mina

^ permalink raw reply

* Re: [PATCH] iphase: Adding a null pointer check
From: Jakub Kicinski @ 2023-11-09  2:40 UTC (permalink / raw)
  To: Andrey Shumilin
  Cc: 3chas3, linux-atm-general, netdev, linux-kernel, lvc-project,
	khoroshilov, ykarpov, vmerzlyakov, vefanov
In-Reply-To: <20231107123600.14529-1-shum.sdl@nppct.ru>

On Tue,  7 Nov 2023 15:36:00 +0300 Andrey Shumilin wrote:
> The pointer <dev->desc_tbl[i].iavcc> is dereferenced on line 195.
> Further in the code, it is checked for null on line 204.
> It is proposed to add a check before dereferencing the pointer.

How do you know this is the right way to address the problem.
Much easier to debug a crash than misbehaving driver.

This is ancient code, leave it be.
-- 
pw-bot: reject
pv-bot: nit

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox