Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 04/22] [RESEND] time: make sysfs_get_uname() function visible in header
From: Uwe Kleine-König @ 2023-11-10 12:41 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Andrew Morton, linux-kernel, Masahiro Yamada, linux-kbuild,
	Arnd Bergmann, Matt Turner, Vineet Gupta, Russell King,
	Catalin Marinas, Will Deacon, Steven Rostedt, Masami Hiramatsu,
	Mark Rutland, Guo Ren, Peter Zijlstra, Ard Biesheuvel,
	Huacai Chen, Greg Ungerer, Michal Simek, Thomas Bogendoerfer,
	Dinh Nguyen, Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Geoff Levand, Palmer Dabbelt, Heiko Carstens,
	John Paul Adrian Glaubitz, David S. Miller, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, x86, Helge Deller, Sudip Mukherjee,
	Greg Kroah-Hartman, Timur Tabi, Kent Overstreet, David Woodhouse,
	Naveen N. Rao, Anil S Keshavamurthy, Kees Cook, Vincenzo Frascino,
	Juri Lelli, Vincent Guittot, Nathan Chancellor, Nick Desaulniers,
	Nicolas Schier, Al Viro, linux-alpha, linux-snps-arc,
	linux-arm-kernel, linux-trace-kernel, linux-csky, loongarch,
	linux-m68k, linux-mips, linuxppc-dev, linux-riscv, linux-s390,
	linux-sh, sparclinux, netdev, linux-parisc, linux-usb,
	linux-fbdev, dri-devel, linux-bcachefs, linux-mtd
In-Reply-To: <20231108125843.3806765-5-arnd@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 919 bytes --]

Hello Arnd,

On Wed, Nov 08, 2023 at 01:58:25PM +0100, Arnd Bergmann wrote:
> From: Arnd Bergmann <arnd@arndb.de>
> 
> This function is defined globally in clocksource.c and used conditionally
> in clockevent.c, which the declaration hidden when clockevent support

s/which/with/ ?

> is disabled. This causes a harmless warning in the definition:
> 
> kernel/time/clocksource.c:1324:9: warning: no previous prototype for 'sysfs_get_uname' [-Wmissing-prototypes]
>  1324 | ssize_t sysfs_get_uname(const char *buf, char *dst, size_t cnt)
> 
> Move the declaration out of the #ifdef so it is always visible.
> 
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Other than that:

Reviewed-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>

-- 
Pengutronix e.K.                           | Uwe Kleine-König            |
Industrial Linux Solutions                 | https://www.pengutronix.de/ |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [RFC net-next] net: don't dump stack on queue timeout
From: Jamal Hadi Salim @ 2023-11-10 13:11 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, syzbot+d55372214aff0faa1f1f,
	xiyou.wangcong, jiri
In-Reply-To: <20231109000901.949152-1-kuba@kernel.org>

On Wed, Nov 8, 2023 at 7:09 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> The top syzbot report for networking (#14 for the entire kernel)
> is the queue timeout splat. We kept it around for a long time,
> because in real life it provides pretty strong signal that
> something is wrong with the driver or the device.
>
> Removing it is also likely to break monitoring for those who
> track it as a kernel warning.
>
> Nevertheless, WARN()ings are best suited for catching kernel
> programming bugs. If a Tx queue gets starved due to a pause
> storm, priority configuration, or other weirdness - that's
> obviously a problem, but not a problem we can fix at
> the kernel level.
>
> Bite the bullet and convert the WARN() to a print.
>
> Before:
>
>   NETDEV WATCHDOG: eni1np1 (netdevsim): transmit queue 0 timed out 1975 ms
>   WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x39e/0x3b0
>   [... completely pointless stack trace of a timer follows ...]
>
> Now:
>
>   netdevsim netdevsim1 eni1np1: NETDEV WATCHDOG: CPU: 0: transmit queue 0 timed out 1769 ms
>
> Alternatively we could mark the drivers which syzbot has
> learned to abuse as "print-instead-of-WARN" selectively.
>
> Reported-by: syzbot+d55372214aff0faa1f1f@syzkaller.appspotmail.com
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>

cheers,
jamal

> ---
> CC: jhs@mojatatu.com
> CC: xiyou.wangcong@gmail.com
> CC: jiri@resnulli.us
> ---
>  net/sched/sch_generic.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 4195a4bc26ca..8dd0e5925342 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -522,8 +522,9 @@ static void dev_watchdog(struct timer_list *t)
>
>                         if (unlikely(timedout_ms)) {
>                                 trace_net_dev_xmit_timeout(dev, i);
> -                               WARN_ONCE(1, "NETDEV WATCHDOG: %s (%s): transmit queue %u timed out %u ms\n",
> -                                         dev->name, netdev_drivername(dev), i, timedout_ms);
> +                               netdev_crit(dev, "NETDEV WATCHDOG: CPU: %d: transmit queue %u timed out %u ms\n",
> +                                           raw_smp_processor_id(),
> +                                           i, timedout_ms);
>                                 netif_freeze_queues(dev);
>                                 dev->netdev_ops->ndo_tx_timeout(dev, i);
>                                 netif_unfreeze_queues(dev);
> --
> 2.41.0
>

^ permalink raw reply

* [PATCH net-next RFC] net: sfp: rework the RollBall PHY waiting code
From: Marek Behún @ 2023-11-10 13:12 UTC (permalink / raw)
  To: netdev, Russell King; +Cc: Jakub Kicinski, Andrew Lunn, Marek Behún

RollBall SFP modules allow the access to PHY registers only after a
certain time has passed. Until then, the registers read 0xffff.

Currently we have quirks for modules where we need to wait either 25
seconds or 4 seconds, but recently I got hands on another module where
the wait is even shorter.

Instead of hardcoding different wait times, lets rework the code:
- increase the PHY retry count to 25
- when RollBall module is detected, increase the PHY retry time from
  50ms to 1s

Signed-off-by: Marek Behún <kabel@kernel.org>
---
 drivers/net/phy/sfp.c | 41 +++++++++++++++++++++--------------------
 1 file changed, 21 insertions(+), 20 deletions(-)

diff --git a/drivers/net/phy/sfp.c b/drivers/net/phy/sfp.c
index 5468bd209fab..5eb00295b8bf 100644
--- a/drivers/net/phy/sfp.c
+++ b/drivers/net/phy/sfp.c
@@ -191,7 +191,7 @@ static const enum gpiod_flags gpio_flags[] = {
  * R_PHY_RETRY is the number of attempts.
  */
 #define T_PHY_RETRY		msecs_to_jiffies(50)
-#define R_PHY_RETRY		12
+#define R_PHY_RETRY		25
 
 /* SFP module presence detection is poor: the three MOD DEF signals are
  * the same length on the PCB, which means it's possible for MOD DEF 0 to
@@ -274,7 +274,7 @@ struct sfp {
 	struct sfp_eeprom_id id;
 	unsigned int module_power_mW;
 	unsigned int module_t_start_up;
-	unsigned int module_t_wait;
+	unsigned int phy_t_retry;
 
 	unsigned int rate_kbd;
 	unsigned int rs_threshold_kbd;
@@ -372,18 +372,22 @@ static void sfp_fixup_10gbaset_30m(struct sfp *sfp)
 	sfp->id.base.extended_cc = SFF8024_ECC_10GBASE_T_SR;
 }
 
-static void sfp_fixup_rollball_proto(struct sfp *sfp, unsigned int secs)
+static void sfp_fixup_rollball(struct sfp *sfp)
 {
 	sfp->mdio_protocol = MDIO_I2C_ROLLBALL;
-	sfp->module_t_wait = msecs_to_jiffies(secs * 1000);
+
+	/* RollBall modules may disallow access to PHY registers for up to 25
+	 * seconds, and the reads return 0xffff before that. Increase the time
+	 * between PHY probe retries from 50ms to 1s so that we will wait for
+	 * the PHY for a sufficient amount of time.
+	 */
+	sfp->phy_t_retry = msecs_to_jiffies(1000);
 }
 
 static void sfp_fixup_fs_10gt(struct sfp *sfp)
 {
 	sfp_fixup_10gbaset_30m(sfp);
-
-	// These SFPs need 4 seconds before the PHY can be accessed
-	sfp_fixup_rollball_proto(sfp, 4);
+	sfp_fixup_rollball(sfp);
 }
 
 static void sfp_fixup_halny_gsfp(struct sfp *sfp)
@@ -395,12 +399,6 @@ static void sfp_fixup_halny_gsfp(struct sfp *sfp)
 	sfp->state_hw_mask &= ~(SFP_F_TX_FAULT | SFP_F_LOS);
 }
 
-static void sfp_fixup_rollball(struct sfp *sfp)
-{
-	// Rollball SFPs need 25 seconds before the PHY can be accessed
-	sfp_fixup_rollball_proto(sfp, 25);
-}
-
 static void sfp_fixup_rollball_cc(struct sfp *sfp)
 {
 	sfp_fixup_rollball(sfp);
@@ -2331,7 +2329,7 @@ static int sfp_sm_mod_probe(struct sfp *sfp, bool report)
 		mask |= SFP_F_RS1;
 
 	sfp->module_t_start_up = T_START_UP;
-	sfp->module_t_wait = T_WAIT;
+	sfp->phy_t_retry = T_PHY_RETRY;
 
 	sfp->state_ignore_mask = 0;
 
@@ -2568,10 +2566,9 @@ static void sfp_sm_main(struct sfp *sfp, unsigned int event)
 
 		/* We need to check the TX_FAULT state, which is not defined
 		 * while TX_DISABLE is asserted. The earliest we want to do
-		 * anything (such as probe for a PHY) is 50ms (or more on
-		 * specific modules).
+		 * anything (such as probe for a PHY) is 50ms.
 		 */
-		sfp_sm_next(sfp, SFP_S_WAIT, sfp->module_t_wait);
+		sfp_sm_next(sfp, SFP_S_WAIT, T_WAIT);
 		break;
 
 	case SFP_S_WAIT:
@@ -2585,8 +2582,8 @@ static void sfp_sm_main(struct sfp *sfp, unsigned int event)
 			 * deasserting.
 			 */
 			timeout = sfp->module_t_start_up;
-			if (timeout > sfp->module_t_wait)
-				timeout -= sfp->module_t_wait;
+			if (timeout > T_WAIT)
+				timeout -= T_WAIT;
 			else
 				timeout = 1;
 
@@ -2629,7 +2626,11 @@ static void sfp_sm_main(struct sfp *sfp, unsigned int event)
 		ret = sfp_sm_probe_for_phy(sfp);
 		if (ret == -ENODEV) {
 			if (--sfp->sm_phy_retries) {
-				sfp_sm_next(sfp, SFP_S_INIT_PHY, T_PHY_RETRY);
+				sfp_sm_next(sfp, SFP_S_INIT_PHY,
+					    sfp->phy_t_retry);
+				dev_dbg(sfp->dev,
+					"no PHY detected, %u tries left\n",
+					sfp->sm_phy_retries);
 				break;
 			} else {
 				dev_info(sfp->dev, "no PHY detected\n");
-- 
2.41.0


^ permalink raw reply related

* Re: [PATCH net-next v7 1/5] rust: core abstractions for network PHY drivers
From: Andrew Lunn @ 2023-11-10 13:26 UTC (permalink / raw)
  To: FUJITA Tomonori
  Cc: benno.lossin, boqun.feng, netdev, rust-for-linux, tmgross,
	miguel.ojeda.sandonis, wedsonaf
In-Reply-To: <20231108.194647.1383073631008060059.fujita.tomonori@gmail.com>

On Wed, Nov 08, 2023 at 07:46:47PM +0900, FUJITA Tomonori wrote:
> On Mon, 30 Oct 2023 16:45:38 +0000
> Benno Lossin <benno.lossin@proton.me> wrote:
> 
> >>> But I would wait until we see a response from the bindgen devs on the issue.
> >> 
> >> You meant that they might have a different option on this?
> > 
> > No, before you implement the workaround that Boqun posted you
> > should wait until the bindgen devs say how long/if they will
> > implement it.
> 
> It has been 10 days but no response from bindgen developpers. I guess
> that unlikely bindgen will support the feature until the next merge
> window.

I just took a look around the kernel includes. Here are a few examples
i found, there are many many more.

include/target/iscsi/iscsi_target_core.h:       unsigned int            conn_tx_reset_cpumask:1;
include/media/videobuf2-core.h: unsigned int            synced:1;
include/media/v4l2-ctrls.h:     unsigned int done:1;
include/drm/bridge/samsung-dsim.h:      unsigned int has_freqband:1;
include/net/sock_reuseport.h:   unsigned int            bind_inany:1;
include/net/sock.h:     unsigned char           skc_ipv6only:1;
include/sound/simple_card_utils.h:      unsigned int force_dpcm:1;
include/sound/soc.h:    unsigned int autodisable:1;
include/linux/pci.h:    unsigned int    imm_ready:1;    /* Supports Immediate Readiness */
include/linux/regmap.h: unsigned int use_ack:1;
include/linux/cpuidle.h:        unsigned int            registered:1;
include/linux/regulator/gpio-regulator.h:       unsigned enabled_at_boot:1;
include/linux/phy.h:    unsigned link:1;
include/linux/pm.h:     unsigned int            irq_safe:1;
include/linux/nfs_xdr.h:        unsigned char                   renew:1;
include/linux/iio/iio.h:        unsigned                output:1;
include/linux/drbd.h:           unsigned user_isp:1 ;
include/linux/sched.h:  unsigned                        sched_migrated:1;
include/linux/writeback.h:      unsigned no_cgroup_owner:1;
include/linux/tty_port.h:       unsigned char           console:1;
include/linux/mpi.h:                    unsigned int two_inv_p:1;
include/linux/mmc/host.h:       unsigned int            use_spi_crc:1;
include/linux/netdevice.h:      unsigned                wol_enabled:1;
include/linux/serial_8250.h:    unsigned int            tx_stopped:1;
include/linux/usb.h:    unsigned can_submit:1;
include/linux/firewire.h:       unsigned is_local:1;
include/linux/phylink.h:        unsigned int link:1;
include/linux/gpio_keys.h:      unsigned int rep:1;
include/linux/spi/spi.h:        unsigned        cs_change:1;
include/linux/i2c-mux.h:        unsigned int arbitrator:1;
include/linux/kobject.h:        unsigned int state_in_sysfs:1;
include/linux/kbd_kern.h:       unsigned char ledmode:1;
include/linux/bpf_verifier.h:   unsigned int fit_for_inline:1;
include/linux/rtc.h:    unsigned int uie_irq_active:1;
include/linux/usb/usbnet.h:     unsigned                can_dma_sg:1;
include/linux/usb/serial.h:     unsigned char                   attached:1;
include/linux/usb/gadget.h:     unsigned dir_out:1;
include/linux/comedi/comedidev.h:       unsigned int attached:1;
include/scsi/sg.h:    unsigned int twelve_byte:1;
include/scsi/scsi_host.h:       unsigned emulated:1;
include/scsi/scsi_device.h:     unsigned removable:1;
include/rdma/ib_verbs.h:        unsigned int            ip_gids:1;

And thats just bitfields used as binary values. There are more with
:2, :3, :4, :8, :16.

The point i'm trying to make is that they are used in many
subsystems. Not having bindgen support for them seems like its going
to be a problem.

Isn't this also an unsoundness problem? Is there existing Rust code
which people think is sound but is actually not?

> I prefer adding accessors in the C side rather than the workaround if
> it's fine by Andrew because we have no idea when bindgen will support
> the feature.

Maybe we need a better understanding of the kernel wide implications
of this. It could be this is actually a big issue, and rust-for-linux
can then apply either pressure, or human resources, to get it
implemented.

    Andrew

^ permalink raw reply

* Re: [PATCH iproute2] Revert "Makefile: ensure CONF_USR_DIR honours the libdir config"
From: Petr Machata @ 2023-11-10 13:34 UTC (permalink / raw)
  To: aclaudi; +Cc: luca.boccassi, netdev, stephen
In-Reply-To: <20231106001410.183542-1-luca.boccassi@gmail.com>

luca.boccassi@gmail.com writes:

> From: Luca Boccassi <bluca@debian.org>
>
> LIBDIR in Debian and derivatives is not /usr/lib/, it's
> /usr/lib/<architecture triplet>/, which is different, and it's the
> wrong location where to install architecture-independent default
> configuration files, which should always go to /usr/lib/ instead.
> Installing these files to the per-architecture directory is not
> the right thing, hence revert the change.

So I looked into the Fedora package. Up until recently, the files were
in /etc, but it seems there was a deliberate change in the spec file
this September that moved them to /usr/lib or /usr/lib64.

Luca -- since you both sent the patch under reversion, and are Fedora
maintainer, could you please elaborate on what the logic was behind it?
It does look odd to me to put config files into an arch-dependent
directory, but I've been out of packaging for close to a decade at this
point.

Thanks!

> This reverts commit 946753a4459bd035132a27bb2eb87529c1979b90.
>
> Signed-off-by: Luca Boccassi <bluca@debian.org>
> ---
>  Makefile | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/Makefile b/Makefile
> index 54539ce4..7d1819ce 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -17,7 +17,7 @@ endif
>  PREFIX?=/usr
>  SBINDIR?=/sbin
>  CONF_ETC_DIR?=/etc/iproute2
> -CONF_USR_DIR?=$(LIBDIR)/iproute2
> +CONF_USR_DIR?=$(PREFIX)/lib/iproute2
>  NETNS_RUN_DIR?=/var/run/netns
>  NETNS_ETC_DIR?=/etc/netns
>  DATADIR?=$(PREFIX)/share


^ permalink raw reply

* Re: [PATCH iproute2] Revert "Makefile: ensure CONF_USR_DIR honours the libdir config"
From: Luca Boccassi @ 2023-11-10 13:54 UTC (permalink / raw)
  To: Petr Machata; +Cc: aclaudi, netdev, stephen
In-Reply-To: <87fs1dyax9.fsf@nvidia.com>

On Fri, 10 Nov 2023 at 13:51, Petr Machata <petrm@nvidia.com> wrote:
>
> luca.boccassi@gmail.com writes:
>
> > From: Luca Boccassi <bluca@debian.org>
> >
> > LIBDIR in Debian and derivatives is not /usr/lib/, it's
> > /usr/lib/<architecture triplet>/, which is different, and it's the
> > wrong location where to install architecture-independent default
> > configuration files, which should always go to /usr/lib/ instead.
> > Installing these files to the per-architecture directory is not
> > the right thing, hence revert the change.
>
> So I looked into the Fedora package. Up until recently, the files were
> in /etc, but it seems there was a deliberate change in the spec file
> this September that moved them to /usr/lib or /usr/lib64.
>
> Luca -- since you both sent the patch under reversion, and are Fedora
> maintainer, could you please elaborate on what the logic was behind it?
> It does look odd to me to put config files into an arch-dependent
> directory, but I've been out of packaging for close to a decade at this
> point.

I am not a Fedora maintainer, and an arch-dependent triplet
subdirectory is exactly the problem that this patch is fixing.

^ permalink raw reply

* Re: [PATCH iproute2] Revert "Makefile: ensure CONF_USR_DIR honours the libdir config"
From: Petr Machata @ 2023-11-10 13:54 UTC (permalink / raw)
  To: aclaudi; +Cc: Petr Machata, luca.boccassi, netdev, stephen
In-Reply-To: <87fs1dyax9.fsf@nvidia.com>


Petr Machata <petrm@nvidia.com> writes:

> luca.boccassi@gmail.com writes:
>
>> From: Luca Boccassi <bluca@debian.org>
>>
>> LIBDIR in Debian and derivatives is not /usr/lib/, it's
>> /usr/lib/<architecture triplet>/, which is different, and it's the
>> wrong location where to install architecture-independent default
>> configuration files, which should always go to /usr/lib/ instead.
>> Installing these files to the per-architecture directory is not
>> the right thing, hence revert the change.
>
> So I looked into the Fedora package. Up until recently, the files were
> in /etc, but it seems there was a deliberate change in the spec file
> this September that moved them to /usr/lib or /usr/lib64.
>
> Luca -- since you both sent the patch under reversion, and are Fedora

Ugh, I mean Andrea, not Luca. Sorry!

> maintainer, could you please elaborate on what the logic was behind it?
> It does look odd to me to put config files into an arch-dependent
> directory, but I've been out of packaging for close to a decade at this
> point.

^ permalink raw reply

* Re: [RFC v1 0/8] vhost-vdpa: add support for iommufd
From: Jason Gunthorpe @ 2023-11-10 14:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Cindy Lu, jasowang, yi.l.liu, linux-kernel, virtualization,
	netdev
In-Reply-To: <20231109183407-mutt-send-email-mst@kernel.org>

On Thu, Nov 09, 2023 at 06:48:46PM -0500, Michael S. Tsirkin wrote:
> On Tue, Nov 07, 2023 at 11:52:17AM -0400, Jason Gunthorpe wrote:
> > On Tue, Nov 07, 2023 at 09:30:21AM -0500, Michael S. Tsirkin wrote:
> > > On Tue, Nov 07, 2023 at 10:12:37AM -0400, Jason Gunthorpe wrote:
> > > > Big company's should take the responsibility to train and provide
> > > > skill development for their own staff.
> > > 
> > > That would result in a beautiful cathedral of a patch. I know this is
> > > how some companies work. We are doing more of a bazaar thing here,
> > > though. In a bunch of subsystems it seems that you don't get the
> > > necessary skills until you have been publically shouted at by
> > > maintainers - better to start early ;). Not a nice environment for
> > > novices, for sure.
> > 
> > In my view the "shouting from maintainers" is harmful to the people
> > buidling skills and it is an unkind thing to dump employees into that
> > kind of situation.
> > 
> > They should have help to establish the basic level of competence where
> > they may do the wrong thing, but all the process and presentation of
> > the wrong thing is top notch. You get a much better reception.
> 
> What - like e.g. mechanically fixing checkpatch warnings without
> understanding? 

No, not at all. I mean actually going through and explaining what the
idea is to another person and ensuing that the commit messages convey
that idea, that the patches reflect the idea, that everything is
convayed, and it isn't obviously internally illogical.

Like, why did this series have a giant block of #ifdef 0'd code with
no explanation at all? That isn't checkpatch nitpicks, that is not
meeting the minimum standard to convey an idea in an RFC.

Jason

^ permalink raw reply

* RE: [PATCH net 2/3] dpll: fix pin dump crash for rebound module
From: Kubalewski, Arkadiusz @ 2023-11-10 14:11 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev@vger.kernel.org, vadim.fedorenko@linux.dev,
	Michalik, Michal, Olech, Milena, pabeni@redhat.com,
	kuba@kernel.org
In-Reply-To: <ZU4XqBtl0CXwvh5n@nanopsycho>

>From: Jiri Pirko <jiri@resnulli.us>
>Sent: Friday, November 10, 2023 12:45 PM
>
>Fri, Nov 10, 2023 at 12:18:51PM CET, arkadiusz.kubalewski@intel.com wrote:
>>>From: Jiri Pirko <jiri@resnulli.us>
>>>Sent: Friday, November 10, 2023 11:06 AM
>>>
>>>Fri, Nov 10, 2023 at 10:01:50AM CET, arkadiusz.kubalewski@intel.com
>>>wrote:
>>>>>From: Jiri Pirko <jiri@resnulli.us>
>>>>>Sent: Friday, November 10, 2023 7:46 AM
>>>>>
>>>>>Fri, Nov 10, 2023 at 12:32:21AM CET, arkadiusz.kubalewski@intel.com
>>>>>wrote:
>>>>>>>From: Jiri Pirko <jiri@resnulli.us>
>>>>>>>Sent: Thursday, November 9, 2023 7:06 PM
>>>>>>>
>>>>>>>Thu, Nov 09, 2023 at 05:30:20PM CET, arkadiusz.kubalewski@intel.com
>>>>>>>wrote:
>>>>>>>>>From: Jiri Pirko <jiri@resnulli.us>
>>>>>>>>>Sent: Thursday, November 9, 2023 2:19 PM
>>>>>>>>>
>>>>>>>>>Thu, Nov 09, 2023 at 01:20:48PM CET, arkadiusz.kubalewski@intel.com
>>>>>>>>>wrote:
>>>>>>>>>>>From: Jiri Pirko <jiri@resnulli.us>
>>>>>>>>>>>Sent: Wednesday, November 8, 2023 3:30 PM
>>>>>>>>>>>
>>>>>>>>>>>Wed, Nov 08, 2023 at 11:32:25AM CET,
>>>>>>>>>>>arkadiusz.kubalewski@intel.com
>>>>>>>>>>>wrote:
>>>>>>>>>>>>When a kernel module is unbound but the pin resources were not
>>>>>>>>>>>>entirely
>>>>>>>>>>>>freed (other kernel module instance have had kept the reference
>>>>>>>>>>>>to
>>>>>>>>>>>>that
>>>>>>>>>>>>pin), and kernel module is again bound, the pin properties would
>>>>>>>>>>>>not
>>>>>>>>>>>>be
>>>>>>>>>>>>updated (the properties are only assigned when memory for the
>>>>>>>>>>>>pin
>>>>>>>>>>>>is
>>>>>>>>>>>>allocated), prop pointer still points to the kernel module
>>>>>>>>>>>>memory
>>>>>>>>>>>>of
>>>>>>>>>>>>the kernel module which was deallocated on the unbind.
>>>>>>>>>>>>
>>>>>>>>>>>>If the pin dump is invoked in this state, the result is a kernel
>>>>>>>>>>>>crash.
>>>>>>>>>>>>Prevent the crash by storing persistent pin properties in dpll
>>>>>>>>>>>>subsystem,
>>>>>>>>>>>>copy the content from the kernel module when pin is allocated,
>>>>>>>>>>>>instead
>>>>>>>>>>>>of
>>>>>>>>>>>>using memory of the kernel module.
>>>>>>>>>>>>
>>>>>>>>>>>>Fixes: 9431063ad323 ("dpll: core: Add DPLL framework base
>>>>>>>>>>>>functions")
>>>>>>>>>>>>Fixes: 9d71b54b65b1 ("dpll: netlink: Add DPLL framework base
>>>>>>>>>>>>functions")
>>>>>>>>>>>>Signed-off-by: Arkadiusz Kubalewski
>>>>>>>>>>>><arkadiusz.kubalewski@intel.com>
>>>>>>>>>>>>---
>>>>>>>>>>>> drivers/dpll/dpll_core.c    |  4 ++--
>>>>>>>>>>>> drivers/dpll/dpll_core.h    |  4 ++--
>>>>>>>>>>>> drivers/dpll/dpll_netlink.c | 28 ++++++++++++++--------------
>>>>>>>>>>>> 3 files changed, 18 insertions(+), 18 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>>diff --git a/drivers/dpll/dpll_core.c b/drivers/dpll/dpll_core.c
>>>>>>>>>>>>index 3568149b9562..4077b562ba3b 100644
>>>>>>>>>>>>--- a/drivers/dpll/dpll_core.c
>>>>>>>>>>>>+++ b/drivers/dpll/dpll_core.c
>>>>>>>>>>>>@@ -442,7 +442,7 @@ dpll_pin_alloc(u64 clock_id, u32 pin_idx,
>>>>>>>>>>>>struct
>>>>>>>>>>>>module *module,
>>>>>>>>>>>> 		ret = -EINVAL;
>>>>>>>>>>>> 		goto err;
>>>>>>>>>>>> 	}
>>>>>>>>>>>>-	pin->prop = prop;
>>>>>>>>>>>>+	memcpy(&pin->prop, prop, sizeof(pin->prop));
>>>>>>>>>>>
>>>>>>>>>>>Odd, you don't care about the pointer within this structure?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>Well, true. Need a fix.
>>>>>>>>>>Wondering if copying idea is better than just assigning prop
>>>>>>>>>>pointer
>>>>>>>>>>on
>>>>>>>>>>each call to dpll_pin_get(..) function (when pin already exists)?
>>>>>>>>>
>>>>>>>>>Not sure what do you mean. Examples please.
>>>>>>>>>
>>>>>>>>
>>>>>>>>Sure,
>>>>>>>>
>>>>>>>>Basically this change:
>>>>>>>>
>>>>>>>>diff --git a/drivers/dpll/dpll_core.c b/drivers/dpll/dpll_core.c
>>>>>>>>index ae884b92d68c..06b72d5877c3 100644
>>>>>>>>--- a/drivers/dpll/dpll_core.c
>>>>>>>>+++ b/drivers/dpll/dpll_core.c
>>>>>>>>@@ -483,6 +483,7 @@ dpll_pin_get(u64 clock_id, u32 pin_idx, struct
>>>>>>>>module
>>>>>>>>*module,
>>>>>>>>                    pos->pin_idx == pin_idx &&
>>>>>>>>                    pos->module == module) {
>>>>>>>>                        ret = pos;
>>>>>>>>+                       pos->prop = prop;
>>>>>>>>                        refcount_inc(&ret->refcount);
>>>>>>>>                        break;
>>>>>>>>                }
>>>>>>>>
>>>>>>>>would replace whole of this patch changes, although seems a bit
>>>>>>>hacky.
>>>>>>>
>>>>>>>Or event better, as I suggested in the other patch reply, resolve
>>>>>>>this
>>>>>>>internally in the driver registering things only when they are valid.
>>>>>>>Much better then to hack anything in dpll core.
>>>>>>>
>>>>>>
>>>>>>This approach seemed to me hacky, that is why started with coping the
>>>>>>data.
>>>>>>It is not about registering, rather about unregistering on driver
>>>>>>unbind, which brakes things, and currently cannot be recovered in
>>>>>>described case.
>>>>>
>>>>>Sure it can. PF0 unbind-> internal notification-> PF1 unregisters all
>>>>>related object. Very clean and simple.
>>>>>
>>>>
>>>>What you are suggesting is:
>>>>- special purpose bus in the driver,
>>>
>>>No, it is a simple notificator. Very common infra over the whole
>>>kernel code.
>>>
>>
>>No difference if this is simple or not, it is special purpose.
>>
>>>
>>>>- dpll-related,
>>>
>>>Is this the only thing that PF0 is special with? Perhaps you can
>>>utilize this for other features as well, since your fw design is like
>>>this.
>>>
>>
>>None api user is allowed to the unordered driver bind when there are muxed
>>pins involved.
>>This requires a fix in the api not in the driver.
>>
>>>
>>>>- not needed,
>>>>- prone for errors.
>>>>
>>>>The dpll subsystem is here to make driver life easier.
>>>
>>>No, the subsystem is never here to handle device specific issues. And
>>>your PF0 dependency is very clearly something device specific. Don't
>>>pollute the dpll subsystem with workaround to handle specific device
>>>needs. create/register the dplls objects from your driver only when it
>>>is valid to do so. Make sure the lifetime of such object stays in
>>>the scope of validity. Handle that in the driver. Very clear and simple.
>>>
>>
>>In our case this is PF0 but this is broader issue. The muxed pins
>>infrastructure in now slightly broken and I am fixing it.
>>From the beginning it was designed to allow separated driver instances
>>to create a device and connect their pins with it.
>
>That is true. That's exacly what we have implemented in mlx5. Each
>instance registers dpll device and the pin related to the instance. No
>problem.
>

I read this: "mlx5 is done correctly, you need to follow", funny again.

>The fact that you do register only in PF0 is a limitation in your
>driver. Fix it there.
>

A limitation when comparing to your design, makes it not the limitation at all.
Different design is just different.

According to documentation there is no such thing like a correct order of
registering/unregistering pins.
So a driver can register a parent pin, register a pin with that parent, and
remove the parent first. It shall not crash. Right now it is broken. The
dpll needs to handle such cases. I am fixing it. Simple.

Thank you!
Arkadiusz

>
>>Any driver which would use it would face this issue. What you are trying
>>to imply is that it is better to put traffic lights on each car instead
>>of putting them on crossroads.
>>
>>>Thanks!
>>>
>>
>>Thank you!
>>Arkadiusz
>>
>>>
>>>>
>>>>Thank you!
>>>>Arkadiusz
>>>>
>>>>>
>>>>>>
>>>>>>Thank you!
>>>>>>Arkadiusz
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>Thank you!
>>>>>>>>Arkadiusz
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>Thank you!
>>>>>>>>>>Arkadiusz
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> 	refcount_set(&pin->refcount, 1);
>>>>>>>>>>>> 	xa_init_flags(&pin->dpll_refs, XA_FLAGS_ALLOC);
>>>>>>>>>>>> 	xa_init_flags(&pin->parent_refs, XA_FLAGS_ALLOC);
>>>>>>>>>>>>@@ -634,7 +634,7 @@ int dpll_pin_on_pin_register(struct dpll_pin
>>>>>>>>>>>>*parent,
>>>>>>>>>>>>struct dpll_pin *pin,
>>>>>>>>>>>> 	unsigned long i, stop;
>>>>>>>>>>>> 	int ret;
>>>>>>>>>>>>
>>>>>>>>>>>>-	if (WARN_ON(parent->prop->type != DPLL_PIN_TYPE_MUX))
>>>>>>>>>>>>+	if (WARN_ON(parent->prop.type != DPLL_PIN_TYPE_MUX))
>>>>>>>>>>>> 		return -EINVAL;
>>>>>>>>>>>>
>>>>>>>>>>>> 	if (WARN_ON(!ops) ||
>>>>>>>>>>>>diff --git a/drivers/dpll/dpll_core.h b/drivers/dpll/dpll_core.h
>>>>>>>>>>>>index 5585873c5c1b..717f715015c7 100644
>>>>>>>>>>>>--- a/drivers/dpll/dpll_core.h
>>>>>>>>>>>>+++ b/drivers/dpll/dpll_core.h
>>>>>>>>>>>>@@ -44,7 +44,7 @@ struct dpll_device {
>>>>>>>>>>>>  * @module:		module of creator
>>>>>>>>>>>>  * @dpll_refs:		hold referencees to dplls pin was
>>>>>>>>>>>>registered
>>>>>>>>>>>>with
>>>>>>>>>>>>  * @parent_refs:	hold references to parent pins pin was
>>>>>>>>>>>>registered
>>>>>>>>>>>>with
>>>>>>>>>>>>- * @prop:		pointer to pin properties given by registerer
>>>>>>>>>>>>+ * @prop:		pin properties copied from the registerer
>>>>>>>>>>>>  * @rclk_dev_name:	holds name of device when pin can
>>>>>>>>>>>>recover
>>>>>>>>>>>>clock
>>>>>>>>>>>>from it
>>>>>>>>>>>>  * @refcount:		refcount
>>>>>>>>>>>>  **/
>>>>>>>>>>>>@@ -55,7 +55,7 @@ struct dpll_pin {
>>>>>>>>>>>> 	struct module *module;
>>>>>>>>>>>> 	struct xarray dpll_refs;
>>>>>>>>>>>> 	struct xarray parent_refs;
>>>>>>>>>>>>-	const struct dpll_pin_properties *prop;
>>>>>>>>>>>>+	struct dpll_pin_properties prop;
>>>>>>>>>>>> 	refcount_t refcount;
>>>>>>>>>>>> };
>>>>>>>>>>>>
>>>>>>>>>>>>diff --git a/drivers/dpll/dpll_netlink.c
>>>>>>>>>>>>b/drivers/dpll/dpll_netlink.c
>>>>>>>>>>>>index 93fc6c4b8a78..963bbbbe6660 100644
>>>>>>>>>>>>--- a/drivers/dpll/dpll_netlink.c
>>>>>>>>>>>>+++ b/drivers/dpll/dpll_netlink.c
>>>>>>>>>>>>@@ -278,17 +278,17 @@ dpll_msg_add_pin_freq(struct sk_buff *msg,
>>>>>>>>>>>>struct
>>>>>>>>>>>>dpll_pin *pin,
>>>>>>>>>>>> 	if (nla_put_64bit(msg, DPLL_A_PIN_FREQUENCY,
>>>>>>>>>>>>sizeof(freq),
>>>>>>>>>>>>&freq,
>>>>>>>>>>>> 			  DPLL_A_PIN_PAD))
>>>>>>>>>>>> 		return -EMSGSIZE;
>>>>>>>>>>>>-	for (fs = 0; fs < pin->prop->freq_supported_num; fs++) {
>>>>>>>>>>>>+	for (fs = 0; fs < pin->prop.freq_supported_num; fs++) {
>>>>>>>>>>>> 		nest = nla_nest_start(msg,
>>>>>>>>>>>>DPLL_A_PIN_FREQUENCY_SUPPORTED);
>>>>>>>>>>>> 		if (!nest)
>>>>>>>>>>>> 			return -EMSGSIZE;
>>>>>>>>>>>>-		freq = pin->prop->freq_supported[fs].min;
>>>>>>>>>>>>+		freq = pin->prop.freq_supported[fs].min;
>>>>>>>>>>>> 		if (nla_put_64bit(msg, DPLL_A_PIN_FREQUENCY_MIN,
>>>>>>>>>>>>sizeof(freq),
>>>>>>>>>>>> 				  &freq, DPLL_A_PIN_PAD)) {
>>>>>>>>>>>> 			nla_nest_cancel(msg, nest);
>>>>>>>>>>>> 			return -EMSGSIZE;
>>>>>>>>>>>> 		}
>>>>>>>>>>>>-		freq = pin->prop->freq_supported[fs].max;
>>>>>>>>>>>>+		freq = pin->prop.freq_supported[fs].max;
>>>>>>>>>>>> 		if (nla_put_64bit(msg, DPLL_A_PIN_FREQUENCY_MAX,
>>>>>>>>>>>>sizeof(freq),
>>>>>>>>>>>> 				  &freq, DPLL_A_PIN_PAD)) {
>>>>>>>>>>>> 			nla_nest_cancel(msg, nest);
>>>>>>>>>>>>@@ -304,9 +304,9 @@ static bool
>>>>>>>>>>>>dpll_pin_is_freq_supported(struct
>>>>>>>>>>>>dpll_pin
>>>>>>>>>>>>*pin, u32 freq)
>>>>>>>>>>>> {
>>>>>>>>>>>> 	int fs;
>>>>>>>>>>>>
>>>>>>>>>>>>-	for (fs = 0; fs < pin->prop->freq_supported_num; fs++)
>>>>>>>>>>>>-		if (freq >= pin->prop->freq_supported[fs].min &&
>>>>>>>>>>>>-		    freq <= pin->prop->freq_supported[fs].max)
>>>>>>>>>>>>+	for (fs = 0; fs < pin->prop.freq_supported_num; fs++)
>>>>>>>>>>>>+		if (freq >= pin->prop.freq_supported[fs].min &&
>>>>>>>>>>>>+		    freq <= pin->prop.freq_supported[fs].max)
>>>>>>>>>>>> 			return true;
>>>>>>>>>>>> 	return false;
>>>>>>>>>>>> }
>>>>>>>>>>>>@@ -403,7 +403,7 @@ static int
>>>>>>>>>>>> dpll_cmd_pin_get_one(struct sk_buff *msg, struct dpll_pin *pin,
>>>>>>>>>>>> 		     struct netlink_ext_ack *extack)
>>>>>>>>>>>> {
>>>>>>>>>>>>-	const struct dpll_pin_properties *prop = pin->prop;
>>>>>>>>>>>>+	const struct dpll_pin_properties *prop = &pin->prop;
>>>>>>>>>>>> 	struct dpll_pin_ref *ref;
>>>>>>>>>>>> 	int ret;
>>>>>>>>>>>>
>>>>>>>>>>>>@@ -696,7 +696,7 @@ dpll_pin_on_pin_state_set(struct dpll_pin
>>>>>>>>>>>>*pin,
>>>>>>>>>>>>u32
>>>>>>>>>>>>parent_idx,
>>>>>>>>>>>> 	int ret;
>>>>>>>>>>>>
>>>>>>>>>>>> 	if (!(DPLL_PIN_CAPABILITIES_STATE_CAN_CHANGE &
>>>>>>>>>>>>-	      pin->prop->capabilities)) {
>>>>>>>>>>>>+	      pin->prop.capabilities)) {
>>>>>>>>>>>> 		NL_SET_ERR_MSG(extack, "state changing is not
>>>>>>>>>>>>allowed");
>>>>>>>>>>>> 		return -EOPNOTSUPP;
>>>>>>>>>>>> 	}
>>>>>>>>>>>>@@ -732,7 +732,7 @@ dpll_pin_state_set(struct dpll_device *dpll,
>>>>>>>>>>>>struct
>>>>>>>>>>>>dpll_pin *pin,
>>>>>>>>>>>> 	int ret;
>>>>>>>>>>>>
>>>>>>>>>>>> 	if (!(DPLL_PIN_CAPABILITIES_STATE_CAN_CHANGE &
>>>>>>>>>>>>-	      pin->prop->capabilities)) {
>>>>>>>>>>>>+	      pin->prop.capabilities)) {
>>>>>>>>>>>> 		NL_SET_ERR_MSG(extack, "state changing is not
>>>>>>>>>>>>allowed");
>>>>>>>>>>>> 		return -EOPNOTSUPP;
>>>>>>>>>>>> 	}
>>>>>>>>>>>>@@ -759,7 +759,7 @@ dpll_pin_prio_set(struct dpll_device *dpll,
>>>>>>>>>>>>struct
>>>>>>>>>>>>dpll_pin *pin,
>>>>>>>>>>>> 	int ret;
>>>>>>>>>>>>
>>>>>>>>>>>> 	if (!(DPLL_PIN_CAPABILITIES_PRIORITY_CAN_CHANGE &
>>>>>>>>>>>>-	      pin->prop->capabilities)) {
>>>>>>>>>>>>+	      pin->prop.capabilities)) {
>>>>>>>>>>>> 		NL_SET_ERR_MSG(extack, "prio changing is not
>>>>>>>>>>>>allowed");
>>>>>>>>>>>> 		return -EOPNOTSUPP;
>>>>>>>>>>>> 	}
>>>>>>>>>>>>@@ -787,7 +787,7 @@ dpll_pin_direction_set(struct dpll_pin *pin,
>>>>>>>>>>>>struct
>>>>>>>>>>>>dpll_device *dpll,
>>>>>>>>>>>> 	int ret;
>>>>>>>>>>>>
>>>>>>>>>>>> 	if (!(DPLL_PIN_CAPABILITIES_DIRECTION_CAN_CHANGE &
>>>>>>>>>>>>-	      pin->prop->capabilities)) {
>>>>>>>>>>>>+	      pin->prop.capabilities)) {
>>>>>>>>>>>> 		NL_SET_ERR_MSG(extack, "direction changing is not
>>>>>>>>>>>>allowed");
>>>>>>>>>>>> 		return -EOPNOTSUPP;
>>>>>>>>>>>> 	}
>>>>>>>>>>>>@@ -817,8 +817,8 @@ dpll_pin_phase_adj_set(struct dpll_pin *pin,
>>>>>>>>>>>>struct
>>>>>>>>>>>>nlattr *phase_adj_attr,
>>>>>>>>>>>> 	int ret;
>>>>>>>>>>>>
>>>>>>>>>>>> 	phase_adj = nla_get_s32(phase_adj_attr);
>>>>>>>>>>>>-	if (phase_adj > pin->prop->phase_range.max ||
>>>>>>>>>>>>-	    phase_adj < pin->prop->phase_range.min) {
>>>>>>>>>>>>+	if (phase_adj > pin->prop.phase_range.max ||
>>>>>>>>>>>>+	    phase_adj < pin->prop.phase_range.min) {
>>>>>>>>>>>> 		NL_SET_ERR_MSG_ATTR(extack, phase_adj_attr,
>>>>>>>>>>>> 				    "phase adjust value not
>>>>>>>>>>>>supported");
>>>>>>>>>>>> 		return -EINVAL;
>>>>>>>>>>>>@@ -999,7 +999,7 @@ dpll_pin_find(u64 clock_id, struct nlattr
>>>>>>>>>>>>*mod_name_attr,
>>>>>>>>>>>> 	unsigned long i;
>>>>>>>>>>>>
>>>>>>>>>>>> 	xa_for_each_marked(&dpll_pin_xa, i, pin, DPLL_REGISTERED)
>>>>>>>>>>>>{
>>>>>>>>>>>>-		prop = pin->prop;
>>>>>>>>>>>>+		prop = &pin->prop;
>>>>>>>>>>>> 		cid_match = clock_id ? pin->clock_id == clock_id :
>>>>>>>>>>>>true;
>>>>>>>>>>>> 		mod_match = mod_name_attr && module_name(pin-
>>>>>>>>>>>>module) ?
>>>>>>>>>>>> 			!nla_strcmp(mod_name_attr,
>>>>>>>>>>>>--
>>>>>>>>>>>>2.38.1
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>

^ permalink raw reply

* Re: [PATCH] tls: fix missing memory barrier in tls_init
From: kernel test robot @ 2023-11-10 14:19 UTC (permalink / raw)
  To: Dae R. Jeong, borisp, john.fastabend, kuba, davem, edumazet,
	pabeni, netdev, linux-kernel
  Cc: oe-kbuild-all, ywchoi
In-Reply-To: <ZU4Mk_RfzvRpwkmX@dragonet>

Hi Dae,

kernel test robot noticed the following build errors:

[auto build test ERROR on linus/master]
[also build test ERROR on v6.6 next-20231110]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Dae-R-Jeong/tls-fix-missing-memory-barrier-in-tls_init/20231110-190047
base:   linus/master
patch link:    https://lore.kernel.org/r/ZU4Mk_RfzvRpwkmX%40dragonet
patch subject: [PATCH] tls: fix missing memory barrier in tls_init
config: csky-randconfig-002-20231110 (https://download.01.org/0day-ci/archive/20231110/202311102254.YAU3c8F7-lkp@intel.com/config)
compiler: csky-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231110/202311102254.YAU3c8F7-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202311102254.YAU3c8F7-lkp@intel.com/

All errors (new ones prefixed by >>):

   csky-linux-ld: net/tls/tls_toe.o: in function `tls_toe_bypass':
>> tls_toe.c:(.text+0xf4): undefined reference to `tls_ctx_create'
>> csky-linux-ld: tls_toe.c:(.text+0x14c): undefined reference to `tls_ctx_create'

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator
From: Pavel Begunkov @ 2023-11-10 14:26 UTC (permalink / raw)
  To: Mina Almasry, David Ahern, David Wei
  Cc: netdev, linux-kernel, linux-arch, linux-kselftest, linux-media,
	dri-devel, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jesper Dangaard Brouer, Ilias Apalodimas,
	Arnd Bergmann, Willem de Bruijn, Shuah Khan, Sumit Semwal,
	Christian König, Shakeel Butt, Jeroen de Borst,
	Praveen Kaligineedi, Willem de Bruijn, Kaiyuan Zhang
In-Reply-To: <CAHS8izMKDOw5_y2MLRfuJHs=ai+sZ6GF7Rg1NuR_JqONg-5u5Q@mail.gmail.com>

On 11/7/23 23:03, Mina Almasry wrote:
> On Tue, Nov 7, 2023 at 2:55 PM David Ahern <dsahern@kernel.org> wrote:
>>
>> On 11/7/23 3:10 PM, Mina Almasry wrote:
>>> On Mon, Nov 6, 2023 at 3:44 PM David Ahern <dsahern@kernel.org> wrote:
>>>>
>>>> On 11/5/23 7:44 PM, Mina Almasry wrote:
>>>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>>>> index eeeda849115c..1c351c138a5b 100644
>>>>> --- a/include/linux/netdevice.h
>>>>> +++ b/include/linux/netdevice.h
>>>>> @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding {
>>>>>   };
>>>>>
>>>>>   #ifdef CONFIG_DMA_SHARED_BUFFER
>>>>> +struct page_pool_iov *
>>>>> +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
>>>>> +void netdev_free_devmem(struct page_pool_iov *ppiov);
>>>>
>>>> netdev_{alloc,free}_dmabuf?
>>>>
>>>
>>> Can do.
>>>
>>>> I say that because a dmabuf can be host memory, at least I am not aware
>>>> of a restriction that a dmabuf is device memory.
>>>>
>>>
>>> In my limited experience dma-buf is generally device memory, and
>>> that's really its use case. CONFIG_UDMABUF is a driver that mocks
>>> dma-buf with a memfd which I think is used for testing. But I can do
>>> the rename, it's more clear anyway, I think.
>>
>> config UDMABUF
>>          bool "userspace dmabuf misc driver"
>>          default n
>>          depends on DMA_SHARED_BUFFER
>>          depends on MEMFD_CREATE || COMPILE_TEST
>>          help
>>            A driver to let userspace turn memfd regions into dma-bufs.
>>            Qemu can use this to create host dmabufs for guest framebuffers.
>>
>>
>> Qemu is just a userspace process; it is no way a special one.
>>
>> Treating host memory as a dmabuf should radically simplify the io_uring
>> extension of this set.
> 
> I agree actually, and I was about to make that comment to David Wei's
> series once I have the time.
> 
> David, your io_uring RX zerocopy proposal actually works with devmem
> TCP, if you're inclined to do that instead, what you'd do roughly is
> (I think):
That would be a Frankenstein's monster api with no good reason for it.
You bind memory via netlink because you don't have a proper context to
work with otherwise, io_uring serves as the context with a separate and
precise abstraction around queues. Same with dmabufs, it totally makes
sense for device memory, but wrapping host memory into a file just to
immediately unwrap it back with no particular benefits from doing so
doesn't seem like a good uapi. Currently, the difference will be
hidden by io_uring.

And we'd still need to have a hook in pp's get page to grab buffers from
the buffer ring instead of refilling via SO_DEVMEM_DONTNEED and a
callback for when skbs are dropped. It's just instead of a new pp ops
it'll be a branch in the devmem path. io_uring might want to use the
added iov format in the future for device memory or even before that,
io_uring doesn't really care whether it's pages or not.

It's also my big concern from how many optimisations it'll fence us off.
With the current io_uring RFC I can get rid of all buffer atomic
refcounting and replace it with a single percpu counting per skb.
Hopefully, that will be doable after we place it on top of pp providers.


> - Allocate a memfd,
> - Use CONFIG_UDMABUF to create a dma-buf out of that memfd.
> - Bind the dma-buf to the NIC using the netlink API in this RFC.
> - Your io_uring extensions and io_uring uapi should work as-is almost
> on top of this series, I think.
> 
> If you do this the incoming packets should land into your memfd, which
> may or may not work for you. In the future if you feel inclined to use
> device memory, this approach that I'm describing here would be more
> extensible to device memory, because you'd already be using dma-bufs
> for your user memory; you'd just replace one kind of dma-buf (UDMABUF)
> with another.
> 
>> That the io_uring set needs to dive into
>> page_pools is just wrong - complicating the design and code and pushing
>> io_uring into a realm it does not need to be involved in.

I disagree. How does it complicate it? io_uring will be just a yet another
provider implementing the callbacks of the API created for such use cases
and not changing common pp/net bits. The rest of code is in io_uring
implementing interaction with userspace and other usability features, but
there will be anyway some amount of code if we want to have a convenient
and performant api via io_uring.


>>
>> Most (all?) of this patch set can work with any memory; only device
>> memory is unreadable.

-- 
Pavel Begunkov

^ permalink raw reply

* [PATCH net,v4, 0/3] hv_netvsc: fix race of netvsc, VF register, and slave bit
From: Haiyang Zhang @ 2023-11-10 14:38 UTC (permalink / raw)
  To: linux-hyperv, netdev
  Cc: haiyangz, kys, wei.liu, decui, edumazet, kuba, pabeni, davem,
	linux-kernel

There are some races between netvsc probe, set notifier, VF register,
and slave bit setting.
This patch set fixes them.


Haiyang Zhang (2):
  hv_netvsc: fix race of netvsc and VF register_netdevice
  hv_netvsc: Fix race of register_netdevice_notifier and VF register

Long Li (1):
  hv_netvsc: Mark VF as slave before exposing it to user-mode

 drivers/net/hyperv/netvsc_drv.c | 64 ++++++++++++++++++++++-----------
 1 file changed, 43 insertions(+), 21 deletions(-)

-- 
2.25.1


^ permalink raw reply

* [PATCH net,v4, 1/3] hv_netvsc: fix race of netvsc and VF register_netdevice
From: Haiyang Zhang @ 2023-11-10 14:38 UTC (permalink / raw)
  To: linux-hyperv, netdev
  Cc: haiyangz, kys, wei.liu, decui, edumazet, kuba, pabeni, davem,
	linux-kernel, stable
In-Reply-To: <1699627140-28003-1-git-send-email-haiyangz@microsoft.com>

The rtnl lock also needs to be held before rndis_filter_device_add()
which advertises nvsp_2_vsc_capability / sriov bit, and triggers
VF NIC offering and registering. If VF NIC finished register_netdev()
earlier it may cause name based config failure.

To fix this issue, move the call to rtnl_lock() before
rndis_filter_device_add(), so VF will be registered later than netvsc
/ synthetic NIC, and gets a name numbered (ethX) after netvsc.

Cc: stable@vger.kernel.org
Fixes: e04e7a7bbd4b ("hv_netvsc: Fix a deadlock by getting rtnl lock earlier in netvsc_probe()")
Reported-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>

---
v3:
  Divide it into two patches, suggested by Jakub Kicinski.
v2:
  Fix rtnl_unlock() in error handling as found by Wojciech Drewek.
---
 drivers/net/hyperv/netvsc_drv.c | 25 +++++++++++++++----------
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 3ba3c8fb28a5..5e528a76f5f5 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -2531,15 +2531,6 @@ static int netvsc_probe(struct hv_device *dev,
 		goto devinfo_failed;
 	}
 
-	nvdev = rndis_filter_device_add(dev, device_info);
-	if (IS_ERR(nvdev)) {
-		ret = PTR_ERR(nvdev);
-		netdev_err(net, "unable to add netvsc device (ret %d)\n", ret);
-		goto rndis_failed;
-	}
-
-	eth_hw_addr_set(net, device_info->mac_adr);
-
 	/* We must get rtnl lock before scheduling nvdev->subchan_work,
 	 * otherwise netvsc_subchan_work() can get rtnl lock first and wait
 	 * all subchannels to show up, but that may not happen because
@@ -2547,9 +2538,23 @@ static int netvsc_probe(struct hv_device *dev,
 	 * -> ... -> device_add() -> ... -> __device_attach() can't get
 	 * the device lock, so all the subchannels can't be processed --
 	 * finally netvsc_subchan_work() hangs forever.
+	 *
+	 * The rtnl lock also needs to be held before rndis_filter_device_add()
+	 * which advertises nvsp_2_vsc_capability / sriov bit, and triggers
+	 * VF NIC offering and registering. If VF NIC finished register_netdev()
+	 * earlier it may cause name based config failure.
 	 */
 	rtnl_lock();
 
+	nvdev = rndis_filter_device_add(dev, device_info);
+	if (IS_ERR(nvdev)) {
+		ret = PTR_ERR(nvdev);
+		netdev_err(net, "unable to add netvsc device (ret %d)\n", ret);
+		goto rndis_failed;
+	}
+
+	eth_hw_addr_set(net, device_info->mac_adr);
+
 	if (nvdev->num_chn > 1)
 		schedule_work(&nvdev->subchan_work);
 
@@ -2586,9 +2591,9 @@ static int netvsc_probe(struct hv_device *dev,
 	return 0;
 
 register_failed:
-	rtnl_unlock();
 	rndis_filter_device_remove(dev, nvdev);
 rndis_failed:
+	rtnl_unlock();
 	netvsc_devinfo_put(device_info);
 devinfo_failed:
 	free_percpu(net_device_ctx->vf_stats);
-- 
2.25.1


^ permalink raw reply related

* [PATCH net,v4, 2/3] hv_netvsc: Fix race of register_netdevice_notifier and VF register
From: Haiyang Zhang @ 2023-11-10 14:38 UTC (permalink / raw)
  To: linux-hyperv, netdev
  Cc: haiyangz, kys, wei.liu, decui, edumazet, kuba, pabeni, davem,
	linux-kernel, stable
In-Reply-To: <1699627140-28003-1-git-send-email-haiyangz@microsoft.com>

If VF NIC is registered earlier, NETDEV_REGISTER event is replayed,
but NETDEV_POST_INIT is not.

Move register_netdevice_notifier() earlier, so the call back
function is set before probing.

Cc: stable@vger.kernel.org
Fixes: e04e7a7bbd4b ("hv_netvsc: Fix a deadlock by getting rtnl lock earlier in netvsc_probe()")
Reported-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>

---
v3:
  Divide it into two patches, suggested by Jakub Kicinski.
v2:
  Fix rtnl_unlock() in error handling as found by Wojciech Drewek.
---
 drivers/net/hyperv/netvsc_drv.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 5e528a76f5f5..1d1491da303b 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -2793,11 +2793,14 @@ static int __init netvsc_drv_init(void)
 	}
 	netvsc_ring_bytes = ring_size * PAGE_SIZE;
 
+	register_netdevice_notifier(&netvsc_netdev_notifier);
+
 	ret = vmbus_driver_register(&netvsc_drv);
-	if (ret)
+	if (ret) {
+		unregister_netdevice_notifier(&netvsc_netdev_notifier);
 		return ret;
+	}
 
-	register_netdevice_notifier(&netvsc_netdev_notifier);
 	return 0;
 }
 
-- 
2.25.1


^ permalink raw reply related

* [PATCH net,v4, 3/3] hv_netvsc: Mark VF as slave before exposing it to user-mode
From: Haiyang Zhang @ 2023-11-10 14:39 UTC (permalink / raw)
  To: linux-hyperv, netdev
  Cc: haiyangz, kys, wei.liu, decui, edumazet, kuba, pabeni, davem,
	linux-kernel, Long Li, stable
In-Reply-To: <1699627140-28003-1-git-send-email-haiyangz@microsoft.com>

From: Long Li <longli@microsoft.com>

When a VF is being exposed form the kernel, it should be marked as "slave"
before exposing to the user-mode. The VF is not usable without netvsc
running as master. The user-mode should never see a VF without the "slave"
flag.

This commit moves the code of setting the slave flag to the time before
VF is exposed to user-mode.

Cc: stable@vger.kernel.org
Fixes: 0c195567a8f6 ("netvsc: transparent VF management")
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>

---
v4:
Add comments in get_netvsc_byslot() explaining the need to check dev_addr
v2:
Use a new function to handle NETDEV_POST_INIT.
---
 drivers/net/hyperv/netvsc_drv.c | 32 +++++++++++++++++++++++---------
 1 file changed, 23 insertions(+), 9 deletions(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 1d1491da303b..52dbf1de7347 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -2206,9 +2206,6 @@ static int netvsc_vf_join(struct net_device *vf_netdev,
 		goto upper_link_failed;
 	}
 
-	/* set slave flag before open to prevent IPv6 addrconf */
-	vf_netdev->flags |= IFF_SLAVE;
-
 	schedule_delayed_work(&ndev_ctx->vf_takeover, VF_TAKEOVER_INT);
 
 	call_netdevice_notifiers(NETDEV_JOIN, vf_netdev);
@@ -2315,16 +2312,18 @@ static struct net_device *get_netvsc_byslot(const struct net_device *vf_netdev)
 
 	}
 
-	/* Fallback path to check synthetic vf with
-	 * help of mac addr
+	/* Fallback path to check synthetic vf with help of mac addr.
+	 * Because this function can be called before vf_netdev is
+	 * initialized (NETDEV_POST_INIT) when its perm_addr has not been copied
+	 * from dev_addr, also try to match to its dev_addr.
+	 * Note: On Hyper-V and Azure, it's not possible to set a MAC address
+	 * on a VF that matches to the MAC of a unrelated NETVSC device.
 	 */
 	list_for_each_entry(ndev_ctx, &netvsc_dev_list, list) {
 		ndev = hv_get_drvdata(ndev_ctx->device_ctx);
-		if (ether_addr_equal(vf_netdev->perm_addr, ndev->perm_addr)) {
-			netdev_notice(vf_netdev,
-				      "falling back to mac addr based matching\n");
+		if (ether_addr_equal(vf_netdev->perm_addr, ndev->perm_addr) ||
+		    ether_addr_equal(vf_netdev->dev_addr, ndev->perm_addr))
 			return ndev;
-		}
 	}
 
 	netdev_notice(vf_netdev,
@@ -2332,6 +2331,19 @@ static struct net_device *get_netvsc_byslot(const struct net_device *vf_netdev)
 	return NULL;
 }
 
+static int netvsc_prepare_slave(struct net_device *vf_netdev)
+{
+	struct net_device *ndev;
+
+	ndev = get_netvsc_byslot(vf_netdev);
+	if (!ndev)
+		return NOTIFY_DONE;
+
+	/* set slave flag before open to prevent IPv6 addrconf */
+	vf_netdev->flags |= IFF_SLAVE;
+	return NOTIFY_DONE;
+}
+
 static int netvsc_register_vf(struct net_device *vf_netdev)
 {
 	struct net_device_context *net_device_ctx;
@@ -2758,6 +2770,8 @@ static int netvsc_netdev_event(struct notifier_block *this,
 		return NOTIFY_DONE;
 
 	switch (event) {
+	case NETDEV_POST_INIT:
+		return netvsc_prepare_slave(event_dev);
 	case NETDEV_REGISTER:
 		return netvsc_register_vf(event_dev);
 	case NETDEV_UNREGISTER:
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH net-next] packet: add a generic drop reason for receive
From: Yan Zhai @ 2023-11-10 14:42 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Eric Dumazet, netdev, David S. Miller, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, Weongyo Jeong, Ivan Babrou,
	David Ahern, Jesper Brouer, linux-kernel, kernel-team
In-Reply-To: <ZU4CLCk1APrD3Yzi@nanopsycho>

On Fri, Nov 10, 2023 at 4:13 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Fri, Nov 10, 2023 at 10:30:49AM CET, edumazet@google.com wrote:
> >On Fri, Nov 10, 2023 at 6:49 AM Yan Zhai <yan@cloudflare.com> wrote:
>
> [..]
>
> >1) Note that net-next is currently closed.
>
> I wonder, can't some bot be easily set up to warn about
> this automatically?
>

It's funny that I actually got notified about an individual recipient
mailbox being full.. Side channel :)

^ permalink raw reply

* Re: [PATCH net-next] packet: add a generic drop reason for receive
From: Yan Zhai @ 2023-11-10 14:48 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Willem de Bruijn, Weongyo Jeong, Ivan Babrou, David Ahern,
	Jesper Brouer, linux-kernel, kernel-team
In-Reply-To: <CANn89iKZYsWGT1weXZ6W7_z28dqJwTZeg+2_Lw+x+6spUHp8Eg@mail.gmail.com>

On Fri, Nov 10, 2023 at 3:31 AM Eric Dumazet <edumazet@google.com> wrote:
> it is time we replace the various constructs which do not help readability:
>
> if (something)
>      consume_skb(skb);
> else
>      kfree_skb_reason(skb, drop_reason);
>
> By:
>
> kfree_skb_reason(skb, drop_reason);
>
> (By using drop_reason == SKB_CONSUMED when appropriate)
Will send a V2 when net-next reopens

^ permalink raw reply

* Re: tcpdump and Big TCP
From: Xin Long @ 2023-11-10 14:50 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Ahern, netdev@vger.kernel.org
In-Reply-To: <CADvbK_epdT+s-peW9v1oKGrTfttrVFCgSLkdwLLBAT2N+ZDdMQ@mail.gmail.com>

On Mon, Oct 2, 2023 at 2:59 PM Xin Long <lucien.xin@gmail.com> wrote:
>
> On Mon, Oct 2, 2023 at 1:26 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Mon, Oct 2, 2023 at 7:19 PM Xin Long <lucien.xin@gmail.com> wrote:
> > >
> > > On Mon, Oct 2, 2023 at 12:25 PM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Mon, Oct 2, 2023 at 6:20 PM David Ahern <dsahern@kernel.org> wrote:
> > > > >
> > > > > Eric:
> > > > >
> > > > > Looking at the tcpdump source code, it has a GUESS_TSO define that can
> > > > > be enabled to dump IPv4 packets with tot_len = 0:
> > > > >
> > > > >         if (len < hlen) {
> > > > > #ifdef GUESS_TSO
> > > > >             if (len) {
> > > > >                 ND_PRINT("bad-len %u", len);
> > > > >                 return;
> > > > >             }
> > > > >             else {
> > > > >                 /* we guess that it is a TSO send */
> > > > >                 len = length;
> > > > >             }
> > > > > #else
> > > > >             ND_PRINT("bad-len %u", len);
> > > > >             return;
> > > > > #endif /* GUESS_TSO */
> > > > >         }
> > > > >
> > > > >
> > > > > The IPv6 version has a similar check but no compile change needed:
> > > > >         /*
> > > > >          * RFC 1883 says:
> > > > >          *
> > > > >          * The Payload Length field in the IPv6 header must be set to zero
> > > > >          * in every packet that carries the Jumbo Payload option.  If a
> > > > >          * packet is received with a valid Jumbo Payload option present and
> > > > >          * a non-zero IPv6 Payload Length field, an ICMP Parameter Problem
> > > > >          * message, Code 0, should be sent to the packet's source, pointing
> > > > >          * to the Option Type field of the Jumbo Payload option.
> > > > >          *
> > > > >          * Later versions of the IPv6 spec don't discuss the Jumbo Payload
> > > > >          * option.
> > > > >          *
> > > > >          * If the payload length is 0, we temporarily just set the total
> > > > >          * length to the remaining data in the packet (which, for Ethernet,
> > > > >          * could include frame padding, but if it's a Jumbo Payload frame,
> > > > >          * it shouldn't even be sendable over Ethernet, so we don't worry
> > > > >          * about that), so we can process the extension headers in order
> > > > >          * to *find* a Jumbo Payload hop-by-hop option and, when we've
> > > > >          * processed all the extension headers, check whether we found
> > > > >          * a Jumbo Payload option, and fail if we haven't.
> > > > >          */
> > > > >         if (payload_len != 0) {
> > > > >                 len = payload_len + sizeof(struct ip6_hdr);
> > > > >                 if (length < len)
> > > > >                         ND_PRINT("truncated-ip6 - %u bytes missing!",
> > > > >                                 len - length);
> > > > >         } else
> > > > >                 len = length + sizeof(struct ip6_hdr);
> > > > >
> > > > >
> > > > > Maybe I am missing something, but it appears that no code change to
> > > > > tcpdump is needed for Linux Big TCP packets other than enabling that
> > > > > macro when building. I did that in a local build and the large packets
> > > > > were dumped just fine.
> > > > >
> > > Right, wireshark/tshark currently has no problem parsing BIG TCP IPv4 packets.
> > > I think it enables GUESS_TSO by default.
> > >
> > > We also enabled GUESS_TSO in tcpdump for RHEL-9 when BIG TCP IPv4 was
> > > backported in it.
> >
> > Make sure to enable this in tcpdump source, so that other distros do
> > not have to 'guess'.
> Looks the tcpdump maintainer has posted one:
>
A couple of updates:

> https://github.com/the-tcpdump-group/tcpdump/pull/1085
In tcpdump, this one has been Merged into master and tcpdump-4.99 branch.
It means tcpdump has officially supported BIG TCP parsing on upstream and
its next release version.

For wireshark, according to the maintainer Guy Harris,
Code in Wireshark to deal with the total length being 0 in the IPv4 header
dates back to at least 2012.

For the TP_STATUS from packet socket including our TP_STATUS_GSO_TCP,
it has been merged into the pcapng IETF draft doc:

  https://github.com/IETF-OPSAWG-WG/draft-ietf-opsawg-pcap/pull/144

and the next step is to implement it in both tcpdump and wireshark.

But in my opinion, this doesn't block the IPv6 jumbo headers removal,
as both tcpdump and wireshark now have no issue to parse BIG TCP packets.

Thanks.

>
> >
> > >
> > > >
> > > > My point is that tcpdump should not guess, but look at TP_STATUS_GSO_TCP
> > > > (and TP_STATUS_CSUM_VALID would also be nice)
> > > >
> > > > Otherwise, why add TP_STATUS_GSO_TCP in the first place ?
> > > That's for more reliable parsing in the future.
> >
> > We want this. I thought this was obvious.
> >
> > >
> > > As currently in libpcap, it doesn't save meta_data(like
> > > TP_STATUS_CSUM_VALID/GSO_TCP)
> > > to 'pcap' files, and it requires libpcap APIs change and uses the
> > > 'pcap-ng' file format.
> > > I think it will take quite some time to implement in userspace.
> >
> > Great. Until this is implemented as discussed last year, we will not remove
> > IPv6 jumbo headers.
> I will get back to this libpcap APIs and pcap-ng things, and let you
> know when it's done.
>
> Thanks.

^ permalink raw reply

* RE: [PATCH] netlink: introduce netlink poll to resolve fast return issue
From: Jong eon Park @ 2023-11-10 14:54 UTC (permalink / raw)
  To: 'Jakub Kicinski'
  Cc: 'Paolo Abeni', 'David S. Miller',
	'Eric Dumazet', netdev, linux-kernel,
	'Dong ha Kang'
In-Reply-To: <20231107085347.75bc3802@kernel.org>



On Tuesday, Nov 7, 2023 at 08:48 Jakub Kicinski wrote:
> Why does the wake up happen in the first place?
> I don't see anything special in the netlink code, so I'm assuming it's
> because datagram_poll() returns EPOLLERR.
> 
> The man page says:
> 
>        EPOLLERR
>               Error condition happened on the associated file
>               descriptor.  This event is also reported for the write end
>               of a pipe when the read end has been closed.
> 
>               epoll_wait(2) will always report for this event; it is not
>               necessary to set it in events when calling epoll_ctl().
> 
> To me that sounds like EPOLLERR is always implicitly enabled, and should
> be handled by the application. IOW it's an pure application bug.
> 
> Are you aware of any precedent for sockets adding in EPOLLOUT when
> EPOLLERR is set?

In my case, the first wake-up was by POLLIN, not POLLERR.
Please consider the below scenario.

------------CPU1 (kernel)----------  --------------CPU2 (app)--------------
...
a driver delivers uevent.              poll was waiting for schedule.
a driver delivers uevent.
a driver delivers uevent.
...
1) netlink_broadcast_deliver fails.
(sk_rmem_alloc > sk_rcvbuf)
                                            getting schedule and poll
returns,
                                            and the app calls recv.
                                            (rcv queue is empied)
                                            2)

netlink_overrun is called.
(NETLINK_S_CONGESTED flag is set,
ENOBUF is written in sk_err and,
wake up poll.)
                                            finishing its job and call poll
again.
                                            poll returns POLLERR.

                                            (the app doesn't have POLLERR
handler,)
                                            it calls poll, but getting
POLLERR.
                                            it calls poll, but getting
POLLERR.
                                            it calls poll, but getting
POLLERR.
                                            ...
									 

Interestingly, in this issue, even though netlink overrun frequently 
happened and caused POLLERRs, the user was managing it well through 
POLLIN and 'recv' function without a specific POLLERR handler. 
However, in the current situation, rcv queue is already empty and 
NETLINK_S_CONGESTED flag prevents any more incoming packets. This makes 
it impossible for the user to call 'recv'.

This "congested" situation is a bit ambiguous. The queue is empty, yet 
'congested' remains. This means kernel can no longer deliver uevents 
despite the empty queue, and it lead to the persistent 'congested' status.

The reason for the difference in netlink lies in the NETLINK_S_CONGESTED 
flag. If it were UDP, upon seeing the empty queue, it might have kept 
pushing the received packets into the queue (making possible to call 
'recv').

BRs,
JE Park.



^ permalink raw reply

* Re: [PATCH v2 1/2] tg3: Move the [rt]x_dropped counters to tg3_napi
From: Andrew Lunn @ 2023-11-10 15:13 UTC (permalink / raw)
  To: alexey.pakhunov
  Cc: mchan, vincent.wong2, netdev, linux-kernel, siva.kallam, prashant
In-Reply-To: <20231110002340.3612515-1-alexey.pakhunov@spacex.com>

> @@ -11895,6 +11903,9 @@ static void tg3_get_nstats(struct tg3 *tp, struct rtnl_link_stats64 *stats)
>  {
>  	struct rtnl_link_stats64 *old_stats = &tp->net_stats_prev;
>  	struct tg3_hw_stats *hw_stats = tp->hw_stats;
> +	unsigned long rx_dropped;
> +	unsigned long tx_dropped;
> +	int i;
>  
>  	stats->rx_packets = old_stats->rx_packets +
>  		get_stat64(&hw_stats->rx_ucast_packets) +
> @@ -11941,8 +11952,26 @@ static void tg3_get_nstats(struct tg3 *tp, struct rtnl_link_stats64 *stats)
>  	stats->rx_missed_errors = old_stats->rx_missed_errors +
>  		get_stat64(&hw_stats->rx_discards);
>  
> -	stats->rx_dropped = tp->rx_dropped;
> -	stats->tx_dropped = tp->tx_dropped;
> +	/* Aggregate per-queue counters. The per-queue counters are updated
> +	 * by a single writer, race-free. The result computed by this loop
> +	 * might not be 100% accurate (counters can be updated in the middle of
> +	 * the loop) but the next tg3_get_nstats() will recompute the current
> +	 * value so it is acceptable.
> +	 *
> +	 * Note that these counters wrap around at 4G on 32bit machines.
> +	 */
> +	rx_dropped = (unsigned long)(old_stats->rx_dropped);
> +	tx_dropped = (unsigned long)(old_stats->tx_dropped);

Isn't this wrapping artificial? old_stats is of type
rtnl_link_stats64, so the counters are 64 bit.

> +
> +	for (i = 0; i < tp->irq_cnt; i++) {
> +		struct tg3_napi *tnapi = &tp->napi[i];
> +
> +		rx_dropped += tnapi->rx_dropped;
> +		tx_dropped += tnapi->tx_dropped;
> +	}
> +
> +	stats->rx_dropped = rx_dropped;
> +	stats->tx_dropped = tx_dropped;

state is also rtnl_link_stats64 so has 64 bit counters. So this code
is throwing away the upper 32 bits.

Why not use include/linux/u64_stats_sync.h, which should cost you
nothing on 64 bit machines, and do the right thing on 32 bit machines.

	Andrew

^ permalink raw reply

* [PATCH net] net: gso_test: support CONFIG_MAX_SKB_FRAGS up to 45
From: Willem de Bruijn @ 2023-11-10 15:36 UTC (permalink / raw)
  To: netdev; +Cc: davem, kuba, edumazet, pabeni, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

The test allocs a single page to hold all the frag_list skbs. This
is insufficient on kernels with CONFIG_MAX_SKB_FRAGS=45, due to the
increased skb_shared_info frags[] array length.

        gso_test_func: ASSERTION FAILED at net/core/gso_test.c:210
        Expected alloc_size <= ((1UL) << 12), but
            alloc_size == 5075 (0x13d3)
            ((1UL) << 12) == 4096 (0x1000)

Simplify the logic. Just allocate a page for each frag_list skb.

Fixes: 4688ecb1385f ("net: expand skb_segment unit test with frag_list coverage")
Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/core/gso_test.c | 14 +++++---------
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/net/core/gso_test.c b/net/core/gso_test.c
index ceb684be4cbf..4c2e77bd12f4 100644
--- a/net/core/gso_test.c
+++ b/net/core/gso_test.c
@@ -180,18 +180,17 @@ static void gso_test_func(struct kunit *test)
 	}
 
 	if (tcase->frag_skbs) {
-		unsigned int total_size = 0, total_true_size = 0, alloc_size = 0;
+		unsigned int total_size = 0, total_true_size = 0;
 		struct sk_buff *frag_skb, *prev = NULL;
 
-		page = alloc_page(GFP_KERNEL);
-		KUNIT_ASSERT_NOT_NULL(test, page);
-		page_ref_add(page, tcase->nr_frag_skbs - 1);
-
 		for (i = 0; i < tcase->nr_frag_skbs; i++) {
 			unsigned int frag_size;
 
+			page = alloc_page(GFP_KERNEL);
+			KUNIT_ASSERT_NOT_NULL(test, page);
+
 			frag_size = tcase->frag_skbs[i];
-			frag_skb = build_skb(page_address(page) + alloc_size,
+			frag_skb = build_skb(page_address(page),
 					     frag_size + shinfo_size);
 			KUNIT_ASSERT_NOT_NULL(test, frag_skb);
 			__skb_put(frag_skb, frag_size);
@@ -204,11 +203,8 @@ static void gso_test_func(struct kunit *test)
 
 			total_size += frag_size;
 			total_true_size += frag_skb->truesize;
-			alloc_size += frag_size + shinfo_size;
 		}
 
-		KUNIT_ASSERT_LE(test, alloc_size, PAGE_SIZE);
-
 		skb->len += total_size;
 		skb->data_len += total_size;
 		skb->truesize += total_true_size;
-- 
2.43.0.rc0.421.g78406f8d94-goog


^ permalink raw reply related

* Re: [PATCH bpf-next] bpf, sockmap: Bundle psock->sk_redir and redir_ingress into a tagged pointer
From: Alexei Starovoitov @ 2023-11-10 15:42 UTC (permalink / raw)
  To: Pengcheng Yang; +Cc: Jakub Sitnicki, John Fastabend, bpf, Network Development
In-Reply-To: <1699609302-8605-1-git-send-email-yangpc@wangsu.com>

On Fri, Nov 10, 2023 at 1:44 AM Pengcheng Yang <yangpc@wangsu.com> wrote:
>
> Like skb->_sk_redir, we bundle the sock redirect pointer and
> the ingress bit to manage them together.
>
> Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
> Link: https://lore.kernel.org/bpf/87cz97cnz8.fsf@cloudflare.com
> Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
> ---
>  include/linux/skmsg.h | 30 ++++++++++++++++++++++++++++--
>  net/core/skmsg.c      | 18 ++++++++++--------
>  net/ipv4/tcp_bpf.c    | 13 +++++++------
>  net/tls/tls_sw.c      | 11 ++++++-----
>  4 files changed, 51 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
> index c1637515a8a4..ae021f511f46 100644
> --- a/include/linux/skmsg.h
> +++ b/include/linux/skmsg.h
> @@ -78,11 +78,10 @@ struct sk_psock_work_state {
>
>  struct sk_psock {
>         struct sock                     *sk;
> -       struct sock                     *sk_redir;
> +       unsigned long                   _sk_redir;

Please don't.
There is no need to bundle them together.

^ permalink raw reply

* [PATCH net v4] net: ti: icssg-prueth: Add missing icss_iep_put to error path
From: Jan Kiszka @ 2023-11-10 16:13 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	MD Danish Anwar
  Cc: netdev, linux-kernel, Lopes Ivo, Diogo Miguel (T CED IFD-PT),
	Nishanth Menon, Su, Bao Cheng (RC-CN DF FA R&D),
	Wojciech Drewek, Roger Quadros

From: Jan Kiszka <jan.kiszka@siemens.com>

Analogously to prueth_remove, just also taking care for NULL'ing the
iep pointers.

Fixes: 186734c15886 ("net: ti: icssg-prueth: add packet timestamping and ptp support")
Fixes: 443a2367ba3c ("net: ti: icssg-prueth: am65x SR2.0 add 10M full duplex support")
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
---

Changes in v4:
 - no functional ones
 - added one original author in CC with new address (no address of
   Grygorii available)

Changes in v3:
 - consolidate cleanup logic further [Wojciech]
 - make sure to NULL iep pointers

Changes in v2:
 - add proper tags

This was lost from the TI SDK version while ripping out SR1.0 support - 
which we are currently restoring for upstream.

 drivers/net/ethernet/ti/icssg/icssg_prueth.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/ti/icssg/icssg_prueth.c b/drivers/net/ethernet/ti/icssg/icssg_prueth.c
index 6c4b64227ac8..3abbeba26f1b 100644
--- a/drivers/net/ethernet/ti/icssg/icssg_prueth.c
+++ b/drivers/net/ethernet/ti/icssg/icssg_prueth.c
@@ -2105,10 +2105,7 @@ static int prueth_probe(struct platform_device *pdev)
 	prueth->iep1 = icss_iep_get_idx(np, 1);
 	if (IS_ERR(prueth->iep1)) {
 		ret = dev_err_probe(dev, PTR_ERR(prueth->iep1), "iep1 get failed\n");
-		icss_iep_put(prueth->iep0);
-		prueth->iep0 = NULL;
-		prueth->iep1 = NULL;
-		goto free_pool;
+		goto put_iep0;
 	}
 
 	if (prueth->pdata.quirk_10m_link_issue) {
@@ -2205,6 +2202,12 @@ static int prueth_probe(struct platform_device *pdev)
 exit_iep:
 	if (prueth->pdata.quirk_10m_link_issue)
 		icss_iep_exit_fw(prueth->iep1);
+	icss_iep_put(prueth->iep1);
+
+put_iep0:
+	icss_iep_put(prueth->iep0);
+	prueth->iep0 = NULL;
+	prueth->iep1 = NULL;
 
 free_pool:
 	gen_pool_free(prueth->sram_pool,
-- 
2.35.3

^ permalink raw reply related

* [PATCH net v3] net: ti: icssg-prueth: Fix error cleanup on failing pruss_request_mem_region
From: Jan Kiszka @ 2023-11-10 16:13 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	MD Danish Anwar
  Cc: netdev, linux-kernel, Lopes Ivo, Diogo Miguel (T CED IFD-PT),
	Nishanth Menon, Su, Bao Cheng (RC-CN DF FA R&D),
	Wojciech Drewek, Roger Quadros

From: Jan Kiszka <jan.kiszka@siemens.com>

We were just continuing in this case, surely not desired.

Fixes: 128d5874c082 ("net: ti: icssg-prueth: Add ICSSG ethernet driver")
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
---

Changes in v3:
 - no functional ones
 - added original author in CC with new address

Changes in v2:
 - add proper tags

 drivers/net/ethernet/ti/icssg/icssg_prueth.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ti/icssg/icssg_prueth.c b/drivers/net/ethernet/ti/icssg/icssg_prueth.c
index 3abbeba26f1b..411898a4f38c 100644
--- a/drivers/net/ethernet/ti/icssg/icssg_prueth.c
+++ b/drivers/net/ethernet/ti/icssg/icssg_prueth.c
@@ -2063,7 +2063,7 @@ static int prueth_probe(struct platform_device *pdev)
 				       &prueth->shram);
 	if (ret) {
 		dev_err(dev, "unable to get PRUSS SHRD RAM2: %d\n", ret);
-		pruss_put(prueth->pruss);
+		goto put_pruss;
 	}
 
 	prueth->sram_pool = of_gen_pool_get(np, "sram", 0);
@@ -2215,6 +2215,8 @@ static int prueth_probe(struct platform_device *pdev)
 
 put_mem:
 	pruss_release_mem_region(prueth->pruss, &prueth->shram);
+
+put_pruss:
 	pruss_put(prueth->pruss);
 
 put_cores:
-- 
2.35.3

^ permalink raw reply related

* Re: [PATCH net v4] net: ti: icssg-prueth: Add missing icss_iep_put to error path
From: Anwar, Md Danish @ 2023-11-10 16:18 UTC (permalink / raw)
  To: Jan Kiszka, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, MD Danish Anwar
  Cc: netdev, linux-kernel, Lopes Ivo, Diogo Miguel (T CED IFD-PT),
	Nishanth Menon, Su, Bao Cheng (RC-CN DF FA R&D),
	Wojciech Drewek, Roger Quadros
In-Reply-To: <7a4e5c5b-e397-479b-b1cb-4b50da248f21@siemens.com>

On 11/10/2023 9:43 PM, Jan Kiszka wrote:
> From: Jan Kiszka <jan.kiszka@siemens.com>
> 
> Analogously to prueth_remove, just also taking care for NULL'ing the
> iep pointers.
> 
> Fixes: 186734c15886 ("net: ti: icssg-prueth: add packet timestamping and ptp support")
> Fixes: 443a2367ba3c ("net: ti: icssg-prueth: am65x SR2.0 add 10M full duplex support")
> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Reviewed-by: MD Danish Anwar <danishanwar@ti.com>

> ---
> 
> Changes in v4:
>  - no functional ones
>  - added one original author in CC with new address (no address of
>    Grygorii available)
> 
> Changes in v3:
>  - consolidate cleanup logic further [Wojciech]
>  - make sure to NULL iep pointers
> 
> Changes in v2:
>  - add proper tags
> 
> This was lost from the TI SDK version while ripping out SR1.0 support - 
> which we are currently restoring for upstream.
> 
>  drivers/net/ethernet/ti/icssg/icssg_prueth.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/ti/icssg/icssg_prueth.c b/drivers/net/ethernet/ti/icssg/icssg_prueth.c
> index 6c4b64227ac8..3abbeba26f1b 100644
> --- a/drivers/net/ethernet/ti/icssg/icssg_prueth.c
> +++ b/drivers/net/ethernet/ti/icssg/icssg_prueth.c
> @@ -2105,10 +2105,7 @@ static int prueth_probe(struct platform_device *pdev)
>  	prueth->iep1 = icss_iep_get_idx(np, 1);
>  	if (IS_ERR(prueth->iep1)) {
>  		ret = dev_err_probe(dev, PTR_ERR(prueth->iep1), "iep1 get failed\n");
> -		icss_iep_put(prueth->iep0);
> -		prueth->iep0 = NULL;
> -		prueth->iep1 = NULL;
> -		goto free_pool;
> +		goto put_iep0;
>  	}
>  
>  	if (prueth->pdata.quirk_10m_link_issue) {
> @@ -2205,6 +2202,12 @@ static int prueth_probe(struct platform_device *pdev)
>  exit_iep:
>  	if (prueth->pdata.quirk_10m_link_issue)
>  		icss_iep_exit_fw(prueth->iep1);
> +	icss_iep_put(prueth->iep1);
> +
> +put_iep0:
> +	icss_iep_put(prueth->iep0);
> +	prueth->iep0 = NULL;
> +	prueth->iep1 = NULL;
>  
>  free_pool:
>  	gen_pool_free(prueth->sram_pool,

-- 
Thanks and Regards,
Md Danish Anwar

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox