Netdev List
 help / color / mirror / Atom feed
* [PATCH net-next v6 0/5] Coalesce mac ocp write/modify calls to reduce spinlock contention
From: Mirsad Goran Todorovac @ 2023-11-04 22:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: David S. Miller, Eric Dumazet, Heiner Kallweit, Jakub Kicinski,
	Paolo Abeni, netdev, nic_swsd, Mirsad Goran Todorovac

The motivation for these helpers was the locking overhead of 130 consecutive
r8168_mac_ocp_write() calls in the RTL8411b reset after the NIC gets confused
if the PHY is powered-down.

To quote Heiner:

    On RTL8411b the RX unit gets confused if the PHY is powered-down.
    This was reported in [0] and confirmed by Realtek. Realtek provided
    a sequence to fix the RX unit after PHY wakeup.

A series of about 130 r8168_mac_ocp_write() calls is performed to program the
RTL registers for recovery, each doing an expensive spin_lock_irqsave() and
spin_unlock_irqrestore().

Each mac ocp write is made of:

    static void __r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg,
                      u32 data)
    {
        if (rtl_ocp_reg_failure(reg))
            return;

        RTL_W32(tp, OCPDR, OCPAR_FLAG | (reg << 15) | data);
    }

    static void r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg,
                    u32 data)
    {
        unsigned long flags;

        raw_spin_lock_irqsave(&tp->mac_ocp_lock, flags);
        __r8168_mac_ocp_write(tp, reg, data);
        raw_spin_unlock_irqrestore(&tp->mac_ocp_lock, flags);
    }

Register programming is done through RTL_W32() macro which expands into

    #define RTL_W32(tp, reg, val32) writel((val32), tp->mmio_addr + (reg))

which is further (on Alpha):

    extern inline void writel(u32 b, volatile void __iomem *addr)
    {
        mb();
        __raw_writel(b, addr);
    }

or on i386/x86_64:

    #define build_mmio_write(name, size, type, reg, barrier) \
    static inline void name(type val, volatile void __iomem *addr) \
    { asm volatile("mov" size " %0,%1": :reg (val), \
    "m" (*(volatile type __force *)addr) barrier); }

    build_mmio_write(writel, "l", unsigned int, "r", :"memory")

This obviously involves iat least a compiler barrier.

mb() expands into something like this i.e. on x86_64:

    #define mb()    asm volatile("lock; addl $0,0(%%esp)" ::: "memory")

This means a whole lot of memory bus stalls: for spin_lock_irqsave(),
memory barrier, writel(), and spin_unlock_irqrestore().

With about 130 of these sequential calls to r8168_mac_ocp_write() this looks like
a lock storm that will stall all of the cores and CPUs on the same memory controller
for certain time I/O takes to finish.

In a sequential case of RTL register programming, the writes to RTL registers
can be coalesced under a same raw spinlock. This can dramatically decrease the
number of bus stalls in a multicore or multi-CPU system.

Macro helpers r8168_mac_ocp_write_seq() and r8168_mac_ocp_modify_seq() are
provided to reduce lock contention:

    static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
    {

        ...

        /* The following Realtek-provided magic fixes an issue with the RX unit
         * getting confused after the PHY having been powered-down.
         */

        static const struct recover_8411b_info init_zero_seq[] = {
            { 0xFC28, 0x0000 }, { 0xFC2A, 0x0000 }, { 0xFC2C, 0x0000 },
            ...
        };

        ...

        r8168_mac_ocp_write_seq(tp, init_zero_seq);

        ...

    }

The hex data is preserved intact through s/r8168_mac_ocp_write[(]tp,/{ / and s/[)];/ },/
functions that only changed the function names and the ending of the line, so the actual
hex data is unchanged.

To repeat, the reason for the introduction of the original commit
was to enable recovery of the RX unit on the RTL8411b which was confused by the
powered-down PHY. This sequence of r8168_mac_ocp_write() calls amplifies the problem
into a series of about 500+ memory bus locks, most waiting for the main memory read,
modify and write under a LOCK. The memory barrier in RTL_W32 should suffice for
the programming sequence to reach RTL NIC registers.

[0] https://bugzilla.redhat.com/show_bug.cgi?id=1692075

v6:
 proceeded according to Jacob Keller's suggestions by creating a cover page and reducing
 the text within the commits. Applying to the net-next tree as Heiner Kallweit requested.

v5:
 attempted some new optimisations, which were rejected, but not all and not completely.

v4:
 fixed complaints as advised by Heiner and checkpatch.pl.
 split the patch into five sections to be more easily manipulated and reviewed
 introduced r8168_mac_ocp_write_seq()
 applied coalescing of mac ocp writes/modifies for 8168H, 8125 and 8125B

v3:
 removed register/mask pair array sentinels, so using ARRAY_SIZE().
 avoided duplication of RTL_W32() call code as advised by Heiner.

Mirsad Goran Todorovac (5):
  r8169: Coalesce r8169_mac_ocp_write/modify calls to reduce spinlock
    stalls
  r8169: Coalesce RTL8411b PHY power-down recovery calls to reduce
    spinlock stalls
  r8169: Coalesce mac ocp write and modify for 8168H start to reduce
    spinlocks
  r8169: Coalesce mac ocp commands for 8125 and 8125B start to reduce
    spinlock contention
  r8169: Coalesce mac ocp commands for rtl_hw_init_8125 to reduce
    spinlocks

 drivers/net/ethernet/realtek/r8169_main.c | 304 +++++++++++-----------
 1 file changed, 150 insertions(+), 154 deletions(-)

-- 
2.34.1


^ permalink raw reply

* [PATCH net] net: xt_recent: fix (increase) ipv6 literal buffer length
From: Maciej Żenczykowski @ 2023-11-04 21:00 UTC (permalink / raw)
  To: Maciej Żenczykowski, David S . Miller, Pablo Neira Ayuso,
	Florian Westphal
  Cc: Linux Network Development Mailing List,
	Netfilter Development Mailing List, Jan Engelhardt,
	Patrick McHardy

From: Maciej Żenczykowski <zenczykowski@gmail.com>

IPv4 in IPv6 is supported by in6_pton
(this is useful with DNS64/NAT64 networks for example):

  # echo +aaaa:bbbb:cccc:dddd:eeee:ffff:1.2.3.4 > /proc/self/net/xt_recent/DEFAULT
  # cat /proc/self/net/xt_recent/DEFAULT
  src=aaaa:bbbb:cccc:dddd:eeee:ffff:0102:0304 ttl: 0 last_seen: 9733848829 oldest_pkt: 1 9733848829

but the provided buffer is too short:

  # echo +aaaa:bbbb:cccc:dddd:eeee:ffff:255.255.255.255 > /proc/self/net/xt_recent/DEFAULT
  -bash: echo: write error: Invalid argument

Cc: Jan Engelhardt <jengelh@medozas.de>
Cc: Patrick McHardy <kaber@trash.net>
Fixes: 079aa88fe717 ("netfilter: xt_recent: IPv6 support")
Signed-off-by: Maciej Żenczykowski <zenczykowski@gmail.com>
---
 net/netfilter/xt_recent.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/xt_recent.c b/net/netfilter/xt_recent.c
index 7ddb9a78e3fc..ef93e0d3bee0 100644
--- a/net/netfilter/xt_recent.c
+++ b/net/netfilter/xt_recent.c
@@ -561,7 +561,7 @@ recent_mt_proc_write(struct file *file, const char __user *input,
 {
 	struct recent_table *t = pde_data(file_inode(file));
 	struct recent_entry *e;
-	char buf[sizeof("+b335:1d35:1e55:dead:c0de:1715:5afe:c0de")];
+	char buf[sizeof("+b335:1d35:1e55:dead:c0de:1715:255.255.255.255")];
 	const char *c = buf;
 	union nf_inet_addr addr = {};
 	u_int16_t family;
-- 
2.42.0.869.gea05f2083d-goog


^ permalink raw reply related

* Re: [PATCH v4 3/3] NFSD: convert write_ports to netlink command
From: kernel test robot @ 2023-11-04 20:56 UTC (permalink / raw)
  To: Lorenzo Bianconi, linux-nfs
  Cc: oe-kbuild-all, lorenzo.bianconi, neilb, chuck.lever, netdev,
	jlayton, kuba
In-Reply-To: <153b94db12b5c8fff270706673afffad5d84938c.1699095665.git.lorenzo@kernel.org>

Hi Lorenzo,

kernel test robot noticed the following build warnings:

[auto build test WARNING on linus/master]
[also build test WARNING on next-20231103]
[cannot apply to trondmy-nfs/linux-next v6.6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Lorenzo-Bianconi/NFSD-convert-write_threads-to-netlink-command/20231104-202515
base:   linus/master
patch link:    https://lore.kernel.org/r/153b94db12b5c8fff270706673afffad5d84938c.1699095665.git.lorenzo%40kernel.org
patch subject: [PATCH v4 3/3] NFSD: convert write_ports to netlink command
config: alpha-defconfig (https://download.01.org/0day-ci/archive/20231105/202311050409.dPLvgiwN-lkp@intel.com/config)
compiler: alpha-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231105/202311050409.dPLvgiwN-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202311050409.dPLvgiwN-lkp@intel.com/

All warnings (new ones prefixed by >>):

   fs/nfsd/nfsctl.c: In function 'nfsd_nl_listener_start_doit':
>> fs/nfsd/nfsctl.c:1877:13: warning: variable 'ret' set but not used [-Wunused-but-set-variable]
    1877 |         int ret;
         |             ^~~
--
   fs/nfsd/nfsctl.c:1819: warning: expecting prototype for nfsd_nl_version_get_doit(). Prototype was for nfsd_nl_version_get_dumpit() instead
>> fs/nfsd/nfsctl.c:1901: warning: expecting prototype for nfsd_nl_version_get_dumpit(). Prototype was for nfsd_nl_listener_get_dumpit() instead


vim +/ret +1877 fs/nfsd/nfsctl.c

  1867	
  1868	/**
  1869	 * nfsd_nl_listener_start_doit - start the provided nfs server listener
  1870	 * @skb: reply buffer
  1871	 * @info: netlink metadata and command arguments
  1872	 *
  1873	 * Return 0 on success or a negative errno.
  1874	 */
  1875	int nfsd_nl_listener_start_doit(struct sk_buff *skb, struct genl_info *info)
  1876	{
> 1877		int ret;
  1878	
  1879		if (GENL_REQ_ATTR_CHECK(info, NFSD_A_SERVER_LISTENER_TRANSPORT_NAME) ||
  1880		    GENL_REQ_ATTR_CHECK(info, NFSD_A_SERVER_LISTENER_PORT))
  1881			return -EINVAL;
  1882	
  1883		mutex_lock(&nfsd_mutex);
  1884		ret = ___write_ports_addxprt(genl_info_net(info), get_current_cred(),
  1885				nla_data(info->attrs[NFSD_A_SERVER_LISTENER_TRANSPORT_NAME]),
  1886				nla_get_u32(info->attrs[NFSD_A_SERVER_LISTENER_PORT]));
  1887		mutex_unlock(&nfsd_mutex);
  1888	
  1889		return 0;
  1890	}
  1891	
  1892	/**
  1893	 * nfsd_nl_version_get_dumpit - Handle listener_get dumpit
  1894	 * @skb: reply buffer
  1895	 * @cb: netlink metadata and command arguments
  1896	 *
  1897	 * Returns the size of the reply or a negative errno.
  1898	 */
  1899	int nfsd_nl_listener_get_dumpit(struct sk_buff *skb,
  1900					struct netlink_callback *cb)
> 1901	{
  1902		struct nfsd_net *nn = net_generic(sock_net(skb->sk), nfsd_net_id);
  1903		int i = 0, ret = -ENOMEM;
  1904		struct svc_xprt *xprt;
  1905		struct svc_serv *serv;
  1906	
  1907		mutex_lock(&nfsd_mutex);
  1908	
  1909		serv = nn->nfsd_serv;
  1910		if (!serv) {
  1911			mutex_unlock(&nfsd_mutex);
  1912			return 0;
  1913		}
  1914	
  1915		spin_lock_bh(&serv->sv_lock);
  1916		list_for_each_entry(xprt, &serv->sv_permsocks, xpt_list) {
  1917			void *hdr;
  1918	
  1919			if (i < cb->args[0]) /* already consumed */
  1920				continue;
  1921	
  1922			hdr = genlmsg_put(skb, NETLINK_CB(cb->skb).portid,
  1923					  cb->nlh->nlmsg_seq, &nfsd_nl_family,
  1924					  0, NFSD_CMD_LISTENER_GET);
  1925			if (!hdr)
  1926				goto out;
  1927	
  1928			if (nla_put_string(skb, NFSD_A_SERVER_LISTENER_TRANSPORT_NAME,
  1929					   xprt->xpt_class->xcl_name))
  1930				goto out;
  1931	
  1932			if (nla_put_u32(skb, NFSD_A_SERVER_LISTENER_PORT,
  1933					svc_xprt_local_port(xprt)))
  1934				goto out;
  1935	
  1936			genlmsg_end(skb, hdr);
  1937			i++;
  1938		}
  1939		cb->args[0] = i;
  1940		ret = skb->len;
  1941	out:
  1942		spin_unlock_bh(&serv->sv_lock);
  1943	
  1944		mutex_unlock(&nfsd_mutex);
  1945	
  1946		return ret;
  1947	}
  1948	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH net-next v10 09/13] net:ethernet:realtek:rtase: Implement pci_driver suspend and resume function
From: Andrew Lunn @ 2023-11-04 19:52 UTC (permalink / raw)
  To: Justin Lai
  Cc: kuba, davem, edumazet, pabeni, linux-kernel, netdev, pkshih,
	larry.chiu
In-Reply-To: <20231102154505.940783-10-justinlai0215@realtek.com>

> +static int rtase_resume(struct pci_dev *pdev)
> +{
> +	struct net_device *dev = pci_get_drvdata(pdev);
> +	struct rtase_private *tp = netdev_priv(dev);
> +	int ret;
> +
> +	pci_set_power_state(pdev, PCI_D0);
> +	pci_restore_state(pdev);
> +	pci_enable_wake(pdev, PCI_D0, 0);
> +
> +	/* restore last modified mac address */
> +	rtase_rar_set(tp, dev->dev_addr);
> +
> +	if (!netif_running(dev))
> +		goto out;
> +
> +	rtase_wait_for_quiescence(dev);
> +	netif_device_attach(dev);
> +
> +	rtase_tx_clear(tp);
> +	rtase_rx_clear(tp);
> +
> +	ret = rtase_init_ring(dev);
> +	if (ret)
> +		netdev_alert(dev, "unable to init ring\n");

If you fail to init the ring, is it safe to keep going?

	Andrew

^ permalink raw reply

* Re: [PATCH net-next v10 10/13] net:ethernet:realtek:rtase: Implement ethtool function
From: Andrew Lunn @ 2023-11-04 19:48 UTC (permalink / raw)
  To: Justin Lai
  Cc: kuba, davem, edumazet, pabeni, linux-kernel, netdev, pkshih,
	larry.chiu
In-Reply-To: <20231102154505.940783-11-justinlai0215@realtek.com>

> +static int rtase_get_settings(struct net_device *dev,
> +			      struct ethtool_link_ksettings *cmd)
> +{
> +	u32 supported = SUPPORTED_MII | SUPPORTED_Pause;
> +
> +	ethtool_convert_legacy_u32_to_link_mode(cmd->link_modes.supported,
> +						supported);
> +	cmd->base.speed = SPEED_5000;
> +	cmd->base.duplex = DUPLEX_FULL;
> +	cmd->base.port = PORT_MII;
> +	cmd->base.autoneg = AUTONEG_DISABLE;
> +
> +	return 0;
> +}
> +

> +static int rtase_set_pauseparam(struct net_device *dev,
> +				struct ethtool_pauseparam *pause)
> +{
> +	const struct rtase_private *tp = netdev_priv(dev);
> +	u16 value = rtase_r16(tp, RTASE_CPLUS_CMD);
> +
> +	if (pause->autoneg)
> +		return -EOPNOTSUPP;
> +
> +	value &= ~(FORCE_TXFLOW_EN | FORCE_RXFLOW_EN);
> +
> +	if (pause->tx_pause)
> +		value |= FORCE_TXFLOW_EN;
> +
> +	if (pause->rx_pause)
> +		value |= FORCE_RXFLOW_EN;

It appears the hardware supports asymmetric pause? So i think your
rtase_get_settings() is wrong.

	Andrew

^ permalink raw reply

* Re: [PATCH] leds: triggers: netdev: add a check, whether device is up
From: Klaus Kudielka @ 2023-11-04 19:46 UTC (permalink / raw)
  To: Russell King (Oracle), Andrew Lunn
  Cc: David S . Miller, Jakub Kicinski, netdev
In-Reply-To: <ZUaCiGVPwcuTtjYW@shell.armlinux.org.uk>

On Sat, 2023-11-04 at 17:42 +0000, Russell King (Oracle) wrote:
> On Sat, Nov 04, 2023 at 05:46:44PM +0100, Andrew Lunn wrote:
> > On Sat, Nov 04, 2023 at 05:32:19PM +0100, Klaus Kudielka wrote:
> > > On Sat, 2023-11-04 at 16:27 +0100, Klaus Kudielka wrote:
> > > > 
> > > > phylink_start() is the first one that does netif_carrier_off() and thus
> > > > sets the NOCARRIER bit, but that only happens when bringing the device up.
> > > > 
> > > > Before that, I would not know who cares about setting the NOCARRIER bit.
> > > 
> > > A different, driver-specific solution could be like this (tested and working):
> > > 
> > > --- a/drivers/net/ethernet/marvell/mvneta.c
> > > +++ b/drivers/net/ethernet/marvell/mvneta.c
> > > @@ -5690,6 +5690,7 @@ static int mvneta_probe(struct platform_device *pdev)
> > >         /* 9676 == 9700 - 20 and rounding to 8 */
> > >         dev->max_mtu = 9676;
> > >  
> > > +       netif_carrier_off(dev);
> > >         err = register_netdev(dev);
> > >         if (err < 0) {
> > >                 dev_err(&pdev->dev, "failed to register\n");
> > > 
> > > 
> > > Would that be the "correct" approach?
> > 
> > Crossing emails.
> > 
> > Its a better approach. But it fixes just one driver. If we can do this
> > in phylink_create(), we fix it in a lot of drivers with a single
> > change...
> 
> ... and I think we should.
> 

So, the patch below also results in correct behaviour of the netdev trigger
with eth2 down, but I'm not so sure whether it really covers all desired
cases. Any advice?

--- a/drivers/net/phy/phylink.c
+++ b/drivers/net/phy/phylink.c
@@ -1616,6 +1616,7 @@ struct phylink *phylink_create(struct phylink_config *config,
        pl->config = config;
        if (config->type == PHYLINK_NETDEV) {
                pl->netdev = to_net_dev(config->dev);
+               netif_carrier_off(pl->netdev);
        } else if (config->type == PHYLINK_DEV) {
                pl->dev = config->dev;
        } else {


Best regards, Klaus


^ permalink raw reply

* Re: [PATCH] net: phy: at803x: add QCA8084 ethernet phy support
From: Russell King (Oracle) @ 2023-11-04 19:35 UTC (permalink / raw)
  To: Luo Jie
  Cc: andrew, hkallweit1, davem, edumazet, kuba, pabeni, netdev,
	linux-kernel
In-Reply-To: <20231103123538.15735-1-quic_luoj@quicinc.com>

On Fri, Nov 03, 2023 at 08:35:37PM +0800, Luo Jie wrote:
> Add qca8084 PHY support, which is four-port PHY with maximum
> link capability 2.5G, the features of each port is almost same
> as QCA8081 and slave seed config is not needed.
> 
> There are some initialization configurations needed.
> 1. Configuring qca8084 related initializations including
> MSE detect threshold and ADC clock edge invert.
> 2. Add the additional configurations for the CDT feature.
> 
> Signed-off-by: Luo Jie <quic_luoj@quicinc.com>
> ---
>  drivers/net/phy/at803x.c | 40 +++++++++++++++++++++++++++++++++++++---
>  1 file changed, 37 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/phy/at803x.c b/drivers/net/phy/at803x.c
> index 37fb033e1c29..4124eb76d835 100644
> --- a/drivers/net/phy/at803x.c
> +++ b/drivers/net/phy/at803x.c
> @@ -176,6 +176,8 @@
>  #define AT8030_PHY_ID_MASK			0xffffffef
>  
>  #define QCA8081_PHY_ID				0x004dd101
> +#define QCA8081_PHY_MASK			0xffffff00
> +#define QCA8084_PHY_ID				0x004dd180
...
> @@ -2207,8 +2240,9 @@ static struct phy_driver at803x_driver[] = {
>  	.resume			= qca83xx_resume,
>  }, {
>  	/* Qualcomm QCA8081 */
> -	PHY_ID_MATCH_EXACT(QCA8081_PHY_ID),
> -	.name			= "Qualcomm QCA8081",
> +	.phy_id			= QCA8081_PHY_ID,
> +	.phy_id_mask		= QCA8081_PHY_MASK,
> +	.name			= "Qualcomm QCA808X",
...
> @@ -2241,7 +2275,7 @@ static struct mdio_device_id __maybe_unused atheros_tbl[] = {
>  	{ PHY_ID_MATCH_EXACT(QCA8327_A_PHY_ID) },
>  	{ PHY_ID_MATCH_EXACT(QCA8327_B_PHY_ID) },
>  	{ PHY_ID_MATCH_EXACT(QCA9561_PHY_ID) },
> -	{ PHY_ID_MATCH_EXACT(QCA8081_PHY_ID) },
> +	{ QCA8081_PHY_ID, QCA8081_PHY_MASK},

So, in summary from the above, what you're doing is using the pair of
QCA8081_PHY_ID, QCA8081_PHY_MASK to match not only QCA8081 but also
QCA8084. This is confusing.

Are there any other parts that QCA808X would correspond with which
would not be compatible with the above? E.g. QCA8082, QCA8083, QCA8088
etc.

If there are, then the correct approach would be to list them
separately in atheros_tbl, and also have separate driver entries in
at803x_driver so it's unambiguous.

If we keep this approach, then I would suggest:

#define QCA808X_PHY_ID		0x004dd100
#define QCA808X_PHY_MASK	GENMASK(31, 8)

to make it explicit that this phy ID/mask pair is matching several
devices, rather than re-using the QCA8081_PHY_ID definition.


The next point - what about the revision field which occupies bits 3:0
in these:

>  static bool qca808x_has_fast_retrain_or_slave_seed(struct phy_device *phydev)
>  {
> +	if (phydev->phy_id == QCA8084_PHY_ID)
> +		return false;
> +
...
> @@ -1767,6 +1781,20 @@ static int qca808x_config_init(struct phy_device *phydev)
>  {
>  	int ret;
>  
> +	if (phydev->phy_id == QCA8084_PHY_ID) {
> +		/* Invert ADC clock edge */
...
> @@ -1958,6 +1986,11 @@ static int qca808x_cable_test_start(struct phy_device *phydev)
>  	phy_write_mmd(phydev, MDIO_MMD_PCS, 0x807a, 0xc060);
>  	phy_write_mmd(phydev, MDIO_MMD_PCS, 0x807e, 0xb060);
>  
> +	if (phydev->phy_id == QCA8084_PHY_ID) {

Do these need to be exact matches, or should the revision field be
ignored? If so, consider using phy_id_compare(), or if you end up with
separate driver entries, consider using phydev_id_compare().

Thanks.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply

* Re: [RFCv2 bpf-next 1/7] bpf: xfrm: Add bpf_xdp_get_xfrm_state() kfunc
From: Daniel Xu @ 2023-11-04 18:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jakub Kicinski, Jesper Dangaard Brouer, Eric Dumazet,
	Steffen Klassert, Daniel Borkmann, Herbert Xu, Alexei Starovoitov,
	John Fastabend, Paolo Abeni, David S. Miller, antony.antony, LKML,
	Network Development, bpf, devel
In-Reply-To: <CAADnVQJu27HZGaTH5046Smwjpn-ttVCRR7f_0B12es_juZiN5w@mail.gmail.com>

On Wed, Nov 01, 2023 at 06:27:03PM -0700, Alexei Starovoitov wrote:
> On Wed, Nov 1, 2023 at 2:58 PM Daniel Xu <dxu@dxuuu.xyz> wrote:
> >
> > This commit adds an unstable kfunc helper to access internal xfrm_state
> > associated with an SA. This is intended to be used for the upcoming
> > IPsec pcpu work to assign special pcpu SAs to a particular CPU. In other
> > words: for custom software RSS.
> >
> > That being said, the function that this kfunc wraps is fairly generic
> > and used for a lot of xfrm tasks. I'm sure people will find uses
> > elsewhere over time.
> >
> > Co-developed-by: Antony Antony <antony.antony@secunet.com>
> > Signed-off-by: Antony Antony <antony.antony@secunet.com>
> > Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
> > ---
> >  include/net/xfrm.h        |   9 ++++
> >  net/xfrm/Makefile         |   1 +
> >  net/xfrm/xfrm_policy.c    |   2 +
> >  net/xfrm/xfrm_state_bpf.c | 105 ++++++++++++++++++++++++++++++++++++++
> >  4 files changed, 117 insertions(+)
> >  create mode 100644 net/xfrm/xfrm_state_bpf.c
> >
> > diff --git a/include/net/xfrm.h b/include/net/xfrm.h
> > index c9bb0f892f55..1d107241b901 100644
> > --- a/include/net/xfrm.h
> > +++ b/include/net/xfrm.h
> > @@ -2190,4 +2190,13 @@ static inline int register_xfrm_interface_bpf(void)
> >
> >  #endif
> >
> > +#if IS_ENABLED(CONFIG_DEBUG_INFO_BTF)
> > +int register_xfrm_state_bpf(void);
> > +#else
> > +static inline int register_xfrm_state_bpf(void)
> > +{
> > +       return 0;
> > +}
> > +#endif
> > +
> >  #endif /* _NET_XFRM_H */
> > diff --git a/net/xfrm/Makefile b/net/xfrm/Makefile
> > index cd47f88921f5..547cec77ba03 100644
> > --- a/net/xfrm/Makefile
> > +++ b/net/xfrm/Makefile
> > @@ -21,3 +21,4 @@ obj-$(CONFIG_XFRM_USER_COMPAT) += xfrm_compat.o
> >  obj-$(CONFIG_XFRM_IPCOMP) += xfrm_ipcomp.o
> >  obj-$(CONFIG_XFRM_INTERFACE) += xfrm_interface.o
> >  obj-$(CONFIG_XFRM_ESPINTCP) += espintcp.o
> > +obj-$(CONFIG_DEBUG_INFO_BTF) += xfrm_state_bpf.o
> > diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
> > index c13dc3ef7910..1b7e75159727 100644
> > --- a/net/xfrm/xfrm_policy.c
> > +++ b/net/xfrm/xfrm_policy.c
> > @@ -4218,6 +4218,8 @@ void __init xfrm_init(void)
> >  #ifdef CONFIG_XFRM_ESPINTCP
> >         espintcp_init();
> >  #endif
> > +
> > +       register_xfrm_state_bpf();
> >  }
> >
> >  #ifdef CONFIG_AUDITSYSCALL
> > diff --git a/net/xfrm/xfrm_state_bpf.c b/net/xfrm/xfrm_state_bpf.c
> > new file mode 100644
> > index 000000000000..4aaac134b97a
> > --- /dev/null
> > +++ b/net/xfrm/xfrm_state_bpf.c
> 
> since net/xfrm/xfrm_interface_bpf.c is already there and
> was meant to be use as a file for interface between xfrm and bpf
> may be add new kfuncs there instead of new file?

IIUC xfrm_interface_bpf.c is for "xfrm interfaces" [0]. See
bpf_xfrm_info:if_id.

I could be wrong. But if you or Steffen want it all in one file I don't
really mind either way.

[0]: https://docs.strongswan.org/docs/5.9/features/routeBasedVpn.html#_xfrm_interfaces_on_linux

> 
> 
> > @@ -0,0 +1,105 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/* Unstable XFRM state BPF helpers.
> > + *
> > + * Note that it is allowed to break compatibility for these functions since the
> > + * interface they are exposed through to BPF programs is explicitly unstable.
> > + */
> > +
> > +#include <linux/bpf.h>
> > +#include <linux/btf_ids.h>
> > +#include <net/xdp.h>
> > +#include <net/xfrm.h>
> > +
> > +/* bpf_xfrm_state_opts - Options for XFRM state lookup helpers
> > + *
> > + * Members:
> > + * @error      - Out parameter, set for any errors encountered
> > + *              Values:
> > + *                -EINVAL - netns_id is less than -1
> > + *                -EINVAL - Passed NULL for opts
> > + *                -EINVAL - opts__sz isn't BPF_XFRM_STATE_OPTS_SZ
> > + *                -ENONET - No network namespace found for netns_id
> > + * @netns_id   - Specify the network namespace for lookup
> > + *              Values:
> > + *                BPF_F_CURRENT_NETNS (-1)
> > + *                  Use namespace associated with ctx
> > + *                [0, S32_MAX]
> > + *                  Network Namespace ID
> > + * @mark       - XFRM mark to match on
> > + * @daddr      - Destination address to match on
> > + * @spi                - Security parameter index to match on
> > + * @proto      - L3 protocol to match on
> > + * @family     - L3 protocol family to match on
> > + */
> > +struct bpf_xfrm_state_opts {
> > +       s32 error;
> > +       s32 netns_id;
> > +       u32 mark;
> > +       xfrm_address_t daddr;
> > +       __be32 spi;
> > +       u8 proto;
> > +       u16 family;
> > +};
> > +
> > +enum {
> > +       BPF_XFRM_STATE_OPTS_SZ = sizeof(struct bpf_xfrm_state_opts),
> > +};
> > +
> > +__diag_push();
> > +__diag_ignore_all("-Wmissing-prototypes",
> > +                 "Global functions as their definitions will be in xfrm_state BTF");
> > +
> > +/* bpf_xdp_get_xfrm_state - Get XFRM state
> > + *
> > + * Parameters:
> > + * @ctx        - Pointer to ctx (xdp_md) in XDP program
> > + *                 Cannot be NULL
> > + * @opts       - Options for lookup (documented above)
> > + *                 Cannot be NULL
> > + * @opts__sz   - Length of the bpf_xfrm_state_opts structure
> > + *                 Must be BPF_XFRM_STATE_OPTS_SZ
> > + */
> > +__bpf_kfunc struct xfrm_state *
> > +bpf_xdp_get_xfrm_state(struct xdp_md *ctx, struct bpf_xfrm_state_opts *opts, u32 opts__sz)
> > +{
> > +       struct xdp_buff *xdp = (struct xdp_buff *)ctx;
> > +       struct net *net = dev_net(xdp->rxq->dev);
> > +
> > +       if (!opts || opts__sz != BPF_XFRM_STATE_OPTS_SZ) {
> > +               opts->error = -EINVAL;
> > +               return NULL;
> > +       }
> > +
> > +       if (unlikely(opts->netns_id < BPF_F_CURRENT_NETNS)) {
> > +               opts->error = -EINVAL;
> > +               return NULL;
> > +       }
> > +
> > +       if (opts->netns_id >= 0) {
> > +               net = get_net_ns_by_id(net, opts->netns_id);
> 
> netns is leaking :(

Argh, will fix.

> 
> > +               if (unlikely(!net)) {
> > +                       opts->error = -ENONET;
> > +                       return NULL;
> > +               }
> > +       }
> > +
> > +       return xfrm_state_lookup(net, opts->mark, &opts->daddr, opts->spi,
> > +                                opts->proto, opts->family);
> 
> After looking into xfrm internals realized that
> refcnt inc/dec and KF_ACQUIRE maybe unnecessary overhead.
> XDP progs run under rcu_read_lock.
> I think you can make a version of __xfrm_state_lookup()
> without xfrm_state_hold_rcu() and avoid two atomics per packet,
> but such xfrm_state may have refcnt==0.
> Since bpf prog will only read from there and won't pass it anywhere
> else it might be ok.
> 
> But considering the rest of ipsec overhead this might be
> a premature optimization and it's better to stay with clean
> acquire/release semantics.

Yeah I think acquire/release will be fine for now. I've been doing a lot
of profiling on tput tests and xdp barely shows up. There are some lower
hanging fruit atm.

> As far as IETF:
> https://datatracker.ietf.org/doc/html/draft-ietf-ipsecme-multi-sa-performance-02
> it's not clear to me why one Child SA (without new pcpu field)
> has to be handled by one cpu.
> 
> Sounds like it's possible to implement differently. At least in SW.
> In HW, I can see how duplicating multiple crypto state and the rest
> in a single queue is difficult.

The child SA without pcpu flag (aka fallback SA) is not pinned to a cpu.
It's shared between all of the cpus. I think Steffen can probably
explain this better if that's not clear.

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH] net: ti: icssg-prueth: Add missing icss_iep_put to error path
From: Simon Horman @ 2023-11-04 18:43 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	MD Danish Anwar, netdev, linux-kernel,
	Lopes Ivo, Diogo Miguel (T CED IFD-PT), Nishanth Menon,
	Su, Bao Cheng (RC-CN DF FA R&D)
In-Reply-To: <0b21ba4e-5c47-4625-a9ec-e45e54ba9229@siemens.com>

On Thu, Nov 02, 2023 at 05:03:30PM +0100, Jan Kiszka wrote:
> From: Jan Kiszka <jan.kiszka@siemens.com>
> 
> Analogously to prueth_remove.
> 
> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>

Hi Jan,

I am a little unclear if this patch addresses a user-visible bug, or
is adding a new feature.


If it is fixing a bug then it should be targeted at the net tree.
It should apply cleanly there, and the tree should be noted in the subject.

  Subject: [PATCH net] ...

Also, if it is a bug fix, it should have a fixes tag, indicating the
revision(s) where the problem was introduced. This to assist in backporting
fixes. In this case perhaps the following is appropriate:


On the other hand, if this is a new feature, then it should be targeted
at net-next:

  Subject: [PATCH net-next] ...

And in that case the following applies.


In either case, I think it would be good to expand the commit message.
It should explain why this change is being made.


## Form letter - net-next-closed

The merge window for v6.7 has begun and therefore net-next is closed
for new drivers, features, code refactoring and optimizations.
We are currently accepting bug fixes only.

Please repost when net-next reopens after November 12th.

RFC patches sent for review only are obviously welcome at any time.

See: https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#development-cycle

-- 
pw-bot: cr

^ permalink raw reply

* Re: [PATCH net-next v10 12/13] net:ethernet:realtek: Update the Makefile and Kconfig in the realtek folder
From: Simon Horman @ 2023-11-04 18:39 UTC (permalink / raw)
  To: Justin Lai
  Cc: kuba, davem, edumazet, pabeni, linux-kernel, netdev, andrew,
	pkshih, larry.chiu
In-Reply-To: <20231102154505.940783-13-justinlai0215@realtek.com>

On Thu, Nov 02, 2023 at 11:45:04PM +0800, Justin Lai wrote:
> 1. Add the RTASE entry in the Kconfig.
> 2. Add the CONFIG_RTASE entry in the Makefile.
> 
> Signed-off-by: Justin Lai <justinlai0215@realtek.com>
> ---
>  drivers/net/ethernet/realtek/Kconfig  | 17 +++++++++++++++++
>  drivers/net/ethernet/realtek/Makefile |  1 +
>  2 files changed, 18 insertions(+)
> 
> diff --git a/drivers/net/ethernet/realtek/Kconfig b/drivers/net/ethernet/realtek/Kconfig
> index 93d9df55b361..57ef924deebd 100644
> --- a/drivers/net/ethernet/realtek/Kconfig
> +++ b/drivers/net/ethernet/realtek/Kconfig
> @@ -113,4 +113,21 @@ config R8169
>  	  To compile this driver as a module, choose M here: the module
>  	  will be called r8169.  This is recommended.
>  
> +config RTASE
> +	tristate "Realtek Automotive Switch 9054/9068/9072/9075/9068/9071 PCIe Interface support"
> +	depends on PCI
> +	select CRC32
> +	help
> +	  Say Y here if you have a Realtek Ethernet adapter belonging to
> +	  the following families:
> +	  RTL9054 5GBit Ethernet
> +	  RTL9068 5GBit Ethernet
> +	  RTL9072 5GBit Ethernet
> +	  RTL9075 5GBit Ethernet
> +	  RTL9068 5GBit Ethernet
> +	  RTL9071 5GBit Ethernet
> +
> +	  To compile this driver as a module, choose M here: the module
> +	  will be called rtase. This is recommended.
> +
>  endif # NET_VENDOR_REALTEK
> diff --git a/drivers/net/ethernet/realtek/Makefile b/drivers/net/ethernet/realtek/Makefile
> index 2e1d78b106b0..0c1c16f63e9a 100644
> --- a/drivers/net/ethernet/realtek/Makefile
> +++ b/drivers/net/ethernet/realtek/Makefile
> @@ -8,3 +8,4 @@ obj-$(CONFIG_8139TOO) += 8139too.o
>  obj-$(CONFIG_ATP) += atp.o
>  r8169-objs += r8169_main.o r8169_firmware.o r8169_phy_config.o
>  obj-$(CONFIG_R8169) += r8169.o
> +obj-$(CONFIG_RTASE) += rtase/

An allmodconfig on x86_64 fails to build with the series applied up to this
point.

../drivers/net/ethernet/realtek/rtase/rtase_main.c:68:10: fatal error: net/page_pool.h: No such file or directory    68 | #include <net/page_pool.h>

Please also note that, net-next closed, as per the following:

## Form letter - net-next-closed

The merge window for v6.7 has begun and therefore net-next is closed
for new drivers, features, code refactoring and optimizations.
We are currently accepting bug fixes only.

Please repost when net-next reopens after November 12th.

RFC patches sent for review only are obviously welcome at any time.

See: https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#development-cycle
-- 
pw-bot: cr

^ permalink raw reply

* [PATCH iwl-net] ice: fix DDP package download for packages without signature segment
From: Paul Greenwalt @ 2023-11-04 18:29 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: netdev, jesse.brandeburg, anthony.l.nguyen, davem, kuba, horms,
	tony.brelinski, Dan Nowlin, Fijalkowski, Maciej, Paul Greenwalt

From: Dan Nowlin <dan.nowlin@intel.com>

Commit 3cbdb0343022 ("ice: Add support for E830 DDP package segment")
incorrectly removed support for package download for packages without a
signature segment. These packages include the signature buffer inline
in the configurations buffers, and do not in a signature segment.

Fix package download by providing download support for both packages
with (ice_download_pkg_with_sig_seg()) and without signature segment
(ice_download_pkg_without_sig_seg()).

Fixes: 3cbdb0343022 ("ice: Add support for E830 DDP package segment")
Reported-by: Fijalkowski, Maciej <maciej.fijalkowski@intel.com>
Signed-off-by: Dan Nowlin <dan.nowlin@intel.com>
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_ddp.c | 106 ++++++++++++++++++++++-
 1 file changed, 103 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_ddp.c b/drivers/net/ethernet/intel/ice/ice_ddp.c
index cfb1580f5850..3f1a11d0252c 100644
--- a/drivers/net/ethernet/intel/ice/ice_ddp.c
+++ b/drivers/net/ethernet/intel/ice/ice_ddp.c
@@ -1479,14 +1479,14 @@ ice_post_dwnld_pkg_actions(struct ice_hw *hw)
 }
 
 /**
- * ice_download_pkg
+ * ice_download_pkg_with_sig_seg
  * @hw: pointer to the hardware structure
  * @pkg_hdr: pointer to package header
  *
  * Handles the download of a complete package.
  */
 static enum ice_ddp_state
-ice_download_pkg(struct ice_hw *hw, struct ice_pkg_hdr *pkg_hdr)
+ice_download_pkg_with_sig_seg(struct ice_hw *hw, struct ice_pkg_hdr *pkg_hdr)
 {
 	enum ice_aq_err aq_err = hw->adminq.sq_last_status;
 	enum ice_ddp_state state = ICE_DDP_PKG_ERR;
@@ -1519,6 +1519,106 @@ ice_download_pkg(struct ice_hw *hw, struct ice_pkg_hdr *pkg_hdr)
 		state = ice_post_dwnld_pkg_actions(hw);
 
 	ice_release_global_cfg_lock(hw);
+
+	return state;
+}
+
+/**
+ * ice_dwnld_cfg_bufs
+ * @hw: pointer to the hardware structure
+ * @bufs: pointer to an array of buffers
+ * @count: the number of buffers in the array
+ *
+ * Obtains global config lock and downloads the package configuration buffers
+ * to the firmware.
+ */
+static enum ice_ddp_state
+ice_dwnld_cfg_bufs(struct ice_hw *hw, struct ice_buf *bufs, u32 count)
+{
+	enum ice_ddp_state state = ICE_DDP_PKG_SUCCESS;
+	struct ice_buf_hdr *bh;
+	int status;
+
+	if (!bufs || !count)
+		return ICE_DDP_PKG_ERR;
+
+	/* If the first buffer's first section has its metadata bit set
+	 * then there are no buffers to be downloaded, and the operation is
+	 * considered a success.
+	 */
+	bh = (struct ice_buf_hdr *)bufs;
+	if (le32_to_cpu(bh->section_entry[0].type) & ICE_METADATA_BUF)
+		return ICE_DDP_PKG_SUCCESS;
+
+	status = ice_acquire_global_cfg_lock(hw, ICE_RES_WRITE);
+	if (status) {
+		if (status == -EALREADY)
+			return ICE_DDP_PKG_ALREADY_LOADED;
+		return ice_map_aq_err_to_ddp_state(hw->adminq.sq_last_status);
+	}
+
+	state = ice_dwnld_cfg_bufs_no_lock(hw, bufs, 0, count, true);
+	if (!state)
+		state = ice_post_dwnld_pkg_actions(hw);
+
+	ice_release_global_cfg_lock(hw);
+
+	return state;
+}
+
+/**
+ * ice_download_pkg_without_sig_seg
+ * @hw: pointer to the hardware structure
+ * @ice_seg: pointer to the segment of the package to be downloaded
+ *
+ * Handles the download of a complete package without signature segment.
+ */
+static enum ice_ddp_state
+ice_download_pkg_without_sig_seg(struct ice_hw *hw, struct ice_seg *ice_seg)
+{
+	struct ice_buf_table *ice_buf_tbl;
+	enum ice_ddp_state state;
+
+	ice_debug(hw, ICE_DBG_PKG, "Segment format version: %d.%d.%d.%d\n",
+		  ice_seg->hdr.seg_format_ver.major,
+		  ice_seg->hdr.seg_format_ver.minor,
+		  ice_seg->hdr.seg_format_ver.update,
+		  ice_seg->hdr.seg_format_ver.draft);
+
+	ice_debug(hw, ICE_DBG_PKG, "Seg: type 0x%X, size %d, name %s\n",
+		  le32_to_cpu(ice_seg->hdr.seg_type),
+		  le32_to_cpu(ice_seg->hdr.seg_size), ice_seg->hdr.seg_id);
+
+	ice_buf_tbl = ice_find_buf_table(ice_seg);
+
+	ice_debug(hw, ICE_DBG_PKG, "Seg buf count: %d\n",
+		  le32_to_cpu(ice_buf_tbl->buf_count));
+
+	state = ice_dwnld_cfg_bufs(hw, ice_buf_tbl->buf_array,
+				   le32_to_cpu(ice_buf_tbl->buf_count));
+
+	return state;
+}
+
+/**
+ * ice_download_pkg
+ * @hw: pointer to the hardware structure
+ * @pkg_hdr: pointer to package header
+ * @ice_seg: pointer to the segment of the package to be downloaded
+ *
+ * Handles the download of a complete package.
+ */
+static enum ice_ddp_state
+ice_download_pkg(struct ice_hw *hw, struct ice_pkg_hdr *pkg_hdr,
+		 struct ice_seg *ice_seg)
+{
+	enum ice_ddp_state state;
+
+	if (hw->pkg_has_signing_seg)
+		state = ice_download_pkg_with_sig_seg(hw, pkg_hdr);
+	else
+		state = ice_download_pkg_without_sig_seg(hw, ice_seg);
+
 	ice_post_pkg_dwnld_vlan_mode_cfg(hw);
 
 	return state;
@@ -2083,7 +2183,7 @@ enum ice_ddp_state ice_init_pkg(struct ice_hw *hw, u8 *buf, u32 len)
 
 	/* initialize package hints and then download package */
 	ice_init_pkg_hints(hw, seg);
-	state = ice_download_pkg(hw, pkg);
+	state = ice_download_pkg(hw, pkg, seg);
 	if (state == ICE_DDP_PKG_ALREADY_LOADED) {
 		ice_debug(hw, ICE_DBG_INIT,
 			  "package previously loaded - no work.\n");

base-commit: 016b9332a3346e97a6cacffea0f9dc10e1235a75
-- 
2.41.0


^ permalink raw reply related

* Re: [PATCH net v2] virtio/vsock: Fix uninit-value in virtio_transport_recv_pkt()
From: David Wei @ 2023-11-04 18:28 UTC (permalink / raw)
  To: Shigeru Yoshida, stefanha, sgarzare, davem, edumazet, kuba,
	pabeni
  Cc: kvm, virtualization, netdev, linux-kernel,
	syzbot+0c8ce1da0ac31abbadcd
In-Reply-To: <20231104150531.257952-1-syoshida@redhat.com>

On 2023-11-04 08:05, Shigeru Yoshida wrote:
> KMSAN reported the following uninit-value access issue:
> 
> =====================================================
> BUG: KMSAN: uninit-value in virtio_transport_recv_pkt+0x1dfb/0x26a0 net/vmw_vsock/virtio_transport_common.c:1421
>  virtio_transport_recv_pkt+0x1dfb/0x26a0 net/vmw_vsock/virtio_transport_common.c:1421
>  vsock_loopback_work+0x3bb/0x5a0 net/vmw_vsock/vsock_loopback.c:120
>  process_one_work kernel/workqueue.c:2630 [inline]
>  process_scheduled_works+0xff6/0x1e60 kernel/workqueue.c:2703
>  worker_thread+0xeca/0x14d0 kernel/workqueue.c:2784
>  kthread+0x3cc/0x520 kernel/kthread.c:388
>  ret_from_fork+0x66/0x80 arch/x86/kernel/process.c:147
>  ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304
> 
> Uninit was stored to memory at:
>  virtio_transport_space_update net/vmw_vsock/virtio_transport_common.c:1274 [inline]
>  virtio_transport_recv_pkt+0x1ee8/0x26a0 net/vmw_vsock/virtio_transport_common.c:1415
>  vsock_loopback_work+0x3bb/0x5a0 net/vmw_vsock/vsock_loopback.c:120
>  process_one_work kernel/workqueue.c:2630 [inline]
>  process_scheduled_works+0xff6/0x1e60 kernel/workqueue.c:2703
>  worker_thread+0xeca/0x14d0 kernel/workqueue.c:2784
>  kthread+0x3cc/0x520 kernel/kthread.c:388
>  ret_from_fork+0x66/0x80 arch/x86/kernel/process.c:147
>  ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304
> 
> Uninit was created at:
>  slab_post_alloc_hook+0x105/0xad0 mm/slab.h:767
>  slab_alloc_node mm/slub.c:3478 [inline]
>  kmem_cache_alloc_node+0x5a2/0xaf0 mm/slub.c:3523
>  kmalloc_reserve+0x13c/0x4a0 net/core/skbuff.c:559
>  __alloc_skb+0x2fd/0x770 net/core/skbuff.c:650
>  alloc_skb include/linux/skbuff.h:1286 [inline]
>  virtio_vsock_alloc_skb include/linux/virtio_vsock.h:66 [inline]
>  virtio_transport_alloc_skb+0x90/0x11e0 net/vmw_vsock/virtio_transport_common.c:58
>  virtio_transport_reset_no_sock net/vmw_vsock/virtio_transport_common.c:957 [inline]
>  virtio_transport_recv_pkt+0x1279/0x26a0 net/vmw_vsock/virtio_transport_common.c:1387
>  vsock_loopback_work+0x3bb/0x5a0 net/vmw_vsock/vsock_loopback.c:120
>  process_one_work kernel/workqueue.c:2630 [inline]
>  process_scheduled_works+0xff6/0x1e60 kernel/workqueue.c:2703
>  worker_thread+0xeca/0x14d0 kernel/workqueue.c:2784
>  kthread+0x3cc/0x520 kernel/kthread.c:388
>  ret_from_fork+0x66/0x80 arch/x86/kernel/process.c:147
>  ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304
> 
> CPU: 1 PID: 10664 Comm: kworker/1:5 Not tainted 6.6.0-rc3-00146-g9f3ebbef746f #3
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
> Workqueue: vsock-loopback vsock_loopback_work
> =====================================================
> 
> The following simple reproducer can cause the issue described above:
> 
> int main(void)
> {
>   int sock;
>   struct sockaddr_vm addr = {
>     .svm_family = AF_VSOCK,
>     .svm_cid = VMADDR_CID_ANY,
>     .svm_port = 1234,
>   };
> 
>   sock = socket(AF_VSOCK, SOCK_STREAM, 0);
>   connect(sock, (struct sockaddr *)&addr, sizeof(addr));
>   return 0;
> }
> 
> This issue occurs because the `buf_alloc` and `fwd_cnt` fields of the
> `struct virtio_vsock_hdr` are not initialized when a new skb is allocated
> in `virtio_transport_init_hdr()`. This patch resolves the issue by
> initializing these fields during allocation.
> 
> Fixes: 71dc9ec9ac7d ("virtio/vsock: replace virtio_vsock_pkt with sk_buff")
> Reported-and-tested-by: syzbot+0c8ce1da0ac31abbadcd@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=0c8ce1da0ac31abbadcd
> Signed-off-by: Shigeru Yoshida <syoshida@redhat.com>
> ---
> v1->v2:
> - Rebase on the latest net tree
> https://lore.kernel.org/all/20231026150154.3536433-1-syoshida@redhat.com/
> ---
>  net/vmw_vsock/virtio_transport_common.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index e22c81435ef7..dc65dd4d26df 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -130,6 +130,8 @@ static void virtio_transport_init_hdr(struct sk_buff *skb,
>  	hdr->dst_port	= cpu_to_le32(dst_port);
>  	hdr->flags	= cpu_to_le32(info->flags);
>  	hdr->len	= cpu_to_le32(payload_len);
> +	hdr->buf_alloc	= cpu_to_le32(0);
> +	hdr->fwd_cnt	= cpu_to_le32(0);

This change looks good to me.

I did notice that payload_len in virtio_transport_init_hdr is size_t but
len in struct virtio_vsock_hdr is __le32. But I don't think this is an
issue other than wasting some stack space, though.

>  }
>  
>  static void virtio_transport_copy_nonlinear_skb(const struct sk_buff *skb,

^ permalink raw reply

* Re: [PATCH] leds: triggers: netdev: add a check, whether device is up
From: Russell King (Oracle) @ 2023-11-04 17:42 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Klaus Kudielka, David S . Miller, Jakub Kicinski, netdev
In-Reply-To: <53f3e4ff-2afd-4acb-8cd4-55bdd1defd0d@lunn.ch>

On Sat, Nov 04, 2023 at 05:46:44PM +0100, Andrew Lunn wrote:
> On Sat, Nov 04, 2023 at 05:32:19PM +0100, Klaus Kudielka wrote:
> > On Sat, 2023-11-04 at 16:27 +0100, Klaus Kudielka wrote:
> > > 
> > > phylink_start() is the first one that does netif_carrier_off() and thus
> > > sets the NOCARRIER bit, but that only happens when bringing the device up.
> > > 
> > > Before that, I would not know who cares about setting the NOCARRIER bit.
> > 
> > A different, driver-specific solution could be like this (tested and working):
> > 
> > --- a/drivers/net/ethernet/marvell/mvneta.c
> > +++ b/drivers/net/ethernet/marvell/mvneta.c
> > @@ -5690,6 +5690,7 @@ static int mvneta_probe(struct platform_device *pdev)
> >         /* 9676 == 9700 - 20 and rounding to 8 */
> >         dev->max_mtu = 9676;
> >  
> > +       netif_carrier_off(dev);
> >         err = register_netdev(dev);
> >         if (err < 0) {
> >                 dev_err(&pdev->dev, "failed to register\n");
> > 
> > 
> > Would that be the "correct" approach?
> 
> Crossing emails.
> 
> Its a better approach. But it fixes just one driver. If we can do this
> in phylink_create(), we fix it in a lot of drivers with a single
> change...

... and I think we should.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply

* Re: [PATCH] leds: triggers: netdev: add a check, whether device is up
From: Russell King (Oracle) @ 2023-11-04 17:12 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Klaus Kudielka, David S . Miller, Jakub Kicinski, netdev
In-Reply-To: <196db01b-40ff-44ed-8e45-1b855940417f@lunn.ch>

On Sat, Nov 04, 2023 at 05:41:54PM +0100, Andrew Lunn wrote:
> [Changes the Cc: list. Dropping LED people, adding a few netdev
> people]
> 
> On Sat, Nov 04, 2023 at 04:27:45PM +0100, Klaus Kudielka wrote:
> > After booting, the device is down, but netdev trigger reports "link" active.
> > This looks wrong to me.
> > 
> > I can then "ip link set eth2 up", and the "link" goes away - as I
> > would have expected it to be from the beginning.
> 
> Thanks for the details.
> 
> A brain dump...
> 
> You do see a lot of MAC drivers doing a netif_carrier_off() in there
> probe function. That suggests the carrier is on by default. I doubt we
> can change that, we would break all the drivers which assume the
> carrier is on by default, probably virtual devices and some real
> devices.

Note that one of the things that phylink will do is call
netif_carrier_off() when phylink_start() is called to ensure that the
netdev state matches its internal state.

> Often the MAC and PHY are connected in the open() callback, when using
> phylib. So that is too late.  phylink_create() is however mostly used
> in the probe function. So it could set the carrier to off by default.
> Russell, what do you think?

I was going to ask whether that would be a good idea - since it means
that the carrier is in the right state at probe time as well as after
phylink_start() has been called.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply

* Re: [PATCH] leds: triggers: netdev: add a check, whether device is up
From: Andrew Lunn @ 2023-11-04 16:46 UTC (permalink / raw)
  To: Klaus Kudielka; +Cc: David S . Miller, Jakub Kicinski, netdev, Russell King
In-Reply-To: <970325157b7598b6367c293380cace3624e6cb88.camel@gmail.com>

On Sat, Nov 04, 2023 at 05:32:19PM +0100, Klaus Kudielka wrote:
> On Sat, 2023-11-04 at 16:27 +0100, Klaus Kudielka wrote:
> > 
> > phylink_start() is the first one that does netif_carrier_off() and thus
> > sets the NOCARRIER bit, but that only happens when bringing the device up.
> > 
> > Before that, I would not know who cares about setting the NOCARRIER bit.
> 
> A different, driver-specific solution could be like this (tested and working):
> 
> --- a/drivers/net/ethernet/marvell/mvneta.c
> +++ b/drivers/net/ethernet/marvell/mvneta.c
> @@ -5690,6 +5690,7 @@ static int mvneta_probe(struct platform_device *pdev)
>         /* 9676 == 9700 - 20 and rounding to 8 */
>         dev->max_mtu = 9676;
>  
> +       netif_carrier_off(dev);
>         err = register_netdev(dev);
>         if (err < 0) {
>                 dev_err(&pdev->dev, "failed to register\n");
> 
> 
> Would that be the "correct" approach?

Crossing emails.

Its a better approach. But it fixes just one driver. If we can do this
in phylink_create(), we fix it in a lot of drivers with a single
change...

	Andrew

^ permalink raw reply

* Re: [PATCH] leds: triggers: netdev: add a check, whether device is up
From: Andrew Lunn @ 2023-11-04 16:41 UTC (permalink / raw)
  To: Klaus Kudielka; +Cc: David S . Miller, Jakub Kicinski, netdev, Russell King
In-Reply-To: <95ff53a1d1b9102c81a05076f40d47242579fc37.camel@gmail.com>

[Changes the Cc: list. Dropping LED people, adding a few netdev
people]

On Sat, Nov 04, 2023 at 04:27:45PM +0100, Klaus Kudielka wrote:
> On Sat, 2023-11-04 at 15:29 +0100, Andrew Lunn wrote:
> > On Sat, Nov 04, 2023 at 01:58:40PM +0100, Klaus Kudielka wrote:
> > > Some net devices do not report NO-CARRIER, if they haven't been brought
> > > up.
> > 
> > Hi Klaus
> > 
> > We should probably fix the driver. What device is it?
> > 
> > > In that case, the netdev trigger results in a wrong link state being
> > > displayed. Fix this, by adding a check, whether the device is up.
> > 
> > Is it wrong? Maybe the carrier really is up, even if the interface is
> > admin down. Broadcast packets are being received by the
> > hardware. Maybe there is a BMC sharing the link and it is active?
> 
> My particular example is Turris Omnia, eth2 (WAN), connected to SFP.
> SFP module is inserted, but no fiber connected, so definitely no carrier. 
> 
> *Without* the patch:
> 
> After booting, the device is down, but netdev trigger reports "link" active.
> This looks wrong to me.
> 
> I can then "ip link set eth2 up", and the "link" goes away - as I
> would have expected it to be from the beginning.

Thanks for the details.

A brain dump...

You do see a lot of MAC drivers doing a netif_carrier_off() in there
probe function. That suggests the carrier is on by default. I doubt we
can change that, we would break all the drivers which assume the
carrier is on by default, probably virtual devices and some real
devices.

Often the MAC and PHY are connected in the open() callback, when using
phylib. So that is too late.  phylink_create() is however mostly used
in the probe function. So it could set the carrier to off by default.
Russell, what do you think?

Turris Omnia uses mvneta. That does not turn the carrier off in its
probe function. So a quick fix would be to add it. However, that then
becomes pointless if we make phylink_create() disable the carrier.

	Andrew

^ permalink raw reply

* Re: [PATCH net-next] netfilter: nf_tables: Remove unused variable nft_net
From: Simon Horman @ 2023-11-04 16:33 UTC (permalink / raw)
  To: Yang Li
  Cc: kuba, edumazet, davem, pabeni, fw, kadlec, pablo, netfilter-devel,
	coreteam, netdev, linux-kernel, Abaci Robot
In-Reply-To: <20231101013351.55902-1-yang.lee@linux.alibaba.com>

On Wed, Nov 01, 2023 at 09:33:51AM +0800, Yang Li wrote:
> The code that uses nft_net has been removed, and the nft_pernet function
> is merely obtaining a reference to shared data through the net pointer.
> The content of the net pointer is not modified or changed, so both of
> them should be removed.
> 
> silence the warning:
> net/netfilter/nft_set_rbtree.c:627:26: warning: variable ‘nft_net’ set but not used
> 
> Reported-by: Abaci Robot <abaci@linux.alibaba.com>
> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=7103
> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>

I think this is for nf-next, rather than net-next (which is closed).

But as for the change itself, I also noticed this, and am glad
to see it being addressed.

Reviewed-by: Simon Horman <horms@kernel.org>

^ permalink raw reply

* Re: [PATCH net 1/1] net, sched: Fix SKB_NOT_DROPPED_YET splat under debug config
From: Simon Horman @ 2023-11-04 16:12 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: davem, kuba, edumazet, pabeni, jiri, xiyou.wangcong, daniel,
	idosch, netdev
In-Reply-To: <CAM0EoMn195kQXDq9hn=NZoCNHOVyh3qS1kGxN+N3q4qNXj2mVA@mail.gmail.com>

On Sat, Nov 04, 2023 at 11:46:22AM -0400, Jamal Hadi Salim wrote:
> On Sat, Nov 4, 2023 at 9:11 AM Simon Horman <horms@kernel.org> wrote:
> >
> > On Sat, Oct 28, 2023 at 01:16:10PM -0400, Jamal Hadi Salim wrote:
> > > Getting the following splat [1] with CONFIG_DEBUG_NET=y and this
> > > reproducer [2]. Problem seems to be that classifiers clear 'struct
> > > tcf_result::drop_reason', thereby triggering the warning in
> > > __kfree_skb_reason() due to reason being 'SKB_NOT_DROPPED_YET' (0).
> > >
> > > Fixed by disambiguating a legit error from a verdict with a bogus drop_reason
> > >
> > > [1]
> > > WARNING: CPU: 0 PID: 181 at net/core/skbuff.c:1082 kfree_skb_reason+0x38/0x130
> > > Modules linked in:
> > > CPU: 0 PID: 181 Comm: mausezahn Not tainted 6.6.0-rc6-custom-ge43e6d9582e0 #682
> > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc37 04/01/2014
> > > RIP: 0010:kfree_skb_reason+0x38/0x130
> > > [...]
> > > Call Trace:
> > >  <IRQ>
> > >  __netif_receive_skb_core.constprop.0+0x837/0xdb0
> > >  __netif_receive_skb_one_core+0x3c/0x70
> > >  process_backlog+0x95/0x130
> > >  __napi_poll+0x25/0x1b0
> > >  net_rx_action+0x29b/0x310
> > >  __do_softirq+0xc0/0x29b
> > >  do_softirq+0x43/0x60
> > >  </IRQ>
> > >
> > > [2]
> > >
> > > ip link add name veth0 type veth peer name veth1
> > > ip link set dev veth0 up
> > > ip link set dev veth1 up
> > > tc qdisc add dev veth1 clsact
> > > tc filter add dev veth1 ingress pref 1 proto all flower dst_mac 00:11:22:33:44:55 action drop
> > > mausezahn veth0 -a own -b 00:11:22:33:44:55 -q -c 1
> > >
> > > Ido reported:
> > >
> > >   [...] getting the following splat [1] with CONFIG_DEBUG_NET=y and this
> > >   reproducer [2]. Problem seems to be that classifiers clear 'struct
> > >   tcf_result::drop_reason', thereby triggering the warning in
> > >   __kfree_skb_reason() due to reason being 'SKB_NOT_DROPPED_YET' (0). [...]
> > >
> > >   [1]
> > >   WARNING: CPU: 0 PID: 181 at net/core/skbuff.c:1082 kfree_skb_reason+0x38/0x130
> > >   Modules linked in:
> > >   CPU: 0 PID: 181 Comm: mausezahn Not tainted 6.6.0-rc6-custom-ge43e6d9582e0 #682
> > >   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc37 04/01/2014
> > >   RIP: 0010:kfree_skb_reason+0x38/0x130
> > >   [...]
> > >   Call Trace:
> > >    <IRQ>
> > >    __netif_receive_skb_core.constprop.0+0x837/0xdb0
> > >    __netif_receive_skb_one_core+0x3c/0x70
> > >    process_backlog+0x95/0x130
> > >    __napi_poll+0x25/0x1b0
> > >    net_rx_action+0x29b/0x310
> > >    __do_softirq+0xc0/0x29b
> > >    do_softirq+0x43/0x60
> > >    </IRQ>
> > >
> > >   [2]
> > >   #!/bin/bash
> > >
> > >   ip link add name veth0 type veth peer name veth1
> > >   ip link set dev veth0 up
> > >   ip link set dev veth1 up
> > >   tc qdisc add dev veth1 clsact
> > >   tc filter add dev veth1 ingress pref 1 proto all flower dst_mac 00:11:22:33:44:55 action drop
> > >   mausezahn veth0 -a own -b 00:11:22:33:44:55 -q -c 1
> > >
> > > What happens is that inside most classifiers the tcf_result is copied over
> > > from a filter template e.g. *res = f->res which then implicitly overrides
> > > the prior SKB_DROP_REASON_TC_{INGRESS,EGRESS} default drop code which was
> > > set via sch_handle_{ingress,egress}() for kfree_skb_reason().
> > >
> > > Commit text above copied verbatim from Daniel. The general idea of the patch
> > > is not very different from what Ido originally posted but instead done at the
> > > cls_api codepath.
> > >
> > > Fixes: 54a59aed395c ("net, sched: Make tc-related drop reason more flexible")
> > > Reported-by: Ido Schimmel <idosch@idosch.org>
> > > Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
> > > Link: https://lore.kernel.org/netdev/ZTjY959R+AFXf3Xy@shredder
> >
> > Hi Jamal,
> >
> > FWIIW, I think it would be nicer to fix this the classifiers so they don't
> > do this, which by my reading is what Daniel's patch did.
> >
> 
> I dont believe it was nicer tbh.

I thought you might say that :)

> In any case we are going to send cleanups to do this with skb cb.

Thanks, I'll look out for the new patch.

> > But, I don't feel strongly about this and I do tend to think the
> > approach taken in this patch is a nice clean fix for net.
> >
> > Reviewed-by: Simon Horman <horms@kernel.org>
> >
> > BTW, this patch is marked as Not Applicable in patchwork.
> > I am unsure why.
> 
> Weird.

^ permalink raw reply

* Re: [PATCH] net: ethernet: ti: am65-cpsw: rx_pause/tx_pause controls wrong direction
From: Simon Horman @ 2023-11-04 16:10 UTC (permalink / raw)
  To: Ronald Wahl
  Cc: linux-kernel, netdev, Grygorii Strashko, Dan Carpenter,
	Siddharth Vadapalli, Jakub Kicinski, David S . Miller,
	Roger Quadros, Ronald Wahl
In-Reply-To: <20231031122005.13368-1-rwahl@gmx.de>

On Tue, Oct 31, 2023 at 01:20:05PM +0100, Ronald Wahl wrote:
> From: Ronald Wahl <ronald.wahl@raritan.com>
> 
> The rx_pause flag says that whether we support receiving Pause frames.
> When a Pause frame is received TX is delayed for some time. This is TX
> flow control. In the same manner tx_pause is actually RX flow control.
> 
> Signed-off-by: Ronald Wahl <ronald.wahl@raritan.com>

Hi Ronald,

I am a little unclear if this patch addresses a user-visible bug, or
is adding a new feature.


If it is fixing a bug then it should be targeted at the net tree.
It should apply cleanly there, and the tree should be noted in the subject.

  Subject: [PATCH net] net: ethernet: ...

Also, if it is a bug fix, it should have a fixes tag, indicating the
revision(s) where the problem was introduced. This to assist in backporting
fixes. In this case perhaps the following is appropriate:

  Fixes: e8609e69470f ("net: ethernet: ti: am65-cpsw: Convert to PHYLINK")

It is probably not necessary to repost to address the above.
But please keep it in mind for future Networking patches.


On the other hand, if this is a new feature, then it should be targeted
at net-next:

  Subject: [PATCH net-next] net: ethernet: ...

And in that case the following applies.


## Form letter - net-next-closed

The merge window for v6.7 has begun and therefore net-next is closed
for new drivers, features, code refactoring and optimizations.
We are currently accepting bug fixes only.

Please repost when net-next reopens after November 12th.

RFC patches sent for review only are obviously welcome at any time.

See: https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#development-cycle

^ permalink raw reply

* Re: [PATCH v4 2/3] NFSD: convert write_version to netlink command
From: kernel test robot @ 2023-11-04 16:01 UTC (permalink / raw)
  To: Lorenzo Bianconi, linux-nfs
  Cc: oe-kbuild-all, lorenzo.bianconi, neilb, chuck.lever, netdev,
	jlayton, kuba
In-Reply-To: <3785da26e14c13e194510eaad9c6bd846d691d5f.1699095665.git.lorenzo@kernel.org>

Hi Lorenzo,

kernel test robot noticed the following build warnings:

[auto build test WARNING on linus/master]
[also build test WARNING on next-20231103]
[cannot apply to trondmy-nfs/linux-next v6.6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Lorenzo-Bianconi/NFSD-convert-write_threads-to-netlink-command/20231104-202515
base:   linus/master
patch link:    https://lore.kernel.org/r/3785da26e14c13e194510eaad9c6bd846d691d5f.1699095665.git.lorenzo%40kernel.org
patch subject: [PATCH v4 2/3] NFSD: convert write_version to netlink command
config: alpha-defconfig (https://download.01.org/0day-ci/archive/20231104/202311042320.nQOTYxhs-lkp@intel.com/config)
compiler: alpha-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231104/202311042320.nQOTYxhs-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202311042320.nQOTYxhs-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> fs/nfsd/nfsctl.c:1810: warning: expecting prototype for nfsd_nl_version_get_doit(). Prototype was for nfsd_nl_version_get_dumpit() instead


vim +1810 fs/nfsd/nfsctl.c

  1800	
  1801	/**
  1802	 * nfsd_nl_version_get_doit - Handle verion_get dumpit
  1803	 * @skb: reply buffer
  1804	 * @cb: netlink metadata and command arguments
  1805	 *
  1806	 * Returns the size of the reply or a negative errno.
  1807	 */
  1808	int nfsd_nl_version_get_dumpit(struct sk_buff *skb,
  1809				       struct netlink_callback *cb)
> 1810	{
  1811		struct nfsd_net *nn = net_generic(sock_net(skb->sk), nfsd_net_id);
  1812		int i, ret = -ENOMEM;
  1813	
  1814		mutex_lock(&nfsd_mutex);
  1815	
  1816		for (i = 2; i <= 4; i++) {
  1817			int j;
  1818	
  1819			if (i < cb->args[0]) /* already consumed */
  1820				continue;
  1821	
  1822			if (!nfsd_vers(nn, i, NFSD_AVAIL))
  1823				continue;
  1824	
  1825			for (j = 0; j <= NFSD_SUPPORTED_MINOR_VERSION; j++) {
  1826				void *hdr;
  1827	
  1828				if (!nfsd_vers(nn, i, NFSD_TEST))
  1829					continue;
  1830	
  1831				/* NFSv{2,3} does not support minor numbers */
  1832				if (i < 4 && j)
  1833					continue;
  1834	
  1835				if (i == 4 && !nfsd_minorversion(nn, j, NFSD_TEST))
  1836					continue;
  1837	
  1838				hdr = genlmsg_put(skb, NETLINK_CB(cb->skb).portid,
  1839						  cb->nlh->nlmsg_seq, &nfsd_nl_family,
  1840						  0, NFSD_CMD_VERSION_GET);
  1841				if (!hdr)
  1842					goto out;
  1843	
  1844				if (nla_put_u32(skb, NFSD_A_SERVER_VERSION_MAJOR, i) ||
  1845				    nla_put_u32(skb, NFSD_A_SERVER_VERSION_MINOR, j))
  1846					goto out;
  1847	
  1848				genlmsg_end(skb, hdr);
  1849			}
  1850		}
  1851		cb->args[0] = i;
  1852		ret = skb->len;
  1853	out:
  1854		mutex_unlock(&nfsd_mutex);
  1855	
  1856		return ret;
  1857	}
  1858	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH net 1/1] net, sched: Fix SKB_NOT_DROPPED_YET splat under debug config
From: Jamal Hadi Salim @ 2023-11-04 15:46 UTC (permalink / raw)
  To: Simon Horman
  Cc: davem, kuba, edumazet, pabeni, jiri, xiyou.wangcong, daniel,
	idosch, netdev
In-Reply-To: <20231104131054.GB891380@kernel.org>

On Sat, Nov 4, 2023 at 9:11 AM Simon Horman <horms@kernel.org> wrote:
>
> On Sat, Oct 28, 2023 at 01:16:10PM -0400, Jamal Hadi Salim wrote:
> > Getting the following splat [1] with CONFIG_DEBUG_NET=y and this
> > reproducer [2]. Problem seems to be that classifiers clear 'struct
> > tcf_result::drop_reason', thereby triggering the warning in
> > __kfree_skb_reason() due to reason being 'SKB_NOT_DROPPED_YET' (0).
> >
> > Fixed by disambiguating a legit error from a verdict with a bogus drop_reason
> >
> > [1]
> > WARNING: CPU: 0 PID: 181 at net/core/skbuff.c:1082 kfree_skb_reason+0x38/0x130
> > Modules linked in:
> > CPU: 0 PID: 181 Comm: mausezahn Not tainted 6.6.0-rc6-custom-ge43e6d9582e0 #682
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc37 04/01/2014
> > RIP: 0010:kfree_skb_reason+0x38/0x130
> > [...]
> > Call Trace:
> >  <IRQ>
> >  __netif_receive_skb_core.constprop.0+0x837/0xdb0
> >  __netif_receive_skb_one_core+0x3c/0x70
> >  process_backlog+0x95/0x130
> >  __napi_poll+0x25/0x1b0
> >  net_rx_action+0x29b/0x310
> >  __do_softirq+0xc0/0x29b
> >  do_softirq+0x43/0x60
> >  </IRQ>
> >
> > [2]
> >
> > ip link add name veth0 type veth peer name veth1
> > ip link set dev veth0 up
> > ip link set dev veth1 up
> > tc qdisc add dev veth1 clsact
> > tc filter add dev veth1 ingress pref 1 proto all flower dst_mac 00:11:22:33:44:55 action drop
> > mausezahn veth0 -a own -b 00:11:22:33:44:55 -q -c 1
> >
> > Ido reported:
> >
> >   [...] getting the following splat [1] with CONFIG_DEBUG_NET=y and this
> >   reproducer [2]. Problem seems to be that classifiers clear 'struct
> >   tcf_result::drop_reason', thereby triggering the warning in
> >   __kfree_skb_reason() due to reason being 'SKB_NOT_DROPPED_YET' (0). [...]
> >
> >   [1]
> >   WARNING: CPU: 0 PID: 181 at net/core/skbuff.c:1082 kfree_skb_reason+0x38/0x130
> >   Modules linked in:
> >   CPU: 0 PID: 181 Comm: mausezahn Not tainted 6.6.0-rc6-custom-ge43e6d9582e0 #682
> >   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc37 04/01/2014
> >   RIP: 0010:kfree_skb_reason+0x38/0x130
> >   [...]
> >   Call Trace:
> >    <IRQ>
> >    __netif_receive_skb_core.constprop.0+0x837/0xdb0
> >    __netif_receive_skb_one_core+0x3c/0x70
> >    process_backlog+0x95/0x130
> >    __napi_poll+0x25/0x1b0
> >    net_rx_action+0x29b/0x310
> >    __do_softirq+0xc0/0x29b
> >    do_softirq+0x43/0x60
> >    </IRQ>
> >
> >   [2]
> >   #!/bin/bash
> >
> >   ip link add name veth0 type veth peer name veth1
> >   ip link set dev veth0 up
> >   ip link set dev veth1 up
> >   tc qdisc add dev veth1 clsact
> >   tc filter add dev veth1 ingress pref 1 proto all flower dst_mac 00:11:22:33:44:55 action drop
> >   mausezahn veth0 -a own -b 00:11:22:33:44:55 -q -c 1
> >
> > What happens is that inside most classifiers the tcf_result is copied over
> > from a filter template e.g. *res = f->res which then implicitly overrides
> > the prior SKB_DROP_REASON_TC_{INGRESS,EGRESS} default drop code which was
> > set via sch_handle_{ingress,egress}() for kfree_skb_reason().
> >
> > Commit text above copied verbatim from Daniel. The general idea of the patch
> > is not very different from what Ido originally posted but instead done at the
> > cls_api codepath.
> >
> > Fixes: 54a59aed395c ("net, sched: Make tc-related drop reason more flexible")
> > Reported-by: Ido Schimmel <idosch@idosch.org>
> > Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
> > Link: https://lore.kernel.org/netdev/ZTjY959R+AFXf3Xy@shredder
>
> Hi Jamal,
>
> FWIIW, I think it would be nicer to fix this the classifiers so they don't
> do this, which by my reading is what Daniel's patch did.
>

I dont believe it was nicer tbh.
In any case we are going to send cleanups to do this with skb cb.

> But, I don't feel strongly about this and I do tend to think the
> approach taken in this patch is a nice clean fix for net.
>
> Reviewed-by: Simon Horman <horms@kernel.org>
>
> BTW, this patch is marked as Not Applicable in patchwork.
> I am unsure why.

Weird.

cheers,
jamal

^ permalink raw reply

* Re: [PATCH iwl-next] ice: periodically kick Tx timestamp interrupt
From: Simon Horman @ 2023-11-04 15:36 UTC (permalink / raw)
  To: Karol Kolacinski
  Cc: intel-wired-lan, netdev, anthony.l.nguyen, jesse.brandeburg,
	Jacob Keller, Andrii Staikov
In-Reply-To: <20231103162943.485467-1-karol.kolacinski@intel.com>

On Fri, Nov 03, 2023 at 05:29:43PM +0100, Karol Kolacinski wrote:
> From: Jacob Keller <jacob.e.keller@intel.com>
> 
> The E822 hardware for Tx timestamping keeps track of how many
> outstanding timestamps are still in the PHY memory block. It will not
> generate a new interrupt to the MAC until all of the timestamps in the
> region have been read.
> 
> If somehow all the available data is not read, but the driver has exited
> its interrupt routine already, the PHY will not generate a new interrupt
> even if new timestamp data is captured. Because no interrupt is
> generated, the driver never processes the timestamp data. This state
> results in a permanent failure for all future Tx timestamps.
> 
> It is not clear how the driver and hardware could enter this state.
> However, if it does, there is currently no recovery mechanism.
> 
> Add a recovery mechanism via the periodic PTP work thread which invokes
> ice_ptp_periodic_work(). Introduce a new check,
> ice_ptp_maybe_trigger_tx_interrupt() which checks the PHY timestamp
> ready bitmask. If any bits are set, trigger a software interrupt by
> writing to PFINT_OICR.
> 
> Once triggered, the main timestamp processing thread will read through
> the PHY data and clear the outstanding timestamp data. Once cleared, new
> data should trigger interrupts as expected.
> 
> This should allow recovery from such a state rather than leaving the
> device in a state where we cannot process Tx timestamps.
> 
> It is possible that this function checks for timestamp data
> simultaneously with the interrupt, and it might trigger additional
> unnecessary interrupts. This will cause a small amount of additional
> processing.
> 
> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
> Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
> Reviewed-by: Andrii Staikov <andrii.staikov@intel.com>

Reviewed-by: Simon Horman <horms@kernel.org>


^ permalink raw reply

* Re: [PATCH iwl-next v2] ice: Reset VF on Tx MDD event
From: Simon Horman @ 2023-11-04 15:33 UTC (permalink / raw)
  To: Pawel Chmielewski
  Cc: intel-wired-lan, netdev, pmenzel, lukasz.czapnik, Liang-Min Wang,
	Michal Swiatkowski
In-Reply-To: <20231102155149.2574209-1-pawel.chmielewski@intel.com>

On Thu, Nov 02, 2023 at 04:51:49PM +0100, Pawel Chmielewski wrote:
> From: Liang-Min Wang <liang-min.wang@intel.com>
> 
> In cases when VF sends malformed packets that are classified as malicious,
> sometimes it causes Tx queue to freeze. This frozen queue can be stuck
> for several minutes being unusable. This behavior can be reproduced with
> DPDK application, testpmd.
> 
> When Malicious Driver Detection event occurs, perform graceful VF reset
> to quickly bring VF back to operational state. Add a log message to
> notify about the cause of the reset.
> 
> Signed-off-by: Liang-Min Wang <liang-min.wang@intel.com>
> Signed-off-by: Pawel Chmielewski <pawel.chmielewski@intel.com>
> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>

Reviewed-by: Simon Horman <horms@kernel.org>


^ permalink raw reply

* Re: [PATCH iwl-net 3/3] ice: restore timestamp configuration after device reset
From: Simon Horman @ 2023-11-04 15:25 UTC (permalink / raw)
  To: Jacob Keller; +Cc: Anthony Nguyen, Intel Wired LAN, netdev, Jesse Brandeburg
In-Reply-To: <20231103234658.511859-4-jacob.e.keller@intel.com>

On Fri, Nov 03, 2023 at 04:46:58PM -0700, Jacob Keller wrote:
> The driver calls ice_ptp_cfg_timestamp() during ice_ptp_prepare_for_reset()
> to disable timestamping while the device is resetting. This operation
> destroys the user requested configuration. While the driver does call
> ice_ptp_cfg_timestamp in ice_rebuild() to restore some hardware settings
> after a reset, it unconditionally passes true or false, resulting in
> failure to restore previous user space configuration.
> 
> This results in a device reset forcibly disabling timestamp configuration
> regardless of current user settings.
> 
> This was not detected previously due to a quirk of the LinuxPTP ptp4l
> application. If ptp4l detects a missing timestamp, it enters a fault state
> and performs recovery logic which includes executing SIOCSHWTSTAMP again,
> restoring the now accidentally cleared configuration.
> 
> Not every application does this, and for these applications, timestamps
> will mysteriously stop after a PF reset, without being restored until an
> application restart.
> 
> Fix this by replacing ice_ptp_cfg_timestamp() with two new functions:
> 
> 1) ice_ptp_disable_timestamp_mode() which unconditionally disables the
>    timestamping logic in ice_ptp_prepare_for_reset() and ice_ptp_release()
> 
> 2) ice_ptp_restore_timestamp_mode() which calls
>    ice_ptp_restore_tx_interrupt() to restore Tx timestamping configuration,
>    calls ice_set_rx_tstamp() to restore Rx timestamping configuration, and
>    issues an immediate TSYN_TX interrupt to ensure that timestamps which
>    may have occurred during the device reset get processed.
> 
> Modify the ice_ptp_set_timestamp_mode to directly save the user
> configuration and then call ice_ptp_restore_timestamp_mode. This way, reset
> no longer destroys the saved user configuration.
> 
> This obsoletes the ice_set_tx_tstamp() function which can now be safely
> removed.
> 
> With this change, all devices should now restore Tx and Rx timestamping
> functionality correctly after a PF reset without application intervention.
> 
> Fixes: 77a781155a65 ("ice: enable receive hardware timestamping")
> Fixes: ea9b847cda64 ("ice: enable transmit timestamps for E810 devices")
> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>

Reviewed-by: Simon Horman <horms@kernel.org>


^ permalink raw reply

* Re: [PATCH iwl-net 2/3] ice: unify logic for programming PFINT_TSYN_MSK
From: Simon Horman @ 2023-11-04 15:24 UTC (permalink / raw)
  To: Jacob Keller; +Cc: Anthony Nguyen, Intel Wired LAN, netdev, Jesse Brandeburg
In-Reply-To: <20231103234658.511859-3-jacob.e.keller@intel.com>

On Fri, Nov 03, 2023 at 04:46:57PM -0700, Jacob Keller wrote:
> Commit d938a8cca88a ("ice: Auxbus devices & driver for E822 TS") modified
> how Tx timestamps are handled for E822 devices. On these devices, only the
> clock owner handles reading the Tx timestamp data from firmware. To do
> this, the PFINT_TSYN_MSK register is modified from the default value to one
> which enables reacting to a Tx timestamp on all PHY ports.
> 
> The driver currently programs PFINT_TSYN_MSK in different places depending
> on whether the port is the clock owner or not. For the clock owner, the
> PFINT_TSYN_MSK value is programmed during ice_ptp_init_owner just before
> calling ice_ptp_tx_ena_intr to program the PHY ports.
> 
> For the non-clock owner ports, the PFINT_TSYN_MSK is programmed during
> ice_ptp_init_port.
> 
> If a large enough device reset occurs, the PFINT_TSYN_MSK register will be
> reset to the default value in which only the PHY associated directly with
> the PF will cause the Tx timestamp interrupt to trigger.
> 
> The driver lacks logic to reprogram the PFINT_TSYN_MSK register after a
> device reset. For the E822 device, this results in the PF no longer
> responding to interrupts for other ports. This results in failure to
> deliver Tx timestamps to user space applications.
> 
> Rename ice_ptp_configure_tx_tstamp to ice_ptp_cfg_tx_interrupt, and unify
> the logic for programming PFINT_TSYN_MSK and PFINT_OICR_ENA into one place.
> This function will program both registers according to the combination of
> user configuration and device requirements.
> 
> This ensures that PFINT_TSYN_MSK is always restored when we configure the
> Tx timestamp interrupt.
> 
> Fixes: d938a8cca88a ("ice: Auxbus devices & driver for E822 TS")
> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>

Reviewed-by: Simon Horman <horms@kernel.org>


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox