Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH bpf-next v8 02/10] bpf, net: introduce bpf_struct_ops_desc.
From: Kui-Feng Lee @ 2023-10-31 16:00 UTC (permalink / raw)
  To: Martin KaFai Lau, thinker.li, bpf, ast, song, kernel-team, andrii,
	drosen
  Cc: kuifeng, netdev
In-Reply-To: <bb7bbfa2-94b1-041d-7255-bb8c7e56e6c7@linux.dev>



On 10/30/23 23:40, Martin KaFai Lau wrote:
> 
> nit. I think this should be renamed to bpf_

Sure! It looks better.

^ permalink raw reply

* Re: [PATCH] net: stmmac: Wait a bit for the reset to take effect
From: Bernd Edlinger @ 2023-10-31 16:10 UTC (permalink / raw)
  To: Serge Semin
  Cc: Alexandre Torgue, Jose Abreu, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Maxime Coquelin, netdev, linux-stm32,
	linux-arm-kernel, linux-kernel@vger.kernel.org
In-Reply-To: <j37ktiug7vwbb7h7s44zmng5a2bjzbd663p7pfowbehapjv3by@vrxfmapscaln>



On 10/31/23 11:32, Serge Semin wrote:
> On Mon, Oct 30, 2023 at 07:01:11AM +0100, Bernd Edlinger wrote:
>> otherwise the synopsys_id value may be read out wrong,
>> because the GMAC_VERSION register might still be in reset
>> state, for at least 1 us after the reset is de-asserted.
> 
> From what have you got that delay value?
> 

Just try and error, with very old linux versions and old gcc versions
the synopsys_id was read out correctly most of the time (but not always),
with recent linux versions and recnet gcc versions it was read out
wrongly most of the time, but again not always.
I don't have access to the VHDL code in question, so I cannot
tell why it takes so long to get the correct values, I also do not
have more than a few hardware samples, so I cannot tell how long
this timeout must be in worst case.
Experimentally I can tell that the register is read several times
as zero immediately after the reset is de-asserted, also adding several
no-ops is not enough, adding a printk is enough, also udelay(1) seems to
be enough but I tried that not very often, and I have not access to many
hardware samples to be 100% sure about the necessary delay.
And since the udelay here is only executed once per device instance,
it seems acceptable to delay the boot for 10 us.

BTW: my hardware's synopsys id is 0x37.


Bernd.

> -Serge(y)
> 
>>
>> Add a wait for 10 us before continuing to be on the safe side.
>>
>> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
>> ---
>>  drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>> index 5801f4d50f95..e485f4db3605 100644
>> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>> @@ -7398,6 +7398,9 @@ int stmmac_dvr_probe(struct device *device,
>>  		dev_err(priv->device, "unable to bring out of ahb reset: %pe\n",
>>  			ERR_PTR(ret));
>>  
>> +	/* Wait a bit for the reset to take effect */
>> +	udelay(10);
>> +
>>  	/* Init MAC and get the capabilities */
>>  	ret = stmmac_hw_init(priv);
>>  	if (ret)
>> -- 
>> 2.39.2
>>
>>

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH net-next v4 1/6] net: ethtool: allow symmetric-xor RSS hash for any flow type
From: Gal Pressman @ 2023-10-31 16:11 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Ahmed Zaki, Jakub Kicinski, mkubecek, andrew,
	willemdebruijn.kernel, Wojciech Drewek, corbet, netdev, linux-doc,
	jesse.brandeburg, edumazet, anthony.l.nguyen, horms,
	vladimir.oltean, Jacob Keller, intel-wired-lan, pabeni, davem
In-Reply-To: <CAKgT0UdObrDUGKMC7Tneqc4j3tU1jxRugoEB=u63drHhxOeKyw@mail.gmail.com>

On 31/10/2023 16:59, Alexander Duyck wrote:
> On Tue, Oct 31, 2023 at 5:01 AM Gal Pressman <gal@nvidia.com> wrote:
>>
>> On 29/10/2023 18:59, Ahmed Zaki wrote:
>>>
>>>
>>> On 2023-10-29 06:48, Gal Pressman wrote:
>>>> On 29/10/2023 14:42, Ahmed Zaki wrote:
>>>>>
>>>>>
>>>>> On 2023-10-29 06:25, Gal Pressman wrote:
>>>>>> On 21/10/2023 3:00, Ahmed Zaki wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2023-10-20 17:49, Jakub Kicinski wrote:
>>>>>>>> On Fri, 20 Oct 2023 17:14:11 -0600 Ahmed Zaki wrote:
>>>>>>>>> I replied to that here:
>>>>>>>>>
>>>>>>>>> https://lore.kernel.org/all/afb4a06f-cfba-47ba-adb3-09bea7cb5f00@intel.com/
>>>>>>>>>
>>>>>>>>> I am kind of confused now so please bear with me. ethtool either
>>>>>>>>> sends
>>>>>>>>> "ethtool_rxfh" or "ethtool_rxnfc". AFAIK "ethtool_rxfh" is the
>>>>>>>>> interface
>>>>>>>>> for "ethtool -X" which is used to set the RSS algorithm. But we
>>>>>>>>> kind of
>>>>>>>>> agreed to go with "ethtool -U|-N" for symmetric-xor, and that uses
>>>>>>>>> "ethtool_rxnfc" (as implemented in this series).
>>>>>>>>
>>>>>>>> I have no strong preference. Sounds like Alex prefers to keep it
>>>>>>>> closer
>>>>>>>> to algo, which is "ethtool_rxfh".
>>>>>>>>
>>>>>>>>> Do you mean use "ethtool_rxfh" instead of "ethtool_rxnfc"? how would
>>>>>>>>> that work on the ethtool user interface?
>>>>>>>>
>>>>>>>> I don't know what you're asking of us. If you find the code to
>>>>>>>> confusing
>>>>>>>> maybe someone at Intel can help you :|
>>>>>>>
>>>>>>> The code is straightforward. I am confused by the requirements: don't
>>>>>>> add a new algorithm but use "ethtool_rxfh".
>>>>>>>
>>>>>>> I'll see if I can get more help, may be I am missing something.
>>>>>>>
>>>>>>
>>>>>> What was the decision here?
>>>>>> Is this going to be exposed through ethtool -N or -X?
>>>>>
>>>>> I am working on a new version that uses "ethtool_rxfh" to set the
>>>>> symmetric-xor. The user will set per-device via:
>>>>>
>>>>> ethtool -X eth0 hfunc toeplitz symmetric-xor
>>>>>
>>>>> then specify the per-flow type RSS fields as usual:
>>>>>
>>>>> ethtool -N|-U eth0 rx-flow-hash <flow_type> s|d|f|n
>>>>>
>>>>> The downside is that all flow-types will have to be either symmetric or
>>>>> asymmetric.
>>>>
>>>> Why are we making the interface less flexible than it can be with -N?
>>>
>>> Alexander Duyck prefers to implement the "symmetric-xor" interface as an
>>> algorithm or extension (please refer to previous messages), but ethtool
>>> does not provide flowtype/RSS fields setting via "-X". The above was the
>>> best solution that we (at Intel) could think of.
>>
>> OK, it's a weird we're deliberately limiting our interface, given
>> there's already hardware that supports controlling symmetric hashing per
>> flow type.
>>
>> I saw you mentioned the way ice hardware implements symmetric-xor
>> somewhere, it definitely needs to be added somewhere in our
>> documentation to prevent confusion.
>> mlx5 hardware also does symmetric hashing with xor, but not exactly as
>> you described, we need the algorithm to be clear.
> 
> It is precisely because of the way the symmetric-xor implements it
> that I suggested that they change the algo type instead of the input
> fields.
> 
> Instead of doing something such as rearranging the inputs, what they
> do is start XORing them together and then using those values for both
> the source and destination ports. It would be one thing if they
> swapped them, but instead they destroy the entropy provided by XORing
> the values together and then doubling them up in the source and
> destination fields. The result is the hash value becomes predictable
> in that once you know the target you just have to offset the source
> and destination port/IP by the same amount so that they hash out to
> the same values, and as a result it would make DDoS attacks based on
> the RSS hash much easier.
> 
> Where I draw the line in this is if we start losing entropy without
> explicitly removing the value then it is part of the algo, whereas if
> it is something such as placement or us explicitly saying we don't
> want certain fields in there then it would be part of the input.
> Adding fields to the input should increase or at least maintain the
> entropy is my point of view.

Thanks for the detailed summary, that was helpful.

Though, if a vendor chooses to implement symmetric by sorting, we will
still have it as part of the algorithm, not input.

My main concern was about losing the ability to control symmetric per
flow-type, but I guess we can resolve that if the need arises.

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH net-next v4 1/6] net: ethtool: allow symmetric-xor RSS hash for any flow type
From: Gal Pressman @ 2023-10-31 16:12 UTC (permalink / raw)
  To: Ahmed Zaki, Jakub Kicinski, Alexander H Duyck
  Cc: mkubecek, andrew, willemdebruijn.kernel, Wojciech Drewek, corbet,
	netdev, linux-doc, jesse.brandeburg, edumazet, anthony.l.nguyen,
	horms, vladimir.oltean, Jacob Keller, intel-wired-lan, pabeni,
	davem
In-Reply-To: <e7679b57-af11-42b1-91c7-b18cbcc70119@intel.com>

On 31/10/2023 17:14, Ahmed Zaki wrote:
> 
> 
> On 2023-10-31 08:45, Gal Pressman wrote:
>> On 31/10/2023 16:40, Ahmed Zaki wrote:
>>>
>>>
>>> On 2023-10-31 06:00, Gal Pressman wrote:
>>>> On 29/10/2023 18:59, Ahmed Zaki wrote:
>>>>>
>>>>>
>>>>> On 2023-10-29 06:48, Gal Pressman wrote:
>>>>>> On 29/10/2023 14:42, Ahmed Zaki wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2023-10-29 06:25, Gal Pressman wrote:
>>>>>>>> On 21/10/2023 3:00, Ahmed Zaki wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2023-10-20 17:49, Jakub Kicinski wrote:
>>>>>>>>>> On Fri, 20 Oct 2023 17:14:11 -0600 Ahmed Zaki wrote:
>>>>>>>>>>> I replied to that here:
>>>>>>>>>>>
>>>>>>>>>>> https://lore.kernel.org/all/afb4a06f-cfba-47ba-adb3-09bea7cb5f00@intel.com/
>>>>>>>>>>>
>>>>>>>>>>> I am kind of confused now so please bear with me. ethtool either
>>>>>>>>>>> sends
>>>>>>>>>>> "ethtool_rxfh" or "ethtool_rxnfc". AFAIK "ethtool_rxfh" is the
>>>>>>>>>>> interface
>>>>>>>>>>> for "ethtool -X" which is used to set the RSS algorithm. But we
>>>>>>>>>>> kind of
>>>>>>>>>>> agreed to go with "ethtool -U|-N" for symmetric-xor, and that
>>>>>>>>>>> uses
>>>>>>>>>>> "ethtool_rxnfc" (as implemented in this series).
>>>>>>>>>>
>>>>>>>>>> I have no strong preference. Sounds like Alex prefers to keep it
>>>>>>>>>> closer
>>>>>>>>>> to algo, which is "ethtool_rxfh".
>>>>>>>>>>
>>>>>>>>>>> Do you mean use "ethtool_rxfh" instead of "ethtool_rxnfc"? how
>>>>>>>>>>> would
>>>>>>>>>>> that work on the ethtool user interface?
>>>>>>>>>>
>>>>>>>>>> I don't know what you're asking of us. If you find the code to
>>>>>>>>>> confusing
>>>>>>>>>> maybe someone at Intel can help you :|
>>>>>>>>>
>>>>>>>>> The code is straightforward. I am confused by the requirements:
>>>>>>>>> don't
>>>>>>>>> add a new algorithm but use "ethtool_rxfh".
>>>>>>>>>
>>>>>>>>> I'll see if I can get more help, may be I am missing something.
>>>>>>>>>
>>>>>>>>
>>>>>>>> What was the decision here?
>>>>>>>> Is this going to be exposed through ethtool -N or -X?
>>>>>>>
>>>>>>> I am working on a new version that uses "ethtool_rxfh" to set the
>>>>>>> symmetric-xor. The user will set per-device via:
>>>>>>>
>>>>>>> ethtool -X eth0 hfunc toeplitz symmetric-xor
>>>>>>>
>>>>>>> then specify the per-flow type RSS fields as usual:
>>>>>>>
>>>>>>> ethtool -N|-U eth0 rx-flow-hash <flow_type> s|d|f|n
>>>>>>>
>>>>>>> The downside is that all flow-types will have to be either
>>>>>>> symmetric or
>>>>>>> asymmetric.
>>>>>>
>>>>>> Why are we making the interface less flexible than it can be with -N?
>>>>>
>>>>> Alexander Duyck prefers to implement the "symmetric-xor" interface
>>>>> as an
>>>>> algorithm or extension (please refer to previous messages), but
>>>>> ethtool
>>>>> does not provide flowtype/RSS fields setting via "-X". The above
>>>>> was the
>>>>> best solution that we (at Intel) could think of.
>>>>
>>>> OK, it's a weird we're deliberately limiting our interface, given
>>>> there's already hardware that supports controlling symmetric hashing
>>>> per
>>>> flow type.
>>>>
>>>> I saw you mentioned the way ice hardware implements symmetric-xor
>>>> somewhere, it definitely needs to be added somewhere in our
>>>> documentation to prevent confusion.
>>>> mlx5 hardware also does symmetric hashing with xor, but not exactly as
>>>> you described, we need the algorithm to be clear.
>>>
>>> Sure. I will add more ice-specific doc in:
>>> Documentation/networking/device_drivers/ethernet/intel/ice.rst
>>
>> I was thinking of somewhere more generic, where ethtool users (not
>> necessarily ice users) can refer to.
>>
>> Perhaps Documentation/networking/ethtool-netlink.rst? Or ethtool man
>> page?
> 
> Do you mean add vendor-specific implementation details to common docs?
> Not sure if I have seen this before. Any examples?
> 
> Or, we can add a note in ethtool doc that each vendor's implementation
> is different and "Refer to your vendor's specifications for more info".

It's a generic ethtool flag, its documentation shouldn't be vendor specific.
The documentation should reflect the exact details about the algorithm,
then other vendors can either use it, or add a new symmetric flag and
document it separately.

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH net-next v4 1/6] net: ethtool: allow symmetric-xor RSS hash for any flow type
From: Gal Pressman @ 2023-10-31 16:13 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Ahmed Zaki, Alexander H Duyck, mkubecek, andrew,
	willemdebruijn.kernel, Wojciech Drewek, corbet, netdev, linux-doc,
	jesse.brandeburg, edumazet, anthony.l.nguyen, horms,
	vladimir.oltean, Jacob Keller, intel-wired-lan, pabeni, davem
In-Reply-To: <20231031082023.3fd4761b@kernel.org>

On 31/10/2023 17:20, Jakub Kicinski wrote:
> On Tue, 31 Oct 2023 09:14:58 -0600 Ahmed Zaki wrote:
>> Do you mean add vendor-specific implementation details to common docs? 
>> Not sure if I have seen this before. Any examples?
>>
>> Or, we can add a note in ethtool doc that each vendor's implementation 
>> is different and "Refer to your vendor's specifications for more info".
> 
> Gal, can you shed any more detail on who your implementation differs?
> It will help the discussion, and also - I'm not sure how you can do
> xor differently..

Sure, IIUC, ice's implementation does a:
(SRC_IP ^ DST_IP, SRC_IP ^ DST_IP, SRC_PORT ^ DST_PORT, SRC_PORT ^ DST_PORT)

Our implementation isn't exactly xor, it is:
(SRC_IP | DST_IP, SRC_IP ^ DST_IP, SRC_PORT | DST_PORT, SRC_PORT ^ DST_PORT)

The way I see it, the xor implementation should be clearly documented,
so no one uses the same flag with a different implementation by mistake.

^ permalink raw reply

* Re: [RFC] Questionable RCU/BH usage in cgw_create_job().
From: Oliver Hartkopp @ 2023-10-31 16:14 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Marc Kleine-Budde
  Cc: linux-can, netdev, Thomas Gleixner
In-Reply-To: <20231031112349.y0aLoBrz@linutronix.de>

Hi Sebastian,

thanks for the review!

On 31.10.23 12:23, Sebastian Andrzej Siewior wrote:
> Hi,
> 
> I stumbled over this piece in cgw_create_job():
> |      /* update modifications with disabled softirq & quit */
> |      local_bh_disable();
> |      memcpy(&gwj->mod, &mod, sizeof(mod));
> |      local_bh_enable();
> |      return 0;
> 
> unfortunately the comment did not provide much enlightenment for me.
> Let's look. That memcpy() overwrites struct cf_mod within struct cgw_job
> which is under RCU protection. memcpy() and RCU hardly works as a combo.
> But why the local_bh_disable()?

The content of gwj->mod can be overwritten with new modification rules 
at runtime. But this update (with memcpy) has to take place when there 
is no incoming network traffic.

> Let's look further. The user of this data structure is can_can_gw_rcv().
> There is something like:
> |                 /* check for checksum updates */
> |                 if (gwj->mod.csumfunc.crc8)
> |                         (*gwj->mod.csumfunc.crc8)(cf, &gwj->mod.csum.crc8);
> 
> With optimisation enabled (as in -O2 or so) the compiler will fetch
> mod.csumfunc.crc8, do the comparison and if non-NULL use the previously
> fetched value and jump there. So one could argue that it is not really
> affected by the memcpy() suddenly setting it to NULL. However, adding
> any kind of a function in between, say
> |                 /* check for checksum updates */
> |                 if (gwj->mod.csumfunc.crc8) {
> |+                        trace_event_crc8_sth(cf)
> |                         (*gwj->mod.csumfunc.crc8)(cf, &gwj->mod.csum.crc8);
> |                 }
> 
> will force the compiler to reload mod.csumfunc.crc8. And here is the
> possible NULL pointer if overwritten by update in cgw_create_job().
> 
> One reload that already happens is the one of mod.modfunc. First at the
> top we have:
> |         if (gwj->mod.modfunc[0])
> |                 nskb = skb_copy(skb, GFP_ATOMIC);
> |         else
> |                 nskb = skb_clone(skb, GFP_ATOMIC);
> 
> Here mod.modfunc[0] is NULL and skb_clone() is invoked. Later if
> mod.modfunc has been set to non-NULL value this piece
> |         while (modidx < MAX_MODFUNCTIONS && gwj->mod.modfunc[modidx])
> |                 (*gwj->mod.modfunc[modidx++])(cf, &gwj->mod);
> 
> reloads mod.modfunc and may modify the skb assuming that skb_copy() has
> been used earlier.
> 
> Looking at this makes me think that the local_bh_disable() has been
> added to the memcpy() just to ensure that can_can_gw_rcv() won't run
> because it is invoked in a BH disabled context. Clever little trick that
> is. But this trick is limited to UP environments…
> 
> Am I missing something?
> 
> If not, my suggestion would be replacing the bh-off, memcpy part with:
> |		old_mod = rcu_replace_pointer(gwj->mod, new_mod, true);
> |		kfree_rcu_mightsleep(old_mod);
> 
> and doing the needed pointer replacement with for struct cgw_job::mod
> and RCU annotation.

Replacing a pointer does not copy any data to the cf_mod structure, right?

Best regards,
Oliver

^ permalink raw reply

* Re: [PATCH net v2] net: dsa: tag_rtl4_a: Bump min packet size
From: Florian Fainelli @ 2023-10-31 16:23 UTC (permalink / raw)
  To: Linus Walleij, Andrew Lunn
  Cc: Vladimir Oltean, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, netdev, linux-kernel
In-Reply-To: <CACRpkdZ8A-Yz-8itXzaBuLHfFLpmwGyg2gUqyNMpx3NbjR+KPA@mail.gmail.com>

On 10/31/23 07:06, Linus Walleij wrote:
> On Tue, Oct 31, 2023 at 1:37 AM Andrew Lunn <andrew@lunn.ch> wrote:
>> On Tue, Oct 31, 2023 at 01:09:06AM +0200, Vladimir Oltean wrote:
>>> On Mon, Oct 30, 2023 at 11:57:33PM +0100, Linus Walleij wrote:
>>>> This of course make no sense, since the padding function should do nothing
>>>> when the packet is bigger than 60 bytes.
>>>
>>> Indeed, this of course makes no sense. Ping doesn't work, or ARP doesn't
>>> work? Could you add a static ARP entry for the 192.168.1.137 IP address?
>>
>> Probably the ARP, since they are short packets and probably need the
>> padding.
> 
> It seems correct: the reason ping stops working is not that the ping
> isn't reaching the host, I can see the pin request on tcpdumps.
> 
> The reason is that the host has no idea where to send the reply.
> Because it's not getting any ARP replies.
> 
> And that is probably because the ARP replies are short and needs
> padding to ETH_ZLEN.
> 
> I notice the code is probably borrowed from tag_brcm.c which does
> exactly the same thing (ETH_ZLEN + tag). It comes with this
> explanation:
> 
>         /* The Ethernet switch we are interfaced with needs packets to
> be at
>           * least 64 bytes (including FCS) otherwise they will be
> discarded when
>           * they enter the switch port logic. When Broadcom tags are
> enabled, we
>           * need to make sure that packets are at least 68 bytes
>           * (including FCS and tag) because the length verification is
> done after
>           * the Broadcom tag is stripped off the ingress packet.
>           *
>           * Let dsa_slave_xmit() free the SKB
>           */
> 
> The switch fabric is dropping smaller packets when CPU tags
> (DSA tags) are enabled.
> 
> So is the padding to ETH_ZLEN OK or should is happen elsewhere?

This is OK IMHO because you may not always have knowledge of which 
Ethernet MAC the switch is interfaced with, and with the tagger code, 
you have a central location to address them all.

The padding should ideally be done by the Ethernet MAC, even better when 
the HW can do it on its own (as it should), because whether there is a 
switch or not, no link partner should accept runt packets (some might 
but this is not standard).
-- 
Florian


^ permalink raw reply

* Re: [PATCH net v2] net: dsa: tag_rtl4_a: Bump min packet size
From: Vladimir Oltean @ 2023-10-31 16:34 UTC (permalink / raw)
  To: Linus Walleij
  Cc: Andrew Lunn, Florian Fainelli, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, netdev, linux-kernel
In-Reply-To: <CACRpkdbLTNVJusuCw2hrHDzx5odw8vw8hMWvvvvgEPsAFwB8hg@mail.gmail.com>

On Tue, Oct 31, 2023 at 03:16:50PM +0100, Linus Walleij wrote:
> On Tue, Oct 31, 2023 at 12:33 AM Vladimir Oltean <olteanv@gmail.com> wrote:
> > On Mon, Oct 30, 2023 at 11:57:33PM +0100, Linus Walleij wrote:
> > > Here you can incidentally also see what happens if we don't pad the big packets:
> > > the packet gets truncated.
> >
> > Are we sure we are debugging a switch problem? On how many platforms
> > with the RTL8366RB can the issue be seen? Is the conduit interface the
> > same on all these platforms, or is it different and that makes no
> > difference?
> 
> I don't have any other RTL8366RB systems than the D-Link DIR-685.
> 
> I however have several systems with the same backing ethernet controller
> connected directly to a PHY and they all work fine.
> 
> Yours,
> Linus Walleij

Ok, so we don't have a confirmation of breakage with other conduit
interface than the Gemini driver, either. So a problem there is still
not off the table.

So on the gemini-dlink-dir-685.dts platform, you can also use &gmac1 as
a plain Ethernet port, right?

If possible, could you set up dsa_loop (enable CONFIG_NET_DSA_LOOP, replace
"eth0" in dsa_loop_pdata with the netdev name of &gmac1, replace DSA_TAG_PROTO_NONE
in dsa_loop_get_protocol() with your tagging protocol) and put a tcpdump
on the remote end of the gmac1 port, to see if the issue isn't, in fact,
somewhere else, maybe gmac_start_xmit()?

With dsa_loop, you turn &gmac1 into a conduit interface of sorts, except
for the fact that there is no switch to process the DSA-tagged packets,
you see those packets on the remote end, and you can investigate there
whether it's the switch who's corrupting/truncating, or if they're
somehow sent corrupted/truncated by Gemini on the wire (which would not
be visible in a local tcpdump).

^ permalink raw reply

* Re: [RFC] Questionable RCU/BH usage in cgw_create_job().
From: Sebastian Andrzej Siewior @ 2023-10-31 16:52 UTC (permalink / raw)
  To: Oliver Hartkopp; +Cc: Marc Kleine-Budde, linux-can, netdev, Thomas Gleixner
In-Reply-To: <ba5d5420-a3ef-4368-ba36-3a84ed1458cf@hartkopp.net>

On 2023-10-31 17:14:01 [+0100], Oliver Hartkopp wrote:
> Hi Sebastian,
Hi Oliver,

> The content of gwj->mod can be overwritten with new modification rules at
> runtime. But this update (with memcpy) has to take place when there is no
> incoming network traffic.

This is my assumption. But "no incoming network traffic" is not ensured,
right?

> > If not, my suggestion would be replacing the bh-off, memcpy part with:
> > |		old_mod = rcu_replace_pointer(gwj->mod, new_mod, true);
> > |		kfree_rcu_mightsleep(old_mod);
> > 
> > and doing the needed pointer replacement with for struct cgw_job::mod
> > and RCU annotation.
> 
> Replacing a pointer does not copy any data to the cf_mod structure, right?

Yes. The cf_mod data structure is embedded into cgw_job. So it would
have to become a pointer. Then cgw_create_job() would create a new
cf_mod via cgw_parse_attr() but it would be a new allocated structure
instead on stack like it is now. And then in the existing case you would
do the swap. Otherwise (non-existing, brand new) it becomes part of the
new created cgw_job.

The point is to replace/ update cf_mod at runtime while following RCU
rules so always either new or the old object is observed. Never an
intermediate step.

> Best regards,
> Oliver

Sebastian

^ permalink raw reply

* [net-next v2 PATCH] octeontx2-pf: TC flower offload support for ICMP type and code
From: Geetha sowjanya @ 2023-10-31 16:52 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: kuba, davem, pabeni, edumazet, sgoutham, gakula, sbhatta, hkelam

Adds tc offload support for matching on ICMP type and code.

Example usage:
To enable adding tc ingress rules
        tc qdisc add dev eth0 ingress

TC rule drop the ICMP echo reply:
        tc filter add dev eth0 protocol ip parent ffff: \
        flower ip_proto icmp type 8 code 0 skip_sw action drop

TC rule to drop ICMPv6 echo reply:
        tc filter add dev eth0 protocol ipv6 parent ffff: flower \
        indev eth0 ip_proto icmpv6 type 128 code 0 action drop

Signed-off-by: Geetha sowjanya <gakula@marvell.com>
---
v1-v2:
-Rebased on latest net-next branch.

 .../net/ethernet/marvell/octeontx2/af/mbox.h  |  2 ++
 .../net/ethernet/marvell/octeontx2/af/npc.h   |  2 ++
 .../marvell/octeontx2/af/rvu_debugfs.c        |  8 +++++++
 .../marvell/octeontx2/af/rvu_npc_fs.c         | 23 ++++++++++++++-----
 .../ethernet/marvell/octeontx2/nic/otx2_tc.c  | 14 +++++++++++
 5 files changed, 43 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/marvell/octeontx2/af/mbox.h b/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
index 6845556581c3..b4ced739ae44 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
@@ -1479,6 +1479,8 @@ struct flow_msg {
 #define OTX2_FLOWER_MASK_MPLS_TTL		GENMASK(7, 0)
 #define OTX2_FLOWER_MASK_MPLS_NON_TTL		GENMASK(31, 8)
 	u32 mpls_lse[4];
+	u8 icmp_type;
+	u8 icmp_code;
 };
 
 struct npc_install_flow_req {
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/npc.h b/drivers/net/ethernet/marvell/octeontx2/af/npc.h
index ab3e39eef2eb..ffc0aa0a7b47 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/npc.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/npc.h
@@ -214,6 +214,8 @@ enum key_fields {
 	NPC_MPLS3_TTL,
 	NPC_MPLS4_LBTCBOS,
 	NPC_MPLS4_TTL,
+	NPC_TYPE_ICMP,
+	NPC_CODE_ICMP,
 	NPC_HEADER_FIELDS_MAX,
 	NPC_CHAN = NPC_HEADER_FIELDS_MAX, /* Valid when Rx */
 	NPC_PF_FUNC, /* Valid when Tx */
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_debugfs.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu_debugfs.c
index bd817ee88735..468b6561ed3f 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_debugfs.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_debugfs.c
@@ -2889,6 +2889,14 @@ static void rvu_dbg_npc_mcam_show_flows(struct seq_file *s,
 			RVU_DBG_PRINT_MPLS_TTL(rule->packet.mpls_lse[3],
 					       rule->mask.mpls_lse[3]);
 			break;
+		case NPC_TYPE_ICMP:
+			seq_printf(s, "%d ", rule->packet.icmp_type);
+			seq_printf(s, "mask 0x%x\n", rule->mask.icmp_type);
+			break;
+		case NPC_CODE_ICMP:
+			seq_printf(s, "%d ", rule->packet.icmp_code);
+			seq_printf(s, "mask 0x%x\n", rule->mask.icmp_code);
+			break;
 		default:
 			seq_puts(s, "\n");
 			break;
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_npc_fs.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu_npc_fs.c
index 114e4ec21802..db8f151636af 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_npc_fs.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_npc_fs.c
@@ -51,6 +51,8 @@ static const char * const npc_flow_names[] = {
 	[NPC_MPLS3_TTL]     = "lse depth 3 ttl",
 	[NPC_MPLS4_LBTCBOS] = "lse depth 4 label tc bos",
 	[NPC_MPLS4_TTL]     = "lse depth 4",
+	[NPC_TYPE_ICMP] = "icmp type",
+	[NPC_CODE_ICMP] = "icmp code",
 	[NPC_UNKNOWN]	= "unknown",
 };
 
@@ -526,6 +528,8 @@ do {									       \
 	NPC_SCAN_HDR(NPC_DPORT_TCP, NPC_LID_LD, NPC_LT_LD_TCP, 2, 2);
 	NPC_SCAN_HDR(NPC_SPORT_SCTP, NPC_LID_LD, NPC_LT_LD_SCTP, 0, 2);
 	NPC_SCAN_HDR(NPC_DPORT_SCTP, NPC_LID_LD, NPC_LT_LD_SCTP, 2, 2);
+	NPC_SCAN_HDR(NPC_TYPE_ICMP, NPC_LID_LD, NPC_LT_LD_ICMP, 0, 1);
+	NPC_SCAN_HDR(NPC_CODE_ICMP, NPC_LID_LD, NPC_LT_LD_ICMP, 1, 1);
 	NPC_SCAN_HDR(NPC_ETYPE_ETHER, NPC_LID_LA, NPC_LT_LA_ETHER, 12, 2);
 	NPC_SCAN_HDR(NPC_ETYPE_TAG1, NPC_LID_LB, NPC_LT_LB_CTAG, 4, 2);
 	NPC_SCAN_HDR(NPC_ETYPE_TAG2, NPC_LID_LB, NPC_LT_LB_STAG_QINQ, 8, 2);
@@ -555,7 +559,7 @@ static void npc_set_features(struct rvu *rvu, int blkaddr, u8 intf)
 {
 	struct npc_mcam *mcam = &rvu->hw->mcam;
 	u64 *features = &mcam->rx_features;
-	u64 tcp_udp_sctp;
+	u64 proto_flags;
 	int hdr;
 
 	if (is_npc_intf_tx(intf))
@@ -566,18 +570,21 @@ static void npc_set_features(struct rvu *rvu, int blkaddr, u8 intf)
 			*features |= BIT_ULL(hdr);
 	}
 
-	tcp_udp_sctp = BIT_ULL(NPC_SPORT_TCP) | BIT_ULL(NPC_SPORT_UDP) |
+	proto_flags = BIT_ULL(NPC_SPORT_TCP) | BIT_ULL(NPC_SPORT_UDP) |
 		       BIT_ULL(NPC_DPORT_TCP) | BIT_ULL(NPC_DPORT_UDP) |
-		       BIT_ULL(NPC_SPORT_SCTP) | BIT_ULL(NPC_DPORT_SCTP);
+		       BIT_ULL(NPC_SPORT_SCTP) | BIT_ULL(NPC_DPORT_SCTP) |
+		       BIT_ULL(NPC_SPORT_SCTP) | BIT_ULL(NPC_DPORT_SCTP) |
+		       BIT_ULL(NPC_TYPE_ICMP) | BIT_ULL(NPC_CODE_ICMP);
 
 	/* for tcp/udp/sctp corresponding layer type should be in the key */
-	if (*features & tcp_udp_sctp) {
+	if (*features & proto_flags) {
 		if (!npc_check_field(rvu, blkaddr, NPC_LD, intf))
-			*features &= ~tcp_udp_sctp;
+			*features &= ~proto_flags;
 		else
 			*features |= BIT_ULL(NPC_IPPROTO_TCP) |
 				     BIT_ULL(NPC_IPPROTO_UDP) |
-				     BIT_ULL(NPC_IPPROTO_SCTP);
+				     BIT_ULL(NPC_IPPROTO_SCTP) |
+				     BIT_ULL(NPC_IPPROTO_ICMP);
 	}
 
 	/* for AH/ICMP/ICMPv6/, check if corresponding layer type is present in the key */
@@ -971,6 +978,10 @@ do {									      \
 		       ntohs(mask->sport), 0);
 	NPC_WRITE_FLOW(NPC_DPORT_SCTP, dport, ntohs(pkt->dport), 0,
 		       ntohs(mask->dport), 0);
+	NPC_WRITE_FLOW(NPC_TYPE_ICMP, icmp_type, pkt->icmp_type, 0,
+		       mask->icmp_type, 0);
+	NPC_WRITE_FLOW(NPC_CODE_ICMP, icmp_code, pkt->icmp_code, 0,
+		       mask->icmp_code, 0);
 
 	NPC_WRITE_FLOW(NPC_IPSEC_SPI, spi, ntohl(pkt->spi), 0,
 		       ntohl(mask->spi), 0);
diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_tc.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_tc.c
index 8a5e3987a482..10657a7559d7 100644
--- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_tc.c
+++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_tc.c
@@ -522,6 +522,7 @@ static int otx2_tc_prepare_flow(struct otx2_nic *nic, struct otx2_tc_flow *node,
 	      BIT_ULL(FLOW_DISSECTOR_KEY_PORTS) |
 	      BIT(FLOW_DISSECTOR_KEY_IPSEC) |
 	      BIT_ULL(FLOW_DISSECTOR_KEY_MPLS) |
+	      BIT_ULL(FLOW_DISSECTOR_KEY_ICMP) |
 	      BIT_ULL(FLOW_DISSECTOR_KEY_IP))))  {
 		netdev_info(nic->netdev, "unsupported flow used key 0x%llx",
 			    dissector->used_keys);
@@ -796,6 +797,19 @@ static int otx2_tc_prepare_flow(struct otx2_nic *nic, struct otx2_tc_flow *node,
 		}
 	}
 
+	if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_ICMP)) {
+		struct flow_match_icmp match;
+
+		flow_rule_match_icmp(rule, &match);
+
+		flow_spec->icmp_type = match.key->type;
+		flow_mask->icmp_type = match.mask->type;
+		req->features |= BIT_ULL(NPC_TYPE_ICMP);
+
+		flow_spec->icmp_code = match.key->code;
+		flow_mask->icmp_code = match.mask->code;
+		req->features |= BIT_ULL(NPC_CODE_ICMP);
+	}
 	return otx2_tc_parse_actions(nic, &rule->action, req, f, node);
 }
 
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH net] net/tcp_sigpool: Fix some off by one bugs
From: Dmitry Safonov @ 2023-10-31 16:58 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Eric Dumazet, David S. Miller, David Ahern, Jakub Kicinski,
	Paolo Abeni, Steen Hegelund, netdev, kernel-janitors
In-Reply-To: <ce915d61-04bc-44fb-b450-35fcc9fc8831@moroto.mountain>

Hi Dan,

On 10/31/23 09:51, Dan Carpenter wrote:
> The "cpool_populated" variable is the number of elements in the cpool[]
> array that have been populated.  It is incremented in
> tcp_sigpool_alloc_ahash() every time we populate a new element.
> Unpopulated elements are NULL but if we have populated every element then
> this code will read one element beyond the end of the array.
> 
> Fixes: 8c73b26315aa ("net/tcp: Prepare tcp_md5sig_pool for TCP-AO")
> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>

Yeah, those are barriers for any issues in the caller-code, so that's
not too critical but nice to have. Thanks for the patch!

Reviewed-by: Dmitry Safonov <dima@arista.com>

> ---
> From static analysis and review.
> 
>  net/ipv4/tcp_sigpool.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/net/ipv4/tcp_sigpool.c b/net/ipv4/tcp_sigpool.c
> index 65a8eaae2fec..55b310a722c7 100644
> --- a/net/ipv4/tcp_sigpool.c
> +++ b/net/ipv4/tcp_sigpool.c
> @@ -231,7 +231,7 @@ static void cpool_schedule_cleanup(struct kref *kref)
>   */
>  void tcp_sigpool_release(unsigned int id)
>  {
> -	if (WARN_ON_ONCE(id > cpool_populated || !cpool[id].alg))
> +	if (WARN_ON_ONCE(id >= cpool_populated || !cpool[id].alg))
>  		return;
>  
>  	/* slow-path */
> @@ -245,7 +245,7 @@ EXPORT_SYMBOL_GPL(tcp_sigpool_release);
>   */
>  void tcp_sigpool_get(unsigned int id)
>  {
> -	if (WARN_ON_ONCE(id > cpool_populated || !cpool[id].alg))
> +	if (WARN_ON_ONCE(id >= cpool_populated || !cpool[id].alg))
>  		return;
>  	kref_get(&cpool[id].kref);
>  }
> @@ -256,7 +256,7 @@ int tcp_sigpool_start(unsigned int id, struct tcp_sigpool *c) __cond_acquires(RC
>  	struct crypto_ahash *hash;
>  
>  	rcu_read_lock_bh();
> -	if (WARN_ON_ONCE(id > cpool_populated || !cpool[id].alg)) {
> +	if (WARN_ON_ONCE(id >= cpool_populated || !cpool[id].alg)) {
>  		rcu_read_unlock_bh();
>  		return -EINVAL;
>  	}
> @@ -301,7 +301,7 @@ EXPORT_SYMBOL_GPL(tcp_sigpool_end);
>   */
>  size_t tcp_sigpool_algo(unsigned int id, char *buf, size_t buf_len)
>  {
> -	if (WARN_ON_ONCE(id > cpool_populated || !cpool[id].alg))
> +	if (WARN_ON_ONCE(id >= cpool_populated || !cpool[id].alg))
>  		return -EINVAL;
>  
>  	return strscpy(buf, cpool[id].alg, buf_len);

Thanks,
            Dmitry


^ permalink raw reply

* [PATCH 6.1 35/86] r8169: fix the KCSAN reported data-race in rtl_tx while reading TxDescArray[entry].opts1
From: Greg Kroah-Hartman @ 2023-10-31 17:01 UTC (permalink / raw)
  To: stable
  Cc: Greg Kroah-Hartman, patches, Heiner Kallweit, nic_swsd,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Marco Elver, netdev, Mirsad Goran Todorovac, Sasha Levin
In-Reply-To: <20231031165918.608547597@linuxfoundation.org>

6.1-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr>

[ Upstream commit dcf75a0f6bc136de94e88178ae5f51b7f879abc9 ]

KCSAN reported the following data-race:

==================================================================
BUG: KCSAN: data-race in rtl8169_poll (drivers/net/ethernet/realtek/r8169_main.c:4368 drivers/net/ethernet/realtek/r8169_main.c:4581) r8169

race at unknown origin, with read to 0xffff888140d37570 of 4 bytes by interrupt on cpu 21:
rtl8169_poll (drivers/net/ethernet/realtek/r8169_main.c:4368 drivers/net/ethernet/realtek/r8169_main.c:4581) r8169
__napi_poll (net/core/dev.c:6527)
net_rx_action (net/core/dev.c:6596 net/core/dev.c:6727)
__do_softirq (kernel/softirq.c:553)
__irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632)
irq_exit_rcu (kernel/softirq.c:647)
sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1074 (discriminator 14))
asm_sysvec_apic_timer_interrupt (./arch/x86/include/asm/idtentry.h:645)
cpuidle_enter_state (drivers/cpuidle/cpuidle.c:291)
cpuidle_enter (drivers/cpuidle/cpuidle.c:390)
call_cpuidle (kernel/sched/idle.c:135)
do_idle (kernel/sched/idle.c:219 kernel/sched/idle.c:282)
cpu_startup_entry (kernel/sched/idle.c:378 (discriminator 1))
start_secondary (arch/x86/kernel/smpboot.c:210 arch/x86/kernel/smpboot.c:294)
secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:433)

value changed: 0xb0000042 -> 0x00000000

Reported by Kernel Concurrency Sanitizer on:
CPU: 21 PID: 0 Comm: swapper/21 Tainted: G             L     6.6.0-rc2-kcsan-00143-gb5cbe7c00aa0 #41
Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
==================================================================

The read side is in

drivers/net/ethernet/realtek/r8169_main.c
=========================================
   4355 static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp,
   4356                    int budget)
   4357 {
   4358         unsigned int dirty_tx, bytes_compl = 0, pkts_compl = 0;
   4359         struct sk_buff *skb;
   4360
   4361         dirty_tx = tp->dirty_tx;
   4362
   4363         while (READ_ONCE(tp->cur_tx) != dirty_tx) {
   4364                 unsigned int entry = dirty_tx % NUM_TX_DESC;
   4365                 u32 status;
   4366
 → 4367                 status = le32_to_cpu(tp->TxDescArray[entry].opts1);
   4368                 if (status & DescOwn)
   4369                         break;
   4370
   4371                 skb = tp->tx_skb[entry].skb;
   4372                 rtl8169_unmap_tx_skb(tp, entry);
   4373
   4374                 if (skb) {
   4375                         pkts_compl++;
   4376                         bytes_compl += skb->len;
   4377                         napi_consume_skb(skb, budget);
   4378                 }
   4379                 dirty_tx++;
   4380         }
   4381
   4382         if (tp->dirty_tx != dirty_tx) {
   4383                 dev_sw_netstats_tx_add(dev, pkts_compl, bytes_compl);
   4384                 WRITE_ONCE(tp->dirty_tx, dirty_tx);
   4385
   4386                 netif_subqueue_completed_wake(dev, 0, pkts_compl, bytes_compl,
   4387                                               rtl_tx_slots_avail(tp),
   4388                                               R8169_TX_START_THRS);
   4389                 /*
   4390                  * 8168 hack: TxPoll requests are lost when the Tx packets are
   4391                  * too close. Let's kick an extra TxPoll request when a burst
   4392                  * of start_xmit activity is detected (if it is not detected,
   4393                  * it is slow enough). -- FR
   4394                  * If skb is NULL then we come here again once a tx irq is
   4395                  * triggered after the last fragment is marked transmitted.
   4396                  */
   4397                 if (READ_ONCE(tp->cur_tx) != dirty_tx && skb)
   4398                         rtl8169_doorbell(tp);
   4399         }
   4400 }

tp->TxDescArray[entry].opts1 is reported to have a data-race and READ_ONCE() fixes
this KCSAN warning.

   4366
 → 4367                 status = le32_to_cpu(READ_ONCE(tp->TxDescArray[entry].opts1));
   4368                 if (status & DescOwn)
   4369                         break;
   4370

Cc: Heiner Kallweit <hkallweit1@gmail.com>
Cc: nic_swsd@realtek.com
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Marco Elver <elver@google.com>
Cc: netdev@vger.kernel.org
Link: https://lore.kernel.org/lkml/dc7fc8fa-4ea4-e9a9-30a6-7c83e6b53188@alu.unizg.hr/
Signed-off-by: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr>
Acked-by: Marco Elver <elver@google.com>
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/net/ethernet/realtek/r8169_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 5288daaf59b5b..f677f625a4939 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -4343,7 +4343,7 @@ static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp,
 		unsigned int entry = dirty_tx % NUM_TX_DESC;
 		u32 status;
 
-		status = le32_to_cpu(tp->TxDescArray[entry].opts1);
+		status = le32_to_cpu(READ_ONCE(tp->TxDescArray[entry].opts1));
 		if (status & DescOwn)
 			break;
 
-- 
2.42.0




^ permalink raw reply related

* [PATCH 6.1 36/86] r8169: fix the KCSAN reported data race in rtl_rx while reading desc->opts1
From: Greg Kroah-Hartman @ 2023-10-31 17:01 UTC (permalink / raw)
  To: stable
  Cc: Greg Kroah-Hartman, patches, Marco Elver, Heiner Kallweit,
	nic_swsd, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, netdev, Mirsad Goran Todorovac, Sasha Levin
In-Reply-To: <20231031165918.608547597@linuxfoundation.org>

6.1-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr>

[ Upstream commit f97eee484e71890131f9c563c5cc6d5a69e4308d ]

KCSAN reported the following data-race bug:

==================================================================
BUG: KCSAN: data-race in rtl8169_poll (drivers/net/ethernet/realtek/r8169_main.c:4430 drivers/net/ethernet/realtek/r8169_main.c:4583) r8169

race at unknown origin, with read to 0xffff888117e43510 of 4 bytes by interrupt on cpu 21:
rtl8169_poll (drivers/net/ethernet/realtek/r8169_main.c:4430 drivers/net/ethernet/realtek/r8169_main.c:4583) r8169
__napi_poll (net/core/dev.c:6527)
net_rx_action (net/core/dev.c:6596 net/core/dev.c:6727)
__do_softirq (kernel/softirq.c:553)
__irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632)
irq_exit_rcu (kernel/softirq.c:647)
sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1074 (discriminator 14))
asm_sysvec_apic_timer_interrupt (./arch/x86/include/asm/idtentry.h:645)
cpuidle_enter_state (drivers/cpuidle/cpuidle.c:291)
cpuidle_enter (drivers/cpuidle/cpuidle.c:390)
call_cpuidle (kernel/sched/idle.c:135)
do_idle (kernel/sched/idle.c:219 kernel/sched/idle.c:282)
cpu_startup_entry (kernel/sched/idle.c:378 (discriminator 1))
start_secondary (arch/x86/kernel/smpboot.c:210 arch/x86/kernel/smpboot.c:294)
secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:433)

value changed: 0x80003fff -> 0x3402805f

Reported by Kernel Concurrency Sanitizer on:
CPU: 21 PID: 0 Comm: swapper/21 Tainted: G             L     6.6.0-rc2-kcsan-00143-gb5cbe7c00aa0 #41
Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
==================================================================

drivers/net/ethernet/realtek/r8169_main.c:
==========================================
   4429
 → 4430                 status = le32_to_cpu(desc->opts1);
   4431                 if (status & DescOwn)
   4432                         break;
   4433
   4434                 /* This barrier is needed to keep us from reading
   4435                  * any other fields out of the Rx descriptor until
   4436                  * we know the status of DescOwn
   4437                  */
   4438                 dma_rmb();
   4439
   4440                 if (unlikely(status & RxRES)) {
   4441                         if (net_ratelimit())
   4442                                 netdev_warn(dev, "Rx ERROR. status = %08x\n",

Marco Elver explained that dma_rmb() doesn't prevent the compiler to tear up the access to
desc->opts1 which can be written to concurrently. READ_ONCE() should prevent that from
happening:

   4429
 → 4430                 status = le32_to_cpu(READ_ONCE(desc->opts1));
   4431                 if (status & DescOwn)
   4432                         break;
   4433

As the consequence of this fix, this KCSAN warning was eliminated.

Fixes: 6202806e7c03a ("r8169: drop member opts1_mask from struct rtl8169_private")
Suggested-by: Marco Elver <elver@google.com>
Cc: Heiner Kallweit <hkallweit1@gmail.com>
Cc: nic_swsd@realtek.com
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: netdev@vger.kernel.org
Link: https://lore.kernel.org/lkml/dc7fc8fa-4ea4-e9a9-30a6-7c83e6b53188@alu.unizg.hr/
Signed-off-by: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr>
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/net/ethernet/realtek/r8169_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index f677f625a4939..80b6079b8a8e3 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -4413,7 +4413,7 @@ static int rtl_rx(struct net_device *dev, struct rtl8169_private *tp, int budget
 		dma_addr_t addr;
 		u32 status;
 
-		status = le32_to_cpu(desc->opts1);
+		status = le32_to_cpu(READ_ONCE(desc->opts1));
 		if (status & DescOwn)
 			break;
 
-- 
2.42.0




^ permalink raw reply related

* [PATCH 6.1 34/86] r8169: fix the KCSAN reported data-race in rtl_tx() while reading tp->cur_tx
From: Greg Kroah-Hartman @ 2023-10-31 17:00 UTC (permalink / raw)
  To: stable
  Cc: Greg Kroah-Hartman, patches, Heiner Kallweit, nic_swsd,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Marco Elver, netdev, Mirsad Goran Todorovac, Sasha Levin
In-Reply-To: <20231031165918.608547597@linuxfoundation.org>

6.1-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr>

[ Upstream commit c1c0ce31b2420d5c173228a2132a492ede03d81f ]

KCSAN reported the following data-race:

==================================================================
BUG: KCSAN: data-race in rtl8169_poll [r8169] / rtl8169_start_xmit [r8169]

write (marked) to 0xffff888102474b74 of 4 bytes by task 5358 on cpu 29:
rtl8169_start_xmit (drivers/net/ethernet/realtek/r8169_main.c:4254) r8169
dev_hard_start_xmit (./include/linux/netdevice.h:4889 ./include/linux/netdevice.h:4903 net/core/dev.c:3544 net/core/dev.c:3560)
sch_direct_xmit (net/sched/sch_generic.c:342)
__dev_queue_xmit (net/core/dev.c:3817 net/core/dev.c:4306)
ip_finish_output2 (./include/linux/netdevice.h:3082 ./include/net/neighbour.h:526 ./include/net/neighbour.h:540 net/ipv4/ip_output.c:233)
__ip_finish_output (net/ipv4/ip_output.c:311 net/ipv4/ip_output.c:293)
ip_finish_output (net/ipv4/ip_output.c:328)
ip_output (net/ipv4/ip_output.c:435)
ip_send_skb (./include/net/dst.h:458 net/ipv4/ip_output.c:127 net/ipv4/ip_output.c:1486)
udp_send_skb (net/ipv4/udp.c:963)
udp_sendmsg (net/ipv4/udp.c:1246)
inet_sendmsg (net/ipv4/af_inet.c:840 (discriminator 4))
sock_sendmsg (net/socket.c:730 net/socket.c:753)
__sys_sendto (net/socket.c:2177)
__x64_sys_sendto (net/socket.c:2185)
do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)

read to 0xffff888102474b74 of 4 bytes by interrupt on cpu 21:
rtl8169_poll (drivers/net/ethernet/realtek/r8169_main.c:4397 drivers/net/ethernet/realtek/r8169_main.c:4581) r8169
__napi_poll (net/core/dev.c:6527)
net_rx_action (net/core/dev.c:6596 net/core/dev.c:6727)
__do_softirq (kernel/softirq.c:553)
__irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632)
irq_exit_rcu (kernel/softirq.c:647)
common_interrupt (arch/x86/kernel/irq.c:247 (discriminator 14))
asm_common_interrupt (./arch/x86/include/asm/idtentry.h:636)
cpuidle_enter_state (drivers/cpuidle/cpuidle.c:291)
cpuidle_enter (drivers/cpuidle/cpuidle.c:390)
call_cpuidle (kernel/sched/idle.c:135)
do_idle (kernel/sched/idle.c:219 kernel/sched/idle.c:282)
cpu_startup_entry (kernel/sched/idle.c:378 (discriminator 1))
start_secondary (arch/x86/kernel/smpboot.c:210 arch/x86/kernel/smpboot.c:294)
secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:433)

value changed: 0x002f4815 -> 0x002f4816

Reported by Kernel Concurrency Sanitizer on:
CPU: 21 PID: 0 Comm: swapper/21 Tainted: G             L     6.6.0-rc2-kcsan-00143-gb5cbe7c00aa0 #41
Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
==================================================================

The write side of drivers/net/ethernet/realtek/r8169_main.c is:
==================
   4251         /* rtl_tx needs to see descriptor changes before updated tp->cur_tx */
   4252         smp_wmb();
   4253
 → 4254         WRITE_ONCE(tp->cur_tx, tp->cur_tx + frags + 1);
   4255
   4256         stop_queue = !netif_subqueue_maybe_stop(dev, 0, rtl_tx_slots_avail(tp),
   4257                                                 R8169_TX_STOP_THRS,
   4258                                                 R8169_TX_START_THRS);

The read side is the function rtl_tx():

   4355 static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp,
   4356                    int budget)
   4357 {
   4358         unsigned int dirty_tx, bytes_compl = 0, pkts_compl = 0;
   4359         struct sk_buff *skb;
   4360
   4361         dirty_tx = tp->dirty_tx;
   4362
   4363         while (READ_ONCE(tp->cur_tx) != dirty_tx) {
   4364                 unsigned int entry = dirty_tx % NUM_TX_DESC;
   4365                 u32 status;
   4366
   4367                 status = le32_to_cpu(tp->TxDescArray[entry].opts1);
   4368                 if (status & DescOwn)
   4369                         break;
   4370
   4371                 skb = tp->tx_skb[entry].skb;
   4372                 rtl8169_unmap_tx_skb(tp, entry);
   4373
   4374                 if (skb) {
   4375                         pkts_compl++;
   4376                         bytes_compl += skb->len;
   4377                         napi_consume_skb(skb, budget);
   4378                 }
   4379                 dirty_tx++;
   4380         }
   4381
   4382         if (tp->dirty_tx != dirty_tx) {
   4383                 dev_sw_netstats_tx_add(dev, pkts_compl, bytes_compl);
   4384                 WRITE_ONCE(tp->dirty_tx, dirty_tx);
   4385
   4386                 netif_subqueue_completed_wake(dev, 0, pkts_compl, bytes_compl,
   4387                                               rtl_tx_slots_avail(tp),
   4388                                               R8169_TX_START_THRS);
   4389                 /*
   4390                  * 8168 hack: TxPoll requests are lost when the Tx packets are
   4391                  * too close. Let's kick an extra TxPoll request when a burst
   4392                  * of start_xmit activity is detected (if it is not detected,
   4393                  * it is slow enough). -- FR
   4394                  * If skb is NULL then we come here again once a tx irq is
   4395                  * triggered after the last fragment is marked transmitted.
   4396                  */
 → 4397                 if (tp->cur_tx != dirty_tx && skb)
   4398                         rtl8169_doorbell(tp);
   4399         }
   4400 }

Obviously from the code, an earlier detected data-race for tp->cur_tx was fixed in the
line 4363:

   4363         while (READ_ONCE(tp->cur_tx) != dirty_tx) {

but the same solution is required for protecting the other access to tp->cur_tx:

 → 4397                 if (READ_ONCE(tp->cur_tx) != dirty_tx && skb)
   4398                         rtl8169_doorbell(tp);

The write in the line 4254 is protected with WRITE_ONCE(), but the read in the line 4397
might have suffered read tearing under some compiler optimisations.

The fix eliminated the KCSAN data-race report for this bug.

It is yet to be evaluated what happens if tp->cur_tx changes between the test in line 4363
and line 4397. This test should certainly not be cached by the compiler in some register
for such a long time, while asynchronous writes to tp->cur_tx might have occurred in line
4254 in the meantime.

Fixes: 94d8a98e6235c ("r8169: reduce number of workaround doorbell rings")
Cc: Heiner Kallweit <hkallweit1@gmail.com>
Cc: nic_swsd@realtek.com
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Marco Elver <elver@google.com>
Cc: netdev@vger.kernel.org
Link: https://lore.kernel.org/lkml/dc7fc8fa-4ea4-e9a9-30a6-7c83e6b53188@alu.unizg.hr/
Signed-off-by: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr>
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/net/ethernet/realtek/r8169_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index a9a0dca0c0305..5288daaf59b5b 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -4380,7 +4380,7 @@ static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp,
 		 * If skb is NULL then we come here again once a tx irq is
 		 * triggered after the last fragment is marked transmitted.
 		 */
-		if (tp->cur_tx != dirty_tx && skb)
+		if (READ_ONCE(tp->cur_tx) != dirty_tx && skb)
 			rtl8169_doorbell(tp);
 	}
 }
-- 
2.42.0




^ permalink raw reply related

* Re: [PATCH bpf-next v6 11/18] ice: put XDP meta sources assignment under a static key condition
From: Larysa Zaremba @ 2023-10-31 17:32 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, Willem de Bruijn, Alexei Starovoitov, Tariq Toukan,
	Saeed Mahameed, toke
In-Reply-To: <ZT1nSGYng8sUKQD7@boxer>

On Sat, Oct 28, 2023 at 09:55:52PM +0200, Maciej Fijalkowski wrote:
> On Mon, Oct 23, 2023 at 11:35:46AM +0200, Larysa Zaremba wrote:
> > On Fri, Oct 20, 2023 at 06:32:13PM +0200, Maciej Fijalkowski wrote:
> > > On Thu, Oct 12, 2023 at 07:05:17PM +0200, Larysa Zaremba wrote:
> > > > Usage of XDP hints requires putting additional information after the
> > > > xdp_buff. In basic case, only the descriptor has to be copied on a
> > > > per-packet basis, because xdp_buff permanently resides before per-ring
> > > > metadata (cached time and VLAN protocol ID).
> > > > 
> > > > However, in ZC mode, xdp_buffs come from a pool, so memory after such
> > > > buffer does not contain any reliable information, so everything has to be
> > > > copied, damaging the performance.
> > > > 
> > > > Introduce a static key to enable meta sources assignment only when attached
> > > > XDP program is device-bound.
> > > > 
> > > > This patch eliminates a 6% performance drop in ZC mode, which was a result
> > > > of addition of XDP hints to the driver.
> > > > 
> > > > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > > > ---
> > > >  drivers/net/ethernet/intel/ice/ice.h      |  1 +
> > > >  drivers/net/ethernet/intel/ice/ice_main.c | 14 ++++++++++++++
> > > >  drivers/net/ethernet/intel/ice/ice_txrx.c |  3 ++-
> > > >  drivers/net/ethernet/intel/ice/ice_xsk.c  |  3 +++
> > > >  4 files changed, 20 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
> > > > index 3d0f15f8b2b8..76d22be878a4 100644
> > > > --- a/drivers/net/ethernet/intel/ice/ice.h
> > > > +++ b/drivers/net/ethernet/intel/ice/ice.h
> > > > @@ -210,6 +210,7 @@ enum ice_feature {
> > > >  };
> > > >  
> > > >  DECLARE_STATIC_KEY_FALSE(ice_xdp_locking_key);
> > > > +DECLARE_STATIC_KEY_FALSE(ice_xdp_meta_key);
> > > >  
> > > >  struct ice_channel {
> > > >  	struct list_head list;
> > > > diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
> > > > index 47e8920e1727..ee0df86d34b7 100644
> > > > --- a/drivers/net/ethernet/intel/ice/ice_main.c
> > > > +++ b/drivers/net/ethernet/intel/ice/ice_main.c
> > > > @@ -48,6 +48,9 @@ MODULE_PARM_DESC(debug, "netif level (0=none,...,16=all)");
> > > >  DEFINE_STATIC_KEY_FALSE(ice_xdp_locking_key);
> > > >  EXPORT_SYMBOL(ice_xdp_locking_key);
> > > >  
> > > > +DEFINE_STATIC_KEY_FALSE(ice_xdp_meta_key);
> > > > +EXPORT_SYMBOL(ice_xdp_meta_key);
> > > > +
> > > >  /**
> > > >   * ice_hw_to_dev - Get device pointer from the hardware structure
> > > >   * @hw: pointer to the device HW structure
> > > > @@ -2634,6 +2637,11 @@ static int ice_xdp_alloc_setup_rings(struct ice_vsi *vsi)
> > > >  	return -ENOMEM;
> > > >  }
> > > >  
> > > > +static bool ice_xdp_prog_has_meta(struct bpf_prog *prog)
> > > > +{
> > > > +	return prog && prog->aux->dev_bound;
> > > > +}
> > > > +
> > > >  /**
> > > >   * ice_vsi_assign_bpf_prog - set or clear bpf prog pointer on VSI
> > > >   * @vsi: VSI to set the bpf prog on
> > > > @@ -2644,10 +2652,16 @@ static void ice_vsi_assign_bpf_prog(struct ice_vsi *vsi, struct bpf_prog *prog)
> > > >  	struct bpf_prog *old_prog;
> > > >  	int i;
> > > >  
> > > > +	if (ice_xdp_prog_has_meta(prog))
> > > > +		static_branch_inc(&ice_xdp_meta_key);
> > > 
> > > i thought boolean key would be enough but inc/dec should serve properly
> > > for example prog hotswap cases.
> > >
> > 
> > My thought process on using counting instead of boolean was: there can be 
> > several PFs that use the same driver, so therefore we need to keep track of how 
> > many od them use hints. 
> 
> Very good point. This implies that if PF0 has hints-enabled prog loaded,
> PF1 with non-hints prog will "suffer" from it.
> 
> Sorry for such a long delays in responses but I was having a hard time
> making up my mind about it. In the end I have come up to some conclusions.
> I know the timing for sending this response is not ideal, but I need to
> get this off my chest and bring discussion back to life:)
> 
> IMHO having static keys to eliminate ZC overhead does not scale. I assume
> every other driver would have to follow that.
> 
> XSK pool allows us to avoid initializing various things per each packet.
> Instead, taking xdp_rxq_info as an example, each xdp_buff from pool has
> xdp_rxq_info assigned at init time. With this in mind, we should have some
> mechanism to set hints-specific things in xdp_buff_xsk::cb, at init time
> as well. Such mechanism should not require us to expose driver's private
> xdp_buff hints containers (such as ice_pkt_ctx) to XSK pool.
> 
> Right now you moved phctime down to ice_pkt_ctx and to me that's the main
> reason we have to copy ice_pkt_ctx to each xdp_buff on ZC. What if we keep
> the cached_phctime at original offset in ring but ice_pkt_ctx would get a
> pointer to that?
> 
> This would allow us to init the pointer in each xdp_buff from XSK pool at
> init time. I have come up with a way to program that via so called XSK
> meta descriptors. Each desc would have data to write onto cb, offset
> within cb and amount of bytes to write/copy.
> 
> I'll share the diff below but note that I didn't measure how much lower
> the performance is degraded. My icelake machine where I used to measure
> performance-sensitive code got broke. For now we can't escape initing
> eop_desc per each xdp_buff, but I moved it to alloc side, as we mangle
> descs there anyway.
> 
> I think mlx5 could benefit from that approach as well with initing the rq
> ptr at init time.
> 
> Diff does mostly these things:
> - move cached_phctime to old place in ice_rx_ring and add ptr to that in
>   ice_pkt_ctx
> - introduce xsk_pool_set_meta()
> - use it from ice side.
> 
> I consider this as a discussion trigger rather than ready code. Any
> feedback will be appreciated.
> 
> ---------------------------------8<---------------------------------
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
> index 7fa43827a3f0..c192e84bee55 100644
> --- a/drivers/net/ethernet/intel/ice/ice_base.c
> +++ b/drivers/net/ethernet/intel/ice/ice_base.c
> @@ -519,6 +519,23 @@ static int ice_setup_rx_ctx(struct ice_rx_ring *ring)
>  	return 0;
>  }
>  

[...]

> > > 
> > > My only concern is that we might be hurting in a minor way hints path now,
> > > no?
> > 
> > I have thought "unlikely" refers to the default state the code is compiled with 
> > and after static key incrementation this should be patched to "likely". Isn't 
> > this how static keys work?
> 
> I was only referring to that it ends with compiler hint:
> #define unlikely_notrace(x)	__builtin_expect(!!(x), 0)
> 
> see include/linux/jump_label.h
> 

You are right, I have misunderstood the concept a little bit.

> > 
> > > 
> > > > +		ice_xdp_meta_set_desc(xdp, eop_desc);
> > > >  
> > > >  	act = bpf_prog_run_xdp(xdp_prog, xdp);
> > > >  	switch (act) {
> > > > diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
> > > > index 39775bb6cec1..f92d7d33fde6 100644
> > > > --- a/drivers/net/ethernet/intel/ice/ice_xsk.c
> > > > +++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
> > > > @@ -773,6 +773,9 @@ static void ice_prepare_pkt_ctx_zc(struct xdp_buff *xdp,
> > > >  				   union ice_32b_rx_flex_desc *eop_desc,
> > > >  				   struct ice_rx_ring *rx_ring)
> > > >  {
> > > > +	if (!static_branch_unlikely(&ice_xdp_meta_key))
> > > > +		return;
> > > 
> > > wouldn't it be better to pull it out and avoid calling
> > > ice_prepare_pkt_ctx_zc() unnecessarily?
> > > 
> > > > +
> > > >  	XSK_CHECK_PRIV_TYPE(struct ice_xdp_buff);
> > > >  	((struct ice_xdp_buff *)xdp)->pkt_ctx = rx_ring->pkt_ctx;
> > > >  	ice_xdp_meta_set_desc(xdp, eop_desc);
> > > > -- 
> > > > 2.41.0
> > > > 

^ permalink raw reply

* [PATCH 6.5 049/112] r8169: fix the KCSAN reported data-race in rtl_tx() while reading tp->cur_tx
From: Greg Kroah-Hartman @ 2023-10-31 17:00 UTC (permalink / raw)
  To: stable
  Cc: Greg Kroah-Hartman, patches, Heiner Kallweit, nic_swsd,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Marco Elver, netdev, Mirsad Goran Todorovac, Sasha Levin
In-Reply-To: <20231031165901.318222981@linuxfoundation.org>

6.5-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr>

[ Upstream commit c1c0ce31b2420d5c173228a2132a492ede03d81f ]

KCSAN reported the following data-race:

==================================================================
BUG: KCSAN: data-race in rtl8169_poll [r8169] / rtl8169_start_xmit [r8169]

write (marked) to 0xffff888102474b74 of 4 bytes by task 5358 on cpu 29:
rtl8169_start_xmit (drivers/net/ethernet/realtek/r8169_main.c:4254) r8169
dev_hard_start_xmit (./include/linux/netdevice.h:4889 ./include/linux/netdevice.h:4903 net/core/dev.c:3544 net/core/dev.c:3560)
sch_direct_xmit (net/sched/sch_generic.c:342)
__dev_queue_xmit (net/core/dev.c:3817 net/core/dev.c:4306)
ip_finish_output2 (./include/linux/netdevice.h:3082 ./include/net/neighbour.h:526 ./include/net/neighbour.h:540 net/ipv4/ip_output.c:233)
__ip_finish_output (net/ipv4/ip_output.c:311 net/ipv4/ip_output.c:293)
ip_finish_output (net/ipv4/ip_output.c:328)
ip_output (net/ipv4/ip_output.c:435)
ip_send_skb (./include/net/dst.h:458 net/ipv4/ip_output.c:127 net/ipv4/ip_output.c:1486)
udp_send_skb (net/ipv4/udp.c:963)
udp_sendmsg (net/ipv4/udp.c:1246)
inet_sendmsg (net/ipv4/af_inet.c:840 (discriminator 4))
sock_sendmsg (net/socket.c:730 net/socket.c:753)
__sys_sendto (net/socket.c:2177)
__x64_sys_sendto (net/socket.c:2185)
do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)

read to 0xffff888102474b74 of 4 bytes by interrupt on cpu 21:
rtl8169_poll (drivers/net/ethernet/realtek/r8169_main.c:4397 drivers/net/ethernet/realtek/r8169_main.c:4581) r8169
__napi_poll (net/core/dev.c:6527)
net_rx_action (net/core/dev.c:6596 net/core/dev.c:6727)
__do_softirq (kernel/softirq.c:553)
__irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632)
irq_exit_rcu (kernel/softirq.c:647)
common_interrupt (arch/x86/kernel/irq.c:247 (discriminator 14))
asm_common_interrupt (./arch/x86/include/asm/idtentry.h:636)
cpuidle_enter_state (drivers/cpuidle/cpuidle.c:291)
cpuidle_enter (drivers/cpuidle/cpuidle.c:390)
call_cpuidle (kernel/sched/idle.c:135)
do_idle (kernel/sched/idle.c:219 kernel/sched/idle.c:282)
cpu_startup_entry (kernel/sched/idle.c:378 (discriminator 1))
start_secondary (arch/x86/kernel/smpboot.c:210 arch/x86/kernel/smpboot.c:294)
secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:433)

value changed: 0x002f4815 -> 0x002f4816

Reported by Kernel Concurrency Sanitizer on:
CPU: 21 PID: 0 Comm: swapper/21 Tainted: G             L     6.6.0-rc2-kcsan-00143-gb5cbe7c00aa0 #41
Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
==================================================================

The write side of drivers/net/ethernet/realtek/r8169_main.c is:
==================
   4251         /* rtl_tx needs to see descriptor changes before updated tp->cur_tx */
   4252         smp_wmb();
   4253
 → 4254         WRITE_ONCE(tp->cur_tx, tp->cur_tx + frags + 1);
   4255
   4256         stop_queue = !netif_subqueue_maybe_stop(dev, 0, rtl_tx_slots_avail(tp),
   4257                                                 R8169_TX_STOP_THRS,
   4258                                                 R8169_TX_START_THRS);

The read side is the function rtl_tx():

   4355 static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp,
   4356                    int budget)
   4357 {
   4358         unsigned int dirty_tx, bytes_compl = 0, pkts_compl = 0;
   4359         struct sk_buff *skb;
   4360
   4361         dirty_tx = tp->dirty_tx;
   4362
   4363         while (READ_ONCE(tp->cur_tx) != dirty_tx) {
   4364                 unsigned int entry = dirty_tx % NUM_TX_DESC;
   4365                 u32 status;
   4366
   4367                 status = le32_to_cpu(tp->TxDescArray[entry].opts1);
   4368                 if (status & DescOwn)
   4369                         break;
   4370
   4371                 skb = tp->tx_skb[entry].skb;
   4372                 rtl8169_unmap_tx_skb(tp, entry);
   4373
   4374                 if (skb) {
   4375                         pkts_compl++;
   4376                         bytes_compl += skb->len;
   4377                         napi_consume_skb(skb, budget);
   4378                 }
   4379                 dirty_tx++;
   4380         }
   4381
   4382         if (tp->dirty_tx != dirty_tx) {
   4383                 dev_sw_netstats_tx_add(dev, pkts_compl, bytes_compl);
   4384                 WRITE_ONCE(tp->dirty_tx, dirty_tx);
   4385
   4386                 netif_subqueue_completed_wake(dev, 0, pkts_compl, bytes_compl,
   4387                                               rtl_tx_slots_avail(tp),
   4388                                               R8169_TX_START_THRS);
   4389                 /*
   4390                  * 8168 hack: TxPoll requests are lost when the Tx packets are
   4391                  * too close. Let's kick an extra TxPoll request when a burst
   4392                  * of start_xmit activity is detected (if it is not detected,
   4393                  * it is slow enough). -- FR
   4394                  * If skb is NULL then we come here again once a tx irq is
   4395                  * triggered after the last fragment is marked transmitted.
   4396                  */
 → 4397                 if (tp->cur_tx != dirty_tx && skb)
   4398                         rtl8169_doorbell(tp);
   4399         }
   4400 }

Obviously from the code, an earlier detected data-race for tp->cur_tx was fixed in the
line 4363:

   4363         while (READ_ONCE(tp->cur_tx) != dirty_tx) {

but the same solution is required for protecting the other access to tp->cur_tx:

 → 4397                 if (READ_ONCE(tp->cur_tx) != dirty_tx && skb)
   4398                         rtl8169_doorbell(tp);

The write in the line 4254 is protected with WRITE_ONCE(), but the read in the line 4397
might have suffered read tearing under some compiler optimisations.

The fix eliminated the KCSAN data-race report for this bug.

It is yet to be evaluated what happens if tp->cur_tx changes between the test in line 4363
and line 4397. This test should certainly not be cached by the compiler in some register
for such a long time, while asynchronous writes to tp->cur_tx might have occurred in line
4254 in the meantime.

Fixes: 94d8a98e6235c ("r8169: reduce number of workaround doorbell rings")
Cc: Heiner Kallweit <hkallweit1@gmail.com>
Cc: nic_swsd@realtek.com
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Marco Elver <elver@google.com>
Cc: netdev@vger.kernel.org
Link: https://lore.kernel.org/lkml/dc7fc8fa-4ea4-e9a9-30a6-7c83e6b53188@alu.unizg.hr/
Signed-off-by: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr>
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/net/ethernet/realtek/r8169_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 6351a2dc13bce..281aaa8518472 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -4394,7 +4394,7 @@ static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp,
 		 * If skb is NULL then we come here again once a tx irq is
 		 * triggered after the last fragment is marked transmitted.
 		 */
-		if (tp->cur_tx != dirty_tx && skb)
+		if (READ_ONCE(tp->cur_tx) != dirty_tx && skb)
 			rtl8169_doorbell(tp);
 	}
 }
-- 
2.42.0




^ permalink raw reply related

* [PATCH 6.5 050/112] r8169: fix the KCSAN reported data-race in rtl_tx while reading TxDescArray[entry].opts1
From: Greg Kroah-Hartman @ 2023-10-31 17:00 UTC (permalink / raw)
  To: stable
  Cc: Greg Kroah-Hartman, patches, Heiner Kallweit, nic_swsd,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Marco Elver, netdev, Mirsad Goran Todorovac, Sasha Levin
In-Reply-To: <20231031165901.318222981@linuxfoundation.org>

6.5-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr>

[ Upstream commit dcf75a0f6bc136de94e88178ae5f51b7f879abc9 ]

KCSAN reported the following data-race:

==================================================================
BUG: KCSAN: data-race in rtl8169_poll (drivers/net/ethernet/realtek/r8169_main.c:4368 drivers/net/ethernet/realtek/r8169_main.c:4581) r8169

race at unknown origin, with read to 0xffff888140d37570 of 4 bytes by interrupt on cpu 21:
rtl8169_poll (drivers/net/ethernet/realtek/r8169_main.c:4368 drivers/net/ethernet/realtek/r8169_main.c:4581) r8169
__napi_poll (net/core/dev.c:6527)
net_rx_action (net/core/dev.c:6596 net/core/dev.c:6727)
__do_softirq (kernel/softirq.c:553)
__irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632)
irq_exit_rcu (kernel/softirq.c:647)
sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1074 (discriminator 14))
asm_sysvec_apic_timer_interrupt (./arch/x86/include/asm/idtentry.h:645)
cpuidle_enter_state (drivers/cpuidle/cpuidle.c:291)
cpuidle_enter (drivers/cpuidle/cpuidle.c:390)
call_cpuidle (kernel/sched/idle.c:135)
do_idle (kernel/sched/idle.c:219 kernel/sched/idle.c:282)
cpu_startup_entry (kernel/sched/idle.c:378 (discriminator 1))
start_secondary (arch/x86/kernel/smpboot.c:210 arch/x86/kernel/smpboot.c:294)
secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:433)

value changed: 0xb0000042 -> 0x00000000

Reported by Kernel Concurrency Sanitizer on:
CPU: 21 PID: 0 Comm: swapper/21 Tainted: G             L     6.6.0-rc2-kcsan-00143-gb5cbe7c00aa0 #41
Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
==================================================================

The read side is in

drivers/net/ethernet/realtek/r8169_main.c
=========================================
   4355 static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp,
   4356                    int budget)
   4357 {
   4358         unsigned int dirty_tx, bytes_compl = 0, pkts_compl = 0;
   4359         struct sk_buff *skb;
   4360
   4361         dirty_tx = tp->dirty_tx;
   4362
   4363         while (READ_ONCE(tp->cur_tx) != dirty_tx) {
   4364                 unsigned int entry = dirty_tx % NUM_TX_DESC;
   4365                 u32 status;
   4366
 → 4367                 status = le32_to_cpu(tp->TxDescArray[entry].opts1);
   4368                 if (status & DescOwn)
   4369                         break;
   4370
   4371                 skb = tp->tx_skb[entry].skb;
   4372                 rtl8169_unmap_tx_skb(tp, entry);
   4373
   4374                 if (skb) {
   4375                         pkts_compl++;
   4376                         bytes_compl += skb->len;
   4377                         napi_consume_skb(skb, budget);
   4378                 }
   4379                 dirty_tx++;
   4380         }
   4381
   4382         if (tp->dirty_tx != dirty_tx) {
   4383                 dev_sw_netstats_tx_add(dev, pkts_compl, bytes_compl);
   4384                 WRITE_ONCE(tp->dirty_tx, dirty_tx);
   4385
   4386                 netif_subqueue_completed_wake(dev, 0, pkts_compl, bytes_compl,
   4387                                               rtl_tx_slots_avail(tp),
   4388                                               R8169_TX_START_THRS);
   4389                 /*
   4390                  * 8168 hack: TxPoll requests are lost when the Tx packets are
   4391                  * too close. Let's kick an extra TxPoll request when a burst
   4392                  * of start_xmit activity is detected (if it is not detected,
   4393                  * it is slow enough). -- FR
   4394                  * If skb is NULL then we come here again once a tx irq is
   4395                  * triggered after the last fragment is marked transmitted.
   4396                  */
   4397                 if (READ_ONCE(tp->cur_tx) != dirty_tx && skb)
   4398                         rtl8169_doorbell(tp);
   4399         }
   4400 }

tp->TxDescArray[entry].opts1 is reported to have a data-race and READ_ONCE() fixes
this KCSAN warning.

   4366
 → 4367                 status = le32_to_cpu(READ_ONCE(tp->TxDescArray[entry].opts1));
   4368                 if (status & DescOwn)
   4369                         break;
   4370

Cc: Heiner Kallweit <hkallweit1@gmail.com>
Cc: nic_swsd@realtek.com
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Marco Elver <elver@google.com>
Cc: netdev@vger.kernel.org
Link: https://lore.kernel.org/lkml/dc7fc8fa-4ea4-e9a9-30a6-7c83e6b53188@alu.unizg.hr/
Signed-off-by: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr>
Acked-by: Marco Elver <elver@google.com>
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/net/ethernet/realtek/r8169_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 281aaa8518472..7e14a1d958c8e 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -4364,7 +4364,7 @@ static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp,
 		unsigned int entry = dirty_tx % NUM_TX_DESC;
 		u32 status;
 
-		status = le32_to_cpu(tp->TxDescArray[entry].opts1);
+		status = le32_to_cpu(READ_ONCE(tp->TxDescArray[entry].opts1));
 		if (status & DescOwn)
 			break;
 
-- 
2.42.0




^ permalink raw reply related

* [PATCH 6.5 051/112] r8169: fix the KCSAN reported data race in rtl_rx while reading desc->opts1
From: Greg Kroah-Hartman @ 2023-10-31 17:00 UTC (permalink / raw)
  To: stable
  Cc: Greg Kroah-Hartman, patches, Marco Elver, Heiner Kallweit,
	nic_swsd, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, netdev, Mirsad Goran Todorovac, Sasha Levin
In-Reply-To: <20231031165901.318222981@linuxfoundation.org>

6.5-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr>

[ Upstream commit f97eee484e71890131f9c563c5cc6d5a69e4308d ]

KCSAN reported the following data-race bug:

==================================================================
BUG: KCSAN: data-race in rtl8169_poll (drivers/net/ethernet/realtek/r8169_main.c:4430 drivers/net/ethernet/realtek/r8169_main.c:4583) r8169

race at unknown origin, with read to 0xffff888117e43510 of 4 bytes by interrupt on cpu 21:
rtl8169_poll (drivers/net/ethernet/realtek/r8169_main.c:4430 drivers/net/ethernet/realtek/r8169_main.c:4583) r8169
__napi_poll (net/core/dev.c:6527)
net_rx_action (net/core/dev.c:6596 net/core/dev.c:6727)
__do_softirq (kernel/softirq.c:553)
__irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632)
irq_exit_rcu (kernel/softirq.c:647)
sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1074 (discriminator 14))
asm_sysvec_apic_timer_interrupt (./arch/x86/include/asm/idtentry.h:645)
cpuidle_enter_state (drivers/cpuidle/cpuidle.c:291)
cpuidle_enter (drivers/cpuidle/cpuidle.c:390)
call_cpuidle (kernel/sched/idle.c:135)
do_idle (kernel/sched/idle.c:219 kernel/sched/idle.c:282)
cpu_startup_entry (kernel/sched/idle.c:378 (discriminator 1))
start_secondary (arch/x86/kernel/smpboot.c:210 arch/x86/kernel/smpboot.c:294)
secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:433)

value changed: 0x80003fff -> 0x3402805f

Reported by Kernel Concurrency Sanitizer on:
CPU: 21 PID: 0 Comm: swapper/21 Tainted: G             L     6.6.0-rc2-kcsan-00143-gb5cbe7c00aa0 #41
Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
==================================================================

drivers/net/ethernet/realtek/r8169_main.c:
==========================================
   4429
 → 4430                 status = le32_to_cpu(desc->opts1);
   4431                 if (status & DescOwn)
   4432                         break;
   4433
   4434                 /* This barrier is needed to keep us from reading
   4435                  * any other fields out of the Rx descriptor until
   4436                  * we know the status of DescOwn
   4437                  */
   4438                 dma_rmb();
   4439
   4440                 if (unlikely(status & RxRES)) {
   4441                         if (net_ratelimit())
   4442                                 netdev_warn(dev, "Rx ERROR. status = %08x\n",

Marco Elver explained that dma_rmb() doesn't prevent the compiler to tear up the access to
desc->opts1 which can be written to concurrently. READ_ONCE() should prevent that from
happening:

   4429
 → 4430                 status = le32_to_cpu(READ_ONCE(desc->opts1));
   4431                 if (status & DescOwn)
   4432                         break;
   4433

As the consequence of this fix, this KCSAN warning was eliminated.

Fixes: 6202806e7c03a ("r8169: drop member opts1_mask from struct rtl8169_private")
Suggested-by: Marco Elver <elver@google.com>
Cc: Heiner Kallweit <hkallweit1@gmail.com>
Cc: nic_swsd@realtek.com
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: netdev@vger.kernel.org
Link: https://lore.kernel.org/lkml/dc7fc8fa-4ea4-e9a9-30a6-7c83e6b53188@alu.unizg.hr/
Signed-off-by: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr>
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/net/ethernet/realtek/r8169_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 7e14a1d958c8e..361b90007148b 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -4427,7 +4427,7 @@ static int rtl_rx(struct net_device *dev, struct rtl8169_private *tp, int budget
 		dma_addr_t addr;
 		u32 status;
 
-		status = le32_to_cpu(desc->opts1);
+		status = le32_to_cpu(READ_ONCE(desc->opts1));
 		if (status & DescOwn)
 			break;
 
-- 
2.42.0




^ permalink raw reply related

* [PATCH bpf-next v2] net, xdp: allow metadata > 32
From: Larysa Zaremba @ 2023-10-31 17:57 UTC (permalink / raw)
  To: bpf
  Cc: Larysa Zaremba, netdev, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Eric Dumazet, Magnus Karlsson, Willem de Bruijn, Yunsheng Lin,
	Maciej Fijalkowski, John Fastabend, Aleksander Lobakin

32 bytes may be not enough for some custom metadata. Relax the restriction,
allow metadata larger than 32 bytes and make __skb_metadata_differs() work
with bigger lengths.

Now size of metadata is only limited by the fact it is stored as u8
in skb_shared_info, so the upper limit is now is 255. Other important
conditions, such as having enough space for xdp_frame building, are already
checked in bpf_xdp_adjust_meta().

The requirement of having its length aligned to 4 bytes is still
valid.

Signed-off-by: Aleksander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
v1->v2: replaced 'typeof(metalen)' with the actual type
This patch was previously a part of an old BTF-based hints RFC.
Then it was included into "XDP metadata via kfuncs for ice":
https://lore.kernel.org/bpf/20230811161509.19722-1-larysa.zaremba@intel.com/
It is not longer needed in the series, but presents a useful change on its own.

 include/linux/skbuff.h | 13 ++++++++-----
 include/net/xdp.h      |  7 ++++++-
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 97bfef071255..a361a9b8767c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -4232,10 +4232,13 @@ static inline bool __skb_metadata_differs(const struct sk_buff *skb_a,
 {
 	const void *a = skb_metadata_end(skb_a);
 	const void *b = skb_metadata_end(skb_b);
-	/* Using more efficient varaiant than plain call to memcmp(). */
-#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && BITS_PER_LONG == 64
 	u64 diffs = 0;
 
+	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) ||
+	    BITS_PER_LONG != 64)
+		goto slow;
+
+	/* Using more efficient variant than plain call to memcmp(). */
 	switch (meta_len) {
 #define __it(x, op) (x -= sizeof(u##op))
 #define __it_diff(a, b, op) (*(u##op *)__it(a, op)) ^ (*(u##op *)__it(b, op))
@@ -4255,11 +4258,11 @@ static inline bool __skb_metadata_differs(const struct sk_buff *skb_a,
 		fallthrough;
 	case  4: diffs |= __it_diff(a, b, 32);
 		break;
+	default:
+slow:
+		return memcmp(a - meta_len, b - meta_len, meta_len);
 	}
 	return diffs;
-#else
-	return memcmp(a - meta_len, b - meta_len, meta_len);
-#endif
 }
 
 static inline bool skb_metadata_differs(const struct sk_buff *skb_a,
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 349c36fb5fd8..5d3673afc037 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -369,7 +369,12 @@ xdp_data_meta_unsupported(const struct xdp_buff *xdp)
 
 static inline bool xdp_metalen_invalid(unsigned long metalen)
 {
-	return (metalen & (sizeof(__u32) - 1)) || (metalen > 32);
+	unsigned long meta_max;
+
+	meta_max = type_max(typeof_member(struct skb_shared_info, meta_len));
+	BUILD_BUG_ON(!__builtin_constant_p(meta_max));
+
+	return !IS_ALIGNED(metalen, sizeof(u32)) || metalen > meta_max;
 }
 
 struct xdp_attachment_info {
-- 
2.41.0


^ permalink raw reply related

* Re: [PATCH v2 05/12] net: stmmac: dwmac-starfive: Add support for JH7100 SoC
From: Cristian Ciocaltea @ 2023-10-31 18:07 UTC (permalink / raw)
  To: Emil Renner Berthing, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, Emil Renner Berthing, Samin Guo, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Torgue, Jose Abreu,
	Maxime Coquelin, Richard Cochran, Giuseppe Cavallaro
  Cc: netdev, devicetree, linux-kernel, linux-riscv, linux-stm32,
	linux-arm-kernel, kernel
In-Reply-To: <CAJM55Z8K5QztgU9NYiJ1kv+-BSsgP=LCABN7BYDtQ30_G1Nc7w@mail.gmail.com>

On 10/31/23 16:33, Emil Renner Berthing wrote:
> Cristian Ciocaltea wrote:
>> Add a missing quirk to enable support for the StarFive JH7100 SoC.
>>
>> Additionally, for greater flexibility in operation, allow using the
>> rgmii-rxid and rgmii-txid phy modes.
>>
>> Co-developed-by: Emil Renner Berthing <kernel@esmil.dk>
>> Signed-off-by: Emil Renner Berthing <kernel@esmil.dk>
>> Signed-off-by: Cristian Ciocaltea <cristian.ciocaltea@collabora.com>
> 
> Hi Cristian,
> 
> Thanks for working on this! This driver has code to update the phy clock for
> different line speeds. I don't think that will work without the
> CLK_SET_RATE_PARENT flag added to the clock in [1] which in turn depends on
> [2].
> 
> [1]: https://github.com/esmil/linux/commit/b200c3054b58a49ba25af67aff82d9045e3c3666
> [2]: https://github.com/esmil/linux/commit/dce189542c16bf0eb8533d96c0305cb59d149dae
> 
> Two more comments below..

Hi Emil,

Thanks for the review!

I've been only testing this at 1000 Mbps and so far it seems to work
fine. I will try with different line speeds and report back.

>> ---
>>  drivers/net/ethernet/stmicro/stmmac/Kconfig   |  6 ++--
>>  .../ethernet/stmicro/stmmac/dwmac-starfive.c  | 32 ++++++++++++++++---
>>  2 files changed, 31 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/stmicro/stmmac/Kconfig b/drivers/net/ethernet/stmicro/stmmac/Kconfig
>> index a2b9e289aa36..c3c2c8360047 100644
>> --- a/drivers/net/ethernet/stmicro/stmmac/Kconfig
>> +++ b/drivers/net/ethernet/stmicro/stmmac/Kconfig
>> @@ -165,9 +165,9 @@ config DWMAC_STARFIVE
>>  	help
>>  	  Support for ethernet controllers on StarFive RISC-V SoCs
>>
>> -	  This selects the StarFive platform specific glue layer support for
>> -	  the stmmac device driver. This driver is used for StarFive JH7110
>> -	  ethernet controller.
>> +	  This selects the StarFive platform specific glue layer support
>> +	  for the stmmac device driver. This driver is used for the
>> +	  StarFive JH7100 and JH7110 ethernet controllers.
>>
>>  config DWMAC_STI
>>  	tristate "STi GMAC support"
>> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-starfive.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-starfive.c
>> index 5d630affb4d1..88c431edcea0 100644
>> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-starfive.c
>> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-starfive.c
>> @@ -15,13 +15,20 @@
>>
>>  #include "stmmac_platform.h"
>>
>> -#define STARFIVE_DWMAC_PHY_INFT_RGMII	0x1
>> -#define STARFIVE_DWMAC_PHY_INFT_RMII	0x4
>> -#define STARFIVE_DWMAC_PHY_INFT_FIELD	0x7U
>> +#define STARFIVE_DWMAC_PHY_INFT_RGMII		0x1
>> +#define STARFIVE_DWMAC_PHY_INFT_RMII		0x4
>> +#define STARFIVE_DWMAC_PHY_INFT_FIELD		0x7U
>> +
>> +#define JH7100_SYSMAIN_REGISTER49_DLYCHAIN	0xc8
>> +
>> +struct starfive_dwmac_data {
>> +	unsigned int gtxclk_dlychain;
>> +};
>>
>>  struct starfive_dwmac {
>>  	struct device *dev;
>>  	struct clk *clk_tx;
>> +	const struct starfive_dwmac_data *data;
>>  };
>>
>>  static void starfive_dwmac_fix_mac_speed(void *priv, unsigned int speed, unsigned int mode)
>> @@ -67,6 +74,8 @@ static int starfive_dwmac_set_mode(struct plat_stmmacenet_data *plat_dat)
>>
>>  	case PHY_INTERFACE_MODE_RGMII:
>>  	case PHY_INTERFACE_MODE_RGMII_ID:
>> +	case PHY_INTERFACE_MODE_RGMII_RXID:
>> +	case PHY_INTERFACE_MODE_RGMII_TXID:
>>  		mode = STARFIVE_DWMAC_PHY_INFT_RGMII;
>>  		break;
>>
>> @@ -89,6 +98,14 @@ static int starfive_dwmac_set_mode(struct plat_stmmacenet_data *plat_dat)
>>  	if (err)
>>  		return dev_err_probe(dwmac->dev, err, "error setting phy mode\n");
>>
>> +	if (dwmac->data) {
> 
> I think you want something like this so future quirks don't need to touch this
> code:
> 
> 	if (dwmac->data && dwmac->data->gtxclk_dlychain)

Yes, but that would prevent having a quirk that explicitly wants to write 0.

I was initially thinking to something more generic, like providing a
list of register-value pairs, but I'm not sure this is going to be ever
needed.

I'm still open to apply the suggested change, though.

>> +		err = regmap_write(regmap, JH7100_SYSMAIN_REGISTER49_DLYCHAIN,
>> +				   dwmac->data->gtxclk_dlychain);
>> +		if (err)
>> +			return dev_err_probe(dwmac->dev, err,
>> +					     "error selecting gtxclk delay chain\n");
>> +	}
>> +
>>  	return 0;
>>  }
>>
>> @@ -114,6 +131,8 @@ static int starfive_dwmac_probe(struct platform_device *pdev)
>>  	if (!dwmac)
>>  		return -ENOMEM;
>>
>> +	dwmac->data = device_get_match_data(&pdev->dev);
>> +
>>  	dwmac->clk_tx = devm_clk_get_enabled(&pdev->dev, "tx");
>>  	if (IS_ERR(dwmac->clk_tx))
>>  		return dev_err_probe(&pdev->dev, PTR_ERR(dwmac->clk_tx),
>> @@ -144,8 +163,13 @@ static int starfive_dwmac_probe(struct platform_device *pdev)
>>  	return stmmac_dvr_probe(&pdev->dev, plat_dat, &stmmac_res);
>>  }
>>
>> +static const struct starfive_dwmac_data jh7100_data = {
>> +	.gtxclk_dlychain = 4
> 
> Please add a , at the end of this line. I know it's unlikely that we need to
> add more properties, but it's still good practice to do. This way such patches
> won't need to touch this line.

Sure, will do.

>> +};
>> +
>>  static const struct of_device_id starfive_dwmac_match[] = {
>> -	{ .compatible = "starfive,jh7110-dwmac"	},
>> +	{ .compatible = "starfive,jh7100-dwmac", .data = &jh7100_data },
>> +	{ .compatible = "starfive,jh7110-dwmac" },
>>  	{ /* sentinel */ }
>>  };
>>  MODULE_DEVICE_TABLE(of, starfive_dwmac_match);
>> --
>> 2.42.0
>>

^ permalink raw reply

* Re: [PATCH v2 07/12] riscv: dts: starfive: jh7100: Add ccache DT node
From: Cristian Ciocaltea @ 2023-10-31 19:01 UTC (permalink / raw)
  To: Emil Renner Berthing, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, Emil Renner Berthing, Samin Guo, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Torgue, Jose Abreu,
	Maxime Coquelin, Richard Cochran, Giuseppe Cavallaro
  Cc: netdev, devicetree, linux-kernel, linux-riscv, linux-stm32,
	linux-arm-kernel, kernel
In-Reply-To: <CAJM55Z8D12XoRG4WGaf=PG0_yp7d_xk9EhOk7bnCKQRMok9eBA@mail.gmail.com>

On 10/31/23 16:38, Emil Renner Berthing wrote:
> Cristian Ciocaltea wrote:
>> Provide a DT node for the SiFive Composable Cache controller found on
>> the StarFive JH7100 SoC.
>>
>> Note this is also used to support non-coherent DMA, via the
>> sifive,cache-ops cache flushing operations.
> 
> This property is no longer needed:
> https://lore.kernel.org/linux-riscv/20231031141444.53426-1-emil.renner.berthing@canonical.com/

Thanks for the heads up! I actually noticed that from v1 reviews and was
just waiting for v2. :)

> Also it would be nice to mention that these nodes are copied from my
> visionfive patches ;)

Ups, sorry about that! Those were initially taken from a patch adding a
full DT (the repo is mentioned in the cover letter) with many
contributors mentioned, without being clear who did what. That's why I
didn't provide a Co-developed-by tag and, unfortunately, I also missed
to add it in v2 (will handle this in v3 and also provide the link to the
new repo), but I'm still not sure about the gmac stuff.

Thanks,
Cristian

>>
>> Signed-off-by: Cristian Ciocaltea <cristian.ciocaltea@collabora.com>
>> ---
>>  arch/riscv/boot/dts/starfive/jh7100.dtsi | 14 ++++++++++++++
>>  1 file changed, 14 insertions(+)
>>
>> diff --git a/arch/riscv/boot/dts/starfive/jh7100.dtsi b/arch/riscv/boot/dts/starfive/jh7100.dtsi
>> index 06bb157ce111..a8a5bb00b0d8 100644
>> --- a/arch/riscv/boot/dts/starfive/jh7100.dtsi
>> +++ b/arch/riscv/boot/dts/starfive/jh7100.dtsi
>> @@ -32,6 +32,7 @@ U74_0: cpu@0 {
>>  			i-tlb-sets = <1>;
>>  			i-tlb-size = <32>;
>>  			mmu-type = "riscv,sv39";
>> +			next-level-cache = <&ccache>;
>>  			riscv,isa = "rv64imafdc";
>>  			riscv,isa-base = "rv64i";
>>  			riscv,isa-extensions = "i", "m", "a", "f", "d", "c", "zicntr", "zicsr",
>> @@ -60,6 +61,7 @@ U74_1: cpu@1 {
>>  			i-tlb-sets = <1>;
>>  			i-tlb-size = <32>;
>>  			mmu-type = "riscv,sv39";
>> +			next-level-cache = <&ccache>;
>>  			riscv,isa = "rv64imafdc";
>>  			riscv,isa-base = "rv64i";
>>  			riscv,isa-extensions = "i", "m", "a", "f", "d", "c", "zicntr", "zicsr",
>> @@ -147,6 +149,18 @@ soc {
>>  		dma-noncoherent;
>>  		ranges;
>>
>> +		ccache: cache-controller@2010000 {
>> +			compatible = "starfive,jh7100-ccache", "sifive,ccache0", "cache";
>> +			reg = <0x0 0x2010000 0x0 0x1000>;
>> +			interrupts = <128>, <130>, <131>, <129>;
>> +			cache-block-size = <64>;
>> +			cache-level = <2>;
>> +			cache-sets = <2048>;
>> +			cache-size = <2097152>;
>> +			cache-unified;
>> +			sifive,cache-ops;
>> +		};
>> +
>>  		clint: clint@2000000 {
>>  			compatible = "starfive,jh7100-clint", "sifive,clint0";
>>  			reg = <0x0 0x2000000 0x0 0x10000>;
>> --
>> 2.42.0
>>

^ permalink raw reply

* Re: [PATCH net v2] net: dsa: tag_rtl4_a: Bump min packet size
From: Linus Walleij @ 2023-10-31 19:02 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Andrew Lunn, Florian Fainelli, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, netdev, linux-kernel
In-Reply-To: <20231031163439.tqab5axhk5q2r62i@skbuf>

On Tue, Oct 31, 2023 at 5:34 PM Vladimir Oltean <olteanv@gmail.com> wrote:

> Ok, so we don't have a confirmation of breakage with other conduit
> interface than the Gemini driver, either. So a problem there is still
> not off the table.

True!

> So on the gemini-dlink-dir-685.dts platform, you can also use &gmac1 as
> a plain Ethernet port, right?

As a port it exist on the SoC yes but it is not connected physically
to anything.

&gmac0 is connected to the switch, and the switch has all the PHYs.

(I don't know if I misunderstand the question...)

> If possible, could you set up dsa_loop (enable CONFIG_NET_DSA_LOOP, replace
> "eth0" in dsa_loop_pdata with the netdev name of &gmac1, replace DSA_TAG_PROTO_NONE
> in dsa_loop_get_protocol() with your tagging protocol) and put a tcpdump
> on the remote end of the gmac1 port, to see if the issue isn't, in fact,
> somewhere else, maybe gmac_start_xmit()?

If you by remote end mean the end of a physical cable there is
no way I can do that, as I have no PHY on gmac1.

But I have other Gemini platforms, so I will try to do it on one
of them! Let's see if I can do this thing....

Yours,
Linus Walleij

^ permalink raw reply

* Re: [PATCH v2 08/12] riscv: dts: starfive: Add pool for coherent DMA memory on JH7100 boards
From: Cristian Ciocaltea @ 2023-10-31 19:16 UTC (permalink / raw)
  To: Emil Renner Berthing, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, Emil Renner Berthing, Samin Guo, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Torgue, Jose Abreu,
	Maxime Coquelin, Richard Cochran, Giuseppe Cavallaro
  Cc: netdev, devicetree, linux-kernel, linux-riscv, linux-stm32,
	linux-arm-kernel, kernel
In-Reply-To: <CAJM55Z_2hdsvw8gdYLs2kZbRrH6xcM6+xCZn8BCf5zsWYyhY3w@mail.gmail.com>

On 10/31/23 16:40, Emil Renner Berthing wrote:
> Cristian Ciocaltea wrote:
>> From: Emil Renner Berthing <emil.renner.berthing@canonical.com>
>>
>> The StarFive JH7100 SoC has non-coherent device DMAs, but most drivers
>> expect to be able to allocate coherent memory for DMA descriptors and
>> such. However on the JH7100 DDR memory appears twice in the physical
>> memory map, once cached and once uncached:
>>
>>   0x00_8000_0000 - 0x08_7fff_ffff : Off chip DDR memory, cached
>>   0x10_0000_0000 - 0x17_ffff_ffff : Off chip DDR memory, uncached
>>
>> To use this uncached region we create a global DMA memory pool there and
>> reserve the corresponding area in the cached region.
>>
>> However the uncached region is fully above the 32bit address limit, so add
>> a dma-ranges map so the DMA address used for peripherals is still in the
>> regular cached region below the limit.
> 
> Adding these nodes to the device tree won't actually do anything without
> enabling CONFIG_DMA_GLOBAL_POOL as is done here:
> 
> https://github.com/esmil/linux/commit/e14ad9ff67fd51dcc76415d4cc7f3a30ffcba379

Should I pick this up for v3 or maybe it would be better to be handled
as part of your ccache series?

Thanks,
Cristian

>>
>> Link: https://github.com/starfive-tech/JH7100_Docs/blob/main/JH7100%20Data%20Sheet%20V01.01.04-EN%20(4-21-2021).pdf
>> Signed-off-by: Emil Renner Berthing <emil.renner.berthing@canonical.com>
>> Signed-off-by: Cristian Ciocaltea <cristian.ciocaltea@collabora.com>
>> ---
>>  .../boot/dts/starfive/jh7100-common.dtsi      | 24 +++++++++++++++++++
>>  1 file changed, 24 insertions(+)
>>
>> diff --git a/arch/riscv/boot/dts/starfive/jh7100-common.dtsi b/arch/riscv/boot/dts/starfive/jh7100-common.dtsi
>> index b93ce351a90f..504c73f01f14 100644
>> --- a/arch/riscv/boot/dts/starfive/jh7100-common.dtsi
>> +++ b/arch/riscv/boot/dts/starfive/jh7100-common.dtsi
>> @@ -39,6 +39,30 @@ led-ack {
>>  			label = "ack";
>>  		};
>>  	};
>> +
>> +	reserved-memory {
>> +		#address-cells = <2>;
>> +		#size-cells = <2>;
>> +		ranges;
>> +
>> +		dma-reserved {
>> +			reg = <0x0 0xfa000000 0x0 0x1000000>;
>> +			no-map;
>> +		};
>> +
>> +		linux,dma {
>> +			compatible = "shared-dma-pool";
>> +			reg = <0x10 0x7a000000 0x0 0x1000000>;
>> +			no-map;
>> +			linux,dma-default;
>> +		};
>> +	};
>> +
>> +	soc {
>> +		dma-ranges = <0x00 0x80000000 0x00 0x80000000 0x00 0x7a000000>,
>> +			     <0x00 0xfa000000 0x10 0x7a000000 0x00 0x01000000>,
>> +			     <0x00 0xfb000000 0x00 0xfb000000 0x07 0x85000000>;
>> +	};
>>  };
>>
>>  &gpio {
>> --
>> 2.42.0
>>

^ permalink raw reply

* Re: [PATCH net v2] net: dsa: tag_rtl4_a: Bump min packet size
From: Luiz Angelo Daros de Luca @ 2023-10-31 19:18 UTC (permalink / raw)
  To: Linus Walleij
  Cc: Vladimir Oltean, Andrew Lunn, Florian Fainelli, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, netdev, linux-kernel
In-Reply-To: <CACRpkdbLTNVJusuCw2hrHDzx5odw8vw8hMWvvvvgEPsAFwB8hg@mail.gmail.com>

> I don't have any other RTL8366RB systems than the D-Link DIR-685.
>
> I however have several systems with the same backing ethernet controller
> connected directly to a PHY and they all work fine.

Hi Linus,

I ported TL-WR1043nd to DSA using RTL8366RB on OpenWrt main. Do you
need some help testing the switch?

I just need to test ping with different sizes?

Regards,

Luiz

^ permalink raw reply

* Re: [PATCH net v2] net: dsa: tag_rtl4_a: Bump min packet size
From: Linus Walleij @ 2023-10-31 19:27 UTC (permalink / raw)
  To: Luiz Angelo Daros de Luca
  Cc: Vladimir Oltean, Andrew Lunn, Florian Fainelli, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, netdev, linux-kernel
In-Reply-To: <CAJq09z4+3g7-h5asYPs_3g4e9NbPnxZQK+NxggYXGGxO+oHU1g@mail.gmail.com>

On Tue, Oct 31, 2023 at 8:18 PM Luiz Angelo Daros de Luca
<luizluca@gmail.com> wrote:

> > I don't have any other RTL8366RB systems than the D-Link DIR-685.
> >
> > I however have several systems with the same backing ethernet controller
> > connected directly to a PHY and they all work fine.
>
> Hi Linus,
>
> I ported TL-WR1043nd to DSA using RTL8366RB on OpenWrt main. Do you
> need some help testing the switch?

Yes!

> I just need to test ping with different sizes?

Yes try to ping the host from the router:

ping -s 1472 192.168.1.1 or so to send a 1500 byte ping packet,
which will be padded up to a 1518 byte ethernet frame and
1522 bytes from the conduit interface.

Then if it doesn't work, see if this patch solves the issue!

Yours,
Linus Walleij

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox