* Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: Packets only received after some buffer is full @ 2025-04-01 10:52 Álvaro G. M. 2025-04-02 17:00 ` Jakub Kicinski 0 siblings, 1 reply; 16+ messages in thread From: Álvaro G. M. @ 2025-04-01 10:52 UTC (permalink / raw) To: netdev Hello, I have a custom PCB board fitting a AMD/Xilinx Artix 7 FPGA with a Microblaze design inside that uses Xilinx' AXI 1G/2.5G Ethernet Subsystem connected via DMA. This board and HDL design have been tested and in production since 2016 using kernel 4.4.43 without any issue. The hardware part of the ethernet is DP83620 running in 100base-FX mode, which back in the day required a small patch to dp83848.c from myself that has been in the kernel since. I am now trying to upgrade to a recent kernel (v6.13) and I'm facing some strange behavior of the ethernet system. The most probable cause is a misconfiguration on my part of the device tree, since things have changed since then and I've found the device tree documentation confusing, but I can't discard some kind of bug, for I have never seem something similar to this. Relevant boot messages: xilinx_axienet 40c00000.ethernet eth0: PHY [axienet-40c00000:01] driver [TI DP83620 10/100 Mbps PHY] (irq=POLL) xilinx_axienet 40c00000.ethernet eth0: configuring for phy/mii link mode xilinx_axienet 40c00000.ethernet eth0: Link is Up - 100Mbps/Half - flow control off Now, transmission from the Microblaze seems to work fine, but reception however does not. I run tcpdump on the Microblaze and I can see that there's some kind of buffering occuring, as a single ARP packet sent from my directly connected computer won't reach tcpdump unless I send also a big chunk of data via, for example, multicast, or after enough time of ping flooding. It's not however a matter of sending a big chunk of data at the beginning, it seems like the buffer empties once full and the process starts back again, so a single ping packet won't be received after the buffer has emptied. I can see that interrupts increase, but not as fast as they occur when using old kernel. For example, in the ping case, kernel 4.43 will notify that there was an interrupt for each single ping packet received with ping -c 1 (so no coalescing shenanigans can occur), but the new kernel won't show any increase in the number of interrupts, so it means that the DMA core is either not generating the irq for some reason or isn't even executing the DMA transfer at all. Output packets, however, do seem to be sent expeditely and received in my working computer as soon as I sent them from the Microblaze. I guess I may have made some mistake in upgrading the DTS to the new format, although I've tried the two available methods (either setting node "dmas" or using "axistream-connected" property) and both methods result in the same boot messages and behavior. By crafting properly sized UDP multicast packets (so I don't have to rely on ARP which isn't working due to timeouts), I've been able to determine I need to send 131072 bytes before reception can truly occur, although it somehow seems like sending multicast UDP packets won't trigger receiving IRQ unless I have a specific UDP listener program running on the Microblaze. I'm quite confused about that too. So please, if anyone could inspect the DTS for me and/or guide me on how to debug this, I'd be grateful. These are the relevant parts of the DTS for kernel 6.13, which I've hand crafted with help from Documentation/devicetree/bindings and peeking at xilinx_axienet_main.c: axi_ethernet_0_dma: dma@41e00000 { compatible = "xlnx,axi-dma-1.00.a"; #dma-cells = <1>; reg = <0x41e00000 0x10000>; interrupt-parent = <µblaze_0_axi_intc>; interrupts = <7 1 8 1>; xlnx,addrwidth = <32>; xlnx,datawidth = <32>; xlnx,include-sg; xlnx,sg-length-width = <16>; xlnx,include-dre = <1>; xlnx,axistream-connected = <1>; xlnx,irq-delay = <1>; dma-channels = <2>; clock-names = "s_axi_lite_aclk", "m_axi_mm2s_aclk", "m_axi_s2mm_aclk", "m_axi_sg_aclk"; clocks = <&clk_bus_0>, <&clk_bus_0>, <&clk_bus_0>, <&clk_bus_0>; dma-channel@41e00000 { compatible = "xlnx,axi-dma-mm2s-channel"; xlnx,include-dre = <1>; interrupts = <7 1>; xlnx,datawidth = <32>; }; dma-channel@41e00030 { compatible = "xlnx,axi-dma-s2mm-channel"; xlnx,include-dre = <1>; interrupts = <8 1>; xlnx,datawidth = <32>; }; }; axi_ethernet_eth: ethernet@40c00000 { compatible = "xlnx,axi-ethernet-1.00.a"; reg = <0x40c00000 0x40000>, <0x41e00000 0x10000>; phy-handle = <&phy1>; xlnx,rxmem = <0x1000>; phy-mode = "mii"; xlnx,txcsum = <0x2>; xlnx,rxcsum = <0x2>; clock-names = "s_axi_lite_clk", "axis_clk", "ref_clk", "mgt_clk"; clocks = <&clk_bus_0>, <&clk_bus_0>, <&clk_bus_0>, <&clk_bus_0>; /* axistream-connected = <&axi_ethernet_0_dma>; */ dmas = <&axi_ethernet_0_dma 0>, <&axi_ethernet_0_dma 1>; dma-names = "tx_chan0", "rx_chan0"; mdio { #address-cells = <1>; #size-cells = <0>; phy1: ethernet-phy@1 { device_type = "ethernet-phy"; reg = <1>; }; }; }; And these are same parts of the DTS for kernel 4.43 which worked fine. These were created with help from Xilinx tools. axi_ethernet_0_dma: dma@41e00000 { #dma-cells = <1>; compatible = "xlnx,axi-dma-1.00.a"; interrupt-parent = <µblaze_0_axi_intc>; interrupts = <7 1 8 1>; reg = <0x41e00000 0x10000>; xlnx,include-sg ; dma-channel@41e00000 { compatible = "xlnx,axi-dma-mm2s-channel"; dma-channels = <0x1>; interrupts = <7 1>; xlnx,datawidth = <0x8>; xlnx,device-id = <0x0>; }; dma-channel@41e00030 { compatible = "xlnx,axi-dma-s2mm-channel"; dma-channels = <0x1>; interrupts = <8 1>; xlnx,datawidth = <0x8>; xlnx,device-id = <0x0>; }; }; axi_ethernet_eth: ethernet@40c00000 { axistream-connected = <&axi_ethernet_0_dma>; axistream-control-connected = <&axi_ethernet_0_dma>; clock-frequency = <83250000>; clocks = <&clk_bus_0>; compatible = "xlnx,axi-ethernet-1.00.a"; device_type = "network"; interrupt-parent = <µblaze_0_axi_intc>; interrupts = <3 0>; phy-mode = "mii"; reg = <0x40c00000 0x40000>; xlnx = <0x0>; xlnx,axiliteclkrate = <0x0>; xlnx,axisclkrate = <0x0>; xlnx,gt-type = <0x0>; xlnx,gtinex = <0x0>; xlnx,phy-type = <0x0>; xlnx,phyaddr = <0x1>; xlnx,rable = <0x0>; xlnx,rxcsum = <0x2>; xlnx,rxlane0-placement = <0x0>; xlnx,rxlane1-placement = <0x0>; xlnx,rxmem = <0x1000>; xlnx,rxnibblebitslice0used = <0x1>; xlnx,tx-in-upper-nibble = <0x1>; xlnx,txcsum = <0x2>; xlnx,txlane0-placement = <0x0>; xlnx,txlane1-placement = <0x0>; phy-handle = <&phy0>; axi_ethernetlite_0_mdio: mdio { #address-cells = <1>; #size-cells = <0>; phy0: phy@1 { device_type = "ethernet-phy"; reg = <1>; ti,rx-internal-delay = <7>; ti,tx-internal-delay = <7>; ti,fifo-depth = <1>; }; }; }; Best regards, -- Álvaro G. M. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: Packets only received after some buffer is full 2025-04-01 10:52 Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: Packets only received after some buffer is full Álvaro G. M. @ 2025-04-02 17:00 ` Jakub Kicinski 2025-04-03 5:44 ` Álvaro G. M. 0 siblings, 1 reply; 16+ messages in thread From: Jakub Kicinski @ 2025-04-02 17:00 UTC (permalink / raw) To: Álvaro "G. M."; +Cc: netdev, Radhey Shyam Pandey +CC Radhey, maintainer of axienet On Tue, 01 Apr 2025 12:52:15 +0200 Álvaro "G. M." wrote: > Hello, > > I have a custom PCB board fitting a AMD/Xilinx Artix 7 FPGA with a Microblaze design > inside that uses Xilinx' AXI 1G/2.5G Ethernet Subsystem connected via DMA. > > This board and HDL design have been tested and in production since 2016 using > kernel 4.4.43 without any issue. The hardware part of the ethernet is DP83620 > running in 100base-FX mode, which back in the day required a small patch to > dp83848.c from myself that has been in the kernel since. > > I am now trying to upgrade to a recent kernel (v6.13) and I'm facing some strange > behavior of the ethernet system. The most probable cause is a misconfiguration > on my part of the device tree, since things have changed since then and I've found > the device tree documentation confusing, but I can't discard some kind of bug, > for I have never seem something similar to this. > > Relevant boot messages: > > xilinx_axienet 40c00000.ethernet eth0: PHY [axienet-40c00000:01] driver [TI DP83620 10/100 Mbps PHY] (irq=POLL) > xilinx_axienet 40c00000.ethernet eth0: configuring for phy/mii link mode > xilinx_axienet 40c00000.ethernet eth0: Link is Up - 100Mbps/Half - flow control off > > Now, transmission from the Microblaze seems to work fine, but reception however does not. > I run tcpdump on the Microblaze and I can see that there's some kind of buffering occuring, > as a single ARP packet sent from my directly connected computer won't reach tcpdump unless > I send also a big chunk of data via, for example, multicast, or after enough time of ping flooding. > > It's not however a matter of sending a big chunk of data at the beginning, it seems like the > buffer empties once full and the process starts back again, so a single ping packet won't be > received after the buffer has emptied. > > I can see that interrupts increase, but not as fast as they occur when using old kernel. > For example, in the ping case, kernel 4.43 will notify that there was an interrupt > for each single ping packet received with ping -c 1 (so no coalescing shenanigans can occur), > but the new kernel won't show any increase in the number of interrupts, so it means > that the DMA core is either not generating the irq for some reason or isn't even > executing the DMA transfer at all. > > Output packets, however, do seem to be sent expeditely and received in my working computer > as soon as I sent them from the Microblaze. > > I guess I may have made some mistake in upgrading the DTS to the new format, although > I've tried the two available methods (either setting node "dmas" or using "axistream-connected" > property) and both methods result in the same boot messages and behavior. > > By crafting properly sized UDP multicast packets (so I don't have to rely on ARP which isn't > working due to timeouts), I've been able to determine I need to send 131072 bytes before > reception can truly occur, although it somehow seems like sending multicast UDP > packets won't trigger receiving IRQ unless I have a specific UDP listener program running on > the Microblaze. I'm quite confused about that too. > > So please, if anyone could inspect the DTS for me and/or guide me on how to debug this, I'd be grateful. > > These are the relevant parts of the DTS for kernel 6.13, which I've hand crafted with help > from Documentation/devicetree/bindings and peeking at xilinx_axienet_main.c: > > > axi_ethernet_0_dma: dma@41e00000 { > compatible = "xlnx,axi-dma-1.00.a"; > #dma-cells = <1>; > reg = <0x41e00000 0x10000>; > interrupt-parent = <µblaze_0_axi_intc>; > interrupts = <7 1 8 1>; > xlnx,addrwidth = <32>; > xlnx,datawidth = <32>; > xlnx,include-sg; > xlnx,sg-length-width = <16>; > xlnx,include-dre = <1>; > xlnx,axistream-connected = <1>; > xlnx,irq-delay = <1>; > dma-channels = <2>; > clock-names = "s_axi_lite_aclk", "m_axi_mm2s_aclk", "m_axi_s2mm_aclk", "m_axi_sg_aclk"; > clocks = <&clk_bus_0>, <&clk_bus_0>, <&clk_bus_0>, <&clk_bus_0>; > dma-channel@41e00000 { > compatible = "xlnx,axi-dma-mm2s-channel"; > xlnx,include-dre = <1>; > interrupts = <7 1>; > xlnx,datawidth = <32>; > }; > dma-channel@41e00030 { > compatible = "xlnx,axi-dma-s2mm-channel"; > xlnx,include-dre = <1>; > interrupts = <8 1>; > xlnx,datawidth = <32>; > }; > }; > axi_ethernet_eth: ethernet@40c00000 { > compatible = "xlnx,axi-ethernet-1.00.a"; > reg = <0x40c00000 0x40000>, <0x41e00000 0x10000>; > phy-handle = <&phy1>; > xlnx,rxmem = <0x1000>; > phy-mode = "mii"; > xlnx,txcsum = <0x2>; > xlnx,rxcsum = <0x2>; > clock-names = "s_axi_lite_clk", "axis_clk", "ref_clk", "mgt_clk"; > clocks = <&clk_bus_0>, <&clk_bus_0>, <&clk_bus_0>, <&clk_bus_0>; > /* axistream-connected = <&axi_ethernet_0_dma>; */ > dmas = <&axi_ethernet_0_dma 0>, <&axi_ethernet_0_dma 1>; > dma-names = "tx_chan0", "rx_chan0"; > mdio { > #address-cells = <1>; > #size-cells = <0>; > phy1: ethernet-phy@1 { > device_type = "ethernet-phy"; > reg = <1>; > }; > }; > }; > > > And these are same parts of the DTS for kernel 4.43 which worked fine. > These were created with help from Xilinx tools. > > axi_ethernet_0_dma: dma@41e00000 { > #dma-cells = <1>; > compatible = "xlnx,axi-dma-1.00.a"; > interrupt-parent = <µblaze_0_axi_intc>; > interrupts = <7 1 8 1>; > reg = <0x41e00000 0x10000>; > xlnx,include-sg ; > dma-channel@41e00000 { > compatible = "xlnx,axi-dma-mm2s-channel"; > dma-channels = <0x1>; > interrupts = <7 1>; > xlnx,datawidth = <0x8>; > xlnx,device-id = <0x0>; > }; > dma-channel@41e00030 { > compatible = "xlnx,axi-dma-s2mm-channel"; > dma-channels = <0x1>; > interrupts = <8 1>; > xlnx,datawidth = <0x8>; > xlnx,device-id = <0x0>; > }; > }; > axi_ethernet_eth: ethernet@40c00000 { > axistream-connected = <&axi_ethernet_0_dma>; > axistream-control-connected = <&axi_ethernet_0_dma>; > clock-frequency = <83250000>; > clocks = <&clk_bus_0>; > compatible = "xlnx,axi-ethernet-1.00.a"; > device_type = "network"; > interrupt-parent = <µblaze_0_axi_intc>; > interrupts = <3 0>; > phy-mode = "mii"; > reg = <0x40c00000 0x40000>; > xlnx = <0x0>; > xlnx,axiliteclkrate = <0x0>; > xlnx,axisclkrate = <0x0>; > xlnx,gt-type = <0x0>; > xlnx,gtinex = <0x0>; > xlnx,phy-type = <0x0>; > xlnx,phyaddr = <0x1>; > xlnx,rable = <0x0>; > xlnx,rxcsum = <0x2>; > xlnx,rxlane0-placement = <0x0>; > xlnx,rxlane1-placement = <0x0>; > xlnx,rxmem = <0x1000>; > xlnx,rxnibblebitslice0used = <0x1>; > xlnx,tx-in-upper-nibble = <0x1>; > xlnx,txcsum = <0x2>; > xlnx,txlane0-placement = <0x0>; > xlnx,txlane1-placement = <0x0>; > phy-handle = <&phy0>; > axi_ethernetlite_0_mdio: mdio { > #address-cells = <1>; > #size-cells = <0>; > phy0: phy@1 { > device_type = "ethernet-phy"; > reg = <1>; > ti,rx-internal-delay = <7>; > ti,tx-internal-delay = <7>; > ti,fifo-depth = <1>; > }; > }; > }; > > > > Best regards, > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: Packets only received after some buffer is full 2025-04-02 17:00 ` Jakub Kicinski @ 2025-04-03 5:44 ` Álvaro G. M. 2025-04-03 5:54 ` Pandey, Radhey Shyam 2025-04-03 13:58 ` Gupta, Suraj 0 siblings, 2 replies; 16+ messages in thread From: Álvaro G. M. @ 2025-04-03 5:44 UTC (permalink / raw) To: Jakub Kicinski; +Cc: netdev, Radhey Shyam Pandey Hi On Wed, 2025-04-02 at 10:00 -0700, Jakub Kicinski wrote: > +CC Radhey, maintainer of axienet Thanks, I don't know why I didn't think of that. So, I can provide a little more information and I definitely believe now there are some issues with this driver. > On Tue, 01 Apr 2025 12:52:15 +0200 Álvaro "G. M." wrote: > > I guess I may have made some mistake in upgrading the DTS to the new format, although > > I've tried the two available methods (either setting node "dmas" or using "axistream-connected" > > property) and both methods result in the same boot messages and behavior. This has happened not to be true, I'm sorry for the confusion. Using node "dmas" enables use_dmaengine and produces the effect I explained: data is only received after a 2^17 bytes buffer is filled. If I remove "dmas" entry and provide a "axistream-connected" one, things get a little better (but see at the end for some DTS notes). In this mode, in which dmaengine is not used but legacy DMA code inside axienet itself, tcpdump -vv shows packets incoming at a normal rate. However, the system is not answering to ARP requests: 00:02:37.800814 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.188.140.2 tell 10.188.139.1, length 46 00:02:38.801974 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.188.140.2 tell 10.188.139.1, length 46 00:02:39.804137 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.188.140.2 tell 10.188.139.1, length 46 00:02:40.806434 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.188.140.2 tell 10.188.139.1, length 46 00:02:41.808084 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.188.140.2 tell 10.188.139.1, length 46 00:02:42.810592 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.188.140.2 tell 10.188.139.1, length 46 00:02:43.813155 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.188.140.2 tell 10.188.139.1, length 46 Here's the normal answer for a second device running old 4.4.43 kernel connected to the same switch: 00:21:12.057326 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.188.140.1 tell 10.188.139.1, length 46 00:21:12.057905 ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.188.140.1 is-at 06:00:0a:bc:8c:01 (oui Unknown), length 28 00:21:13.059460 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.188.140.1 tell 10.188.139.1, length 46 00:21:13.060031 ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.188.140.1 is-at 06:00:0a:bc:8c:01 (oui Unknown), length 28 00:21:14.060502 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.188.140.1 tell 10.188.139.1, length 46 00:21:14.061051 ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.188.140.1 is-at 06:00:0a:bc:8c:01 (oui Unknown), length 28 The funny thing is that once I manually add arp entries in both my computer and the embedded one, I can establish full TCP communication and iperf3 shows a relatively nice speed, similar to the throughput I get with old 4.4.43 kernel. # arp -s 10.188.139.1 f4:4d:ad:02:11:29 # iperf3 -c 10.188.139.1 Connecting to host 10.188.139.1, port 5201 [ 5] local 10.188.140.2 port 55480 connected to 10.188.139.1 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.01 sec 3.63 MBytes 30.1 Mbits/sec 0 130 KBytes [ 5] 1.01-2.01 sec 3.75 MBytes 31.5 Mbits/sec 0 130 KBytes [ 5] 2.01-3.01 sec 3.63 MBytes 30.4 Mbits/sec 0 130 KBytes [ 5] 3.01-4.01 sec 3.75 MBytes 31.4 Mbits/sec 0 130 KBytes [ 5] 4.01-5.01 sec 3.75 MBytes 31.4 Mbits/sec 0 130 KBytes [ 5] 5.01-6.01 sec 3.75 MBytes 31.5 Mbits/sec 0 130 KBytes [ 5] 6.01-7.01 sec 3.75 MBytes 31.6 Mbits/sec 0 130 KBytes [ 5] 7.01-7.75 sec 2.63 MBytes 29.5 Mbits/sec 0 130 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-7.75 sec 28.6 MBytes 31.0 Mbits/sec 0 sender [ 5] 0.00-7.75 sec 0.00 Bytes 0.00 bits/sec receiver iperf3: interrupt - the client has terminated # iperf3 -c 10.188.139.1 -R Connecting to host 10.188.139.1, port 5201 Reverse mode, remote host 10.188.139.1 is sending [ 5] local 10.188.140.2 port 45582 connected to 10.188.139.1 port 5201 [ ID] Interval Transfer Bitrate [ 5] 0.00-1.03 sec 5.13 MBytes 41.9 Mbits/sec [ 5] 1.03-2.03 sec 5.38 MBytes 44.8 Mbits/sec [ 5] 2.03-3.02 sec 5.38 MBytes 45.6 Mbits/sec [ 5] 3.02-4.02 sec 5.38 MBytes 45.2 Mbits/sec [ 5] 4.02-5.01 sec 5.38 MBytes 45.4 Mbits/sec [ 5] 5.01-5.30 sec 1.50 MBytes 43.2 Mbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate [ 5] 0.00-5.30 sec 0.00 Bytes 0.00 bits/sec sender [ 5] 0.00-5.30 sec 28.1 MBytes 44.5 Mbits/sec receiver iperf3: interrupt - the client has terminated I had never seen a device able to fully stablish communication except for replying to MAC requests, so I'm not sure what's happening here. On the other hand, and since I don't know how to debug this ARP issue, I went back to see if I could diagnose what's happening in DMA Engine mode, so I peeked at the code and I saw an asymmetry between RX and TX, which sounded good given that in dmaengine mode TX works perfectly (or so it seems) and RX is heavily buffered. This asymmetry lies precisely on the number of SG blocks and number of skb buffers. Both bd_nums are defined in the same way: lp->rx_bd_num = RX_BD_NUM_DEFAULT; // = 1024 lp->tx_bd_num = TX_BD_NUM_DEFAULT; // = 128 But the skb ring size is defined in a different fashion: lp->tx_skb_ring = kcalloc(TX_BD_NUM_MAX, sizeof(*lp->tx_skb_ring), // = 4096 GFP_KERNEL); ... lp->rx_skb_ring = kcalloc(RX_BUF_NUM_DEFAULT, sizeof(*lp->rx_skb_ring), // = 128 GFP_KERNEL); So, for TX we allocate space for up to 4096 buffers but by default use 128. For RX we allocate space for 128 buffers but somehow are setting 1024 as the default bd number. The fact that RX_BD_NUM_DEFAULT is used nowhere else is also a signal that there was some mistake here, so I went and replaced all RX_BUF_NUM_DEFAULT occurances with RX_BD_NUM_DEFAULT, so that both TX and RX skb rings are declared and operated with using the same strategy: sed -i '/^#define/!s#RX_BUF_NUM_DEFAULT#RX_BD_NUM_MAX#g' xilinx_axienet_main.c Doing this solved the buffering problem, although the system still doesn't reply to ARP requests, and when I tried to run an iperf3 test after manually adding arp tables, the kernel segfaulted (so I probably shouldn't have blindly 'sed' like that :) # iperf3 -c 10.188.139.1 Connecting to host 10.188.139.1, port 5201 [ 5] local 10.188.140.2 port 46356 connected to 10.188.139.1 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.01 sec 640 KBytes 5.18 Mbits/sec 3 84.8 KBykernel task_size exceed Oops: Exception in kernel mode, sig: 11 CPU: 0 UID: 0 PID: 147 Comm: iperf3 Not tainted 6.13.8 #13 Registers dump: mode=8269B900 r1=00000000, r2=00000000, r3=00000000, r4=00000010 r5=00000000, r6=000005F2, r7=FFFF7FFF, r8=00000000 r9=00000000, r10=00000000, r11=00000000, r12=CF5FF24C r13=00000000, r14=C241AB70, r15=C0383EB8, r16=00000000 r17=C0383EC0, r18=000005F0, r19=C10124A0, r20=480F8520 r21=4831F960, r22=00000000, r23=00000000, r24=FFFFFFEA r25=C12BE0A8, r26=C12BE03C, r27=C12BE020, r28=00000122 r29=00000100, r30=000065A2, r31=C120F780, rPC=C0383EC0 msr=000046A2, ear=FFFFFFFA, esr=00000312, fsr=00000000 Kernel panic - not syncing: Aiee, killing interrupt handler! ---[ end Kernel panic - not syncing: Aiee, killing interrupt handler! ]--- tes I couldn't see what was wrong with new code, so I just went and replaced the RX_BD_NUM_DEFAULT value from 1024 down to 128, so it's now the same size as its TX counterpart, but the kernel segfaulted again when trying to measure throughput. Sadly, my kernel debugging abilities are not much stronger than this, so I'm stuck at this point but firmly believe there's something wrong here, although I can't see what it is. Any help will be greatly appreciated. DTS NOTES: Using old DMA code inside xilinx_axienet_main.c requires removing "dmas" entry and add a reference to DMA device either via axistream-connected or by adding resources manually to the node. Referring to the node linked by axistream-connected requires a DMA node to exist, but its compatible string can't be xlnx,axi-dma-1.00.a, because then AXI DMA driver will lock onto it and axienet will complain about the device being busy. So my solution for this is to use a not compatible string. As such, with the following DTS I can establish TCP connections as long as ARP tables are manually entered: axi_ethernet_0_dma: dma@41e00000 { /* NOTE THE NOT */ compatible = "notxlnx,axi-dma-1.00.a"; #dma-cells = <1>; reg = <0x41e00000 0x10000>; interrupt-parent = <µblaze_0_axi_intc>; interrupts = <7 1 8 1>; xlnx,addrwidth = <32>; // Tamaño de dirección en bits xlnx,datawidth = <32>; xlnx,include-sg; xlnx,sg-length-width = <16>; xlnx,include-dre = <1>; xlnx,axistream-connected = <1>; xlnx,irq-delay = <1>; dma-channels = <2>; clock-names = "s_axi_lite_aclk", "m_axi_mm2s_aclk", "m_axi_s2mm_aclk", "m_axi_sg_aclk"; clocks = <&clk_bus_0>, <&clk_bus_0>, <&clk_bus_0>, <&clk_bus_0>; dma-channel@41e00000 { compatible = "xlnx,axi-dma-mm2s-channel"; xlnx,include-dre = <1>; interrupts = <7 1>; xlnx,datawidth = <32>; }; dma-channel@41e00030 { compatible = "xlnx,axi-dma-s2mm-channel"; xlnx,include-dre = <1>; interrupts = <8 1>; xlnx,datawidth = <32>; }; }; axi_ethernet_eth: ethernet@40c00000 { compatible = "xlnx,axi-ethernet-1.00.a"; reg = <0x40c00000 0x40000>; phy-handle = <&phy1>; interrupt-parent = <µblaze_0_axi_intc>; interrupts = <3 0>; xlnx,rxmem = <0x1000>; max-speed = <100000>; phy-mode = "mii"; xlnx,txcsum = <0x2>; xlnx,rxcsum = <0x2>; clock-names = "s_axi_lite_clk", "axis_clk", "ref_clk", "mgt_clk"; clocks = <&clk_bus_0>, <&clk_bus_0>, <&clk_bus_0>, <&clk_bus_0>; axistream-connected = <&axi_ethernet_0_dma>; dma-names = "tx_chan0", "rx_chan0"; mdio { #address-cells = <1>; #size-cells = <0>; phy1: ethernet-phy@1 { device_type = "ethernet-phy"; reg = <1>; }; }; }; So this mode of working would definitely NOT need AXI DMA, and this hack with the compatible string should not be needed if the dependency with AXI DMA was removed. Best regards, -- Álvaro G. M. ^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: Packets only received after some buffer is full 2025-04-03 5:44 ` Álvaro G. M. @ 2025-04-03 5:54 ` Pandey, Radhey Shyam 2025-04-03 6:10 ` Álvaro G. M. 2025-04-09 11:00 ` Álvaro G. M. 2025-04-03 13:58 ` Gupta, Suraj 1 sibling, 2 replies; 16+ messages in thread From: Pandey, Radhey Shyam @ 2025-04-03 5:54 UTC (permalink / raw) To: Álvaro G. M., Jakub Kicinski Cc: netdev@vger.kernel.org, Katakam, Harini, Gupta, Suraj [AMD Official Use Only - AMD Internal Distribution Only] > -----Original Message----- > From: Álvaro G. M. <alvaro.gamez@hazent.com> > Sent: Thursday, April 3, 2025 11:15 AM > To: Jakub Kicinski <kuba@kernel.org> > Cc: netdev@vger.kernel.org; Pandey, Radhey Shyam > <radhey.shyam.pandey@amd.com> > Subject: Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: > Packets only received after some buffer is full > > Hi > > > On Wed, 2025-04-02 at 10:00 -0700, Jakub Kicinski wrote: > > +CC Radhey, maintainer of axienet > > Thanks, I don't know why I didn't think of that. > > So, I can provide a little more information and I definitely believe now there are some > issues with this driver. > > > On Tue, 01 Apr 2025 12:52:15 +0200 Álvaro "G. M." wrote: > > > I guess I may have made some mistake in upgrading the DTS to the new > > > format, although I've tried the two available methods (either setting node "dmas" > or using "axistream-connected" > > > property) and both methods result in the same boot messages and behavior. > > This has happened not to be true, I'm sorry for the confusion. Using node "dmas" > enables use_dmaengine and produces the effect I explained: data is only received > after a 2^17 bytes buffer is filled. > > If I remove "dmas" entry and provide a "axistream-connected" one, things get a little > better (but see at the end for some DTS notes). In this mode, in which dmaengine is > not used but legacy DMA code inside axienet itself, tcpdump -vv shows packets > incoming at a normal rate. However, the system is not answering to ARP requests: > > 00:02:37.800814 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.188.140.2 > tell 10.188.139.1, length 46 > 00:02:38.801974 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.188.140.2 > tell 10.188.139.1, length 46 > 00:02:39.804137 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.188.140.2 > tell 10.188.139.1, length 46 > 00:02:40.806434 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.188.140.2 > tell 10.188.139.1, length 46 > 00:02:41.808084 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.188.140.2 > tell 10.188.139.1, length 46 > 00:02:42.810592 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.188.140.2 > tell 10.188.139.1, length 46 > 00:02:43.813155 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.188.140.2 > tell 10.188.139.1, length 46 > > Here's the normal answer for a second device running old 4.4.43 kernel connected to > the same switch: > > 00:21:12.057326 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.188.140.1 > tell 10.188.139.1, length 46 > 00:21:12.057905 ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.188.140.1 is-at > 06:00:0a:bc:8c:01 (oui Unknown), length 28 > 00:21:13.059460 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.188.140.1 > tell 10.188.139.1, length 46 > 00:21:13.060031 ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.188.140.1 is-at > 06:00:0a:bc:8c:01 (oui Unknown), length 28 > 00:21:14.060502 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.188.140.1 > tell 10.188.139.1, length 46 > 00:21:14.061051 ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.188.140.1 is-at > 06:00:0a:bc:8c:01 (oui Unknown), length 28 > > The funny thing is that once I manually add arp entries in both my computer and the > embedded one, I can establish full TCP communication and iperf3 shows a relatively > nice speed, similar to the throughput I get with old 4.4.43 kernel. > > # arp -s 10.188.139.1 f4:4d:ad:02:11:29 > # iperf3 -c 10.188.139.1 > Connecting to host 10.188.139.1, port 5201 [ 5] local 10.188.140.2 port 55480 > connected to 10.188.139.1 port 5201 > [ ID] Interval Transfer Bitrate Retr Cwnd > [ 5] 0.00-1.01 sec 3.63 MBytes 30.1 Mbits/sec 0 130 KBytes > [ 5] 1.01-2.01 sec 3.75 MBytes 31.5 Mbits/sec 0 130 KBytes > [ 5] 2.01-3.01 sec 3.63 MBytes 30.4 Mbits/sec 0 130 KBytes > [ 5] 3.01-4.01 sec 3.75 MBytes 31.4 Mbits/sec 0 130 KBytes > [ 5] 4.01-5.01 sec 3.75 MBytes 31.4 Mbits/sec 0 130 KBytes > [ 5] 5.01-6.01 sec 3.75 MBytes 31.5 Mbits/sec 0 130 KBytes > [ 5] 6.01-7.01 sec 3.75 MBytes 31.6 Mbits/sec 0 130 KBytes > [ 5] 7.01-7.75 sec 2.63 MBytes 29.5 Mbits/sec 0 130 KBytes > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate Retr > [ 5] 0.00-7.75 sec 28.6 MBytes 31.0 Mbits/sec 0 sender > [ 5] 0.00-7.75 sec 0.00 Bytes 0.00 bits/sec receiver > iperf3: interrupt - the client has terminated # iperf3 -c 10.188.139.1 -R Connecting to > host 10.188.139.1, port 5201 Reverse mode, remote host 10.188.139.1 is sending [ > 5] local 10.188.140.2 port 45582 connected to 10.188.139.1 port 5201 > [ ID] Interval Transfer Bitrate > [ 5] 0.00-1.03 sec 5.13 MBytes 41.9 Mbits/sec > [ 5] 1.03-2.03 sec 5.38 MBytes 44.8 Mbits/sec > [ 5] 2.03-3.02 sec 5.38 MBytes 45.6 Mbits/sec > [ 5] 3.02-4.02 sec 5.38 MBytes 45.2 Mbits/sec > [ 5] 4.02-5.01 sec 5.38 MBytes 45.4 Mbits/sec > [ 5] 5.01-5.30 sec 1.50 MBytes 43.2 Mbits/sec > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate > [ 5] 0.00-5.30 sec 0.00 Bytes 0.00 bits/sec sender > [ 5] 0.00-5.30 sec 28.1 MBytes 44.5 Mbits/sec receiver > iperf3: interrupt - the client has terminated > > I had never seen a device able to fully stablish communication except for replying to > MAC requests, so I'm not sure what's happening here. > > > On the other hand, and since I don't know how to debug this ARP issue, I went back > to see if I could diagnose what's happening in DMA Engine mode, so I peeked at the > code and I saw an asymmetry between RX and TX, which sounded good given that > in dmaengine mode TX works perfectly (or so it seems) and RX is heavily buffered. > This asymmetry lies precisely on the number of SG blocks and number of skb > buffers. > > Both bd_nums are defined in the same way: > lp->rx_bd_num = RX_BD_NUM_DEFAULT; // = 1024 > lp->tx_bd_num = TX_BD_NUM_DEFAULT; // = 128 > > > But the skb ring size is defined in a different fashion: > lp->tx_skb_ring = kcalloc(TX_BD_NUM_MAX, sizeof(*lp->tx_skb_ring), // = > 4096 > GFP_KERNEL); > ... > lp->rx_skb_ring = kcalloc(RX_BUF_NUM_DEFAULT, sizeof(*lp->rx_skb_ring), > // = 128 > GFP_KERNEL); > > So, for TX we allocate space for up to 4096 buffers but by default use 128. > For RX we allocate space for 128 buffers but somehow are setting 1024 as the > default bd number. > > The fact that RX_BD_NUM_DEFAULT is used nowhere else is also a signal that > there was some mistake here, so I went and replaced all RX_BUF_NUM_DEFAULT > occurances with RX_BD_NUM_DEFAULT, so that both TX and RX skb rings are > declared and operated with using the same strategy: > > sed -i '/^#define/!s#RX_BUF_NUM_DEFAULT#RX_BD_NUM_MAX#g' > xilinx_axienet_main.c > > Doing this solved the buffering problem, although the system still doesn't reply to > ARP requests, and when I tried to run an iperf3 test after manually adding arp tables, > the kernel segfaulted (so I probably shouldn't have blindly 'sed' like that :) > > # iperf3 -c 10.188.139.1 > Connecting to host 10.188.139.1, port 5201 [ 5] local 10.188.140.2 port 46356 > connected to 10.188.139.1 port 5201 > [ ID] Interval Transfer Bitrate Retr Cwnd > [ 5] 0.00-1.01 sec 640 KBytes 5.18 Mbits/sec 3 84.8 KBykernel task_size > exceed > Oops: Exception in kernel mode, sig: 11 > CPU: 0 UID: 0 PID: 147 Comm: iperf3 Not tainted 6.13.8 #13 Registers dump: > mode=8269B900 r1=00000000, r2=00000000, r3=00000000, r4=00000010 > r5=00000000, r6=000005F2, r7=FFFF7FFF, r8=00000000 r9=00000000, > r10=00000000, r11=00000000, r12=CF5FF24C r13=00000000, r14=C241AB70, > r15=C0383EB8, r16=00000000 r17=C0383EC0, r18=000005F0, r19=C10124A0, > r20=480F8520 r21=4831F960, r22=00000000, r23=00000000, r24=FFFFFFEA > r25=C12BE0A8, r26=C12BE03C, r27=C12BE020, r28=00000122 r29=00000100, > r30=000065A2, r31=C120F780, rPC=C0383EC0 msr=000046A2, ear=FFFFFFFA, > esr=00000312, fsr=00000000 Kernel panic - not syncing: Aiee, killing interrupt > handler! > ---[ end Kernel panic - not syncing: Aiee, killing interrupt handler! ]--- > tes > > I couldn't see what was wrong with new code, so I just went and replaced the > RX_BD_NUM_DEFAULT value from 1024 down to 128, so it's now the same size as > its TX counterpart, but the kernel segfaulted again when trying to measure > throughput. Sadly, my kernel debugging abilities are not much stronger than this, so > I'm stuck at this point but firmly believe there's something wrong here, although I > can't see what it is. > > Any help will be greatly appreciated. > > > DTS NOTES: > Using old DMA code inside xilinx_axienet_main.c requires removing "dmas" entry > and add a reference to DMA device either via axistream-connected or by adding > resources manually to the node. Referring to the node linked by axistream- > connected requires a DMA node to exist, but its compatible string can't be xlnx,axi- > dma-1.00.a, because then AXI DMA driver will lock onto it and axienet will complain > about the device being busy. So my solution for this is to use a not compatible string. > As such, with the following DTS I can establish TCP connections as long as ARP > tables are manually entered: > > > axi_ethernet_0_dma: dma@41e00000 { > /* NOTE THE NOT */ > compatible = "notxlnx,axi-dma-1.00.a"; > #dma-cells = <1>; > reg = <0x41e00000 0x10000>; > interrupt-parent = <µblaze_0_axi_intc>; > interrupts = <7 1 8 1>; > xlnx,addrwidth = <32>; // Tamaño de dirección en bits > xlnx,datawidth = <32>; > xlnx,include-sg; > xlnx,sg-length-width = <16>; > xlnx,include-dre = <1>; > xlnx,axistream-connected = <1>; > xlnx,irq-delay = <1>; > dma-channels = <2>; > clock-names = "s_axi_lite_aclk", "m_axi_mm2s_aclk", "m_axi_s2mm_aclk", > "m_axi_sg_aclk"; > clocks = <&clk_bus_0>, <&clk_bus_0>, <&clk_bus_0>, <&clk_bus_0>; > dma-channel@41e00000 { > compatible = "xlnx,axi-dma-mm2s-channel"; > xlnx,include-dre = <1>; > interrupts = <7 1>; > xlnx,datawidth = <32>; > }; > dma-channel@41e00030 { > compatible = "xlnx,axi-dma-s2mm-channel"; > xlnx,include-dre = <1>; > interrupts = <8 1>; > xlnx,datawidth = <32>; > }; > }; > axi_ethernet_eth: ethernet@40c00000 { > compatible = "xlnx,axi-ethernet-1.00.a"; > reg = <0x40c00000 0x40000>; > phy-handle = <&phy1>; > interrupt-parent = <µblaze_0_axi_intc>; > interrupts = <3 0>; > xlnx,rxmem = <0x1000>; > max-speed = <100000>; > phy-mode = "mii"; > xlnx,txcsum = <0x2>; > xlnx,rxcsum = <0x2>; > clock-names = "s_axi_lite_clk", "axis_clk", "ref_clk", "mgt_clk"; > clocks = <&clk_bus_0>, <&clk_bus_0>, <&clk_bus_0>, <&clk_bus_0>; > axistream-connected = <&axi_ethernet_0_dma>; > dma-names = "tx_chan0", "rx_chan0"; > mdio { > #address-cells = <1>; > #size-cells = <0>; > phy1: ethernet-phy@1 { > device_type = "ethernet-phy"; > reg = <1>; > }; > }; > }; > > So this mode of working would definitely NOT need AXI DMA, and this hack with the > compatible string should not be needed if the dependency with AXI DMA was > removed. > > Best regards, + Going through the details and will get back to you . Just to confirm there is no vivado design update ? and we are only updating linux kernel to latest? Thanks, Radhey ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: Packets only received after some buffer is full 2025-04-03 5:54 ` Pandey, Radhey Shyam @ 2025-04-03 6:10 ` Álvaro G. M. 2025-04-09 11:00 ` Álvaro G. M. 1 sibling, 0 replies; 16+ messages in thread From: Álvaro G. M. @ 2025-04-03 6:10 UTC (permalink / raw) To: Pandey, Radhey Shyam, Jakub Kicinski Cc: netdev@vger.kernel.org, Katakam, Harini, Gupta, Suraj On Thu, 2025-04-03 at 05:54 +0000, Pandey, Radhey Shyam wrote: > > + Going through the details and will get back to you . Just to confirm there is no > vivado design update ? and we are only updating linux kernel to latest? > That's right, there's been no changes. Well, to be completely honest, I removed a custom HDL block from the design (a watchdog) because it was supposed to be fed when ping to a network device failed, but since network isn't working properly, well, I had to remove it or the system will keep rebooting itself. I also am now building the kernel with an initramfs so as to add small scripts without having to write to flash and things to ease debugging, but my first tries were done with the original filesystem that was already on the device (I'm now loading the kernel with initramfs using xsdb) Thanks -- Álvaro G. M. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: Packets only received after some buffer is full 2025-04-03 5:54 ` Pandey, Radhey Shyam 2025-04-03 6:10 ` Álvaro G. M. @ 2025-04-09 11:00 ` Álvaro G. M. 2025-04-09 11:14 ` Pandey, Radhey Shyam 1 sibling, 1 reply; 16+ messages in thread From: Álvaro G. M. @ 2025-04-09 11:00 UTC (permalink / raw) To: Pandey, Radhey Shyam, Jakub Kicinski Cc: netdev@vger.kernel.org, Katakam, Harini, Gupta, Suraj On Thu, 2025-04-03 at 05:54 +0000, Pandey, Radhey Shyam wrote: > [...] > + Going through the details and will get back to you . Just to confirm there is no > vivado design update ? and we are only updating linux kernel to latest? > Hi again, I've reconsidered the upgrading approach and I've first upgraded buildroot and kept the same kernel version (4.4.43). This has the effect of upgrading gcc from version 10 to version 13. With buildroot's compiled gcc-13 and keeping this same old kernel, the effect is that the system drops ARP requests. Compiling with older gcc-10, ARP requests are replied to. Keeping old buildroot version but asking it to use gcc-11 will cause the same issue with kernel 4.4.43, so something must have happened in between those gcc versions. So this does not look like an axienet driver problem, which I first thought it was, because who would blame the compiler in first instance? But then things started to get even stranger. What I did next, was slowly upgrading buildroot and using the kernel version that buildroot considered "latest" at the point it was released. I reached a point in which the ARP requests were being dropped again. This happened on buildroot 2021.11, which still used gcc-10 as the default and kernel version 5.15.6. So some gcc bug that is getting triggered on gcc-11 in kernel 4.4.43 is also triggered on gcc-10 by kernel 5.15.6. Using gcc-10, I bisected the kernel and found that this commit was triggering whatever it is that is happening, around 5.11-rc2: commit 324cefaf1c723625e93f703d6e6d78e28996b315 (HEAD) Author: Menglong Dong <dong.menglong@zte.com.cn> Date: Mon Jan 11 02:42:21 2021 -0800 net: core: use eth_type_vlan in __netif_receive_skb_core Replace the check for ETH_P_8021Q and ETH_P_8021AD in __netif_receive_skb_core with eth_type_vlan. Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn> Link: https://lore.kernel.org/r/20210111104221.3451-1-dong.menglong@zte.com.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org> I've been staring at the diff for hours because I can't understand what can be wrong about this: diff --git a/net/core/dev.c b/net/core/dev.c index e4d77c8abe76..267c4a8daa55 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -5151,8 +5151,7 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc, skb_reset_mac_len(skb); } - if (skb->protocol == cpu_to_be16(ETH_P_8021Q) || - skb->protocol == cpu_to_be16(ETH_P_8021AD)) { + if (eth_type_vlan(skb->protocol)) { skb = skb_vlan_untag(skb); if (unlikely(!skb)) goto out; @@ -5236,8 +5235,7 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc, * find vlan device. */ skb->pkt_type = PACKET_OTHERHOST; - } else if (skb->protocol == cpu_to_be16(ETH_P_8021Q) || - skb->protocol == cpu_to_be16(ETH_P_8021AD)) { + } else if (eth_type_vlan(skb->protocol)) { /* Outer header is 802.1P with vlan 0, inner header is * 802.1Q or 802.1AD and vlan_do_receive() above could * not find vlan dev for vlan id 0. Given that eth_type_vlan is simply this: static inline bool eth_type_vlan(__be16 ethertype) { switch (ethertype) { case htons(ETH_P_8021Q): case htons(ETH_P_8021AD): return true; default: return false; } } I've added a small printk to see these values right before the first time they are checked: printk(KERN_ALERT "skb->protocol = %d, ETH_P_8021Q=%d ETH_P_8021AD=%d, eth_type_vlan(skb->protocol) = %d", skb->protocol, cpu_to_be16(ETH_P_8021Q), cpu_to_be16(ETH_P_8021AD), eth_type_vlan(skb->protocol)); And each ARP ping delivers a packet reported as: skb->protocol = 1544, ETH_P_8021Q=129 ETH_P_8021AD=43144, eth_type_vlan(skb->protocol) = 0 To add insult to injury, adding this printk line solves the ARP deafness, so no matter whether I use eth_type_vlan function or manual comparison, now ARP packets aren't dropped. Removing this printk and adding one inside the if-clause that should not be happening, shows nothing, so neither I can directly inspect the packets or return value of the wrong working code, nor can I indirectly proof that the wrong branch of the if is being taken. This reinforces the idea of a compiler bug, but I very well could be wrong. Adding this printk: diff --git i/net/core/dev.c w/net/core/dev.c index 267c4a8daa55..a3ae3bcb3a21 100644 --- i/net/core/dev.c +++ w/net/core/dev.c @@ -5257,6 +5257,8 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc, * check again for vlan id to set OTHERHOST. */ goto check_vlan_id; + } else { + printk(KERN_ALERT "(1) skb->protocol is not type vlan\n"); } /* Note: we might in the future use prio bits * and set skb->priority like in vlan_do_receive() is even weirder because the same effect: the message does not appear but ARP requests are answered back. If I remove this printk, ARP requests are dropped. I've generated assembly output and this is the difference between having that extra else with the printk and not having it. It doesn't even make much any sense that code would even reach this region of code because there's no vlan involved in at all here. And so here I am again, staring at all this without knowing how to proceed. I guess I will be trying different and more modern versions of gcc, even some precompiled toolchains and see what else may be going on. If anyone has any hindsight as to what is causing this or how to solve it, it'd be great if you could share it. Thanks! -- Álvaro G. M. ^ permalink raw reply related [flat|nested] 16+ messages in thread
* RE: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: Packets only received after some buffer is full 2025-04-09 11:00 ` Álvaro G. M. @ 2025-04-09 11:14 ` Pandey, Radhey Shyam 2025-04-09 13:09 ` Álvaro G. M. 0 siblings, 1 reply; 16+ messages in thread From: Pandey, Radhey Shyam @ 2025-04-09 11:14 UTC (permalink / raw) To: Álvaro G. M., Jakub Kicinski Cc: netdev@vger.kernel.org, Katakam, Harini, Gupta, Suraj [AMD Official Use Only - AMD Internal Distribution Only] > -----Original Message----- > From: Álvaro G. M. <alvaro.gamez@hazent.com> > Sent: Wednesday, April 9, 2025 4:31 PM > To: Pandey, Radhey Shyam <radhey.shyam.pandey@amd.com>; Jakub Kicinski > <kuba@kernel.org> > Cc: netdev@vger.kernel.org; Katakam, Harini <harini.katakam@amd.com>; Gupta, > Suraj <Suraj.Gupta2@amd.com> > Subject: Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: > Packets only received after some buffer is full > > On Thu, 2025-04-03 at 05:54 +0000, Pandey, Radhey Shyam wrote: > > [...] > > + Going through the details and will get back to you . Just to > > confirm there is no vivado design update ? and we are only updating linux kernel to > latest? > > > > Hi again, > > I've reconsidered the upgrading approach and I've first upgraded buildroot and kept > the same kernel version (4.4.43). This has the effect of upgrading gcc from version > 10 to version 13. > > With buildroot's compiled gcc-13 and keeping this same old kernel, the effect is that > the system drops ARP requests. Compiling with older gcc-10, ARP requests are When the system drops ARP packet - Is it drop by MAC hw or by software layer. Reading MAC stats and DMA descriptors help us know if it reaches software layer or not ? > replied to. Keeping old buildroot version but asking it to use gcc-11 will cause the > same issue with kernel 4.4.43, so something must have happened in between those > gcc versions. > > So this does not look like an axienet driver problem, which I first thought it was, > because who would blame the compiler in first instance? > > But then things started to get even stranger. > > What I did next, was slowly upgrading buildroot and using the kernel version that > buildroot considered "latest" at the point it was released. I reached a point in which > the ARP requests were being dropped again. This happened on buildroot 2021.11, > which still used gcc-10 as the default and kernel version 5.15.6. So some gcc bug > that is getting triggered on gcc-11 in kernel 4.4.43 is also triggered on gcc-10 by > kernel 5.15.6. > > Using gcc-10, I bisected the kernel and found that this commit was triggering > whatever it is that is happening, around 5.11-rc2: > > commit 324cefaf1c723625e93f703d6e6d78e28996b315 (HEAD) > Author: Menglong Dong <dong.menglong@zte.com.cn> > Date: Mon Jan 11 02:42:21 2021 -0800 > > net: core: use eth_type_vlan in __netif_receive_skb_core > > Replace the check for ETH_P_8021Q and ETH_P_8021AD in > __netif_receive_skb_core with eth_type_vlan. > > Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn> > Link: https://lore.kernel.org/r/20210111104221.3451-1- > dong.menglong@zte.com.cn > Signed-off-by: Jakub Kicinski <kuba@kernel.org> > > > I've been staring at the diff for hours because I can't understand what can be wrong > about this: > > diff --git a/net/core/dev.c b/net/core/dev.c index e4d77c8abe76..267c4a8daa55 > 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -5151,8 +5151,7 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, > bool pfmemalloc, > skb_reset_mac_len(skb); > } > > - if (skb->protocol == cpu_to_be16(ETH_P_8021Q) || > - skb->protocol == cpu_to_be16(ETH_P_8021AD)) { > + if (eth_type_vlan(skb->protocol)) { > skb = skb_vlan_untag(skb); > if (unlikely(!skb)) > goto out; > @@ -5236,8 +5235,7 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, > bool pfmemalloc, > * find vlan device. > */ > skb->pkt_type = PACKET_OTHERHOST; > - } else if (skb->protocol == cpu_to_be16(ETH_P_8021Q) || > - skb->protocol == cpu_to_be16(ETH_P_8021AD)) { > + } else if (eth_type_vlan(skb->protocol)) { > /* Outer header is 802.1P with vlan 0, inner header is > * 802.1Q or 802.1AD and vlan_do_receive() above could > * not find vlan dev for vlan id 0. > > > > Given that eth_type_vlan is simply this: > > static inline bool eth_type_vlan(__be16 ethertype) { > switch (ethertype) { > case htons(ETH_P_8021Q): > case htons(ETH_P_8021AD): > return true; > default: > return false; > } > } > > I've added a small printk to see these values right before the first time they are > checked: > > printk(KERN_ALERT "skb->protocol = %d, ETH_P_8021Q=%d > ETH_P_8021AD=%d, eth_type_vlan(skb->protocol) = %d", > skb->protocol, cpu_to_be16(ETH_P_8021Q), cpu_to_be16(ETH_P_8021AD), > eth_type_vlan(skb->protocol)); > > And each ARP ping delivers a packet reported as: > skb->protocol = 1544, ETH_P_8021Q=129 ETH_P_8021AD=43144, > skb->eth_type_vlan(skb->protocol) = 0 > > To add insult to injury, adding this printk line solves the ARP deafness, so no matter > whether I use eth_type_vlan function or manual comparison, now ARP packets > aren't dropped. > > Removing this printk and adding one inside the if-clause that should not be > happening, shows nothing, so neither I can directly inspect the packets or return > value of the wrong working code, nor can I indirectly proof that the wrong branch of > the if is being taken. This reinforces the idea of a compiler bug, but I very well could > be wrong. > > Adding this printk: > diff --git i/net/core/dev.c w/net/core/dev.c index 267c4a8daa55..a3ae3bcb3a21 > 100644 > --- i/net/core/dev.c > +++ w/net/core/dev.c > @@ -5257,6 +5257,8 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, > bool pfmemalloc, > * check again for vlan id to set OTHERHOST. > */ > goto check_vlan_id; > + } else { > + printk(KERN_ALERT "(1) skb->protocol is not type vlan\n"); > } > /* Note: we might in the future use prio bits > * and set skb->priority like in vlan_do_receive() > > is even weirder because the same effect: the message does not appear but ARP > requests are answered back. If I remove this printk, ARP requests are dropped. > > I've generated assembly output and this is the difference between having that extra > else with the printk and not having it. > > It doesn't even make much any sense that code would even reach this region of > code because there's no vlan involved in at all here. > > And so here I am again, staring at all this without knowing how to proceed. > > I guess I will be trying different and more modern versions of gcc, even some > precompiled toolchains and see what else may be going on. > > If anyone has any hindsight as to what is causing this or how to solve it, it'd be great > if you could share it. > > Thanks! > > -- > Álvaro G. M. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: Packets only received after some buffer is full 2025-04-09 11:14 ` Pandey, Radhey Shyam @ 2025-04-09 13:09 ` Álvaro G. M. 2025-04-10 6:25 ` Gupta, Suraj 2025-04-17 16:12 ` Sean Anderson 0 siblings, 2 replies; 16+ messages in thread From: Álvaro G. M. @ 2025-04-09 13:09 UTC (permalink / raw) To: Pandey, Radhey Shyam, Jakub Kicinski Cc: netdev@vger.kernel.org, Katakam, Harini, Gupta, Suraj On Wed, 2025-04-09 at 11:14 +0000, Pandey, Radhey Shyam wrote: > [AMD Official Use Only - AMD Internal Distribution Only] > > > -----Original Message----- > > From: Álvaro G. M. <alvaro.gamez@hazent.com> > > Sent: Wednesday, April 9, 2025 4:31 PM > > To: Pandey, Radhey Shyam <radhey.shyam.pandey@amd.com>; Jakub Kicinski > > <kuba@kernel.org> > > Cc: netdev@vger.kernel.org; Katakam, Harini <harini.katakam@amd.com>; Gupta, > > Suraj <Suraj.Gupta2@amd.com> > > Subject: Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: > > Packets only received after some buffer is full > > > > On Thu, 2025-04-03 at 05:54 +0000, Pandey, Radhey Shyam wrote: > > > [...] > > > + Going through the details and will get back to you . Just to > > > confirm there is no vivado design update ? and we are only updating linux kernel to > > latest? > > > > > > > Hi again, > > > > I've reconsidered the upgrading approach and I've first upgraded buildroot and kept > > the same kernel version (4.4.43). This has the effect of upgrading gcc from version > > 10 to version 13. > > > > With buildroot's compiled gcc-13 and keeping this same old kernel, the effect is that > > the system drops ARP requests. Compiling with older gcc-10, ARP requests are > > When the system drops ARP packet - Is it drop by MAC hw or by software layer. > Reading MAC stats and DMA descriptors help us know if it reaches software > layer or not ? I'm not sure, who is the open dropping packets, I can only check with ethtool -S eth0 and this is its output after a few dozens of arpings: # ifconfig eth0 eth0 Link encap:Ethernet HWaddr 06:00:0A:BC:8C:01 inet addr:10.188.140.1 Bcast:10.188.143.255 Mask:255.255.248.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:164 errors:0 dropped:99 overruns:0 frame:0 TX packets:22 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:11236 (10.9 KiB) TX bytes:1844 (1.8 KiB) # ethtool -S eth0 NIC statistics: Received bytes: 13950 Transmitted bytes: 2016 RX Good VLAN Tagged Frames: 0 TX Good VLAN Tagged Frames: 0 TX Good PFC Frames: 0 RX Good PFC Frames: 0 User Defined Counter 0: 0 User Defined Counter 1: 0 User Defined Counter 2: 0 # ethtool -g eth0 Ring parameters for eth0: Pre-set maximums: RX: 4096 RX Mini: 0 RX Jumbo: 0 TX: 4096 Current hardware settings: RX: 1024 RX Mini: 0 RX Jumbo: 0 TX: 128 # ethtool -d eth0 Offset Values ------ ------ 0x0000: 00 00 00 00 00 00 00 00 00 00 00 00 e4 01 00 00 0x0010: 00 00 00 00 18 00 00 00 00 00 00 00 00 00 00 00 0x0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0030: 00 00 00 00 ff ff ff ff ff ff 00 18 00 00 00 18 0x0040: 00 00 00 00 00 00 00 40 d0 07 00 00 50 00 00 00 0x0050: 80 80 00 01 00 00 00 00 00 21 01 00 00 00 00 00 0x0060: 00 00 00 00 00 00 00 00 00 00 00 00 06 00 0a bc 0x0070: 8c 01 00 00 03 00 00 00 00 00 00 00 00 00 00 00 0x0080: 03 70 18 21 0a 00 18 00 40 25 b3 80 40 25 b3 80 0x0090: 03 50 01 00 08 00 01 00 40 38 12 81 00 38 12 81 Running tcpdump makes it so that ifconfig dropped value doesn't increment and shows me ARP requests (although it won't reply to them), but just setting the interface as promisc do not. If you can give me any indications on how to gather more data about DMA descriptors I'll try my best. This is using internal's emaclite dma, because when using dmaengine there's no dropping of packets, but a big buffering, and kernel 6.13.8, because in series ~5.11 which I'm also working with, axienet didn't have support for reading statistics from the core. I assume the old dma code inside axienet is to be deprecated, and I would be pretty glad to use dmaengine, but that has the buffering problem. So if you want to focus efforts on solving that issue I'm completely open to whatever you all deem more appropriate. I can even add some ILA to the Vivado design and inspect whatever you think could be useful Thanks > > > replied to. Keeping old buildroot version but asking it to use gcc-11 will cause the > > same issue with kernel 4.4.43, so something must have happened in between those > > gcc versions. > > > > So this does not look like an axienet driver problem, which I first thought it was, > > because who would blame the compiler in first instance? > > > > But then things started to get even stranger. > > > > What I did next, was slowly upgrading buildroot and using the kernel version that > > buildroot considered "latest" at the point it was released. I reached a point in which > > the ARP requests were being dropped again. This happened on buildroot 2021.11, > > which still used gcc-10 as the default and kernel version 5.15.6. So some gcc bug > > that is getting triggered on gcc-11 in kernel 4.4.43 is also triggered on gcc-10 by > > kernel 5.15.6. > > > > Using gcc-10, I bisected the kernel and found that this commit was triggering > > whatever it is that is happening, around 5.11-rc2: > > > > commit 324cefaf1c723625e93f703d6e6d78e28996b315 (HEAD) > > Author: Menglong Dong <dong.menglong@zte.com.cn> > > Date: Mon Jan 11 02:42:21 2021 -0800 > > > > net: core: use eth_type_vlan in __netif_receive_skb_core > > > > Replace the check for ETH_P_8021Q and ETH_P_8021AD in > > __netif_receive_skb_core with eth_type_vlan. > > > > Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn> > > Link: https://lore.kernel.org/r/20210111104221.3451-1- > > dong.menglong@zte.com.cn > > Signed-off-by: Jakub Kicinski <kuba@kernel.org> > > > > > > I've been staring at the diff for hours because I can't understand what can be wrong > > about this: > > > > diff --git a/net/core/dev.c b/net/core/dev.c index e4d77c8abe76..267c4a8daa55 > > 100644 > > --- a/net/core/dev.c > > +++ b/net/core/dev.c > > @@ -5151,8 +5151,7 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, > > bool pfmemalloc, > > skb_reset_mac_len(skb); > > } > > > > - if (skb->protocol == cpu_to_be16(ETH_P_8021Q) || > > - skb->protocol == cpu_to_be16(ETH_P_8021AD)) { > > + if (eth_type_vlan(skb->protocol)) { > > skb = skb_vlan_untag(skb); > > if (unlikely(!skb)) > > goto out; > > @@ -5236,8 +5235,7 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, > > bool pfmemalloc, > > * find vlan device. > > */ > > skb->pkt_type = PACKET_OTHERHOST; > > - } else if (skb->protocol == cpu_to_be16(ETH_P_8021Q) || > > - skb->protocol == cpu_to_be16(ETH_P_8021AD)) { > > + } else if (eth_type_vlan(skb->protocol)) { > > /* Outer header is 802.1P with vlan 0, inner header is > > * 802.1Q or 802.1AD and vlan_do_receive() above could > > * not find vlan dev for vlan id 0. > > > > > > > > Given that eth_type_vlan is simply this: > > > > static inline bool eth_type_vlan(__be16 ethertype) { > > switch (ethertype) { > > case htons(ETH_P_8021Q): > > case htons(ETH_P_8021AD): > > return true; > > default: > > return false; > > } > > } > > > > I've added a small printk to see these values right before the first time they are > > checked: > > > > printk(KERN_ALERT "skb->protocol = %d, ETH_P_8021Q=%d > > ETH_P_8021AD=%d, eth_type_vlan(skb->protocol) = %d", > > skb->protocol, cpu_to_be16(ETH_P_8021Q), cpu_to_be16(ETH_P_8021AD), > > eth_type_vlan(skb->protocol)); > > > > And each ARP ping delivers a packet reported as: > > skb->protocol = 1544, ETH_P_8021Q=129 ETH_P_8021AD=43144, > > skb->eth_type_vlan(skb->protocol) = 0 > > > > To add insult to injury, adding this printk line solves the ARP deafness, so no matter > > whether I use eth_type_vlan function or manual comparison, now ARP packets > > aren't dropped. > > > > Removing this printk and adding one inside the if-clause that should not be > > happening, shows nothing, so neither I can directly inspect the packets or return > > value of the wrong working code, nor can I indirectly proof that the wrong branch of > > the if is being taken. This reinforces the idea of a compiler bug, but I very well could > > be wrong. > > > > Adding this printk: > > diff --git i/net/core/dev.c w/net/core/dev.c index 267c4a8daa55..a3ae3bcb3a21 > > 100644 > > --- i/net/core/dev.c > > +++ w/net/core/dev.c > > @@ -5257,6 +5257,8 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, > > bool pfmemalloc, > > * check again for vlan id to set OTHERHOST. > > */ > > goto check_vlan_id; > > + } else { > > + printk(KERN_ALERT "(1) skb->protocol is not type vlan\n"); > > } > > /* Note: we might in the future use prio bits > > * and set skb->priority like in vlan_do_receive() > > > > is even weirder because the same effect: the message does not appear but ARP > > requests are answered back. If I remove this printk, ARP requests are dropped. > > > > I've generated assembly output and this is the difference between having that extra > > else with the printk and not having it. > > > > It doesn't even make much any sense that code would even reach this region of > > code because there's no vlan involved in at all here. > > > > And so here I am again, staring at all this without knowing how to proceed. > > > > I guess I will be trying different and more modern versions of gcc, even some > > precompiled toolchains and see what else may be going on. > > > > If anyone has any hindsight as to what is causing this or how to solve it, it'd be great > > if you could share it. > > > > Thanks! > > > > -- > > Álvaro G. M. ^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: Packets only received after some buffer is full 2025-04-09 13:09 ` Álvaro G. M. @ 2025-04-10 6:25 ` Gupta, Suraj 2025-04-10 6:54 ` Álvaro G. M. 2025-04-17 16:12 ` Sean Anderson 1 sibling, 1 reply; 16+ messages in thread From: Gupta, Suraj @ 2025-04-10 6:25 UTC (permalink / raw) To: Álvaro G. M. Cc: netdev@vger.kernel.org, Katakam, Harini, Pandey, Radhey Shyam, Jakub Kicinski [AMD Official Use Only - AMD Internal Distribution Only] > -----Original Message----- > From: Álvaro G. M. <alvaro.gamez@hazent.com> > Sent: Wednesday, April 9, 2025 6:40 PM > To: Pandey, Radhey Shyam <radhey.shyam.pandey@amd.com>; Jakub Kicinski > <kuba@kernel.org> > Cc: netdev@vger.kernel.org; Katakam, Harini <harini.katakam@amd.com>; Gupta, > Suraj <Suraj.Gupta2@amd.com> > Subject: Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: > Packets only received after some buffer is full > > Caution: This message originated from an External Source. Use proper caution > when opening attachments, clicking links, or responding. > > > On Wed, 2025-04-09 at 11:14 +0000, Pandey, Radhey Shyam wrote: > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > > -----Original Message----- > > > From: Álvaro G. M. <alvaro.gamez@hazent.com> > > > Sent: Wednesday, April 9, 2025 4:31 PM > > > To: Pandey, Radhey Shyam <radhey.shyam.pandey@amd.com>; Jakub > > > Kicinski <kuba@kernel.org> > > > Cc: netdev@vger.kernel.org; Katakam, Harini > > > <harini.katakam@amd.com>; Gupta, Suraj <Suraj.Gupta2@amd.com> > > > Subject: Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: > > > Packets only received after some buffer is full > > > > > > On Thu, 2025-04-03 at 05:54 +0000, Pandey, Radhey Shyam wrote: > > > > [...] > > > > + Going through the details and will get back to you . Just to > > > > confirm there is no vivado design update ? and we are only > > > > updating linux kernel to > > > latest? > > > > > > > > > > Hi again, > > > > > > I've reconsidered the upgrading approach and I've first upgraded > > > buildroot and kept the same kernel version (4.4.43). This has the > > > effect of upgrading gcc from version > > > 10 to version 13. > > > > > > With buildroot's compiled gcc-13 and keeping this same old kernel, > > > the effect is that the system drops ARP requests. Compiling with > > > older gcc-10, ARP requests are > > > > When the system drops ARP packet - Is it drop by MAC hw or by software layer. > > Reading MAC stats and DMA descriptors help us know if it reaches > > software layer or not ? > > I'm not sure, who is the open dropping packets, I can only check with ethtool -S > eth0 and this is its output after a few dozens of arpings: > > # ifconfig eth0 > eth0 Link encap:Ethernet HWaddr 06:00:0A:BC:8C:01 > inet addr:10.188.140.1 Bcast:10.188.143.255 Mask:255.255.248.0 > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:164 errors:0 dropped:99 overruns:0 frame:0 > TX packets:22 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:11236 (10.9 KiB) TX bytes:1844 (1.8 KiB) > > # ethtool -S eth0 > NIC statistics: > Received bytes: 13950 > Transmitted bytes: 2016 > RX Good VLAN Tagged Frames: 0 > TX Good VLAN Tagged Frames: 0 > TX Good PFC Frames: 0 > RX Good PFC Frames: 0 > User Defined Counter 0: 0 > User Defined Counter 1: 0 > User Defined Counter 2: 0 > > # ethtool -g eth0 > Ring parameters for eth0: > Pre-set maximums: > RX: 4096 > RX Mini: 0 > RX Jumbo: 0 > TX: 4096 > Current hardware settings: > RX: 1024 > RX Mini: 0 > RX Jumbo: 0 > TX: 128 > > # ethtool -d eth0 > Offset Values > ------ ------ > 0x0000: 00 00 00 00 00 00 00 00 00 00 00 00 e4 01 00 00 > 0x0010: 00 00 00 00 18 00 00 00 00 00 00 00 00 00 00 00 > 0x0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 0x0030: 00 00 00 00 ff ff ff ff ff ff 00 18 00 00 00 18 > 0x0040: 00 00 00 00 00 00 00 40 d0 07 00 00 50 00 00 00 > 0x0050: 80 80 00 01 00 00 00 00 00 21 01 00 00 00 00 00 > 0x0060: 00 00 00 00 00 00 00 00 00 00 00 00 06 00 0a bc > 0x0070: 8c 01 00 00 03 00 00 00 00 00 00 00 00 00 00 00 > 0x0080: 03 70 18 21 0a 00 18 00 40 25 b3 80 40 25 b3 80 > 0x0090: 03 50 01 00 08 00 01 00 40 38 12 81 00 38 12 81 > > > As per registers dump, packet is not dropped by MAC. It's dropping somewhere in the software layer. Since you started bisecting linux commits, could you please try reverting suspected commit and check if that's actually the first bad commit? > Running tcpdump makes it so that ifconfig dropped value doesn't increment and > shows me ARP requests (although it won't reply to them), but just setting the > interface as promisc do not. > > If you can give me any indications on how to gather more data about DMA > descriptors I'll try my best. > > This is using internal's emaclite dma, because when using dmaengine there's no > dropping of packets, but a big buffering, and kernel 6.13.8, because in series ~5.11 > which I'm also working with, axienet didn't have support for reading statistics from > the core. > > I assume the old dma code inside axienet is to be deprecated, and I would be pretty > glad to use dmaengine, but that has the buffering problem. So if you want to focus > efforts on solving that issue I'm completely open to whatever you all deem more > appropriate. > We're not planning to make DMAengine flow default soon as there is some significant work and optimizations required there which are under progress. But this buffering issue we didn't observe on our platforms last time we ran it with linux v6.12. > I can even add some ILA to the Vivado design and inspect whatever you think could > be useful > > Thanks > > > > > > replied to. Keeping old buildroot version but asking it to use > > > gcc-11 will cause the same issue with kernel 4.4.43, so something > > > must have happened in between those gcc versions. > > > > > > So this does not look like an axienet driver problem, which I first > > > thought it was, because who would blame the compiler in first instance? > > > > > > But then things started to get even stranger. > > > > > > What I did next, was slowly upgrading buildroot and using the kernel > > > version that buildroot considered "latest" at the point it was > > > released. I reached a point in which the ARP requests were being > > > dropped again. This happened on buildroot 2021.11, which still used > > > gcc-10 as the default and kernel version 5.15.6. So some gcc bug > > > that is getting triggered on gcc-11 in kernel 4.4.43 is also triggered on gcc-10 by > kernel 5.15.6. > > > > > > Using gcc-10, I bisected the kernel and found that this commit was > > > triggering whatever it is that is happening, around 5.11-rc2: > > > > > > commit 324cefaf1c723625e93f703d6e6d78e28996b315 (HEAD) > > > Author: Menglong Dong <dong.menglong@zte.com.cn> > > > Date: Mon Jan 11 02:42:21 2021 -0800 > > > > > > net: core: use eth_type_vlan in __netif_receive_skb_core > > > > > > Replace the check for ETH_P_8021Q and ETH_P_8021AD in > > > __netif_receive_skb_core with eth_type_vlan. > > > > > > Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn> > > > Link: https://lore.kernel.org/r/20210111104221.3451-1- > > > dong.menglong@zte.com.cn > > > Signed-off-by: Jakub Kicinski <kuba@kernel.org> > > > > > > > > > I've been staring at the diff for hours because I can't understand > > > what can be wrong about this: > > > > > > diff --git a/net/core/dev.c b/net/core/dev.c index > > > e4d77c8abe76..267c4a8daa55 > > > 100644 > > > --- a/net/core/dev.c > > > +++ b/net/core/dev.c > > > @@ -5151,8 +5151,7 @@ static int __netif_receive_skb_core(struct > > > sk_buff **pskb, bool pfmemalloc, > > > skb_reset_mac_len(skb); > > > } > > > > > > - if (skb->protocol == cpu_to_be16(ETH_P_8021Q) || > > > - skb->protocol == cpu_to_be16(ETH_P_8021AD)) { > > > + if (eth_type_vlan(skb->protocol)) { > > > skb = skb_vlan_untag(skb); > > > if (unlikely(!skb)) > > > goto out; > > > @@ -5236,8 +5235,7 @@ static int __netif_receive_skb_core(struct > > > sk_buff **pskb, bool pfmemalloc, > > > * find vlan device. > > > */ > > > skb->pkt_type = PACKET_OTHERHOST; > > > - } else if (skb->protocol == cpu_to_be16(ETH_P_8021Q) || > > > - skb->protocol == cpu_to_be16(ETH_P_8021AD)) { > > > + } else if (eth_type_vlan(skb->protocol)) { > > > /* Outer header is 802.1P with vlan 0, inner header is > > > * 802.1Q or 802.1AD and vlan_do_receive() above could > > > * not find vlan dev for vlan id 0. > > > > > > > > > > > > Given that eth_type_vlan is simply this: > > > > > > static inline bool eth_type_vlan(__be16 ethertype) { > > > switch (ethertype) { > > > case htons(ETH_P_8021Q): > > > case htons(ETH_P_8021AD): > > > return true; > > > default: > > > return false; > > > } > > > } > > > > > > I've added a small printk to see these values right before the first > > > time they are > > > checked: > > > > > > printk(KERN_ALERT "skb->protocol = %d, ETH_P_8021Q=%d > > > ETH_P_8021AD=%d, eth_type_vlan(skb->protocol) = %d", > > > skb->protocol, cpu_to_be16(ETH_P_8021Q), > > > cpu_to_be16(ETH_P_8021AD), eth_type_vlan(skb->protocol)); > > > > > > And each ARP ping delivers a packet reported as: > > > skb->protocol = 1544, ETH_P_8021Q=129 ETH_P_8021AD=43144, > > > skb->eth_type_vlan(skb->protocol) = 0 > > > > > > To add insult to injury, adding this printk line solves the ARP > > > deafness, so no matter whether I use eth_type_vlan function or > > > manual comparison, now ARP packets aren't dropped. > > > > > > Removing this printk and adding one inside the if-clause that should > > > not be happening, shows nothing, so neither I can directly inspect > > > the packets or return value of the wrong working code, nor can I > > > indirectly proof that the wrong branch of the if is being taken. > > > This reinforces the idea of a compiler bug, but I very well could be wrong. > > > > > > Adding this printk: > > > diff --git i/net/core/dev.c w/net/core/dev.c index > > > 267c4a8daa55..a3ae3bcb3a21 > > > 100644 > > > --- i/net/core/dev.c > > > +++ w/net/core/dev.c > > > @@ -5257,6 +5257,8 @@ static int __netif_receive_skb_core(struct > > > sk_buff **pskb, bool pfmemalloc, > > > * check again for vlan id to set OTHERHOST. > > > */ > > > goto check_vlan_id; > > > + } else { > > > + printk(KERN_ALERT "(1) skb->protocol is not type > > > + vlan\n"); > > > } > > > /* Note: we might in the future use prio bits > > > * and set skb->priority like in vlan_do_receive() > > > > > > is even weirder because the same effect: the message does not appear > > > but ARP requests are answered back. If I remove this printk, ARP requests are > dropped. > > > > > > I've generated assembly output and this is the difference between > > > having that extra else with the printk and not having it. > > > > > > It doesn't even make much any sense that code would even reach this > > > region of code because there's no vlan involved in at all here. > > > > > > And so here I am again, staring at all this without knowing how to proceed. > > > > > > I guess I will be trying different and more modern versions of gcc, > > > even some precompiled toolchains and see what else may be going on. > > > > > > If anyone has any hindsight as to what is causing this or how to > > > solve it, it'd be great if you could share it. > > > > > > Thanks! > > > > > > -- > > > Álvaro G. M. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: Packets only received after some buffer is full 2025-04-10 6:25 ` Gupta, Suraj @ 2025-04-10 6:54 ` Álvaro G. M. 2025-04-10 7:10 ` Gupta, Suraj 2025-04-10 7:14 ` Álvaro G. M. 0 siblings, 2 replies; 16+ messages in thread From: Álvaro G. M. @ 2025-04-10 6:54 UTC (permalink / raw) To: Gupta, Suraj Cc: netdev@vger.kernel.org, Katakam, Harini, Pandey, Radhey Shyam, Jakub Kicinski On Thu, 2025-04-10 at 06:25 +0000, Gupta, Suraj wrote: > [AMD Official Use Only - AMD Internal Distribution Only] > > > -----Original Message----- > > From: Álvaro G. M. <alvaro.gamez@hazent.com> > > Sent: Wednesday, April 9, 2025 6:40 PM > > To: Pandey, Radhey Shyam <radhey.shyam.pandey@amd.com>; Jakub Kicinski > > <kuba@kernel.org> > > Cc: netdev@vger.kernel.org; Katakam, Harini <harini.katakam@amd.com>; Gupta, > > Suraj <Suraj.Gupta2@amd.com> > > Subject: Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: > > Packets only received after some buffer is full > > > > Caution: This message originated from an External Source. Use proper caution > > when opening attachments, clicking links, or responding. > > > > > > On Wed, 2025-04-09 at 11:14 +0000, Pandey, Radhey Shyam wrote: > > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > > > > -----Original Message----- > > > > From: Álvaro G. M. <alvaro.gamez@hazent.com> > > > > Sent: Wednesday, April 9, 2025 4:31 PM > > > > To: Pandey, Radhey Shyam <radhey.shyam.pandey@amd.com>; Jakub > > > > Kicinski <kuba@kernel.org> > > > > Cc: netdev@vger.kernel.org; Katakam, Harini > > > > <harini.katakam@amd.com>; Gupta, Suraj <Suraj.Gupta2@amd.com> > > > > Subject: Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: > > > > Packets only received after some buffer is full > > > > > > > > On Thu, 2025-04-03 at 05:54 +0000, Pandey, Radhey Shyam wrote: > > > > > [...] > > > > > + Going through the details and will get back to you . Just to > > > > > confirm there is no vivado design update ? and we are only > > > > > updating linux kernel to > > > > latest? > > > > > > > > > > > > > Hi again, > > > > > > > > I've reconsidered the upgrading approach and I've first upgraded > > > > buildroot and kept the same kernel version (4.4.43). This has the > > > > effect of upgrading gcc from version > > > > 10 to version 13. > > > > > > > > With buildroot's compiled gcc-13 and keeping this same old kernel, > > > > the effect is that the system drops ARP requests. Compiling with > > > > older gcc-10, ARP requests are > > > > > > When the system drops ARP packet - Is it drop by MAC hw or by software layer. > > > Reading MAC stats and DMA descriptors help us know if it reaches > > > software layer or not ? > > > > I'm not sure, who is the open dropping packets, I can only check with ethtool -S > > eth0 and this is its output after a few dozens of arpings: > > > > # ifconfig eth0 > > eth0 Link encap:Ethernet HWaddr 06:00:0A:BC:8C:01 > > inet addr:10.188.140.1 Bcast:10.188.143.255 Mask:255.255.248.0 > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > > RX packets:164 errors:0 dropped:99 overruns:0 frame:0 > > TX packets:22 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:1000 > > RX bytes:11236 (10.9 KiB) TX bytes:1844 (1.8 KiB) > > > > # ethtool -S eth0 > > NIC statistics: > > Received bytes: 13950 > > Transmitted bytes: 2016 > > RX Good VLAN Tagged Frames: 0 > > TX Good VLAN Tagged Frames: 0 > > TX Good PFC Frames: 0 > > RX Good PFC Frames: 0 > > User Defined Counter 0: 0 > > User Defined Counter 1: 0 > > User Defined Counter 2: 0 > > > > # ethtool -g eth0 > > Ring parameters for eth0: > > Pre-set maximums: > > RX: 4096 > > RX Mini: 0 > > RX Jumbo: 0 > > TX: 4096 > > Current hardware settings: > > RX: 1024 > > RX Mini: 0 > > RX Jumbo: 0 > > TX: 128 > > > > # ethtool -d eth0 > > Offset Values > > ------ ------ > > 0x0000: 00 00 00 00 00 00 00 00 00 00 00 00 e4 01 00 00 > > 0x0010: 00 00 00 00 18 00 00 00 00 00 00 00 00 00 00 00 > > 0x0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > 0x0030: 00 00 00 00 ff ff ff ff ff ff 00 18 00 00 00 18 > > 0x0040: 00 00 00 00 00 00 00 40 d0 07 00 00 50 00 00 00 > > 0x0050: 80 80 00 01 00 00 00 00 00 21 01 00 00 00 00 00 > > 0x0060: 00 00 00 00 00 00 00 00 00 00 00 00 06 00 0a bc > > 0x0070: 8c 01 00 00 03 00 00 00 00 00 00 00 00 00 00 00 > > 0x0080: 03 70 18 21 0a 00 18 00 40 25 b3 80 40 25 b3 80 > > 0x0090: 03 50 01 00 08 00 01 00 40 38 12 81 00 38 12 81 > > > > > > > > As per registers dump, packet is not dropped by MAC. It's dropping somewhere in the software layer. > Since you started bisecting linux commits, could you please try reverting suspected commit and check if that's actually the first bad commit? > I already kinda did, please read the whole message quoted below. * To summarize: Kernel commit 324cefaf1c723625e93f703d6e6d78e28996b315^ = 679500e385fc4d65c3fac5bfbe6ee55d65698f20 works fine Kernel commit 324cefaf1c723625e93f703d6e6d78e28996b315 drops packets But using commit 324cefaf1c723625e93f703d6e6d78e28996b315 and adding printk around suspect lines, solves the issue. Looks a like a compiler bug. * New information from yesterday's email: Reverting commit 324cefaf1c723625e93f703d6e6d78e28996b315 on kernel 6.13.8 does not solve the issue. Neither does tinkering around with printks > > Running tcpdump makes it so that ifconfig dropped value doesn't increment and > > shows me ARP requests (although it won't reply to them), but just setting the > > interface as promisc do not. > > > > If you can give me any indications on how to gather more data about DMA > > descriptors I'll try my best. > > > > This is using internal's emaclite dma, because when using dmaengine there's no > > dropping of packets, but a big buffering, and kernel 6.13.8, because in series ~5.11 > > which I'm also working with, axienet didn't have support for reading statistics from > > the core. > > > > I assume the old dma code inside axienet is to be deprecated, and I would be pretty > > glad to use dmaengine, but that has the buffering problem. So if you want to focus > > efforts on solving that issue I'm completely open to whatever you all deem more > > appropriate. > > > > We're not planning to make DMAengine flow default soon as there is some significant work and optimizations required there which are under progress. > But this buffering issue we didn't observe on our platforms last time we ran it with linux v6.12. > I just tried dmaengine on 6.12 and have the same buffering issue. Did you try on Microblaze too or only on Zynq? > > I can even add some ILA to the Vivado design and inspect whatever you think could > > be useful > > > > Thanks > > > > > > > > > replied to. Keeping old buildroot version but asking it to use > > > > gcc-11 will cause the same issue with kernel 4.4.43, so something > > > > must have happened in between those gcc versions. > > > > > > > > So this does not look like an axienet driver problem, which I first > > > > thought it was, because who would blame the compiler in first instance? > > > > > > > > But then things started to get even stranger. > > > > > > > > What I did next, was slowly upgrading buildroot and using the kernel > > > > version that buildroot considered "latest" at the point it was > > > > released. I reached a point in which the ARP requests were being > > > > dropped again. This happened on buildroot 2021.11, which still used > > > > gcc-10 as the default and kernel version 5.15.6. So some gcc bug > > > > that is getting triggered on gcc-11 in kernel 4.4.43 is also triggered on gcc-10 by > > kernel 5.15.6. > > > > > > > > Using gcc-10, I bisected the kernel and found that this commit was > > > > triggering whatever it is that is happening, around 5.11-rc2: > > > > > > > > commit 324cefaf1c723625e93f703d6e6d78e28996b315 (HEAD) > > > > Author: Menglong Dong <dong.menglong@zte.com.cn> > > > > Date: Mon Jan 11 02:42:21 2021 -0800 > > > > > > > > net: core: use eth_type_vlan in __netif_receive_skb_core > > > > > > > > Replace the check for ETH_P_8021Q and ETH_P_8021AD in > > > > __netif_receive_skb_core with eth_type_vlan. > > > > > > > > Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn> > > > > Link: https://lore.kernel.org/r/20210111104221.3451-1- > > > > dong.menglong@zte.com.cn > > > > Signed-off-by: Jakub Kicinski <kuba@kernel.org> > > > > > > > > > > > > I've been staring at the diff for hours because I can't understand > > > > what can be wrong about this: > > > > > > > > diff --git a/net/core/dev.c b/net/core/dev.c index > > > > e4d77c8abe76..267c4a8daa55 > > > > 100644 > > > > --- a/net/core/dev.c > > > > +++ b/net/core/dev.c > > > > @@ -5151,8 +5151,7 @@ static int __netif_receive_skb_core(struct > > > > sk_buff **pskb, bool pfmemalloc, > > > > skb_reset_mac_len(skb); > > > > } > > > > > > > > - if (skb->protocol == cpu_to_be16(ETH_P_8021Q) || > > > > - skb->protocol == cpu_to_be16(ETH_P_8021AD)) { > > > > + if (eth_type_vlan(skb->protocol)) { > > > > skb = skb_vlan_untag(skb); > > > > if (unlikely(!skb)) > > > > goto out; > > > > @@ -5236,8 +5235,7 @@ static int __netif_receive_skb_core(struct > > > > sk_buff **pskb, bool pfmemalloc, > > > > * find vlan device. > > > > */ > > > > skb->pkt_type = PACKET_OTHERHOST; > > > > - } else if (skb->protocol == cpu_to_be16(ETH_P_8021Q) || > > > > - skb->protocol == cpu_to_be16(ETH_P_8021AD)) { > > > > + } else if (eth_type_vlan(skb->protocol)) { > > > > /* Outer header is 802.1P with vlan 0, inner header is > > > > * 802.1Q or 802.1AD and vlan_do_receive() above could > > > > * not find vlan dev for vlan id 0. > > > > > > > > > > > > > > > > Given that eth_type_vlan is simply this: > > > > > > > > static inline bool eth_type_vlan(__be16 ethertype) { > > > > switch (ethertype) { > > > > case htons(ETH_P_8021Q): > > > > case htons(ETH_P_8021AD): > > > > return true; > > > > default: > > > > return false; > > > > } > > > > } > > > > > > > > I've added a small printk to see these values right before the first > > > > time they are > > > > checked: > > > > > > > > printk(KERN_ALERT "skb->protocol = %d, ETH_P_8021Q=%d > > > > ETH_P_8021AD=%d, eth_type_vlan(skb->protocol) = %d", > > > > skb->protocol, cpu_to_be16(ETH_P_8021Q), > > > > cpu_to_be16(ETH_P_8021AD), eth_type_vlan(skb->protocol)); > > > > > > > > And each ARP ping delivers a packet reported as: > > > > skb->protocol = 1544, ETH_P_8021Q=129 ETH_P_8021AD=43144, > > > > skb->eth_type_vlan(skb->protocol) = 0 > > > > > > > > To add insult to injury, adding this printk line solves the ARP > > > > deafness, so no matter whether I use eth_type_vlan function or > > > > manual comparison, now ARP packets aren't dropped. > > > > > > > > Removing this printk and adding one inside the if-clause that should > > > > not be happening, shows nothing, so neither I can directly inspect > > > > the packets or return value of the wrong working code, nor can I > > > > indirectly proof that the wrong branch of the if is being taken. > > > > This reinforces the idea of a compiler bug, but I very well could be wrong. > > > > > > > > Adding this printk: > > > > diff --git i/net/core/dev.c w/net/core/dev.c index > > > > 267c4a8daa55..a3ae3bcb3a21 > > > > 100644 > > > > --- i/net/core/dev.c > > > > +++ w/net/core/dev.c > > > > @@ -5257,6 +5257,8 @@ static int __netif_receive_skb_core(struct > > > > sk_buff **pskb, bool pfmemalloc, > > > > * check again for vlan id to set OTHERHOST. > > > > */ > > > > goto check_vlan_id; > > > > + } else { > > > > + printk(KERN_ALERT "(1) skb->protocol is not type > > > > + vlan\n"); > > > > } > > > > /* Note: we might in the future use prio bits > > > > * and set skb->priority like in vlan_do_receive() > > > > > > > > is even weirder because the same effect: the message does not appear > > > > but ARP requests are answered back. If I remove this printk, ARP requests are > > dropped. > > > > > > > > I've generated assembly output and this is the difference between > > > > having that extra else with the printk and not having it. > > > > > > > > It doesn't even make much any sense that code would even reach this > > > > region of code because there's no vlan involved in at all here. > > > > > > > > And so here I am again, staring at all this without knowing how to proceed. > > > > > > > > I guess I will be trying different and more modern versions of gcc, > > > > even some precompiled toolchains and see what else may be going on. > > > > > > > > If anyone has any hindsight as to what is causing this or how to > > > > solve it, it'd be great if you could share it. > > > > > > > > Thanks! > > > > > > > > -- > > > > Álvaro G. M. ^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: Packets only received after some buffer is full 2025-04-10 6:54 ` Álvaro G. M. @ 2025-04-10 7:10 ` Gupta, Suraj 2025-04-21 11:12 ` Álvaro G. M. 2025-04-10 7:14 ` Álvaro G. M. 1 sibling, 1 reply; 16+ messages in thread From: Gupta, Suraj @ 2025-04-10 7:10 UTC (permalink / raw) To: Álvaro G. M. Cc: netdev@vger.kernel.org, Katakam, Harini, Pandey, Radhey Shyam, Jakub Kicinski [AMD Official Use Only - AMD Internal Distribution Only] > -----Original Message----- > From: Álvaro G. M. <alvaro.gamez@hazent.com> > Sent: Thursday, April 10, 2025 12:24 PM > To: Gupta, Suraj <Suraj.Gupta2@amd.com> > Cc: netdev@vger.kernel.org; Katakam, Harini <harini.katakam@amd.com>; Pandey, > Radhey Shyam <radhey.shyam.pandey@amd.com>; Jakub Kicinski > <kuba@kernel.org> > Subject: Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: > Packets only received after some buffer is full > > Caution: This message originated from an External Source. Use proper caution > when opening attachments, clicking links, or responding. > > > On Thu, 2025-04-10 at 06:25 +0000, Gupta, Suraj wrote: > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > > -----Original Message----- > > > From: Álvaro G. M. <alvaro.gamez@hazent.com> > > > Sent: Wednesday, April 9, 2025 6:40 PM > > > To: Pandey, Radhey Shyam <radhey.shyam.pandey@amd.com>; Jakub > > > Kicinski <kuba@kernel.org> > > > Cc: netdev@vger.kernel.org; Katakam, Harini > > > <harini.katakam@amd.com>; Gupta, Suraj <Suraj.Gupta2@amd.com> > > > Subject: Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: > > > Packets only received after some buffer is full > > > > > > Caution: This message originated from an External Source. Use proper > > > caution when opening attachments, clicking links, or responding. > > > > > > > > > On Wed, 2025-04-09 at 11:14 +0000, Pandey, Radhey Shyam wrote: > > > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > > > > > > -----Original Message----- > > > > > From: Álvaro G. M. <alvaro.gamez@hazent.com> > > > > > Sent: Wednesday, April 9, 2025 4:31 PM > > > > > To: Pandey, Radhey Shyam <radhey.shyam.pandey@amd.com>; Jakub > > > > > Kicinski <kuba@kernel.org> > > > > > Cc: netdev@vger.kernel.org; Katakam, Harini > > > > > <harini.katakam@amd.com>; Gupta, Suraj <Suraj.Gupta2@amd.com> > > > > > Subject: Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on > MicroBlaze: > > > > > Packets only received after some buffer is full > > > > > > > > > > On Thu, 2025-04-03 at 05:54 +0000, Pandey, Radhey Shyam wrote: > > > > > > [...] > > > > > > + Going through the details and will get back to you . Just > > > > > > to confirm there is no vivado design update ? and we are only > > > > > > updating linux kernel to > > > > > latest? > > > > > > > > > > > > > > > > Hi again, > > > > > > > > > > I've reconsidered the upgrading approach and I've first upgraded > > > > > buildroot and kept the same kernel version (4.4.43). This has > > > > > the effect of upgrading gcc from version > > > > > 10 to version 13. > > > > > > > > > > With buildroot's compiled gcc-13 and keeping this same old > > > > > kernel, the effect is that the system drops ARP requests. > > > > > Compiling with older gcc-10, ARP requests are > > > > > > > > When the system drops ARP packet - Is it drop by MAC hw or by software > layer. > > > > Reading MAC stats and DMA descriptors help us know if it reaches > > > > software layer or not ? > > > > > > I'm not sure, who is the open dropping packets, I can only check > > > with ethtool -S > > > eth0 and this is its output after a few dozens of arpings: > > > > > > # ifconfig eth0 > > > eth0 Link encap:Ethernet HWaddr 06:00:0A:BC:8C:01 > > > inet addr:10.188.140.1 Bcast:10.188.143.255 Mask:255.255.248.0 > > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > > > RX packets:164 errors:0 dropped:99 overruns:0 frame:0 > > > TX packets:22 errors:0 dropped:0 overruns:0 carrier:0 > > > collisions:0 txqueuelen:1000 > > > RX bytes:11236 (10.9 KiB) TX bytes:1844 (1.8 KiB) > > > > > > # ethtool -S eth0 > > > NIC statistics: > > > Received bytes: 13950 > > > Transmitted bytes: 2016 > > > RX Good VLAN Tagged Frames: 0 > > > TX Good VLAN Tagged Frames: 0 > > > TX Good PFC Frames: 0 > > > RX Good PFC Frames: 0 > > > User Defined Counter 0: 0 > > > User Defined Counter 1: 0 > > > User Defined Counter 2: 0 > > > > > > # ethtool -g eth0 > > > Ring parameters for eth0: > > > Pre-set maximums: > > > RX: 4096 > > > RX Mini: 0 > > > RX Jumbo: 0 > > > TX: 4096 > > > Current hardware settings: > > > RX: 1024 > > > RX Mini: 0 > > > RX Jumbo: 0 > > > TX: 128 > > > > > > # ethtool -d eth0 > > > Offset Values > > > ------ ------ > > > 0x0000: 00 00 00 00 00 00 00 00 00 00 00 00 e4 01 00 00 > > > 0x0010: 00 00 00 00 18 00 00 00 00 00 00 00 00 00 00 00 > > > 0x0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > > 0x0030: 00 00 00 00 ff ff ff ff ff ff 00 18 00 00 00 18 > > > 0x0040: 00 00 00 00 00 00 00 40 d0 07 00 00 50 00 00 00 > > > 0x0050: 80 80 00 01 00 00 00 00 00 21 01 00 00 00 00 00 > > > 0x0060: 00 00 00 00 00 00 00 00 00 00 00 00 06 00 0a bc > > > 0x0070: 8c 01 00 00 03 00 00 00 00 00 00 00 00 00 00 00 > > > 0x0080: 03 70 18 21 0a 00 18 00 40 25 b3 80 40 25 b3 80 > > > 0x0090: 03 50 01 00 08 00 01 00 40 38 12 81 00 38 12 81 > > > > > > > > > > > > > As per registers dump, packet is not dropped by MAC. It's dropping somewhere in > the software layer. > > Since you started bisecting linux commits, could you please try reverting > suspected commit and check if that's actually the first bad commit? > > > > I already kinda did, please read the whole message quoted below. > > * To summarize: > Kernel commit 324cefaf1c723625e93f703d6e6d78e28996b315^ = > 679500e385fc4d65c3fac5bfbe6ee55d65698f20 works fine Kernel commit > 324cefaf1c723625e93f703d6e6d78e28996b315 drops packets > > But using commit 324cefaf1c723625e93f703d6e6d78e28996b315 and adding printk > around suspect lines, solves the issue. Looks a like a compiler bug. > > * New information from yesterday's email: > Reverting commit 324cefaf1c723625e93f703d6e6d78e28996b315 on kernel 6.13.8 > does not solve the issue. Neither does tinkering around with printks > > I didn't suspected that commit last time as it was unrelated to the issue. Could you please try effectively bisecting linux, keeping other things same? For the starting you can try bisecting among AXI ethernet commits and see if it's related to AXI ethernet changes or something else? > > > Running tcpdump makes it so that ifconfig dropped value doesn't > > > increment and shows me ARP requests (although it won't reply to > > > them), but just setting the interface as promisc do not. > > > > > > If you can give me any indications on how to gather more data about > > > DMA descriptors I'll try my best. > > > > > > This is using internal's emaclite dma, because when using dmaengine > > > there's no dropping of packets, but a big buffering, and kernel > > > 6.13.8, because in series ~5.11 which I'm also working with, axienet > > > didn't have support for reading statistics from the core. > > > > > > I assume the old dma code inside axienet is to be deprecated, and I > > > would be pretty glad to use dmaengine, but that has the buffering > > > problem. So if you want to focus efforts on solving that issue I'm > > > completely open to whatever you all deem more appropriate. > > > > > > > We're not planning to make DMAengine flow default soon as there is some > significant work and optimizations required there which are under progress. > > But this buffering issue we didn't observe on our platforms last time we ran it with > linux v6.12. > > > > I just tried dmaengine on 6.12 and have the same buffering issue. > > Did you try on Microblaze too or only on Zynq? We have tested with ZynqMP AXI 1G ethernet. > > > > > > I can even add some ILA to the Vivado design and inspect whatever > > > you think could be useful > > > > > > Thanks > > > > > > > > > > > > replied to. Keeping old buildroot version but asking it to use > > > > > gcc-11 will cause the same issue with kernel 4.4.43, so > > > > > something must have happened in between those gcc versions. > > > > > > > > > > So this does not look like an axienet driver problem, which I > > > > > first thought it was, because who would blame the compiler in first instance? > > > > > > > > > > But then things started to get even stranger. > > > > > > > > > > What I did next, was slowly upgrading buildroot and using the > > > > > kernel version that buildroot considered "latest" at the point > > > > > it was released. I reached a point in which the ARP requests > > > > > were being dropped again. This happened on buildroot 2021.11, > > > > > which still used > > > > > gcc-10 as the default and kernel version 5.15.6. So some gcc bug > > > > > that is getting triggered on gcc-11 in kernel 4.4.43 is also > > > > > triggered on gcc-10 by > > > kernel 5.15.6. > > > > > > > > > > Using gcc-10, I bisected the kernel and found that this commit > > > > > was triggering whatever it is that is happening, around 5.11-rc2: > > > > > > > > > > commit 324cefaf1c723625e93f703d6e6d78e28996b315 (HEAD) > > > > > Author: Menglong Dong <dong.menglong@zte.com.cn> > > > > > Date: Mon Jan 11 02:42:21 2021 -0800 > > > > > > > > > > net: core: use eth_type_vlan in __netif_receive_skb_core > > > > > > > > > > Replace the check for ETH_P_8021Q and ETH_P_8021AD in > > > > > __netif_receive_skb_core with eth_type_vlan. > > > > > > > > > > Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn> > > > > > Link: https://lore.kernel.org/r/20210111104221.3451-1- > > > > > dong.menglong@zte.com.cn > > > > > Signed-off-by: Jakub Kicinski <kuba@kernel.org> > > > > > > > > > > > > > > > I've been staring at the diff for hours because I can't > > > > > understand what can be wrong about this: > > > > > > > > > > diff --git a/net/core/dev.c b/net/core/dev.c index > > > > > e4d77c8abe76..267c4a8daa55 > > > > > 100644 > > > > > --- a/net/core/dev.c > > > > > +++ b/net/core/dev.c > > > > > @@ -5151,8 +5151,7 @@ static int __netif_receive_skb_core(struct > > > > > sk_buff **pskb, bool pfmemalloc, > > > > > skb_reset_mac_len(skb); > > > > > } > > > > > > > > > > - if (skb->protocol == cpu_to_be16(ETH_P_8021Q) || > > > > > - skb->protocol == cpu_to_be16(ETH_P_8021AD)) { > > > > > + if (eth_type_vlan(skb->protocol)) { > > > > > skb = skb_vlan_untag(skb); > > > > > if (unlikely(!skb)) > > > > > goto out; > > > > > @@ -5236,8 +5235,7 @@ static int __netif_receive_skb_core(struct > > > > > sk_buff **pskb, bool pfmemalloc, > > > > > * find vlan device. > > > > > */ > > > > > skb->pkt_type = PACKET_OTHERHOST; > > > > > - } else if (skb->protocol == cpu_to_be16(ETH_P_8021Q) || > > > > > - skb->protocol == cpu_to_be16(ETH_P_8021AD)) { > > > > > + } else if (eth_type_vlan(skb->protocol)) { > > > > > /* Outer header is 802.1P with vlan 0, inner header is > > > > > * 802.1Q or 802.1AD and vlan_do_receive() above could > > > > > * not find vlan dev for vlan id 0. > > > > > > > > > > > > > > > > > > > > Given that eth_type_vlan is simply this: > > > > > > > > > > static inline bool eth_type_vlan(__be16 ethertype) { > > > > > switch (ethertype) { > > > > > case htons(ETH_P_8021Q): > > > > > case htons(ETH_P_8021AD): > > > > > return true; > > > > > default: > > > > > return false; > > > > > } > > > > > } > > > > > > > > > > I've added a small printk to see these values right before the > > > > > first time they are > > > > > checked: > > > > > > > > > > printk(KERN_ALERT "skb->protocol = %d, ETH_P_8021Q=%d > > > > > ETH_P_8021AD=%d, eth_type_vlan(skb->protocol) = %d", > > > > > skb->protocol, cpu_to_be16(ETH_P_8021Q), > > > > > cpu_to_be16(ETH_P_8021AD), eth_type_vlan(skb->protocol)); > > > > > > > > > > And each ARP ping delivers a packet reported as: > > > > > skb->protocol = 1544, ETH_P_8021Q=129 ETH_P_8021AD=43144, > > > > > skb->eth_type_vlan(skb->protocol) = 0 > > > > > > > > > > To add insult to injury, adding this printk line solves the ARP > > > > > deafness, so no matter whether I use eth_type_vlan function or > > > > > manual comparison, now ARP packets aren't dropped. > > > > > > > > > > Removing this printk and adding one inside the if-clause that > > > > > should not be happening, shows nothing, so neither I can > > > > > directly inspect the packets or return value of the wrong > > > > > working code, nor can I indirectly proof that the wrong branch of the if is > being taken. > > > > > This reinforces the idea of a compiler bug, but I very well could be wrong. > > > > > > > > > > Adding this printk: > > > > > diff --git i/net/core/dev.c w/net/core/dev.c index > > > > > 267c4a8daa55..a3ae3bcb3a21 > > > > > 100644 > > > > > --- i/net/core/dev.c > > > > > +++ w/net/core/dev.c > > > > > @@ -5257,6 +5257,8 @@ static int __netif_receive_skb_core(struct > > > > > sk_buff **pskb, bool pfmemalloc, > > > > > * check again for vlan id to set OTHERHOST. > > > > > */ > > > > > goto check_vlan_id; > > > > > + } else { > > > > > + printk(KERN_ALERT "(1) skb->protocol is not type > > > > > + vlan\n"); > > > > > } > > > > > /* Note: we might in the future use prio bits > > > > > * and set skb->priority like in vlan_do_receive() > > > > > > > > > > is even weirder because the same effect: the message does not > > > > > appear but ARP requests are answered back. If I remove this > > > > > printk, ARP requests are > > > dropped. > > > > > > > > > > I've generated assembly output and this is the difference > > > > > between having that extra else with the printk and not having it. > > > > > > > > > > It doesn't even make much any sense that code would even reach > > > > > this region of code because there's no vlan involved in at all here. > > > > > > > > > > And so here I am again, staring at all this without knowing how to proceed. > > > > > > > > > > I guess I will be trying different and more modern versions of > > > > > gcc, even some precompiled toolchains and see what else may be going on. > > > > > > > > > > If anyone has any hindsight as to what is causing this or how to > > > > > solve it, it'd be great if you could share it. > > > > > > > > > > Thanks! > > > > > > > > > > -- > > > > > Álvaro G. M. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: Packets only received after some buffer is full 2025-04-10 7:10 ` Gupta, Suraj @ 2025-04-21 11:12 ` Álvaro G. M. 0 siblings, 0 replies; 16+ messages in thread From: Álvaro G. M. @ 2025-04-21 11:12 UTC (permalink / raw) To: Gupta, Suraj Cc: netdev@vger.kernel.org, Katakam, Harini, Pandey, Radhey Shyam, Jakub Kicinski On Thu, 2025-04-10 at 07:10 +0000, Gupta, Suraj wrote: > [AMD Official Use Only - AMD Internal Distribution Only] > > > -----Original Message----- > > From: Álvaro G. M. <alvaro.gamez@hazent.com> > > Sent: Thursday, April 10, 2025 12:24 PM > > To: Gupta, Suraj <Suraj.Gupta2@amd.com> > > Cc: netdev@vger.kernel.org; Katakam, Harini <harini.katakam@amd.com>; Pandey, > > Radhey Shyam <radhey.shyam.pandey@amd.com>; Jakub Kicinski > > <kuba@kernel.org> > > Subject: Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: > > Packets only received after some buffer is full > > > > Caution: This message originated from an External Source. Use proper caution > > when opening attachments, clicking links, or responding. > > > > > > On Thu, 2025-04-10 at 06:25 +0000, Gupta, Suraj wrote: > > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > > > > -----Original Message----- > > > > From: Álvaro G. M. <alvaro.gamez@hazent.com> > > > > Sent: Wednesday, April 9, 2025 6:40 PM > > > > To: Pandey, Radhey Shyam <radhey.shyam.pandey@amd.com>; Jakub > > > > Kicinski <kuba@kernel.org> > > > > Cc: netdev@vger.kernel.org; Katakam, Harini > > > > <harini.katakam@amd.com>; Gupta, Suraj <Suraj.Gupta2@amd.com> > > > > Subject: Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: > > > > Packets only received after some buffer is full > > > > > > > > Caution: This message originated from an External Source. Use proper > > > > caution when opening attachments, clicking links, or responding. > > > > > > > > > > > > On Wed, 2025-04-09 at 11:14 +0000, Pandey, Radhey Shyam wrote: > > > > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > > > > > > > > -----Original Message----- > > > > > > From: Álvaro G. M. <alvaro.gamez@hazent.com> > > > > > > Sent: Wednesday, April 9, 2025 4:31 PM > > > > > > To: Pandey, Radhey Shyam <radhey.shyam.pandey@amd.com>; Jakub > > > > > > Kicinski <kuba@kernel.org> > > > > > > Cc: netdev@vger.kernel.org; Katakam, Harini > > > > > > <harini.katakam@amd.com>; Gupta, Suraj <Suraj.Gupta2@amd.com> > > > > > > Subject: Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on > > MicroBlaze: > > > > > > Packets only received after some buffer is full > > > > > > > > > > > > On Thu, 2025-04-03 at 05:54 +0000, Pandey, Radhey Shyam wrote: > > > > > > > [...] > > > > > > > + Going through the details and will get back to you . Just > > > > > > > to confirm there is no vivado design update ? and we are only > > > > > > > updating linux kernel to > > > > > > latest? > > > > > > > > > > > > > > > > > > > Hi again, > > > > > > > > > > > > I've reconsidered the upgrading approach and I've first upgraded > > > > > > buildroot and kept the same kernel version (4.4.43). This has > > > > > > the effect of upgrading gcc from version > > > > > > 10 to version 13. > > > > > > > > > > > > With buildroot's compiled gcc-13 and keeping this same old > > > > > > kernel, the effect is that the system drops ARP requests. > > > > > > Compiling with older gcc-10, ARP requests are > > > > > > > > > > When the system drops ARP packet - Is it drop by MAC hw or by software > > layer. > > > > > Reading MAC stats and DMA descriptors help us know if it reaches > > > > > software layer or not ? > > > > > > > > I'm not sure, who is the open dropping packets, I can only check > > > > with ethtool -S > > > > eth0 and this is its output after a few dozens of arpings: > > > > > > > > # ifconfig eth0 > > > > eth0 Link encap:Ethernet HWaddr 06:00:0A:BC:8C:01 > > > > inet addr:10.188.140.1 Bcast:10.188.143.255 Mask:255.255.248.0 > > > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > > > > RX packets:164 errors:0 dropped:99 overruns:0 frame:0 > > > > TX packets:22 errors:0 dropped:0 overruns:0 carrier:0 > > > > collisions:0 txqueuelen:1000 > > > > RX bytes:11236 (10.9 KiB) TX bytes:1844 (1.8 KiB) > > > > > > > > # ethtool -S eth0 > > > > NIC statistics: > > > > Received bytes: 13950 > > > > Transmitted bytes: 2016 > > > > RX Good VLAN Tagged Frames: 0 > > > > TX Good VLAN Tagged Frames: 0 > > > > TX Good PFC Frames: 0 > > > > RX Good PFC Frames: 0 > > > > User Defined Counter 0: 0 > > > > User Defined Counter 1: 0 > > > > User Defined Counter 2: 0 > > > > > > > > # ethtool -g eth0 > > > > Ring parameters for eth0: > > > > Pre-set maximums: > > > > RX: 4096 > > > > RX Mini: 0 > > > > RX Jumbo: 0 > > > > TX: 4096 > > > > Current hardware settings: > > > > RX: 1024 > > > > RX Mini: 0 > > > > RX Jumbo: 0 > > > > TX: 128 > > > > > > > > # ethtool -d eth0 > > > > Offset Values > > > > ------ ------ > > > > 0x0000: 00 00 00 00 00 00 00 00 00 00 00 00 e4 01 00 00 > > > > 0x0010: 00 00 00 00 18 00 00 00 00 00 00 00 00 00 00 00 > > > > 0x0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > > > 0x0030: 00 00 00 00 ff ff ff ff ff ff 00 18 00 00 00 18 > > > > 0x0040: 00 00 00 00 00 00 00 40 d0 07 00 00 50 00 00 00 > > > > 0x0050: 80 80 00 01 00 00 00 00 00 21 01 00 00 00 00 00 > > > > 0x0060: 00 00 00 00 00 00 00 00 00 00 00 00 06 00 0a bc > > > > 0x0070: 8c 01 00 00 03 00 00 00 00 00 00 00 00 00 00 00 > > > > 0x0080: 03 70 18 21 0a 00 18 00 40 25 b3 80 40 25 b3 80 > > > > 0x0090: 03 50 01 00 08 00 01 00 40 38 12 81 00 38 12 81 > > > > > > > > > > > > > > > > > > As per registers dump, packet is not dropped by MAC. It's dropping somewhere in > > the software layer. > > > Since you started bisecting linux commits, could you please try reverting > > suspected commit and check if that's actually the first bad commit? > > > > > > > I already kinda did, please read the whole message quoted below. > > > > * To summarize: > > Kernel commit 324cefaf1c723625e93f703d6e6d78e28996b315^ = > > 679500e385fc4d65c3fac5bfbe6ee55d65698f20 works fine Kernel commit > > 324cefaf1c723625e93f703d6e6d78e28996b315 drops packets > > > > But using commit 324cefaf1c723625e93f703d6e6d78e28996b315 and adding printk > > around suspect lines, solves the issue. Looks a like a compiler bug. > > > > * New information from yesterday's email: > > Reverting commit 324cefaf1c723625e93f703d6e6d78e28996b315 on kernel 6.13.8 > > does not solve the issue. Neither does tinkering around with printks > > > > > > I didn't suspected that commit last time as it was unrelated to the issue. Could you please try effectively bisecting linux, keeping other things same? For the starting you can try bisecting among AXI ethernet commits and see if it's related to AXI ethernet changes or something else? > And rightfully so, the commit is basically replacing a chunk of code with another that does exactly the same, although using a different comparison structure (if vs switch) and calling a function. So, I've managed to build kernel 6.13.8 with the same gcc version (5.4) as I used to compile kernel 4.4.14 and the problem has solved itself. I am now completely sure that the compiler is the one responsible. Going back to gcc 5.4 has solved my ethernet issues without touching a single line of kernel code. Using gcc 13 causes the behavior I described, and bisecting the kernel with gcc 13 definitely points to that commit, which of course isn't the bad actor, but is somehow triggering the fault; so for now I'll just keep using gcc 5.4 Conversations on gcc bugzilla point to Microblaze architecture being in the process of being deprecated, as it doesn't seem to gather much attention and there are standing bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118280 We'll probably be deprecating Microblaze internally and I will definitely will recommend not using it anymore. Maybe Microblaze V could be a nice alternative, but AFAIK it doesn't have an MMU yet and Linux probably won't run in it, so... Thanks a lot for your guidance, I've learned a lot diagnosing this and you guys pointed me at things I didn't know about. And please forgive all the misdirection, I blamed axienet driver and net code when they were innocent. Best regards > > > > Running tcpdump makes it so that ifconfig dropped value doesn't > > > > increment and shows me ARP requests (although it won't reply to > > > > them), but just setting the interface as promisc do not. > > > > > > > > If you can give me any indications on how to gather more data about > > > > DMA descriptors I'll try my best. > > > > > > > > This is using internal's emaclite dma, because when using dmaengine > > > > there's no dropping of packets, but a big buffering, and kernel > > > > 6.13.8, because in series ~5.11 which I'm also working with, axienet > > > > didn't have support for reading statistics from the core. > > > > > > > > I assume the old dma code inside axienet is to be deprecated, and I > > > > would be pretty glad to use dmaengine, but that has the buffering > > > > problem. So if you want to focus efforts on solving that issue I'm > > > > completely open to whatever you all deem more appropriate. > > > > > > > > > > We're not planning to make DMAengine flow default soon as there is some > > significant work and optimizations required there which are under progress. > > > But this buffering issue we didn't observe on our platforms last time we ran it with > > linux v6.12. > > > > > > > I just tried dmaengine on 6.12 and have the same buffering issue. > > > > Did you try on Microblaze too or only on Zynq? > > We have tested with ZynqMP AXI 1G ethernet. > > > > > > > > > > > I can even add some ILA to the Vivado design and inspect whatever > > > > you think could be useful > > > > > > > > Thanks > > > > > > > > > > > > > > > replied to. Keeping old buildroot version but asking it to use > > > > > > gcc-11 will cause the same issue with kernel 4.4.43, so > > > > > > something must have happened in between those gcc versions. > > > > > > > > > > > > So this does not look like an axienet driver problem, which I > > > > > > first thought it was, because who would blame the compiler in first instance? > > > > > > > > > > > > But then things started to get even stranger. > > > > > > > > > > > > What I did next, was slowly upgrading buildroot and using the > > > > > > kernel version that buildroot considered "latest" at the point > > > > > > it was released. I reached a point in which the ARP requests > > > > > > were being dropped again. This happened on buildroot 2021.11, > > > > > > which still used > > > > > > gcc-10 as the default and kernel version 5.15.6. So some gcc bug > > > > > > that is getting triggered on gcc-11 in kernel 4.4.43 is also > > > > > > triggered on gcc-10 by > > > > kernel 5.15.6. > > > > > > > > > > > > Using gcc-10, I bisected the kernel and found that this commit > > > > > > was triggering whatever it is that is happening, around 5.11-rc2: > > > > > > > > > > > > commit 324cefaf1c723625e93f703d6e6d78e28996b315 (HEAD) > > > > > > Author: Menglong Dong <dong.menglong@zte.com.cn> > > > > > > Date: Mon Jan 11 02:42:21 2021 -0800 > > > > > > > > > > > > net: core: use eth_type_vlan in __netif_receive_skb_core > > > > > > > > > > > > Replace the check for ETH_P_8021Q and ETH_P_8021AD in > > > > > > __netif_receive_skb_core with eth_type_vlan. > > > > > > > > > > > > Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn> > > > > > > Link: https://lore.kernel.org/r/20210111104221.3451-1- > > > > > > dong.menglong@zte.com.cn > > > > > > Signed-off-by: Jakub Kicinski <kuba@kernel.org> > > > > > > > > > > > > > > > > > > I've been staring at the diff for hours because I can't > > > > > > understand what can be wrong about this: > > > > > > > > > > > > diff --git a/net/core/dev.c b/net/core/dev.c index > > > > > > e4d77c8abe76..267c4a8daa55 > > > > > > 100644 > > > > > > --- a/net/core/dev.c > > > > > > +++ b/net/core/dev.c > > > > > > @@ -5151,8 +5151,7 @@ static int __netif_receive_skb_core(struct > > > > > > sk_buff **pskb, bool pfmemalloc, > > > > > > skb_reset_mac_len(skb); > > > > > > } > > > > > > > > > > > > - if (skb->protocol == cpu_to_be16(ETH_P_8021Q) || > > > > > > - skb->protocol == cpu_to_be16(ETH_P_8021AD)) { > > > > > > + if (eth_type_vlan(skb->protocol)) { > > > > > > skb = skb_vlan_untag(skb); > > > > > > if (unlikely(!skb)) > > > > > > goto out; > > > > > > @@ -5236,8 +5235,7 @@ static int __netif_receive_skb_core(struct > > > > > > sk_buff **pskb, bool pfmemalloc, > > > > > > * find vlan device. > > > > > > */ > > > > > > skb->pkt_type = PACKET_OTHERHOST; > > > > > > - } else if (skb->protocol == cpu_to_be16(ETH_P_8021Q) || > > > > > > - skb->protocol == cpu_to_be16(ETH_P_8021AD)) { > > > > > > + } else if (eth_type_vlan(skb->protocol)) { > > > > > > /* Outer header is 802.1P with vlan 0, inner header is > > > > > > * 802.1Q or 802.1AD and vlan_do_receive() above could > > > > > > * not find vlan dev for vlan id 0. > > > > > > > > > > > > > > > > > > > > > > > > Given that eth_type_vlan is simply this: > > > > > > > > > > > > static inline bool eth_type_vlan(__be16 ethertype) { > > > > > > switch (ethertype) { > > > > > > case htons(ETH_P_8021Q): > > > > > > case htons(ETH_P_8021AD): > > > > > > return true; > > > > > > default: > > > > > > return false; > > > > > > } > > > > > > } > > > > > > > > > > > > I've added a small printk to see these values right before the > > > > > > first time they are > > > > > > checked: > > > > > > > > > > > > printk(KERN_ALERT "skb->protocol = %d, ETH_P_8021Q=%d > > > > > > ETH_P_8021AD=%d, eth_type_vlan(skb->protocol) = %d", > > > > > > skb->protocol, cpu_to_be16(ETH_P_8021Q), > > > > > > cpu_to_be16(ETH_P_8021AD), eth_type_vlan(skb->protocol)); > > > > > > > > > > > > And each ARP ping delivers a packet reported as: > > > > > > skb->protocol = 1544, ETH_P_8021Q=129 ETH_P_8021AD=43144, > > > > > > skb->eth_type_vlan(skb->protocol) = 0 > > > > > > > > > > > > To add insult to injury, adding this printk line solves the ARP > > > > > > deafness, so no matter whether I use eth_type_vlan function or > > > > > > manual comparison, now ARP packets aren't dropped. > > > > > > > > > > > > Removing this printk and adding one inside the if-clause that > > > > > > should not be happening, shows nothing, so neither I can > > > > > > directly inspect the packets or return value of the wrong > > > > > > working code, nor can I indirectly proof that the wrong branch of the if is > > being taken. > > > > > > This reinforces the idea of a compiler bug, but I very well could be wrong. > > > > > > > > > > > > Adding this printk: > > > > > > diff --git i/net/core/dev.c w/net/core/dev.c index > > > > > > 267c4a8daa55..a3ae3bcb3a21 > > > > > > 100644 > > > > > > --- i/net/core/dev.c > > > > > > +++ w/net/core/dev.c > > > > > > @@ -5257,6 +5257,8 @@ static int __netif_receive_skb_core(struct > > > > > > sk_buff **pskb, bool pfmemalloc, > > > > > > * check again for vlan id to set OTHERHOST. > > > > > > */ > > > > > > goto check_vlan_id; > > > > > > + } else { > > > > > > + printk(KERN_ALERT "(1) skb->protocol is not type > > > > > > + vlan\n"); > > > > > > } > > > > > > /* Note: we might in the future use prio bits > > > > > > * and set skb->priority like in vlan_do_receive() > > > > > > > > > > > > is even weirder because the same effect: the message does not > > > > > > appear but ARP requests are answered back. If I remove this > > > > > > printk, ARP requests are > > > > dropped. > > > > > > > > > > > > I've generated assembly output and this is the difference > > > > > > between having that extra else with the printk and not having it. > > > > > > > > > > > > It doesn't even make much any sense that code would even reach > > > > > > this region of code because there's no vlan involved in at all here. > > > > > > > > > > > > And so here I am again, staring at all this without knowing how to proceed. > > > > > > > > > > > > I guess I will be trying different and more modern versions of > > > > > > gcc, even some precompiled toolchains and see what else may be going on. > > > > > > > > > > > > If anyone has any hindsight as to what is causing this or how to > > > > > > solve it, it'd be great if you could share it. > > > > > > > > > > > > Thanks! > > > > > > > > > > > > -- > > > > > > Álvaro G. M. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: Packets only received after some buffer is full 2025-04-10 6:54 ` Álvaro G. M. 2025-04-10 7:10 ` Gupta, Suraj @ 2025-04-10 7:14 ` Álvaro G. M. 2025-04-10 9:06 ` Álvaro G. M. 1 sibling, 1 reply; 16+ messages in thread From: Álvaro G. M. @ 2025-04-10 7:14 UTC (permalink / raw) To: Gupta, Suraj Cc: netdev@vger.kernel.org, Katakam, Harini, Pandey, Radhey Shyam, Jakub Kicinski On Thu, 2025-04-10 at 08:54 +0200, Álvaro G. M. wrote: > On Thu, 2025-04-10 at 06:25 +0000, Gupta, Suraj wrote: > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > > -----Original Message----- > > > From: Álvaro G. M. <alvaro.gamez@hazent.com> > > > Sent: Wednesday, April 9, 2025 6:40 PM > > > To: Pandey, Radhey Shyam <radhey.shyam.pandey@amd.com>; Jakub Kicinski > > > <kuba@kernel.org> > > > Cc: netdev@vger.kernel.org; Katakam, Harini <harini.katakam@amd.com>; Gupta, > > > Suraj <Suraj.Gupta2@amd.com> > > > Subject: Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: > > > Packets only received after some buffer is full > > > > > > Caution: This message originated from an External Source. Use proper caution > > > when opening attachments, clicking links, or responding. > > > > > > > > > On Wed, 2025-04-09 at 11:14 +0000, Pandey, Radhey Shyam wrote: > > > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > > > > > > -----Original Message----- > > > > > From: Álvaro G. M. <alvaro.gamez@hazent.com> > > > > > Sent: Wednesday, April 9, 2025 4:31 PM > > > > > To: Pandey, Radhey Shyam <radhey.shyam.pandey@amd.com>; Jakub > > > > > Kicinski <kuba@kernel.org> > > > > > Cc: netdev@vger.kernel.org; Katakam, Harini > > > > > <harini.katakam@amd.com>; Gupta, Suraj <Suraj.Gupta2@amd.com> > > > > > Subject: Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: > > > > > Packets only received after some buffer is full > > > > > > > > > > On Thu, 2025-04-03 at 05:54 +0000, Pandey, Radhey Shyam wrote: > > > > > > [...] > > > > > > + Going through the details and will get back to you . Just to > > > > > > confirm there is no vivado design update ? and we are only > > > > > > updating linux kernel to > > > > > latest? > > > > > > > > > > > > > > > > Hi again, > > > > > > > > > > I've reconsidered the upgrading approach and I've first upgraded > > > > > buildroot and kept the same kernel version (4.4.43). This has the > > > > > effect of upgrading gcc from version > > > > > 10 to version 13. > > > > > > > > > > With buildroot's compiled gcc-13 and keeping this same old kernel, > > > > > the effect is that the system drops ARP requests. Compiling with > > > > > older gcc-10, ARP requests are > > > > > > > > When the system drops ARP packet - Is it drop by MAC hw or by software layer. > > > > Reading MAC stats and DMA descriptors help us know if it reaches > > > > software layer or not ? > > > > > > I'm not sure, who is the open dropping packets, I can only check with ethtool -S > > > eth0 and this is its output after a few dozens of arpings: > > > > > > # ifconfig eth0 > > > eth0 Link encap:Ethernet HWaddr 06:00:0A:BC:8C:01 > > > inet addr:10.188.140.1 Bcast:10.188.143.255 Mask:255.255.248.0 > > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > > > RX packets:164 errors:0 dropped:99 overruns:0 frame:0 > > > TX packets:22 errors:0 dropped:0 overruns:0 carrier:0 > > > collisions:0 txqueuelen:1000 > > > RX bytes:11236 (10.9 KiB) TX bytes:1844 (1.8 KiB) > > > > > > # ethtool -S eth0 > > > NIC statistics: > > > Received bytes: 13950 > > > Transmitted bytes: 2016 > > > RX Good VLAN Tagged Frames: 0 > > > TX Good VLAN Tagged Frames: 0 > > > TX Good PFC Frames: 0 > > > RX Good PFC Frames: 0 > > > User Defined Counter 0: 0 > > > User Defined Counter 1: 0 > > > User Defined Counter 2: 0 > > > > > > # ethtool -g eth0 > > > Ring parameters for eth0: > > > Pre-set maximums: > > > RX: 4096 > > > RX Mini: 0 > > > RX Jumbo: 0 > > > TX: 4096 > > > Current hardware settings: > > > RX: 1024 > > > RX Mini: 0 > > > RX Jumbo: 0 > > > TX: 128 > > > > > > # ethtool -d eth0 > > > Offset Values > > > ------ ------ > > > 0x0000: 00 00 00 00 00 00 00 00 00 00 00 00 e4 01 00 00 > > > 0x0010: 00 00 00 00 18 00 00 00 00 00 00 00 00 00 00 00 > > > 0x0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > > 0x0030: 00 00 00 00 ff ff ff ff ff ff 00 18 00 00 00 18 > > > 0x0040: 00 00 00 00 00 00 00 40 d0 07 00 00 50 00 00 00 > > > 0x0050: 80 80 00 01 00 00 00 00 00 21 01 00 00 00 00 00 > > > 0x0060: 00 00 00 00 00 00 00 00 00 00 00 00 06 00 0a bc > > > 0x0070: 8c 01 00 00 03 00 00 00 00 00 00 00 00 00 00 00 > > > 0x0080: 03 70 18 21 0a 00 18 00 40 25 b3 80 40 25 b3 80 > > > 0x0090: 03 50 01 00 08 00 01 00 40 38 12 81 00 38 12 81 > > > > > > > > > > > > > As per registers dump, packet is not dropped by MAC. It's dropping somewhere in the software layer. Via printk-debugging I've just determined that packets are being dropped here because pt_prev is NULL. But I don't understand yet the code of this function. The message seems to indicate SKB_DROP_REASON_UNHANDLED_PROTO. How come? Jan 1 00:01:39 DRVM kern.alert kernel: Dropping packet from else Jan 1 00:01:39 DRVM kern.alert kernel: Dropped packet Jan 1 00:01:39 DRVM kern.alert kernel: OUT: dropped packet net/core/dev.c:5719 if (pt_prev) { if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC))) { printk(KERN_ALERT "2 Goto drop\n"); goto drop; } *ppt_prev = pt_prev; } else { printk(KERN_ALERT "Dropping packet from else\n"); drop: printk(KERN_ALERT "Dropped packet\n"); if (!deliver_exact) dev_core_stats_rx_dropped_inc(skb->dev); else dev_core_stats_rx_nohandler_inc(skb->dev); kfree_skb_reason(skb, SKB_DROP_REASON_UNHANDLED_PROTO); /* Jamal, now you will not able to escape explaining * me how you were going to use this. :-) */ ret = NET_RX_DROP; } > > Since you started bisecting linux commits, could you please try reverting suspected commit and check if that's actually the first bad commit? > > > > I already kinda did, please read the whole message quoted below. > > * To summarize: > Kernel commit 324cefaf1c723625e93f703d6e6d78e28996b315^ = 679500e385fc4d65c3fac5bfbe6ee55d65698f20 works fine > Kernel commit 324cefaf1c723625e93f703d6e6d78e28996b315 drops packets > > But using commit 324cefaf1c723625e93f703d6e6d78e28996b315 and adding printk > around suspect lines, solves the issue. Looks a like a compiler bug. > > * New information from yesterday's email: > Reverting commit 324cefaf1c723625e93f703d6e6d78e28996b315 on kernel 6.13.8 > does not solve the issue. Neither does tinkering around with printks > > > > > Running tcpdump makes it so that ifconfig dropped value doesn't increment and > > > shows me ARP requests (although it won't reply to them), but just setting the > > > interface as promisc do not. > > > > > > If you can give me any indications on how to gather more data about DMA > > > descriptors I'll try my best. > > > > > > This is using internal's emaclite dma, because when using dmaengine there's no > > > dropping of packets, but a big buffering, and kernel 6.13.8, because in series ~5.11 > > > which I'm also working with, axienet didn't have support for reading statistics from > > > the core. > > > > > > I assume the old dma code inside axienet is to be deprecated, and I would be pretty > > > glad to use dmaengine, but that has the buffering problem. So if you want to focus > > > efforts on solving that issue I'm completely open to whatever you all deem more > > > appropriate. > > > > > > > We're not planning to make DMAengine flow default soon as there is some significant work and optimizations required there which are under progress. > > But this buffering issue we didn't observe on our platforms last time we ran it with linux v6.12. > > > > I just tried dmaengine on 6.12 and have the same buffering issue. > > Did you try on Microblaze too or only on Zynq? > > > > > > I can even add some ILA to the Vivado design and inspect whatever you think could > > > be useful > > > > > > Thanks > > > > > > > > > > > > replied to. Keeping old buildroot version but asking it to use > > > > > gcc-11 will cause the same issue with kernel 4.4.43, so something > > > > > must have happened in between those gcc versions. > > > > > > > > > > So this does not look like an axienet driver problem, which I first > > > > > thought it was, because who would blame the compiler in first instance? > > > > > > > > > > But then things started to get even stranger. > > > > > > > > > > What I did next, was slowly upgrading buildroot and using the kernel > > > > > version that buildroot considered "latest" at the point it was > > > > > released. I reached a point in which the ARP requests were being > > > > > dropped again. This happened on buildroot 2021.11, which still used > > > > > gcc-10 as the default and kernel version 5.15.6. So some gcc bug > > > > > that is getting triggered on gcc-11 in kernel 4.4.43 is also triggered on gcc-10 by > > > kernel 5.15.6. > > > > > > > > > > Using gcc-10, I bisected the kernel and found that this commit was > > > > > triggering whatever it is that is happening, around 5.11-rc2: > > > > > > > > > > commit 324cefaf1c723625e93f703d6e6d78e28996b315 (HEAD) > > > > > Author: Menglong Dong <dong.menglong@zte.com.cn> > > > > > Date: Mon Jan 11 02:42:21 2021 -0800 > > > > > > > > > > net: core: use eth_type_vlan in __netif_receive_skb_core > > > > > > > > > > Replace the check for ETH_P_8021Q and ETH_P_8021AD in > > > > > __netif_receive_skb_core with eth_type_vlan. > > > > > > > > > > Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn> > > > > > Link: https://lore.kernel.org/r/20210111104221.3451-1- > > > > > dong.menglong@zte.com.cn > > > > > Signed-off-by: Jakub Kicinski <kuba@kernel.org> > > > > > > > > > > > > > > > I've been staring at the diff for hours because I can't understand > > > > > what can be wrong about this: > > > > > > > > > > diff --git a/net/core/dev.c b/net/core/dev.c index > > > > > e4d77c8abe76..267c4a8daa55 > > > > > 100644 > > > > > --- a/net/core/dev.c > > > > > +++ b/net/core/dev.c > > > > > @@ -5151,8 +5151,7 @@ static int __netif_receive_skb_core(struct > > > > > sk_buff **pskb, bool pfmemalloc, > > > > > skb_reset_mac_len(skb); > > > > > } > > > > > > > > > > - if (skb->protocol == cpu_to_be16(ETH_P_8021Q) || > > > > > - skb->protocol == cpu_to_be16(ETH_P_8021AD)) { > > > > > + if (eth_type_vlan(skb->protocol)) { > > > > > skb = skb_vlan_untag(skb); > > > > > if (unlikely(!skb)) > > > > > goto out; > > > > > @@ -5236,8 +5235,7 @@ static int __netif_receive_skb_core(struct > > > > > sk_buff **pskb, bool pfmemalloc, > > > > > * find vlan device. > > > > > */ > > > > > skb->pkt_type = PACKET_OTHERHOST; > > > > > - } else if (skb->protocol == cpu_to_be16(ETH_P_8021Q) || > > > > > - skb->protocol == cpu_to_be16(ETH_P_8021AD)) { > > > > > + } else if (eth_type_vlan(skb->protocol)) { > > > > > /* Outer header is 802.1P with vlan 0, inner header is > > > > > * 802.1Q or 802.1AD and vlan_do_receive() above could > > > > > * not find vlan dev for vlan id 0. > > > > > > > > > > > > > > > > > > > > Given that eth_type_vlan is simply this: > > > > > > > > > > static inline bool eth_type_vlan(__be16 ethertype) { > > > > > switch (ethertype) { > > > > > case htons(ETH_P_8021Q): > > > > > case htons(ETH_P_8021AD): > > > > > return true; > > > > > default: > > > > > return false; > > > > > } > > > > > } > > > > > > > > > > I've added a small printk to see these values right before the first > > > > > time they are > > > > > checked: > > > > > > > > > > printk(KERN_ALERT "skb->protocol = %d, ETH_P_8021Q=%d > > > > > ETH_P_8021AD=%d, eth_type_vlan(skb->protocol) = %d", > > > > > skb->protocol, cpu_to_be16(ETH_P_8021Q), > > > > > cpu_to_be16(ETH_P_8021AD), eth_type_vlan(skb->protocol)); > > > > > > > > > > And each ARP ping delivers a packet reported as: > > > > > skb->protocol = 1544, ETH_P_8021Q=129 ETH_P_8021AD=43144, > > > > > skb->eth_type_vlan(skb->protocol) = 0 > > > > > > > > > > To add insult to injury, adding this printk line solves the ARP > > > > > deafness, so no matter whether I use eth_type_vlan function or > > > > > manual comparison, now ARP packets aren't dropped. > > > > > > > > > > Removing this printk and adding one inside the if-clause that should > > > > > not be happening, shows nothing, so neither I can directly inspect > > > > > the packets or return value of the wrong working code, nor can I > > > > > indirectly proof that the wrong branch of the if is being taken. > > > > > This reinforces the idea of a compiler bug, but I very well could be wrong. > > > > > > > > > > Adding this printk: > > > > > diff --git i/net/core/dev.c w/net/core/dev.c index > > > > > 267c4a8daa55..a3ae3bcb3a21 > > > > > 100644 > > > > > --- i/net/core/dev.c > > > > > +++ w/net/core/dev.c > > > > > @@ -5257,6 +5257,8 @@ static int __netif_receive_skb_core(struct > > > > > sk_buff **pskb, bool pfmemalloc, > > > > > * check again for vlan id to set OTHERHOST. > > > > > */ > > > > > goto check_vlan_id; > > > > > + } else { > > > > > + printk(KERN_ALERT "(1) skb->protocol is not type > > > > > + vlan\n"); > > > > > } > > > > > /* Note: we might in the future use prio bits > > > > > * and set skb->priority like in vlan_do_receive() > > > > > > > > > > is even weirder because the same effect: the message does not appear > > > > > but ARP requests are answered back. If I remove this printk, ARP requests are > > > dropped. > > > > > > > > > > I've generated assembly output and this is the difference between > > > > > having that extra else with the printk and not having it. > > > > > > > > > > It doesn't even make much any sense that code would even reach this > > > > > region of code because there's no vlan involved in at all here. > > > > > > > > > > And so here I am again, staring at all this without knowing how to proceed. > > > > > > > > > > I guess I will be trying different and more modern versions of gcc, > > > > > even some precompiled toolchains and see what else may be going on. > > > > > > > > > > If anyone has any hindsight as to what is causing this or how to > > > > > solve it, it'd be great if you could share it. > > > > > > > > > > Thanks! > > > > > > > > > > -- > > > > > Álvaro G. M. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: Packets only received after some buffer is full 2025-04-10 7:14 ` Álvaro G. M. @ 2025-04-10 9:06 ` Álvaro G. M. 0 siblings, 0 replies; 16+ messages in thread From: Álvaro G. M. @ 2025-04-10 9:06 UTC (permalink / raw) To: Gupta, Suraj Cc: netdev@vger.kernel.org, Katakam, Harini, Pandey, Radhey Shyam, Jakub Kicinski On Thu, 2025-04-10 at 09:14 +0200, Álvaro G. M. wrote: > On Thu, 2025-04-10 at 08:54 +0200, Álvaro G. M. wrote: > > On Thu, 2025-04-10 at 06:25 +0000, Gupta, Suraj wrote: > > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > > > > -----Original Message----- > > > > From: Álvaro G. M. <alvaro.gamez@hazent.com> > > > > Sent: Wednesday, April 9, 2025 6:40 PM > > > > To: Pandey, Radhey Shyam <radhey.shyam.pandey@amd.com>; Jakub Kicinski > > > > <kuba@kernel.org> > > > > Cc: netdev@vger.kernel.org; Katakam, Harini <harini.katakam@amd.com>; Gupta, > > > > Suraj <Suraj.Gupta2@amd.com> > > > > Subject: Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: > > > > Packets only received after some buffer is full > > > > > > > > Caution: This message originated from an External Source. Use proper caution > > > > when opening attachments, clicking links, or responding. > > > > > > > > > > > > On Wed, 2025-04-09 at 11:14 +0000, Pandey, Radhey Shyam wrote: > > > > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > > > > > > > > -----Original Message----- > > > > > > From: Álvaro G. M. <alvaro.gamez@hazent.com> > > > > > > Sent: Wednesday, April 9, 2025 4:31 PM > > > > > > To: Pandey, Radhey Shyam <radhey.shyam.pandey@amd.com>; Jakub > > > > > > Kicinski <kuba@kernel.org> > > > > > > Cc: netdev@vger.kernel.org; Katakam, Harini > > > > > > <harini.katakam@amd.com>; Gupta, Suraj <Suraj.Gupta2@amd.com> > > > > > > Subject: Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: > > > > > > Packets only received after some buffer is full > > > > > > > > > > > > On Thu, 2025-04-03 at 05:54 +0000, Pandey, Radhey Shyam wrote: > > > > > > > [...] > > > > > > > + Going through the details and will get back to you . Just to > > > > > > > confirm there is no vivado design update ? and we are only > > > > > > > updating linux kernel to > > > > > > latest? > > > > > > > > > > > > > > > > > > > Hi again, > > > > > > > > > > > > I've reconsidered the upgrading approach and I've first upgraded > > > > > > buildroot and kept the same kernel version (4.4.43). This has the > > > > > > effect of upgrading gcc from version > > > > > > 10 to version 13. > > > > > > > > > > > > With buildroot's compiled gcc-13 and keeping this same old kernel, > > > > > > the effect is that the system drops ARP requests. Compiling with > > > > > > older gcc-10, ARP requests are > > > > > > > > > > When the system drops ARP packet - Is it drop by MAC hw or by software layer. > > > > > Reading MAC stats and DMA descriptors help us know if it reaches > > > > > software layer or not ? > > > > > > > > I'm not sure, who is the open dropping packets, I can only check with ethtool -S > > > > eth0 and this is its output after a few dozens of arpings: > > > > > > > > # ifconfig eth0 > > > > eth0 Link encap:Ethernet HWaddr 06:00:0A:BC:8C:01 > > > > inet addr:10.188.140.1 Bcast:10.188.143.255 Mask:255.255.248.0 > > > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > > > > RX packets:164 errors:0 dropped:99 overruns:0 frame:0 > > > > TX packets:22 errors:0 dropped:0 overruns:0 carrier:0 > > > > collisions:0 txqueuelen:1000 > > > > RX bytes:11236 (10.9 KiB) TX bytes:1844 (1.8 KiB) > > > > > > > > # ethtool -S eth0 > > > > NIC statistics: > > > > Received bytes: 13950 > > > > Transmitted bytes: 2016 > > > > RX Good VLAN Tagged Frames: 0 > > > > TX Good VLAN Tagged Frames: 0 > > > > TX Good PFC Frames: 0 > > > > RX Good PFC Frames: 0 > > > > User Defined Counter 0: 0 > > > > User Defined Counter 1: 0 > > > > User Defined Counter 2: 0 > > > > > > > > # ethtool -g eth0 > > > > Ring parameters for eth0: > > > > Pre-set maximums: > > > > RX: 4096 > > > > RX Mini: 0 > > > > RX Jumbo: 0 > > > > TX: 4096 > > > > Current hardware settings: > > > > RX: 1024 > > > > RX Mini: 0 > > > > RX Jumbo: 0 > > > > TX: 128 > > > > > > > > # ethtool -d eth0 > > > > Offset Values > > > > ------ ------ > > > > 0x0000: 00 00 00 00 00 00 00 00 00 00 00 00 e4 01 00 00 > > > > 0x0010: 00 00 00 00 18 00 00 00 00 00 00 00 00 00 00 00 > > > > 0x0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > > > 0x0030: 00 00 00 00 ff ff ff ff ff ff 00 18 00 00 00 18 > > > > 0x0040: 00 00 00 00 00 00 00 40 d0 07 00 00 50 00 00 00 > > > > 0x0050: 80 80 00 01 00 00 00 00 00 21 01 00 00 00 00 00 > > > > 0x0060: 00 00 00 00 00 00 00 00 00 00 00 00 06 00 0a bc > > > > 0x0070: 8c 01 00 00 03 00 00 00 00 00 00 00 00 00 00 00 > > > > 0x0080: 03 70 18 21 0a 00 18 00 40 25 b3 80 40 25 b3 80 > > > > 0x0090: 03 50 01 00 08 00 01 00 40 38 12 81 00 38 12 81 > > > > > > > > > > > > > > > > > > As per registers dump, packet is not dropped by MAC. It's dropping somewhere in the software layer. > > Via printk-debugging I've just determined that packets are being dropped here > because pt_prev is NULL. But I don't understand yet the code of this function. > The message seems to indicate SKB_DROP_REASON_UNHANDLED_PROTO. How come? > > Jan 1 00:01:39 DRVM kern.alert kernel: Dropping packet from else > Jan 1 00:01:39 DRVM kern.alert kernel: Dropped packet > Jan 1 00:01:39 DRVM kern.alert kernel: OUT: dropped packet > > net/core/dev.c:5719 > if (pt_prev) { > if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC))) { > printk(KERN_ALERT "2 Goto drop\n"); > goto drop; > } > *ppt_prev = pt_prev; > } else { > printk(KERN_ALERT "Dropping packet from else\n"); > drop: > printk(KERN_ALERT "Dropped packet\n"); > if (!deliver_exact) > dev_core_stats_rx_dropped_inc(skb->dev); > else > dev_core_stats_rx_nohandler_inc(skb->dev); > kfree_skb_reason(skb, SKB_DROP_REASON_UNHANDLED_PROTO); > /* Jamal, now you will not able to escape explaining > * me how you were going to use this. :-) > */ > ret = NET_RX_DROP; > } > > I'm testing with this patch to see what's going on diff --git i/net/core/dev.c w/net/core/dev.c index b180d175c37f..78a19d5f5d73 100644 --- i/net/core/dev.c +++ w/net/core/dev.c @@ -2310,11 +2310,15 @@ static inline void deliver_ptype_list_skb(struct sk_buff *skb, { struct packet_type *ptype, *pt_prev = *pt; + printk(KERN_ALERT "deliver_ptype_list_skb, ptype_list = %p", ptype_list); list_for_each_entry_rcu(ptype, ptype_list, list) { + printk(KERN_ALERT "ptype->type = 0x%04x, type = 0x%04x\n", ptype->type, type); if (ptype->type != type) continue; - if (pt_prev) + if (pt_prev) { + printk(KERN_ALERT "pt_prev exists, calling deliver_skb\n"); deliver_skb(skb, pt_prev, orig_dev); + } pt_prev = ptype; } *pt = pt_prev; Sending an ARP request dumps this: Jan 1 00:06:22 DRVM kern.alert kernel: deliver_ptype_list_skb, ptype_list = 8d85f396 Jan 1 00:06:22 DRVM kern.alert kernel: deliver_ptype_list_skb, ptype_list = 43cad5fa So there seems to be no entries on ptype list to match with ARP type. > > > > Since you started bisecting linux commits, could you please try reverting suspected commit and check if that's actually the first bad commit? > > > I'll try again to bisect using same compiler version, but this is looking like it's not related to axienet, is it? > > > > I already kinda did, please read the whole message quoted below. > > > > * To summarize: > > Kernel commit 324cefaf1c723625e93f703d6e6d78e28996b315^ = 679500e385fc4d65c3fac5bfbe6ee55d65698f20 works fine > > Kernel commit 324cefaf1c723625e93f703d6e6d78e28996b315 drops packets > > > > But using commit 324cefaf1c723625e93f703d6e6d78e28996b315 and adding printk > > around suspect lines, solves the issue. Looks a like a compiler bug. > > > > * New information from yesterday's email: > > Reverting commit 324cefaf1c723625e93f703d6e6d78e28996b315 on kernel 6.13.8 > > does not solve the issue. Neither does tinkering around with printks > > > > > > > > Running tcpdump makes it so that ifconfig dropped value doesn't increment and > > > > shows me ARP requests (although it won't reply to them), but just setting the > > > > interface as promisc do not. > > > > > > > > If you can give me any indications on how to gather more data about DMA > > > > descriptors I'll try my best. > > > > > > > > This is using internal's emaclite dma, because when using dmaengine there's no > > > > dropping of packets, but a big buffering, and kernel 6.13.8, because in series ~5.11 > > > > which I'm also working with, axienet didn't have support for reading statistics from > > > > the core. > > > > > > > > I assume the old dma code inside axienet is to be deprecated, and I would be pretty > > > > glad to use dmaengine, but that has the buffering problem. So if you want to focus > > > > efforts on solving that issue I'm completely open to whatever you all deem more > > > > appropriate. > > > > > > > > > > We're not planning to make DMAengine flow default soon as there is some significant work and optimizations required there which are under progress. > > > But this buffering issue we didn't observe on our platforms last time we ran it with linux v6.12. > > > > > > > I just tried dmaengine on 6.12 and have the same buffering issue. > > > > Did you try on Microblaze too or only on Zynq? > > > > > > > > > > I can even add some ILA to the Vivado design and inspect whatever you think could > > > > be useful > > > > > > > > Thanks > > > > > > > > > > > > > > > replied to. Keeping old buildroot version but asking it to use > > > > > > gcc-11 will cause the same issue with kernel 4.4.43, so something > > > > > > must have happened in between those gcc versions. > > > > > > > > > > > > So this does not look like an axienet driver problem, which I first > > > > > > thought it was, because who would blame the compiler in first instance? > > > > > > > > > > > > But then things started to get even stranger. > > > > > > > > > > > > What I did next, was slowly upgrading buildroot and using the kernel > > > > > > version that buildroot considered "latest" at the point it was > > > > > > released. I reached a point in which the ARP requests were being > > > > > > dropped again. This happened on buildroot 2021.11, which still used > > > > > > gcc-10 as the default and kernel version 5.15.6. So some gcc bug > > > > > > that is getting triggered on gcc-11 in kernel 4.4.43 is also triggered on gcc-10 by > > > > kernel 5.15.6. > > > > > > > > > > > > Using gcc-10, I bisected the kernel and found that this commit was > > > > > > triggering whatever it is that is happening, around 5.11-rc2: > > > > > > > > > > > > commit 324cefaf1c723625e93f703d6e6d78e28996b315 (HEAD) > > > > > > Author: Menglong Dong <dong.menglong@zte.com.cn> > > > > > > Date: Mon Jan 11 02:42:21 2021 -0800 > > > > > > > > > > > > net: core: use eth_type_vlan in __netif_receive_skb_core > > > > > > > > > > > > Replace the check for ETH_P_8021Q and ETH_P_8021AD in > > > > > > __netif_receive_skb_core with eth_type_vlan. > > > > > > > > > > > > Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn> > > > > > > Link: https://lore.kernel.org/r/20210111104221.3451-1- > > > > > > dong.menglong@zte.com.cn > > > > > > Signed-off-by: Jakub Kicinski <kuba@kernel.org> > > > > > > > > > > > > > > > > > > I've been staring at the diff for hours because I can't understand > > > > > > what can be wrong about this: > > > > > > > > > > > > diff --git a/net/core/dev.c b/net/core/dev.c index > > > > > > e4d77c8abe76..267c4a8daa55 > > > > > > 100644 > > > > > > --- a/net/core/dev.c > > > > > > +++ b/net/core/dev.c > > > > > > @@ -5151,8 +5151,7 @@ static int __netif_receive_skb_core(struct > > > > > > sk_buff **pskb, bool pfmemalloc, > > > > > > skb_reset_mac_len(skb); > > > > > > } > > > > > > > > > > > > - if (skb->protocol == cpu_to_be16(ETH_P_8021Q) || > > > > > > - skb->protocol == cpu_to_be16(ETH_P_8021AD)) { > > > > > > + if (eth_type_vlan(skb->protocol)) { > > > > > > skb = skb_vlan_untag(skb); > > > > > > if (unlikely(!skb)) > > > > > > goto out; > > > > > > @@ -5236,8 +5235,7 @@ static int __netif_receive_skb_core(struct > > > > > > sk_buff **pskb, bool pfmemalloc, > > > > > > * find vlan device. > > > > > > */ > > > > > > skb->pkt_type = PACKET_OTHERHOST; > > > > > > - } else if (skb->protocol == cpu_to_be16(ETH_P_8021Q) || > > > > > > - skb->protocol == cpu_to_be16(ETH_P_8021AD)) { > > > > > > + } else if (eth_type_vlan(skb->protocol)) { > > > > > > /* Outer header is 802.1P with vlan 0, inner header is > > > > > > * 802.1Q or 802.1AD and vlan_do_receive() above could > > > > > > * not find vlan dev for vlan id 0. > > > > > > > > > > > > > > > > > > > > > > > > Given that eth_type_vlan is simply this: > > > > > > > > > > > > static inline bool eth_type_vlan(__be16 ethertype) { > > > > > > switch (ethertype) { > > > > > > case htons(ETH_P_8021Q): > > > > > > case htons(ETH_P_8021AD): > > > > > > return true; > > > > > > default: > > > > > > return false; > > > > > > } > > > > > > } > > > > > > > > > > > > I've added a small printk to see these values right before the first > > > > > > time they are > > > > > > checked: > > > > > > > > > > > > printk(KERN_ALERT "skb->protocol = %d, ETH_P_8021Q=%d > > > > > > ETH_P_8021AD=%d, eth_type_vlan(skb->protocol) = %d", > > > > > > skb->protocol, cpu_to_be16(ETH_P_8021Q), > > > > > > cpu_to_be16(ETH_P_8021AD), eth_type_vlan(skb->protocol)); > > > > > > > > > > > > And each ARP ping delivers a packet reported as: > > > > > > skb->protocol = 1544, ETH_P_8021Q=129 ETH_P_8021AD=43144, > > > > > > skb->eth_type_vlan(skb->protocol) = 0 > > > > > > > > > > > > To add insult to injury, adding this printk line solves the ARP > > > > > > deafness, so no matter whether I use eth_type_vlan function or > > > > > > manual comparison, now ARP packets aren't dropped. > > > > > > > > > > > > Removing this printk and adding one inside the if-clause that should > > > > > > not be happening, shows nothing, so neither I can directly inspect > > > > > > the packets or return value of the wrong working code, nor can I > > > > > > indirectly proof that the wrong branch of the if is being taken. > > > > > > This reinforces the idea of a compiler bug, but I very well could be wrong. > > > > > > > > > > > > Adding this printk: > > > > > > diff --git i/net/core/dev.c w/net/core/dev.c index > > > > > > 267c4a8daa55..a3ae3bcb3a21 > > > > > > 100644 > > > > > > --- i/net/core/dev.c > > > > > > +++ w/net/core/dev.c > > > > > > @@ -5257,6 +5257,8 @@ static int __netif_receive_skb_core(struct > > > > > > sk_buff **pskb, bool pfmemalloc, > > > > > > * check again for vlan id to set OTHERHOST. > > > > > > */ > > > > > > goto check_vlan_id; > > > > > > + } else { > > > > > > + printk(KERN_ALERT "(1) skb->protocol is not type > > > > > > + vlan\n"); > > > > > > } > > > > > > /* Note: we might in the future use prio bits > > > > > > * and set skb->priority like in vlan_do_receive() > > > > > > > > > > > > is even weirder because the same effect: the message does not appear > > > > > > but ARP requests are answered back. If I remove this printk, ARP requests are > > > > dropped. > > > > > > > > > > > > I've generated assembly output and this is the difference between > > > > > > having that extra else with the printk and not having it. > > > > > > > > > > > > It doesn't even make much any sense that code would even reach this > > > > > > region of code because there's no vlan involved in at all here. > > > > > > > > > > > > And so here I am again, staring at all this without knowing how to proceed. > > > > > > > > > > > > I guess I will be trying different and more modern versions of gcc, > > > > > > even some precompiled toolchains and see what else may be going on. > > > > > > > > > > > > If anyone has any hindsight as to what is causing this or how to > > > > > > solve it, it'd be great if you could share it. > > > > > > > > > > > > Thanks! > > > > > > > > > > > > -- > > > > > > Álvaro G. M. ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: Packets only received after some buffer is full 2025-04-09 13:09 ` Álvaro G. M. 2025-04-10 6:25 ` Gupta, Suraj @ 2025-04-17 16:12 ` Sean Anderson 1 sibling, 0 replies; 16+ messages in thread From: Sean Anderson @ 2025-04-17 16:12 UTC (permalink / raw) To: Álvaro G. M., Pandey, Radhey Shyam, Jakub Kicinski Cc: netdev@vger.kernel.org, Katakam, Harini, Gupta, Suraj On 4/9/25 09:09, Álvaro G. M. wrote: > On Wed, 2025-04-09 at 11:14 +0000, Pandey, Radhey Shyam wrote: >> [AMD Official Use Only - AMD Internal Distribution Only] >> >> > -----Original Message----- >> > From: Álvaro G. M. <alvaro.gamez@hazent.com> >> > Sent: Wednesday, April 9, 2025 4:31 PM >> > To: Pandey, Radhey Shyam <radhey.shyam.pandey@amd.com>; Jakub Kicinski >> > <kuba@kernel.org> >> > Cc: netdev@vger.kernel.org; Katakam, Harini <harini.katakam@amd.com>; Gupta, >> > Suraj <Suraj.Gupta2@amd.com> >> > Subject: Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: >> > Packets only received after some buffer is full >> > >> > On Thu, 2025-04-03 at 05:54 +0000, Pandey, Radhey Shyam wrote: >> > > [...] >> > > + Going through the details and will get back to you . Just to >> > > confirm there is no vivado design update ? and we are only updating linux kernel to >> > latest? >> > > >> > >> > Hi again, >> > >> > I've reconsidered the upgrading approach and I've first upgraded buildroot and kept >> > the same kernel version (4.4.43). This has the effect of upgrading gcc from version >> > 10 to version 13. >> > >> > With buildroot's compiled gcc-13 and keeping this same old kernel, the effect is that >> > the system drops ARP requests. Compiling with older gcc-10, ARP requests are >> >> When the system drops ARP packet - Is it drop by MAC hw or by software layer. >> Reading MAC stats and DMA descriptors help us know if it reaches software >> layer or not ? > > I'm not sure, who is the open dropping packets, I can only check with > ethtool -S eth0 and this is its output after a few dozens of arpings: > > # ifconfig eth0 > eth0 Link encap:Ethernet HWaddr 06:00:0A:BC:8C:01 > inet addr:10.188.140.1 Bcast:10.188.143.255 Mask:255.255.248.0 > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:164 errors:0 dropped:99 overruns:0 frame:0 > TX packets:22 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:11236 (10.9 KiB) TX bytes:1844 (1.8 KiB) > > # ethtool -S eth0 > NIC statistics: > Received bytes: 13950 > Transmitted bytes: 2016 > RX Good VLAN Tagged Frames: 0 > TX Good VLAN Tagged Frames: 0 > TX Good PFC Frames: 0 > RX Good PFC Frames: 0 > User Defined Counter 0: 0 > User Defined Counter 1: 0 > User Defined Counter 2: 0 FYI you can do # ethtool -S net4 --all-groups Standard stats for net4: eth-mac-FramesTransmittedOK: 74 eth-mac-SingleCollisionFrames: 0 eth-mac-MultipleCollisionFrames: 0 eth-mac-FramesReceivedOK: 92 eth-mac-FrameCheckSequenceErrors: 0 eth-mac-AlignmentErrors: 0 eth-mac-FramesWithDeferredXmissions: 0 eth-mac-LateCollisions: 0 eth-mac-FramesAbortedDueToXSColls: 0 eth-mac-MulticastFramesXmittedOK: 38 eth-mac-BroadcastFramesXmittedOK: 3 eth-mac-FramesWithExcessiveDeferral: 0 eth-mac-MulticastFramesReceivedOK: 24 eth-mac-BroadcastFramesReceivedOK: 33 eth-mac-InRangeLengthErrors: 0 eth-ctrl-MACControlFramesTransmitted: 0 eth-ctrl-MACControlFramesReceived: 0 eth-ctrl-UnsupportedOpcodesReceived: 0 rmon-etherStatsUndersizePkts: 0 rmon-etherStatsOversizePkts: 0 rmon-etherStatsFragments: 0 rx-rmon-etherStatsPkts64to64Octets: 19 rx-rmon-etherStatsPkts65to127Octets: 22 rx-rmon-etherStatsPkts128to255Octets: 9 rx-rmon-etherStatsPkts256to511Octets: 3 rx-rmon-etherStatsPkts512to1023Octets: 27 rx-rmon-etherStatsPkts1024to1518Octets: 12 rx-rmon-etherStatsPkts1519to16384Octets: 0 tx-rmon-etherStatsPkts64to64Octets: 2 tx-rmon-etherStatsPkts65to127Octets: 52 tx-rmon-etherStatsPkts128to255Octets: 18 tx-rmon-etherStatsPkts256to511Octets: 2 tx-rmon-etherStatsPkts512to1023Octets: 0 tx-rmon-etherStatsPkts1024to1518Octets: 0 tx-rmon-etherStatsPkts1519to16384Octets: 0 to get standard statistics. --Sean ^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: Packets only received after some buffer is full 2025-04-03 5:44 ` Álvaro G. M. 2025-04-03 5:54 ` Pandey, Radhey Shyam @ 2025-04-03 13:58 ` Gupta, Suraj 1 sibling, 0 replies; 16+ messages in thread From: Gupta, Suraj @ 2025-04-03 13:58 UTC (permalink / raw) To: Álvaro G. M. Cc: netdev@vger.kernel.org, Jakub Kicinski, Pandey, Radhey Shyam [AMD Official Use Only - AMD Internal Distribution Only] Hi Alvaro, > -----Original Message----- > From: Álvaro G. M. <alvaro.gamez@hazent.com> > Sent: Thursday, April 3, 2025 11:15 AM > To: Jakub Kicinski <kuba@kernel.org> > Cc: netdev@vger.kernel.org; Pandey, Radhey Shyam > <radhey.shyam.pandey@amd.com> > Subject: Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: > Packets only received after some buffer is full > > Caution: This message originated from an External Source. Use proper caution > when opening attachments, clicking links, or responding. > > > Hi > > > On Wed, 2025-04-02 at 10:00 -0700, Jakub Kicinski wrote: > > +CC Radhey, maintainer of axienet > > Thanks, I don't know why I didn't think of that. > > So, I can provide a little more information and I definitely believe now there are some > issues with this driver. > > > On Tue, 01 Apr 2025 12:52:15 +0200 Álvaro "G. M." wrote: > > > I guess I may have made some mistake in upgrading the DTS to the new > > > format, although I've tried the two available methods (either setting node "dmas" > or using "axistream-connected" > > > property) and both methods result in the same boot messages and behavior. > > This has happened not to be true, I'm sorry for the confusion. Using node "dmas" > enables use_dmaengine and produces the effect I explained: data is only received > after a 2^17 bytes buffer is filled. > > If I remove "dmas" entry and provide a "axistream-connected" one, things get a little > better (but see at the end for some DTS notes). In this mode, in which dmaengine is > not used but legacy DMA code inside axienet itself, tcpdump -vv shows packets > incoming at a normal rate. However, the system is not answering to ARP requests: > Could you please check ifconfig for any packet drop/error? > 00:02:37.800814 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.188.140.2 > tell 10.188.139.1, length 46 > 00:02:38.801974 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.188.140.2 > tell 10.188.139.1, length 46 > 00:02:39.804137 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.188.140.2 > tell 10.188.139.1, length 46 > 00:02:40.806434 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.188.140.2 > tell 10.188.139.1, length 46 > 00:02:41.808084 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.188.140.2 > tell 10.188.139.1, length 46 > 00:02:42.810592 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.188.140.2 > tell 10.188.139.1, length 46 > 00:02:43.813155 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.188.140.2 > tell 10.188.139.1, length 46 > > Here's the normal answer for a second device running old 4.4.43 kernel connected to > the same switch: > > 00:21:12.057326 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.188.140.1 > tell 10.188.139.1, length 46 > 00:21:12.057905 ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.188.140.1 is-at > 06:00:0a:bc:8c:01 (oui Unknown), length 28 > 00:21:13.059460 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.188.140.1 > tell 10.188.139.1, length 46 > 00:21:13.060031 ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.188.140.1 is-at > 06:00:0a:bc:8c:01 (oui Unknown), length 28 > 00:21:14.060502 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.188.140.1 > tell 10.188.139.1, length 46 > 00:21:14.061051 ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.188.140.1 is-at > 06:00:0a:bc:8c:01 (oui Unknown), length 28 > > The funny thing is that once I manually add arp entries in both my computer and the > embedded one, I can establish full TCP communication and iperf3 shows a relatively > nice speed, similar to the throughput I get with old 4.4.43 kernel. > > # arp -s 10.188.139.1 f4:4d:ad:02:11:29 > # iperf3 -c 10.188.139.1 > Connecting to host 10.188.139.1, port 5201 [ 5] local 10.188.140.2 port 55480 > connected to 10.188.139.1 port 5201 > [ ID] Interval Transfer Bitrate Retr Cwnd > [ 5] 0.00-1.01 sec 3.63 MBytes 30.1 Mbits/sec 0 130 KBytes > [ 5] 1.01-2.01 sec 3.75 MBytes 31.5 Mbits/sec 0 130 KBytes > [ 5] 2.01-3.01 sec 3.63 MBytes 30.4 Mbits/sec 0 130 KBytes > [ 5] 3.01-4.01 sec 3.75 MBytes 31.4 Mbits/sec 0 130 KBytes > [ 5] 4.01-5.01 sec 3.75 MBytes 31.4 Mbits/sec 0 130 KBytes > [ 5] 5.01-6.01 sec 3.75 MBytes 31.5 Mbits/sec 0 130 KBytes > [ 5] 6.01-7.01 sec 3.75 MBytes 31.6 Mbits/sec 0 130 KBytes > [ 5] 7.01-7.75 sec 2.63 MBytes 29.5 Mbits/sec 0 130 KBytes > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate Retr > [ 5] 0.00-7.75 sec 28.6 MBytes 31.0 Mbits/sec 0 sender > [ 5] 0.00-7.75 sec 0.00 Bytes 0.00 bits/sec receiver > iperf3: interrupt - the client has terminated # iperf3 -c 10.188.139.1 -R Connecting to > host 10.188.139.1, port 5201 Reverse mode, remote host 10.188.139.1 is sending [ > 5] local 10.188.140.2 port 45582 connected to 10.188.139.1 port 5201 > [ ID] Interval Transfer Bitrate > [ 5] 0.00-1.03 sec 5.13 MBytes 41.9 Mbits/sec > [ 5] 1.03-2.03 sec 5.38 MBytes 44.8 Mbits/sec > [ 5] 2.03-3.02 sec 5.38 MBytes 45.6 Mbits/sec > [ 5] 3.02-4.02 sec 5.38 MBytes 45.2 Mbits/sec > [ 5] 4.02-5.01 sec 5.38 MBytes 45.4 Mbits/sec > [ 5] 5.01-5.30 sec 1.50 MBytes 43.2 Mbits/sec > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate > [ 5] 0.00-5.30 sec 0.00 Bytes 0.00 bits/sec sender > [ 5] 0.00-5.30 sec 28.1 MBytes 44.5 Mbits/sec receiver > iperf3: interrupt - the client has terminated > > I had never seen a device able to fully stablish communication except for replying to > MAC requests, so I'm not sure what's happening here. > > > On the other hand, and since I don't know how to debug this ARP issue, I went back > to see if I could diagnose what's happening in DMA Engine mode, so I peeked at the > code and I saw an asymmetry between RX and TX, which sounded good given that > in dmaengine mode TX works perfectly (or so it seems) and RX is heavily buffered. > This asymmetry lies precisely on the number of SG blocks and number of skb > buffers. > > Both bd_nums are defined in the same way: > lp->rx_bd_num = RX_BD_NUM_DEFAULT; // = 1024 > lp->tx_bd_num = TX_BD_NUM_DEFAULT; // = 128 > > > But the skb ring size is defined in a different fashion: > lp->tx_skb_ring = kcalloc(TX_BD_NUM_MAX, sizeof(*lp->tx_skb_ring), // = > 4096 > GFP_KERNEL); > ... > lp->rx_skb_ring = kcalloc(RX_BUF_NUM_DEFAULT, sizeof(*lp->rx_skb_ring), > // = 128 > GFP_KERNEL); > > So, for TX we allocate space for up to 4096 buffers but by default use 128. > For RX we allocate space for 128 buffers but somehow are setting 1024 as the > default bd number. > > The fact that RX_BD_NUM_DEFAULT is used nowhere else is also a signal that > there was some mistake here, so I went and replaced all RX_BUF_NUM_DEFAULT > occurances with RX_BD_NUM_DEFAULT, so that both TX and RX skb rings are > declared and operated with using the same strategy: > > sed -i '/^#define/!s#RX_BUF_NUM_DEFAULT#RX_BD_NUM_MAX#g' > xilinx_axienet_main.c > > Doing this solved the buffering problem, although the system still doesn't reply to > ARP requests, and when I tried to run an iperf3 test after manually adding arp tables, > the kernel segfaulted (so I probably shouldn't have blindly 'sed' like that :) > > # iperf3 -c 10.188.139.1 > Connecting to host 10.188.139.1, port 5201 [ 5] local 10.188.140.2 port 46356 > connected to 10.188.139.1 port 5201 > [ ID] Interval Transfer Bitrate Retr Cwnd > [ 5] 0.00-1.01 sec 640 KBytes 5.18 Mbits/sec 3 84.8 KBykernel task_size > exceed > Oops: Exception in kernel mode, sig: 11 > CPU: 0 UID: 0 PID: 147 Comm: iperf3 Not tainted 6.13.8 #13 Registers dump: > mode=8269B900 r1=00000000, r2=00000000, r3=00000000, r4=00000010 > r5=00000000, r6=000005F2, r7=FFFF7FFF, r8=00000000 r9=00000000, > r10=00000000, r11=00000000, r12=CF5FF24C r13=00000000, r14=C241AB70, > r15=C0383EB8, r16=00000000 r17=C0383EC0, r18=000005F0, r19=C10124A0, > r20=480F8520 r21=4831F960, r22=00000000, r23=00000000, r24=FFFFFFEA > r25=C12BE0A8, r26=C12BE03C, r27=C12BE020, r28=00000122 r29=00000100, > r30=000065A2, r31=C120F780, rPC=C0383EC0 msr=000046A2, ear=FFFFFFFA, > esr=00000312, fsr=00000000 Kernel panic - not syncing: Aiee, killing interrupt > handler! > ---[ end Kernel panic - not syncing: Aiee, killing interrupt handler! ]--- tes > > I couldn't see what was wrong with new code, so I just went and replaced the > RX_BD_NUM_DEFAULT value from 1024 down to 128, so it's now the same size as > its TX counterpart, but the kernel segfaulted again when trying to measure > throughput. Sadly, my kernel debugging abilities are not much stronger than this, so > I'm stuck at this point but firmly believe there's something wrong here, although I > can't see what it is. > > Any help will be greatly appreciated. > This doesn't looks like be the reason as driver doesn't uses lp->rx_bd_num and lp->tx_bd_num to traverse skb ring in DMAengine flow. It uses axienet_get_rx_desc() and axienet_get_tx_desc() respectively, which uses same size as allocated. Only difference between working and non-working I can see is increasing Rx skb ring size. But later you mentioned you tried to bring it down to 128, could you please confirm small size transfer still works? FYI, basic ping and iperf both works for us in DMAengine flow for AXI ethernet 1G designs. We tested for full-duplex mode. But I can see half duplex in your case, could you please confirm if that is expected and correct? > > DTS NOTES: > Using old DMA code inside xilinx_axienet_main.c requires removing "dmas" entry > and add a reference to DMA device either via axistream-connected or by adding > resources manually to the node. Referring to the node linked by axistream- > connected requires a DMA node to exist, but its compatible string can't be xlnx,axi- > dma-1.00.a, because then AXI DMA driver will lock onto it and axienet will complain > about the device being busy. So my solution for this is to use a not compatible string. > As such, with the following DTS I can establish TCP connections as long as ARP > tables are manually entered: > > > axi_ethernet_0_dma: dma@41e00000 { > /* NOTE THE NOT */ > compatible = "notxlnx,axi-dma-1.00.a"; > #dma-cells = <1>; > reg = <0x41e00000 0x10000>; > interrupt-parent = <µblaze_0_axi_intc>; > interrupts = <7 1 8 1>; > xlnx,addrwidth = <32>; // Tamaño de dirección en bits > xlnx,datawidth = <32>; > xlnx,include-sg; > xlnx,sg-length-width = <16>; > xlnx,include-dre = <1>; > xlnx,axistream-connected = <1>; > xlnx,irq-delay = <1>; > dma-channels = <2>; > clock-names = "s_axi_lite_aclk", "m_axi_mm2s_aclk", "m_axi_s2mm_aclk", > "m_axi_sg_aclk"; > clocks = <&clk_bus_0>, <&clk_bus_0>, <&clk_bus_0>, <&clk_bus_0>; > dma-channel@41e00000 { > compatible = "xlnx,axi-dma-mm2s-channel"; > xlnx,include-dre = <1>; > interrupts = <7 1>; > xlnx,datawidth = <32>; > }; > dma-channel@41e00030 { > compatible = "xlnx,axi-dma-s2mm-channel"; > xlnx,include-dre = <1>; > interrupts = <8 1>; > xlnx,datawidth = <32>; > }; > }; > axi_ethernet_eth: ethernet@40c00000 { > compatible = "xlnx,axi-ethernet-1.00.a"; > reg = <0x40c00000 0x40000>; > phy-handle = <&phy1>; > interrupt-parent = <µblaze_0_axi_intc>; > interrupts = <3 0>; > xlnx,rxmem = <0x1000>; > max-speed = <100000>; > phy-mode = "mii"; > xlnx,txcsum = <0x2>; > xlnx,rxcsum = <0x2>; > clock-names = "s_axi_lite_clk", "axis_clk", "ref_clk", "mgt_clk"; > clocks = <&clk_bus_0>, <&clk_bus_0>, <&clk_bus_0>, <&clk_bus_0>; > axistream-connected = <&axi_ethernet_0_dma>; > dma-names = "tx_chan0", "rx_chan0"; > mdio { > #address-cells = <1>; > #size-cells = <0>; > phy1: ethernet-phy@1 { > device_type = "ethernet-phy"; > reg = <1>; > }; > }; > }; > > So this mode of working would definitely NOT need AXI DMA, and this hack with the > compatible string should not be needed if the dependency with AXI DMA was > removed. > > Best regards, > > -- > Álvaro G. M. ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2025-04-21 11:12 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-04-01 10:52 Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze: Packets only received after some buffer is full Álvaro G. M. 2025-04-02 17:00 ` Jakub Kicinski 2025-04-03 5:44 ` Álvaro G. M. 2025-04-03 5:54 ` Pandey, Radhey Shyam 2025-04-03 6:10 ` Álvaro G. M. 2025-04-09 11:00 ` Álvaro G. M. 2025-04-09 11:14 ` Pandey, Radhey Shyam 2025-04-09 13:09 ` Álvaro G. M. 2025-04-10 6:25 ` Gupta, Suraj 2025-04-10 6:54 ` Álvaro G. M. 2025-04-10 7:10 ` Gupta, Suraj 2025-04-21 11:12 ` Álvaro G. M. 2025-04-10 7:14 ` Álvaro G. M. 2025-04-10 9:06 ` Álvaro G. M. 2025-04-17 16:12 ` Sean Anderson 2025-04-03 13:58 ` Gupta, Suraj
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).