Hi Paolo, > On 5/28/25 12:53 PM, Lukasz Majewski wrote: > >> On 5/22/25 9:54 AM, Lukasz Majewski wrote: > >>> +/* dynamicms MAC address table learn and migration */ > >>> +static void mtip_aging_timer(struct timer_list *t) > >>> +{ > >>> + struct switch_enet_private *fep = from_timer(fep, t, > >>> timer_aging); + > >>> + fep->curr_time = mtip_timeincrement(fep->curr_time); > >>> + > >>> + mod_timer(&fep->timer_aging, > >>> + jiffies + > >>> msecs_to_jiffies(LEARNING_AGING_INTERVAL)); +} > >> > >> It's unclear to me why you need to maintain a timer just to update > >> a timestamp?!? > >> > > > > This timestamp is afterwards used in: > > mtip_atable_dynamicms_learn_migration(), which in turn manages the > > entries in switch "dynamic" table (it is one of the fields in the > > record. > > > >> (jiffies >> msecs_to_jiffies(LEARNING_AGING_INTERVAL)) & ((1 << > >> AT_DENTRY_TIMESTAMP_WIDTH) - 1) > >> > > > > If I understood you correctly - I shall remove the timer and then > > just use the above line (based on jiffies) when > > mtip_atable_dynamicms_learn_migration() is called (and it requires > > the timestamp)? > > > > Otherwise the mtip_timeincrement() seems like a nice wrapper on > > incrementing the timestamp. > > Scheduling a timer to obtain a value you can have for free is not a > good resource usage strategy. Note that is a pending question/check > above: verify that the suggested expression yield the expected value > in all the possible use-case. This is a bit more tricky than just getting value from jiffies. The current code provides a monotonic, starting from 0 time "base" for learning and managing entries in internal routing tables for MTIP. To be more specific - the fep->curr_time is a value incremented after each ~10ms. Simple masking of jiffies would not provide such features. However, I've rewritten relevant portions where GENMASK() could be used to simplify and make the code more readable. > > >>> + if (!fep->link[0] && !fep->link[1]) { > >>> + /* Link is down or autonegotiation is in > >>> progress. */ > >>> + netif_stop_queue(dev); > >>> + spin_unlock_irqrestore(&fep->hw_lock, flags); > >>> + return NETDEV_TX_BUSY; > >> > >> Intead you should probably stop the queue when such events happen > > > > Please correct me if I'm wrong - the netif_stop_queue(dev); is > > called before return. Shall something different be also done? > > The xmit routine should assume the link is up and the tx ring has > enough free slot to enqueue a packet. In the case of MTIP driver, there is a circular buffer of 16 "sets" of descriptors (allocated as coherent) and corresponding buffer (dma_map_single at start). The size of each "buffer" is set to 2048B to accommodate at least single packet. > After enqueueing it should > check for enough space availble for the next xmit and stop the queue, > likely using the netif_txq_maybe_stop() helper. The problem with not using the netif_txq_maybe_stop() is that I'm not using the "txq" (netdev_queue). With the current code it looks like netif_stop_queue() is the most suitable one from the network API. > > Documentation/networking/driver.rst > > >>> + } > >>> + > >>> + /* Clear all of the status flags */ > >>> + status &= ~BD_ENET_TX_STATS; > >>> + > >>> + /* Set buffer length and buffer pointer */ > >>> + bufaddr = skb->data; > >>> + bdp->cbd_datlen = skb->len; > >>> + > >>> + /* On some FEC implementations data must be aligned on > >>> + * 4-byte boundaries. Use bounce buffers to copy data > >>> + * and get it aligned. > >>> + */ > >>> + if ((unsigned long)bufaddr & MTIP_ALIGNMENT) { > >>> + unsigned int index; > >>> + > >>> + index = bdp - fep->tx_bd_base; > >>> + memcpy(fep->tx_bounce[index], > >>> + (void *)skb->data, skb->len); > >>> + bufaddr = fep->tx_bounce[index]; > >>> + } > >>> + > >>> + if (fep->quirks & FEC_QUIRK_SWAP_FRAME) > >>> + swap_buffer(bufaddr, skb->len); > >> > >> Ouch, the above will kill performances. > > > > This unfortunately must be done in such a way (the same approach is > > present on fec_main.c) as the IP block is implemented in such a way > > (explicit conversion from big endian to little endian). > > > >> Also it looks like it will > >> access uninitialized memory if skb->len is not 4 bytes aligned. > >> > > > > There is a few lines above a special code to prevent from such a > > situation ((unsigned long)bufaddr & MTIP_ALIGNMENT). > > The problem here is not with memory buffer alignment, but with the > packet length, that can be not a multiple of 4. In such a case the > last swap will do an out-of-bound read touching uninitialized data. On the init function the size of allocation for each buffer is set to be 2048 bytes, so there is no such a thread. > > >>> + bdp->cbd_sc = status; > >>> + > >>> + netif_trans_update(dev); > >>> + skb_tx_timestamp(skb); > >>> + > >>> + /* For port separation - force sending via specified port > >>> */ > >>> + if (!fep->br_offload && port != 0) > >>> + mtip_forced_forward(fep, port, 1); > >>> + > >>> + /* Trigger transmission start */ > >>> + writel(MCF_ESW_TDAR_X_DES_ACTIVE, fep->hwp + ESW_TDAR); > >>> > >> > >> Possibly you should check skb->xmit_more to avoid ringing the > >> doorbell when not needed. > > > > I couldn't find skb->xmit_more in the current sources. Instead, > > there is netdev_xmit_more(). > > Yeah, I referred to the old code, sorry. > > > However, the TX code just is supposed to setup one frame > > transmission and hence there is no risk that we trigger "empty" > > transmission. > > The point is that doorbell ringing is usually very expensive (slow) > for the H/W. And is not needed when netdev_xmit_more() is true, > because the another xmit operation will follow. If you care about > performances you should leverage such info. I do have an impression, that this is very important for network devices having many queues with separate priorities. In my case - I do have a single uDMA0 port with a single RX and TX circular buffer (16 packets can be "queued"). > > > > >>> + /* First, grab all of the stats for the incoming packet. > >>> + * These get messed up if we get called due to a busy > >>> condition. > >>> + */ > >>> + bdp = fep->cur_rx; > >>> + > >>> + while (!((status = bdp->cbd_sc) & BD_ENET_RX_EMPTY)) { > >>> + if (pkt_received >= budget) > >>> + break; > >>> + > >>> + pkt_received++; > >>> + /* Since we have allocated space to hold a > >>> complete frame, > >>> + * the last indicator should be set. > >>> + */ > >>> + if ((status & BD_ENET_RX_LAST) == 0) > >>> + dev_warn_ratelimited(&dev->dev, > >>> + "SWITCH ENET: rcv is > >>> not +last\n"); + > >>> + if (!fep->usage_count) > >>> + goto rx_processing_done; > >>> + > >>> + /* Check for errors. */ > >>> + if (status & (BD_ENET_RX_LG | BD_ENET_RX_SH | > >>> BD_ENET_RX_NO | > >>> + BD_ENET_RX_CR | BD_ENET_RX_OV)) { > >>> + dev->stats.rx_errors++; > >>> + if (status & (BD_ENET_RX_LG | > >>> BD_ENET_RX_SH)) { > >>> + /* Frame too long or too short. > >>> */ > >>> + dev->stats.rx_length_errors++; > >>> + } > >>> + if (status & BD_ENET_RX_NO) /* > >>> Frame alignment */ > >>> + dev->stats.rx_frame_errors++; > >>> + if (status & BD_ENET_RX_CR) /* CRC > >>> Error */ > >>> + dev->stats.rx_crc_errors++; > >>> + if (status & BD_ENET_RX_OV) /* > >>> FIFO overrun */ > >>> + dev->stats.rx_fifo_errors++; > >>> + } > >>> + > >>> + /* Report late collisions as a frame error. > >>> + * On this error, the BD is closed, but we don't > >>> know what we > >>> + * have in the buffer. So, just drop this frame > >>> on the floor. > >>> + */ > >>> + if (status & BD_ENET_RX_CL) { > >>> + dev->stats.rx_errors++; > >>> + dev->stats.rx_frame_errors++; > >>> + goto rx_processing_done; > >>> + } > >>> + > >>> + /* Process the incoming frame */ > >>> + pkt_len = bdp->cbd_datlen; > >>> + data = (__u8 *)__va(bdp->cbd_bufaddr); > >>> + > >>> + dma_unmap_single(&fep->pdev->dev, > >>> bdp->cbd_bufaddr, > >>> + bdp->cbd_datlen, > >>> DMA_FROM_DEVICE); > >> > >> I have read your explaination WRT unmap/map. Actually you don't > >> need to do any mapping here, > > > > There are 16 cbd_t descriptors allocated (as dma_alloc_coherent). > > Those descriptors contain pointer to data (being read in this > > case). > > I'm referring to the actual packet payload, that is the buffer at > bdp-cbd_bufaddr with len bdp->cbd_datlen; I'm not discussing the > descriptors contents. +1 > > > Hence the need to perform dma_map_single() for each descriptor, > > You are not unmapping the descriptor, you are unmapping the packet > payload. +1 > > >> since you are unconditionally copying the > >> whole buffer (why???) > > > > Only the value of > > pkt_len = bdp->cbd_datlen; is copied to SKB (after byte > > swap_buffer()). > > The relevant line is: > > skb_copy_to_linear_data(skb, data, pkt_len); > > AFAICS that copies whole packet contents, which is usually quite > sub-optimal from performance PoV. > fec_main.c just assigns: data·=·skb->data; so I would prefer to keep the: skb_copy_to_linear_data(skb, data, pkt_len); > >> and re-using it. > >> > >> Still you need a dma_sync_single() to ensure the CPUs see the > >> correct data. > > > > The descriptors - i.e. struct cbd_t fields are allocated with > > dma_alloc_coherent(), so this is OK. > > I'm talking about packets contents, not packet descriptors. Please > re-read the above and have a look at other drivers code. The usage of dma_sync_single_for_cpu() works without issues in the mtip_switch_rx(). > > An additional point that I missed in the previous review is that the > rx allocation schema is quite uncorrect. At ring initialization time > you allocate full skbs, while what you need and use is just raw > buffers for the packet payload. Instead you could/should use the page > pool: > > Documentation/networking/page_pool.rst > Yes, for RX packets payload the page of 2048 bytes is allocated. By using dma page pool - I can state the same maximal size, but the usage of memory can be much more flexible. > That will also help doing the right thing WRT DMA handling. > The dma_sync_single_for_cpu() shall work correctly with the current approach as well. > >> This patch is really too big, I'm pretty sure I missed some > >> relevant issues. You should split it in multiple ones: i.e. > >> initialization and h/w access, rx/tx, others ndos. > > > > It is quite hard to "scatter" this patch as: > > > > 1. I've already split it to several files (which correspond to > > different "logical" entities - like mtipl2sw_br.c). > > 2. The mtipl2sw.c file is the smallest part of the "core" of the > > driver. > > 3. If I split it, then at some point I would break bisectability for > > imx28. > > Note that each patch don't need to provide complete functionality. > i.e. patch 1 could implement ndo_open()/close and related helper, > leaving ndo_start_xmit() and napi_poll empty and avoid allocating the > rx buffers. patch 2 could implement the rx patch, patch 3 the tx path. > Yes, this seems to be a good idea... I will implement such approach. > The only constraint is that each patch will build successufully, which > is usually easy to achieve. +1 > > A 2K lines patches will probably lead to many more iterations and > unhappy (or no) reviewers. The problem is that all the patches "around" this driver (like *yaml, bindings, defconfig) would get outdated very fast if not pull to mainline. In such a way that already done work would need to be redo... > > /P > Best regards, Lukasz Majewski -- DENX Software Engineering GmbH, Managing Director: Erika Unter HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany Phone: (+49)-8142-66989-59 Fax: (+49)-8142-66989-80 Email: lukma@denx.de