Hi Paolo,

> On 5/28/25 12:53 PM, Lukasz Majewski wrote:
> >> On 5/22/25 9:54 AM, Lukasz Majewski wrote:  
> >>> +/* dynamicms MAC address table learn and migration */
> >>> +static void mtip_aging_timer(struct timer_list *t)
> >>> +{
> >>> +	struct switch_enet_private *fep = from_timer(fep, t,
> >>> timer_aging); +
> >>> +	fep->curr_time = mtip_timeincrement(fep->curr_time);
> >>> +
> >>> +	mod_timer(&fep->timer_aging,
> >>> +		  jiffies +
> >>> msecs_to_jiffies(LEARNING_AGING_INTERVAL)); +}    
> >>
> >> It's unclear to me why you need to maintain a timer just to update
> >> a timestamp?!?
> >>  
> > 
> > This timestamp is afterwards used in:
> > mtip_atable_dynamicms_learn_migration(), which in turn manages the
> > entries in switch "dynamic" table (it is one of the fields in the
> > record.
> >   
> >> (jiffies >> msecs_to_jiffies(LEARNING_AGING_INTERVAL)) & ((1 <<
> >> AT_DENTRY_TIMESTAMP_WIDTH) - 1)
> >>  
> > 
> > If I understood you correctly - I shall remove the timer and then
> > just use the above line (based on jiffies) when
> > mtip_atable_dynamicms_learn_migration() is called (and it requires
> > the timestamp)?
> > 
> > Otherwise the mtip_timeincrement() seems like a nice wrapper on
> > incrementing the timestamp.  
> 
> Scheduling a timer to obtain a value you can have for free is not a
> good resource usage strategy. Note that is a pending question/check
> above: verify that the suggested expression yield the expected value
> in all the possible use-case.

This is a bit more tricky than just getting value from jiffies.

The current code provides a monotonic, starting from 0 time "base" for
learning and managing entries in internal routing tables for MTIP.

To be more specific - the fep->curr_time is a value incremented after
each ~10ms.

Simple masking of jiffies would not provide such features.

However, I've rewritten relevant portions where GENMASK() could be used
to simplify and make the code more readable.

> 
> >>> +	if (!fep->link[0] && !fep->link[1]) {
> >>> +		/* Link is down or autonegotiation is in
> >>> progress. */
> >>> +		netif_stop_queue(dev);
> >>> +		spin_unlock_irqrestore(&fep->hw_lock, flags);
> >>> +		return NETDEV_TX_BUSY;    
> >>
> >> Intead you should probably stop the queue when such events happen  
> > 
> > Please correct me if I'm wrong - the netif_stop_queue(dev); is
> > called before return. Shall something different be also done?  
> 
> The xmit routine should assume the link is up and the tx ring has
> enough free slot to enqueue a packet.

In the case of MTIP driver, there is a circular buffer of 16 "sets" of
descriptors (allocated as coherent) and corresponding buffer
(dma_map_single at start).

The size of each "buffer" is set to 2048B to accommodate at least single
packet.

> After enqueueing it should
> check for enough space availble for the next xmit and stop the queue,
> likely using the netif_txq_maybe_stop() helper.

The problem with not using the netif_txq_maybe_stop() is that I'm not
using the "txq" (netdev_queue).

With the current code it looks like netif_stop_queue() is the most
suitable one from the network API.

> 
> Documentation/networking/driver.rst
> 
> >>> +	}
> >>> +
> >>> +	/* Clear all of the status flags */
> >>> +	status &= ~BD_ENET_TX_STATS;
> >>> +
> >>> +	/* Set buffer length and buffer pointer */
> >>> +	bufaddr = skb->data;
> >>> +	bdp->cbd_datlen = skb->len;
> >>> +
> >>> +	/* On some FEC implementations data must be aligned on
> >>> +	 * 4-byte boundaries. Use bounce buffers to copy data
> >>> +	 * and get it aligned.
> >>> +	 */
> >>> +	if ((unsigned long)bufaddr & MTIP_ALIGNMENT) {
> >>> +		unsigned int index;
> >>> +
> >>> +		index = bdp - fep->tx_bd_base;
> >>> +		memcpy(fep->tx_bounce[index],
> >>> +		       (void *)skb->data, skb->len);
> >>> +		bufaddr = fep->tx_bounce[index];
> >>> +	}
> >>> +
> >>> +	if (fep->quirks & FEC_QUIRK_SWAP_FRAME)
> >>> +		swap_buffer(bufaddr, skb->len);    
> >>
> >> Ouch, the above will kill performances.  
> > 
> > This unfortunately must be done in such a way (the same approach is
> > present on fec_main.c) as the IP block is implemented in such a way
> > (explicit conversion from big endian to little endian).
> >   
> >> Also it looks like it will
> >> access uninitialized memory if skb->len is not 4 bytes aligned.
> >>  
> > 
> > There is a few lines above a special code to prevent from such a
> > situation ((unsigned long)bufaddr & MTIP_ALIGNMENT).  
> 
> The problem here is not with memory buffer alignment, but with the
> packet length, that can be not a multiple of 4. In such a case the
> last swap will do an out-of-bound read touching uninitialized data.

On the init function the size of allocation for each buffer is set to
be 2048 bytes, so there is no such a thread.

> 
> >>> +	bdp->cbd_sc = status;
> >>> +
> >>> +	netif_trans_update(dev);
> >>> +	skb_tx_timestamp(skb);
> >>> +
> >>> +	/* For port separation - force sending via specified port
> >>> */
> >>> +	if (!fep->br_offload && port != 0)
> >>> +		mtip_forced_forward(fep, port, 1);
> >>> +
> >>> +	/* Trigger transmission start */
> >>> +	writel(MCF_ESW_TDAR_X_DES_ACTIVE, fep->hwp + ESW_TDAR);
> >>>   
> >>
> >> Possibly you should check skb->xmit_more to avoid ringing the
> >> doorbell when not needed.  
> > 
> > I couldn't find skb->xmit_more in the current sources. Instead,
> > there is netdev_xmit_more().  
> 
> Yeah, I referred to the old code, sorry.
> 
> > However, the TX code just is supposed to setup one frame
> > transmission and hence there is no risk that we trigger "empty"
> > transmission.  
> 
> The point is that doorbell ringing is usually very expensive (slow)
> for the H/W. And is not needed when netdev_xmit_more() is true,
> because the another xmit operation will follow. If you care about
> performances you should leverage such info.

I do have an impression, that this is very important for network
devices having many queues with separate priorities.

In my case - I do have a single uDMA0 port with a single RX and TX
circular buffer (16 packets can be "queued").

> 
> >   
> >>> +	/* First, grab all of the stats for the incoming packet.
> >>> +	 * These get messed up if we get called due to a busy
> >>> condition.
> >>> +	 */
> >>> +	bdp = fep->cur_rx;
> >>> +
> >>> +	while (!((status = bdp->cbd_sc) & BD_ENET_RX_EMPTY)) {
> >>> +		if (pkt_received >= budget)
> >>> +			break;
> >>> +
> >>> +		pkt_received++;
> >>> +		/* Since we have allocated space to hold a
> >>> complete frame,
> >>> +		 * the last indicator should be set.
> >>> +		 */
> >>> +		if ((status & BD_ENET_RX_LAST) == 0)
> >>> +			dev_warn_ratelimited(&dev->dev,
> >>> +					     "SWITCH ENET: rcv is
> >>> not +last\n"); +
> >>> +		if (!fep->usage_count)
> >>> +			goto rx_processing_done;
> >>> +
> >>> +		/* Check for errors. */
> >>> +		if (status & (BD_ENET_RX_LG | BD_ENET_RX_SH |
> >>> BD_ENET_RX_NO |
> >>> +			      BD_ENET_RX_CR | BD_ENET_RX_OV)) {
> >>> +			dev->stats.rx_errors++;
> >>> +			if (status & (BD_ENET_RX_LG |
> >>> BD_ENET_RX_SH)) {
> >>> +				/* Frame too long or too short.
> >>> */
> >>> +				dev->stats.rx_length_errors++;
> >>> +			}
> >>> +			if (status & BD_ENET_RX_NO)	/*
> >>> Frame alignment */
> >>> +				dev->stats.rx_frame_errors++;
> >>> +			if (status & BD_ENET_RX_CR)	/* CRC
> >>> Error */
> >>> +				dev->stats.rx_crc_errors++;
> >>> +			if (status & BD_ENET_RX_OV)	/*
> >>> FIFO overrun */
> >>> +				dev->stats.rx_fifo_errors++;
> >>> +		}
> >>> +
> >>> +		/* Report late collisions as a frame error.
> >>> +		 * On this error, the BD is closed, but we don't
> >>> know what we
> >>> +		 * have in the buffer.  So, just drop this frame
> >>> on the floor.
> >>> +		 */
> >>> +		if (status & BD_ENET_RX_CL) {
> >>> +			dev->stats.rx_errors++;
> >>> +			dev->stats.rx_frame_errors++;
> >>> +			goto rx_processing_done;
> >>> +		}
> >>> +
> >>> +		/* Process the incoming frame */
> >>> +		pkt_len = bdp->cbd_datlen;
> >>> +		data = (__u8 *)__va(bdp->cbd_bufaddr);
> >>> +
> >>> +		dma_unmap_single(&fep->pdev->dev,
> >>> bdp->cbd_bufaddr,
> >>> +				 bdp->cbd_datlen,
> >>> DMA_FROM_DEVICE);    
> >>
> >> I have read your explaination WRT unmap/map. Actually you don't
> >> need to do any mapping here,   
> > 
> > There are 16 cbd_t descriptors allocated (as dma_alloc_coherent).
> > Those descriptors contain pointer to data (being read in this
> > case).  
> 
> I'm referring to the actual packet payload, that is the buffer at
> bdp-cbd_bufaddr with len bdp->cbd_datlen; I'm not discussing the
> descriptors contents.

+1

> 
> > Hence the need to perform dma_map_single() for each descriptor,   
> 
> You are not unmapping the descriptor, you are unmapping the packet
> payload.

+1

> 
> >> since you are unconditionally copying the
> >> whole buffer (why???)  
> > 
> > Only the value of 
> > pkt_len = bdp->cbd_datlen; is copied to SKB (after byte
> > swap_buffer()).  
> 
> The relevant line is:
> 
> 		skb_copy_to_linear_data(skb, data, pkt_len);
> 
> AFAICS that copies whole packet contents, which is usually quite
> sub-optimal from performance PoV.
> 

fec_main.c just assigns:
data·=·skb->data;

so I would prefer to keep the:
skb_copy_to_linear_data(skb, data, pkt_len);

> >> and re-using it.
> >>
> >> Still you need a dma_sync_single() to ensure the CPUs see the
> >> correct data.  
> > 
> > The descriptors - i.e. struct cbd_t fields are allocated with
> > dma_alloc_coherent(), so this is OK.  
> 
> I'm talking about packets contents, not packet descriptors. Please
> re-read the above and have a look at other drivers code.

The usage of dma_sync_single_for_cpu() works without issues in the
mtip_switch_rx().

> 
> An additional point that I missed in the previous review is that the
> rx allocation schema is quite uncorrect. At ring initialization time
> you allocate full skbs, while what you need and use is just raw
> buffers for the packet payload. Instead you could/should use the page
> pool:
> 
> Documentation/networking/page_pool.rst
> 

Yes, for RX packets payload the page of 2048 bytes is allocated. By
using dma page pool - I can state the same maximal size, but the usage
of memory can be much more flexible.

> That will also help doing the right thing WRT DMA handling.
> 

The dma_sync_single_for_cpu() shall work correctly with the current
approach as well.

> >> This patch is really too big, I'm pretty sure I missed some
> >> relevant issues. You should split it in multiple ones: i.e.
> >> initialization and h/w access, rx/tx, others ndos.  
> > 
> > It is quite hard to "scatter" this patch as:
> > 
> > 1. I've already split it to several files (which correspond to
> > different "logical" entities - like mtipl2sw_br.c).
> > 2. The mtipl2sw.c file is the smallest part of the "core" of the
> > driver.
> > 3. If I split it, then at some point I would break bisectability for
> > imx28.  
> 
> Note that each patch don't need to provide complete functionality.
> i.e. patch 1 could implement ndo_open()/close and related helper,
> leaving ndo_start_xmit() and napi_poll empty and avoid allocating the
> rx buffers. patch 2 could implement the rx patch, patch 3 the tx path.
> 

Yes, this seems to be a good idea... I will implement such approach.

> The only constraint is that each patch will build successufully, which
> is usually easy to achieve.

+1

> 
> A 2K lines patches will probably lead to many more iterations and
> unhappy (or no) reviewers.

The problem is that all the patches "around" this driver (like *yaml,
bindings, defconfig) would get outdated very fast if not pull to
mainline.

In such a way that already done work would need to be redo...

> 
> /P
> 




Best regards,

Lukasz Majewski

--

DENX Software Engineering GmbH,      Managing Director: Erika Unter
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-59 Fax: (+49)-8142-66989-80 Email: lukma@denx.de