* [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing @ 2012-08-08 12:26 Claudiu Manoil 2012-08-08 12:26 ` [RFC net-next 1/4] gianfar: Remove redundant programming of [rt]xic registers Claudiu Manoil 2012-08-08 16:24 ` [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing Paul Gortmaker 0 siblings, 2 replies; 21+ messages in thread From: Claudiu Manoil @ 2012-08-08 12:26 UTC (permalink / raw) To: netdev; +Cc: David S. Miller, Paul Gortmaker Hi all, This set of patches basically splits the existing napi poll routine into two separate napi functions, one for Rx processing (triggered by frame receive interrupts only) and one for the Tx confirmation path processing (triggerred by Tx confirmation interrupts only). The polling algorithm behind remains much the same. Important throughput improvements have been noted on low power boards with this set of changes. For instance, for the following netperf test: netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 yields a throughput gain from oscilating ~500-~700 Mbps to steady ~940 Mbps, (if the Rx/Tx paths are processed on different cores), w/ no increase in CPU%, on a p1020rdb - 2 core machine featuring etsec2.0 (Multi-Queue Multi-Group driver mode). Also, this change, which should ballance Rx and Tx processing, proves to be effective against Rx busy interrupt occurrences. Thanks for your review. Claudiu Claudiu Manoil (4): gianfar: Remove redundant programming of [rt]xic registers gianfar: Clear ievent from interrupt handler for [RT]x int gianfar: Separate out the Rx and Tx coalescing functions gianfar: Use separate NAPIs for Tx and Rx processing drivers/net/ethernet/freescale/gianfar.c | 220 +++++++++++++++++++++-------- drivers/net/ethernet/freescale/gianfar.h | 16 ++- 2 files changed, 171 insertions(+), 65 deletions(-) ^ permalink raw reply [flat|nested] 21+ messages in thread
* [RFC net-next 1/4] gianfar: Remove redundant programming of [rt]xic registers 2012-08-08 12:26 [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing Claudiu Manoil @ 2012-08-08 12:26 ` Claudiu Manoil 2012-08-08 12:26 ` [RFC net-next 2/4] gianfar: Clear ievent from interrupt handler for [RT]x int Claudiu Manoil 2012-08-08 16:24 ` [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing Paul Gortmaker 1 sibling, 1 reply; 21+ messages in thread From: Claudiu Manoil @ 2012-08-08 12:26 UTC (permalink / raw) To: netdev; +Cc: David S. Miller, Paul Gortmaker, Claudiu Manoil In Multi Q Multi Group (MQ_MG_MODE) mode, the Rx/Tx colescing registers [rt]xic are aliased with the [rt]xic0 registers (coalescing setting regs for Q0). This avoids programming twice in a row the coalescing registers for the Rx/Tx hw Q0. Also, replaced inconsistent "unlikely" in the process. Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com> --- drivers/net/ethernet/freescale/gianfar.c | 24 ++++++++++++------------ 1 files changed, 12 insertions(+), 12 deletions(-) diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c index 4605f72..e9feeb9 100644 --- a/drivers/net/ethernet/freescale/gianfar.c +++ b/drivers/net/ethernet/freescale/gianfar.c @@ -1799,20 +1799,9 @@ void gfar_configure_coalescing(struct gfar_private *priv, { struct gfar __iomem *regs = priv->gfargrp[0].regs; u32 __iomem *baddr; - int i = 0; - - /* Backward compatible case ---- even if we enable - * multiple queues, there's only single reg to program - */ - gfar_write(®s->txic, 0); - if (likely(priv->tx_queue[0]->txcoalescing)) - gfar_write(®s->txic, priv->tx_queue[0]->txic); - - gfar_write(®s->rxic, 0); - if (unlikely(priv->rx_queue[0]->rxcoalescing)) - gfar_write(®s->rxic, priv->rx_queue[0]->rxic); if (priv->mode == MQ_MG_MODE) { + int i; baddr = ®s->txic0; for_each_set_bit(i, &tx_mask, priv->num_tx_queues) { gfar_write(baddr + i, 0); @@ -1826,6 +1815,17 @@ void gfar_configure_coalescing(struct gfar_private *priv, if (likely(priv->rx_queue[i]->rxcoalescing)) gfar_write(baddr + i, priv->rx_queue[i]->rxic); } + } else { + /* Backward compatible case ---- even if we enable + * multiple queues, there's only single reg to program + */ + gfar_write(®s->txic, 0); + if (likely(priv->tx_queue[0]->txcoalescing)) + gfar_write(®s->txic, priv->tx_queue[0]->txic); + + gfar_write(®s->rxic, 0); + if (likely(priv->rx_queue[0]->rxcoalescing)) + gfar_write(®s->rxic, priv->rx_queue[0]->rxic); } } -- 1.6.6 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* [RFC net-next 2/4] gianfar: Clear ievent from interrupt handler for [RT]x int 2012-08-08 12:26 ` [RFC net-next 1/4] gianfar: Remove redundant programming of [rt]xic registers Claudiu Manoil @ 2012-08-08 12:26 ` Claudiu Manoil 2012-08-08 12:26 ` [RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions Claudiu Manoil 2012-08-08 16:11 ` [RFC net-next 2/4] gianfar: Clear ievent from interrupt handler for [RT]x int Paul Gortmaker 0 siblings, 2 replies; 21+ messages in thread From: Claudiu Manoil @ 2012-08-08 12:26 UTC (permalink / raw) To: netdev; +Cc: David S. Miller, Paul Gortmaker, Claudiu Manoil It's the interrupt handler's job to clear ievent for the Tx/Rx paths, as soon as the corresponding interrupt sources have been masked. Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com> --- drivers/net/ethernet/freescale/gianfar.c | 16 ++++++---------- 1 files changed, 6 insertions(+), 10 deletions(-) diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c index e9feeb9..ddd350a 100644 --- a/drivers/net/ethernet/freescale/gianfar.c +++ b/drivers/net/ethernet/freescale/gianfar.c @@ -2568,12 +2568,13 @@ static void gfar_schedule_cleanup(struct gfar_priv_grp *gfargrp) if (napi_schedule_prep(&gfargrp->napi)) { gfar_write(&gfargrp->regs->imask, IMASK_RTX_DISABLED); __napi_schedule(&gfargrp->napi); - } else { - /* Clear IEVENT, so interrupts aren't called again - * because of the packets that have already arrived. - */ - gfar_write(&gfargrp->regs->ievent, IEVENT_RTX_MASK); } + + /* Clear IEVENT, so interrupts aren't called again + * because of the packets that have already arrived. + */ + gfar_write(&gfargrp->regs->ievent, IEVENT_RTX_MASK); + spin_unlock_irqrestore(&gfargrp->grplock, flags); } @@ -2837,11 +2838,6 @@ static int gfar_poll(struct napi_struct *napi, int budget) num_queues = gfargrp->num_rx_queues; budget_per_queue = budget/num_queues; - /* Clear IEVENT, so interrupts aren't called again - * because of the packets that have already arrived - */ - gfar_write(®s->ievent, IEVENT_RTX_MASK); - while (num_queues && left_over_budget) { budget_per_queue = left_over_budget/num_queues; left_over_budget = 0; -- 1.6.6 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* [RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions 2012-08-08 12:26 ` [RFC net-next 2/4] gianfar: Clear ievent from interrupt handler for [RT]x int Claudiu Manoil @ 2012-08-08 12:26 ` Claudiu Manoil 2012-08-08 12:26 ` [RFC net-next 4/4] gianfar: Use separate NAPIs for Tx and Rx processing Claudiu Manoil ` (2 more replies) 2012-08-08 16:11 ` [RFC net-next 2/4] gianfar: Clear ievent from interrupt handler for [RT]x int Paul Gortmaker 1 sibling, 3 replies; 21+ messages in thread From: Claudiu Manoil @ 2012-08-08 12:26 UTC (permalink / raw) To: netdev; +Cc: David S. Miller, Paul Gortmaker, Claudiu Manoil Split the coalescing programming support by Rx and Tx h/w queues, in order to introduce a separate NAPI for the Tx confirmation path (next patch). This way, the Rx processing path will handle the coalescing settings for the Rx queues only, resp. the Tx confirmation processing path will handle the Tx queues. Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com> --- drivers/net/ethernet/freescale/gianfar.c | 36 +++++++++++++++++++++++------ 1 files changed, 28 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c index ddd350a..919acb3 100644 --- a/drivers/net/ethernet/freescale/gianfar.c +++ b/drivers/net/ethernet/freescale/gianfar.c @@ -1794,8 +1794,8 @@ void gfar_start(struct net_device *dev) dev->trans_start = jiffies; /* prevent tx timeout */ } -void gfar_configure_coalescing(struct gfar_private *priv, - unsigned long tx_mask, unsigned long rx_mask) +static inline void gfar_configure_tx_coalescing(struct gfar_private *priv, + unsigned long mask) { struct gfar __iomem *regs = priv->gfargrp[0].regs; u32 __iomem *baddr; @@ -1803,14 +1803,31 @@ void gfar_configure_coalescing(struct gfar_private *priv, if (priv->mode == MQ_MG_MODE) { int i; baddr = ®s->txic0; - for_each_set_bit(i, &tx_mask, priv->num_tx_queues) { + for_each_set_bit(i, &mask, priv->num_tx_queues) { gfar_write(baddr + i, 0); if (likely(priv->tx_queue[i]->txcoalescing)) gfar_write(baddr + i, priv->tx_queue[i]->txic); } + } else { + /* Backward compatible case ---- even if we enable + * multiple queues, there's only single reg to program + */ + gfar_write(®s->txic, 0); + if (likely(priv->tx_queue[0]->txcoalescing)) + gfar_write(®s->txic, priv->tx_queue[0]->txic); + } +} + +static inline void gfar_configure_rx_coalescing(struct gfar_private *priv, + unsigned long mask) +{ + struct gfar __iomem *regs = priv->gfargrp[0].regs; + u32 __iomem *baddr; + if (priv->mode == MQ_MG_MODE) { + int i; baddr = ®s->rxic0; - for_each_set_bit(i, &rx_mask, priv->num_rx_queues) { + for_each_set_bit(i, &mask, priv->num_rx_queues) { gfar_write(baddr + i, 0); if (likely(priv->rx_queue[i]->rxcoalescing)) gfar_write(baddr + i, priv->rx_queue[i]->rxic); @@ -1819,16 +1836,19 @@ void gfar_configure_coalescing(struct gfar_private *priv, /* Backward compatible case ---- even if we enable * multiple queues, there's only single reg to program */ - gfar_write(®s->txic, 0); - if (likely(priv->tx_queue[0]->txcoalescing)) - gfar_write(®s->txic, priv->tx_queue[0]->txic); - gfar_write(®s->rxic, 0); if (likely(priv->rx_queue[0]->rxcoalescing)) gfar_write(®s->rxic, priv->rx_queue[0]->rxic); } } +void gfar_configure_coalescing(struct gfar_private *priv, + unsigned long tx_mask, unsigned long rx_mask) +{ + gfar_configure_tx_coalescing(priv, tx_mask); + gfar_configure_rx_coalescing(priv, rx_mask); +} + static int register_grp_irqs(struct gfar_priv_grp *grp) { struct gfar_private *priv = grp->priv; -- 1.6.6 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* [RFC net-next 4/4] gianfar: Use separate NAPIs for Tx and Rx processing 2012-08-08 12:26 ` [RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions Claudiu Manoil @ 2012-08-08 12:26 ` Claudiu Manoil 2012-08-14 0:51 ` Paul Gortmaker 2012-08-08 15:44 ` [RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions Paul Gortmaker 2012-08-15 1:29 ` Paul Gortmaker 2 siblings, 1 reply; 21+ messages in thread From: Claudiu Manoil @ 2012-08-08 12:26 UTC (permalink / raw) To: netdev; +Cc: David S. Miller, Paul Gortmaker, Pankaj Chauhan, Claudiu Manoil Add a separate napi poll routine for Tx cleanup, to be triggerred by Tx confirmation interrupts only. Existing poll function is modified to handle only the Rx path processing. This allows parallel processing of Rx and Tx confirmation paths on a smp machine (2 cores). The split also results in simpler/cleaner napi poll function implementations, where each processing path has its own budget, thus improving the fairness b/w the processing paths at the same time. Signed-off-by: Pankaj Chauhan <pankaj.chauhan@freescale.com> Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com> --- drivers/net/ethernet/freescale/gianfar.c | 154 +++++++++++++++++++++++------- drivers/net/ethernet/freescale/gianfar.h | 16 +++- 2 files changed, 130 insertions(+), 40 deletions(-) diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c index 919acb3..2774961 100644 --- a/drivers/net/ethernet/freescale/gianfar.c +++ b/drivers/net/ethernet/freescale/gianfar.c @@ -128,12 +128,14 @@ static void free_skb_resources(struct gfar_private *priv); static void gfar_set_multi(struct net_device *dev); static void gfar_set_hash_for_addr(struct net_device *dev, u8 *addr); static void gfar_configure_serdes(struct net_device *dev); -static int gfar_poll(struct napi_struct *napi, int budget); +static int gfar_poll_rx(struct napi_struct *napi, int budget); +static int gfar_poll_tx(struct napi_struct *napi, int budget); #ifdef CONFIG_NET_POLL_CONTROLLER static void gfar_netpoll(struct net_device *dev); #endif int gfar_clean_rx_ring(struct gfar_priv_rx_q *rx_queue, int rx_work_limit); -static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue); +static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue, + int tx_work_limit); static int gfar_process_frame(struct net_device *dev, struct sk_buff *skb, int amount_pull, struct napi_struct *napi); void gfar_halt(struct net_device *dev); @@ -543,16 +545,20 @@ static void disable_napi(struct gfar_private *priv) { int i; - for (i = 0; i < priv->num_grps; i++) - napi_disable(&priv->gfargrp[i].napi); + for (i = 0; i < priv->num_grps; i++) { + napi_disable(&priv->gfargrp[i].napi_rx); + napi_disable(&priv->gfargrp[i].napi_tx); + } } static void enable_napi(struct gfar_private *priv) { int i; - for (i = 0; i < priv->num_grps; i++) - napi_enable(&priv->gfargrp[i].napi); + for (i = 0; i < priv->num_grps; i++) { + napi_enable(&priv->gfargrp[i].napi_rx); + napi_enable(&priv->gfargrp[i].napi_tx); + } } static int gfar_parse_group(struct device_node *np, @@ -1028,9 +1034,12 @@ static int gfar_probe(struct platform_device *ofdev) dev->ethtool_ops = &gfar_ethtool_ops; /* Register for napi ...We are registering NAPI for each grp */ - for (i = 0; i < priv->num_grps; i++) - netif_napi_add(dev, &priv->gfargrp[i].napi, gfar_poll, - GFAR_DEV_WEIGHT); + for (i = 0; i < priv->num_grps; i++) { + netif_napi_add(dev, &priv->gfargrp[i].napi_rx, gfar_poll_rx, + GFAR_DEV_RX_WEIGHT); + netif_napi_add(dev, &priv->gfargrp[i].napi_tx, gfar_poll_tx, + GFAR_DEV_TX_WEIGHT); + } if (priv->device_flags & FSL_GIANFAR_DEV_HAS_CSUM) { dev->hw_features = NETIF_F_IP_CSUM | NETIF_F_SG | @@ -2465,7 +2474,8 @@ static void gfar_align_skb(struct sk_buff *skb) } /* Interrupt Handler for Transmit complete */ -static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue) +static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue, + int tx_work_limit) { struct net_device *dev = tx_queue->dev; struct netdev_queue *txq; @@ -2490,7 +2500,7 @@ static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue) bdp = tx_queue->dirty_tx; skb_dirtytx = tx_queue->skb_dirtytx; - while ((skb = tx_queue->tx_skbuff[skb_dirtytx])) { + while ((skb = tx_queue->tx_skbuff[skb_dirtytx]) && tx_work_limit--) { unsigned long flags; frags = skb_shinfo(skb)->nr_frags; @@ -2580,29 +2590,50 @@ static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue) return howmany; } -static void gfar_schedule_cleanup(struct gfar_priv_grp *gfargrp) +static void gfar_schedule_rx_cleanup(struct gfar_priv_grp *gfargrp) { unsigned long flags; - spin_lock_irqsave(&gfargrp->grplock, flags); - if (napi_schedule_prep(&gfargrp->napi)) { - gfar_write(&gfargrp->regs->imask, IMASK_RTX_DISABLED); - __napi_schedule(&gfargrp->napi); + if (napi_schedule_prep(&gfargrp->napi_rx)) { + u32 imask; + spin_lock_irqsave(&gfargrp->grplock, flags); + imask = gfar_read(&gfargrp->regs->imask); + imask &= ~(IMASK_RX_DEFAULT); + gfar_write(&gfargrp->regs->imask, imask); + __napi_schedule(&gfargrp->napi_rx); + spin_unlock_irqrestore(&gfargrp->grplock, flags); } /* Clear IEVENT, so interrupts aren't called again * because of the packets that have already arrived. */ - gfar_write(&gfargrp->regs->ievent, IEVENT_RTX_MASK); + gfar_write(&gfargrp->regs->ievent, IEVENT_RX_MASK); +} - spin_unlock_irqrestore(&gfargrp->grplock, flags); +static void gfar_schedule_tx_cleanup(struct gfar_priv_grp *gfargrp) +{ + unsigned long flags; + + if (napi_schedule_prep(&gfargrp->napi_tx)) { + u32 imask; + spin_lock_irqsave(&gfargrp->grplock, flags); + imask = gfar_read(&gfargrp->regs->imask); + imask &= ~(IMASK_TX_DEFAULT); + gfar_write(&gfargrp->regs->imask, imask); + __napi_schedule(&gfargrp->napi_tx); + spin_unlock_irqrestore(&gfargrp->grplock, flags); + } + /* Clear IEVENT, so interrupts aren't called again + * because of the packets that have already arrived. + */ + gfar_write(&gfargrp->regs->ievent, IEVENT_TX_MASK); } /* Interrupt Handler for Transmit complete */ static irqreturn_t gfar_transmit(int irq, void *grp_id) { - gfar_schedule_cleanup((struct gfar_priv_grp *)grp_id); + gfar_schedule_tx_cleanup((struct gfar_priv_grp *)grp_id); return IRQ_HANDLED; } @@ -2683,7 +2714,7 @@ static inline void count_errors(unsigned short status, struct net_device *dev) irqreturn_t gfar_receive(int irq, void *grp_id) { - gfar_schedule_cleanup((struct gfar_priv_grp *)grp_id); + gfar_schedule_rx_cleanup((struct gfar_priv_grp *)grp_id); return IRQ_HANDLED; } @@ -2813,7 +2844,7 @@ int gfar_clean_rx_ring(struct gfar_priv_rx_q *rx_queue, int rx_work_limit) rx_queue->stats.rx_bytes += pkt_len; skb_record_rx_queue(skb, rx_queue->qindex); gfar_process_frame(dev, skb, amount_pull, - &rx_queue->grp->napi); + &rx_queue->grp->napi_rx); } else { netif_warn(priv, rx_err, dev, "Missing skb!\n"); @@ -2842,21 +2873,19 @@ int gfar_clean_rx_ring(struct gfar_priv_rx_q *rx_queue, int rx_work_limit) return howmany; } -static int gfar_poll(struct napi_struct *napi, int budget) +static int gfar_poll_rx(struct napi_struct *napi, int budget) { struct gfar_priv_grp *gfargrp = - container_of(napi, struct gfar_priv_grp, napi); + container_of(napi, struct gfar_priv_grp, napi_rx); struct gfar_private *priv = gfargrp->priv; struct gfar __iomem *regs = gfargrp->regs; - struct gfar_priv_tx_q *tx_queue = NULL; struct gfar_priv_rx_q *rx_queue = NULL; - int rx_cleaned = 0, budget_per_queue = 0, rx_cleaned_per_queue = 0; - int tx_cleaned = 0, i, left_over_budget = budget; + int rx_cleaned = 0, budget_per_queue, rx_cleaned_per_queue; + int i, left_over_budget = budget; unsigned long serviced_queues = 0; int num_queues = 0; num_queues = gfargrp->num_rx_queues; - budget_per_queue = budget/num_queues; while (num_queues && left_over_budget) { budget_per_queue = left_over_budget/num_queues; @@ -2866,9 +2895,6 @@ static int gfar_poll(struct napi_struct *napi, int budget) if (test_bit(i, &serviced_queues)) continue; rx_queue = priv->rx_queue[i]; - tx_queue = priv->tx_queue[rx_queue->qindex]; - - tx_cleaned += gfar_clean_tx_ring(tx_queue); rx_cleaned_per_queue = gfar_clean_rx_ring(rx_queue, budget_per_queue); rx_cleaned += rx_cleaned_per_queue; @@ -2882,27 +2908,83 @@ static int gfar_poll(struct napi_struct *napi, int budget) } } - if (tx_cleaned) - return budget; - if (rx_cleaned < budget) { + u32 imask; napi_complete(napi); /* Clear the halt bit in RSTAT */ gfar_write(®s->rstat, gfargrp->rstat); - gfar_write(®s->imask, IMASK_DEFAULT); + spin_lock_irq(&gfargrp->grplock); + imask = gfar_read(®s->imask); + imask |= IMASK_RX_DEFAULT; + gfar_write(®s->imask, imask); + spin_unlock_irq(&gfargrp->grplock); /* If we are coalescing interrupts, update the timer * Otherwise, clear it */ - gfar_configure_coalescing(priv, gfargrp->rx_bit_map, - gfargrp->tx_bit_map); + gfar_configure_rx_coalescing(priv, gfargrp->rx_bit_map); } return rx_cleaned; } +static int gfar_poll_tx(struct napi_struct *napi, int budget) +{ + struct gfar_priv_grp *gfargrp = + container_of(napi, struct gfar_priv_grp, napi_tx); + struct gfar_private *priv = gfargrp->priv; + struct gfar __iomem *regs = gfargrp->regs; + struct gfar_priv_tx_q *tx_queue = NULL; + int tx_cleaned = 0, budget_per_queue, tx_cleaned_per_queue; + int i, left_over_budget = budget; + unsigned long serviced_queues = 0; + int num_queues = 0; + + num_queues = gfargrp->num_tx_queues; + + while (num_queues && left_over_budget) { + budget_per_queue = left_over_budget/num_queues; + left_over_budget = 0; + + for_each_set_bit(i, &gfargrp->tx_bit_map, priv->num_tx_queues) { + if (test_bit(i, &serviced_queues)) + continue; + tx_queue = priv->tx_queue[i]; + tx_cleaned_per_queue = + gfar_clean_tx_ring(tx_queue, budget_per_queue); + tx_cleaned += tx_cleaned_per_queue; + if (tx_cleaned_per_queue < budget_per_queue) { + left_over_budget = left_over_budget + + (budget_per_queue - + tx_cleaned_per_queue); + set_bit(i, &serviced_queues); + num_queues--; + } + } + } + + if (tx_cleaned < budget) { + u32 imask; + napi_complete(napi); + + gfar_write(®s->imask, IMASK_DEFAULT); + spin_lock_irq(&gfargrp->grplock); + imask = gfar_read(®s->imask); + imask |= IMASK_TX_DEFAULT; + gfar_write(®s->imask, imask); + spin_unlock_irq(&gfargrp->grplock); + + /* If we are coalescing interrupts, update the timer + * Otherwise, clear it + */ + gfar_configure_tx_coalescing(priv, gfargrp->tx_bit_map); + } + + return tx_cleaned; +} + #ifdef CONFIG_NET_POLL_CONTROLLER /* Polling 'interrupt' - used by things like netconsole to send skbs * without having to re-enable interrupts. It's not called while diff --git a/drivers/net/ethernet/freescale/gianfar.h b/drivers/net/ethernet/freescale/gianfar.h index 2136c7f..f5be234 100644 --- a/drivers/net/ethernet/freescale/gianfar.h +++ b/drivers/net/ethernet/freescale/gianfar.h @@ -57,8 +57,10 @@ struct ethtool_rx_list { unsigned int count; }; -/* The maximum number of packets to be handled in one call of gfar_poll */ -#define GFAR_DEV_WEIGHT 64 +/* The maximum number of packets to be handled in one call of gfar_poll_rx */ +#define GFAR_DEV_RX_WEIGHT 64 +/* The maximum number of packets to be handled in one call of gfar_poll_tx */ +#define GFAR_DEV_TX_WEIGHT 64 /* Length for FCB */ #define GMAC_FCB_LEN 8 @@ -366,6 +368,10 @@ extern const char gfar_driver_version[]; | IMASK_PERR) #define IMASK_RTX_DISABLED ((~(IMASK_RXFEN0 | IMASK_TXFEN | IMASK_BSY)) \ & IMASK_DEFAULT) +#define IMASK_RX_DEFAULT (IMASK_RXFEN0 | IMASK_BSY) +#define IMASK_TX_DEFAULT (IMASK_TXFEN) +#define IMASK_RX_DISABLED ((~(IMASK_RX_DEFAULT)) & IMASK_DEFAULT) +#define IMASK_TX_DISABLED ((~(IMASK_TX_DEFAULT)) & IMASK_DEFAULT) /* Fifo management */ #define FIFO_TX_THR_MASK 0x01ff @@ -993,7 +999,8 @@ struct gfar_priv_rx_q { /** * struct gfar_priv_grp - per group structure - * @napi: the napi poll function + * @napi_rx: the RX napi poll function + * @napi_tx: the TX confirmation napi poll function * @priv: back pointer to the priv structure * @regs: the ioremapped register space for this group * @grp_id: group id for this group @@ -1007,7 +1014,8 @@ struct gfar_priv_rx_q { struct gfar_priv_grp { spinlock_t grplock __attribute__ ((aligned (SMP_CACHE_BYTES))); - struct napi_struct napi; + struct napi_struct napi_rx; + struct napi_struct napi_tx; struct gfar_private *priv; struct gfar __iomem *regs; unsigned int grp_id; -- 1.6.6 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [RFC net-next 4/4] gianfar: Use separate NAPIs for Tx and Rx processing 2012-08-08 12:26 ` [RFC net-next 4/4] gianfar: Use separate NAPIs for Tx and Rx processing Claudiu Manoil @ 2012-08-14 0:51 ` Paul Gortmaker 2012-08-14 16:08 ` Claudiu Manoil 0 siblings, 1 reply; 21+ messages in thread From: Paul Gortmaker @ 2012-08-14 0:51 UTC (permalink / raw) To: Claudiu Manoil; +Cc: netdev, David S. Miller, Pankaj Chauhan, Eric Dumazet [[RFC net-next 4/4] gianfar: Use separate NAPIs for Tx and Rx processing] On 08/08/2012 (Wed 15:26) Claudiu Manoil wrote: > Add a separate napi poll routine for Tx cleanup, to be triggerred by Tx > confirmation interrupts only. Existing poll function is modified to handle > only the Rx path processing. This allows parallel processing of Rx and Tx > confirmation paths on a smp machine (2 cores). > The split also results in simpler/cleaner napi poll function implementations, > where each processing path has its own budget, thus improving the fairness b/w > the processing paths at the same time. > > Signed-off-by: Pankaj Chauhan <pankaj.chauhan@freescale.com> > Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com> > --- > drivers/net/ethernet/freescale/gianfar.c | 154 +++++++++++++++++++++++------- > drivers/net/ethernet/freescale/gianfar.h | 16 +++- > 2 files changed, 130 insertions(+), 40 deletions(-) > > diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c > index 919acb3..2774961 100644 > --- a/drivers/net/ethernet/freescale/gianfar.c > +++ b/drivers/net/ethernet/freescale/gianfar.c > @@ -128,12 +128,14 @@ static void free_skb_resources(struct gfar_private *priv); > static void gfar_set_multi(struct net_device *dev); > static void gfar_set_hash_for_addr(struct net_device *dev, u8 *addr); > static void gfar_configure_serdes(struct net_device *dev); > -static int gfar_poll(struct napi_struct *napi, int budget); > +static int gfar_poll_rx(struct napi_struct *napi, int budget); > +static int gfar_poll_tx(struct napi_struct *napi, int budget); > #ifdef CONFIG_NET_POLL_CONTROLLER > static void gfar_netpoll(struct net_device *dev); > #endif > int gfar_clean_rx_ring(struct gfar_priv_rx_q *rx_queue, int rx_work_limit); > -static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue); > +static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue, > + int tx_work_limit); I'm looking at this in a bit more detail now (was on vacation last wk). With the above, you push a work limit down into the clean_tx_ring. I'm wondering if the above is implicitly involved in the performance difference you see, since... > static int gfar_process_frame(struct net_device *dev, struct sk_buff *skb, > int amount_pull, struct napi_struct *napi); > void gfar_halt(struct net_device *dev); > @@ -543,16 +545,20 @@ static void disable_napi(struct gfar_private *priv) > { > int i; > > - for (i = 0; i < priv->num_grps; i++) > - napi_disable(&priv->gfargrp[i].napi); > + for (i = 0; i < priv->num_grps; i++) { > + napi_disable(&priv->gfargrp[i].napi_rx); > + napi_disable(&priv->gfargrp[i].napi_tx); > + } > } > > static void enable_napi(struct gfar_private *priv) > { > int i; > > - for (i = 0; i < priv->num_grps; i++) > - napi_enable(&priv->gfargrp[i].napi); > + for (i = 0; i < priv->num_grps; i++) { > + napi_enable(&priv->gfargrp[i].napi_rx); > + napi_enable(&priv->gfargrp[i].napi_tx); > + } > } > > static int gfar_parse_group(struct device_node *np, > @@ -1028,9 +1034,12 @@ static int gfar_probe(struct platform_device *ofdev) > dev->ethtool_ops = &gfar_ethtool_ops; > > /* Register for napi ...We are registering NAPI for each grp */ > - for (i = 0; i < priv->num_grps; i++) > - netif_napi_add(dev, &priv->gfargrp[i].napi, gfar_poll, > - GFAR_DEV_WEIGHT); > + for (i = 0; i < priv->num_grps; i++) { > + netif_napi_add(dev, &priv->gfargrp[i].napi_rx, gfar_poll_rx, > + GFAR_DEV_RX_WEIGHT); > + netif_napi_add(dev, &priv->gfargrp[i].napi_tx, gfar_poll_tx, > + GFAR_DEV_TX_WEIGHT); > + } > > if (priv->device_flags & FSL_GIANFAR_DEV_HAS_CSUM) { > dev->hw_features = NETIF_F_IP_CSUM | NETIF_F_SG | > @@ -2465,7 +2474,8 @@ static void gfar_align_skb(struct sk_buff *skb) > } > > /* Interrupt Handler for Transmit complete */ > -static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue) > +static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue, > + int tx_work_limit) > { > struct net_device *dev = tx_queue->dev; > struct netdev_queue *txq; > @@ -2490,7 +2500,7 @@ static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue) > bdp = tx_queue->dirty_tx; > skb_dirtytx = tx_queue->skb_dirtytx; > > - while ((skb = tx_queue->tx_skbuff[skb_dirtytx])) { > + while ((skb = tx_queue->tx_skbuff[skb_dirtytx]) && tx_work_limit--) { ...code like this provides a new exit point that did not exist before, for the case of a massive transmit blast. Do you have any data on how often the work limit is hit? The old Don Becker ether drivers which originally introduced the idea of work limits (on IRQs) used to printk a message when they hit it, since ideally it shouldn't be happening all the time. In any case, it might be worth while to split this change out into a separate commit; something like: gianfar: push transmit cleanup work limit down to clean_tx_ring The advantage being (1) we can test this change in isolation, and (2) it makes your remaining rx/tx separate thread patch smaller and easier to review. Thanks, Paul. -- > unsigned long flags; > > frags = skb_shinfo(skb)->nr_frags; > @@ -2580,29 +2590,50 @@ static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue) > return howmany; > } > > -static void gfar_schedule_cleanup(struct gfar_priv_grp *gfargrp) > +static void gfar_schedule_rx_cleanup(struct gfar_priv_grp *gfargrp) > { > unsigned long flags; > > - spin_lock_irqsave(&gfargrp->grplock, flags); > - if (napi_schedule_prep(&gfargrp->napi)) { > - gfar_write(&gfargrp->regs->imask, IMASK_RTX_DISABLED); > - __napi_schedule(&gfargrp->napi); > + if (napi_schedule_prep(&gfargrp->napi_rx)) { > + u32 imask; > + spin_lock_irqsave(&gfargrp->grplock, flags); > + imask = gfar_read(&gfargrp->regs->imask); > + imask &= ~(IMASK_RX_DEFAULT); > + gfar_write(&gfargrp->regs->imask, imask); > + __napi_schedule(&gfargrp->napi_rx); > + spin_unlock_irqrestore(&gfargrp->grplock, flags); > } > > /* Clear IEVENT, so interrupts aren't called again > * because of the packets that have already arrived. > */ > - gfar_write(&gfargrp->regs->ievent, IEVENT_RTX_MASK); > + gfar_write(&gfargrp->regs->ievent, IEVENT_RX_MASK); > +} > > - spin_unlock_irqrestore(&gfargrp->grplock, flags); > +static void gfar_schedule_tx_cleanup(struct gfar_priv_grp *gfargrp) > +{ > + unsigned long flags; > + > + if (napi_schedule_prep(&gfargrp->napi_tx)) { > + u32 imask; > + spin_lock_irqsave(&gfargrp->grplock, flags); > + imask = gfar_read(&gfargrp->regs->imask); > + imask &= ~(IMASK_TX_DEFAULT); > + gfar_write(&gfargrp->regs->imask, imask); > + __napi_schedule(&gfargrp->napi_tx); > + spin_unlock_irqrestore(&gfargrp->grplock, flags); > + } > > + /* Clear IEVENT, so interrupts aren't called again > + * because of the packets that have already arrived. > + */ > + gfar_write(&gfargrp->regs->ievent, IEVENT_TX_MASK); > } > > /* Interrupt Handler for Transmit complete */ > static irqreturn_t gfar_transmit(int irq, void *grp_id) > { > - gfar_schedule_cleanup((struct gfar_priv_grp *)grp_id); > + gfar_schedule_tx_cleanup((struct gfar_priv_grp *)grp_id); > return IRQ_HANDLED; > } > > @@ -2683,7 +2714,7 @@ static inline void count_errors(unsigned short status, struct net_device *dev) > > irqreturn_t gfar_receive(int irq, void *grp_id) > { > - gfar_schedule_cleanup((struct gfar_priv_grp *)grp_id); > + gfar_schedule_rx_cleanup((struct gfar_priv_grp *)grp_id); > return IRQ_HANDLED; > } > > @@ -2813,7 +2844,7 @@ int gfar_clean_rx_ring(struct gfar_priv_rx_q *rx_queue, int rx_work_limit) > rx_queue->stats.rx_bytes += pkt_len; > skb_record_rx_queue(skb, rx_queue->qindex); > gfar_process_frame(dev, skb, amount_pull, > - &rx_queue->grp->napi); > + &rx_queue->grp->napi_rx); > > } else { > netif_warn(priv, rx_err, dev, "Missing skb!\n"); > @@ -2842,21 +2873,19 @@ int gfar_clean_rx_ring(struct gfar_priv_rx_q *rx_queue, int rx_work_limit) > return howmany; > } > > -static int gfar_poll(struct napi_struct *napi, int budget) > +static int gfar_poll_rx(struct napi_struct *napi, int budget) > { > struct gfar_priv_grp *gfargrp = > - container_of(napi, struct gfar_priv_grp, napi); > + container_of(napi, struct gfar_priv_grp, napi_rx); > struct gfar_private *priv = gfargrp->priv; > struct gfar __iomem *regs = gfargrp->regs; > - struct gfar_priv_tx_q *tx_queue = NULL; > struct gfar_priv_rx_q *rx_queue = NULL; > - int rx_cleaned = 0, budget_per_queue = 0, rx_cleaned_per_queue = 0; > - int tx_cleaned = 0, i, left_over_budget = budget; > + int rx_cleaned = 0, budget_per_queue, rx_cleaned_per_queue; > + int i, left_over_budget = budget; > unsigned long serviced_queues = 0; > int num_queues = 0; > > num_queues = gfargrp->num_rx_queues; > - budget_per_queue = budget/num_queues; > > while (num_queues && left_over_budget) { > budget_per_queue = left_over_budget/num_queues; > @@ -2866,9 +2895,6 @@ static int gfar_poll(struct napi_struct *napi, int budget) > if (test_bit(i, &serviced_queues)) > continue; > rx_queue = priv->rx_queue[i]; > - tx_queue = priv->tx_queue[rx_queue->qindex]; > - > - tx_cleaned += gfar_clean_tx_ring(tx_queue); > rx_cleaned_per_queue = > gfar_clean_rx_ring(rx_queue, budget_per_queue); > rx_cleaned += rx_cleaned_per_queue; > @@ -2882,27 +2908,83 @@ static int gfar_poll(struct napi_struct *napi, int budget) > } > } > > - if (tx_cleaned) > - return budget; > - > if (rx_cleaned < budget) { > + u32 imask; > napi_complete(napi); > > /* Clear the halt bit in RSTAT */ > gfar_write(®s->rstat, gfargrp->rstat); > > - gfar_write(®s->imask, IMASK_DEFAULT); > + spin_lock_irq(&gfargrp->grplock); > + imask = gfar_read(®s->imask); > + imask |= IMASK_RX_DEFAULT; > + gfar_write(®s->imask, imask); > + spin_unlock_irq(&gfargrp->grplock); > > /* If we are coalescing interrupts, update the timer > * Otherwise, clear it > */ > - gfar_configure_coalescing(priv, gfargrp->rx_bit_map, > - gfargrp->tx_bit_map); > + gfar_configure_rx_coalescing(priv, gfargrp->rx_bit_map); > } > > return rx_cleaned; > } > > +static int gfar_poll_tx(struct napi_struct *napi, int budget) > +{ > + struct gfar_priv_grp *gfargrp = > + container_of(napi, struct gfar_priv_grp, napi_tx); > + struct gfar_private *priv = gfargrp->priv; > + struct gfar __iomem *regs = gfargrp->regs; > + struct gfar_priv_tx_q *tx_queue = NULL; > + int tx_cleaned = 0, budget_per_queue, tx_cleaned_per_queue; > + int i, left_over_budget = budget; > + unsigned long serviced_queues = 0; > + int num_queues = 0; > + > + num_queues = gfargrp->num_tx_queues; > + > + while (num_queues && left_over_budget) { > + budget_per_queue = left_over_budget/num_queues; > + left_over_budget = 0; > + > + for_each_set_bit(i, &gfargrp->tx_bit_map, priv->num_tx_queues) { > + if (test_bit(i, &serviced_queues)) > + continue; > + tx_queue = priv->tx_queue[i]; > + tx_cleaned_per_queue = > + gfar_clean_tx_ring(tx_queue, budget_per_queue); > + tx_cleaned += tx_cleaned_per_queue; > + if (tx_cleaned_per_queue < budget_per_queue) { > + left_over_budget = left_over_budget + > + (budget_per_queue - > + tx_cleaned_per_queue); > + set_bit(i, &serviced_queues); > + num_queues--; > + } > + } > + } > + > + if (tx_cleaned < budget) { > + u32 imask; > + napi_complete(napi); > + > + gfar_write(®s->imask, IMASK_DEFAULT); > + spin_lock_irq(&gfargrp->grplock); > + imask = gfar_read(®s->imask); > + imask |= IMASK_TX_DEFAULT; > + gfar_write(®s->imask, imask); > + spin_unlock_irq(&gfargrp->grplock); > + > + /* If we are coalescing interrupts, update the timer > + * Otherwise, clear it > + */ > + gfar_configure_tx_coalescing(priv, gfargrp->tx_bit_map); > + } > + > + return tx_cleaned; > +} > + > #ifdef CONFIG_NET_POLL_CONTROLLER > /* Polling 'interrupt' - used by things like netconsole to send skbs > * without having to re-enable interrupts. It's not called while > diff --git a/drivers/net/ethernet/freescale/gianfar.h b/drivers/net/ethernet/freescale/gianfar.h > index 2136c7f..f5be234 100644 > --- a/drivers/net/ethernet/freescale/gianfar.h > +++ b/drivers/net/ethernet/freescale/gianfar.h > @@ -57,8 +57,10 @@ struct ethtool_rx_list { > unsigned int count; > }; > > -/* The maximum number of packets to be handled in one call of gfar_poll */ > -#define GFAR_DEV_WEIGHT 64 > +/* The maximum number of packets to be handled in one call of gfar_poll_rx */ > +#define GFAR_DEV_RX_WEIGHT 64 > +/* The maximum number of packets to be handled in one call of gfar_poll_tx */ > +#define GFAR_DEV_TX_WEIGHT 64 > > /* Length for FCB */ > #define GMAC_FCB_LEN 8 > @@ -366,6 +368,10 @@ extern const char gfar_driver_version[]; > | IMASK_PERR) > #define IMASK_RTX_DISABLED ((~(IMASK_RXFEN0 | IMASK_TXFEN | IMASK_BSY)) \ > & IMASK_DEFAULT) > +#define IMASK_RX_DEFAULT (IMASK_RXFEN0 | IMASK_BSY) > +#define IMASK_TX_DEFAULT (IMASK_TXFEN) > +#define IMASK_RX_DISABLED ((~(IMASK_RX_DEFAULT)) & IMASK_DEFAULT) > +#define IMASK_TX_DISABLED ((~(IMASK_TX_DEFAULT)) & IMASK_DEFAULT) > > /* Fifo management */ > #define FIFO_TX_THR_MASK 0x01ff > @@ -993,7 +999,8 @@ struct gfar_priv_rx_q { > > /** > * struct gfar_priv_grp - per group structure > - * @napi: the napi poll function > + * @napi_rx: the RX napi poll function > + * @napi_tx: the TX confirmation napi poll function > * @priv: back pointer to the priv structure > * @regs: the ioremapped register space for this group > * @grp_id: group id for this group > @@ -1007,7 +1014,8 @@ struct gfar_priv_rx_q { > > struct gfar_priv_grp { > spinlock_t grplock __attribute__ ((aligned (SMP_CACHE_BYTES))); > - struct napi_struct napi; > + struct napi_struct napi_rx; > + struct napi_struct napi_tx; > struct gfar_private *priv; > struct gfar __iomem *regs; > unsigned int grp_id; > -- > 1.6.6 > > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC net-next 4/4] gianfar: Use separate NAPIs for Tx and Rx processing 2012-08-14 0:51 ` Paul Gortmaker @ 2012-08-14 16:08 ` Claudiu Manoil 0 siblings, 0 replies; 21+ messages in thread From: Claudiu Manoil @ 2012-08-14 16:08 UTC (permalink / raw) To: Paul Gortmaker; +Cc: netdev, David S. Miller, Pankaj Chauhan, Eric Dumazet On 08/14/2012 03:51 AM, Paul Gortmaker wrote: > [[RFC net-next 4/4] gianfar: Use separate NAPIs for Tx and Rx processing] On 08/08/2012 (Wed 15:26) Claudiu Manoil wrote: > >> Add a separate napi poll routine for Tx cleanup, to be triggerred by Tx >> confirmation interrupts only. Existing poll function is modified to handle >> only the Rx path processing. This allows parallel processing of Rx and Tx >> confirmation paths on a smp machine (2 cores). >> The split also results in simpler/cleaner napi poll function implementations, >> where each processing path has its own budget, thus improving the fairness b/w >> the processing paths at the same time. >> >> Signed-off-by: Pankaj Chauhan <pankaj.chauhan@freescale.com> >> Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com> >> --- >> drivers/net/ethernet/freescale/gianfar.c | 154 +++++++++++++++++++++++------- >> drivers/net/ethernet/freescale/gianfar.h | 16 +++- >> 2 files changed, 130 insertions(+), 40 deletions(-) >> >> diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c >> index 919acb3..2774961 100644 >> --- a/drivers/net/ethernet/freescale/gianfar.c >> +++ b/drivers/net/ethernet/freescale/gianfar.c >> @@ -128,12 +128,14 @@ static void free_skb_resources(struct gfar_private *priv); >> static void gfar_set_multi(struct net_device *dev); >> static void gfar_set_hash_for_addr(struct net_device *dev, u8 *addr); >> static void gfar_configure_serdes(struct net_device *dev); >> -static int gfar_poll(struct napi_struct *napi, int budget); >> +static int gfar_poll_rx(struct napi_struct *napi, int budget); >> +static int gfar_poll_tx(struct napi_struct *napi, int budget); >> #ifdef CONFIG_NET_POLL_CONTROLLER >> static void gfar_netpoll(struct net_device *dev); >> #endif >> int gfar_clean_rx_ring(struct gfar_priv_rx_q *rx_queue, int rx_work_limit); >> -static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue); >> +static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue, >> + int tx_work_limit); > I'm looking at this in a bit more detail now (was on vacation last wk). > With the above, you push a work limit down into the clean_tx_ring. > I'm wondering if the above is implicitly involved in the performance > difference you see, since... > >> static int gfar_process_frame(struct net_device *dev, struct sk_buff *skb, >> int amount_pull, struct napi_struct *napi); >> void gfar_halt(struct net_device *dev); >> @@ -543,16 +545,20 @@ static void disable_napi(struct gfar_private *priv) >> { >> int i; >> >> - for (i = 0; i < priv->num_grps; i++) >> - napi_disable(&priv->gfargrp[i].napi); >> + for (i = 0; i < priv->num_grps; i++) { >> + napi_disable(&priv->gfargrp[i].napi_rx); >> + napi_disable(&priv->gfargrp[i].napi_tx); >> + } >> } >> >> static void enable_napi(struct gfar_private *priv) >> { >> int i; >> >> - for (i = 0; i < priv->num_grps; i++) >> - napi_enable(&priv->gfargrp[i].napi); >> + for (i = 0; i < priv->num_grps; i++) { >> + napi_enable(&priv->gfargrp[i].napi_rx); >> + napi_enable(&priv->gfargrp[i].napi_tx); >> + } >> } >> >> static int gfar_parse_group(struct device_node *np, >> @@ -1028,9 +1034,12 @@ static int gfar_probe(struct platform_device *ofdev) >> dev->ethtool_ops = &gfar_ethtool_ops; >> >> /* Register for napi ...We are registering NAPI for each grp */ >> - for (i = 0; i < priv->num_grps; i++) >> - netif_napi_add(dev, &priv->gfargrp[i].napi, gfar_poll, >> - GFAR_DEV_WEIGHT); >> + for (i = 0; i < priv->num_grps; i++) { >> + netif_napi_add(dev, &priv->gfargrp[i].napi_rx, gfar_poll_rx, >> + GFAR_DEV_RX_WEIGHT); >> + netif_napi_add(dev, &priv->gfargrp[i].napi_tx, gfar_poll_tx, >> + GFAR_DEV_TX_WEIGHT); >> + } >> >> if (priv->device_flags & FSL_GIANFAR_DEV_HAS_CSUM) { >> dev->hw_features = NETIF_F_IP_CSUM | NETIF_F_SG | >> @@ -2465,7 +2474,8 @@ static void gfar_align_skb(struct sk_buff *skb) >> } >> >> /* Interrupt Handler for Transmit complete */ >> -static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue) >> +static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue, >> + int tx_work_limit) >> { >> struct net_device *dev = tx_queue->dev; >> struct netdev_queue *txq; >> @@ -2490,7 +2500,7 @@ static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue) >> bdp = tx_queue->dirty_tx; >> skb_dirtytx = tx_queue->skb_dirtytx; >> >> - while ((skb = tx_queue->tx_skbuff[skb_dirtytx])) { >> + while ((skb = tx_queue->tx_skbuff[skb_dirtytx]) && tx_work_limit--) { > ...code like this provides a new exit point that did not exist before, > for the case of a massive transmit blast. Do you have any data on how > often the work limit is hit? The old Don Becker ether drivers which > originally introduced the idea of work limits (on IRQs) used to printk a > message when they hit it, since ideally it shouldn't be happening all > the time. > > In any case, it might be worth while to split this change out into a > separate commit; something like: > > gianfar: push transmit cleanup work limit down to clean_tx_ring > > The advantage being (1) we can test this change in isolation, and > (2) it makes your remaining rx/tx separate thread patch smaller and > easier to review. Sounds interesting, I think I'll give it a try. Thanks, Claudiu ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions 2012-08-08 12:26 ` [RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions Claudiu Manoil 2012-08-08 12:26 ` [RFC net-next 4/4] gianfar: Use separate NAPIs for Tx and Rx processing Claudiu Manoil @ 2012-08-08 15:44 ` Paul Gortmaker 2012-08-09 16:24 ` Claudiu Manoil 2012-08-15 1:29 ` Paul Gortmaker 2 siblings, 1 reply; 21+ messages in thread From: Paul Gortmaker @ 2012-08-08 15:44 UTC (permalink / raw) To: Claudiu Manoil; +Cc: netdev, David S. Miller [[RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions] On 08/08/2012 (Wed 15:26) Claudiu Manoil wrote: > Split the coalescing programming support by Rx and Tx h/w queues, in order to > introduce a separate NAPI for the Tx confirmation path (next patch). This way, > the Rx processing path will handle the coalescing settings for the Rx queues > only, resp. the Tx confirmation processing path will handle the Tx queues. > > Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com> > --- > drivers/net/ethernet/freescale/gianfar.c | 36 +++++++++++++++++++++++------ > 1 files changed, 28 insertions(+), 8 deletions(-) > > diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c > index ddd350a..919acb3 100644 > --- a/drivers/net/ethernet/freescale/gianfar.c > +++ b/drivers/net/ethernet/freescale/gianfar.c > @@ -1794,8 +1794,8 @@ void gfar_start(struct net_device *dev) > dev->trans_start = jiffies; /* prevent tx timeout */ > } > > -void gfar_configure_coalescing(struct gfar_private *priv, > - unsigned long tx_mask, unsigned long rx_mask) > +static inline void gfar_configure_tx_coalescing(struct gfar_private *priv, I believe the preference is to not specify inline when all the chunks in play are present in the one C file -- i.e. let gcc figure it out. Same for the Rx instance below. P. -- > + unsigned long mask) > { > struct gfar __iomem *regs = priv->gfargrp[0].regs; > u32 __iomem *baddr; > @@ -1803,14 +1803,31 @@ void gfar_configure_coalescing(struct gfar_private *priv, > if (priv->mode == MQ_MG_MODE) { > int i; > baddr = ®s->txic0; > - for_each_set_bit(i, &tx_mask, priv->num_tx_queues) { > + for_each_set_bit(i, &mask, priv->num_tx_queues) { > gfar_write(baddr + i, 0); > if (likely(priv->tx_queue[i]->txcoalescing)) > gfar_write(baddr + i, priv->tx_queue[i]->txic); > } > + } else { > + /* Backward compatible case ---- even if we enable > + * multiple queues, there's only single reg to program > + */ > + gfar_write(®s->txic, 0); > + if (likely(priv->tx_queue[0]->txcoalescing)) > + gfar_write(®s->txic, priv->tx_queue[0]->txic); > + } > +} > + > +static inline void gfar_configure_rx_coalescing(struct gfar_private *priv, > + unsigned long mask) > +{ > + struct gfar __iomem *regs = priv->gfargrp[0].regs; > + u32 __iomem *baddr; > > + if (priv->mode == MQ_MG_MODE) { > + int i; > baddr = ®s->rxic0; > - for_each_set_bit(i, &rx_mask, priv->num_rx_queues) { > + for_each_set_bit(i, &mask, priv->num_rx_queues) { > gfar_write(baddr + i, 0); > if (likely(priv->rx_queue[i]->rxcoalescing)) > gfar_write(baddr + i, priv->rx_queue[i]->rxic); > @@ -1819,16 +1836,19 @@ void gfar_configure_coalescing(struct gfar_private *priv, > /* Backward compatible case ---- even if we enable > * multiple queues, there's only single reg to program > */ > - gfar_write(®s->txic, 0); > - if (likely(priv->tx_queue[0]->txcoalescing)) > - gfar_write(®s->txic, priv->tx_queue[0]->txic); > - > gfar_write(®s->rxic, 0); > if (likely(priv->rx_queue[0]->rxcoalescing)) > gfar_write(®s->rxic, priv->rx_queue[0]->rxic); > } > } > > +void gfar_configure_coalescing(struct gfar_private *priv, > + unsigned long tx_mask, unsigned long rx_mask) > +{ > + gfar_configure_tx_coalescing(priv, tx_mask); > + gfar_configure_rx_coalescing(priv, rx_mask); > +} > + > static int register_grp_irqs(struct gfar_priv_grp *grp) > { > struct gfar_private *priv = grp->priv; > -- > 1.6.6 > > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions 2012-08-08 15:44 ` [RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions Paul Gortmaker @ 2012-08-09 16:24 ` Claudiu Manoil 0 siblings, 0 replies; 21+ messages in thread From: Claudiu Manoil @ 2012-08-09 16:24 UTC (permalink / raw) To: Paul Gortmaker; +Cc: netdev, David S. Miller On 8/8/2012 6:44 PM, Paul Gortmaker wrote: > [[RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions] On 08/08/2012 (Wed 15:26) Claudiu Manoil wrote: > >> Split the coalescing programming support by Rx and Tx h/w queues, in order to >> introduce a separate NAPI for the Tx confirmation path (next patch). This way, >> the Rx processing path will handle the coalescing settings for the Rx queues >> only, resp. the Tx confirmation processing path will handle the Tx queues. >> >> Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com> >> --- >> drivers/net/ethernet/freescale/gianfar.c | 36 +++++++++++++++++++++++------ >> 1 files changed, 28 insertions(+), 8 deletions(-) >> >> diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c >> index ddd350a..919acb3 100644 >> --- a/drivers/net/ethernet/freescale/gianfar.c >> +++ b/drivers/net/ethernet/freescale/gianfar.c >> @@ -1794,8 +1794,8 @@ void gfar_start(struct net_device *dev) >> dev->trans_start = jiffies; /* prevent tx timeout */ >> } >> >> -void gfar_configure_coalescing(struct gfar_private *priv, >> - unsigned long tx_mask, unsigned long rx_mask) >> +static inline void gfar_configure_tx_coalescing(struct gfar_private *priv, > > I believe the preference is to not specify inline when all the chunks in > play are present in the one C file -- i.e. let gcc figure it out. Same > for the Rx instance below. > > P. > -- I agree with you. thanks ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions 2012-08-08 12:26 ` [RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions Claudiu Manoil 2012-08-08 12:26 ` [RFC net-next 4/4] gianfar: Use separate NAPIs for Tx and Rx processing Claudiu Manoil 2012-08-08 15:44 ` [RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions Paul Gortmaker @ 2012-08-15 1:29 ` Paul Gortmaker 2 siblings, 0 replies; 21+ messages in thread From: Paul Gortmaker @ 2012-08-15 1:29 UTC (permalink / raw) To: Claudiu Manoil; +Cc: netdev, David S. Miller [[RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions] On 08/08/2012 (Wed 15:26) Claudiu Manoil wrote: > Split the coalescing programming support by Rx and Tx h/w queues, in order to > introduce a separate NAPI for the Tx confirmation path (next patch). This way, > the Rx processing path will handle the coalescing settings for the Rx queues > only, resp. the Tx confirmation processing path will handle the Tx queues. > > Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com> > --- > drivers/net/ethernet/freescale/gianfar.c | 36 +++++++++++++++++++++++------ > 1 files changed, 28 insertions(+), 8 deletions(-) > > diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c > index ddd350a..919acb3 100644 > --- a/drivers/net/ethernet/freescale/gianfar.c > +++ b/drivers/net/ethernet/freescale/gianfar.c > @@ -1794,8 +1794,8 @@ void gfar_start(struct net_device *dev) > dev->trans_start = jiffies; /* prevent tx timeout */ > } > > -void gfar_configure_coalescing(struct gfar_private *priv, > - unsigned long tx_mask, unsigned long rx_mask) > +static inline void gfar_configure_tx_coalescing(struct gfar_private *priv, > + unsigned long mask) Hi Claudiu, I had it in mind to mention this earlier, but forgot. Align line two (and three, and four...) of args with the 1st arg in line one, and you'll save yourself needing a resend for basic formatting issues. In other words, you need: void foo(int some_really_long_line, struct blah *more_long_line, int b, ... ) i.e. allowing for mail mangling, make sure you have the the two int directly over each other, in the example above. Thanks, Paul. -- > { > struct gfar __iomem *regs = priv->gfargrp[0].regs; > u32 __iomem *baddr; > @@ -1803,14 +1803,31 @@ void gfar_configure_coalescing(struct gfar_private *priv, > if (priv->mode == MQ_MG_MODE) { > int i; > baddr = ®s->txic0; > - for_each_set_bit(i, &tx_mask, priv->num_tx_queues) { > + for_each_set_bit(i, &mask, priv->num_tx_queues) { > gfar_write(baddr + i, 0); > if (likely(priv->tx_queue[i]->txcoalescing)) > gfar_write(baddr + i, priv->tx_queue[i]->txic); > } > + } else { > + /* Backward compatible case ---- even if we enable > + * multiple queues, there's only single reg to program > + */ > + gfar_write(®s->txic, 0); > + if (likely(priv->tx_queue[0]->txcoalescing)) > + gfar_write(®s->txic, priv->tx_queue[0]->txic); > + } > +} > + > +static inline void gfar_configure_rx_coalescing(struct gfar_private *priv, > + unsigned long mask) > +{ > + struct gfar __iomem *regs = priv->gfargrp[0].regs; > + u32 __iomem *baddr; > > + if (priv->mode == MQ_MG_MODE) { > + int i; > baddr = ®s->rxic0; > - for_each_set_bit(i, &rx_mask, priv->num_rx_queues) { > + for_each_set_bit(i, &mask, priv->num_rx_queues) { > gfar_write(baddr + i, 0); > if (likely(priv->rx_queue[i]->rxcoalescing)) > gfar_write(baddr + i, priv->rx_queue[i]->rxic); > @@ -1819,16 +1836,19 @@ void gfar_configure_coalescing(struct gfar_private *priv, > /* Backward compatible case ---- even if we enable > * multiple queues, there's only single reg to program > */ > - gfar_write(®s->txic, 0); > - if (likely(priv->tx_queue[0]->txcoalescing)) > - gfar_write(®s->txic, priv->tx_queue[0]->txic); > - > gfar_write(®s->rxic, 0); > if (likely(priv->rx_queue[0]->rxcoalescing)) > gfar_write(®s->rxic, priv->rx_queue[0]->rxic); > } > } > > +void gfar_configure_coalescing(struct gfar_private *priv, > + unsigned long tx_mask, unsigned long rx_mask) > +{ > + gfar_configure_tx_coalescing(priv, tx_mask); > + gfar_configure_rx_coalescing(priv, rx_mask); > +} > + > static int register_grp_irqs(struct gfar_priv_grp *grp) > { > struct gfar_private *priv = grp->priv; > -- > 1.6.6 > > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC net-next 2/4] gianfar: Clear ievent from interrupt handler for [RT]x int 2012-08-08 12:26 ` [RFC net-next 2/4] gianfar: Clear ievent from interrupt handler for [RT]x int Claudiu Manoil 2012-08-08 12:26 ` [RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions Claudiu Manoil @ 2012-08-08 16:11 ` Paul Gortmaker 2012-08-09 16:04 ` Claudiu Manoil 1 sibling, 1 reply; 21+ messages in thread From: Paul Gortmaker @ 2012-08-08 16:11 UTC (permalink / raw) To: Claudiu Manoil; +Cc: netdev, David S. Miller [[RFC net-next 2/4] gianfar: Clear ievent from interrupt handler for [RT]x int] On 08/08/2012 (Wed 15:26) Claudiu Manoil wrote: > It's the interrupt handler's job to clear ievent for the Tx/Rx paths, as soon > as the corresponding interrupt sources have been masked. What wasn't clear to me was whether we'd ever have an instance of gfar_poll run without RTX_MASK being cleared (in less normal conditions, like netconsole, KGDBoE etc), since the gfar_schedule_cleanup is only called from rx/tx IRQ threads, and neither of those are used by gfar_poll, it seems. Paul. -- > > Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com> > --- > drivers/net/ethernet/freescale/gianfar.c | 16 ++++++---------- > 1 files changed, 6 insertions(+), 10 deletions(-) > > diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c > index e9feeb9..ddd350a 100644 > --- a/drivers/net/ethernet/freescale/gianfar.c > +++ b/drivers/net/ethernet/freescale/gianfar.c > @@ -2568,12 +2568,13 @@ static void gfar_schedule_cleanup(struct gfar_priv_grp *gfargrp) > if (napi_schedule_prep(&gfargrp->napi)) { > gfar_write(&gfargrp->regs->imask, IMASK_RTX_DISABLED); > __napi_schedule(&gfargrp->napi); > - } else { > - /* Clear IEVENT, so interrupts aren't called again > - * because of the packets that have already arrived. > - */ > - gfar_write(&gfargrp->regs->ievent, IEVENT_RTX_MASK); > } > + > + /* Clear IEVENT, so interrupts aren't called again > + * because of the packets that have already arrived. > + */ > + gfar_write(&gfargrp->regs->ievent, IEVENT_RTX_MASK); > + > spin_unlock_irqrestore(&gfargrp->grplock, flags); > > } > @@ -2837,11 +2838,6 @@ static int gfar_poll(struct napi_struct *napi, int budget) > num_queues = gfargrp->num_rx_queues; > budget_per_queue = budget/num_queues; > > - /* Clear IEVENT, so interrupts aren't called again > - * because of the packets that have already arrived > - */ > - gfar_write(®s->ievent, IEVENT_RTX_MASK); > - > while (num_queues && left_over_budget) { > budget_per_queue = left_over_budget/num_queues; > left_over_budget = 0; > -- > 1.6.6 > > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC net-next 2/4] gianfar: Clear ievent from interrupt handler for [RT]x int 2012-08-08 16:11 ` [RFC net-next 2/4] gianfar: Clear ievent from interrupt handler for [RT]x int Paul Gortmaker @ 2012-08-09 16:04 ` Claudiu Manoil 0 siblings, 0 replies; 21+ messages in thread From: Claudiu Manoil @ 2012-08-09 16:04 UTC (permalink / raw) To: Paul Gortmaker; +Cc: netdev, David S. Miller On 8/8/2012 7:11 PM, Paul Gortmaker wrote: > [[RFC net-next 2/4] gianfar: Clear ievent from interrupt handler for [RT]x int] On 08/08/2012 (Wed 15:26) Claudiu Manoil wrote: > >> It's the interrupt handler's job to clear ievent for the Tx/Rx paths, as soon >> as the corresponding interrupt sources have been masked. > > What wasn't clear to me was whether we'd ever have an instance of > gfar_poll run without RTX_MASK being cleared (in less normal conditions, > like netconsole, KGDBoE etc), since the gfar_schedule_cleanup is only > called from rx/tx IRQ threads, and neither of those are used by > gfar_poll, it seems. Hi Paul, As I see it, netconsole has the ndo_poll_controller hook, which points to gfar_netpoll() -> gfar_interrupt() -> gfar_receive|transmit() -> gfar_schedule_cleanup(), so it passes through schedule_cleanup. My understanding is that gfar_poll() is scheduled for execution only by "__napi_schedule(&gfargrp->napi);" from the Rx/Tx interrupt handler (gfar_receive/transmit()->gfar_schedule_cleanup()), and that it will be executed sometimes after the interrupt handler returns, which means RTX_MASK has been cleared (and the interrupt sources are already masked). I think that we might have an issue if we don't clear IEVENT right away in the interrupt handler, as this register might cause additional hw interrupts to the PIC upon returning from the interrupt handler. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing 2012-08-08 12:26 [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing Claudiu Manoil 2012-08-08 12:26 ` [RFC net-next 1/4] gianfar: Remove redundant programming of [rt]xic registers Claudiu Manoil @ 2012-08-08 16:24 ` Paul Gortmaker 2012-08-08 16:44 ` Eric Dumazet 1 sibling, 1 reply; 21+ messages in thread From: Paul Gortmaker @ 2012-08-08 16:24 UTC (permalink / raw) To: Claudiu Manoil; +Cc: netdev, David S. Miller [[RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing] On 08/08/2012 (Wed 15:26) Claudiu Manoil wrote: > Hi all, > This set of patches basically splits the existing napi poll routine into > two separate napi functions, one for Rx processing (triggered by frame > receive interrupts only) and one for the Tx confirmation path processing > (triggerred by Tx confirmation interrupts only). The polling algorithm > behind remains much the same. > > Important throughput improvements have been noted on low power boards with > this set of changes. > For instance, for the following netperf test: > netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 > yields a throughput gain from oscilating ~500-~700 Mbps to steady ~940 Mbps, > (if the Rx/Tx paths are processed on different cores), w/ no increase in CPU%, > on a p1020rdb - 2 core machine featuring etsec2.0 (Multi-Queue Multi-Group > driver mode). It would be interesting to know more about what was causing that large an oscillation -- presumably you will have it reappear once one core becomes 100% utilized. Also, any thoughts on how the change will change performance on an older low power single core gianfar system (e.g. 83xx)? P. -- > > Also, this change, which should ballance Rx and Tx processing, proves to > be effective against Rx busy interrupt occurrences. > > Thanks for your review. > Claudiu > > > Claudiu Manoil (4): > gianfar: Remove redundant programming of [rt]xic registers > gianfar: Clear ievent from interrupt handler for [RT]x int > gianfar: Separate out the Rx and Tx coalescing functions > gianfar: Use separate NAPIs for Tx and Rx processing > > drivers/net/ethernet/freescale/gianfar.c | 220 +++++++++++++++++++++-------- > drivers/net/ethernet/freescale/gianfar.h | 16 ++- > 2 files changed, 171 insertions(+), 65 deletions(-) > > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing 2012-08-08 16:24 ` [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing Paul Gortmaker @ 2012-08-08 16:44 ` Eric Dumazet 2012-08-08 23:06 ` Tomas Hruby 0 siblings, 1 reply; 21+ messages in thread From: Eric Dumazet @ 2012-08-08 16:44 UTC (permalink / raw) To: Paul Gortmaker; +Cc: Claudiu Manoil, netdev, David S. Miller On Wed, 2012-08-08 at 12:24 -0400, Paul Gortmaker wrote: > [[RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing] On 08/08/2012 (Wed 15:26) Claudiu Manoil wrote: > > > Hi all, > > This set of patches basically splits the existing napi poll routine into > > two separate napi functions, one for Rx processing (triggered by frame > > receive interrupts only) and one for the Tx confirmation path processing > > (triggerred by Tx confirmation interrupts only). The polling algorithm > > behind remains much the same. > > > > Important throughput improvements have been noted on low power boards with > > this set of changes. > > For instance, for the following netperf test: > > netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 > > yields a throughput gain from oscilating ~500-~700 Mbps to steady ~940 Mbps, > > (if the Rx/Tx paths are processed on different cores), w/ no increase in CPU%, > > on a p1020rdb - 2 core machine featuring etsec2.0 (Multi-Queue Multi-Group > > driver mode). > > It would be interesting to know more about what was causing that large > an oscillation -- presumably you will have it reappear once one core > becomes 100% utilized. Also, any thoughts on how the change will change > performance on an older low power single core gianfar system (e.g. 83xx)? I also was wondering if this low performance could be caused by BQL Since TCP stack is driven by incoming ACKS, a NAPI run could have to handle 10 TCP acks in a row, and resulting xmits could hit BQL and transit on qdisc (Because NAPI handler wont handle TX completions in the middle of RX handler) So experiments would be nice, maybe by reducing a bit /proc/sys/net/ipv4/tcp_limit_output_bytes (from 131072 to 65536 or 32768) ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing 2012-08-08 16:44 ` Eric Dumazet @ 2012-08-08 23:06 ` Tomas Hruby 2012-08-09 15:07 ` Claudiu Manoil 0 siblings, 1 reply; 21+ messages in thread From: Tomas Hruby @ 2012-08-08 23:06 UTC (permalink / raw) To: Eric Dumazet; +Cc: Paul Gortmaker, Claudiu Manoil, netdev, David S. Miller On Wed, Aug 8, 2012 at 9:44 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Wed, 2012-08-08 at 12:24 -0400, Paul Gortmaker wrote: >> [[RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing] On 08/08/2012 (Wed 15:26) Claudiu Manoil wrote: >> >> > Hi all, >> > This set of patches basically splits the existing napi poll routine into >> > two separate napi functions, one for Rx processing (triggered by frame >> > receive interrupts only) and one for the Tx confirmation path processing >> > (triggerred by Tx confirmation interrupts only). The polling algorithm >> > behind remains much the same. >> > >> > Important throughput improvements have been noted on low power boards with >> > this set of changes. >> > For instance, for the following netperf test: >> > netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 >> > yields a throughput gain from oscilating ~500-~700 Mbps to steady ~940 Mbps, >> > (if the Rx/Tx paths are processed on different cores), w/ no increase in CPU%, >> > on a p1020rdb - 2 core machine featuring etsec2.0 (Multi-Queue Multi-Group >> > driver mode). >> >> It would be interesting to know more about what was causing that large >> an oscillation -- presumably you will have it reappear once one core >> becomes 100% utilized. Also, any thoughts on how the change will change >> performance on an older low power single core gianfar system (e.g. 83xx)? > > I also was wondering if this low performance could be caused by BQL > > Since TCP stack is driven by incoming ACKS, a NAPI run could have to > handle 10 TCP acks in a row, and resulting xmits could hit BQL and > transit on qdisc (Because NAPI handler wont handle TX completions in the > middle of RX handler) Does disabling BQL help? Is the BQL limit stable? To what value is it set? I would be very much interested in more data if the issue is BQL related. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing 2012-08-08 23:06 ` Tomas Hruby @ 2012-08-09 15:07 ` Claudiu Manoil 2012-08-13 16:23 ` Claudiu Manoil 0 siblings, 1 reply; 21+ messages in thread From: Claudiu Manoil @ 2012-08-09 15:07 UTC (permalink / raw) To: Tomas Hruby, Eric Dumazet, Paul Gortmaker; +Cc: netdev, David S. Miller On 8/9/2012 2:06 AM, Tomas Hruby wrote: > On Wed, Aug 8, 2012 at 9:44 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: >> On Wed, 2012-08-08 at 12:24 -0400, Paul Gortmaker wrote: >>> [[RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing] On 08/08/2012 (Wed 15:26) Claudiu Manoil wrote: >>> >>>> Hi all, >>>> This set of patches basically splits the existing napi poll routine into >>>> two separate napi functions, one for Rx processing (triggered by frame >>>> receive interrupts only) and one for the Tx confirmation path processing >>>> (triggerred by Tx confirmation interrupts only). The polling algorithm >>>> behind remains much the same. >>>> >>>> Important throughput improvements have been noted on low power boards with >>>> this set of changes. >>>> For instance, for the following netperf test: >>>> netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 >>>> yields a throughput gain from oscilating ~500-~700 Mbps to steady ~940 Mbps, >>>> (if the Rx/Tx paths are processed on different cores), w/ no increase in CPU%, >>>> on a p1020rdb - 2 core machine featuring etsec2.0 (Multi-Queue Multi-Group >>>> driver mode). >>> >>> It would be interesting to know more about what was causing that large >>> an oscillation -- presumably you will have it reappear once one core >>> becomes 100% utilized. Also, any thoughts on how the change will change >>> performance on an older low power single core gianfar system (e.g. 83xx)? >> >> I also was wondering if this low performance could be caused by BQL >> >> Since TCP stack is driven by incoming ACKS, a NAPI run could have to >> handle 10 TCP acks in a row, and resulting xmits could hit BQL and >> transit on qdisc (Because NAPI handler wont handle TX completions in the >> middle of RX handler) > > Does disabling BQL help? Is the BQL limit stable? To what value is it > set? I would be very much interested in more data if the issue is BQL > related. > > . > I agree that more tests should be run to investigate why gianfar under- performs on the low power p1020rdb platform, and BQL seems to be a good starting point (thanks for the hint). What I can say now is that the issue is not apparent on p2020rdb, for instance, which is a more powerful platform: the CPUs - 1200 MHz instead of 800 MHz; twice the size of L2 cache (512 KB), greater bus (CCB) frequency ... On this board (p2020rdb) the netperf test reaches 940Mbps both w/ and w/o these patches. For a single core system I'm not expecting any performance degradation, simply because I don't see why the proposed napi poll implementation would be slower than the existing one. I'll do some measurements on a p1010rdb too (single core, CPU:800 MHz) and get back to you with the results. Thanks. Claudiu ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing 2012-08-09 15:07 ` Claudiu Manoil @ 2012-08-13 16:23 ` Claudiu Manoil 2012-08-14 1:15 ` Paul Gortmaker 0 siblings, 1 reply; 21+ messages in thread From: Claudiu Manoil @ 2012-08-13 16:23 UTC (permalink / raw) To: Tomas Hruby, Eric Dumazet, Paul Gortmaker; +Cc: netdev, David S. Miller On 08/09/2012 06:07 PM, Claudiu Manoil wrote: > On 8/9/2012 2:06 AM, Tomas Hruby wrote: >> On Wed, Aug 8, 2012 at 9:44 AM, Eric Dumazet <eric.dumazet@gmail.com> >> wrote: >>> On Wed, 2012-08-08 at 12:24 -0400, Paul Gortmaker wrote: >>>> [[RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation >>>> processing] On 08/08/2012 (Wed 15:26) Claudiu Manoil wrote: >>>> >>>>> Hi all, >>>>> This set of patches basically splits the existing napi poll >>>>> routine into >>>>> two separate napi functions, one for Rx processing (triggered by >>>>> frame >>>>> receive interrupts only) and one for the Tx confirmation path >>>>> processing >>>>> (triggerred by Tx confirmation interrupts only). The polling >>>>> algorithm >>>>> behind remains much the same. >>>>> >>>>> Important throughput improvements have been noted on low power >>>>> boards with >>>>> this set of changes. >>>>> For instance, for the following netperf test: >>>>> netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 >>>>> yields a throughput gain from oscilating ~500-~700 Mbps to steady >>>>> ~940 Mbps, >>>>> (if the Rx/Tx paths are processed on different cores), w/ no >>>>> increase in CPU%, >>>>> on a p1020rdb - 2 core machine featuring etsec2.0 (Multi-Queue >>>>> Multi-Group >>>>> driver mode). >>>> >>>> It would be interesting to know more about what was causing that large >>>> an oscillation -- presumably you will have it reappear once one core >>>> becomes 100% utilized. Also, any thoughts on how the change will >>>> change >>>> performance on an older low power single core gianfar system (e.g. >>>> 83xx)? >>> >>> I also was wondering if this low performance could be caused by BQL >>> >>> Since TCP stack is driven by incoming ACKS, a NAPI run could have to >>> handle 10 TCP acks in a row, and resulting xmits could hit BQL and >>> transit on qdisc (Because NAPI handler wont handle TX completions in >>> the >>> middle of RX handler) >> >> Does disabling BQL help? Is the BQL limit stable? To what value is it >> set? I would be very much interested in more data if the issue is BQL >> related. >> >> . >> > > I agree that more tests should be run to investigate why gianfar under- > performs on the low power p1020rdb platform, and BQL seems to be > a good starting point (thanks for the hint). What I can say now is that > the issue is not apparent on p2020rdb, for instance, which is a more > powerful platform: the CPUs - 1200 MHz instead of 800 MHz; twice the > size of L2 cache (512 KB), greater bus (CCB) frequency ... On this > board (p2020rdb) the netperf test reaches 940Mbps both w/ and w/o these > patches. > > For a single core system I'm not expecting any performance degradation, > simply because I don't see why the proposed napi poll implementation > would be slower than the existing one. I'll do some measurements on a > p1010rdb too (single core, CPU:800 MHz) and get back to you with the > results. > Hi all, Please find below the netperf measurements performed on a p1010rdb machine (single core, low power). Three kernel images were used: 1) Linux version 3.5.0-20970-gaae06bf -- net-next commit aae06bf 2) Linux version 3.5.0-20974-g2920464 -- commit aae06bf + Tx NAPI patches 3) Linux version 3.5.0-20970-gaae06bf-dirty -- commit aae06bf + CONFIG_BQL set to 'n' The results show that, on *Image 1)*, by adjusting tcp_limit_output_bytes no substantial improvements are seen, as the throughput stays in the 580-60x Mbps range . By changing the coalescing settings from default* (rx coalescing off, tx-usecs: 10, tx-frames: 16) to: "ethtool -C eth1 rx-frames 22 tx-frames 22 rx-usecs 32 tx-usecs 32" we get a throughput of ~710 Mbps. For *Image 2)*, using the default tcp_limit_output_bytes value (131072) - I've noticed that "tweaking" tcp_limit_output_bytes does not improve the throughput -, we get the following performance numbers: * default coalescing settings: ~650 Mbps * rx-frames tx-frames 22 rx-usecs 32 tx-usecs 32: ~860-880 Mbps For *Image 3)*, by disabling BQL (CONFIG_BQL = n), there's *no* relevant performance improvement compared to Image 1). (note: For all the measurements, rx and tx BD ring sizes have been set to 64, for best performance.) So, I really tend to believe that the performance degradation comes primarily from the driver, and the napi poll processing turns out to be an important source for that. The proposed patches show substantial improvement, especially for SMP systems where Tx and Rx processing may be done in parallel. What do you think? Is it ok to proceed by re-spinning the patches? Do you recommend additional measurements? Regards, Claudiu //=Image 1)================ root@p1010rdb:~# cat /proc/version Linux version 3.5.0-20970-gaae06bf [...] root@p1010rdb:~# zcat /proc/config.gz | grep BQL CONFIG_BQL=y root@p1010rdb:~# cat /proc/sys/net/ipv4/tcp_limit_output_bytes 131072 root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 (192.168.10.1) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 16384 1500 20.00 580.76 99.95 11.76 14.099 1.659 root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 (192.168.10.1) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 16384 1500 20.00 598.21 99.95 10.91 13.687 1.493 root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 (192.168.10.1) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 16384 1500 20.00 583.04 99.95 11.25 14.043 1.581 root@p1010rdb:~# cat /proc/sys/net/ipv4/tcp_limit_output_bytes 65536 root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 (192.168.10.1) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 16384 1500 20.00 604.29 99.95 11.15 13.550 1.512 root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 (192.168.10.1) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 16384 1500 20.00 603.52 99.50 12.57 13.506 1.706 root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 (192.168.10.1) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 16384 1500 20.00 596.18 99.95 12.81 13.734 1.760 root@p1010rdb:~# cat /proc/sys/net/ipv4/tcp_limit_output_bytes 32768 root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 (192.168.10.1) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 16384 1500 20.00 582.32 99.95 12.96 14.061 1.824 root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 (192.168.10.1) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 16384 1500 20.00 583.79 99.95 11.19 14.026 1.571 root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 (192.168.10.1) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 16384 1500 20.00 584.16 99.95 11.36 14.016 1.592 root@p1010rdb:~# ethtool -C eth1 rx-frames 22 tx-frames 22 rx-usecs 32 tx-usecs 32 root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 (192.168.10.1) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 16384 1500 20.00 708.77 99.85 13.32 11.541 1.540 root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 (192.168.10.1) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 16384 1500 20.00 710.50 99.95 12.46 11.524 1.437 root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 (192.168.10.1) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 16384 1500 20.00 709.95 99.95 14.15 11.533 1.633 //=Image 2)================ root@p1010rdb:~# cat /proc/version Linux version 3.5.0-20974-g2920464 [...] root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 (192.168.10.1) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 16384 1500 20.00 652.60 99.95 13.05 12.547 1.638 root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 (192.168.10.1) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 16384 1500 20.00 657.47 99.95 11.81 12.454 1.471 root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 (192.168.10.1) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 16384 1500 20.00 655.77 99.95 11.80 12.486 1.474 root@p1010rdb:~# ethtool -C eth1 rx-frames 22 rx-usecs 32 tx-frames 22 tx-usecs 32 root@p1010rdb:~# cat /proc/sys/net/ipv4/tcp_limit_output_bytes 131072 root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 (192.168.10.1) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 16384 1500 20.01 882.42 99.20 18.06 9.209 1.676 root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 (192.168.10.1) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 16384 1500 20.00 867.02 99.75 16.21 9.425 1.531 root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 (192.168.10.1) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 16384 1500 20.01 874.29 99.85 15.25 9.356 1.429 //=Image 3)================ Linux version 3.5.0-20970-gaae06bf-dirty [...] //CONFIG_BQL = n root@p1010rdb:~# cat /proc/version Linux version 3.5.0-20970-gaae06bf-dirty (b08782@zro04-ws574.ea.freescale.net) (gcc version 4.6.2 (GCC) ) #3 Mon Aug 13 13:58:25 EEST 2012 root@p1010rdb:~# zcat /proc/config.gz | grep BQL # CONFIG_BQL is not set root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 (192.168.10.1) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 16384 1500 20.00 595.08 99.95 12.51 13.759 1.722 root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 (192.168.10.1) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 16384 1500 20.00 593.95 99.95 10.96 13.785 1.511 root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 (192.168.10.1) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 16384 1500 20.00 595.30 99.90 11.11 13.747 1.528 root@p1010rdb:~# ethtool -C eth1 rx-frames 22 rx-usecs 32 tx-frames 22 tx-usecs 32 root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 (192.168.10.1) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 16384 1500 20.00 710.46 99.95 12.46 11.525 1.437 root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 (192.168.10.1) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 16384 1500 20.00 714.27 99.95 14.05 11.463 1.611 root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 (192.168.10.1) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 87380 16384 1500 20.00 717.69 99.95 12.56 11.409 1.433 . ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing 2012-08-13 16:23 ` Claudiu Manoil @ 2012-08-14 1:15 ` Paul Gortmaker 2012-08-14 16:08 ` Claudiu Manoil 0 siblings, 1 reply; 21+ messages in thread From: Paul Gortmaker @ 2012-08-14 1:15 UTC (permalink / raw) To: Claudiu Manoil; +Cc: Tomas Hruby, Eric Dumazet, netdev, David S. Miller [Re: [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing] On 13/08/2012 (Mon 19:23) Claudiu Manoil wrote: > On 08/09/2012 06:07 PM, Claudiu Manoil wrote: > >On 8/9/2012 2:06 AM, Tomas Hruby wrote: > >>On Wed, Aug 8, 2012 at 9:44 AM, Eric Dumazet > >><eric.dumazet@gmail.com> wrote: > >>>On Wed, 2012-08-08 at 12:24 -0400, Paul Gortmaker wrote: > >>>>[[RFC net-next 0/4] gianfar: Use separate NAPI for Tx > >>>>confirmation processing] On 08/08/2012 (Wed 15:26) Claudiu > >>>>Manoil wrote: > >>>> > >>>>>Hi all, > >>>>>This set of patches basically splits the existing napi > >>>>>poll routine into > >>>>>two separate napi functions, one for Rx processing > >>>>>(triggered by frame > >>>>>receive interrupts only) and one for the Tx confirmation > >>>>>path processing > >>>>>(triggerred by Tx confirmation interrupts only). The > >>>>>polling algorithm > >>>>>behind remains much the same. > >>>>> > >>>>>Important throughput improvements have been noted on low > >>>>>power boards with > >>>>>this set of changes. > >>>>>For instance, for the following netperf test: > >>>>>netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 > >>>>>yields a throughput gain from oscilating ~500-~700 Mbps to > >>>>>steady ~940 Mbps, > >>>>>(if the Rx/Tx paths are processed on different cores), w/ > >>>>>no increase in CPU%, > >>>>>on a p1020rdb - 2 core machine featuring etsec2.0 > >>>>>(Multi-Queue Multi-Group > >>>>>driver mode). > >>>> > >>>>It would be interesting to know more about what was causing that large > >>>>an oscillation -- presumably you will have it reappear once one core > >>>>becomes 100% utilized. Also, any thoughts on how the change > >>>>will change > >>>>performance on an older low power single core gianfar system > >>>>(e.g. 83xx)? > >>> > >>>I also was wondering if this low performance could be caused by BQL > >>> > >>>Since TCP stack is driven by incoming ACKS, a NAPI run could have to > >>>handle 10 TCP acks in a row, and resulting xmits could hit BQL and > >>>transit on qdisc (Because NAPI handler wont handle TX > >>>completions in the > >>>middle of RX handler) > >> > >>Does disabling BQL help? Is the BQL limit stable? To what value is it > >>set? I would be very much interested in more data if the issue is BQL > >>related. > >> > >>. > >> > > > >I agree that more tests should be run to investigate why gianfar under- > >performs on the low power p1020rdb platform, and BQL seems to be > >a good starting point (thanks for the hint). What I can say now is that > >the issue is not apparent on p2020rdb, for instance, which is a more > >powerful platform: the CPUs - 1200 MHz instead of 800 MHz; twice the > >size of L2 cache (512 KB), greater bus (CCB) frequency ... On this > >board (p2020rdb) the netperf test reaches 940Mbps both w/ and w/o these > >patches. > > > >For a single core system I'm not expecting any performance degradation, > >simply because I don't see why the proposed napi poll implementation > >would be slower than the existing one. I'll do some measurements on a > >p1010rdb too (single core, CPU:800 MHz) and get back to you with the > >results. > > > > Hi all, > > Please find below the netperf measurements performed on a p1010rdb machine > (single core, low power). Three kernel images were used: > 1) Linux version 3.5.0-20970-gaae06bf -- net-next commit aae06bf > 2) Linux version 3.5.0-20974-g2920464 -- commit aae06bf + Tx NAPI patches > 3) Linux version 3.5.0-20970-gaae06bf-dirty -- commit aae06bf + > CONFIG_BQL set to 'n' For future reference, you don't need to dirty the tree to disable CONFIG_BQL at compile time; there is a runtime disable: http://permalink.gmane.org/gmane.linux.network/223107 > > The results show that, on *Image 1)*, by adjusting > tcp_limit_output_bytes no substantial > improvements are seen, as the throughput stays in the 580-60x Mbps range . This is a lot lower variation than what you reported earlier (20 versus 200, I think). It was the variation that raised a red flag for me... > By changing the coalescing settings from default* (rx coalescing off, > tx-usecs: 10, tx-frames: 16) to: > "ethtool -C eth1 rx-frames 22 tx-frames 22 rx-usecs 32 tx-usecs 32" > we get a throughput of ~710 Mbps. > > For *Image 2)*, using the default tcp_limit_output_bytes value > (131072) - I've noticed > that "tweaking" tcp_limit_output_bytes does not improve the > throughput -, we get the > following performance numbers: > * default coalescing settings: ~650 Mbps > * rx-frames tx-frames 22 rx-usecs 32 tx-usecs 32: ~860-880 Mbps > > For *Image 3)*, by disabling BQL (CONFIG_BQL = n), there's *no* > relevant performance > improvement compared to Image 1). > (note: > For all the measurements, rx and tx BD ring sizes have been set to > 64, for best performance.) > > So, I really tend to believe that the performance degradation comes > primarily from the driver, > and the napi poll processing turns out to be an important source for > that. The proposed patches This would make sense, if the CPU was slammed at 100% load in dealing with the tx processing, and the change made the driver considerably more efficient. But is that really the case? Is the p1010 really going flat out just to handle the Tx processing? Have you done any sort of profiling to confirm/deny where the CPU is spending its time? > show substantial improvement, especially for SMP systems where Tx > and Rx processing may be > done in parallel. > What do you think? > Is it ok to proceed by re-spinning the patches? Do you recommend > additional measurements? Unfortunately Eric is out this week, so we will be without his input for a while. However, we are only at 3.6-rc1 -- meaning net-next will be open for quite some time, hence no need to rush to try and jam stuff in. Also, I have two targets I'm interested in testing your patches on. The 1st is a 500MHz mpc8349 board -- which should replicate what you see on your p1010 (slow, single core). The other is an 8641D, which is interesting since it will give us the SMP tx/rx as separate threads, but without the MQ_MG_MODE support (is that a correct assumption?) I don't have any fundamental problem with your patches (although 4/4 might be better as two patches) -- the above targets/tests are only of interest, since I'm not convinced we yet understand _why_ your changes give a performance boost, and there might be something interesting hiding in there. So, while Eric is out, let me see if I can collect some more data on those two targets sometime this week. Thanks, Paul. -- > > Regards, > Claudiu > > //=Image 1)================ > root@p1010rdb:~# cat /proc/version > Linux version 3.5.0-20970-gaae06bf [...] > > root@p1010rdb:~# zcat /proc/config.gz | grep BQL > CONFIG_BQL=y > root@p1010rdb:~# cat /proc/sys/net/ipv4/tcp_limit_output_bytes > 131072 > > root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > 192.168.10.1 (192.168.10.1) port 0 AF_INET > Recv Send Send Utilization Service Demand > Socket Socket Message Elapsed Send Recv Send Recv > Size Size Size Time Throughput local remote local remote > bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB > > 87380 16384 1500 20.00 580.76 99.95 11.76 14.099 1.659 > root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > 192.168.10.1 (192.168.10.1) port 0 AF_INET > Recv Send Send Utilization Service Demand > Socket Socket Message Elapsed Send Recv Send Recv > Size Size Size Time Throughput local remote local remote > bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB > > 87380 16384 1500 20.00 598.21 99.95 10.91 13.687 1.493 > root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > 192.168.10.1 (192.168.10.1) port 0 AF_INET > Recv Send Send Utilization Service Demand > Socket Socket Message Elapsed Send Recv Send Recv > Size Size Size Time Throughput local remote local remote > bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB > > 87380 16384 1500 20.00 583.04 99.95 11.25 14.043 1.581 > > > root@p1010rdb:~# cat /proc/sys/net/ipv4/tcp_limit_output_bytes > 65536 > > root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > 192.168.10.1 (192.168.10.1) port 0 AF_INET > Recv Send Send Utilization Service Demand > Socket Socket Message Elapsed Send Recv Send Recv > Size Size Size Time Throughput local remote local remote > bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB > > 87380 16384 1500 20.00 604.29 99.95 11.15 13.550 1.512 > root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > 192.168.10.1 (192.168.10.1) port 0 AF_INET > Recv Send Send Utilization Service Demand > Socket Socket Message Elapsed Send Recv Send Recv > Size Size Size Time Throughput local remote local remote > bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB > > 87380 16384 1500 20.00 603.52 99.50 12.57 13.506 1.706 > root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > 192.168.10.1 (192.168.10.1) port 0 AF_INET > Recv Send Send Utilization Service Demand > Socket Socket Message Elapsed Send Recv Send Recv > Size Size Size Time Throughput local remote local remote > bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB > > 87380 16384 1500 20.00 596.18 99.95 12.81 13.734 1.760 > > > > root@p1010rdb:~# cat /proc/sys/net/ipv4/tcp_limit_output_bytes > 32768 > > root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > 192.168.10.1 (192.168.10.1) port 0 AF_INET > Recv Send Send Utilization Service Demand > Socket Socket Message Elapsed Send Recv Send Recv > Size Size Size Time Throughput local remote local remote > bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB > > 87380 16384 1500 20.00 582.32 99.95 12.96 14.061 1.824 > root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > 192.168.10.1 (192.168.10.1) port 0 AF_INET > Recv Send Send Utilization Service Demand > Socket Socket Message Elapsed Send Recv Send Recv > Size Size Size Time Throughput local remote local remote > bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB > > 87380 16384 1500 20.00 583.79 99.95 11.19 14.026 1.571 > root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > 192.168.10.1 (192.168.10.1) port 0 AF_INET > Recv Send Send Utilization Service Demand > Socket Socket Message Elapsed Send Recv Send Recv > Size Size Size Time Throughput local remote local remote > bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB > > 87380 16384 1500 20.00 584.16 99.95 11.36 14.016 1.592 > > > > root@p1010rdb:~# ethtool -C eth1 rx-frames 22 tx-frames 22 rx-usecs > 32 tx-usecs 32 > > root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > 192.168.10.1 (192.168.10.1) port 0 AF_INET > Recv Send Send Utilization Service Demand > Socket Socket Message Elapsed Send Recv Send Recv > Size Size Size Time Throughput local remote local remote > bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB > > 87380 16384 1500 20.00 708.77 99.85 13.32 11.541 1.540 > root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > 192.168.10.1 (192.168.10.1) port 0 AF_INET > Recv Send Send Utilization Service Demand > Socket Socket Message Elapsed Send Recv Send Recv > Size Size Size Time Throughput local remote local remote > bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB > > 87380 16384 1500 20.00 710.50 99.95 12.46 11.524 1.437 > root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > 192.168.10.1 (192.168.10.1) port 0 AF_INET > Recv Send Send Utilization Service Demand > Socket Socket Message Elapsed Send Recv Send Recv > Size Size Size Time Throughput local remote local remote > bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB > > 87380 16384 1500 20.00 709.95 99.95 14.15 11.533 1.633 > > > //=Image 2)================ > > root@p1010rdb:~# cat /proc/version > Linux version 3.5.0-20974-g2920464 [...] > > root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > 192.168.10.1 (192.168.10.1) port 0 AF_INET > Recv Send Send Utilization Service Demand > Socket Socket Message Elapsed Send Recv Send Recv > Size Size Size Time Throughput local remote local remote > bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB > > 87380 16384 1500 20.00 652.60 99.95 13.05 12.547 1.638 > root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > 192.168.10.1 (192.168.10.1) port 0 AF_INET > Recv Send Send Utilization Service Demand > Socket Socket Message Elapsed Send Recv Send Recv > Size Size Size Time Throughput local remote local remote > bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB > > 87380 16384 1500 20.00 657.47 99.95 11.81 12.454 1.471 > root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > 192.168.10.1 (192.168.10.1) port 0 AF_INET > Recv Send Send Utilization Service Demand > Socket Socket Message Elapsed Send Recv Send Recv > Size Size Size Time Throughput local remote local remote > bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB > > 87380 16384 1500 20.00 655.77 99.95 11.80 12.486 1.474 > > > root@p1010rdb:~# ethtool -C eth1 rx-frames 22 rx-usecs 32 tx-frames > 22 tx-usecs 32 > > root@p1010rdb:~# cat /proc/sys/net/ipv4/tcp_limit_output_bytes > 131072 > root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > 192.168.10.1 (192.168.10.1) port 0 AF_INET > Recv Send Send Utilization Service Demand > Socket Socket Message Elapsed Send Recv Send Recv > Size Size Size Time Throughput local remote local remote > bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB > > 87380 16384 1500 20.01 882.42 99.20 18.06 9.209 1.676 > root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > 192.168.10.1 (192.168.10.1) port 0 AF_INET > Recv Send Send Utilization Service Demand > Socket Socket Message Elapsed Send Recv Send Recv > Size Size Size Time Throughput local remote local remote > bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB > > 87380 16384 1500 20.00 867.02 99.75 16.21 9.425 1.531 > root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > 192.168.10.1 (192.168.10.1) port 0 AF_INET > Recv Send Send Utilization Service Demand > Socket Socket Message Elapsed Send Recv Send Recv > Size Size Size Time Throughput local remote local remote > bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB > > 87380 16384 1500 20.01 874.29 99.85 15.25 9.356 1.429 > > > //=Image 3)================ > > Linux version 3.5.0-20970-gaae06bf-dirty [...] //CONFIG_BQL = n > > root@p1010rdb:~# cat /proc/version > Linux version 3.5.0-20970-gaae06bf-dirty > (b08782@zro04-ws574.ea.freescale.net) (gcc version 4.6.2 (GCC) ) #3 > Mon Aug 13 13:58:25 EEST 2012 > root@p1010rdb:~# zcat /proc/config.gz | grep BQL > # CONFIG_BQL is not set > > root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > 192.168.10.1 (192.168.10.1) port 0 AF_INET > Recv Send Send Utilization Service Demand > Socket Socket Message Elapsed Send Recv Send Recv > Size Size Size Time Throughput local remote local remote > bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB > > 87380 16384 1500 20.00 595.08 99.95 12.51 13.759 1.722 > root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > 192.168.10.1 (192.168.10.1) port 0 AF_INET > Recv Send Send Utilization Service Demand > Socket Socket Message Elapsed Send Recv Send Recv > Size Size Size Time Throughput local remote local remote > bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB > > 87380 16384 1500 20.00 593.95 99.95 10.96 13.785 1.511 > root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > 192.168.10.1 (192.168.10.1) port 0 AF_INET > Recv Send Send Utilization Service Demand > Socket Socket Message Elapsed Send Recv Send Recv > Size Size Size Time Throughput local remote local remote > bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB > > 87380 16384 1500 20.00 595.30 99.90 11.11 13.747 1.528 > > root@p1010rdb:~# ethtool -C eth1 rx-frames 22 rx-usecs 32 tx-frames > 22 tx-usecs 32 > root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > 192.168.10.1 (192.168.10.1) port 0 AF_INET > Recv Send Send Utilization Service Demand > Socket Socket Message Elapsed Send Recv Send Recv > Size Size Size Time Throughput local remote local remote > bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB > > 87380 16384 1500 20.00 710.46 99.95 12.46 11.525 1.437 > root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > 192.168.10.1 (192.168.10.1) port 0 AF_INET > Recv Send Send Utilization Service Demand > Socket Socket Message Elapsed Send Recv Send Recv > Size Size Size Time Throughput local remote local remote > bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB > > 87380 16384 1500 20.00 714.27 99.95 14.05 11.463 1.611 > root@p1010rdb:~# netperf -l 20 -cC -H 192.168.10.1 -t TCP_STREAM -- -m 1500 > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > 192.168.10.1 (192.168.10.1) port 0 AF_INET > Recv Send Send Utilization Service Demand > Socket Socket Message Elapsed Send Recv Send Recv > Size Size Size Time Throughput local remote local remote > bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB > > 87380 16384 1500 20.00 717.69 99.95 12.56 11.409 1.433 > . > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing 2012-08-14 1:15 ` Paul Gortmaker @ 2012-08-14 16:08 ` Claudiu Manoil 2012-08-16 15:36 ` Paul Gortmaker 0 siblings, 1 reply; 21+ messages in thread From: Claudiu Manoil @ 2012-08-14 16:08 UTC (permalink / raw) To: Paul Gortmaker; +Cc: Tomas Hruby, Eric Dumazet, netdev, David S. Miller On 08/14/2012 04:15 AM, Paul Gortmaker wrote: > This is a lot lower variation than what you reported earlier (20 > versus 200, I think). It was the variation that raised a red flag for > me... Hi Paul, The earlier variation, which is much bigger (indeed ~200), was observed on a p1020 (slow, 2 cores, MQ_MG_MODE). I did not collect however as detailed measurement results for that board, as I did for p1010 (previous email). The most important performance improvement I've noticed however was on the p1020 platform. >> By changing the coalescing settings from default* (rx coalescing off, >> tx-usecs: 10, tx-frames: 16) to: >> "ethtool -C eth1 rx-frames 22 tx-frames 22 rx-usecs 32 tx-usecs 32" >> we get a throughput of ~710 Mbps. >> >> For *Image 2)*, using the default tcp_limit_output_bytes value >> (131072) - I've noticed >> that "tweaking" tcp_limit_output_bytes does not improve the >> throughput -, we get the >> following performance numbers: >> * default coalescing settings: ~650 Mbps >> * rx-frames tx-frames 22 rx-usecs 32 tx-usecs 32: ~860-880 Mbps >> >> For *Image 3)*, by disabling BQL (CONFIG_BQL = n), there's *no* >> relevant performance >> improvement compared to Image 1). >> (note: >> For all the measurements, rx and tx BD ring sizes have been set to >> 64, for best performance.) >> >> So, I really tend to believe that the performance degradation comes >> primarily from the driver, >> and the napi poll processing turns out to be an important source for >> that. The proposed patches > This would make sense, if the CPU was slammed at 100% load in dealing > with the tx processing, and the change made the driver considerably more > efficient. But is that really the case? Is the p1010 really going flat > out just to handle the Tx processing? Have you done any sort of > profiling to confirm/deny where the CPU is spending its time? The current gfar_poll implementation processes first the tx confirmation path exhaustively, without a budget/ work limit, and only then proceeds with the rx processing within the allotted budget. An this happens for both Rx and Tx confirmation interrupts. I find this unfair and out of balance. Maybe by letting rx processing to be triggered by rx interrupts only, and the tx conf path processing to be triggered by tx confirmation interrupts only, and, on top of that, by imposing a work limit on the tx confirmation path too, we get a more responsive driver that performs better. Indeed some profiling data to confirm this would be great, but I don't have it. There's another issues that seems to be solved by this patchset, and I've noticed it only on p1020rdb (this time). And that is excessive Rx busy interrupts occurrence. Solving this issue may be another factor for the performance improvement on p1020. But maybe this is another discussion. > >> show substantial improvement, especially for SMP systems where Tx >> and Rx processing may be >> done in parallel. >> What do you think? >> Is it ok to proceed by re-spinning the patches? Do you recommend >> additional measurements? > Unfortunately Eric is out this week, so we will be without his input for > a while. However, we are only at 3.6-rc1 -- meaning net-next will be > open for quite some time, hence no need to rush to try and jam stuff in. > > Also, I have two targets I'm interested in testing your patches on. The > 1st is a 500MHz mpc8349 board -- which should replicate what you see on > your p1010 (slow, single core). The other is an 8641D, which is > interesting since it will give us the SMP tx/rx as separate threads, but > without the MQ_MG_MODE support (is that a correct assumption?) > > I don't have any fundamental problem with your patches (although 4/4 > might be better as two patches) -- the above targets/tests are only > of interest, since I'm not convinced we yet understand _why_ your > changes give a performance boost, and there might be something > interesting hiding in there. > > So, while Eric is out, let me see if I can collect some more data on > those two targets sometime this week. Great, I don't mean to rush. The more data we get on this the better. It would be great if you could do some measurements on your platforms too. 8641D is indeed a dual core with etsec 1.x (so without the MQ_MG_MODE), but I did run some tests on a p2020, which has the same features. However I'm eager to see your results. Thanks for helping. Regards, Claudiu ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing 2012-08-14 16:08 ` Claudiu Manoil @ 2012-08-16 15:36 ` Paul Gortmaker 2012-08-17 11:28 ` Claudiu Manoil 0 siblings, 1 reply; 21+ messages in thread From: Paul Gortmaker @ 2012-08-16 15:36 UTC (permalink / raw) To: Claudiu Manoil; +Cc: Tomas Hruby, Eric Dumazet, netdev, David S. Miller [Re: [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing] On 14/08/2012 (Tue 19:08) Claudiu Manoil wrote: > On 08/14/2012 04:15 AM, Paul Gortmaker wrote: > >This is a lot lower variation than what you reported earlier (20 > >versus 200, I think). It was the variation that raised a red flag > >for me... > Hi Paul, > The earlier variation, which is much bigger (indeed ~200), was > observed on a p1020 (slow, 2 cores, MQ_MG_MODE). > I did not collect however as detailed measurement results for that > board, as I did for p1010 (previous email). > The most important performance improvement I've noticed however was > on the p1020 platform. > > >>By changing the coalescing settings from default* (rx coalescing off, > >>tx-usecs: 10, tx-frames: 16) to: > >>"ethtool -C eth1 rx-frames 22 tx-frames 22 rx-usecs 32 tx-usecs 32" > >>we get a throughput of ~710 Mbps. > >> > >>For *Image 2)*, using the default tcp_limit_output_bytes value > >>(131072) - I've noticed > >>that "tweaking" tcp_limit_output_bytes does not improve the > >>throughput -, we get the > >>following performance numbers: > >>* default coalescing settings: ~650 Mbps > >>* rx-frames tx-frames 22 rx-usecs 32 tx-usecs 32: ~860-880 Mbps > >> > >>For *Image 3)*, by disabling BQL (CONFIG_BQL = n), there's *no* > >>relevant performance > >>improvement compared to Image 1). > >>(note: > >>For all the measurements, rx and tx BD ring sizes have been set to > >>64, for best performance.) > >> > >>So, I really tend to believe that the performance degradation comes > >>primarily from the driver, > >>and the napi poll processing turns out to be an important source for > >>that. The proposed patches > >This would make sense, if the CPU was slammed at 100% load in dealing > >with the tx processing, and the change made the driver considerably more > >efficient. But is that really the case? Is the p1010 really going flat > >out just to handle the Tx processing? Have you done any sort of > >profiling to confirm/deny where the CPU is spending its time? > The current gfar_poll implementation processes first the tx > confirmation path exhaustively, without a budget/ work limit, > and only then proceeds with the rx processing within the allotted > budget. An this happens for both Rx and Tx confirmation > interrupts. I find this unfair and out of balance. Maybe by letting > rx processing to be triggered by rx interrupts only, and > the tx conf path processing to be triggered by tx confirmation > interrupts only, and, on top of that, by imposing a work limit > on the tx confirmation path too, we get a more responsive driver > that performs better. Indeed some profiling data to > confirm this would be great, but I don't have it. > > There's another issues that seems to be solved by this patchset, and > I've noticed it only on p1020rdb (this time). > And that is excessive Rx busy interrupts occurrence. Solving this > issue may be another factor for the performance > improvement on p1020. But maybe this is another discussion. > > > > >>show substantial improvement, especially for SMP systems where Tx > >>and Rx processing may be > >>done in parallel. > >>What do you think? > >>Is it ok to proceed by re-spinning the patches? Do you recommend > >>additional measurements? > >Unfortunately Eric is out this week, so we will be without his input for > >a while. However, we are only at 3.6-rc1 -- meaning net-next will be > >open for quite some time, hence no need to rush to try and jam stuff in. > > > >Also, I have two targets I'm interested in testing your patches on. The > >1st is a 500MHz mpc8349 board -- which should replicate what you see on > >your p1010 (slow, single core). The other is an 8641D, which is > >interesting since it will give us the SMP tx/rx as separate threads, but > >without the MQ_MG_MODE support (is that a correct assumption?) > > > >I don't have any fundamental problem with your patches (although 4/4 > >might be better as two patches) -- the above targets/tests are only > >of interest, since I'm not convinced we yet understand _why_ your > >changes give a performance boost, and there might be something > >interesting hiding in there. > > > >So, while Eric is out, let me see if I can collect some more data on > >those two targets sometime this week. > > Great, I don't mean to rush. The more data we get on this the better. > It would be great if you could do some measurements on your platforms too. > 8641D is indeed a dual core with etsec 1.x (so without the MQ_MG_MODE), > but I did run some tests on a p2020, which has the same features. However > I'm eager to see your results. So, I've collected data on 8349 (520MHz single core) and 8641D (1GHz dual core) and the results are kind of surprising (to me). The SMP target, which in theory should have benefited from the change, actually saw about an 8% reduction in throughput. And the slower single core saw about a 5% increase. I also retested the 8641D with just your 1st 3 patches (i.e. drop the "Use separate NAPIs for Tx and Rx processing" patch) and it recovered about 1/2 the lost throughput, but not all. I've used your patches exactly as posted, and the same netperf cmdline. I briefly experimented with disabling BQL on the 8349 but didn't see any impact from doing that (consistent with what you'd reported). I didn't see any real large variations either (target and server on same switch), but I'm thinking the scatter could be reduced further if I isolated the switch entirely to just the target and server. I'll do that if I end up doing any more testing on this, since the averages seem to be reproduceable to about +/- 2% at the moment... Paul. -------------- Command: netperf -l 20 -cC -H 192.168.146.65 -t TCP_STREAM -- -m 1500 next-next baseline: commit 1f07b62f3205f6ed41759df2892eaf433bc051a1 fsl RFC: http://patchwork.ozlabs.org/patch/175919/ applied to above. Default queue sizes (256), BQL defaults. 8349 (528 MHz, single core): net-next 10 runs avg=123 max=124 min=121 send utilization > 99% fsl RFC 13 runs: avg=129 (+ ~5%) max=131 min=127 send utilization > 99% 8641D: (1GHz, dual core) net-next 10 runs avg=826 max=839 min=807 send utilization ~ 70% fsl RFC 12 runs avg=762 (- ~8%) max=783 min=698 send utilization ~ 70% fsl RFC, _only_ 1st 3 of 4 patches, 13 runs avg=794 (- ~4%) max=816 min=758 send utilization ~ 70% -------------- > Thanks for helping. > > Regards, > Claudiu > > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing 2012-08-16 15:36 ` Paul Gortmaker @ 2012-08-17 11:28 ` Claudiu Manoil 0 siblings, 0 replies; 21+ messages in thread From: Claudiu Manoil @ 2012-08-17 11:28 UTC (permalink / raw) To: Paul Gortmaker; +Cc: Tomas Hruby, Eric Dumazet, netdev, David S. Miller On 08/16/2012 06:36 PM, Paul Gortmaker wrote: > [Re: [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing] On 14/08/2012 (Tue 19:08) Claudiu Manoil wrote: > >> On 08/14/2012 04:15 AM, Paul Gortmaker wrote: >>> This is a lot lower variation than what you reported earlier (20 >>> versus 200, I think). It was the variation that raised a red flag >>> for me... >> Hi Paul, >> The earlier variation, which is much bigger (indeed ~200), was >> observed on a p1020 (slow, 2 cores, MQ_MG_MODE). >> I did not collect however as detailed measurement results for that >> board, as I did for p1010 (previous email). >> The most important performance improvement I've noticed however was >> on the p1020 platform. >> >>>> By changing the coalescing settings from default* (rx coalescing off, >>>> tx-usecs: 10, tx-frames: 16) to: >>>> "" >>>> we get a throughput of ~710 Mbps. >>>> >>>> For *Image 2)*, using the default tcp_limit_output_bytes value >>>> (131072) - I've noticed >>>> that "tweaking" tcp_limit_output_bytes does not improve the >>>> throughput -, we get the >>>> following performance numbers: >>>> * default coalescing settings: ~650 Mbps >>>> * rx-frames tx-frames 22 rx-usecs 32 tx-usecs 32: ~860-880 Mbps >>>> >>>> For *Image 3)*, by disabling BQL (CONFIG_BQL = n), there's *no* >>>> relevant performance >>>> improvement compared to Image 1). >>>> (note: >>>> For all the measurements, rx and tx BD ring sizes have been set to >>>> 64, for best performance.) >>>> >>>> So, I really tend to believe that the performance degradation comes >>>> primarily from the driver, >>>> and the napi poll processing turns out to be an important source for >>>> that. The proposed patches >>> This would make sense, if the CPU was slammed at 100% load in dealing >>> with the tx processing, and the change made the driver considerably more >>> efficient. But is that really the case? Is the p1010 really going flat >>> out just to handle the Tx processing? Have you done any sort of >>> profiling to confirm/deny where the CPU is spending its time? >> The current gfar_poll implementation processes first the tx >> confirmation path exhaustively, without a budget/ work limit, >> and only then proceeds with the rx processing within the allotted >> budget. An this happens for both Rx and Tx confirmation >> interrupts. I find this unfair and out of balance. Maybe by letting >> rx processing to be triggered by rx interrupts only, and >> the tx conf path processing to be triggered by tx confirmation >> interrupts only, and, on top of that, by imposing a work limit >> on the tx confirmation path too, we get a more responsive driver >> that performs better. Indeed some profiling data to >> confirm this would be great, but I don't have it. >> >> There's another issues that seems to be solved by this patchset, and >> I've noticed it only on p1020rdb (this time). >> And that is excessive Rx busy interrupts occurrence. Solving this >> issue may be another factor for the performance >> improvement on p1020. But maybe this is another discussion. >> >>>> show substantial improvement, especially for SMP systems where Tx >>>> and Rx processing may be >>>> done in parallel. >>>> What do you think? >>>> Is it ok to proceed by re-spinning the patches? Do you recommend >>>> additional measurements? >>> Unfortunately Eric is out this week, so we will be without his input for >>> a while. However, we are only at 3.6-rc1 -- meaning net-next will be >>> open for quite some time, hence no need to rush to try and jam stuff in. >>> >>> Also, I have two targets I'm interested in testing your patches on. The >>> 1st is a 500MHz mpc8349 board -- which should replicate what you see on >>> your p1010 (slow, single core). The other is an 8641D, which is >>> interesting since it will give us the SMP tx/rx as separate threads, but >>> without the MQ_MG_MODE support (is that a correct assumption?) >>> >>> I don't have any fundamental problem with your patches (although 4/4 >>> might be better as two patches) -- the above targets/tests are only >>> of interest, since I'm not convinced we yet understand _why_ your >>> changes give a performance boost, and there might be something >>> interesting hiding in there. >>> >>> So, while Eric is out, let me see if I can collect some more data on >>> those two targets sometime this week. >> Great, I don't mean to rush. The more data we get on this the better. >> It would be great if you could do some measurements on your platforms too. >> 8641D is indeed a dual core with etsec 1.x (so without the MQ_MG_MODE), >> but I did run some tests on a p2020, which has the same features. However >> I'm eager to see your results. > So, I've collected data on 8349 (520MHz single core) and 8641D (1GHz > dual core) and the results are kind of surprising (to me). The SMP > target, which in theory should have benefited from the change, actually > saw about an 8% reduction in throughput. And the slower single core saw > about a 5% increase. > > I also retested the 8641D with just your 1st 3 patches (i.e. drop the > "Use separate NAPIs for Tx and Rx processing" patch) and it recovered > about 1/2 the lost throughput, but not all. > > I've used your patches exactly as posted, and the same netperf cmdline. > I briefly experimented with disabling BQL on the 8349 but didn't see any > impact from doing that (consistent with what you'd reported). I didn't > see any real large variations either (target and server on same switch), > but I'm thinking the scatter could be reduced further if I isolated > the switch entirely to just the target and server. I'll do that if I > end up doing any more testing on this, since the averages seem to be > reproduceable to about +/- 2% at the moment... > > Paul. > > -------------- > Command: netperf -l 20 -cC -H 192.168.146.65 -t TCP_STREAM -- -m 1500 > next-next baseline: commit 1f07b62f3205f6ed41759df2892eaf433bc051a1 > fsl RFC: http://patchwork.ozlabs.org/patch/175919/ applied to above. > Default queue sizes (256), BQL defaults. > > 8349 (528 MHz, single core): > net-next 10 runs > avg=123 > max=124 > min=121 > send utilization > 99% > > fsl RFC 13 runs: > avg=129 (+ ~5%) > max=131 > min=127 > send utilization > 99% > > 8641D: (1GHz, dual core) > net-next 10 runs > avg=826 > max=839 > min=807 > send utilization ~ 70% > > fsl RFC 12 runs > avg=762 (- ~8%) > max=783 > min=698 > send utilization ~ 70% > > fsl RFC, _only_ 1st 3 of 4 patches, 13 runs > avg=794 (- ~4%) > max=816 > min=758 > send utilization ~ 70% > -------------- Hello Paul, Thanks again for the measurements. It will take me some time to "digest" the results and even do more tests/ analysis on the platforms at my disposal. Your results are indeed surprising, but there are some noticeable differences b/w our setups too. First of all, as noted before, I'm using BD rings of size 64 for best performance (as this proved to be an optimal setting over the time). So, before starting any tests, I was issuing: "ethtool -G <ethX> rx 64 tx 64". Another point is that, to enhance the performance gain, I was using some "decent" interrupt coalescing settings (at least to have rx coalescing enabled too, which is by default off). So I've been using: "ethtool -C eth1 rx-frames 22 rx-usecs 32 tx-frames 22 tx-usecs 32" I think that the proposed code enhancement requires some balanced interrupt coalescing settings too, for best results. It's interesting that with these settings, I was able to reach ~940 Mbps on a p2020rdb (which is also "single queue single group", but with two 1.2GHz e500v2 cores), both with and without the RFC patches. Another point to consider when doing these measurements on SMP systems, is that Rx/Tx interrupt handling should happen on distinct CPUs. I think this happens by default for netperf on the "non-MQ_MG_MODE" systems (like 8641D, or p2020), but this condition must be verified for the MQ_MG_MODE systems (like p1020), by checking /proc/interrupts, and (if needed) forced by setting interrupt affinities accordingly. Btw., do you happen to have a p1020 board at your disposal too? Best regards, Claudiu ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2012-08-17 11:28 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-08-08 12:26 [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing Claudiu Manoil 2012-08-08 12:26 ` [RFC net-next 1/4] gianfar: Remove redundant programming of [rt]xic registers Claudiu Manoil 2012-08-08 12:26 ` [RFC net-next 2/4] gianfar: Clear ievent from interrupt handler for [RT]x int Claudiu Manoil 2012-08-08 12:26 ` [RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions Claudiu Manoil 2012-08-08 12:26 ` [RFC net-next 4/4] gianfar: Use separate NAPIs for Tx and Rx processing Claudiu Manoil 2012-08-14 0:51 ` Paul Gortmaker 2012-08-14 16:08 ` Claudiu Manoil 2012-08-08 15:44 ` [RFC net-next 3/4] gianfar: Separate out the Rx and Tx coalescing functions Paul Gortmaker 2012-08-09 16:24 ` Claudiu Manoil 2012-08-15 1:29 ` Paul Gortmaker 2012-08-08 16:11 ` [RFC net-next 2/4] gianfar: Clear ievent from interrupt handler for [RT]x int Paul Gortmaker 2012-08-09 16:04 ` Claudiu Manoil 2012-08-08 16:24 ` [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing Paul Gortmaker 2012-08-08 16:44 ` Eric Dumazet 2012-08-08 23:06 ` Tomas Hruby 2012-08-09 15:07 ` Claudiu Manoil 2012-08-13 16:23 ` Claudiu Manoil 2012-08-14 1:15 ` Paul Gortmaker 2012-08-14 16:08 ` Claudiu Manoil 2012-08-16 15:36 ` Paul Gortmaker 2012-08-17 11:28 ` Claudiu Manoil
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).