* [PATCH] myr10ge: again fix lro_gen_skb() alignment
@ 2009-04-15 8:09 Stanislaw Gruszka
2009-04-15 9:28 ` David Miller
0 siblings, 1 reply; 41+ messages in thread
From: Stanislaw Gruszka @ 2009-04-15 8:09 UTC (permalink / raw)
To: netdev; +Cc: Andrew Gallatin, Brice Goglin
Add LRO alignment initially committed in 621544eb8c3beaa859c75850f816dd9b056a00a3
and removed in 0dcffac1a329be69bab0ac604bf7283737108e68 during conversion to
multi-slice.
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
---
drivers/net/myri10ge/myri10ge.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/drivers/net/myri10ge/myri10ge.c b/drivers/net/myri10ge/myri10ge.c
index 9eed126..f2c4a66 100644
--- a/drivers/net/myri10ge/myri10ge.c
+++ b/drivers/net/myri10ge/myri10ge.c
@@ -2447,6 +2447,7 @@ static int myri10ge_open(struct net_device *dev)
lro_mgr->lro_arr = ss->rx_done.lro_desc;
lro_mgr->get_frag_header = myri10ge_get_frag_header;
lro_mgr->max_aggr = myri10ge_lro_max_pkts;
+ lro_mgr->frag_align_pad = 2;
if (lro_mgr->max_aggr > MAX_SKB_FRAGS)
lro_mgr->max_aggr = MAX_SKB_FRAGS;
--
1.5.5.6
^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-15 8:09 [PATCH] myr10ge: again fix lro_gen_skb() alignment Stanislaw Gruszka
@ 2009-04-15 9:28 ` David Miller
2009-04-15 9:48 ` Brice Goglin
0 siblings, 1 reply; 41+ messages in thread
From: David Miller @ 2009-04-15 9:28 UTC (permalink / raw)
To: sgruszka; +Cc: netdev, gallatin, brice
From: Stanislaw Gruszka <sgruszka@redhat.com>
Date: Wed, 15 Apr 2009 10:09:37 +0200
> Add LRO alignment initially committed in 621544eb8c3beaa859c75850f816dd9b056a00a3
> and removed in 0dcffac1a329be69bab0ac604bf7283737108e68 during conversion to
> multi-slice.
>
> Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Applied, thanks.
Please, in the future, add the header string of the commit message
when referencing GIT commits. When this patch is added to the -stable
kernel or similar the GIT commit ideas might be different and it
will be impossible for someone reading your commit message to find
the referenced commit using only the SHA ID.
I fixed that up for you this time.
Also, it would great to get this driver converted over to GRO,
such bugs like this one aren't even possible with GRO :-)
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-15 9:28 ` David Miller
@ 2009-04-15 9:48 ` Brice Goglin
2009-04-15 10:02 ` David Miller
0 siblings, 1 reply; 41+ messages in thread
From: Brice Goglin @ 2009-04-15 9:48 UTC (permalink / raw)
To: David Miller; +Cc: sgruszka, netdev, gallatin, brice
David Miller wrote:
> From: Stanislaw Gruszka <sgruszka@redhat.com>
> Date: Wed, 15 Apr 2009 10:09:37 +0200
>
>
>> Add LRO alignment initially committed in 621544eb8c3beaa859c75850f816dd9b056a00a3
>> and removed in 0dcffac1a329be69bab0ac604bf7283737108e68 during conversion to
>> multi-slice.
>>
>> Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
>>
>
> Applied, thanks.
>
> Please, in the future, add the header string of the commit message
> when referencing GIT commits. When this patch is added to the -stable
> kernel or similar the GIT commit ideas might be different and it
> will be impossible for someone reading your commit message to find
> the referenced commit using only the SHA ID.
>
I guess we need to send this patch to the stable maintainers since it
should affect 2.6.27, .28 and .29.
> Also, it would great to get this driver converted over to GRO,
> such bugs like this one aren't even possible with GRO :-)
>
It looks like nobody complains about GRO bugs or performance-problems
anymore so we might indeed look at converting myri10ge for 2.6.31.
Is there a good summary somewhere of why GRO is better, and how to
actually convert drivers?
Brice
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-15 9:48 ` Brice Goglin
@ 2009-04-15 10:02 ` David Miller
2009-04-15 13:01 ` Andrew Gallatin
0 siblings, 1 reply; 41+ messages in thread
From: David Miller @ 2009-04-15 10:02 UTC (permalink / raw)
To: brice; +Cc: sgruszka, netdev, gallatin
From: Brice Goglin <brice@myri.com>
Date: Wed, 15 Apr 2009 11:48:06 +0200
> David Miller wrote:
>> From: Stanislaw Gruszka <sgruszka@redhat.com>
>> Date: Wed, 15 Apr 2009 10:09:37 +0200
>>
>>
>>> Add LRO alignment initially committed in 621544eb8c3beaa859c75850f816dd9b056a00a3
>>> and removed in 0dcffac1a329be69bab0ac604bf7283737108e68 during conversion to
>>> multi-slice.
>>>
>>> Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
>>>
>>
>> Applied, thanks.
>>
>> Please, in the future, add the header string of the commit message
>> when referencing GIT commits. When this patch is added to the -stable
>> kernel or similar the GIT commit ideas might be different and it
>> will be impossible for someone reading your commit message to find
>> the referenced commit using only the SHA ID.
>>
>
> I guess we need to send this patch to the stable maintainers since it
> should affect 2.6.27, .28 and .29.
I will queue it up for -stable, you just have to ask me to do
that.
> Is there a good summary somewhere of why GRO is better,
Transparent forwarding/bridging support, easier driver port.
> and how to
> actually convert drivers?
Step 1: Remove all of your LRO support code, every last line
Step 2: netif_receive_skb() --> napi_gro_receive()
vlan_hwaccel_rx() --> vlan_gro_receive()
It couldn't be any easier.
And it would also behoove you to look at the commits that converted or
added GRO support to other drivers. That's how I learned it :-)
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-15 10:02 ` David Miller
@ 2009-04-15 13:01 ` Andrew Gallatin
2009-04-15 21:04 ` Andrew Gallatin
0 siblings, 1 reply; 41+ messages in thread
From: Andrew Gallatin @ 2009-04-15 13:01 UTC (permalink / raw)
To: David Miller; +Cc: brice, sgruszka, netdev
David Miller wrote:
> From: Brice Goglin <brice@myri.com>
>> Is there a good summary somewhere of why GRO is better,
>
> Transparent forwarding/bridging support, easier driver port.
>
>> and how to
>> actually convert drivers?
>
> Step 1: Remove all of your LRO support code, every last line
> Step 2: netif_receive_skb() --> napi_gro_receive()
> vlan_hwaccel_rx() --> vlan_gro_receive()
>
> It couldn't be any easier.
>
> And it would also behoove you to look at the commits that converted or
> added GRO support to other drivers. That's how I learned it :-)
Unfortunately, it doesn't appear that GRO is able to handle frags
(like lro_receive_frags()), so I anticipate its overhead would
be much higher than LRO for us, due to extra memory allocation
and freeing overheads. I'll try to find the time to convert
the driver and run some quick tests to confirm.
However, since LRO is optional, it would make sense to
convert the non-LRO code path at the very least.
Drew
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-15 13:01 ` Andrew Gallatin
@ 2009-04-15 21:04 ` Andrew Gallatin
2009-04-15 23:42 ` David Miller
0 siblings, 1 reply; 41+ messages in thread
From: Andrew Gallatin @ 2009-04-15 21:04 UTC (permalink / raw)
To: David Miller; +Cc: brice, sgruszka, netdev
Andrew Gallatin wrote:
> David Miller wrote:
>> From: Brice Goglin <brice@myri.com>
>>> Is there a good summary somewhere of why GRO is better,
>>
>> Transparent forwarding/bridging support, easier driver port.
>>
>>> and how to
>>> actually convert drivers?
>>
>> Step 1: Remove all of your LRO support code, every last line
>> Step 2: netif_receive_skb() --> napi_gro_receive()
>> vlan_hwaccel_rx() --> vlan_gro_receive()
>>
>> It couldn't be any easier.
>>
>> And it would also behoove you to look at the commits that converted or
>> added GRO support to other drivers. That's how I learned it :-)
>
> Unfortunately, it doesn't appear that GRO is able to handle frags
> (like lro_receive_frags()), so I anticipate its overhead would
Ah, I missed napi_gro_frags()! I've got quick and dirty test
patch which uses that, but I need to fix a few things. I also need
to figure out why it seems to be a bit slower than LRO
(varies from 8.5 to 9.2 Gb/s, rather than always 9.4Gb/s)
on my old, weak 2.0GHz athlon64.
Thanks for the pointer,
Drew
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-15 21:04 ` Andrew Gallatin
@ 2009-04-15 23:42 ` David Miller
2009-04-16 8:50 ` Herbert Xu
0 siblings, 1 reply; 41+ messages in thread
From: David Miller @ 2009-04-15 23:42 UTC (permalink / raw)
To: gallatin; +Cc: brice, sgruszka, netdev, herbert
From: Andrew Gallatin <gallatin@myri.com>
Date: Wed, 15 Apr 2009 17:04:36 -0400
> Andrew Gallatin wrote:
>> Unfortunately, it doesn't appear that GRO is able to handle frags
>> (like lro_receive_frags()), so I anticipate its overhead would
>
> Ah, I missed napi_gro_frags()! I've got quick and dirty test
> patch which uses that, but I need to fix a few things. I also need
> to figure out why it seems to be a bit slower than LRO
> (varies from 8.5 to 9.2 Gb/s, rather than always 9.4Gb/s)
> on my old, weak 2.0GHz athlon64.
Herbert has been working on various optimizations to get
cxgb3 GRO performance on par with LRO. Perhaps he has
some things for you to try :-)
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-15 23:42 ` David Miller
@ 2009-04-16 8:50 ` Herbert Xu
2009-04-16 9:02 ` David Miller
2009-04-21 19:19 ` Andrew Gallatin
0 siblings, 2 replies; 41+ messages in thread
From: Herbert Xu @ 2009-04-16 8:50 UTC (permalink / raw)
To: David Miller; +Cc: gallatin, brice, sgruszka, netdev
On Wed, Apr 15, 2009 at 04:42:48PM -0700, David Miller wrote:
>
> Herbert has been working on various optimizations to get
> cxgb3 GRO performance on par with LRO. Perhaps he has
> some things for you to try :-)
Yes, this patch should improve performace. In fact, when you
reopen the net-next tree feel free to put this patch in :)
gro: New frags interface to avoid copying shinfo
It turns out that copying a 16-byte area at ~800k times a second
can be really expensive :) This patch redesigns the frags GRO
interface to avoid copying that area twice.
The two disciples of the frags interface have been converted.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
diff --git a/drivers/net/cxgb3/adapter.h b/drivers/net/cxgb3/adapter.h
index 714df2b..322434a 100644
--- a/drivers/net/cxgb3/adapter.h
+++ b/drivers/net/cxgb3/adapter.h
@@ -195,7 +195,7 @@ struct sge_qset { /* an SGE queue set */
struct sge_rspq rspq;
struct sge_fl fl[SGE_RXQ_PER_SET];
struct sge_txq txq[SGE_TXQ_PER_SET];
- struct napi_gro_fraginfo lro_frag_tbl;
+ int nomem;
int lro_enabled;
void *lro_va;
struct net_device *netdev;
diff --git a/drivers/net/cxgb3/sge.c b/drivers/net/cxgb3/sge.c
index 26d3587..73d569e 100644
--- a/drivers/net/cxgb3/sge.c
+++ b/drivers/net/cxgb3/sge.c
@@ -654,7 +654,8 @@ static void t3_reset_qset(struct sge_qset *q)
q->txq_stopped = 0;
q->tx_reclaim_timer.function = NULL; /* for t3_stop_sge_timers() */
q->rx_reclaim_timer.function = NULL;
- q->lro_frag_tbl.nr_frags = q->lro_frag_tbl.len = 0;
+ q->nomem = 0;
+ napi_free_frags(&q->napi);
}
@@ -2074,20 +2075,19 @@ static void lro_add_page(struct adapter *adap, struct sge_qset *qs,
struct sge_fl *fl, int len, int complete)
{
struct rx_sw_desc *sd = &fl->sdesc[fl->cidx];
+ struct sk_buff *skb = NULL;
struct cpl_rx_pkt *cpl;
- struct skb_frag_struct *rx_frag = qs->lro_frag_tbl.frags;
- int nr_frags = qs->lro_frag_tbl.nr_frags;
- int frag_len = qs->lro_frag_tbl.len;
+ struct skb_frag_struct *rx_frag;
+ int nr_frags;
int offset = 0;
- if (!nr_frags) {
- offset = 2 + sizeof(struct cpl_rx_pkt);
- qs->lro_va = cpl = sd->pg_chunk.va + 2;
+ if (!qs->nomem) {
+ skb = napi_get_frags(&qs->napi);
+ qs->nomem = !skb;
}
fl->credits--;
- len -= offset;
pci_dma_sync_single_for_cpu(adap->pdev,
pci_unmap_addr(sd, dma_addr),
fl->buf_size - SGE_PG_RSVD,
@@ -2100,21 +2100,38 @@ static void lro_add_page(struct adapter *adap, struct sge_qset *qs,
fl->alloc_size,
PCI_DMA_FROMDEVICE);
+ if (!skb) {
+ put_page(sd->pg_chunk.page);
+ if (complete)
+ qs->nomem = 0;
+ return;
+ }
+
+ rx_frag = skb_shinfo(skb)->frags;
+ nr_frags = skb_shinfo(skb)->nr_frags;
+
+ if (!nr_frags) {
+ offset = 2 + sizeof(struct cpl_rx_pkt);
+ qs->lro_va = sd->pg_chunk.va + 2;
+ }
+ len -= offset;
+
prefetch(qs->lro_va);
rx_frag += nr_frags;
rx_frag->page = sd->pg_chunk.page;
rx_frag->page_offset = sd->pg_chunk.offset + offset;
rx_frag->size = len;
- frag_len += len;
- qs->lro_frag_tbl.nr_frags++;
- qs->lro_frag_tbl.len = frag_len;
+ skb->len += len;
+ skb->data_len += len;
+ skb->truesize += len;
+ skb_shinfo(skb)->nr_frags++;
if (!complete)
return;
- qs->lro_frag_tbl.ip_summed = CHECKSUM_UNNECESSARY;
+ skb->ip_summed = CHECKSUM_UNNECESSARY;
cpl = qs->lro_va;
if (unlikely(cpl->vlan_valid)) {
@@ -2123,15 +2140,11 @@ static void lro_add_page(struct adapter *adap, struct sge_qset *qs,
struct vlan_group *grp = pi->vlan_grp;
if (likely(grp != NULL)) {
- vlan_gro_frags(&qs->napi, grp, ntohs(cpl->vlan),
- &qs->lro_frag_tbl);
- goto out;
+ vlan_gro_frags(&qs->napi, grp, ntohs(cpl->vlan));
+ return;
}
}
- napi_gro_frags(&qs->napi, &qs->lro_frag_tbl);
-
-out:
- qs->lro_frag_tbl.nr_frags = qs->lro_frag_tbl.len = 0;
+ napi_gro_frags(&qs->napi);
}
/**
@@ -2300,8 +2313,6 @@ no_mem:
if (fl->use_pages) {
void *addr = fl->sdesc[fl->cidx].pg_chunk.va;
- prefetch(&qs->lro_frag_tbl);
-
prefetch(addr);
#if L1_CACHE_BYTES < 128
prefetch(addr + L1_CACHE_BYTES);
diff --git a/drivers/net/sfc/rx.c b/drivers/net/sfc/rx.c
index 66d7fe3..01f9432 100644
--- a/drivers/net/sfc/rx.c
+++ b/drivers/net/sfc/rx.c
@@ -450,17 +450,27 @@ static void efx_rx_packet_lro(struct efx_channel *channel,
/* Pass the skb/page into the LRO engine */
if (rx_buf->page) {
- struct napi_gro_fraginfo info;
+ struct sk_buff *skb = napi_get_frags(napi);
- info.frags[0].page = rx_buf->page;
- info.frags[0].page_offset = efx_rx_buf_offset(rx_buf);
- info.frags[0].size = rx_buf->len;
- info.nr_frags = 1;
- info.ip_summed = CHECKSUM_UNNECESSARY;
- info.len = rx_buf->len;
+ if (!skb) {
+ put_page(rx_buf->page);
+ goto out;
+ }
+
+ skb_shinfo(skb)->frags[0].page = rx_buf->page;
+ skb_shinfo(skb)->frags[0].page_offset =
+ efx_rx_buf_offset(rx_buf);
+ skb_shinfo(skb)->frags[0].size = rx_buf->len;
+ skb_shinfo(skb)->nr_frags = 1;
+
+ skb->len = rx_buf->len;
+ skb->data_len = rx_buf->len;
+ skb->truesize += rx_buf->len;
+ skb->ip_summed = CHECKSUM_UNNECESSARY;
- napi_gro_frags(napi, &info);
+ napi_gro_frags(napi);
+out:
EFX_BUG_ON_PARANOID(rx_buf->skb);
rx_buf->page = NULL;
} else {
diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h
index e1ff5b1..7ff9af1 100644
--- a/include/linux/if_vlan.h
+++ b/include/linux/if_vlan.h
@@ -118,8 +118,7 @@ extern int vlan_hwaccel_do_receive(struct sk_buff *skb);
extern int vlan_gro_receive(struct napi_struct *napi, struct vlan_group *grp,
unsigned int vlan_tci, struct sk_buff *skb);
extern int vlan_gro_frags(struct napi_struct *napi, struct vlan_group *grp,
- unsigned int vlan_tci,
- struct napi_gro_fraginfo *info);
+ unsigned int vlan_tci);
#else
static inline struct net_device *vlan_dev_real_dev(const struct net_device *dev)
@@ -154,8 +153,7 @@ static inline int vlan_gro_receive(struct napi_struct *napi,
}
static inline int vlan_gro_frags(struct napi_struct *napi,
- struct vlan_group *grp, unsigned int vlan_tci,
- struct napi_gro_fraginfo *info)
+ struct vlan_group *grp, unsigned int vlan_tci)
{
return NET_RX_DROP;
}
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2e7783f..54db3eb 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1047,14 +1047,6 @@ struct packet_type {
struct list_head list;
};
-struct napi_gro_fraginfo {
- skb_frag_t frags[MAX_SKB_FRAGS];
- unsigned int nr_frags;
- unsigned int ip_summed;
- unsigned int len;
- __wsum csum;
-};
-
#include <linux/interrupt.h>
#include <linux/notifier.h>
@@ -1442,12 +1434,18 @@ extern int napi_gro_receive(struct napi_struct *napi,
struct sk_buff *skb);
extern void napi_reuse_skb(struct napi_struct *napi,
struct sk_buff *skb);
-extern struct sk_buff * napi_fraginfo_skb(struct napi_struct *napi,
- struct napi_gro_fraginfo *info);
+extern struct sk_buff * napi_get_frags(struct napi_struct *napi);
extern int napi_frags_finish(struct napi_struct *napi,
struct sk_buff *skb, int ret);
-extern int napi_gro_frags(struct napi_struct *napi,
- struct napi_gro_fraginfo *info);
+extern struct sk_buff * napi_frags_skb(struct napi_struct *napi);
+extern int napi_gro_frags(struct napi_struct *napi);
+
+static inline void napi_free_frags(struct napi_struct *napi)
+{
+ kfree_skb(napi->skb);
+ napi->skb = NULL;
+}
+
extern void netif_nit_deliver(struct sk_buff *skb);
extern int dev_valid_name(const char *name);
extern int dev_ioctl(struct net *net, unsigned int cmd, void __user *);
diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
index c67fe6f..7f7de1a 100644
--- a/net/8021q/vlan_core.c
+++ b/net/8021q/vlan_core.c
@@ -114,9 +114,9 @@ int vlan_gro_receive(struct napi_struct *napi, struct vlan_group *grp,
EXPORT_SYMBOL(vlan_gro_receive);
int vlan_gro_frags(struct napi_struct *napi, struct vlan_group *grp,
- unsigned int vlan_tci, struct napi_gro_fraginfo *info)
+ unsigned int vlan_tci)
{
- struct sk_buff *skb = napi_fraginfo_skb(napi, info);
+ struct sk_buff *skb = napi_frags_skb(napi);
if (!skb)
return NET_RX_DROP;
diff --git a/net/core/dev.c b/net/core/dev.c
index 91d792d..619fa14 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2519,16 +2519,10 @@ void napi_reuse_skb(struct napi_struct *napi, struct sk_buff *skb)
}
EXPORT_SYMBOL(napi_reuse_skb);
-struct sk_buff *napi_fraginfo_skb(struct napi_struct *napi,
- struct napi_gro_fraginfo *info)
+struct sk_buff *napi_get_frags(struct napi_struct *napi)
{
struct net_device *dev = napi->dev;
struct sk_buff *skb = napi->skb;
- struct ethhdr *eth;
- skb_frag_t *frag;
- int i;
-
- napi->skb = NULL;
if (!skb) {
skb = netdev_alloc_skb(dev, GRO_MAX_HEAD + NET_IP_ALIGN);
@@ -2536,47 +2530,14 @@ struct sk_buff *napi_fraginfo_skb(struct napi_struct *napi,
goto out;
skb_reserve(skb, NET_IP_ALIGN);
- }
-
- BUG_ON(info->nr_frags > MAX_SKB_FRAGS);
- frag = &info->frags[info->nr_frags - 1];
- for (i = skb_shinfo(skb)->nr_frags; i < info->nr_frags; i++) {
- skb_fill_page_desc(skb, i, frag->page, frag->page_offset,
- frag->size);
- frag++;
+ napi->skb = skb;
}
- skb_shinfo(skb)->nr_frags = info->nr_frags;
-
- skb->data_len = info->len;
- skb->len += info->len;
- skb->truesize += info->len;
-
- skb_reset_mac_header(skb);
- skb_gro_reset_offset(skb);
-
- eth = skb_gro_header(skb, sizeof(*eth));
- if (!eth) {
- napi_reuse_skb(napi, skb);
- skb = NULL;
- goto out;
- }
-
- skb_gro_pull(skb, sizeof(*eth));
-
- /*
- * This works because the only protocols we care about don't require
- * special handling. We'll fix it up properly at the end.
- */
- skb->protocol = eth->h_proto;
-
- skb->ip_summed = info->ip_summed;
- skb->csum = info->csum;
out:
return skb;
}
-EXPORT_SYMBOL(napi_fraginfo_skb);
+EXPORT_SYMBOL(napi_get_frags);
int napi_frags_finish(struct napi_struct *napi, struct sk_buff *skb, int ret)
{
@@ -2606,9 +2567,39 @@ int napi_frags_finish(struct napi_struct *napi, struct sk_buff *skb, int ret)
}
EXPORT_SYMBOL(napi_frags_finish);
-int napi_gro_frags(struct napi_struct *napi, struct napi_gro_fraginfo *info)
+struct sk_buff *napi_frags_skb(struct napi_struct *napi)
+{
+ struct sk_buff *skb = napi->skb;
+ struct ethhdr *eth;
+
+ napi->skb = NULL;
+
+ skb_reset_mac_header(skb);
+ skb_gro_reset_offset(skb);
+
+ eth = skb_gro_header(skb, sizeof(*eth));
+ if (!eth) {
+ napi_reuse_skb(napi, skb);
+ skb = NULL;
+ goto out;
+ }
+
+ skb_gro_pull(skb, sizeof(*eth));
+
+ /*
+ * This works because the only protocols we care about don't require
+ * special handling. We'll fix it up properly at the end.
+ */
+ skb->protocol = eth->h_proto;
+
+out:
+ return skb;
+}
+EXPORT_SYMBOL(napi_frags_skb);
+
+int napi_gro_frags(struct napi_struct *napi)
{
- struct sk_buff *skb = napi_fraginfo_skb(napi, info);
+ struct sk_buff *skb = napi_frags_skb(napi);
if (!skb)
return NET_RX_DROP;
@@ -2712,7 +2703,7 @@ void netif_napi_del(struct napi_struct *napi)
struct sk_buff *skb, *next;
list_del_init(&napi->dev_list);
- kfree_skb(napi->skb);
+ napi_free_frags(napi);
for (skb = napi->gro_list; skb; skb = next) {
next = skb->next;
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-16 8:50 ` Herbert Xu
@ 2009-04-16 9:02 ` David Miller
2009-04-21 19:19 ` Andrew Gallatin
1 sibling, 0 replies; 41+ messages in thread
From: David Miller @ 2009-04-16 9:02 UTC (permalink / raw)
To: herbert; +Cc: gallatin, brice, sgruszka, netdev
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Thu, 16 Apr 2009 16:50:22 +0800
> On Wed, Apr 15, 2009 at 04:42:48PM -0700, David Miller wrote:
>>
>> Herbert has been working on various optimizations to get
>> cxgb3 GRO performance on par with LRO. Perhaps he has
>> some things for you to try :-)
>
> Yes, this patch should improve performace. In fact, when you
> reopen the net-next tree feel free to put this patch in :)
>
> gro: New frags interface to avoid copying shinfo
>
> It turns out that copying a 16-byte area at ~800k times a second
> can be really expensive :) This patch redesigns the frags GRO
> interface to avoid copying that area twice.
>
> The two disciples of the frags interface have been converted.
>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
It is open already, so, applied :-)
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-16 8:50 ` Herbert Xu
2009-04-16 9:02 ` David Miller
@ 2009-04-21 19:19 ` Andrew Gallatin
2009-04-22 10:48 ` Herbert Xu
2009-04-23 8:00 ` Herbert Xu
1 sibling, 2 replies; 41+ messages in thread
From: Andrew Gallatin @ 2009-04-21 19:19 UTC (permalink / raw)
To: Herbert Xu; +Cc: David Miller, brice, sgruszka, netdev
[-- Attachment #1: Type: text/plain, Size: 2075 bytes --]
Herbert Xu wrote:
> On Wed, Apr 15, 2009 at 04:42:48PM -0700, David Miller wrote:
>> Herbert has been working on various optimizations to get
>> cxgb3 GRO performance on par with LRO. Perhaps he has
>> some things for you to try :-)
>
> Yes, this patch should improve performace. In fact, when you
> reopen the net-next tree feel free to put this patch in :)
>
> gro: New frags interface to avoid copying shinfo
<...>
Hi Herbert,
With a net-next tree pulled 2 hours ago, I can now see line rate when
using frags with myri10ge on my weakest machines when receiving an
1500b TCP stream. To achieve line rate on these machines with both
inet_lro and GRO, I must bind the netserver and device IRQ to
different CPUs. Unfortunately, CPU accounting seems to currently be
broken in the Linux kernel, so I cannot provide an accurate comparison
at line rate.
So to compare inet_lro and GRO, I'm binding the netserver and device IRQ
to the same CPU. When I do this, that CPU is saturated and GRO is
roughly 17% slower than inet_lro. For comparison, here are netperf
results from a fast peer sending to my weak machine (AMD Athlon(tm) 64
X2 Dual Core Processor 3800+, 2GHz). First inet_lro:
Recv Send Send Utilization Service
Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local
remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
87380 65536 65536 60.02 6631.52 12.45 50.10 0.308
1.238
And now GRO:
87380 65536 65536 60.01 5488.99 9.79 50.00 0.292
1.492
Also, can you tell me how to handle my device, which passes a simple
16-bit checksum across the entire frame (excluding first 14 bytes),
via GRO? Simply setting skb->ip_summed = CHECKSUM_COMPLETE leads
to "hw csum failure".
I've attached my work-in-progress patch so you can see what I'm doing.
I do not want this applied due to performance and correctness issues.
Thanks for your help,
Drew
[-- Attachment #2: myri10ge_gro.diff --]
[-- Type: text/x-diff, Size: 9039 bytes --]
--- /home/gallatin/linux/git/net-next-2.6/drivers/net/myri10ge/myri10ge.c 2009-04-21 14:00:22.166783937 -0400
+++ linux-tmp/drivers/net/myri10ge/myri10ge.c 2009-04-21 15:08:06.241539153 -0400
@@ -75,7 +75,7 @@
#include "myri10ge_mcp.h"
#include "myri10ge_mcp_gen_header.h"
-#define MYRI10GE_VERSION_STR "1.4.4-1.412"
+#define MYRI10GE_VERSION_STR "1.4.4-1.413"
MODULE_DESCRIPTION("Myricom 10G driver (10GbE)");
MODULE_AUTHOR("Maintainer: help@myri.com");
@@ -161,8 +161,6 @@ struct myri10ge_rx_done {
dma_addr_t bus;
int cnt;
int idx;
- struct net_lro_mgr lro_mgr;
- struct net_lro_desc lro_desc[MYRI10GE_MAX_LRO_DESCRIPTORS];
};
struct myri10ge_slice_netstats {
@@ -1272,11 +1270,11 @@ myri10ge_unmap_rx_page(struct pci_dev *p
static inline int
myri10ge_rx_done(struct myri10ge_slice_state *ss, struct myri10ge_rx_buf *rx,
- int bytes, int len, __wsum csum)
+ struct napi_struct *napi, int bytes, int len, __wsum csum)
{
struct myri10ge_priv *mgp = ss->mgp;
struct sk_buff *skb;
- struct skb_frag_struct rx_frags[MYRI10GE_MAX_FRAGS_PER_FRAME];
+ struct skb_frag_struct *rx_frags;
int i, idx, hlen, remainder;
struct pci_dev *pdev = mgp->pdev;
struct net_device *dev = mgp->dev;
@@ -1286,6 +1284,19 @@ myri10ge_rx_done(struct myri10ge_slice_s
idx = rx->cnt & rx->mask;
va = page_address(rx->info[idx].page) + rx->info[idx].page_offset;
prefetch(va);
+
+ skb = napi_get_frags(napi);
+ if ((unlikely(!skb))) {
+ for (i = 0, remainder = len; remainder > 0; i++) {
+ myri10ge_unmap_rx_page(pdev, &rx->info[idx], bytes);
+ put_page(rx->info[idx].page);
+ rx->cnt++;
+ idx = rx->cnt & rx->mask;
+ remainder -= MYRI10GE_ALLOC_SIZE;
+ }
+ }
+ rx_frags = skb_shinfo(skb)->frags;
+
/* Fill skb_frag_struct(s) with data from our receive */
for (i = 0, remainder = len; remainder > 0; i++) {
myri10ge_unmap_rx_page(pdev, &rx->info[idx], bytes);
@@ -1300,52 +1311,18 @@ myri10ge_rx_done(struct myri10ge_slice_s
remainder -= MYRI10GE_ALLOC_SIZE;
}
- if (mgp->csum_flag && myri10ge_lro) {
- rx_frags[0].page_offset += MXGEFW_PAD;
- rx_frags[0].size -= MXGEFW_PAD;
- len -= MXGEFW_PAD;
- lro_receive_frags(&ss->rx_done.lro_mgr, rx_frags,
- /* opaque, will come back in get_frag_header */
- len, len,
- (void *)(__force unsigned long)csum, csum);
-
- return 1;
- }
-
- hlen = MYRI10GE_HLEN > len ? len : MYRI10GE_HLEN;
-
- /* allocate an skb to attach the page(s) to. This is done
- * after trying LRO, so as to avoid skb allocation overheads */
-
- skb = netdev_alloc_skb(dev, MYRI10GE_HLEN + 16);
- if (unlikely(skb == NULL)) {
- ss->stats.rx_dropped++;
- do {
- i--;
- put_page(rx_frags[i].page);
- } while (i != 0);
- return 0;
- }
-
- /* Attach the pages to the skb, and trim off any padding */
- myri10ge_rx_skb_build(skb, va, rx_frags, len, hlen);
- if (skb_shinfo(skb)->frags[0].size <= 0) {
- put_page(skb_shinfo(skb)->frags[0].page);
- skb_shinfo(skb)->nr_frags = 0;
- }
- skb->protocol = eth_type_trans(skb, dev);
- skb_record_rx_queue(skb, ss - &mgp->ss[0]);
-
- if (mgp->csum_flag) {
- if ((skb->protocol == htons(ETH_P_IP)) ||
- (skb->protocol == htons(ETH_P_IPV6))) {
- skb->csum = csum;
- skb->ip_summed = CHECKSUM_COMPLETE;
- } else
- myri10ge_vlan_ip_csum(skb, csum);
- }
- netif_receive_skb(skb);
+ rx_frags[0].page_offset += MXGEFW_PAD;
+ rx_frags[0].size -= MXGEFW_PAD;
+ len -= MXGEFW_PAD;
+ skb_shinfo(skb)->nr_frags = i;
+ skb->len = len;
+ skb->data_len = len;
+ skb->truesize += len;
+ /* skb->ip_summed = CHECKSUM_COMPLETE; */
+ skb->ip_summed = CHECKSUM_UNNECESSARY; /* XXXXXX */
+ napi_gro_frags(napi);
return 1;
+
}
static inline void
@@ -1418,7 +1395,8 @@ myri10ge_tx_done(struct myri10ge_slice_s
}
static inline int
-myri10ge_clean_rx_done(struct myri10ge_slice_state *ss, int budget)
+myri10ge_clean_rx_done(struct myri10ge_slice_state *ss,
+ struct napi_struct *napi, int budget)
{
struct myri10ge_rx_done *rx_done = &ss->rx_done;
struct myri10ge_priv *mgp = ss->mgp;
@@ -1438,10 +1416,12 @@ myri10ge_clean_rx_done(struct myri10ge_s
checksum = csum_unfold(rx_done->entry[idx].checksum);
if (length <= mgp->small_bytes)
rx_ok = myri10ge_rx_done(ss, &ss->rx_small,
+ napi,
mgp->small_bytes,
length, checksum);
else
rx_ok = myri10ge_rx_done(ss, &ss->rx_big,
+ napi,
mgp->big_bytes,
length, checksum);
rx_packets += rx_ok;
@@ -1455,9 +1435,6 @@ myri10ge_clean_rx_done(struct myri10ge_s
ss->stats.rx_packets += rx_packets;
ss->stats.rx_bytes += rx_bytes;
- if (myri10ge_lro)
- lro_flush_all(&rx_done->lro_mgr);
-
/* restock receive rings if needed */
if (ss->rx_small.fill_cnt - ss->rx_small.cnt < myri10ge_fill_thresh)
myri10ge_alloc_rx_pages(mgp, &ss->rx_small,
@@ -1522,7 +1499,7 @@ static int myri10ge_poll(struct napi_str
#endif
/* process as many rx events as NAPI will allow */
- work_done = myri10ge_clean_rx_done(ss, budget);
+ work_done = myri10ge_clean_rx_done(ss, napi, budget);
if (work_done < budget) {
napi_complete(napi);
@@ -1762,9 +1739,7 @@ static const char myri10ge_gstrings_slic
"----------- slice ---------",
"tx_pkt_start", "tx_pkt_done", "tx_req", "tx_done",
"rx_small_cnt", "rx_big_cnt",
- "wake_queue", "stop_queue", "tx_linearized", "LRO aggregated",
- "LRO flushed",
- "LRO avg aggr", "LRO no_desc"
+ "wake_queue", "stop_queue", "tx_linearized"
};
#define MYRI10GE_NET_STATS_LEN 21
@@ -1863,14 +1838,6 @@ myri10ge_get_ethtool_stats(struct net_de
data[i++] = (unsigned int)ss->tx.wake_queue;
data[i++] = (unsigned int)ss->tx.stop_queue;
data[i++] = (unsigned int)ss->tx.linearized;
- data[i++] = ss->rx_done.lro_mgr.stats.aggregated;
- data[i++] = ss->rx_done.lro_mgr.stats.flushed;
- if (ss->rx_done.lro_mgr.stats.flushed)
- data[i++] = ss->rx_done.lro_mgr.stats.aggregated /
- ss->rx_done.lro_mgr.stats.flushed;
- else
- data[i++] = 0;
- data[i++] = ss->rx_done.lro_mgr.stats.no_desc;
}
}
@@ -2198,67 +2165,6 @@ static void myri10ge_free_irq(struct myr
pci_disable_msix(pdev);
}
-static int
-myri10ge_get_frag_header(struct skb_frag_struct *frag, void **mac_hdr,
- void **ip_hdr, void **tcpudp_hdr,
- u64 * hdr_flags, void *priv)
-{
- struct ethhdr *eh;
- struct vlan_ethhdr *veh;
- struct iphdr *iph;
- u8 *va = page_address(frag->page) + frag->page_offset;
- unsigned long ll_hlen;
- /* passed opaque through lro_receive_frags() */
- __wsum csum = (__force __wsum) (unsigned long)priv;
-
- /* find the mac header, aborting if not IPv4 */
-
- eh = (struct ethhdr *)va;
- *mac_hdr = eh;
- ll_hlen = ETH_HLEN;
- if (eh->h_proto != htons(ETH_P_IP)) {
- if (eh->h_proto == htons(ETH_P_8021Q)) {
- veh = (struct vlan_ethhdr *)va;
- if (veh->h_vlan_encapsulated_proto != htons(ETH_P_IP))
- return -1;
-
- ll_hlen += VLAN_HLEN;
-
- /*
- * HW checksum starts ETH_HLEN bytes into
- * frame, so we must subtract off the VLAN
- * header's checksum before csum can be used
- */
- csum = csum_sub(csum, csum_partial(va + ETH_HLEN,
- VLAN_HLEN, 0));
- } else {
- return -1;
- }
- }
- *hdr_flags = LRO_IPV4;
-
- iph = (struct iphdr *)(va + ll_hlen);
- *ip_hdr = iph;
- if (iph->protocol != IPPROTO_TCP)
- return -1;
- if (iph->frag_off & htons(IP_MF | IP_OFFSET))
- return -1;
- *hdr_flags |= LRO_TCP;
- *tcpudp_hdr = (u8 *) (*ip_hdr) + (iph->ihl << 2);
-
- /* verify the IP checksum */
- if (unlikely(ip_fast_csum((u8 *) iph, iph->ihl)))
- return -1;
-
- /* verify the checksum */
- if (unlikely(csum_tcpudp_magic(iph->saddr, iph->daddr,
- ntohs(iph->tot_len) - (iph->ihl << 2),
- IPPROTO_TCP, csum)))
- return -1;
-
- return 0;
-}
-
static int myri10ge_get_txrx(struct myri10ge_priv *mgp, int slice)
{
struct myri10ge_cmd cmd;
@@ -2329,7 +2235,6 @@ static int myri10ge_open(struct net_devi
struct myri10ge_cmd cmd;
int i, status, big_pow2, slice;
u8 *itable;
- struct net_lro_mgr *lro_mgr;
if (mgp->running != MYRI10GE_ETH_STOPPED)
return -EBUSY;
@@ -2450,19 +2355,6 @@ static int myri10ge_open(struct net_devi
goto abort_with_rings;
}
- lro_mgr = &ss->rx_done.lro_mgr;
- lro_mgr->dev = dev;
- lro_mgr->features = LRO_F_NAPI;
- lro_mgr->ip_summed = CHECKSUM_COMPLETE;
- lro_mgr->ip_summed_aggr = CHECKSUM_UNNECESSARY;
- lro_mgr->max_desc = MYRI10GE_MAX_LRO_DESCRIPTORS;
- lro_mgr->lro_arr = ss->rx_done.lro_desc;
- lro_mgr->get_frag_header = myri10ge_get_frag_header;
- lro_mgr->max_aggr = myri10ge_lro_max_pkts;
- lro_mgr->frag_align_pad = 2;
- if (lro_mgr->max_aggr > MAX_SKB_FRAGS)
- lro_mgr->max_aggr = MAX_SKB_FRAGS;
-
/* must happen prior to any irq */
napi_enable(&(ss)->napi);
}
@@ -3910,7 +3802,7 @@ static int myri10ge_probe(struct pci_dev
if (dac_enabled)
netdev->features |= NETIF_F_HIGHDMA;
-
+ netdev->features |= NETIF_F_GRO;
/* make sure we can get an irq, and that MSI can be
* setup (if available). Also ensure netdev->irq
* is set to correct value if MSI is enabled */
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-21 19:19 ` Andrew Gallatin
@ 2009-04-22 10:48 ` Herbert Xu
2009-04-22 15:37 ` Andrew Gallatin
2009-04-23 8:00 ` Herbert Xu
1 sibling, 1 reply; 41+ messages in thread
From: Herbert Xu @ 2009-04-22 10:48 UTC (permalink / raw)
To: Andrew Gallatin; +Cc: David Miller, brice, sgruszka, netdev
On Tue, Apr 21, 2009 at 03:19:14PM -0400, Andrew Gallatin wrote:
>
> So to compare inet_lro and GRO, I'm binding the netserver and device IRQ
> to the same CPU. When I do this, that CPU is saturated and GRO is
> roughly 17% slower than inet_lro. For comparison, here are netperf
> results from a fast peer sending to my weak machine (AMD Athlon(tm) 64
> X2 Dual Core Processor 3800+, 2GHz). First inet_lro:
>
> Recv Send Send Utilization Service
> Demand
> Socket Socket Message Elapsed Send Recv Send Recv
> Size Size Size Time Throughput local remote local
> remote
> bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
>
> 87380 65536 65536 60.02 6631.52 12.45 50.10 0.308
> 1.238
>
> And now GRO:
> 87380 65536 65536 60.01 5488.99 9.79 50.00 0.292
> 1.492
I was sure I had tested my case with the IRQs bound (using a cxgb3),
but when I tried it again today GRO was indeed slower (8G vs. 9.4G)!
I fiddled with it all day and couldn't figure out why this was
so. We weren't spending any more time in the GRO code than LRO,
and in fact we were aggregating more packets with GRO (700k segments
instead of 900k segments). GRO was also sending a lot less ACKs
than LRO.
It finally dawned on me that my sender had been upgraded from 2.6.18
to 2.6.30-rc1. Indeed, rebooting into 2.6.18 seems to restore
the balance between GRO and LRO. I wonder if the ACK reduction
has anything to do with this.
Hopefully tomorrow I'll get my hands onto a myricom and try to
replicate your problem.
In the mean time, can you see if there is any disparity in the
number of aggregated segments and ACKs between GRO and LRO?
netstat -s should be sufficient to measure this (TCP segments
received and sent).
Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-22 10:48 ` Herbert Xu
@ 2009-04-22 15:37 ` Andrew Gallatin
2009-04-24 5:45 ` Herbert Xu
0 siblings, 1 reply; 41+ messages in thread
From: Andrew Gallatin @ 2009-04-22 15:37 UTC (permalink / raw)
To: Herbert Xu; +Cc: David Miller, brice, sgruszka, netdev
Herbert Xu wrote:
>
> In the mean time, can you see if there is any disparity in the
> number of aggregated segments and ACKs between GRO and LRO?
> netstat -s should be sufficient to measure this (TCP segments
> received and sent).
I booted the sender into a kernel.org 2.6.18.2 so as to try to have
results as close to yours as possible (I was running 2.6.22 on the
sender before).
I ran 2 sets of experiments, with different CPU bindings. First
I bound the netserver and IRQ to the same CPU:
LRO:
2301987 segments received
570331 segments send out
Recv Send Send Utilization Service
Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local
remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
87380 65536 65536 60.01 6637.79 10.07 49.99 0.249
1.234
GRO:
2035181 segments received
493042 segments send out
87380 65536 65536 60.01 5768.21 8.60 49.98 0.244
1.420
Then I bound them to different CPUs, so as to get close to line rate:
LRO:
3165013 segments received
1763169 segments send out
87380 65536 65536 60.01 9473.27 15.75 49.58 0.272
0.858
GRO:
3032484 segments received
2265453 segments send out
87380 65536 65536 60.01 9472.69 15.64 48.73 0.270
0.843
Do you know what is broken with respect the CPU utilization in recent
kernels? If I bind the IRQ to CPU0, then watch mpstat I see
zero load on that CPU:
% mpstat -P 0 1
Linux 2.6.30-rc1 (venice) 04/22/09
11:25:25 CPU %user %nice %system %iowait %irq %soft %idle
intr/s
11:25:26 0 0.00 0.00 0.00 0.00 0.00 0.00 100.00
13248.00
11:25:27 0 0.00 0.00 0.00 0.00 0.00 0.00 100.00
13280.00
Common sense tells me that is wrong, and oprofile verifies there is
a lot happening on CPU0. This makes it hard to use netperf's
service demand to compare LRO and GRO.
When I run a cpu-soaker in usermode bound to CPU0, I start to see
irq, softirq, etc:
11:28:02 CPU %user %nice %system %iowait %irq %soft %idle
intr/s
11:28:03 0 45.10 0.00 0.00 0.00 1.96 52.94 0.00
13019.61
11:28:04 0 46.46 0.00 0.00 0.00 2.02 51.52 0.00
13414.14
If I use this as poor-man's way to measure CPU load on the CPU running
the softirq, then its clear that GRO is using a bit more CPU than LRO.
The above mpstat output is from LRO, and this is from GRO:
11:29:16 0 39.60 0.00 0.00 0.00 2.97 57.43 0.00
13146.53
11:29:17 0 38.00 0.00 0.00 0.00 2.00 60.00 0.00
13278.00
11:29:18 0 39.00 0.00 0.00 0.00 4.00 57.00 0.00
13273.00
Once we have the checksum issue worked out, either GRO or my driver
will be using even more CPU as it will need to verify the partial
checksums. Remember that my current patch is just setting
CHECKSUM_UNNECESSARY to get around the checksum problem I was seeing.
Drew
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-21 19:19 ` Andrew Gallatin
2009-04-22 10:48 ` Herbert Xu
@ 2009-04-23 8:00 ` Herbert Xu
1 sibling, 0 replies; 41+ messages in thread
From: Herbert Xu @ 2009-04-23 8:00 UTC (permalink / raw)
To: Andrew Gallatin; +Cc: David Miller, brice, sgruszka, netdev
On Tue, Apr 21, 2009 at 03:19:14PM -0400, Andrew Gallatin wrote:
>
> - if (mgp->csum_flag) {
> - if ((skb->protocol == htons(ETH_P_IP)) ||
> - (skb->protocol == htons(ETH_P_IPV6))) {
> - skb->csum = csum;
> - skb->ip_summed = CHECKSUM_COMPLETE;
> - } else
> - myri10ge_vlan_ip_csum(skb, csum);
> - }
> - netif_receive_skb(skb);
> + rx_frags[0].page_offset += MXGEFW_PAD;
> + rx_frags[0].size -= MXGEFW_PAD;
> + len -= MXGEFW_PAD;
> + skb_shinfo(skb)->nr_frags = i;
> + skb->len = len;
> + skb->data_len = len;
> + skb->truesize += len;
> + /* skb->ip_summed = CHECKSUM_COMPLETE; */
You need to set skb->csum, just as you did for the non-LRO case.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-22 15:37 ` Andrew Gallatin
@ 2009-04-24 5:45 ` Herbert Xu
2009-04-24 12:45 ` Andrew Gallatin
2009-04-24 16:16 ` Andrew Gallatin
0 siblings, 2 replies; 41+ messages in thread
From: Herbert Xu @ 2009-04-24 5:45 UTC (permalink / raw)
To: Andrew Gallatin; +Cc: David Miller, brice, sgruszka, netdev
On Wed, Apr 22, 2009 at 11:37:24AM -0400, Andrew Gallatin wrote:
>
> I booted the sender into a kernel.org 2.6.18.2 so as to try to have
> results as close to yours as possible (I was running 2.6.22 on the
> sender before).
OK I've got my hands on a myricom card. I've tested it using the
same 2.6.18 sender that I used against the eariler cxgb3 test.
I wasn't able to discern any significant deviations between LRO
and GRO.
Unfortunately it seems that this machine is a little too fast
so even with the IRQ bound to a single CPU it's way overspeced
for 10GbE:
Idle at 10Gb IRQ rate soaker IRQ rate soaker throuput
GRO 43-45 14700 13300 7933
LRO 43-45 14700 13300 7943
But even with the soaker running they seem to be neck and neck.
Here's the patch I used BTW. I got the checksums to work by
just setting skb->csum.
diff --git a/drivers/net/myri10ge/myri10ge.c b/drivers/net/myri10ge/myri10ge.c
index f2c4a66..91de78f 100644
--- a/drivers/net/myri10ge/myri10ge.c
+++ b/drivers/net/myri10ge/myri10ge.c
@@ -48,7 +48,6 @@
#include <linux/etherdevice.h>
#include <linux/if_ether.h>
#include <linux/if_vlan.h>
-#include <linux/inet_lro.h>
#include <linux/dca.h>
#include <linux/ip.h>
#include <linux/inet.h>
@@ -92,7 +91,6 @@ MODULE_LICENSE("Dual BSD/GPL");
#define MYRI10GE_EEPROM_STRINGS_SIZE 256
#define MYRI10GE_MAX_SEND_DESC_TSO ((65536 / 2048) * 2)
-#define MYRI10GE_MAX_LRO_DESCRIPTORS 8
#define MYRI10GE_LRO_MAX_PKTS 64
#define MYRI10GE_NO_CONFIRM_DATA htonl(0xffffffff)
@@ -161,8 +159,6 @@ struct myri10ge_rx_done {
dma_addr_t bus;
int cnt;
int idx;
- struct net_lro_mgr lro_mgr;
- struct net_lro_desc lro_desc[MYRI10GE_MAX_LRO_DESCRIPTORS];
};
struct myri10ge_slice_netstats {
@@ -1264,18 +1260,31 @@ static inline int
myri10ge_rx_done(struct myri10ge_slice_state *ss, struct myri10ge_rx_buf *rx,
int bytes, int len, __wsum csum)
{
+ struct napi_struct *napi = &ss->napi;
struct myri10ge_priv *mgp = ss->mgp;
struct sk_buff *skb;
- struct skb_frag_struct rx_frags[MYRI10GE_MAX_FRAGS_PER_FRAME];
- int i, idx, hlen, remainder;
+ struct skb_frag_struct *rx_frags;
+ int i, idx, remainder;
struct pci_dev *pdev = mgp->pdev;
- struct net_device *dev = mgp->dev;
u8 *va;
len += MXGEFW_PAD;
idx = rx->cnt & rx->mask;
va = page_address(rx->info[idx].page) + rx->info[idx].page_offset;
prefetch(va);
+
+ skb = napi_get_frags(napi);
+ if ((unlikely(!skb))) {
+ for (i = 0, remainder = len; remainder > 0; i++) {
+ myri10ge_unmap_rx_page(pdev, &rx->info[idx], bytes);
+ put_page(rx->info[idx].page);
+ rx->cnt++;
+ idx = rx->cnt & rx->mask;
+ remainder -= MYRI10GE_ALLOC_SIZE;
+ }
+ }
+ rx_frags = skb_shinfo(skb)->frags;
+
/* Fill skb_frag_struct(s) with data from our receive */
for (i = 0, remainder = len; remainder > 0; i++) {
myri10ge_unmap_rx_page(pdev, &rx->info[idx], bytes);
@@ -1290,52 +1299,18 @@ myri10ge_rx_done(struct myri10ge_slice_state *ss, struct myri10ge_rx_buf *rx,
remainder -= MYRI10GE_ALLOC_SIZE;
}
- if (mgp->csum_flag && myri10ge_lro) {
- rx_frags[0].page_offset += MXGEFW_PAD;
- rx_frags[0].size -= MXGEFW_PAD;
- len -= MXGEFW_PAD;
- lro_receive_frags(&ss->rx_done.lro_mgr, rx_frags,
- /* opaque, will come back in get_frag_header */
- len, len,
- (void *)(__force unsigned long)csum, csum);
-
- return 1;
- }
-
- hlen = MYRI10GE_HLEN > len ? len : MYRI10GE_HLEN;
-
- /* allocate an skb to attach the page(s) to. This is done
- * after trying LRO, so as to avoid skb allocation overheads */
-
- skb = netdev_alloc_skb(dev, MYRI10GE_HLEN + 16);
- if (unlikely(skb == NULL)) {
- ss->stats.rx_dropped++;
- do {
- i--;
- put_page(rx_frags[i].page);
- } while (i != 0);
- return 0;
- }
-
- /* Attach the pages to the skb, and trim off any padding */
- myri10ge_rx_skb_build(skb, va, rx_frags, len, hlen);
- if (skb_shinfo(skb)->frags[0].size <= 0) {
- put_page(skb_shinfo(skb)->frags[0].page);
- skb_shinfo(skb)->nr_frags = 0;
- }
- skb->protocol = eth_type_trans(skb, dev);
- skb_record_rx_queue(skb, ss - &mgp->ss[0]);
-
- if (mgp->csum_flag) {
- if ((skb->protocol == htons(ETH_P_IP)) ||
- (skb->protocol == htons(ETH_P_IPV6))) {
- skb->csum = csum;
- skb->ip_summed = CHECKSUM_COMPLETE;
- } else
- myri10ge_vlan_ip_csum(skb, csum);
- }
- netif_receive_skb(skb);
+ rx_frags[0].page_offset += MXGEFW_PAD;
+ rx_frags[0].size -= MXGEFW_PAD;
+ len -= MXGEFW_PAD;
+ skb_shinfo(skb)->nr_frags = i;
+ skb->len = len;
+ skb->data_len = len;
+ skb->truesize += len;
+ skb->csum = csum;
+ skb->ip_summed = CHECKSUM_COMPLETE;
+ napi_gro_frags(napi);
return 1;
+
}
static inline void
@@ -1445,9 +1420,6 @@ myri10ge_clean_rx_done(struct myri10ge_slice_state *ss, int budget)
ss->stats.rx_packets += rx_packets;
ss->stats.rx_bytes += rx_bytes;
- if (myri10ge_lro)
- lro_flush_all(&rx_done->lro_mgr);
-
/* restock receive rings if needed */
if (ss->rx_small.fill_cnt - ss->rx_small.cnt < myri10ge_fill_thresh)
myri10ge_alloc_rx_pages(mgp, &ss->rx_small,
@@ -1752,9 +1724,7 @@ static const char myri10ge_gstrings_slice_stats[][ETH_GSTRING_LEN] = {
"----------- slice ---------",
"tx_pkt_start", "tx_pkt_done", "tx_req", "tx_done",
"rx_small_cnt", "rx_big_cnt",
- "wake_queue", "stop_queue", "tx_linearized", "LRO aggregated",
- "LRO flushed",
- "LRO avg aggr", "LRO no_desc"
+ "wake_queue", "stop_queue", "tx_linearized"
};
#define MYRI10GE_NET_STATS_LEN 21
@@ -1851,14 +1821,6 @@ myri10ge_get_ethtool_stats(struct net_device *netdev,
data[i++] = (unsigned int)ss->tx.wake_queue;
data[i++] = (unsigned int)ss->tx.stop_queue;
data[i++] = (unsigned int)ss->tx.linearized;
- data[i++] = ss->rx_done.lro_mgr.stats.aggregated;
- data[i++] = ss->rx_done.lro_mgr.stats.flushed;
- if (ss->rx_done.lro_mgr.stats.flushed)
- data[i++] = ss->rx_done.lro_mgr.stats.aggregated /
- ss->rx_done.lro_mgr.stats.flushed;
- else
- data[i++] = 0;
- data[i++] = ss->rx_done.lro_mgr.stats.no_desc;
}
}
@@ -2186,67 +2148,6 @@ static void myri10ge_free_irq(struct myri10ge_priv *mgp)
pci_disable_msix(pdev);
}
-static int
-myri10ge_get_frag_header(struct skb_frag_struct *frag, void **mac_hdr,
- void **ip_hdr, void **tcpudp_hdr,
- u64 * hdr_flags, void *priv)
-{
- struct ethhdr *eh;
- struct vlan_ethhdr *veh;
- struct iphdr *iph;
- u8 *va = page_address(frag->page) + frag->page_offset;
- unsigned long ll_hlen;
- /* passed opaque through lro_receive_frags() */
- __wsum csum = (__force __wsum) (unsigned long)priv;
-
- /* find the mac header, aborting if not IPv4 */
-
- eh = (struct ethhdr *)va;
- *mac_hdr = eh;
- ll_hlen = ETH_HLEN;
- if (eh->h_proto != htons(ETH_P_IP)) {
- if (eh->h_proto == htons(ETH_P_8021Q)) {
- veh = (struct vlan_ethhdr *)va;
- if (veh->h_vlan_encapsulated_proto != htons(ETH_P_IP))
- return -1;
-
- ll_hlen += VLAN_HLEN;
-
- /*
- * HW checksum starts ETH_HLEN bytes into
- * frame, so we must subtract off the VLAN
- * header's checksum before csum can be used
- */
- csum = csum_sub(csum, csum_partial(va + ETH_HLEN,
- VLAN_HLEN, 0));
- } else {
- return -1;
- }
- }
- *hdr_flags = LRO_IPV4;
-
- iph = (struct iphdr *)(va + ll_hlen);
- *ip_hdr = iph;
- if (iph->protocol != IPPROTO_TCP)
- return -1;
- if (iph->frag_off & htons(IP_MF | IP_OFFSET))
- return -1;
- *hdr_flags |= LRO_TCP;
- *tcpudp_hdr = (u8 *) (*ip_hdr) + (iph->ihl << 2);
-
- /* verify the IP checksum */
- if (unlikely(ip_fast_csum((u8 *) iph, iph->ihl)))
- return -1;
-
- /* verify the checksum */
- if (unlikely(csum_tcpudp_magic(iph->saddr, iph->daddr,
- ntohs(iph->tot_len) - (iph->ihl << 2),
- IPPROTO_TCP, csum)))
- return -1;
-
- return 0;
-}
-
static int myri10ge_get_txrx(struct myri10ge_priv *mgp, int slice)
{
struct myri10ge_cmd cmd;
@@ -2317,7 +2218,6 @@ static int myri10ge_open(struct net_device *dev)
struct myri10ge_cmd cmd;
int i, status, big_pow2, slice;
u8 *itable;
- struct net_lro_mgr *lro_mgr;
if (mgp->running != MYRI10GE_ETH_STOPPED)
return -EBUSY;
@@ -2438,19 +2338,6 @@ static int myri10ge_open(struct net_device *dev)
goto abort_with_rings;
}
- lro_mgr = &ss->rx_done.lro_mgr;
- lro_mgr->dev = dev;
- lro_mgr->features = LRO_F_NAPI;
- lro_mgr->ip_summed = CHECKSUM_COMPLETE;
- lro_mgr->ip_summed_aggr = CHECKSUM_UNNECESSARY;
- lro_mgr->max_desc = MYRI10GE_MAX_LRO_DESCRIPTORS;
- lro_mgr->lro_arr = ss->rx_done.lro_desc;
- lro_mgr->get_frag_header = myri10ge_get_frag_header;
- lro_mgr->max_aggr = myri10ge_lro_max_pkts;
- lro_mgr->frag_align_pad = 2;
- if (lro_mgr->max_aggr > MAX_SKB_FRAGS)
- lro_mgr->max_aggr = MAX_SKB_FRAGS;
-
/* must happen prior to any irq */
napi_enable(&(ss)->napi);
}
@@ -3884,7 +3771,7 @@ static int myri10ge_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
if (dac_enabled)
netdev->features |= NETIF_F_HIGHDMA;
-
+ netdev->features |= NETIF_F_GRO;
/* make sure we can get an irq, and that MSI can be
* setup (if available). Also ensure netdev->irq
* is set to correct value if MSI is enabled */
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-24 5:45 ` Herbert Xu
@ 2009-04-24 12:45 ` Andrew Gallatin
2009-04-24 12:51 ` Herbert Xu
2009-04-24 17:13 ` Rick Jones
2009-04-24 16:16 ` Andrew Gallatin
1 sibling, 2 replies; 41+ messages in thread
From: Andrew Gallatin @ 2009-04-24 12:45 UTC (permalink / raw)
To: Herbert Xu; +Cc: David Miller, brice, sgruszka, netdev
Herbert Xu wrote:
> On Wed, Apr 22, 2009 at 11:37:24AM -0400, Andrew Gallatin wrote:
>> I booted the sender into a kernel.org 2.6.18.2 so as to try to have
>> results as close to yours as possible (I was running 2.6.22 on the
>> sender before).
>
> OK I've got my hands on a myricom card. I've tested it using the
> same 2.6.18 sender that I used against the eariler cxgb3 test.
> I wasn't able to discern any significant deviations between LRO
> and GRO.
>
> Unfortunately it seems that this machine is a little too fast
> so even with the IRQ bound to a single CPU it's way overspeced
> for 10GbE:
>
> Idle at 10Gb IRQ rate soaker IRQ rate soaker throuput
> GRO 43-45 14700 13300 7933
> LRO 43-45 14700 13300 7943
>
> But even with the soaker running they seem to be neck and neck.
This is strange. I wonder if it might be a cache footprint issue?
My intentionally weak receiver is an athlon64 x2 "Toledo", and
has only 512KB L2 cache. I can re-test with a core-2 based Xeon.
But can you describe your setup in more detail? What CPU does the
receiver have? You say the sender is running 2.6.18. Is this
a RHEL5 kernel, or a kernel.org kernel?
> Here's the patch I used BTW. I got the checksums to work by
> just setting skb->csum.
Yes, sorry about that stupidity.
Drew
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-24 12:45 ` Andrew Gallatin
@ 2009-04-24 12:51 ` Herbert Xu
2009-04-24 17:13 ` Rick Jones
1 sibling, 0 replies; 41+ messages in thread
From: Herbert Xu @ 2009-04-24 12:51 UTC (permalink / raw)
To: Andrew Gallatin; +Cc: David Miller, brice, sgruszka, netdev
On Fri, Apr 24, 2009 at 08:45:28AM -0400, Andrew Gallatin wrote:
>
> This is strange. I wonder if it might be a cache footprint issue?
> My intentionally weak receiver is an athlon64 x2 "Toledo", and
> has only 512KB L2 cache. I can re-test with a core-2 based Xeon.
>
> But can you describe your setup in more detail? What CPU does the
> receiver have? You say the sender is running 2.6.18. Is this
> a RHEL5 kernel, or a kernel.org kernel?
processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU E5440 @ 2.83GHz
stepping : 6
cpu MHz : 2833.155
cache size : 6144 KB
physical id : 1
siblings : 1
core id : 3
cpu cores : 4
apicid : 7
initial apicid : 7
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm tpr_shadow vnmi flexpriority
bogomips : 5665.85
clflush size : 64
cache_alignment : 64
address sizes : 38 bits physical, 48 bits virtual
power management:
It's a RHEL-5 kernel, -128 to be exact. But I rebooted into
2.6.30-rc1 and it made no difference. The previous observed
problem with 30-rc1 seems to only affect cxgb3.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-24 5:45 ` Herbert Xu
2009-04-24 12:45 ` Andrew Gallatin
@ 2009-04-24 16:16 ` Andrew Gallatin
2009-04-24 16:30 ` Herbert Xu
2009-04-27 8:05 ` Herbert Xu
1 sibling, 2 replies; 41+ messages in thread
From: Andrew Gallatin @ 2009-04-24 16:16 UTC (permalink / raw)
To: Herbert Xu; +Cc: David Miller, brice, sgruszka, netdev
Herbert Xu wrote:
> On Wed, Apr 22, 2009 at 11:37:24AM -0400, Andrew Gallatin wrote:
>> I booted the sender into a kernel.org 2.6.18.2 so as to try to have
>> results as close to yours as possible (I was running 2.6.22 on the
>> sender before).
>
> OK I've got my hands on a myricom card. I've tested it using the
> same 2.6.18 sender that I used against the eariler cxgb3 test.
> I wasn't able to discern any significant deviations between LRO
> and GRO.
>
> Unfortunately it seems that this machine is a little too fast
> so even with the IRQ bound to a single CPU it's way overspeced
> for 10GbE:
>
> Idle at 10Gb IRQ rate soaker IRQ rate soaker throuput
> GRO 43-45 14700 13300 7933
> LRO 43-45 14700 13300 7943
>
> But even with the soaker running they seem to be neck and neck.
From what I can tell, CPU utilization is only broken when a CPU is
otherwise idle, so it should be accurate when you bind the IRQ and the
netserver to the same CPU. Here are results from an older, slower
core-2 Xeon with a 4MB L2 cache:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU 5150 @ 2.66GHz
stepping : 6
cpu MHz : 2659.916
cache size : 4096 KB
physical id : 0
siblings : 1
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64
monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca lahf_lm tpr_shadow
bogomips : 5319.83
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
The Xeon was running net-next, had DCA enabled, ioatdma disabled for
TCP (CONFIG_NET_DMA is not set). The sender was the weak athlon64,
running 2.6.22.
LRO, no soaker: (13,200 intrs/sec)
Recv Send Send Utilization Service
Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local
remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
87380 65536 65536 60.02 9469.74 17.44 13.31 0.302
0.461
LRO, soaker: (6,500 intrs/sec)
87380 65536 65536 60.06 3955.74 7.11 25.02 0.294
2.072
GRO, no soaker, (13,200 intrs/sec)
87380 65536 65536 60.02 9467.90 16.76 14.16 0.290
0.490
GRO, soaker: (6,500 intrs/sec)
87380 65536 65536 60.02 3774.88 6.20 25.01 0.269
2.171
These results are indeed quite close, so the performance problem seems
isolated to AMD CPUS, and perhaps due to the smaller caches.
Do you have any AMD you can use as a receiver?
Note that the GRO results were still obtained by (bogusly) setting
CHECKSUM_UNNECESSARY. I tried to use your patch, and I see
terrible performance. Netperf shows between 1Gb/s to 2Gb/s (compared
to 5Gb/s with GRO disabled). I don't see bad checksums in netstat
on the receiver, but it *feels* like something like that.
Here's a diff of netstat -st taken on the sender before and after
a 5 second netperf:
2c2
< 157 active connections openings
---
> 159 active connections openings
7,9c7,9
< 31465934 segments received
< 72887021 segments send out
< 679 segments retransmited
---
> 32184827 segments received
> 73473546 segments send out
> 698 segments retransmited
16c16
< 4596 packets directly queued to recvmsg prequeue.
---
> 4603 packets directly queued to recvmsg prequeue.
18,21c18,21
< 15928 packets header predicted
< 18100148 acknowledgments not containing data received
< 13351873 predicted acknowledgments
< 343 times recovered from packet loss due to SACK data
---
> 15930 packets header predicted
> 18464095 acknowledgments not containing data received
> 13706813 predicted acknowledgments
> 365 times recovered from packet loss due to SACK data
23,25c23,25
< 53 congestion windows fully recovered
< 221 congestion windows partially recovered using Hoe heuristic
< TCPDSACKUndo: 268
---
> 60 congestion windows fully recovered
> 228 congestion windows partially recovered using Hoe heuristic
> TCPDSACKUndo: 281
27,28c27,28
< 584 fast retransmits
< 93 forward retransmits
---
> 597 fast retransmits
> 99 forward retransmits
30c30
< 674 DSACKs received
---
> 693 DSACKs received
And on the receiver (whose netstat is confused, and cannot read ext
stats in a net-next kernel):
diff /tmp/a /tmp/b
3c3
< 12 passive connection openings
---
> 14 passive connection openings
7,8c7,8
< 3776478 segments received
< 3775846 segments send out
---
> 4495385 segments received
> 4494747 segments send out
This was using a net-next pulled 1/2 hour ago. The only patch was your
GRO patch applied to myri10ge. Do you have some other local patch
which might be helping you?
Drew
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-24 16:16 ` Andrew Gallatin
@ 2009-04-24 16:30 ` Herbert Xu
2009-04-24 16:31 ` Herbert Xu
2009-04-27 8:05 ` Herbert Xu
1 sibling, 1 reply; 41+ messages in thread
From: Herbert Xu @ 2009-04-24 16:30 UTC (permalink / raw)
To: Andrew Gallatin; +Cc: David Miller, brice, sgruszka, netdev
On Fri, Apr 24, 2009 at 12:16:08PM -0400, Andrew Gallatin wrote:
>
> Note that the GRO results were still obtained by (bogusly) setting
> CHECKSUM_UNNECESSARY. I tried to use your patch, and I see
> terrible performance. Netperf shows between 1Gb/s to 2Gb/s (compared
> to 5Gb/s with GRO disabled). I don't see bad checksums in netstat
> on the receiver, but it *feels* like something like that.
Well if the hardware checksum ends up being wrong then we'll always
fall back to using software checksums. So somehow I doubt it's
causing what you're seeing.
> Here's a diff of netstat -st taken on the sender before and after
> a 5 second netperf:
> 2c2
> < 157 active connections openings
> ---
> > 159 active connections openings
> 7,9c7,9
> < 31465934 segments received
> < 72887021 segments send out
> < 679 segments retransmited
> ---
> > 32184827 segments received
> > 73473546 segments send out
> > 698 segments retransmited
So you're losing packets. This is indeed something that I didn't
see here at all. I'll see if I can get the card moved to an AMD
machine.
> This was using a net-next pulled 1/2 hour ago. The only patch was your
> GRO patch applied to myri10ge. Do you have some other local patch
> which might be helping you?
I was using Linus's tree + the GRO patches from net-next. I do
have two further optimisation patches applied but they don't
actually make much difference (I made them while trying to figure
out why cxgb3's GRO became slow again, which turned out to be sender
related).
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-24 16:30 ` Herbert Xu
@ 2009-04-24 16:31 ` Herbert Xu
0 siblings, 0 replies; 41+ messages in thread
From: Herbert Xu @ 2009-04-24 16:31 UTC (permalink / raw)
To: Andrew Gallatin; +Cc: David Miller, brice, sgruszka, netdev
On Sat, Apr 25, 2009 at 12:30:09AM +0800, Herbert Xu wrote:
>
> > This was using a net-next pulled 1/2 hour ago. The only patch was your
> > GRO patch applied to myri10ge. Do you have some other local patch
> > which might be helping you?
>
> I was using Linus's tree + the GRO patches from net-next. I do
> have two further optimisation patches applied but they don't
> actually make much difference (I made them while trying to figure
> out why cxgb3's GRO became slow again, which turned out to be sender
> related).
I'll also retest using net-next.
Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-24 12:45 ` Andrew Gallatin
2009-04-24 12:51 ` Herbert Xu
@ 2009-04-24 17:13 ` Rick Jones
1 sibling, 0 replies; 41+ messages in thread
From: Rick Jones @ 2009-04-24 17:13 UTC (permalink / raw)
To: Andrew Gallatin; +Cc: Herbert Xu, David Miller, brice, sgruszka, netdev
> This is strange. I wonder if it might be a cache footprint issue?
> My intentionally weak receiver is an athlon64 x2 "Toledo", and
> has only 512KB L2 cache. I can re-test with a core-2 based Xeon.
A point about netperf :) By default, it will use one more buffer than the
initial size of the socket buffer divided by the send/recv buffer size - this
goes back to days of copy-avoidance, a flavor of which can be found in reading:
ftp://ftp.cup.hp.com/dist/networking/briefs/copyavoid.pdf
particularly section 3.2. It can be overridden with the global -W option:
-W send,recv Set the number of send,recv buffers
Of course, this will interact with other things such as:
The default send/recv size will be the send/recv socket buffer size. That can be
overridden with the test-specific -m/-M options.
The default socket buffer size will be whatever the system gives it. That can be
overridden with the test-specific -s/-S options.
So, the various options can have a non-trivial effect on the cache footprint of
the data netperf is shoving around.
rick jones
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-24 16:16 ` Andrew Gallatin
2009-04-24 16:30 ` Herbert Xu
@ 2009-04-27 8:05 ` Herbert Xu
2009-04-27 8:07 ` Herbert Xu
` (2 more replies)
1 sibling, 3 replies; 41+ messages in thread
From: Herbert Xu @ 2009-04-27 8:05 UTC (permalink / raw)
To: Andrew Gallatin; +Cc: David Miller, brice, sgruszka, netdev
On Fri, Apr 24, 2009 at 12:16:08PM -0400, Andrew Gallatin wrote:
>
> These results are indeed quite close, so the performance problem seems
> isolated to AMD CPUS, and perhaps due to the smaller caches.
> Do you have any AMD you can use as a receiver?
I now have an AMD with 512K cache to test this. Unfortunately
I'd just locked it up before I got a chance to do any serious
testing. So it might take a while.
> Note that the GRO results were still obtained by (bogusly) setting
> CHECKSUM_UNNECESSARY. I tried to use your patch, and I see
> terrible performance. Netperf shows between 1Gb/s to 2Gb/s (compared
> to 5Gb/s with GRO disabled). I don't see bad checksums in netstat
> on the receiver, but it *feels* like something like that.
OK, I'd created a silly bug with the skb_gro_* optimisation.
Here are two patches to fix them.
gro: Fix handling of headers that extend over the tail
The skb_gro_* code fails to handle the case where a header starts
in the linear area but ends in the frags area. Since the goal
of skb_gro_* is to optimise the case of completely non-linear
packets, we can simply bail out if we have anything in the linear
area.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2e7783f..0396447 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1145,7 +1145,7 @@ static inline void skb_gro_reset_offset(struct sk_buff *skb)
static inline void *skb_gro_mac_header(struct sk_buff *skb)
{
- return skb_mac_header(skb) < skb->data ? skb_mac_header(skb) :
+ return skb_headlen(skb) ? skb_mac_header(skb) :
page_address(skb_shinfo(skb)->frags[0].page) +
skb_shinfo(skb)->frags[0].page_offset;
}
diff --git a/net/core/dev.c b/net/core/dev.c
index 308a7d0..ef38e4f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2378,18 +2378,13 @@ void *skb_gro_header(struct sk_buff *skb, unsigned int hlen)
unsigned int offset = skb_gro_offset(skb);
hlen += offset;
- if (hlen <= skb_headlen(skb))
- return skb->data + offset;
-
- if (unlikely(!skb_shinfo(skb)->nr_frags ||
- skb_shinfo(skb)->frags[0].size <=
- hlen - skb_headlen(skb) ||
+ if (unlikely(skb_headlen(skb) ||
+ skb_shinfo(skb)->frags[0].size < hlen ||
PageHighMem(skb_shinfo(skb)->frags[0].page)))
return pskb_may_pull(skb, hlen) ? skb->data + offset : NULL;
return page_address(skb_shinfo(skb)->frags[0].page) +
- skb_shinfo(skb)->frags[0].page_offset +
- offset - skb_headlen(skb);
+ skb_shinfo(skb)->frags[0].page_offset + offset;
}
EXPORT_SYMBOL(skb_gro_header);
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-27 8:05 ` Herbert Xu
@ 2009-04-27 8:07 ` Herbert Xu
2009-04-27 9:32 ` David Miller
2009-04-27 12:45 ` David Miller
2009-04-27 12:45 ` David Miller
2009-04-28 6:12 ` Herbert Xu
2 siblings, 2 replies; 41+ messages in thread
From: Herbert Xu @ 2009-04-27 8:07 UTC (permalink / raw)
To: Andrew Gallatin; +Cc: David Miller, brice, sgruszka, netdev
On Mon, Apr 27, 2009 at 04:05:01PM +0800, Herbert Xu wrote:
>
> OK, I'd created a silly bug with the skb_gro_* optimisation.
> Here are two patches to fix them.
>
> gro: Fix handling of headers that extend over the tail
gro: Fix COMPLETE checksum handling
On a brand new GRO skb, we cannot call ip_hdr since the header
may lie in the non-linear area. This patch adds the helper
skb_gro_network_header to handle this.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0396447..287bec9 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1150,6 +1150,13 @@ static inline void *skb_gro_mac_header(struct sk_buff *skb)
skb_shinfo(skb)->frags[0].page_offset;
}
+static inline void *skb_gro_network_header(struct sk_buff *skb)
+{
+ return skb_headlen(skb) ? skb_network_header(skb) :
+ page_address(skb_shinfo(skb)->frags[0].page) +
+ skb_shinfo(skb)->frags[0].page_offset + skb_network_offset(skb);
+}
+
static inline int dev_hard_header(struct sk_buff *skb, struct net_device *dev,
unsigned short type,
const void *daddr, const void *saddr,
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 5d427f8..bda74e8 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2343,7 +2343,7 @@ void tcp4_proc_exit(void)
struct sk_buff **tcp4_gro_receive(struct sk_buff **head, struct sk_buff *skb)
{
- struct iphdr *iph = ip_hdr(skb);
+ struct iphdr *iph = skb_gro_network_header(skb);
switch (skb->ip_summed) {
case CHECKSUM_COMPLETE:
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 4b5aa18..d9dd94b 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -943,7 +943,7 @@ static int tcp_v6_gso_send_check(struct sk_buff *skb)
struct sk_buff **tcp6_gro_receive(struct sk_buff **head, struct sk_buff *skb)
{
- struct ipv6hdr *iph = ipv6_hdr(skb);
+ struct ipv6hdr *iph = skb_gro_network_header(skb);
switch (skb->ip_summed) {
case CHECKSUM_COMPLETE:
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-27 8:07 ` Herbert Xu
@ 2009-04-27 9:32 ` David Miller
2009-04-27 11:01 ` Herbert Xu
2009-04-27 12:45 ` David Miller
1 sibling, 1 reply; 41+ messages in thread
From: David Miller @ 2009-04-27 9:32 UTC (permalink / raw)
To: herbert; +Cc: gallatin, brice, sgruszka, netdev
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon, 27 Apr 2009 16:07:35 +0800
> On Mon, Apr 27, 2009 at 04:05:01PM +0800, Herbert Xu wrote:
>>
>> OK, I'd created a silly bug with the skb_gro_* optimisation.
>> Here are two patches to fix them.
>>
>> gro: Fix handling of headers that extend over the tail
>
> gro: Fix COMPLETE checksum handling
These look good, want me to toss them into net-next-2.6?
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-27 9:32 ` David Miller
@ 2009-04-27 11:01 ` Herbert Xu
0 siblings, 0 replies; 41+ messages in thread
From: Herbert Xu @ 2009-04-27 11:01 UTC (permalink / raw)
To: David Miller; +Cc: gallatin, brice, sgruszka, netdev
On Mon, Apr 27, 2009 at 02:32:41AM -0700, David Miller wrote:
>
> These look good, want me to toss them into net-next-2.6?
Yes please. Thanks!
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-27 8:05 ` Herbert Xu
2009-04-27 8:07 ` Herbert Xu
@ 2009-04-27 12:45 ` David Miller
2009-04-28 6:12 ` Herbert Xu
2 siblings, 0 replies; 41+ messages in thread
From: David Miller @ 2009-04-27 12:45 UTC (permalink / raw)
To: herbert; +Cc: gallatin, brice, sgruszka, netdev
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon, 27 Apr 2009 16:05:01 +0800
> gro: Fix handling of headers that extend over the tail
>
> The skb_gro_* code fails to handle the case where a header starts
> in the linear area but ends in the frags area. Since the goal
> of skb_gro_* is to optimise the case of completely non-linear
> packets, we can simply bail out if we have anything in the linear
> area.
>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Applied.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-27 8:07 ` Herbert Xu
2009-04-27 9:32 ` David Miller
@ 2009-04-27 12:45 ` David Miller
1 sibling, 0 replies; 41+ messages in thread
From: David Miller @ 2009-04-27 12:45 UTC (permalink / raw)
To: herbert; +Cc: gallatin, brice, sgruszka, netdev
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon, 27 Apr 2009 16:07:35 +0800
> gro: Fix COMPLETE checksum handling
>
> On a brand new GRO skb, we cannot call ip_hdr since the header
> may lie in the non-linear area. This patch adds the helper
> skb_gro_network_header to handle this.
>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Applied.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-27 8:05 ` Herbert Xu
2009-04-27 8:07 ` Herbert Xu
2009-04-27 12:45 ` David Miller
@ 2009-04-28 6:12 ` Herbert Xu
2009-04-28 15:00 ` Andrew Gallatin
2 siblings, 1 reply; 41+ messages in thread
From: Herbert Xu @ 2009-04-28 6:12 UTC (permalink / raw)
To: Andrew Gallatin; +Cc: David Miller, brice, sgruszka, netdev
On Mon, Apr 27, 2009 at 04:05:01PM +0800, Herbert Xu wrote:
> On Fri, Apr 24, 2009 at 12:16:08PM -0400, Andrew Gallatin wrote:
> >
> > These results are indeed quite close, so the performance problem seems
> > isolated to AMD CPUS, and perhaps due to the smaller caches.
> > Do you have any AMD you can use as a receiver?
>
> I now have an AMD with 512K cache to test this. Unfortunately
> I'd just locked it up before I got a chance to do any serious
> testing. So it might take a while.
OK that's been fixed up. Indeed the AMD can't do wire speed.
But still the performance seems comparable. Both of them sit
between 6600Mb/s and 7100Mb/s. The sender is running at about
66% idle in either case.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-28 6:12 ` Herbert Xu
@ 2009-04-28 15:00 ` Andrew Gallatin
2009-04-28 15:02 ` David Miller
2009-04-28 15:20 ` Herbert Xu
0 siblings, 2 replies; 41+ messages in thread
From: Andrew Gallatin @ 2009-04-28 15:00 UTC (permalink / raw)
To: Herbert Xu; +Cc: David Miller, brice, sgruszka, netdev
Herbert Xu wrote:
> On Mon, Apr 27, 2009 at 04:05:01PM +0800, Herbert Xu wrote:
>> On Fri, Apr 24, 2009 at 12:16:08PM -0400, Andrew Gallatin wrote:
>>> These results are indeed quite close, so the performance problem seems
>>> isolated to AMD CPUS, and perhaps due to the smaller caches.
>>> Do you have any AMD you can use as a receiver?
>> I now have an AMD with 512K cache to test this. Unfortunately
>> I'd just locked it up before I got a chance to do any serious
>> testing. So it might take a while.
>
> OK that's been fixed up. Indeed the AMD can't do wire speed.
> But still the performance seems comparable. Both of them sit
> between 6600Mb/s and 7100Mb/s. The sender is running at about
> 66% idle in either case.
Its strange, I still consistently see about 1Gb/s better performance
from LRO than GRO on this weak machine (6.5Gb/s LRO, 5.5Gb/s GRO)
when binding everything to the same CPU. Mpstat -P 0 shows roughly
10% more time spent in "soft" when using GRO vs LRO:
GRO:
10:17:45 CPU %user %nice %system %iowait %irq %soft
%idle intr/s
10:17:46 0 0.00 0.00 54.00 0.00 0.00 46.00 0.00
11754.00
10:17:47 0 0.00 0.00 54.00 0.00 1.00 45.00 0.00
11718.00
10:17:48 0 0.00 0.00 47.00 0.00 2.00 51.00 0.00
11639.00
LRO:
10:21:55 CPU %user %nice %system %iowait %irq %soft %idle
intr/s
10:21:56 0 0.00 0.00 66.00 0.00 1.00 33.00 0.00
13228.00
10:21:57 0 0.00 0.00 65.35 0.00 1.98 32.67 0.00
13118.81
10:21:58 0 0.00 0.00 63.00 0.00 1.00 36.00 0.00
13238.00
According to oprofile, the top 20 samples running GRO are:
CPU: AMD64 processors, speed 2050.03 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a
unit mask of 0x00 (No unit mask) count 100000
samples % image name app name
symbol name
4382 30.5408 vmlinux vmlinux
copy_user_generic_string
534 3.7218 myri10ge.ko myri10ge
myri10ge_poll
463 3.2269 vmlinux vmlinux
_raw_spin_lock
394 2.7460 vmlinux vmlinux
rb_get_reader_page
382 2.6624 vmlinux vmlinux
acpi_pm_read
356 2.4812 vmlinux vmlinux
inet_gro_receive
293 2.0421 oprofiled oprofiled (no
symbols)
268 1.8679 vmlinux vmlinux
find_next_bit
268 1.8679 vmlinux vmlinux
tg_shares_up
257 1.7912 vmlinux vmlinux
ring_buffer_consume
247 1.7215 myri10ge.ko myri10ge
myri10ge_alloc_rx_pages
247 1.7215 vmlinux vmlinux
tcp_gro_receive
228 1.5891 vmlinux vmlinux
__free_pages_ok
219 1.5263 vmlinux vmlinux
skb_gro_receive
167 1.1639 vmlinux vmlinux
skb_gro_header
149 1.0385 bash bash (no
symbols)
141 0.9827 vmlinux vmlinux
skb_copy_datagram_iovec
132 0.9200 vmlinux vmlinux
rb_buffer_peek
129 0.8991 vmlinux vmlinux
_raw_spin_unlock
123 0.8573 vmlinux vmlinux
delay_tsc
Nothing really stands out for me. Here is LRO:
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a
unit mask of 0x00 (No unit mask) count 100000
samples % image name app name
symbol name
4884 33.1164 vmlinux vmlinux
copy_user_generic_string
721 4.8888 myri10ge.ko myri10ge
myri10ge_poll
580 3.9327 vmlinux vmlinux
_raw_spin_lock
409 2.7733 vmlinux vmlinux
acpi_pm_read
306 2.0749 vmlinux vmlinux
rb_get_reader_page
293 1.9867 oprofiled oprofiled (no
symbols)
286 1.9392 myri10ge.ko myri10ge
myri10ge_get_frag_header
253 1.7155 vmlinux vmlinux
__lro_proc_segment
250 1.6951 vmlinux vmlinux
rb_buffer_peek
247 1.6748 vmlinux vmlinux
ring_buffer_consume
232 1.5731 vmlinux vmlinux
__free_pages_ok
211 1.4307 myri10ge.ko myri10ge
myri10ge_alloc_rx_pages
206 1.3968 vmlinux vmlinux
tg_shares_up
175 1.1866 vmlinux vmlinux
skb_copy_datagram_iovec
158 1.0713 vmlinux vmlinux
find_next_bit
146 0.9900 vmlinux vmlinux
lro_tcp_ip_check
131 0.8883 oprofile.ko oprofile
op_cpu_buffer_read_entry
127 0.8611 vmlinux vmlinux
delay_tsc
125 0.8476 bash bash (no
symbols)
125 0.8476 vmlinux vmlinux
_raw_spin_unlock
If I can't figure out why LRO is so much faster in some cases, then I
think maybe I'll just put together a patch which keeps LRO, and does
GRO only if LRO is disabled. Kind of ugly, but better than loosing
15% performance on some machines.
Drew
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-28 15:00 ` Andrew Gallatin
@ 2009-04-28 15:02 ` David Miller
2009-04-28 15:20 ` Herbert Xu
1 sibling, 0 replies; 41+ messages in thread
From: David Miller @ 2009-04-28 15:02 UTC (permalink / raw)
To: gallatin; +Cc: herbert, brice, sgruszka, netdev
From: Andrew Gallatin <gallatin@myri.com>
Date: Tue, 28 Apr 2009 11:00:16 -0400
> If I can't figure out why LRO is so much faster in some cases, then I
> think maybe I'll just put together a patch which keeps LRO, and does
> GRO only if LRO is disabled. Kind of ugly, but better than loosing
> 15% performance on some machines.
I refuse to apply such a patch.
Figure out this performance problem, don't work around it.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-28 15:00 ` Andrew Gallatin
2009-04-28 15:02 ` David Miller
@ 2009-04-28 15:20 ` Herbert Xu
2009-04-28 15:44 ` Andrew Gallatin
2009-04-28 21:12 ` Andrew Gallatin
1 sibling, 2 replies; 41+ messages in thread
From: Herbert Xu @ 2009-04-28 15:20 UTC (permalink / raw)
To: Andrew Gallatin; +Cc: David Miller, brice, sgruszka, netdev
On Tue, Apr 28, 2009 at 11:00:16AM -0400, Andrew Gallatin wrote:
>
> Its strange, I still consistently see about 1Gb/s better performance
> from LRO than GRO on this weak machine (6.5Gb/s LRO, 5.5Gb/s GRO)
> when binding everything to the same CPU. Mpstat -P 0 shows roughly
> 10% more time spent in "soft" when using GRO vs LRO:
Did you check the utilisation of the all the cores on the sender?
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-28 15:20 ` Herbert Xu
@ 2009-04-28 15:44 ` Andrew Gallatin
2009-04-28 21:12 ` Andrew Gallatin
1 sibling, 0 replies; 41+ messages in thread
From: Andrew Gallatin @ 2009-04-28 15:44 UTC (permalink / raw)
To: Herbert Xu; +Cc: David Miller, brice, sgruszka, netdev
Herbert Xu wrote:
> On Tue, Apr 28, 2009 at 11:00:16AM -0400, Andrew Gallatin wrote:
>> Its strange, I still consistently see about 1Gb/s better performance
>> from LRO than GRO on this weak machine (6.5Gb/s LRO, 5.5Gb/s GRO)
>> when binding everything to the same CPU. Mpstat -P 0 shows roughly
>> 10% more time spent in "soft" when using GRO vs LRO:
>
> Did you check the utilisation of the all the cores on the sender?
Yes. It is about the same +/- 2%. The utilization when sending
to GRO is a bit lower, but its going slower.
Here is what might be more interesting.. I'm trying to isolate the
softirq path in oprofile. So in this test, I bound the IRQ to CPU1,
and the netserver to CPU0. In these tests, I see near line rate from
both LRO and GRO. Here is oprofile output separated by CPU, and
sorted on CPU1. (Sorry about binding to CPU1 and making the output
more confusing; I could not get oprofile to emit samples when the irq
was bound to CPU0). I've included the top 20 entries:
GRO:
0 0 1414 15.8485 myri10ge.ko myri10ge
myri10ge_poll
0 0 932 10.4461 vmlinux vmlinux
inet_gro_receive
0 0 705 7.9018 vmlinux vmlinux
tcp_gro_receive
0 0 681 7.6328 vmlinux vmlinux
skb_gro_receive
0 0 652 7.3078 vmlinux vmlinux
skb_gro_header
0 0 517 5.7947 vmlinux vmlinux
__napi_gro_receive
0 0 316 3.5418 vmlinux vmlinux
dev_gro_receive
0 0 309 3.4633 myri10ge.ko myri10ge
myri10ge_alloc_rx_pages
415 3.1243 251 2.8133 vmlinux vmlinux
_raw_spin_lock
0 0 233 2.6115 vmlinux vmlinux
napi_frags_skb
0 0 178 1.9951 vmlinux vmlinux
tcp4_gro_receive
306 2.3037 152 1.7037 vmlinux vmlinux
rb_get_reader_page
0 0 150 1.6812 vmlinux vmlinux
napi_get_frags
188 1.4153 131 1.4683 vmlinux vmlinux
rb_buffer_peek
195 1.4680 101 1.1320 vmlinux vmlinux
ring_buffer_consume
0 0 96 1.0760 vmlinux vmlinux
ip_rcv_finish
0 0 94 1.0536 vmlinux vmlinux
napi_gro_frags
0 0 92 1.0312 vmlinux vmlinux
skb_copy_bits
0 0 86 0.9639 vmlinux vmlinux
napi_frags_finish
225 1.6939 85 0.9527 oprofile.ko oprofile
op_cpu_buffer_read_entry
LRO:
0 0 1937 15.1281 myri10ge.ko myri10ge
myri10ge_poll
0 0 1876 14.6517 myri10ge.ko myri10ge
myri10ge_get_frag_header
0 0 943 7.3649 vmlinux vmlinux
__lro_proc_segment
0 0 723 5.6467 myri10ge.ko myri10ge
myri10ge_alloc_rx_pages
0 0 392 3.0615 vmlinux vmlinux
lro_gen_skb
0 0 369 2.8819 vmlinux vmlinux
lro_tcp_ip_check
353 2.7435 357 2.7882 vmlinux vmlinux
_raw_spin_lock
290 2.2538 328 2.5617 vmlinux vmlinux
rb_get_reader_page
4 0.0311 270 2.1087 vmlinux vmlinux
csum_partial
26 0.2021 214 1.6714 vmlinux vmlinux
memset_c
0 0 202 1.5776 vmlinux vmlinux
lro_add_common
8 0.0622 191 1.4917 vmlinux vmlinux
__slab_alloc
0 0 188 1.4683 vmlinux vmlinux
ip_rcv_finish
84 0.6528 183 1.4292 vmlinux vmlinux
_raw_spin_unlock
0 0 180 1.4058 vmlinux vmlinux
lro_tcp_data_csum
0 0 180 1.4058 vmlinux vmlinux
lro_get_desc
167 1.2979 178 1.3902 vmlinux vmlinux
ring_buffer_consume
0 0 167 1.3043 vmlinux vmlinux
netif_receive_skb
0 0 143 1.1168 vmlinux vmlinux
ip_route_input
0 0 125 0.9763 vmlinux vmlinux
__inet_lookup_established
Does anything strike you as being inordinately expensive for GRO?
Drew
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-28 15:20 ` Herbert Xu
2009-04-28 15:44 ` Andrew Gallatin
@ 2009-04-28 21:12 ` Andrew Gallatin
2009-04-29 13:42 ` Andrew Gallatin
1 sibling, 1 reply; 41+ messages in thread
From: Andrew Gallatin @ 2009-04-28 21:12 UTC (permalink / raw)
To: Herbert Xu; +Cc: David Miller, brice, sgruszka, netdev
For variety, I grabbed a different "slow" receiver. This is another
2 CPU machine, but a dual-socket single-core opteron (Tyan S2895)
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 37
model name : AMD Opteron(tm) Processor 252
stepping : 1
cpu MHz : 2611.738
cache size : 1024 KB
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt
lm 3dnowext 3dnow rep_good pni lahf_lm
bogomips : 5223.47
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp
The sender was an identical machine running an ancient RHEL4 kernel
(2.6.9-42.ELsmp) and our downloadable (backported) driver.
(http://www.myri.com/ftp/pub/Myri10GE/myri10ge-linux.1.4.4.tgz)
I disabled LRO, on the sender.
Binding the IRQ to CPU0, and the netserver to CPU1 I see 8.1Gb/s with
LRO and 8.0Gb/s with GRO.
Binding the IRQ to CPU0, and the netserver to CPU0, I see 6.9Gb/s
with LRO and 5.5 Gb/s with GRO. Monitoring the packet/byte counts
on the interface once per second, LRO looks like this:
Ipkts IBytes Opkts Obytes
588992 891733888 9758 644028
589610 892669540 9771 644886
589079 891865606 9754 643764
And GRO looks like this:
480309 727187826 7949 524634
480032 726768448 7947 524502
480000 726720000 7943 524238
Similarly, in this same scenario, binding the app/irq to the same
CPU and running mpstat -P 0 1 shows about 60%sys and 40% irq+softirq
while GRO shows about 45% sys and 55% irq+softirq.
I can't put my finger on it, but something about GRO is certainly
more expensive on these types of machines. I wish there was some
way you could see it, since it happens on every older AMD I try
it on. If you haven't been able to reproduce it, I'll see if I
can make it happen on a newer "slow" amd64 box I have tomorrow.
Drew
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-28 21:12 ` Andrew Gallatin
@ 2009-04-29 13:42 ` Andrew Gallatin
2009-04-29 13:53 ` Eric Dumazet
0 siblings, 1 reply; 41+ messages in thread
From: Andrew Gallatin @ 2009-04-29 13:42 UTC (permalink / raw)
To: Herbert Xu; +Cc: David Miller, brice, sgruszka, netdev
Andrew Gallatin wrote:
> For variety, I grabbed a different "slow" receiver. This is another
> 2 CPU machine, but a dual-socket single-core opteron (Tyan S2895)
>
> processor : 0
> vendor_id : AuthenticAMD
> cpu family : 15
> model : 37
> model name : AMD Opteron(tm) Processor 252
<...>
> The sender was an identical machine running an ancient RHEL4 kernel
> (2.6.9-42.ELsmp) and our downloadable (backported) driver.
> (http://www.myri.com/ftp/pub/Myri10GE/myri10ge-linux.1.4.4.tgz)
> I disabled LRO, on the sender.
>
> Binding the IRQ to CPU0, and the netserver to CPU1 I see 8.1Gb/s with
> LRO and 8.0Gb/s with GRO.
With the recent patch to fix idle CPU time accounting from LKML applied,
it is again possible to trust netperf's service demand (based on %CPU).
So here is raw netperf output for LRO and GRO, bound as above.
TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
hail1-m.sw.myri.com (10.0.130.167) port 0 AF_INET : cpu bind
Recv Send Send Utilization Service
Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local
remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
LRO:
87380 65536 65536 60.00 8279.36 8.10 77.55 0.160
1.535
GRO:
87380 65536 65536 60.00 8053.19 7.86 85.47 0.160
1.739
The difference is bigger if you disable TCP timestamps (and thus shrink
the packets headers down so they require fewer cachelines):
LRO:
87380 65536 65536 60.02 7753.55 8.01 74.06 0.169
1.565
GRO:
87380 65536 65536 60.02 7535.12 7.27 84.57 0.158
1.839
As you can see, even though the raw bandwidth is very close, the
service demand makes it clear that GRO is more expensive
than LRO. I just wish I understood why.
Drew
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-29 13:42 ` Andrew Gallatin
@ 2009-04-29 13:53 ` Eric Dumazet
2009-04-29 14:18 ` Andrew Gallatin
0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2009-04-29 13:53 UTC (permalink / raw)
To: Andrew Gallatin; +Cc: Herbert Xu, David Miller, brice, sgruszka, netdev
Andrew Gallatin a écrit :
> Andrew Gallatin wrote:
>> For variety, I grabbed a different "slow" receiver. This is another
>> 2 CPU machine, but a dual-socket single-core opteron (Tyan S2895)
>>
>> processor : 0
>> vendor_id : AuthenticAMD
>> cpu family : 15
>> model : 37
>> model name : AMD Opteron(tm) Processor 252
> <...>
>> The sender was an identical machine running an ancient RHEL4 kernel
>> (2.6.9-42.ELsmp) and our downloadable (backported) driver.
>> (http://www.myri.com/ftp/pub/Myri10GE/myri10ge-linux.1.4.4.tgz)
>> I disabled LRO, on the sender.
>>
>> Binding the IRQ to CPU0, and the netserver to CPU1 I see 8.1Gb/s with
>> LRO and 8.0Gb/s with GRO.
>
> With the recent patch to fix idle CPU time accounting from LKML applied,
> it is again possible to trust netperf's service demand (based on %CPU).
> So here is raw netperf output for LRO and GRO, bound as above.
>
> TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> hail1-m.sw.myri.com (10.0.130.167) port 0 AF_INET : cpu bind
> Recv Send Send Utilization Service
> Demand
> Socket Socket Message Elapsed Send Recv Send Recv
> Size Size Size Time Throughput local remote local remote
> bytes bytes bytes secs. 10^6bits/s % S % S us/KB
> us/KB
>
> LRO:
> 87380 65536 65536 60.00 8279.36 8.10 77.55 0.160 1.535
> GRO:
> 87380 65536 65536 60.00 8053.19 7.86 85.47 0.160 1.739
>
> The difference is bigger if you disable TCP timestamps (and thus shrink
> the packets headers down so they require fewer cachelines):
> LRO:
> 87380 65536 65536 60.02 7753.55 8.01 74.06 0.169 1.565
> GRO:
> 87380 65536 65536 60.02 7535.12 7.27 84.57 0.158 1.839
>
>
> As you can see, even though the raw bandwidth is very close, the
> service demand makes it clear that GRO is more expensive
> than LRO. I just wish I understood why.
>
What are "vmstat 1" ouputs on both tests ? Any difference on say... context switches ?
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-29 13:53 ` Eric Dumazet
@ 2009-04-29 14:18 ` Andrew Gallatin
2009-04-29 15:26 ` Eric Dumazet
0 siblings, 1 reply; 41+ messages in thread
From: Andrew Gallatin @ 2009-04-29 14:18 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Herbert Xu, David Miller, brice, sgruszka, netdev
Eric Dumazet wrote:
> Andrew Gallatin a écrit :
>> Andrew Gallatin wrote:
>>> For variety, I grabbed a different "slow" receiver. This is another
>>> 2 CPU machine, but a dual-socket single-core opteron (Tyan S2895)
>>>
>>> processor : 0
>>> vendor_id : AuthenticAMD
>>> cpu family : 15
>>> model : 37
>>> model name : AMD Opteron(tm) Processor 252
>> <...>
>>> The sender was an identical machine running an ancient RHEL4 kernel
>>> (2.6.9-42.ELsmp) and our downloadable (backported) driver.
>>> (http://www.myri.com/ftp/pub/Myri10GE/myri10ge-linux.1.4.4.tgz)
>>> I disabled LRO, on the sender.
>>>
>>> Binding the IRQ to CPU0, and the netserver to CPU1 I see 8.1Gb/s with
>>> LRO and 8.0Gb/s with GRO.
>> With the recent patch to fix idle CPU time accounting from LKML applied,
>> it is again possible to trust netperf's service demand (based on %CPU).
>> So here is raw netperf output for LRO and GRO, bound as above.
>>
>> TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
>> hail1-m.sw.myri.com (10.0.130.167) port 0 AF_INET : cpu bind
>> Recv Send Send Utilization Service
>> Demand
>> Socket Socket Message Elapsed Send Recv Send
Recv
>> Size Size Size Time Throughput local remote local
remote
>> bytes bytes bytes secs. 10^6bits/s % S % S us/KB
>> us/KB
>>
>> LRO:
>> 87380 65536 65536 60.00 8279.36 8.10 77.55 0.160
1.535
>> GRO:
>> 87380 65536 65536 60.00 8053.19 7.86 85.47 0.160
1.739
>>
>> The difference is bigger if you disable TCP timestamps (and thus shrink
>> the packets headers down so they require fewer cachelines):
>> LRO:
>> 87380 65536 65536 60.02 7753.55 8.01 74.06 0.169
1.565
>> GRO:
>> 87380 65536 65536 60.02 7535.12 7.27 84.57 0.158
1.839
>>
>>
>> As you can see, even though the raw bandwidth is very close, the
>> service demand makes it clear that GRO is more expensive
>> than LRO. I just wish I understood why.
>>
>
> What are "vmstat 1" ouputs on both tests ? Any difference on say...
context switches ?
Not much difference is apparent from vmstat, except for a
lower load and slightly higher IRQ rate from LRO:
LRO:
procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu------
r b swpd free buff cache si so bi bo in cs us sy
id wa st
1 0 0 676960 19280 209812 0 0 0 0 14817 24 0
73 27 0 0
1 0 0 677084 19280 209812 0 0 0 0 14834 20 0
73 27 0 0
1 0 0 676916 19280 209812 0 0 0 0 14833 16 0
74 26 0 0
GRO:
r b swpd free buff cache si so bi bo in cs us sy
id wa st
1 0 0 678244 18008 209784 0 0 0 24 14288 32 0
84 16 0 0
1 0 0 678268 18008 209788 0 0 0 0 14403 22 0
85 15 0 0
1 0 0 677956 18008 209788 0 0 0 0 14331 20 0
84 16 0 0
The real difference is visible mainly from mpstat on the CPU handing the
interrupts where you see softirq is much higher:
LRO:
07:15:16 CPU %user %nice %sys %iowait %irq %soft %steal
%idle intr/s
07:15:17 0 0.00 0.00 0.00 0.00 0.00 45.00 0.00
55.00 12907.92
07:15:18 0 0.00 0.00 1.00 0.00 2.00 43.00 0.00
54.00 12707.92
07:15:19 0 0.00 0.00 1.00 0.00 0.00 46.00 0.00
53.00 12825.00
GRO
07:11:59 CPU %user %nice %sys %iowait %irq %soft %steal
%idle intr/s
07:12:00 0 0.00 0.00 0.00 0.00 0.99 66.34 0.00
32.67 12242.57
07:12:01 0 0.00 0.00 0.00 0.00 1.01 66.67 0.00
32.32 12220.00
07:12:02 0 0.00 0.00 0.99 0.00 0.99 65.35 0.00
32.67 12336.00
So it is like "something" GRO is doing in the softirq context is more
expensive than what LRO is doing.
Drew
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-29 14:18 ` Andrew Gallatin
@ 2009-04-29 15:26 ` Eric Dumazet
2009-04-29 17:28 ` Andrew Gallatin
0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2009-04-29 15:26 UTC (permalink / raw)
To: Andrew Gallatin; +Cc: Herbert Xu, David Miller, brice, sgruszka, netdev
Andrew Gallatin a écrit :
> Eric Dumazet wrote:
>> Andrew Gallatin a écrit :
>>> Andrew Gallatin wrote:
>>>> For variety, I grabbed a different "slow" receiver. This is another
>>>> 2 CPU machine, but a dual-socket single-core opteron (Tyan S2895)
>>>>
>>>> processor : 0
>>>> vendor_id : AuthenticAMD
>>>> cpu family : 15
>>>> model : 37
>>>> model name : AMD Opteron(tm) Processor 252
>>> <...>
>>>> The sender was an identical machine running an ancient RHEL4 kernel
>>>> (2.6.9-42.ELsmp) and our downloadable (backported) driver.
>>>> (http://www.myri.com/ftp/pub/Myri10GE/myri10ge-linux.1.4.4.tgz)
>>>> I disabled LRO, on the sender.
>>>>
>>>> Binding the IRQ to CPU0, and the netserver to CPU1 I see 8.1Gb/s with
>>>> LRO and 8.0Gb/s with GRO.
>>> With the recent patch to fix idle CPU time accounting from LKML applied,
>>> it is again possible to trust netperf's service demand (based on %CPU).
>>> So here is raw netperf output for LRO and GRO, bound as above.
>>>
>>> TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
>>> hail1-m.sw.myri.com (10.0.130.167) port 0 AF_INET : cpu bind
>>> Recv Send Send Utilization Service
>>> Demand
>>> Socket Socket Message Elapsed Send Recv Send
> Recv
>>> Size Size Size Time Throughput local remote local
> remote
>>> bytes bytes bytes secs. 10^6bits/s % S % S us/KB
>>> us/KB
>>>
>>> LRO:
>>> 87380 65536 65536 60.00 8279.36 8.10 77.55 0.160
> 1.535
>>> GRO:
>>> 87380 65536 65536 60.00 8053.19 7.86 85.47 0.160
> 1.739
>>>
>>> The difference is bigger if you disable TCP timestamps (and thus shrink
>>> the packets headers down so they require fewer cachelines):
>>> LRO:
>>> 87380 65536 65536 60.02 7753.55 8.01 74.06 0.169
> 1.565
>>> GRO:
>>> 87380 65536 65536 60.02 7535.12 7.27 84.57 0.158
> 1.839
>>>
>>>
>>> As you can see, even though the raw bandwidth is very close, the
>>> service demand makes it clear that GRO is more expensive
>>> than LRO. I just wish I understood why.
>>>
>>
>> What are "vmstat 1" ouputs on both tests ? Any difference on say...
> context switches ?
>
> Not much difference is apparent from vmstat, except for a
> lower load and slightly higher IRQ rate from LRO:
>
> LRO:
> procs -----------memory---------- ---swap-- -----io---- --system--
> -----cpu------
> r b swpd free buff cache si so bi bo in cs us sy
> id wa st
> 1 0 0 676960 19280 209812 0 0 0 0 14817 24 0 73
> 27 0 0
> 1 0 0 677084 19280 209812 0 0 0 0 14834 20 0 73
> 27 0 0
> 1 0 0 676916 19280 209812 0 0 0 0 14833 16 0 74
> 26 0 0
>
>
> GRO:
> r b swpd free buff cache si so bi bo in cs us sy
> id wa st
> 1 0 0 678244 18008 209784 0 0 0 24 14288 32 0 84
> 16 0 0
> 1 0 0 678268 18008 209788 0 0 0 0 14403 22 0 85
> 15 0 0
> 1 0 0 677956 18008 209788 0 0 0 0 14331 20 0 84
> 16 0 0
>
>
>
>
> The real difference is visible mainly from mpstat on the CPU handing the
> interrupts where you see softirq is much higher:
>
> LRO:
> 07:15:16 CPU %user %nice %sys %iowait %irq %soft %steal
> %idle intr/s
> 07:15:17 0 0.00 0.00 0.00 0.00 0.00 45.00 0.00
> 55.00 12907.92
> 07:15:18 0 0.00 0.00 1.00 0.00 2.00 43.00 0.00
> 54.00 12707.92
> 07:15:19 0 0.00 0.00 1.00 0.00 0.00 46.00 0.00
> 53.00 12825.00
>
>
> GRO
> 07:11:59 CPU %user %nice %sys %iowait %irq %soft %steal
> %idle intr/s
> 07:12:00 0 0.00 0.00 0.00 0.00 0.99 66.34 0.00
> 32.67 12242.57
> 07:12:01 0 0.00 0.00 0.00 0.00 1.01 66.67 0.00
> 32.32 12220.00
> 07:12:02 0 0.00 0.00 0.99 0.00 0.99 65.35 0.00
> 32.67 12336.00
>
>
> So it is like "something" GRO is doing in the softirq context is more
> expensive than what LRO is doing.
Sure, probably more cache misses or something...
You could try a longer oprofile session (with at least one million samples)
and :
opannotate -a vmlinux >/tmp/FILE
And select 3 or 4 suspect functions : inet_gro_receive() tcp_gro_receive(),
skb_gro_receive(), skb_gro_header()
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-29 15:26 ` Eric Dumazet
@ 2009-04-29 17:28 ` Andrew Gallatin
2009-04-30 8:10 ` Herbert Xu
2009-04-30 8:17 ` Eric Dumazet
0 siblings, 2 replies; 41+ messages in thread
From: Andrew Gallatin @ 2009-04-29 17:28 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Herbert Xu, David Miller, brice, sgruszka, netdev
Eric Dumazet wrote:
>
> Sure, probably more cache misses or something...
Yes, that's what I thought. The code is much more complete,
and spread out than LRO, and seems to open itself to cache
misses.
> You could try a longer oprofile session (with at least one million
samples)
> and :
>
> opannotate -a vmlinux >/tmp/FILE
>
> And select 3 or 4 suspect functions : inet_gro_receive()
tcp_gro_receive(),
> skb_gro_receive(), skb_gro_header()
Here is the opreport -l output from this machine for GRO for a 25 minute
profiling run:
samples % image name app name
symbol name
3742674 32.2793 vmlinux vmlinux
copy_user_generic_string
890179 7.6775 myri10ge.ko myri10ge
myri10ge_poll
547572 4.7226 vmlinux vmlinux
inet_gro_receive
477479 4.1181 vmlinux vmlinux
skb_gro_receive
406562 3.5065 vmlinux vmlinux
free_hot_cold_page
396796 3.4222 vmlinux vmlinux
tcp_gro_receive
332364 2.8665 vmlinux vmlinux
__rmqueue_smallest
319455 2.7552 vmlinux vmlinux
skb_gro_header
269040 2.3204 vmlinux vmlinux
dev_gro_receive
252885 2.1810 vmlinux vmlinux
free_pages_bulk
247832 2.1375 vmlinux vmlinux
get_pageblock_flags_group
211592 1.8249 myri10ge.ko myri10ge
myri10ge_alloc_rx_pages
208867 1.8014 vmlinux vmlinux
__list_add
201491 1.7378 vmlinux vmlinux
tcp4_gro_receive
187591 1.6179 vmlinux vmlinux
__napi_gro_receive
170156 1.4675 vmlinux vmlinux
get_page_from_freelist
116321 1.0032 vmlinux vmlinux list_del
107994 0.9314 vmlinux vmlinux kfree
106434 0.9180 vmlinux vmlinux
skb_copy_datagram_iovec
100675 0.8683 vmlinux vmlinux put_page
And is here is the opannotate -a output for a few GRO functions. BTW,
did you mean -s
rather than -a? I'd naively think source might be more helpful. But
here is
what you asked for:
ffffffff80479f20 <inet_gro_receive>: /* inet_gro_receive total: 547572
5.2554 */
12187 0.1170 :ffffffff80479f20: push %r13
2611 0.0251 :ffffffff80479f22: mov %rdi,%r13
:ffffffff80479f25: push %r12
:ffffffff80479f27: push %rbp
4031 0.0387 :ffffffff80479f28: push %rbx
:ffffffff80479f29: mov %rsi,%rbx
:ffffffff80479f2c: mov $0x14,%esi
6303 0.0605 :ffffffff80479f31: mov %rbx,%rdi
:ffffffff80479f34: sub $0x8,%rsp
:ffffffff80479f38: callq ffffffff804357a1
<skb_gro_header>
:ffffffff80479f3d: test %rax,%rax
2494 0.0239 :ffffffff80479f40: mov %rax,%r8
:ffffffff80479f43: je ffffffff8047a0a4
<inet_gro_receive+0x184>
:ffffffff80479f49: movzbl 0x9(%rax),%eax
2541 0.0244 :ffffffff80479f4d: mov
0xffffffff80d06280(,%rax,8),%r11
33 3.2e-04 :ffffffff80479f55: test %r11,%r11
5 4.8e-05 :ffffffff80479f58: je ffffffff8047a0a4
<inet_gro_receive+0x184>
11016 0.1057 :ffffffff80479f5e: cmpq $0x0,0x20(%r11)
292 0.0028 :ffffffff80479f63: je ffffffff8047a0a4
<inet_gro_receive+0x184>
1 9.6e-06 :ffffffff80479f69: cmpb $0x45,(%r8)
4297 0.0412 :ffffffff80479f6d: jne ffffffff8047a0a4
<inet_gro_receive+0x184>
6086 0.0584 :ffffffff80479f73: mov $0x5,%eax
:ffffffff80479f78: mov %r8,%rcx
18706 0.1795 :ffffffff80479f7b: mov (%rcx),%edx
341 0.0033 :ffffffff80479f7d: sub $0x4,%eax
:ffffffff80479f80: jbe ffffffff80479fa6
<inet_gro_receive+0x86>
4609 0.0442 :ffffffff80479f82: add 0x4(%rcx),%edx
398 0.0038 :ffffffff80479f85: adc 0x8(%rcx),%edx
:ffffffff80479f88: adc 0xc(%rcx),%edx
4310 0.0414 :ffffffff80479f8b: adc 0x10(%rcx),%edx
790 0.0076 :ffffffff80479f8e: lea 0x4(%rcx),%rcx
:ffffffff80479f92: dec %eax
9097 0.0873 :ffffffff80479f94: jne ffffffff80479f8b
<inet_gro_receive+0x6b>
541 0.0052 :ffffffff80479f96: adc $0x0,%edx
:ffffffff80479f99: mov %edx,%eax
1919 0.0184 :ffffffff80479f9b: shr $0x10,%edx
535 0.0051 :ffffffff80479f9e: add %ax,%dx
:ffffffff80479fa1: adc $0x0,%edx
3633 0.0349 :ffffffff80479fa4: not %edx
683 0.0066 :ffffffff80479fa6: test %dx,%dx
1 9.6e-06 :ffffffff80479fa9: jne ffffffff8047a0a4
<inet_gro_receive+0x184>
4725 0.0453 :ffffffff80479faf: movzwl 0x2(%r8),%eax
9728 0.0934 :ffffffff80479fb4: mov 0x68(%rbx),%edx
8 7.7e-05 :ffffffff80479fb7: mov $0x1,%ebp
43000 0.4127 :ffffffff80479fbc: sub 0x38(%rbx),%edx
11149 0.1070 :ffffffff80479fbf: mov %eax,%ecx
:ffffffff80479fc1: shl $0x8,%eax
66497 0.6382 :ffffffff80479fc4: shr $0x8,%ecx
735 0.0071 :ffffffff80479fc7: or %ecx,%eax
:ffffffff80479fc9: movzwl %ax,%eax
5459 0.0524 :ffffffff80479fcc: cmp %edx,%eax
522 0.0050 :ffffffff80479fce: jne ffffffff80479fdc
<inet_gro_receive+0xbc>
:ffffffff80479fd0: xor %ebp,%ebp
5373 0.0516 :ffffffff80479fd2: cmpw $0x40,0x6(%r8)
345 0.0033 :ffffffff80479fd8: setne %bpl
:ffffffff80479fdc: movzwl 0x4(%r8),%eax
2384 0.0229 :ffffffff80479fe1: mov 0x0(%r13),%r10
631 0.0061 :ffffffff80479fe5: mov %eax,%edx
:ffffffff80479fe7: shl $0x8,%eax
3044 0.0292 :ffffffff80479fea: shr $0x8,%edx
303 0.0029 :ffffffff80479fed: or %edx,%eax
:ffffffff80479fef: movzwl %ax,%r12d
2747 0.0264 :ffffffff80479ff3: jmp ffffffff8047a071
<inet_gro_receive+0x151>
2109 0.0202 :ffffffff80479ff5: lea 0x38(%r10),%r9
12 1.2e-04 :ffffffff80479ff9: cmpl $0x0,0x4(%r9)
23 2.2e-04 :ffffffff80479ffe: je ffffffff8047a06e
<inet_gro_receive+0x14e>
2104 0.0202 :ffffffff8047a000: mov 0xac(%r10),%edi
2 1.9e-05 :ffffffff8047a007: add 0xc0(%r10),%rdi
:ffffffff8047a00e: mov 0x9(%rdi),%sil
2391 0.0229 :ffffffff8047a012: mov 0x1(%rdi),%al
2 1.9e-05 :ffffffff8047a015: xor 0x9(%r8),%sil
7 6.7e-05 :ffffffff8047a019: xor 0x1(%r8),%al
2101 0.0202 :ffffffff8047a01d: mov 0xc(%rdi),%edx
1 9.6e-06 :ffffffff8047a020: mov 0x10(%rdi),%ecx
:ffffffff8047a023: xor 0xc(%r8),%edx
2775 0.0266 :ffffffff8047a027: xor 0x10(%r8),%ecx
:ffffffff8047a02b: or %esi,%eax
:ffffffff8047a02d: movzbl %al,%eax
62734 0.6021 :ffffffff8047a030: or %edx,%ecx
:ffffffff8047a032: or %eax,%ecx
:ffffffff8047a034: je ffffffff8047a040
<inet_gro_receive+0x120>
:ffffffff8047a036: movl $0x0,0x4(%r9)
:ffffffff8047a03e: jmp ffffffff8047a06e
<inet_gro_receive+0x14e>
2106 0.0202 :ffffffff8047a040: movzwl 0x4(%rdi),%edx
:ffffffff8047a044: mov 0x8(%rdi),%al
:ffffffff8047a047: xor 0x8(%r8),%eax
64244 0.6166 :ffffffff8047a04b: mov %edx,%ecx
:ffffffff8047a04d: shl $0x8,%edx
:ffffffff8047a050: shr $0x8,%ecx
2072 0.0199 :ffffffff8047a053: movzbl %al,%eax
:ffffffff8047a056: or 0x8(%r9),%eax
:ffffffff8047a05a: or %ecx,%edx
2629 0.0252 :ffffffff8047a05c: add 0xc(%r9),%edx
2 1.9e-05 :ffffffff8047a060: movzwl %dx,%edx
:ffffffff8047a063: xor %r12d,%edx
58223 0.5588 :ffffffff8047a066: or %edx,%eax
3 2.9e-05 :ffffffff8047a068: or %ebp,%eax
:ffffffff8047a06a: mov %eax,0x8(%r9)
21878 0.2100 :ffffffff8047a06e: mov (%r10),%r10
2156 0.0207 :ffffffff8047a071: test %r10,%r10
:ffffffff8047a074: jne ffffffff80479ff5
<inet_gro_receive+0xd5>
3007 0.0289 :ffffffff8047a07a: mov 0x38(%rbx),%eax
61 5.9e-04 :ffffffff8047a07d: or %ebp,0x40(%rbx)
3 2.9e-05 :ffffffff8047a080: mov %rbx,%rsi
3091 0.0297 :ffffffff8047a083: mov %r13,%rdi
41 3.9e-04 :ffffffff8047a086: add $0x14,%eax
:ffffffff8047a089: mov %eax,0x38(%rbx)
3704 0.0355 :ffffffff8047a08c: sub 0xc0(%rbx),%eax
33 3.2e-04 :ffffffff8047a092: add 0xc8(%rbx),%eax
:ffffffff8047a098: mov %eax,0xa8(%rbx)
2468 0.0237 :ffffffff8047a09e: callq *0x20(%r11)
20011 0.1921 :ffffffff8047a0a2: jmp ffffffff8047a0ab
<inet_gro_receive+0x18b>
:ffffffff8047a0a4: xor %eax,%eax
:ffffffff8047a0a6: mov $0x1,%ebp
24082 0.2311 :ffffffff8047a0ab: or %ebp,0x40(%rbx)
626 0.0060 :ffffffff8047a0ae: pop %r10
1718 0.0165 :ffffffff8047a0b0: pop %rbx
446 0.0043 :ffffffff8047a0b1: pop %rbp
4074 0.0391 :ffffffff8047a0b2: pop %r12
2089 0.0200 :ffffffff8047a0b4: pop %r13
434 0.0042 :ffffffff8047a0b6: retq
ffffffff80430ea9 <skb_gro_receive>: /* skb_gro_receive total: 477479
4.5827 */
2158 0.0207 :ffffffff80430ea9: push %r15
2492 0.0239 :ffffffff80430eab: mov %rdi,%r15
:ffffffff80430eae: push %r14
:ffffffff80430eb0: push %r13
2432 0.0233 :ffffffff80430eb2: push %r12
1 9.6e-06 :ffffffff80430eb4: push %rbp
1 9.6e-06 :ffffffff80430eb5: mov %rsi,%rbp
2430 0.0233 :ffffffff80430eb8: push %rbx
:ffffffff80430eb9: sub $0x8,%rsp
:ffffffff80430ebd: mov 0x68(%rsi),%ecx
2420 0.0232 :ffffffff80430ec0: mov (%rdi),%r12
1 9.6e-06 :ffffffff80430ec3: mov %ecx,%r14d
1 9.6e-06 :ffffffff80430ec6: sub 0x38(%rsi),%r14d
2317 0.0222 :ffffffff80430eca: mov %r14d,%eax
1 9.6e-06 :ffffffff80430ecd: add 0x68(%r12),%eax
1 9.6e-06 :ffffffff80430ed2: cmp $0xffff,%eax
3865 0.0371 :ffffffff80430ed7: ja ffffffff80431261
<skb_gro_receive+0x3b8>
:ffffffff80430edd: mov 0xb8(%r12),%eax
:ffffffff80430ee5: mov 0xc0(%r12),%rdx
8082 0.0776 :ffffffff80430eed: lea (%rdx,%rax,1),%rsi
:ffffffff80430ef1: cmpq $0x0,0x18(%rsi)
2 1.9e-05 :ffffffff80430ef6: jne ffffffff804311ab
<skb_gro_receive+0x302>
9249 0.0888 :ffffffff80430efc: mov %ecx,%edi
:ffffffff80430efe: sub 0x6c(%rbp),%edi
6 5.8e-05 :ffffffff80430f01: cmp 0x38(%rbp),%edi
3104 0.0298 :ffffffff80430f04: ja ffffffff80430fe2
<skb_gro_receive+0x139>
2 1.9e-05 :ffffffff80430f0a: mov 0xb8(%rbp),%ecx
:ffffffff80430f10: movzwl 0x4(%rsi),%edx
8825 0.0847 :ffffffff80430f14: add 0xc0(%rbp),%rcx
:ffffffff80430f1b: movzwl 0x4(%rcx),%eax
21 2.0e-04 :ffffffff80430f1f: add %edx,%eax
19668 0.1888 :ffffffff80430f21: cmp $0x12,%eax
1 9.6e-06 :ffffffff80430f24: ja ffffffff80431261
<skb_gro_receive+0x3b8>
:ffffffff80430f2a: mov 0x38(%rcx),%eax
1974 0.0189 :ffffffff80430f2d: add 0x38(%rbp),%eax
:ffffffff80430f30: cld
:ffffffff80430f31: sub %edi,%eax
7666 0.0736 :ffffffff80430f33: mov %eax,0x38(%rcx)
2 1.9e-05 :ffffffff80430f36: mov 0xb8(%rbp),%edx
:ffffffff80430f3c: add 0xc0(%rbp),%rdx
52468 0.5036 :ffffffff80430f43: mov 0x3c(%rdx),%eax
2 1.9e-05 :ffffffff80430f46: add 0x68(%rbp),%eax
1 9.6e-06 :ffffffff80430f49: sub 0x6c(%rbp),%eax
6592 0.0633 :ffffffff80430f4c: sub 0x38(%rbp),%eax
:ffffffff80430f4f: mov %eax,0x3c(%rdx)
:ffffffff80430f52: mov 0xb8(%r12),%eax
23018 0.2209 :ffffffff80430f5a: add 0xc0(%r12),%rax
1 9.6e-06 :ffffffff80430f62: mov 0xb8(%rbp),%esi
:ffffffff80430f68: add 0xc0(%rbp),%rsi
8477 0.0814 :ffffffff80430f6f: movzwl 0x4(%rax),%edi
6 5.8e-05 :ffffffff80430f73: movzwl 0x4(%rsi),%ecx
:ffffffff80430f77: add $0x30,%rsi
21338 0.2048 :ffffffff80430f7b: shl $0x4,%rdi
3 2.9e-05 :ffffffff80430f7f: lea 0x30(%rdi,%rax,1),%rdi
1 9.6e-06 :ffffffff80430f84: shl $0x4,%rcx
150632 1.4457 :ffffffff80430f88: rep movsb %ds:(%rsi),%es:(%rdi)
3988 0.0383 :ffffffff80430f8a: mov 0xb8(%r12),%eax
2015 0.0193 :ffffffff80430f92: mov 0xb8(%rbp),%ecx
11 1.1e-04 :ffffffff80430f98: add 0xc0(%r12),%rax
8 7.7e-05 :ffffffff80430fa0: mov 0xc0(%rbp),%rdx
3295 0.0316 :ffffffff80430fa7: mov 0x4(%rdx,%rcx,1),%edx
:ffffffff80430fab: add %dx,0x4(%rax)
8 7.7e-05 :ffffffff80430faf: mov 0xb8(%rbp),%edx
2507 0.0241 :ffffffff80430fb5: mov 0xc0(%rbp),%rax
:ffffffff80430fbc: movw $0x0,0x4(%rax,%rdx,1)
3233 0.0310 :ffffffff80430fc3: mov 0x6c(%rbp),%eax
1 9.6e-06 :ffffffff80430fc6: sub %eax,0xd0(%rbp)
:ffffffff80430fcc: sub %eax,0x68(%rbp)
41540 0.3987 :ffffffff80430fcf: movl $0x0,0x6c(%rbp)
:ffffffff80430fd6: movl $0x1,0x48(%rbp)
:ffffffff80430fdd: jmpq ffffffff8043123f
<skb_gro_receive+0x396>
:ffffffff80430fe2: mov 0xc8(%r12),%rax
:ffffffff80430fea: mov 0x20(%r12),%rdi
:ffffffff80430fef: mov %eax,%r13d
:ffffffff80430ff2: sub %edx,%r13d
:ffffffff80430ff5: mov $0x20,%edx
:ffffffff80430ffa: mov %r13d,%esi
:ffffffff80430ffd: add 0x38(%r12),%esi
:ffffffff80431002: callq ffffffff8042ffe0
<__netdev_alloc_skb>
:ffffffff80431007: mov %rax,%rbx
:ffffffff8043100a: mov $0xfffffff4,%eax
:ffffffff8043100f: test %rbx,%rbx
:ffffffff80431012: je ffffffff80431266
<skb_gro_receive+0x3bd>
:ffffffff80431018: mov %r12,%rsi
:ffffffff8043101b: mov %rbx,%rdi
:ffffffff8043101e: callq ffffffff8042e2c0
<__copy_skb_header>
:ffffffff80431023: mov 0x70(%r12),%eax
:ffffffff80431028: add %r13d,0xb4(%rbx)
:ffffffff8043102f: mov %ax,0x70(%rbx)
:ffffffff80431033: movslq %r13d,%rax
:ffffffff80431036: add %rax,0xc8(%rbx)
:ffffffff8043103d: cmpl $0x0,0x6c(%rbx)
:ffffffff80431041: mov 0x38(%r12),%edx
:ffffffff80431046: mov 0xb4(%rbx),%eax
:ffffffff8043104c: je ffffffff80431052
<skb_gro_receive+0x1a9>
:ffffffff8043104e: ud2a
:ffffffff80431050: jmp ffffffff80431050
<skb_gro_receive+0x1a7>
:ffffffff80431052: lea (%rdx,%rax,1),%eax
:ffffffff80431055: add %edx,0x68(%rbx)
:ffffffff80431058: mov 0xc8(%r12),%rcx
:ffffffff80431060: mov 0xc8(%rbx),%rdx
:ffffffff80431067: sub 0xc0(%rbx),%edx
:ffffffff8043106d: mov %eax,0xb4(%rbx)
:ffffffff80431073: mov 0xb0(%r12),%eax
:ffffffff8043107b: add 0xc0(%r12),%rax
:ffffffff80431083: sub %ecx,%eax
:ffffffff80431085: add %edx,%eax
:ffffffff80431087: mov %eax,0xb0(%rbx)
:ffffffff8043108d: mov 0xac(%r12),%eax
:ffffffff80431095: add 0xc0(%r12),%rax
:ffffffff8043109d: sub %ecx,%eax
:ffffffff8043109f: add %edx,%eax
:ffffffff804310a1: mov %eax,0xac(%rbx)
:ffffffff804310a7: mov 0xa8(%r12),%eax
:ffffffff804310af: add 0xc0(%r12),%rax
:ffffffff804310b7: sub %ecx,%eax
:ffffffff804310b9: add %edx,%eax
:ffffffff804310bb: mov %eax,0xa8(%rbx)
:ffffffff804310c1: mov 0x68(%r12),%eax
:ffffffff804310c6: mov 0x38(%r12),%edx
:ffffffff804310cb: sub %edx,%eax
:ffffffff804310cd: cmp 0x6c(%r12),%eax
:ffffffff804310d2: mov %eax,0x68(%r12)
:ffffffff804310d7: jae ffffffff804310dd
<skb_gro_receive+0x234>
:ffffffff804310d9: ud2a
:ffffffff804310db: jmp ffffffff804310db
<skb_gro_receive+0x232>
:ffffffff804310dd: mov 0xb0(%r12),%esi
:ffffffff804310e5: mov %edx,%ecx
:ffffffff804310e7: add 0xc8(%r12),%rcx
:ffffffff804310ef: add 0xc0(%r12),%rsi
:ffffffff804310f7: mov 0xb0(%rbx),%edi
:ffffffff804310fd: add 0xc0(%rbx),%rdi
:ffffffff80431104: cld
:ffffffff80431105: mov %rcx,0xc8(%r12)
:ffffffff8043110d: sub %rsi,%rcx
:ffffffff80431110: rep movsb %ds:(%rsi),%es:(%rdi)
:ffffffff80431112: lea 0x38(%rbx),%rdi
:ffffffff80431116: lea 0x38(%r12),%rsi
:ffffffff8043111b: mov $0x5,%cl
:ffffffff8043111d: rep movsl %ds:(%rsi),%es:(%rdi)
:ffffffff8043111f: mov 0xb8(%rbx),%edx
:ffffffff80431125: mov 0xc0(%rbx),%rax
:ffffffff8043112c: mov %r12,0x18(%rax,%rdx,1)
:ffffffff80431131: mov 0xb8(%r12),%edx
:ffffffff80431139: mov 0xc0(%r12),%rax
:ffffffff80431141: mov 0xb8(%rbx),%esi
:ffffffff80431147: mov 0xc0(%rbx),%rcx
:ffffffff8043114e: mov 0x6(%rax,%rdx,1),%ax
:ffffffff80431153: mov %ax,0x6(%rcx,%rsi,1)
:ffffffff80431158: testb $0x10,0x7c(%r12)
:ffffffff8043115e: je ffffffff80431164
<skb_gro_receive+0x2bb>
:ffffffff80431160: ud2a
:ffffffff80431162: jmp ffffffff80431162
<skb_gro_receive+0x2b9>
:ffffffff80431164: mov 0xb8(%r12),%eax
:ffffffff8043116c: orb $0x10,0x7c(%r12)
:ffffffff80431172: add 0xc0(%r12),%rax
:ffffffff8043117a: lock addl $0x10000,(%rax)
:ffffffff80431181: mov 0x68(%r12),%eax
:ffffffff80431186: mov %r12,0x8(%rbx)
:ffffffff8043118a: add %eax,0x6c(%rbx)
:ffffffff8043118d: add %eax,0xd0(%rbx)
:ffffffff80431193: add %eax,0x68(%rbx)
:ffffffff80431196: mov %rbx,(%r15)
:ffffffff80431199: mov (%r12),%rax
:ffffffff8043119d: mov %rax,(%rbx)
:ffffffff804311a0: movq $0x0,(%r12)
:ffffffff804311a8: mov %rbx,%r12
:ffffffff804311ab: mov 0x68(%rbp),%ecx
:ffffffff804311ae: sub 0x6c(%rbp),%ecx
:ffffffff804311b1: cmp %ecx,0x38(%rbp)
:ffffffff804311b4: jbe ffffffff804311f3
<skb_gro_receive+0x34a>
:ffffffff804311b6: mov 0xb8(%rbp),%edx
:ffffffff804311bc: add 0xc0(%rbp),%rdx
:ffffffff804311c3: mov 0x38(%rdx),%eax
:ffffffff804311c6: add 0x38(%rbp),%eax
:ffffffff804311c9: sub %ecx,%eax
:ffffffff804311cb: mov %eax,0x38(%rdx)
:ffffffff804311ce: mov 0xb8(%rbp),%edx
:ffffffff804311d4: add 0xc0(%rbp),%rdx
:ffffffff804311db: mov 0x3c(%rdx),%eax
:ffffffff804311de: add 0x68(%rbp),%eax
:ffffffff804311e1: sub 0x6c(%rbp),%eax
:ffffffff804311e4: sub 0x38(%rbp),%eax
:ffffffff804311e7: mov %eax,0x3c(%rdx)
:ffffffff804311ea: mov 0x68(%rbp),%eax
:ffffffff804311ed: sub 0x6c(%rbp),%eax
:ffffffff804311f0: mov %eax,0x38(%rbp)
:ffffffff804311f3: mov 0x68(%rbp),%eax
:ffffffff804311f6: mov 0x38(%rbp),%edx
:ffffffff804311f9: sub %edx,%eax
:ffffffff804311fb: cmp 0x6c(%rbp),%eax
:ffffffff804311fe: mov %eax,0x68(%rbp)
:ffffffff80431201: jae ffffffff80431207
<skb_gro_receive+0x35e>
:ffffffff80431203: ud2a
:ffffffff80431205: jmp ffffffff80431205
<skb_gro_receive+0x35c>
:ffffffff80431207: mov %edx,%eax
:ffffffff80431209: add %rax,0xc8(%rbp)
:ffffffff80431210: mov 0x8(%r12),%rax
:ffffffff80431215: mov %rbp,0x8(%r12)
:ffffffff8043121a: mov %rbp,(%rax)
:ffffffff8043121d: testb $0x10,0x7c(%rbp)
:ffffffff80431221: je ffffffff80431227
<skb_gro_receive+0x37e>
:ffffffff80431223: ud2a
:ffffffff80431225: jmp ffffffff80431225
<skb_gro_receive+0x37c>
:ffffffff80431227: mov 0xb8(%rbp),%eax
:ffffffff8043122d: orb $0x10,0x7c(%rbp)
:ffffffff80431231: add 0xc0(%rbp),%rax
:ffffffff80431238: lock addl $0x10000,(%rax)
34919 0.3351 :ffffffff8043123f: add %r14d,0x6c(%r12)
1989 0.0191 :ffffffff80431244: add %r14d,0xd0(%r12)
1 9.6e-06 :ffffffff8043124c: xor %eax,%eax
:ffffffff8043124e: add %r14d,0x68(%r12)
20605 0.1978 :ffffffff80431253: incl 0x44(%r12)
:ffffffff80431258: movl $0x1,0x3c(%rbp)
:ffffffff8043125f: jmp ffffffff80431266
<skb_gro_receive+0x3bd>
:ffffffff80431261: mov $0xfffffff9,%eax
13260 0.1273 :ffffffff80431266: pop %r11
1946 0.0187 :ffffffff80431268: pop %rbx
2010 0.0193 :ffffffff80431269: pop %rbp
64 6.1e-04 :ffffffff8043126a: pop %r12
1948 0.0187 :ffffffff8043126c: pop %r13
2746 0.0264 :ffffffff8043126e: pop %r14
57 5.5e-04 :ffffffff80431270: pop %r15
2067 0.0198 :ffffffff80431272: retq
ffffffff80460663 <tcp_gro_receive>: /* tcp_gro_receive total: 396796
3.8083 */
4433 0.0425 :ffffffff80460663: push %r15
2204 0.0212 :ffffffff80460665: push %r14
:ffffffff80460667: mov %rdi,%r14
:ffffffff8046066a: push %r13
2275 0.0218 :ffffffff8046066c: push %r12
:ffffffff8046066e: mov %rsi,%r12
:ffffffff80460671: mov $0x14,%esi
5933 0.0569 :ffffffff80460676: mov %r12,%rdi
:ffffffff80460679: push %rbp
:ffffffff8046067a: push %rbx
2180 0.0209 :ffffffff8046067b: sub $0x8,%rsp
:ffffffff8046067f: callq ffffffff804357a1
<skb_gro_header>
:ffffffff80460684: test %rax,%rax
3218 0.0309 :ffffffff80460687: je ffffffff804607ed
<tcp_gro_receive+0x18a>
:ffffffff8046068d: mov 0xc(%rax),%al
1 9.6e-06 :ffffffff80460690: shr $0x4,%al
3528 0.0339 :ffffffff80460693: movzbl %al,%eax
:ffffffff80460696: lea 0x0(,%rax,4),%r13d
1 9.6e-06 :ffffffff8046069e: cmp $0x13,%r13d
2773 0.0266 :ffffffff804606a2: jbe ffffffff804607ed
<tcp_gro_receive+0x18a>
:ffffffff804606a8: mov %r13d,%esi
:ffffffff804606ab: mov %r12,%rdi
3327 0.0319 :ffffffff804606ae: callq ffffffff804357a1
<skb_gro_header>
:ffffffff804606b3: test %rax,%rax
2094 0.0201 :ffffffff804606b6: mov %rax,%r8
:ffffffff804606b9: je ffffffff804607ed
<tcp_gro_receive+0x18a>
:ffffffff804606bf: lea 0x38(%r12),%r15
2245 0.0215 :ffffffff804606c4: add %r13d,(%r15)
:ffffffff804606c7: mov 0x68(%r12),%ebp
:ffffffff804606cc: sub 0x38(%r12),%ebp
2394 0.0230 :ffffffff804606d1: mov 0xc(%rax),%ebx
:ffffffff804606d4: jmp ffffffff80460710
<tcp_gro_receive+0xad>
2111 0.0203 :ffffffff804606d6: lea 0x38(%rdi),%r9
3 2.9e-05 :ffffffff804606da: cmpl $0x0,0x4(%r9)
21 2.0e-04 :ffffffff804606df: je ffffffff8046070d
<tcp_gro_receive+0xaa>
2592 0.0249 :ffffffff804606e1: mov 0xa8(%rdi),%eax
:ffffffff804606e7: mov 0xc0(%rdi),%r10
:ffffffff804606ee: mov 0x2(%r8),%dx
2440 0.0234 :ffffffff804606f3: lea (%r10,%rax,1),%rcx
:ffffffff804606f7: mov (%r8),%eax
1 9.6e-06 :ffffffff804606fa: xor 0x2(%rcx),%dx
6275 0.0602 :ffffffff804606fe: xor (%rcx),%eax
3 2.9e-05 :ffffffff80460700: or %ax,%dx
:ffffffff80460703: je ffffffff8046071d
<tcp_gro_receive+0xba>
:ffffffff80460705: movl $0x0,0x4(%r9)
:ffffffff8046070d: mov %rdi,%r14
2920 0.0280 :ffffffff80460710: mov (%r14),%rdi
18 1.7e-04 :ffffffff80460713: test %rdi,%rdi
2 1.9e-05 :ffffffff80460716: jne ffffffff804606d6
<tcp_gro_receive+0x73>
33 3.2e-04 :ffffffff80460718: jmpq ffffffff80460807
<tcp_gro_receive+0x1a4>
4253 0.0408 :ffffffff8046071d: mov 0xe(%r8),%ax
2125 0.0204 :ffffffff80460722: xor 0xe(%rcx),%ax
2 1.9e-05 :ffffffff80460726: mov %ebx,%edx
:ffffffff80460728: and $0x8000,%edx
8066 0.0774 :ffffffff8046072e: or 0x8(%r9),%edx
:ffffffff80460732: movzwl %ax,%esi
:ffffffff80460735: mov 0x8(%r8),%eax
64740 0.6214 :ffffffff80460739: xor 0x8(%rcx),%eax
:ffffffff8046073c: or %eax,%esi
:ffffffff8046073e: mov %ebx,%eax
2084 0.0200 :ffffffff80460740: xor 0xc(%rcx),%eax
:ffffffff80460743: and $0x76,%ah
:ffffffff80460746: or %eax,%edx
2132 0.0205 :ffffffff80460748: or %edx,%esi
:ffffffff8046074a: mov $0x14,%edx
:ffffffff8046074f: jmp ffffffff8046075e
<tcp_gro_receive+0xfb>
:ffffffff80460751: movslq %edx,%rax
:ffffffff80460754: add $0x4,%edx
:ffffffff80460757: mov (%r8,%rax,1),%esi
:ffffffff8046075b: xor (%rcx,%rax,1),%esi
3670 0.0352 :ffffffff8046075e: test %esi,%esi
2162 0.0208 :ffffffff80460760: jne ffffffff80460767
<tcp_gro_receive+0x104>
:ffffffff80460762: cmp %r13d,%edx
1 9.6e-06 :ffffffff80460765: jb ffffffff80460751
<tcp_gro_receive+0xee>
50209 0.4819 :ffffffff80460767: mov 0xb8(%rdi),%eax
4473 0.0429 :ffffffff8046076d: mov 0x4(%rcx),%edx
:ffffffff80460770: bswap %edx
9554 0.0917 :ffffffff80460772: mov 0x4(%r8),%ecx
:ffffffff80460776: bswap %ecx
:ffffffff80460778: movzwl 0x6(%r10,%rax,1),%r13d
7572 0.0727 :ffffffff8046077e: mov 0x68(%rdi),%eax
:ffffffff80460781: sub 0x38(%rdi),%eax
:ffffffff80460784: add %edx,%eax
9803 0.0941 :ffffffff80460786: xor %eax,%ecx
:ffffffff80460788: cmp %r13d,%ebp
:ffffffff8046078b: seta %al
50608 0.4857 :ffffffff8046078e: test %ebp,%ebp
:ffffffff80460790: sete %dl
:ffffffff80460793: or %edx,%eax
3161 0.0303 :ffffffff80460795: movzbl %al,%eax
:ffffffff80460798: or %eax,%esi
:ffffffff8046079a: or %esi,%ecx
3278 0.0315 :ffffffff8046079c: jne ffffffff804607f6
<tcp_gro_receive+0x193>
:ffffffff8046079e: mov %r12,%rsi
2 1.9e-05 :ffffffff804607a1: mov %r14,%rdi
2579 0.0248 :ffffffff804607a4: callq ffffffff80430ea9
<skb_gro_receive>
2059 0.0198 :ffffffff804607a9: test %eax,%eax
49 4.7e-04 :ffffffff804607ab: jne ffffffff804607f6
<tcp_gro_receive+0x193>
:ffffffff804607ad: mov (%r14),%rcx
1945 0.0187 :ffffffff804607b0: mov %ebx,%edx
3 2.9e-05 :ffffffff804607b2: and $0x900,%edx
:ffffffff804607b8: mov 0xa8(%rcx),%eax
2530 0.0243 :ffffffff804607be: add 0xc0(%rcx),%rax
3 2.9e-05 :ffffffff804607c5: or %edx,0xc(%rax)
13 1.2e-04 :ffffffff804607c8: xor %eax,%eax
4881 0.0468 :ffffffff804607ca: cmp %r13d,%ebp
:ffffffff804607cd: setb %al
:ffffffff804607d0: and $0x2f00,%ebx
1912 0.0184 :ffffffff804607d6: or %ebx,%eax
:ffffffff804607d8: test %rcx,%rcx
:ffffffff804607db: je ffffffff80460816
<tcp_gro_receive+0x1b3>
2163 0.0208 :ffffffff804607dd: cmpl $0x0,0x4(%r15)
136 0.0013 :ffffffff804607e2: je ffffffff804607e8
<tcp_gro_receive+0x185>
2455 0.0236 :ffffffff804607e4: test %eax,%eax
57 5.5e-04 :ffffffff804607e6: je ffffffff80460816
<tcp_gro_receive+0x1b3>
148 0.0014 :ffffffff804607e8: mov %r14,%rdi
735 0.0071 :ffffffff804607eb: jmp ffffffff80460818
<tcp_gro_receive+0x1b5>
:ffffffff804607ed: xor %edi,%edi
:ffffffff804607ef: mov $0x1,%eax
:ffffffff804607f4: jmp ffffffff80460818
<tcp_gro_receive+0x1b5>
68 6.5e-04 :ffffffff804607f6: xor %eax,%eax
1 9.6e-06 :ffffffff804607f8: test %ebp,%ebp
67 6.4e-04 :ffffffff804607fa: sete %al
47 4.5e-04 :ffffffff804607fd: and $0x2f00,%ebx
:ffffffff80460803: or %ebx,%eax
58 5.6e-04 :ffffffff80460805: jmp ffffffff804607dd
<tcp_gro_receive+0x17a>
122 0.0012 :ffffffff80460807: xor %eax,%eax
9 8.6e-05 :ffffffff80460809: test %ebp,%ebp
:ffffffff8046080b: sete %al
67 6.4e-04 :ffffffff8046080e: and $0x2f00,%ebx
6 5.8e-05 :ffffffff80460814: or %ebx,%eax
1995 0.0191 :ffffffff80460816: xor %edi,%edi
68 6.5e-04 :ffffffff80460818: or %eax,0x40(%r12)
275 0.0026 :ffffffff8046081d: mov %rdi,%rax
2037 0.0196 :ffffffff80460820: pop %r11
191 0.0018 :ffffffff80460822: pop %rbx
4346 0.0417 :ffffffff80460823: pop %rbp
4739 0.0455 :ffffffff80460824: pop %r12
167 0.0016 :ffffffff80460826: pop %r13
23735 0.2278 :ffffffff80460828: pop %r14
56070 0.5381 :ffffffff8046082a: pop %r15
140 0.0013 :ffffffff8046082c: retq
ffffffff804357a1 <skb_gro_header>: /* skb_gro_header total: 319455
3.0660 */
13604 0.1306 :ffffffff804357a1: push %rbp
14938 0.1434 :ffffffff804357a2: push %rbx
:ffffffff804357a3: mov %rdi,%rbx
:ffffffff804357a6: sub $0x8,%rsp
18392 0.1765 :ffffffff804357aa: mov 0x38(%rdi),%ebp
:ffffffff804357ad: mov 0x68(%rdi),%edx
1 9.6e-06 :ffffffff804357b0: add %ebp,%esi
20559 0.1973 :ffffffff804357b2: mov %edx,%edi
:ffffffff804357b4: sub 0x6c(%rbx),%edi
:ffffffff804357b7: jne ffffffff804357cc
<skb_gro_header+0x2b>
36626 0.3515 :ffffffff804357b9: mov 0xb8(%rbx),%ecx
2 1.9e-05 :ffffffff804357bf: mov 0xc0(%rbx),%rax
3 2.9e-05 :ffffffff804357c6: cmp %esi,0x3c(%rax,%rcx,1)
18577 0.1783 :ffffffff804357ca: jae ffffffff804357ee
<skb_gro_header+0x4d>
:ffffffff804357cc: cmp %edi,%esi
:ffffffff804357ce: jbe ffffffff804357e3
<skb_gro_header+0x42>
:ffffffff804357d0: cmp %edx,%esi
:ffffffff804357d2: ja ffffffff80435833
<skb_gro_header+0x92>
:ffffffff804357d4: sub %edi,%esi
:ffffffff804357d6: mov %rbx,%rdi
:ffffffff804357d9: callq ffffffff8042f6ee
<__pskb_pull_tail>
:ffffffff804357de: test %rax,%rax
:ffffffff804357e1: je ffffffff80435833
<skb_gro_header+0x92>
:ffffffff804357e3: mov %ebp,%eax
:ffffffff804357e5: add 0xc8(%rbx),%rax
:ffffffff804357ec: jmp ffffffff80435835
<skb_gro_header+0x94>
3 2.9e-05 :ffffffff804357ee: add 0xc0(%rbx),%rcx
25999 0.2495 :ffffffff804357f5: mov $0x1e0000000000,%rax
:ffffffff804357ff: mov $0x6db6db6db6db6db7,%rdx
44557 0.4276 :ffffffff80435809: add 0x30(%rcx),%rax
:ffffffff8043580d: sar $0x3,%rax
12588 0.1208 :ffffffff80435811: imul %rdx,%rax
10104 0.0970 :ffffffff80435815: mov $0xffff880000000000,%rdx
:ffffffff8043581f: shl $0xc,%rax
:ffffffff80435823: add %rdx,%rax
16404 0.1574 :ffffffff80435826: mov 0x38(%rcx),%edx
:ffffffff80435829: add %rdx,%rax
:ffffffff8043582c: mov %ebp,%edx
15264 0.1465 :ffffffff8043582e: add %rdx,%rax
:ffffffff80435831: jmp ffffffff80435835
<skb_gro_header+0x94>
:ffffffff80435833: xor %eax,%eax
45844 0.4400 :ffffffff80435835: pop %r10
2 1.9e-05 :ffffffff80435837: pop %rbx
12844 0.1233 :ffffffff80435838: pop %rbp
13144 0.1262 :ffffffff80435839: retq
Thanks for your help,
Drew
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-29 17:28 ` Andrew Gallatin
@ 2009-04-30 8:10 ` Herbert Xu
2009-04-30 8:14 ` Herbert Xu
2009-04-30 8:17 ` Eric Dumazet
1 sibling, 1 reply; 41+ messages in thread
From: Herbert Xu @ 2009-04-30 8:10 UTC (permalink / raw)
To: Andrew Gallatin; +Cc: Eric Dumazet, David Miller, brice, sgruszka, netdev
Hi:
Unfortunately the myricom card I was using is now refusing to work:
myri10ge: Version 1.4.4-1.412
myri10ge 0000:04:00.0: PCI INT A -> GSI 33 (level, low) -> IRQ 33
myri10ge 0000:04:00.0: setting latency timer to 64
myri10ge 0000:04:00.0: PCIE x4 Link
myri10ge 0000:04:00.0: firmware: requesting myri10ge_eth_z8e.dat
myri10ge 0000:04:00.0: command 1 failed, result = 14
myri10ge 0000:04:00.0: failed reset
myri10ge 0000:04:00.0: failed reset
myri10ge 0000:04:00.0: myri10ge_probe() failed: MAC=00:60:dd:47:80:7d, SN=312225
myri10ge 0000:04:00.0: PCI INT A disabled
So I won't be able to test this until I locate another myri10ge
card or get this one back up and running again.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-30 8:10 ` Herbert Xu
@ 2009-04-30 8:14 ` Herbert Xu
0 siblings, 0 replies; 41+ messages in thread
From: Herbert Xu @ 2009-04-30 8:14 UTC (permalink / raw)
To: Andrew Gallatin; +Cc: Eric Dumazet, David Miller, brice, sgruszka, netdev
On Thu, Apr 30, 2009 at 04:10:51PM +0800, Herbert Xu wrote:
>
> So I won't be able to test this until I locate another myri10ge
> card or get this one back up and running again.
Another reboot seems to have fixed it. So all is good.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-29 17:28 ` Andrew Gallatin
2009-04-30 8:10 ` Herbert Xu
@ 2009-04-30 8:17 ` Eric Dumazet
2009-04-30 19:14 ` Andrew Gallatin
1 sibling, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2009-04-30 8:17 UTC (permalink / raw)
To: Andrew Gallatin; +Cc: Herbert Xu, David Miller, brice, sgruszka, netdev
Andrew Gallatin a écrit :
> Eric Dumazet wrote:
>
>>
>> Sure, probably more cache misses or something...
>
> Yes, that's what I thought. The code is much more complete,
> and spread out than LRO, and seems to open itself to cache
> misses.
>
>> You could try a longer oprofile session (with at least one million
> samples)
>> and :
>>
>> opannotate -a vmlinux >/tmp/FILE
>>
>> And select 3 or 4 suspect functions : inet_gro_receive()
> tcp_gro_receive(),
>> skb_gro_receive(), skb_gro_header()
>
> Here is the opreport -l output from this machine for GRO for a 25 minute
> profiling run:
>
>
> samples % image name app name symbol name
> 3742674 32.2793 vmlinux vmlinux copy_user_generic_string
> 890179 7.6775 myri10ge.ko myri10ge myri10ge_poll
> 547572 4.7226 vmlinux vmlinux inet_gro_receive
> 477479 4.1181 vmlinux vmlinux skb_gro_receive
> 406562 3.5065 vmlinux vmlinux free_hot_cold_page
> 396796 3.4222 vmlinux vmlinux tcp_gro_receive
> 332364 2.8665 vmlinux vmlinux __rmqueue_smallest
> 319455 2.7552 vmlinux vmlinux skb_gro_header
> 269040 2.3204 vmlinux vmlinux dev_gro_receive
> 252885 2.1810 vmlinux vmlinux free_pages_bulk
> 247832 2.1375 vmlinux vmlinux
> get_pageblock_flags_group
> 211592 1.8249 myri10ge.ko myri10ge myri10ge_alloc_rx_pages
> 208867 1.8014 vmlinux vmlinux __list_add
> 201491 1.7378 vmlinux vmlinux tcp4_gro_receive
> 187591 1.6179 vmlinux vmlinux __napi_gro_receive
> 170156 1.4675 vmlinux vmlinux get_page_from_freelist
> 116321 1.0032 vmlinux vmlinux
> list_del
> 107994 0.9314 vmlinux vmlinux kfree
> 106434 0.9180 vmlinux vmlinux skb_copy_datagram_iovec
> 100675 0.8683 vmlinux vmlinux
> put_page
>
> And is here is the opannotate -a output for a few GRO functions. BTW,
> did you mean -s
> rather than -a? I'd naively think source might be more helpful. But
> here is
> what you asked for:
>
> ffffffff80479f20 <inet_gro_receive>: /* inet_gro_receive total: 547572
> 5.2554 */
> 12187 0.1170 :ffffffff80479f20: push %r13
> 2611 0.0251 :ffffffff80479f22: mov %rdi,%r13
> :ffffffff80479f25: push %r12
> :ffffffff80479f27: push %rbp
> 4031 0.0387 :ffffffff80479f28: push %rbx
> :ffffffff80479f29: mov %rsi,%rbx
> :ffffffff80479f2c: mov $0x14,%esi
> 6303 0.0605 :ffffffff80479f31: mov %rbx,%rdi
> :ffffffff80479f34: sub $0x8,%rsp
> :ffffffff80479f38: callq ffffffff804357a1
> <skb_gro_header>
> :ffffffff80479f3d: test %rax,%rax
> 2494 0.0239 :ffffffff80479f40: mov %rax,%r8
> :ffffffff80479f43: je ffffffff8047a0a4
> <inet_gro_receive+0x184>
> :ffffffff80479f49: movzbl 0x9(%rax),%eax
> 2541 0.0244 :ffffffff80479f4d: mov
> 0xffffffff80d06280(,%rax,8),%r11
> 33 3.2e-04 :ffffffff80479f55: test %r11,%r11
> 5 4.8e-05 :ffffffff80479f58: je ffffffff8047a0a4
> <inet_gro_receive+0x184>
> 11016 0.1057 :ffffffff80479f5e: cmpq $0x0,0x20(%r11)
> 292 0.0028 :ffffffff80479f63: je ffffffff8047a0a4
> <inet_gro_receive+0x184>
> 1 9.6e-06 :ffffffff80479f69: cmpb $0x45,(%r8)
> 4297 0.0412 :ffffffff80479f6d: jne ffffffff8047a0a4
> <inet_gro_receive+0x184>
> 6086 0.0584 :ffffffff80479f73: mov $0x5,%eax
> :ffffffff80479f78: mov %r8,%rcx
> 18706 0.1795 :ffffffff80479f7b: mov (%rcx),%edx
> 341 0.0033 :ffffffff80479f7d: sub $0x4,%eax
> :ffffffff80479f80: jbe ffffffff80479fa6
> <inet_gro_receive+0x86>
> 4609 0.0442 :ffffffff80479f82: add 0x4(%rcx),%edx
> 398 0.0038 :ffffffff80479f85: adc 0x8(%rcx),%edx
> :ffffffff80479f88: adc 0xc(%rcx),%edx
> 4310 0.0414 :ffffffff80479f8b: adc 0x10(%rcx),%edx
> 790 0.0076 :ffffffff80479f8e: lea 0x4(%rcx),%rcx
> :ffffffff80479f92: dec %eax
> 9097 0.0873 :ffffffff80479f94: jne ffffffff80479f8b
> <inet_gro_receive+0x6b>
> 541 0.0052 :ffffffff80479f96: adc $0x0,%edx
> :ffffffff80479f99: mov %edx,%eax
> 1919 0.0184 :ffffffff80479f9b: shr $0x10,%edx
> 535 0.0051 :ffffffff80479f9e: add %ax,%dx
> :ffffffff80479fa1: adc $0x0,%edx
> 3633 0.0349 :ffffffff80479fa4: not %edx
> 683 0.0066 :ffffffff80479fa6: test %dx,%dx
> 1 9.6e-06 :ffffffff80479fa9: jne ffffffff8047a0a4
> <inet_gro_receive+0x184>
> 4725 0.0453 :ffffffff80479faf: movzwl 0x2(%r8),%eax
> 9728 0.0934 :ffffffff80479fb4: mov 0x68(%rbx),%edx
> 8 7.7e-05 :ffffffff80479fb7: mov $0x1,%ebp
> 43000 0.4127 :ffffffff80479fbc: sub 0x38(%rbx),%edx
> 11149 0.1070 :ffffffff80479fbf: mov %eax,%ecx
> :ffffffff80479fc1: shl $0x8,%eax
> 66497 0.6382 :ffffffff80479fc4: shr $0x8,%ecx
> 735 0.0071 :ffffffff80479fc7: or %ecx,%eax
> :ffffffff80479fc9: movzwl %ax,%eax
> 5459 0.0524 :ffffffff80479fcc: cmp %edx,%eax
> 522 0.0050 :ffffffff80479fce: jne ffffffff80479fdc
> <inet_gro_receive+0xbc>
> :ffffffff80479fd0: xor %ebp,%ebp
> 5373 0.0516 :ffffffff80479fd2: cmpw $0x40,0x6(%r8)
> 345 0.0033 :ffffffff80479fd8: setne %bpl
> :ffffffff80479fdc: movzwl 0x4(%r8),%eax
> 2384 0.0229 :ffffffff80479fe1: mov 0x0(%r13),%r10
> 631 0.0061 :ffffffff80479fe5: mov %eax,%edx
> :ffffffff80479fe7: shl $0x8,%eax
> 3044 0.0292 :ffffffff80479fea: shr $0x8,%edx
> 303 0.0029 :ffffffff80479fed: or %edx,%eax
> :ffffffff80479fef: movzwl %ax,%r12d
> 2747 0.0264 :ffffffff80479ff3: jmp ffffffff8047a071
> <inet_gro_receive+0x151>
> 2109 0.0202 :ffffffff80479ff5: lea 0x38(%r10),%r9
> 12 1.2e-04 :ffffffff80479ff9: cmpl $0x0,0x4(%r9)
> 23 2.2e-04 :ffffffff80479ffe: je ffffffff8047a06e
> <inet_gro_receive+0x14e>
> 2104 0.0202 :ffffffff8047a000: mov 0xac(%r10),%edi
> 2 1.9e-05 :ffffffff8047a007: add 0xc0(%r10),%rdi
> :ffffffff8047a00e: mov 0x9(%rdi),%sil
> 2391 0.0229 :ffffffff8047a012: mov 0x1(%rdi),%al
> 2 1.9e-05 :ffffffff8047a015: xor 0x9(%r8),%sil
> 7 6.7e-05 :ffffffff8047a019: xor 0x1(%r8),%al
> 2101 0.0202 :ffffffff8047a01d: mov 0xc(%rdi),%edx
> 1 9.6e-06 :ffffffff8047a020: mov 0x10(%rdi),%ecx
> :ffffffff8047a023: xor 0xc(%r8),%edx
> 2775 0.0266 :ffffffff8047a027: xor 0x10(%r8),%ecx
> :ffffffff8047a02b: or %esi,%eax
> :ffffffff8047a02d: movzbl %al,%eax
> 62734 0.6021 :ffffffff8047a030: or %edx,%ecx
> :ffffffff8047a032: or %eax,%ecx
> :ffffffff8047a034: je ffffffff8047a040
> <inet_gro_receive+0x120>
> :ffffffff8047a036: movl $0x0,0x4(%r9)
> :ffffffff8047a03e: jmp ffffffff8047a06e
> <inet_gro_receive+0x14e>
> 2106 0.0202 :ffffffff8047a040: movzwl 0x4(%rdi),%edx
> :ffffffff8047a044: mov 0x8(%rdi),%al
> :ffffffff8047a047: xor 0x8(%r8),%eax
> 64244 0.6166 :ffffffff8047a04b: mov %edx,%ecx
> :ffffffff8047a04d: shl $0x8,%edx
> :ffffffff8047a050: shr $0x8,%ecx
> 2072 0.0199 :ffffffff8047a053: movzbl %al,%eax
> :ffffffff8047a056: or 0x8(%r9),%eax
> :ffffffff8047a05a: or %ecx,%edx
> 2629 0.0252 :ffffffff8047a05c: add 0xc(%r9),%edx
> 2 1.9e-05 :ffffffff8047a060: movzwl %dx,%edx
> :ffffffff8047a063: xor %r12d,%edx
> 58223 0.5588 :ffffffff8047a066: or %edx,%eax
> 3 2.9e-05 :ffffffff8047a068: or %ebp,%eax
> :ffffffff8047a06a: mov %eax,0x8(%r9)
> 21878 0.2100 :ffffffff8047a06e: mov (%r10),%r10
> 2156 0.0207 :ffffffff8047a071: test %r10,%r10
> :ffffffff8047a074: jne ffffffff80479ff5
> <inet_gro_receive+0xd5>
> 3007 0.0289 :ffffffff8047a07a: mov 0x38(%rbx),%eax
> 61 5.9e-04 :ffffffff8047a07d: or %ebp,0x40(%rbx)
> 3 2.9e-05 :ffffffff8047a080: mov %rbx,%rsi
> 3091 0.0297 :ffffffff8047a083: mov %r13,%rdi
> 41 3.9e-04 :ffffffff8047a086: add $0x14,%eax
> :ffffffff8047a089: mov %eax,0x38(%rbx)
> 3704 0.0355 :ffffffff8047a08c: sub 0xc0(%rbx),%eax
> 33 3.2e-04 :ffffffff8047a092: add 0xc8(%rbx),%eax
> :ffffffff8047a098: mov %eax,0xa8(%rbx)
> 2468 0.0237 :ffffffff8047a09e: callq *0x20(%r11)
> 20011 0.1921 :ffffffff8047a0a2: jmp ffffffff8047a0ab
> <inet_gro_receive+0x18b>
> :ffffffff8047a0a4: xor %eax,%eax
> :ffffffff8047a0a6: mov $0x1,%ebp
> 24082 0.2311 :ffffffff8047a0ab: or %ebp,0x40(%rbx)
> 626 0.0060 :ffffffff8047a0ae: pop %r10
> 1718 0.0165 :ffffffff8047a0b0: pop %rbx
> 446 0.0043 :ffffffff8047a0b1: pop %rbp
> 4074 0.0391 :ffffffff8047a0b2: pop %r12
> 2089 0.0200 :ffffffff8047a0b4: pop %r13
> 434 0.0042 :ffffffff8047a0b6: retq
>
>
>
>
> ffffffff80430ea9 <skb_gro_receive>: /* skb_gro_receive total: 477479
> 4.5827 */
> 2158 0.0207 :ffffffff80430ea9: push %r15
> 2492 0.0239 :ffffffff80430eab: mov %rdi,%r15
> :ffffffff80430eae: push %r14
> :ffffffff80430eb0: push %r13
> 2432 0.0233 :ffffffff80430eb2: push %r12
> 1 9.6e-06 :ffffffff80430eb4: push %rbp
> 1 9.6e-06 :ffffffff80430eb5: mov %rsi,%rbp
> 2430 0.0233 :ffffffff80430eb8: push %rbx
> :ffffffff80430eb9: sub $0x8,%rsp
> :ffffffff80430ebd: mov 0x68(%rsi),%ecx
> 2420 0.0232 :ffffffff80430ec0: mov (%rdi),%r12
> 1 9.6e-06 :ffffffff80430ec3: mov %ecx,%r14d
> 1 9.6e-06 :ffffffff80430ec6: sub 0x38(%rsi),%r14d
> 2317 0.0222 :ffffffff80430eca: mov %r14d,%eax
> 1 9.6e-06 :ffffffff80430ecd: add 0x68(%r12),%eax
> 1 9.6e-06 :ffffffff80430ed2: cmp $0xffff,%eax
> 3865 0.0371 :ffffffff80430ed7: ja ffffffff80431261
> <skb_gro_receive+0x3b8>
> :ffffffff80430edd: mov 0xb8(%r12),%eax
> :ffffffff80430ee5: mov 0xc0(%r12),%rdx
> 8082 0.0776 :ffffffff80430eed: lea (%rdx,%rax,1),%rsi
> :ffffffff80430ef1: cmpq $0x0,0x18(%rsi)
> 2 1.9e-05 :ffffffff80430ef6: jne ffffffff804311ab
> <skb_gro_receive+0x302>
> 9249 0.0888 :ffffffff80430efc: mov %ecx,%edi
> :ffffffff80430efe: sub 0x6c(%rbp),%edi
> 6 5.8e-05 :ffffffff80430f01: cmp 0x38(%rbp),%edi
> 3104 0.0298 :ffffffff80430f04: ja ffffffff80430fe2
> <skb_gro_receive+0x139>
> 2 1.9e-05 :ffffffff80430f0a: mov 0xb8(%rbp),%ecx
> :ffffffff80430f10: movzwl 0x4(%rsi),%edx
> 8825 0.0847 :ffffffff80430f14: add 0xc0(%rbp),%rcx
> :ffffffff80430f1b: movzwl 0x4(%rcx),%eax
> 21 2.0e-04 :ffffffff80430f1f: add %edx,%eax
> 19668 0.1888 :ffffffff80430f21: cmp $0x12,%eax
> 1 9.6e-06 :ffffffff80430f24: ja ffffffff80431261
> <skb_gro_receive+0x3b8>
> :ffffffff80430f2a: mov 0x38(%rcx),%eax
> 1974 0.0189 :ffffffff80430f2d: add 0x38(%rbp),%eax
> :ffffffff80430f30: cld
> :ffffffff80430f31: sub %edi,%eax
> 7666 0.0736 :ffffffff80430f33: mov %eax,0x38(%rcx)
> 2 1.9e-05 :ffffffff80430f36: mov 0xb8(%rbp),%edx
> :ffffffff80430f3c: add 0xc0(%rbp),%rdx
Compiler has hard time to optimize these function apparently... :(
skb_shinfo(skb) & skb_shinfo(p) are evaluated many times.
> 52468 0.5036 :ffffffff80430f43: mov 0x3c(%rdx),%eax
> 2 1.9e-05 :ffffffff80430f46: add 0x68(%rbp),%eax
> 1 9.6e-06 :ffffffff80430f49: sub 0x6c(%rbp),%eax
> 6592 0.0633 :ffffffff80430f4c: sub 0x38(%rbp),%eax
> :ffffffff80430f4f: mov %eax,0x3c(%rdx)
> :ffffffff80430f52: mov 0xb8(%r12),%eax
> 23018 0.2209 :ffffffff80430f5a: add 0xc0(%r12),%rax
> 1 9.6e-06 :ffffffff80430f62: mov 0xb8(%rbp),%esi
> :ffffffff80430f68: add 0xc0(%rbp),%rsi
> 8477 0.0814 :ffffffff80430f6f: movzwl 0x4(%rax),%edi
> 6 5.8e-05 :ffffffff80430f73: movzwl 0x4(%rsi),%ecx
> :ffffffff80430f77: add $0x30,%rsi
> 21338 0.2048 :ffffffff80430f7b: shl $0x4,%rdi
> 3 2.9e-05 :ffffffff80430f7f: lea 0x30(%rdi,%rax,1),%rdi
> 1 9.6e-06 :ffffffff80430f84: shl $0x4,%rcx
> 150632 1.4457 :ffffffff80430f88: rep movsb %ds:(%rsi),%es:(%rdi)
ouch... What stupid compiler... should use movsq here :(
we could try to inline the likely case of one fragment copied...
> 3988 0.0383 :ffffffff80430f8a: mov 0xb8(%r12),%eax
> 2015 0.0193 :ffffffff80430f92: mov 0xb8(%rbp),%ecx
> 11 1.1e-04 :ffffffff80430f98: add 0xc0(%r12),%rax
> 8 7.7e-05 :ffffffff80430fa0: mov 0xc0(%rbp),%rdx
> 3295 0.0316 :ffffffff80430fa7: mov 0x4(%rdx,%rcx,1),%edx
> :ffffffff80430fab: add %dx,0x4(%rax)
> 8 7.7e-05 :ffffffff80430faf: mov 0xb8(%rbp),%edx
> 2507 0.0241 :ffffffff80430fb5: mov 0xc0(%rbp),%rax
> :ffffffff80430fbc: movw $0x0,0x4(%rax,%rdx,1)
> 3233 0.0310 :ffffffff80430fc3: mov 0x6c(%rbp),%eax
> 1 9.6e-06 :ffffffff80430fc6: sub %eax,0xd0(%rbp)
> :ffffffff80430fcc: sub %eax,0x68(%rbp)
> 41540 0.3987 :ffffffff80430fcf: movl $0x0,0x6c(%rbp)
> :ffffffff80430fd6: movl $0x1,0x48(%rbp)
> :ffffffff80430fdd: jmpq ffffffff8043123f
> <skb_gro_receive+0x396>
> :ffffffff80430fe2: mov 0xc8(%r12),%rax
> :ffffffff80430fea: mov 0x20(%r12),%rdi
> :ffffffff80430fef: mov %eax,%r13d
> :ffffffff80430ff2: sub %edx,%r13d
> :ffffffff80430ff5: mov $0x20,%edx
> :ffffffff80430ffa: mov %r13d,%esi
> :ffffffff80430ffd: add 0x38(%r12),%esi
> :ffffffff80431002: callq ffffffff8042ffe0
...
> :ffffffff80431223: ud2a
> :ffffffff80431225: jmp ffffffff80431225
> <skb_gro_receive+0x37c>
> :ffffffff80431227: mov 0xb8(%rbp),%eax
> :ffffffff8043122d: orb $0x10,0x7c(%rbp)
> :ffffffff80431231: add 0xc0(%rbp),%rax
> :ffffffff80431238: lock addl $0x10000,(%rax)
> 34919 0.3351 :ffffffff8043123f: add %r14d,0x6c(%r12)
> 1989 0.0191 :ffffffff80431244: add %r14d,0xd0(%r12)
> 1 9.6e-06 :ffffffff8043124c: xor %eax,%eax
> :ffffffff8043124e: add %r14d,0x68(%r12)
> 20605 0.1978 :ffffffff80431253: incl 0x44(%r12)
> :ffffffff80431258: movl $0x1,0x3c(%rbp)
> :ffffffff8043125f: jmp ffffffff80431266
> <skb_gro_receive+0x3bd>
> :ffffffff80431261: mov $0xfffffff9,%eax
> 13260 0.1273 :ffffffff80431266: pop %r11
> 1946 0.0187 :ffffffff80431268: pop %rbx
> 2010 0.0193 :ffffffff80431269: pop %rbp
> 64 6.1e-04 :ffffffff8043126a: pop %r12
> 1948 0.0187 :ffffffff8043126c: pop %r13
> 2746 0.0264 :ffffffff8043126e: pop %r14
> 57 5.5e-04 :ffffffff80431270: pop %r15
> 2067 0.0198 :ffffffff80431272: retq
>
> ffffffff80460663 <tcp_gro_receive>: /* tcp_gro_receive total: 396796
> 3.8083 */
> 4433 0.0425 :ffffffff80460663: push %r15
> 2204 0.0212 :ffffffff80460665: push %r14
> :ffffffff80460667: mov %rdi,%r14
> :ffffffff8046066a: push %r13
> 2275 0.0218 :ffffffff8046066c: push %r12
> :ffffffff8046066e: mov %rsi,%r12
> :ffffffff80460671: mov $0x14,%esi
> 5933 0.0569 :ffffffff80460676: mov %r12,%rdi
> :ffffffff80460679: push %rbp
> :ffffffff8046067a: push %rbx
> 2180 0.0209 :ffffffff8046067b: sub $0x8,%rsp
> :ffffffff8046067f: callq ffffffff804357a1
> <skb_gro_header>
> :ffffffff80460684: test %rax,%rax
> 3218 0.0309 :ffffffff80460687: je ffffffff804607ed
> <tcp_gro_receive+0x18a>
> :ffffffff8046068d: mov 0xc(%rax),%al
> 1 9.6e-06 :ffffffff80460690: shr $0x4,%al
> 3528 0.0339 :ffffffff80460693: movzbl %al,%eax
> :ffffffff80460696: lea 0x0(,%rax,4),%r13d
> 1 9.6e-06 :ffffffff8046069e: cmp $0x13,%r13d
> 2773 0.0266 :ffffffff804606a2: jbe ffffffff804607ed
> <tcp_gro_receive+0x18a>
> :ffffffff804606a8: mov %r13d,%esi
> :ffffffff804606ab: mov %r12,%rdi
> 3327 0.0319 :ffffffff804606ae: callq ffffffff804357a1
> <skb_gro_header>
> :ffffffff804606b3: test %rax,%rax
> 2094 0.0201 :ffffffff804606b6: mov %rax,%r8
> :ffffffff804606b9: je ffffffff804607ed
> <tcp_gro_receive+0x18a>
> :ffffffff804606bf: lea 0x38(%r12),%r15
> 2245 0.0215 :ffffffff804606c4: add %r13d,(%r15)
> :ffffffff804606c7: mov 0x68(%r12),%ebp
> :ffffffff804606cc: sub 0x38(%r12),%ebp
> 2394 0.0230 :ffffffff804606d1: mov 0xc(%rax),%ebx
> :ffffffff804606d4: jmp ffffffff80460710
> <tcp_gro_receive+0xad>
> 2111 0.0203 :ffffffff804606d6: lea 0x38(%rdi),%r9
> 3 2.9e-05 :ffffffff804606da: cmpl $0x0,0x4(%r9)
> 21 2.0e-04 :ffffffff804606df: je ffffffff8046070d
> <tcp_gro_receive+0xaa>
> 2592 0.0249 :ffffffff804606e1: mov 0xa8(%rdi),%eax
> :ffffffff804606e7: mov 0xc0(%rdi),%r10
> :ffffffff804606ee: mov 0x2(%r8),%dx
> 2440 0.0234 :ffffffff804606f3: lea (%r10,%rax,1),%rcx
> :ffffffff804606f7: mov (%r8),%eax
> 1 9.6e-06 :ffffffff804606fa: xor 0x2(%rcx),%dx
> 6275 0.0602 :ffffffff804606fe: xor (%rcx),%eax
> 3 2.9e-05 :ffffffff80460700: or %ax,%dx
> :ffffffff80460703: je ffffffff8046071d
> <tcp_gro_receive+0xba>
> :ffffffff80460705: movl $0x0,0x4(%r9)
> :ffffffff8046070d: mov %rdi,%r14
> 2920 0.0280 :ffffffff80460710: mov (%r14),%rdi
> 18 1.7e-04 :ffffffff80460713: test %rdi,%rdi
> 2 1.9e-05 :ffffffff80460716: jne ffffffff804606d6
> <tcp_gro_receive+0x73>
> 33 3.2e-04 :ffffffff80460718: jmpq ffffffff80460807
> <tcp_gro_receive+0x1a4>
> 4253 0.0408 :ffffffff8046071d: mov 0xe(%r8),%ax
> 2125 0.0204 :ffffffff80460722: xor 0xe(%rcx),%ax
> 2 1.9e-05 :ffffffff80460726: mov %ebx,%edx
> :ffffffff80460728: and $0x8000,%edx
> 8066 0.0774 :ffffffff8046072e: or 0x8(%r9),%edx
> :ffffffff80460732: movzwl %ax,%esi
> :ffffffff80460735: mov 0x8(%r8),%eax
> 64740 0.6214 :ffffffff80460739: xor 0x8(%rcx),%eax
> :ffffffff8046073c: or %eax,%esi
> :ffffffff8046073e: mov %ebx,%eax
> 2084 0.0200 :ffffffff80460740: xor 0xc(%rcx),%eax
> :ffffffff80460743: and $0x76,%ah
> :ffffffff80460746: or %eax,%edx
> 2132 0.0205 :ffffffff80460748: or %edx,%esi
> :ffffffff8046074a: mov $0x14,%edx
> :ffffffff8046074f: jmp ffffffff8046075e
> <tcp_gro_receive+0xfb>
> :ffffffff80460751: movslq %edx,%rax
> :ffffffff80460754: add $0x4,%edx
> :ffffffff80460757: mov (%r8,%rax,1),%esi
> :ffffffff8046075b: xor (%rcx,%rax,1),%esi
> 3670 0.0352 :ffffffff8046075e: test %esi,%esi
> 2162 0.0208 :ffffffff80460760: jne ffffffff80460767
> <tcp_gro_receive+0x104>
> :ffffffff80460762: cmp %r13d,%edx
> 1 9.6e-06 :ffffffff80460765: jb ffffffff80460751
> <tcp_gro_receive+0xee>
> 50209 0.4819 :ffffffff80460767: mov 0xb8(%rdi),%eax
> 4473 0.0429 :ffffffff8046076d: mov 0x4(%rcx),%edx
> :ffffffff80460770: bswap %edx
> 9554 0.0917 :ffffffff80460772: mov 0x4(%r8),%ecx
> :ffffffff80460776: bswap %ecx
> :ffffffff80460778: movzwl 0x6(%r10,%rax,1),%r13d
> 7572 0.0727 :ffffffff8046077e: mov 0x68(%rdi),%eax
> :ffffffff80460781: sub 0x38(%rdi),%eax
> :ffffffff80460784: add %edx,%eax
> 9803 0.0941 :ffffffff80460786: xor %eax,%ecx
> :ffffffff80460788: cmp %r13d,%ebp
> :ffffffff8046078b: seta %al
> 50608 0.4857 :ffffffff8046078e: test %ebp,%ebp
> :ffffffff80460790: sete %dl
> :ffffffff80460793: or %edx,%eax
> 3161 0.0303 :ffffffff80460795: movzbl %al,%eax
> :ffffffff80460798: or %eax,%esi
> :ffffffff8046079a: or %esi,%ecx
> 3278 0.0315 :ffffffff8046079c: jne ffffffff804607f6
> <tcp_gro_receive+0x193>
> :ffffffff8046079e: mov %r12,%rsi
> 2 1.9e-05 :ffffffff804607a1: mov %r14,%rdi
> 2579 0.0248 :ffffffff804607a4: callq ffffffff80430ea9
> <skb_gro_receive>
> 2059 0.0198 :ffffffff804607a9: test %eax,%eax
> 49 4.7e-04 :ffffffff804607ab: jne ffffffff804607f6
> <tcp_gro_receive+0x193>
> :ffffffff804607ad: mov (%r14),%rcx
> 1945 0.0187 :ffffffff804607b0: mov %ebx,%edx
> 3 2.9e-05 :ffffffff804607b2: and $0x900,%edx
> :ffffffff804607b8: mov 0xa8(%rcx),%eax
> 2530 0.0243 :ffffffff804607be: add 0xc0(%rcx),%rax
> 3 2.9e-05 :ffffffff804607c5: or %edx,0xc(%rax)
> 13 1.2e-04 :ffffffff804607c8: xor %eax,%eax
> 4881 0.0468 :ffffffff804607ca: cmp %r13d,%ebp
> :ffffffff804607cd: setb %al
> :ffffffff804607d0: and $0x2f00,%ebx
> 1912 0.0184 :ffffffff804607d6: or %ebx,%eax
> :ffffffff804607d8: test %rcx,%rcx
> :ffffffff804607db: je ffffffff80460816
> <tcp_gro_receive+0x1b3>
> 2163 0.0208 :ffffffff804607dd: cmpl $0x0,0x4(%r15)
> 136 0.0013 :ffffffff804607e2: je ffffffff804607e8
> <tcp_gro_receive+0x185>
> 2455 0.0236 :ffffffff804607e4: test %eax,%eax
> 57 5.5e-04 :ffffffff804607e6: je ffffffff80460816
> <tcp_gro_receive+0x1b3>
> 148 0.0014 :ffffffff804607e8: mov %r14,%rdi
> 735 0.0071 :ffffffff804607eb: jmp ffffffff80460818
> <tcp_gro_receive+0x1b5>
> :ffffffff804607ed: xor %edi,%edi
> :ffffffff804607ef: mov $0x1,%eax
> :ffffffff804607f4: jmp ffffffff80460818
> <tcp_gro_receive+0x1b5>
> 68 6.5e-04 :ffffffff804607f6: xor %eax,%eax
> 1 9.6e-06 :ffffffff804607f8: test %ebp,%ebp
> 67 6.4e-04 :ffffffff804607fa: sete %al
> 47 4.5e-04 :ffffffff804607fd: and $0x2f00,%ebx
> :ffffffff80460803: or %ebx,%eax
> 58 5.6e-04 :ffffffff80460805: jmp ffffffff804607dd
> <tcp_gro_receive+0x17a>
> 122 0.0012 :ffffffff80460807: xor %eax,%eax
> 9 8.6e-05 :ffffffff80460809: test %ebp,%ebp
> :ffffffff8046080b: sete %al
> 67 6.4e-04 :ffffffff8046080e: and $0x2f00,%ebx
> 6 5.8e-05 :ffffffff80460814: or %ebx,%eax
> 1995 0.0191 :ffffffff80460816: xor %edi,%edi
> 68 6.5e-04 :ffffffff80460818: or %eax,0x40(%r12)
> 275 0.0026 :ffffffff8046081d: mov %rdi,%rax
> 2037 0.0196 :ffffffff80460820: pop %r11
> 191 0.0018 :ffffffff80460822: pop %rbx
> 4346 0.0417 :ffffffff80460823: pop %rbp
> 4739 0.0455 :ffffffff80460824: pop %r12
> 167 0.0016 :ffffffff80460826: pop %r13
> 23735 0.2278 :ffffffff80460828: pop %r14
> 56070 0.5381 :ffffffff8046082a: pop %r15
> 140 0.0013 :ffffffff8046082c: retq
>
> ffffffff804357a1 <skb_gro_header>: /* skb_gro_header total: 319455
> 3.0660 */
> 13604 0.1306 :ffffffff804357a1: push %rbp
> 14938 0.1434 :ffffffff804357a2: push %rbx
> :ffffffff804357a3: mov %rdi,%rbx
> :ffffffff804357a6: sub $0x8,%rsp
> 18392 0.1765 :ffffffff804357aa: mov 0x38(%rdi),%ebp
> :ffffffff804357ad: mov 0x68(%rdi),%edx
> 1 9.6e-06 :ffffffff804357b0: add %ebp,%esi
> 20559 0.1973 :ffffffff804357b2: mov %edx,%edi
> :ffffffff804357b4: sub 0x6c(%rbx),%edi
> :ffffffff804357b7: jne ffffffff804357cc
> <skb_gro_header+0x2b>
> 36626 0.3515 :ffffffff804357b9: mov 0xb8(%rbx),%ecx
> 2 1.9e-05 :ffffffff804357bf: mov 0xc0(%rbx),%rax
> 3 2.9e-05 :ffffffff804357c6: cmp %esi,0x3c(%rax,%rcx,1)
> 18577 0.1783 :ffffffff804357ca: jae ffffffff804357ee
> <skb_gro_header+0x4d>
> :ffffffff804357cc: cmp %edi,%esi
> :ffffffff804357ce: jbe ffffffff804357e3
> <skb_gro_header+0x42>
> :ffffffff804357d0: cmp %edx,%esi
> :ffffffff804357d2: ja ffffffff80435833
> <skb_gro_header+0x92>
> :ffffffff804357d4: sub %edi,%esi
> :ffffffff804357d6: mov %rbx,%rdi
> :ffffffff804357d9: callq ffffffff8042f6ee
> <__pskb_pull_tail>
> :ffffffff804357de: test %rax,%rax
> :ffffffff804357e1: je ffffffff80435833
> <skb_gro_header+0x92>
> :ffffffff804357e3: mov %ebp,%eax
> :ffffffff804357e5: add 0xc8(%rbx),%rax
> :ffffffff804357ec: jmp ffffffff80435835
> <skb_gro_header+0x94>
> 3 2.9e-05 :ffffffff804357ee: add 0xc0(%rbx),%rcx
> 25999 0.2495 :ffffffff804357f5: mov $0x1e0000000000,%rax
> :ffffffff804357ff: mov $0x6db6db6db6db6db7,%rdx
OK, sizeof(struct page) is 0x38, we know it hurts some workloads.
It would be better to waste few bytes but to align them on cache lines here.
> 44557 0.4276 :ffffffff80435809: add 0x30(%rcx),%rax
> :ffffffff8043580d: sar $0x3,%rax
> 12588 0.1208 :ffffffff80435811: imul %rdx,%rax
> 10104 0.0970 :ffffffff80435815: mov $0xffff880000000000,%rdx
> :ffffffff8043581f: shl $0xc,%rax
> :ffffffff80435823: add %rdx,%rax
> 16404 0.1574 :ffffffff80435826: mov 0x38(%rcx),%edx
> :ffffffff80435829: add %rdx,%rax
> :ffffffff8043582c: mov %ebp,%edx
> 15264 0.1465 :ffffffff8043582e: add %rdx,%rax
> :ffffffff80435831: jmp ffffffff80435835
> <skb_gro_header+0x94>
> :ffffffff80435833: xor %eax,%eax
> 45844 0.4400 :ffffffff80435835: pop %r10
> 2 1.9e-05 :ffffffff80435837: pop %rbx
> 12844 0.1233 :ffffffff80435838: pop %rbp
> 13144 0.1262 :ffffffff80435839: retq
>
I wonder if you could try to enlarge 'struct page' by 8 bytes and redo a test...
Here is a patch to combine two ideas. But it wont allow GRO to go much faster I guess :(
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0e80e26..44e97e2 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -98,6 +98,7 @@ struct page {
#ifdef CONFIG_WANT_PAGE_DEBUG_FLAGS
unsigned long debug_flags; /* Use atomic bitops on this */
#endif
+ unsigned long _pad; /* so that sizeof(struct page) is 64 bytes */
};
/*
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ce6356c..74a6900 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2660,28 +2660,37 @@ int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
struct sk_buff *nskb;
unsigned int headroom;
unsigned int len = skb_gro_len(skb);
+ int delta;
+ struct skb_shared_info *skb_shinfo_p = skb_shinfo(p);
if (p->len + len >= 65536)
return -E2BIG;
- if (skb_shinfo(p)->frag_list)
+ delta = skb_gro_offset(skb) - skb_headlen(skb);
+ if (skb_shinfo_p->frag_list)
goto merge;
- else if (skb_headlen(skb) <= skb_gro_offset(skb)) {
- if (skb_shinfo(p)->nr_frags + skb_shinfo(skb)->nr_frags >
+ if (delta >= 0) {
+ struct skb_shared_info *skb_shinfo_skb = skb_shinfo(skb);
+
+ if (skb_shinfo_p->nr_frags + skb_shinfo_skb->nr_frags >
MAX_SKB_FRAGS)
return -E2BIG;
- skb_shinfo(skb)->frags[0].page_offset +=
- skb_gro_offset(skb) - skb_headlen(skb);
- skb_shinfo(skb)->frags[0].size -=
- skb_gro_offset(skb) - skb_headlen(skb);
-
- memcpy(skb_shinfo(p)->frags + skb_shinfo(p)->nr_frags,
- skb_shinfo(skb)->frags,
- skb_shinfo(skb)->nr_frags * sizeof(skb_frag_t));
+ skb_shinfo_skb->frags[0].page_offset += delta;
+ skb_shinfo_skb->frags[0].size -= delta;
- skb_shinfo(p)->nr_frags += skb_shinfo(skb)->nr_frags;
- skb_shinfo(skb)->nr_frags = 0;
+ if (likely(skb_shinfo_skb->nr_frags == 1)) {
+ memcpy(skb_shinfo_p->frags + skb_shinfo_p->nr_frags,
+ skb_shinfo_skb->frags,
+ sizeof(skb_frag_t));
+ skb_shinfo_p->nr_frags += 1;
+ } else {
+ memcpy(skb_shinfo_p->frags + skb_shinfo_p->nr_frags,
+ skb_shinfo_skb->frags,
+ skb_shinfo_skb->nr_frags * sizeof(skb_frag_t));
+ skb_shinfo_p->nr_frags += skb_shinfo_skb->nr_frags;
+ }
+ skb_shinfo_skb->nr_frags = 0;
skb->truesize -= skb->data_len;
skb->len -= skb->data_len;
@@ -2726,12 +2735,11 @@ int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
p = nskb;
+ delta = skb_gro_offset(skb) - skb_headlen(skb);
merge:
- if (skb_gro_offset(skb) > skb_headlen(skb)) {
- skb_shinfo(skb)->frags[0].page_offset +=
- skb_gro_offset(skb) - skb_headlen(skb);
- skb_shinfo(skb)->frags[0].size -=
- skb_gro_offset(skb) - skb_headlen(skb);
+ if (delta > 0) {
+ skb_shinfo(skb)->frags[0].page_offset += delta;
+ skb_shinfo(skb)->frags[0].size -= delta;
skb_gro_reset_offset(skb);
skb_gro_pull(skb, skb_headlen(skb));
}
^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH] myr10ge: again fix lro_gen_skb() alignment
2009-04-30 8:17 ` Eric Dumazet
@ 2009-04-30 19:14 ` Andrew Gallatin
0 siblings, 0 replies; 41+ messages in thread
From: Andrew Gallatin @ 2009-04-30 19:14 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Herbert Xu, David Miller, brice, sgruszka, netdev
Eric Dumazet wrote:
>
>
> I wonder if you could try to enlarge 'struct page' by 8 bytes and
redo a test...
>
> Here is a patch to combine two ideas. But it wont allow GRO to go
much faster I guess :(
The patch seems to help both GRO and LRO a little with timestamps disabled,
but seems to hurt a little with them. I don't pretend to understand why:
LRO:
87380 65536 65536 60.00 8279.36 8.10 77.55 0.160 1.535
LRO + patch:
87380 65536 65536 60.01 7897.51 7.45 74.92 0.155 1.554
LRO + timestamp disable:
87380 65536 65536 60.02 7753.55 8.01 74.06 0.169 1.565
LRO + patch + timestamp disable:
87380 65536 65536 60.01 7915.63 7.74 74.57 0.160 1.544
GRO:
87380 65536 65536 60.00 8053.19 7.86 85.47 0.160 1.739
GRO + patch
87380 65536 65536 60.00 7910.02 7.69 85.86 0.159 1.778
GRO + timestamp disable:
87380 65536 65536 60.02 7535.12 7.27 84.57 0.158 1.839
GRO + timestamp disable + patch
87380 65536 65536 60.02 7735.26 7.92 83.68 0.168 1.772
Drew
^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2009-04-30 19:15 UTC | newest]
Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-15 8:09 [PATCH] myr10ge: again fix lro_gen_skb() alignment Stanislaw Gruszka
2009-04-15 9:28 ` David Miller
2009-04-15 9:48 ` Brice Goglin
2009-04-15 10:02 ` David Miller
2009-04-15 13:01 ` Andrew Gallatin
2009-04-15 21:04 ` Andrew Gallatin
2009-04-15 23:42 ` David Miller
2009-04-16 8:50 ` Herbert Xu
2009-04-16 9:02 ` David Miller
2009-04-21 19:19 ` Andrew Gallatin
2009-04-22 10:48 ` Herbert Xu
2009-04-22 15:37 ` Andrew Gallatin
2009-04-24 5:45 ` Herbert Xu
2009-04-24 12:45 ` Andrew Gallatin
2009-04-24 12:51 ` Herbert Xu
2009-04-24 17:13 ` Rick Jones
2009-04-24 16:16 ` Andrew Gallatin
2009-04-24 16:30 ` Herbert Xu
2009-04-24 16:31 ` Herbert Xu
2009-04-27 8:05 ` Herbert Xu
2009-04-27 8:07 ` Herbert Xu
2009-04-27 9:32 ` David Miller
2009-04-27 11:01 ` Herbert Xu
2009-04-27 12:45 ` David Miller
2009-04-27 12:45 ` David Miller
2009-04-28 6:12 ` Herbert Xu
2009-04-28 15:00 ` Andrew Gallatin
2009-04-28 15:02 ` David Miller
2009-04-28 15:20 ` Herbert Xu
2009-04-28 15:44 ` Andrew Gallatin
2009-04-28 21:12 ` Andrew Gallatin
2009-04-29 13:42 ` Andrew Gallatin
2009-04-29 13:53 ` Eric Dumazet
2009-04-29 14:18 ` Andrew Gallatin
2009-04-29 15:26 ` Eric Dumazet
2009-04-29 17:28 ` Andrew Gallatin
2009-04-30 8:10 ` Herbert Xu
2009-04-30 8:14 ` Herbert Xu
2009-04-30 8:17 ` Eric Dumazet
2009-04-30 19:14 ` Andrew Gallatin
2009-04-23 8:00 ` Herbert Xu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).