Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: rps perfomance WAS(Re: rps: question
From: Rick Jones @ 2010-04-15 16:41 UTC (permalink / raw)
  To: hadi; +Cc: David Miller, eric.dumazet, therbert, netdev, robert, xiaosuo,
	andi
In-Reply-To: <1271332528.4567.150.camel@bigi>

> 
> I speculate again that it may be too costly to run rps on something like
> a tigerton or intel clovertown where you have cores sharing/contending
> for an FSB. If I can get answers to the question: "What h/ware are
> people running?" i could be proven wrong.
> [Note: I am not against RPS - i think it has its place; so i hope my
> desire to find out when to use rps doesnt show as hostility towards
> rps.]

IPS (~= RPS) was running on shared FSB HP9000's.  Now, that was also a BSD 
networking stack with netisrq's and the like.  TOPS (~= RFS) was also run on 
shared FSB HP9000s, as well as CC-NUMA HP9000s and Integrity systems.  TOPS was 
implemented in a Streams-based stack tracing its history to a common ancestor 
with Solaris (Mentat).

rick jones

^ permalink raw reply

* [RFC][PATCH] xfrm6 refcnt problem in bundle creation
From: Nicolas Dichtel @ 2010-04-15 16:32 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 1383 bytes --]

Hi all,

I got a ref count problem in xfrm IPv6 part, but I don't really know what is the 
best way to fix it.

When xfrm6_fill_dst() is called, a dev is given as parameter:

static int xfrm6_fill_dst(struct xfrm_dst *xdst, struct net_device *dev,
                           struct flowi *fl)
{
         struct rt6_info *rt = (struct rt6_info*)xdst->route;

         xdst->u.dst.dev = dev;
         dev_hold(dev);

         xdst->u.rt6.rt6i_idev = in6_dev_get(rt->u.dst.dev);
         if (!xdst->u.rt6.rt6i_idev)
                 return -ENODEV;
[snip]

In my case, dev points to an ethernet device and the route (rt->u.dst.dev) 
points to a tunnel interface (ip6 over ip6). This function will get a ref on the 
idev of the tunnel (xdst->u.rt6.rt6i_idev = in6_dev_get(rt->u.dst.dev)), but dev 
of the dst is set to the ethernet interface (xdst->u.dst.dev = dev).
After, when we try to remove the tunnel interface, the xfrm gc function will 
never check rt6i_idev, it will only check u.dst.dev, hence it will not remove 
the dst.
The consequence is that the interface cannot be removed.

IPv4 code takes the same dev to get idev, rather than using rt->u.dst.dev. Is it 
right to do the same in IPv6?
A proposal patch is attached.

Code, before the patch of the bundle creation merge, takes 'rt->u.dst.dev' to 
get idev and to set dst.dev.

Suggestions are welcome.

Regards,
Nicolas

[-- Attachment #2: 0001-xfrm6-ensure-to-use-the-same-dev-when-building-a-bu.patch --]
[-- Type: text/x-diff, Size: 950 bytes --]

>From 80432d47369925d4e9e38bcb1068ebf923de3a8f Mon Sep 17 00:00:00 2001
From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Date: Thu, 15 Apr 2010 18:27:30 +0200
Subject: [PATCH] xfrm6: ensure to use the same dev when building a bundle

When building a bundle, we set dst.dev and rt6.rt6i_idev.
We must ensure to set the same device for both fields.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 net/ipv6/xfrm6_policy.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index ae18165..00bf7c9 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -124,7 +124,7 @@ static int xfrm6_fill_dst(struct xfrm_dst *xdst, struct net_device *dev,
 	xdst->u.dst.dev = dev;
 	dev_hold(dev);

-	xdst->u.rt6.rt6i_idev = in6_dev_get(rt->u.dst.dev);
+	xdst->u.rt6.rt6i_idev = in6_dev_get(dev);
 	if (!xdst->u.rt6.rt6i_idev)
 		return -ENODEV;

-- 
1.5.4.5

^ permalink raw reply related

* Poor localhost net performance on recent stable kernel
From: Kelly Burkhart @ 2010-04-15 15:44 UTC (permalink / raw)
  To: netdev, linux-kernel

Hello,

While working on upgrading distributions, I've noticed that local
network communication is much slower on 2.6.33.2 than on our old
kernel 2.6.16.60 (sles 10.2).

Results of netperf, UDP_RR against localhost I get around 150000 tps
on the new kernel vs. 290000 tps with the old kernel.  The netperf
command:

netperf -T 1 -H 127.0.0.1 -t UDP_RR -c -C -- -r 100

TCP_RR had similar results.  The problem did not exist with TCP_STREAM.

While trying to track this down, I wrote a test program that writes
then reads a 32 bit integer to a pipe:

static void tst_pipe0( int sleep_us )
{
    int pipefd[2];
    int idx;
    uint32_t tarr[ITERS];

    printf("tst_pipe0 -- sleep %dus\n", sleep_us);

    if (pipe(pipefd) < 0)
        err_exit("pipe");

    for(idx=0; idx<ITERS; ++idx) {
        uint32_t btsc;
        uint32_t rtsc;
        uint32_t etsc;
        get_tscl(btsc);
        write(pipefd[1], (char *)&btsc, sizeof(btsc));
        read(pipefd[0], (char *)&rtsc, sizeof(rtsc));
        get_tscl(etsc);
        tarr[idx] = etsc-btsc;
        do_sleep(sleep_us);
    }
    prt_avg(tarr, ITERS);
    close(pipefd[0]);
    close(pipefd[1]);
    printf("\n");
}

There's a dramatic difference if there's a sleep between iterations on
the new kernel.  On the old kernel the write/read round trip takes
1100-1300 cycles with or without sleep.  On the new kernel, with no
sleep the round trip is about 1400 cycles.  It doubles with a 1us
sleep then gradually increases to 12000-14000 cycles then stabilizes
as I increase the sleep time to 1500us.  I'm not sure if this is
related to the netperf difference or is a completely different
scheduling issue.

I'm running on an Intel Xeon X5570 @ 2.93GHz.  Different tick/notick,
preemption, HZ kernel config option values doesn't substantially change
the magnitude of the difference.

Does anyone have any ideas regarding what could be causing the netperf
issue?  And is the pipe microbenchmark meaningful and if so what does
it mean?

Thanks,

-Kelly

^ permalink raw reply

* Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
From: Arnd Bergmann @ 2010-04-15 15:06 UTC (permalink / raw)
  To: Xin, Xiaohui
  Cc: netdev@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mst@redhat.com, mingo@elte.hu,
	davem@davemloft.net, jdike@linux.intel.com
In-Reply-To: <97F6D3BD476C464182C1B7BABF0B0AF5C18969A5@shzsmsx502.ccr.corp.intel.com>

On Thursday 15 April 2010, Xin, Xiaohui wrote:
> 
> >It seems that you are duplicating a lot of functionality that
> >is already in macvtap. I've asked about this before but then
> >didn't look at your newer versions. Can you explain the value
> >of introducing another interface to user land?
> 
> >I'm still planning to add zero-copy support to macvtap,
> >hopefully reusing parts of your code, but do you think there
> >is value in having both?
> 
> I have not looked into your macvtap code in detail before.
> Does the two interface exactly the same? We just want to create a simple
> way to do zero-copy. Now it can only support vhost, but in future
> we also want it to support directly read/write operations from user space too.

Right now, the features are mostly distinct. Macvtap first of all provides
a "tap" style interface for users, and can also be used by vhost-net.
It also provides a way to share a NIC among a number of guests by software,
though I indent to add support for VMDq and SR-IOV as well. Zero-copy
is also not yet done in macvtap but should be added.

mpassthru right now does not allow sharing a NIC between guests, and
does not have a tap interface for non-vhost operation, but does the
zero-copy that is missing in macvtap.

> Basically, compared to the interface, I'm more worried about the modification
> to net core we have made to implement zero-copy now. If this hardest part
> can be done, then any user space interface modifications or integrations are 
> more easily to be done after that.

I agree that the network stack modifications are the hard part for zero-copy,
and your work on that looks very promising and is complementary to what I've
done with macvtap. Your current user interface looks good for testing this out,
but I think we should not merge it (the interface) upstream if we can get the
same or better result by integrating your buffer management code into macvtap.

I can try to merge your code into macvtap myself if you agree, so you
can focus on getting the internals right.

> >Not sure what I'm missing, but who calls the vq->receiver? This seems
> >to be neither in the upstream version of vhost nor introduced by your
> >patch.
> 
> See Patch v3 2/3 I have sent out, it is called by handle_rx() in vhost.

Ok, I see. As a general rule, it's preferred to split a patch series
in a way that makes it possible to apply each patch separately and still
get a working kernel, ideally with more features than the version before
the patch. I believe you could get there by reordering your patches to
make the actual driver the last one in the series.

Not a big problem though, I was mostly looking in the wrong place.

> >> +		ifr.ifr_name[IFNAMSIZ-1] = '\0';
> >> +
> >> +		ret = -EBUSY;
> >> +
> >> +		if (ifr.ifr_flags & IFF_MPASSTHRU_EXCL)
> >> +			break;
> 
> >Your current use of the IFF_MPASSTHRU* flags does not seem to make
> >any sense whatsoever. You check that this flag is never set, but set
> >it later yourself and then ignore all flags.
> 
> Using that flag is tried to prevent if another one wants to bind the same device
> Again. But I will see if it really ignore all other flags.

The ifr variable is on the stack of the mp_chr_ioctl function, and you never
look at the value after setting it. In order to prevent multiple opens
of that device, you probably need to lock out any other users as well,
and make it a property of the underlying device. E.g. you also want to
prevent users on the host from setting an IP address on the NIC and
using it to send and receive data there.

	Arnd

^ permalink raw reply

* Re: "kernel:nf_ct_icmp: bad HW ICMP checksum" too noisy
From: Patrick McHardy @ 2010-04-15 14:34 UTC (permalink / raw)
  To: Benny Amorsen; +Cc: netdev, Netfilter Development Mailinglist
In-Reply-To: <m31vegztav.fsf@ursa.amorsen.dk>

Benny Amorsen wrote:
> Would it be possible to lower the log level of "kernel:nf_ct_icmp: bad
> HW ICMP checksum" so that they don't show up on the console? Obviously I
> could configure rsyslog to never send anything to the console, but then
> I might miss something which is actually critical.

Yeah, I guess defaulting to KERN_EMERG wasn't the best choice :)
I'll lower it to something reasonable - I guess KERN_NOTICE would
be appropriate.

> It's obviously nice to know about corrupted ICMP on a controlled LAN,
> but on the open Internet you can't really do anything about it.

You should only see that message when nf_conntrack_log_invalid is
active.

^ permalink raw reply

* "kernel:nf_ct_icmp: bad HW ICMP checksum" too noisy
From: Benny Amorsen @ 2010-04-15 14:25 UTC (permalink / raw)
  To: netdev

Would it be possible to lower the log level of "kernel:nf_ct_icmp: bad
HW ICMP checksum" so that they don't show up on the console? Obviously I
could configure rsyslog to never send anything to the console, but then
I might miss something which is actually critical.

It's obviously nice to know about corrupted ICMP on a controlled LAN,
but on the open Internet you can't really do anything about it.

/Benny

^ permalink raw reply

* Re: Strange packet drops with heavy firewalling
From: Eric Dumazet @ 2010-04-15 13:42 UTC (permalink / raw)
  To: Benny Amorsen; +Cc: Changli Gao, zhigang gong, netdev
In-Reply-To: <m3633szw61.fsf@ursa.amorsen.dk>

Le jeudi 15 avril 2010 à 15:23 +0200, Benny Amorsen a écrit :
> Benny Amorsen <benny+usenet@amorsen.dk> writes:
> 
> > I'll keep monitoring the server, and if it starts dropping packets again
> > or load increases I'll check whether irqbalanced does the right thing,
> > and if not I'll implement your suggestion.
> 
> It did start dropping packets (although very few, a few packets dropped
> at once perhaps every ten minutes). Irqbalanced didn't move the
> interrupts.
> 
> Doing
> 
> echo 01 >/proc/irq/99/smp_affinity
> echo 02 >/proc/irq/100/smp_affinity
> echo 04 >/proc/irq/101/smp_affinity
> 
> and so on like Erik Dumazet suggested seems to have helped, but not
> entirely solved the problem.
> 
> The problem now manifests itself this way in ethtool -S:
>      rx_no_buffer_count: 270
>      rx_queue_drop_packet_count: 270
> 
> I can't be sure that I'm not just getting hit by a 1Gbps traffic spike,
> of course, but it is a bit strange that a machine which can do 200Mbps
> at 92% idle can't handle subsecond peaks close to 1Gbps...
> 

Even with multiqueue, its quite possible one queue gets more than one
packet per micro second. Time to process a packet might be greater then
1 us even on recent hardware. So bursts of 1000 small packets with same
flow information, hit one queue, one cpu, and fill rx ring.

Loosing these packets is OK, its very likely its an attack :)

> I wish ifstat could report errors so I could see what the traffic rate
> was when the problem occurred...

yes, it could be added I guess.



^ permalink raw reply

* Re: Strange packet drops with heavy firewalling
From: Benny Amorsen @ 2010-04-15 13:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Changli Gao, zhigang gong, netdev
In-Reply-To: <m3hbnfzsye.fsf@ursa.amorsen.dk>

Benny Amorsen <benny+usenet@amorsen.dk> writes:

> I'll keep monitoring the server, and if it starts dropping packets again
> or load increases I'll check whether irqbalanced does the right thing,
> and if not I'll implement your suggestion.

It did start dropping packets (although very few, a few packets dropped
at once perhaps every ten minutes). Irqbalanced didn't move the
interrupts.

Doing

echo 01 >/proc/irq/99/smp_affinity
echo 02 >/proc/irq/100/smp_affinity
echo 04 >/proc/irq/101/smp_affinity

and so on like Erik Dumazet suggested seems to have helped, but not
entirely solved the problem.

The problem now manifests itself this way in ethtool -S:
     rx_no_buffer_count: 270
     rx_queue_drop_packet_count: 270

I can't be sure that I'm not just getting hit by a 1Gbps traffic spike,
of course, but it is a bit strange that a machine which can do 200Mbps
at 92% idle can't handle subsecond peaks close to 1Gbps...

I wish ifstat could report errors so I could see what the traffic rate
was when the problem occurred...

/Benny

^ permalink raw reply

* Re: BUG: using smp_processor_id() in preemptible [00000000] code: avahi-daemon:  caller is netif_rx
From: Eric Dumazet @ 2010-04-15 13:16 UTC (permalink / raw)
  To: Eric Paris, David Miller; +Cc: netdev, Tom Herbert
In-Reply-To: <1271101251.16881.135.camel@edumazet-laptop>

Le lundi 12 avril 2010 à 21:40 +0200, Eric Dumazet a écrit :
> Good spot, RPS changed a bit netif_rx() requirements.
> 
> I would change ip_dev_loopback_xmit() to call netif_rx_ni() instead...
> 
> David, Tom ?
> 
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index c65f18e..d1bcc9f 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -120,7 +120,7 @@ static int ip_dev_loopback_xmit(struct sk_buff *newskb)
>  	newskb->pkt_type = PACKET_LOOPBACK;
>  	newskb->ip_summed = CHECKSUM_UNNECESSARY;
>  	WARN_ON(!skb_dst(newskb));
> -	netif_rx(newskb);
> +	netif_rx_ni(newskb);
>  	return 0;
>  }
>  

After some confusion, it seems this was the right fix after all :)

[PATCH] ip: Fix ip_dev_loopback_xmit()

Eric Paris got following trace with a linux-next kernel

[   14.203970] BUG: using smp_processor_id() in preemptible [00000000]
code: avahi-daemon/2093
[   14.204025] caller is netif_rx+0xfa/0x110
[   14.204035] Call Trace:
[   14.204064]  [<ffffffff81278fe5>] debug_smp_processor_id+0x105/0x110
[   14.204070]  [<ffffffff8142163a>] netif_rx+0xfa/0x110
[   14.204090]  [<ffffffff8145b631>] ip_dev_loopback_xmit+0x71/0xa0
[   14.204095]  [<ffffffff8145b892>] ip_mc_output+0x192/0x2c0
[   14.204099]  [<ffffffff8145d610>] ip_local_out+0x20/0x30
[   14.204105]  [<ffffffff8145d8ad>] ip_push_pending_frames+0x28d/0x3d0
[   14.204119]  [<ffffffff8147f1cc>] udp_push_pending_frames+0x14c/0x400
[   14.204125]  [<ffffffff814803fc>] udp_sendmsg+0x39c/0x790
[   14.204137]  [<ffffffff814891d5>] inet_sendmsg+0x45/0x80
[   14.204149]  [<ffffffff8140af91>] sock_sendmsg+0xf1/0x110
[   14.204189]  [<ffffffff8140dc6c>] sys_sendmsg+0x20c/0x380
[   14.204233]  [<ffffffff8100ad82>] system_call_fastpath+0x16/0x1b

While current linux-2.6 kernel doesnt emit this warning, bug is latent
and might cause unexpected failures.

ip_dev_loopback_xmit() runs in process context, preemption enabled, so
must call netif_rx_ni() instead of netif_rx(), to make sure that we
process pending software interrupt.

Reported-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index c65f18e..d1bcc9f 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -120,7 +120,7 @@ static int ip_dev_loopback_xmit(struct sk_buff *newskb)
 	newskb->pkt_type = PACKET_LOOPBACK;
 	newskb->ip_summed = CHECKSUM_UNNECESSARY;
 	WARN_ON(!skb_dst(newskb));
-	netif_rx(newskb);
+	netif_rx_ni(newskb);
 	return 0;
 }
 



^ permalink raw reply related

* [net-next-2.6 PATCH 1/2] ipv6: cancel to setting local_df in ip6_xmit()
From: Shan Wei @ 2010-04-15 13:04 UTC (permalink / raw)
  To: David Miller, Herbert Xu, emils.tantilov
  Cc: kuznet, pekkas, jmorris,
	yoshfuji@linux-ipv6.org >> YOSHIFUJI Hideaki,
	Patrick McHardy, eric.dumazet, sri, netdev@vger.kernel.org,
	Shan Wei

commit f88037(sctp: Drop ipfargok in sctp_xmit function)
has droped ipfragok and set local_df value properly.

So the change of commit 77e2f1(ipv6: Fix ip6_xmit to 
send fragments if ipfragok is true) is not needed. 
So the patch remove them.

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
---
 net/ipv6/ip6_output.c |    4 ----
 1 files changed, 0 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 16c4391..f3a847e 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -231,10 +231,6 @@ int ip6_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl,
 	skb_reset_network_header(skb);
 	hdr = ipv6_hdr(skb);
 
-	/* Allow local fragmentation. */
-	if (ipfragok)
-		skb->local_df = 1;
-
 	/*
 	 *	Fill in the IPv6 header
 	 */
--
1.6.3.3 

^ permalink raw reply related

* Re: rps perfomance WAS(Re: rps: question
From: jamal @ 2010-04-15 12:50 UTC (permalink / raw)
  To: Changli Gao; +Cc: Eric Dumazet, Tom Herbert, netdev
In-Reply-To: <s2p412e6f7f1004150532y4b13a0bfgadab3e6f2f4aecd@mail.gmail.com>

On Thu, 2010-04-15 at 20:32 +0800, Changli Gao wrote:

> For historical reason, we use Linux-2.6.18. Our company have several
> products with CPU Xen, P4, or i7. Some of them are SMP, Multi-Core and
> Multi-Threaded. 

Thanks for sharing. How much more can you say? ;-> Do you have a paper
or description of some sort somewhere?

> We use the similar mechanism like dynamic weighted
> RPS. The total throughput is increased nearly linear with the number
> of the worker threads(one worker thread per CPU).

Other than the i7 - have you tried to run rps on on the P4?

cheers,
jamal



^ permalink raw reply

* [PATCH 2/3] ipv4: ipmr: fix invalid cache resolving when adding a non-matching entry
From: Patrick McHardy @ 2010-04-15 12:47 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1271335678-20961-1-git-send-email-kaber@trash.net>

The patch to convert struct mfc_cache to list_heads (ipv4: ipmr: convert
struct mfc_cache to struct list_head) introduced a bug when adding new
cache entries that don't match any unresolved entries.

The unres queue is searched for a matching entry, which is then resolved.
When no matching entry is present, the iterator points to the head of the
list, but is treated as a matching entry. Use a seperate variable to
indicate that a matching entry was found.

Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 net/ipv4/ipmr.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 5df5fd7..0643fb6 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1089,12 +1089,14 @@ static int ipmr_mfc_add(struct net *net, struct mr_table *mrt,
 	 *	Check to see if we resolved a queued list. If so we
 	 *	need to send on the frames and tidy up.
 	 */
+	found = false;
 	spin_lock_bh(&mfc_unres_lock);
 	list_for_each_entry(uc, &mrt->mfc_unres_queue, list) {
 		if (uc->mfc_origin == c->mfc_origin &&
 		    uc->mfc_mcastgrp == c->mfc_mcastgrp) {
 			list_del(&uc->list);
 			atomic_dec(&mrt->cache_resolve_queue_len);
+			found = true;
 			break;
 		}
 	}
@@ -1102,7 +1104,7 @@ static int ipmr_mfc_add(struct net *net, struct mr_table *mrt,
 		del_timer(&mrt->ipmr_expire_timer);
 	spin_unlock_bh(&mfc_unres_lock);
 
-	if (uc) {
+	if (found) {
 		ipmr_cache_resolve(net, mrt, uc, c);
 		ipmr_cache_free(uc);
 	}
-- 
1.7.0.4


^ permalink raw reply related

* [PATCH 0/3]: fixes for multicast routing rules
From: Patrick McHardy @ 2010-04-15 12:47 UTC (permalink / raw)
  To: davem; +Cc: netdev

Hi Dave,

the following three patches fix a few bugs introduced by the multicast routing
rule patches:

- a missing Kconfig dependency on IP_MROUTE

- a bug introduced by the list conversion, causing the list head to be treated
  as an element

- a NULL pointer dereference: the net pointer in ipmr_destroy_unres() was
  initialized to NULL. This patch was actually intended to be folded into
  the patch introducing multicast routing rules, but I missed it when
  rebasing to the current tree.

Please apply or pull from:

git://git.kernel.org/pub/scm/linux/kernel/git/kaber/ipmr-2.6.git master

Thanks!

^ permalink raw reply

* [PATCH 3/3] ipv4: ipmr: fix NULL pointer deref during unres queue destruction
From: Patrick McHardy @ 2010-04-15 12:47 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1271335678-20961-1-git-send-email-kaber@trash.net>

Fix an oversight in ipmr_destroy_unres() - the net pointer is
unconditionally initialized to NULL, resulting in a NULL pointer
dereference later on.

Fix by adding a net pointer to struct mr_table and using it in
ipmr_destroy_unres().

Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 net/ipv4/ipmr.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 0643fb6..7d8a2bc 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -71,6 +71,9 @@
 
 struct mr_table {
 	struct list_head	list;
+#ifdef CONFIG_NET_NS
+	struct net		*net;
+#endif
 	u32			id;
 	struct sock		*mroute_sk;
 	struct timer_list	ipmr_expire_timer;
@@ -308,6 +311,7 @@ static struct mr_table *ipmr_new_table(struct net *net, u32 id)
 	mrt = kzalloc(sizeof(*mrt), GFP_KERNEL);
 	if (mrt == NULL)
 		return NULL;
+	write_pnet(&mrt->net, net);
 	mrt->id = id;
 
 	/* Forwarding cache */
@@ -580,7 +584,7 @@ static inline void ipmr_cache_free(struct mfc_cache *c)
 
 static void ipmr_destroy_unres(struct mr_table *mrt, struct mfc_cache *c)
 {
-	struct net *net = NULL; //mrt->net;
+	struct net *net = read_pnet(&mrt->net);
 	struct sk_buff *skb;
 	struct nlmsgerr *e;
 
-- 
1.7.0.4


^ permalink raw reply related

* [PATCH 1/3] ipv4: ipmr: fix IP_MROUTE_MULTIPLE_TABLES Kconfig dependencies
From: Patrick McHardy @ 2010-04-15 12:47 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1271335678-20961-1-git-send-email-kaber@trash.net>

IP_MROUTE_MULTIPLE_TABLES should depend on IP_MROUTE.

Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 net/ipv4/Kconfig |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index be59774..8e3a1fd 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -252,7 +252,7 @@ config IP_MROUTE
 
 config IP_MROUTE_MULTIPLE_TABLES
 	bool "IP: multicast policy routing"
-	depends on IP_ADVANCED_ROUTER
+	depends on IP_MROUTE && IP_ADVANCED_ROUTER
 	select FIB_RULES
 	help
 	  Normally, a multicast router runs a userspace daemon and decides
-- 
1.7.0.4


^ permalink raw reply related

* Re: rps perfomance WAS(Re: rps: question
From: Changli Gao @ 2010-04-15 12:32 UTC (permalink / raw)
  To: hadi
  Cc: Eric Dumazet, Tom Herbert, Stephen Hemminger, netdev, robert,
	David Miller, Andi Kleen
In-Reply-To: <1271333428.23780.3.camel@bigi>

On Thu, Apr 15, 2010 at 8:10 PM, jamal <hadi@cyberus.ca> wrote:
> On Wed, 2010-04-14 at 22:57 +0200, Eric Dumazet wrote:
>
>> On my Nehalem machine (16 logical cpus), its NetXtreme II BCM57711E
>> 10Gigabit has 16 queues. It might be good to use less queues according
>> to your results on some workloads, and eventually use RPS on a second
>> layering.
>

For historical reason, we use Linux-2.6.18. Our company have several
products with CPU Xen, P4, or i7. Some of them are SMP, Multi-Core and
Multi-Threaded. We use the similar mechanism like dynamic weighted
RPS. The total throughput is increased nearly linear with the number
of the worker threads(one worker thread per CPU).

-- 
Regards，
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* [PATCH] net: small cleanup of lib8390
From: Nikanth Karthikesan @ 2010-04-15 12:21 UTC (permalink / raw)
  To: Paul Gortmaker, David S. Miller; +Cc: netdev, Al Viro, Jeff Garzik

Remove the always true #if 1. Also the unecessary re-test of ei_local->irqlock
and the unreachable printk format string.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>

---

diff --git a/drivers/net/lib8390.c b/drivers/net/lib8390.c
index 56f66f4..6e1dcd3 100644
--- a/drivers/net/lib8390.c
+++ b/drivers/net/lib8390.c
@@ -445,14 +445,14 @@ static irqreturn_t __ei_interrupt(int irq, void *dev_id)
 
 	if (ei_local->irqlock)
 	{
-#if 1 /* This might just be an interrupt for a PCI device sharing this line */
-		/* The "irqlock" check is only for testing. */
-		printk(ei_local->irqlock
-			   ? "%s: Interrupted while interrupts are masked! isr=%#2x imr=%#2x.\n"
-			   : "%s: Reentering the interrupt handler! isr=%#2x imr=%#2x.\n",
+		/*
+		 * This might just be an interrupt for a PCI device sharing
+		 * this line
+		 */
+		printk("%s: Interrupted while interrupts are masked!"
+			   " isr=%#2x imr=%#2x.\n",
 			   dev->name, ei_inb_p(e8390_base + EN0_ISR),
 			   ei_inb_p(e8390_base + EN0_IMR));
-#endif
 		spin_unlock(&ei_local->page_lock);
 		return IRQ_NONE;
 	}

^ permalink raw reply related

* Re: rps perfomance WAS(Re: rps: question
From: jamal @ 2010-04-15 12:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tom Herbert, Stephen Hemminger, netdev, robert, David Miller,
	Changli Gao, Andi Kleen
In-Reply-To: <1271278661.16881.1761.camel@edumazet-laptop>

On Wed, 2010-04-14 at 22:57 +0200, Eric Dumazet wrote:

> On my Nehalem machine (16 logical cpus), its NetXtreme II BCM57711E
> 10Gigabit has 16 queues. It might be good to use less queues according
> to your results on some workloads, and eventually use RPS on a second
> layering.

Ok Eric, you seem to be running a system with two Nehalems
interconnected by QPI.
Is there any difference, performance-wise, between redirecting from
coreX to coreY when they are on the same Nehalem vs when you
are going across QPI?

cheers,
jamal


^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: jamal @ 2010-04-15 11:55 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, therbert, netdev, robert, xiaosuo, andi
In-Reply-To: <20100415.014857.168270765.davem@davemloft.net>

On Thu, 2010-04-15 at 01:48 -0700, David Miller wrote:

> A single-queue NIC is actually not a requirement, 
> RPS helps also in cases where you have 'N' application threads 
> and N is less than the number of CPUs your multi-queue NIC is 
> distributing traffic to.

sure..

> Moving the bulk of the input packet processing to the cpus where
> the applications actually sit had a non-trivial benefit.  

This is true regardless of rps though. 

> RFS takes this aspect to yet another level.

rfs looks quiet interesting;-> I think with some twist it could be
used with multiqueue nics as well

> I think for the case where application locality is important,
> RPS/RFS can help regardless of cache details.

Generally true, as long as there's not much shared data across the cpus
or the cost of a cache miss is reasonably tolerable. The socket layer
just happens to be not sharing much with ingress packet path and
for a single processor Nehalem, the caching system works so well that
the cost of cache misses is not as an important a variable. Everything
is on the same die including the MM controller etc.
I am speculating (didnt get any answer to the question i asked) that
people running rps use such hardware;->

I speculate again that it may be too costly to run rps on something like
a tigerton or intel clovertown where you have cores sharing/contending
for an FSB. If I can get answers to the question: "What h/ware are
people running?" i could be proven wrong.
[Note: I am not against RPS - i think it has its place; so i hope my
desire to find out when to use rps doesnt show as hostility towards
rps.]

cheers,
jamal

^ permalink raw reply

* Re: NULL pointer dereference panic in stable (2.6.33.2), amd64
From: Eric Dumazet @ 2010-04-15 10:37 UTC (permalink / raw)
  To: Denys Fedorysychenko; +Cc: David Miller, krkumar2, netdev
In-Reply-To: <201004151211.28315.nuclearcat@nuclearcat.com>

Le jeudi 15 avril 2010 à 12:11 +0300, Denys Fedorysychenko a écrit :
> Btw i have application using tun.

Could you add following sanity test to catch the error ?

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index fa8b476..b67274a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -988,6 +988,7 @@ static inline
 struct netdev_queue *netdev_get_tx_queue(const struct net_device *dev,
 					 unsigned int index)
 {
+	WARN_ON(index >= dev->num_tx_queues);
 	return &dev->_tx[index];
 }
 



^ permalink raw reply related

* Re: BUG: using smp_processor_id() in preemptible [00000000] code: avahi-daemon: caller is netif_rx
From: Eric Dumazet @ 2010-04-15 10:29 UTC (permalink / raw)
  To: David Miller; +Cc: xiaosuo, therbert, eparis, netdev
In-Reply-To: <20100415.020246.218622820.davem@davemloft.net>

Le jeudi 15 avril 2010 à 02:02 -0700, David Miller a écrit :

> Since we keep coming back to this issue why don't we simply
> solve it forever?  Let's make netif_rx() work in all contexts
> and get rid of netif_rx_ni().
> 
> I think this is the thing to do because this whole netif_rx_ni()
> vs. netif_rx() thing was meant to be an optimization of sorts (this
> goes back to like 8+ years ago :-), and really I doubt it really
> matters on that level any more.
> 
> What do you think?

I was about to come to same idea indeed.

netif_receive_skb() is supposed to be used for modern devices anyway,
avoiding netif_rx() overhead...





^ permalink raw reply

* Re: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.
From: Michael S. Tsirkin @ 2010-04-15 10:05 UTC (permalink / raw)
  To: Xin, Xiaohui
  Cc: netdev@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mingo@elte.hu,
	jdike@linux.intel.com, davem@davemloft.net
In-Reply-To: <97F6D3BD476C464182C1B7BABF0B0AF5C18969CC@shzsmsx502.ccr.corp.intel.com>

On Thu, Apr 15, 2010 at 05:36:07PM +0800, Xin, Xiaohui wrote:
> 
> Michael,
> >> The idea is simple, just to pin the guest VM user space and then
> >> let host NIC driver has the chance to directly DMA to it. 
> >> The patches are based on vhost-net backend driver. We add a device
> >> which provides proto_ops as sendmsg/recvmsg to vhost-net to
> >> send/recv directly to/from the NIC driver. KVM guest who use the
> >> vhost-net backend may bind any ethX interface in the host side to
> >> get copyless data transfer thru guest virtio-net frontend.
> >> 
> >> The scenario is like this:
> >> 
> >> The guest virtio-net driver submits multiple requests thru vhost-net
> >> backend driver to the kernel. And the requests are queued and then
> >> completed after corresponding actions in h/w are done.
> >> 
> >> For read, user space buffers are dispensed to NIC driver for rx when
> >> a page constructor API is invoked. Means NICs can allocate user buffers
> >> from a page constructor. We add a hook in netif_receive_skb() function
> >> to intercept the incoming packets, and notify the zero-copy device.
> >> 
> >> For write, the zero-copy deivce may allocates a new host skb and puts
> >> payload on the skb_shinfo(skb)->frags, and copied the header to skb->data.
> >> The request remains pending until the skb is transmitted by h/w.
> >> 
> >> Here, we have ever considered 2 ways to utilize the page constructor
> >> API to dispense the user buffers.
> >> 
> >> One:	Modify __alloc_skb() function a bit, it can only allocate a 
> >> 	structure of sk_buff, and the data pointer is pointing to a 
> >> 	user buffer which is coming from a page constructor API.
> >> 	Then the shinfo of the skb is also from guest.
> >> 	When packet is received from hardware, the skb->data is filled
> >> 	directly by h/w. What we have done is in this way.
> >> 
> >> 	Pros:	We can avoid any copy here.
> >> 	Cons:	Guest virtio-net driver needs to allocate skb as almost
> >> 		the same method with the host NIC drivers, say the size
> >> 		of netdev_alloc_skb() and the same reserved space in the
> >> 		head of skb. Many NIC drivers are the same with guest and
> >> 		ok for this. But some lastest NIC drivers reserves special
> >> 		room in skb head. To deal with it, we suggest to provide
> >> 		a method in guest virtio-net driver to ask for parameter
> >> 		we interest from the NIC driver when we know which device 
> >> 		we have bind to do zero-copy. Then we ask guest to do so.
> >> 		Is that reasonable?
> 
> >Unfortunately, this would break compatibility with existing virtio.
> >This also complicates migration. 
> 
> You mean any modification to the guest virtio-net driver will break the
> compatibility? We tried to enlarge the virtio_net_config to contains the
> 2 parameter, and add one VIRTIO_NET_F_PASSTHRU flag, virtionet_probe()
> will check the feature flag, and get the parameters, then virtio-net driver use
> it to allocate buffers. How about this?

This means that we can't, for example, live-migrate between different systems
without flushing outstanding buffers.

> >What is the room in skb head used for?
> I'm not sure, but the latest ixgbe driver does this, it reserves 32 bytes compared to
> NET_IP_ALIGN.

Looking at code, this seems to do with alignment - could just be
a performance optimization.

> >> Two:	Modify driver to get user buffer allocated from a page constructor
> >> 	API(to substitute alloc_page()), the user buffer are used as payload
> >> 	buffers and filled by h/w directly when packet is received. Driver
> >> 	should associate the pages with skb (skb_shinfo(skb)->frags). For 
> >> 	the head buffer side, let host allocates skb, and h/w fills it. 
> >> 	After that, the data filled in host skb header will be copied into
> >> 	guest header buffer which is submitted together with the payload buffer.
> >> 
> >> 	Pros:	We could less care the way how guest or host allocates their
> >> 		buffers.
> >> 	Cons:	We still need a bit copy here for the skb header.
> >> 
> >> We are not sure which way is the better here.
> 
> >The obvious question would be whether you see any speed difference
> >with the two approaches. If no, then the second approach would be
> >better.
> 
> I remember the second approach is a bit slower in 1500MTU. 
> But we did not tested too much.

Well, that's an important datapoint. By the way, you'll need
header copy to activate LRO in host, so that's a good
reason to go with option 2 as well.

> >> This is the first thing we want
> >> to get comments from the community. We wish the modification to the network
> >> part will be generic which not used by vhost-net backend only, but a user
> >> application may use it as well when the zero-copy device may provides async
> >> read/write operations later.
> >> 
> >> Please give comments especially for the network part modifications.
> >> 
> >> 
> >> We provide multiple submits and asynchronous notifiicaton to 
> >>vhost-net too.
> >> 
> >> Our goal is to improve the bandwidth and reduce the CPU usage.
> >> Exact performance data will be provided later. But for simple
> >> test with netperf, we found bindwidth up and CPU % up too,
> >> but the bindwidth up ratio is much more than CPU % up ratio.
> >> 
> >> What we have not done yet:
> >> 	packet split support
> 
> >What does this mean, exactly?
> We can support 1500MTU, but for jumbo frame, since vhost driver before don't 
> support mergeable buffer, we cannot try it for multiple sg.

I do not see why, vhost currently supports 64K buffers with indirect
descriptors.

> A jumbo frame will split 5
> frags and hook them once a descriptor, so the user buffer allocation is greatly dependent
> on how guest virtio-net drivers submits buffers. We think mergeable buffer is suitable for it. 
> 
> >> 	To support GRO
> Actually, I think if the mergeable buffer may get good performance, then GRO is not 
> so important then.
> >And TSO/GSO?
> Do we really need them?

My guess would be yes. Mergeable buffers is a memory saving
optimization, not a performance optimization, I don't see
that it can help. And I think you can't solely rely on jumbo frames
in hardware, not everyone can enable them.

Having said that, number one priority is getting decent performance
out of the driver, in whatever way you find fit. I was just
suggesting obvious ways to do this.


> >> 	Performance tuning
> >> 
> >> what we have done in v1:
> >> 	polish the RCU usage
> >> 	deal with write logging in asynchroush mode in vhost
> >> 	add notifier block for mp device
> >> 	rename page_ctor to mp_port in netdevice.h to make it looks generic
> >> 	add mp_dev_change_flags() for mp device to change NIC state
> >> 	add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
> >> 	a small fix for missing dev_put when fail
> >> 	using dynamic minor instead of static minor number
> >> 	a __KERNEL__ protect to mp_get_sock()
> >> 
> >> what we have done in v2:
> >> 	
> >> 	remove most of the RCU usage, since the ctor pointer is only
> >> 	changed by BIND/UNBIND ioctl, and during that time, NIC will be
> >> 	stopped to get good cleanup(all outstanding requests are finished),
> >> 	so the ctor pointer cannot be raced into wrong situation.
> >> 
> >> 	Remove the struct vhost_notifier with struct kiocb.
> >> 	Let vhost-net backend to alloc/free the kiocb and transfer them
> >> 	via sendmsg/recvmsg.
> >> 
> >> 	use get_user_pages_fast() and set_page_dirty_lock() when read.
> >> 
> >> 	Add some comments for netdev_mp_port_prep() and handle_mpassthru().
> >> 
> >> 
> >> Comments not addressed yet in this time:
> >> 	the async write logging is not satified by vhost-net
> >> 	Qemu needs a sync write
> >> 	a limit for locked pages from get_user_pages_fast()
> >> 	
> >> 		
> >> performance:
> >> 	using netperf with GSO/TSO disabled, 10G NIC, 
> >> 	disabled packet split mode, with raw socket case compared to vhost.
> >> 
> >> 	bindwidth will be from 1.1Gbps to 1.7Gbps
> >> 	CPU % from 120%-140% to 140%-160%

^ permalink raw reply

* Re: [PATCH] Speedup link loss detection for 3c59x
From: Steffen Klassert @ 2010-04-15  9:59 UTC (permalink / raw)
  To: Martin Buck; +Cc: netdev, linux-kernel
In-Reply-To: <20100415091134.GA9574@gromit.at.home>

On Thu, Apr 15, 2010 at 11:11:34AM +0200, Martin Buck wrote:
> From: Martin Buck <mb-tmp-yvahk-argqri@gromit.dyndns.org>
> 
> Change the timer used for link status checking to check link every 5s,
> regardless of the current link state. This way, link loss is detected as
> fast as new link, whereas this took up to 60s previously (which is pretty
> inconvenient when trying to react on link loss using e.g. ifplugd). This
> also matches behaviour of most other Ethernet drivers which typically have
> link check intervals in the low second range.
> 

We discussed this issue already some years ago. The 3c59x does polling
for external environment changes which is quite expensive. Firing a
timer that disables the interrupts on a running interface every 5
seconds is not reasonable for checking for external environment changes.
So we decided to let it depend on the link status, 5 seconds if the link
is down and 60 seconds if the link is up.

Steffen

^ permalink raw reply

* RE: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.
From: Xin, Xiaohui @ 2010-04-15  9:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mingo@elte.hu,
	jdike@linux.intel.com, davem@davemloft.net
In-Reply-To: <20100414152519.GA10792@redhat.com>


Michael,
>> The idea is simple, just to pin the guest VM user space and then
>> let host NIC driver has the chance to directly DMA to it. 
>> The patches are based on vhost-net backend driver. We add a device
>> which provides proto_ops as sendmsg/recvmsg to vhost-net to
>> send/recv directly to/from the NIC driver. KVM guest who use the
>> vhost-net backend may bind any ethX interface in the host side to
>> get copyless data transfer thru guest virtio-net frontend.
>> 
>> The scenario is like this:
>> 
>> The guest virtio-net driver submits multiple requests thru vhost-net
>> backend driver to the kernel. And the requests are queued and then
>> completed after corresponding actions in h/w are done.
>> 
>> For read, user space buffers are dispensed to NIC driver for rx when
>> a page constructor API is invoked. Means NICs can allocate user buffers
>> from a page constructor. We add a hook in netif_receive_skb() function
>> to intercept the incoming packets, and notify the zero-copy device.
>> 
>> For write, the zero-copy deivce may allocates a new host skb and puts
>> payload on the skb_shinfo(skb)->frags, and copied the header to skb->data.
>> The request remains pending until the skb is transmitted by h/w.
>> 
>> Here, we have ever considered 2 ways to utilize the page constructor
>> API to dispense the user buffers.
>> 
>> One:	Modify __alloc_skb() function a bit, it can only allocate a 
>> 	structure of sk_buff, and the data pointer is pointing to a 
>> 	user buffer which is coming from a page constructor API.
>> 	Then the shinfo of the skb is also from guest.
>> 	When packet is received from hardware, the skb->data is filled
>> 	directly by h/w. What we have done is in this way.
>> 
>> 	Pros:	We can avoid any copy here.
>> 	Cons:	Guest virtio-net driver needs to allocate skb as almost
>> 		the same method with the host NIC drivers, say the size
>> 		of netdev_alloc_skb() and the same reserved space in the
>> 		head of skb. Many NIC drivers are the same with guest and
>> 		ok for this. But some lastest NIC drivers reserves special
>> 		room in skb head. To deal with it, we suggest to provide
>> 		a method in guest virtio-net driver to ask for parameter
>> 		we interest from the NIC driver when we know which device 
>> 		we have bind to do zero-copy. Then we ask guest to do so.
>> 		Is that reasonable?

>Unfortunately, this would break compatibility with existing virtio.
>This also complicates migration. 

You mean any modification to the guest virtio-net driver will break the
compatibility? We tried to enlarge the virtio_net_config to contains the
2 parameter, and add one VIRTIO_NET_F_PASSTHRU flag, virtionet_probe()
will check the feature flag, and get the parameters, then virtio-net driver use
it to allocate buffers. How about this?

>What is the room in skb head used for?
I'm not sure, but the latest ixgbe driver does this, it reserves 32 bytes compared to
NET_IP_ALIGN.

>> Two:	Modify driver to get user buffer allocated from a page constructor
>> 	API(to substitute alloc_page()), the user buffer are used as payload
>> 	buffers and filled by h/w directly when packet is received. Driver
>> 	should associate the pages with skb (skb_shinfo(skb)->frags). For 
>> 	the head buffer side, let host allocates skb, and h/w fills it. 
>> 	After that, the data filled in host skb header will be copied into
>> 	guest header buffer which is submitted together with the payload buffer.
>> 
>> 	Pros:	We could less care the way how guest or host allocates their
>> 		buffers.
>> 	Cons:	We still need a bit copy here for the skb header.
>> 
>> We are not sure which way is the better here.

>The obvious question would be whether you see any speed difference
>with the two approaches. If no, then the second approach would be
>better.

I remember the second approach is a bit slower in 1500MTU. 
But we did not tested too much.

>> This is the first thing we want
>> to get comments from the community. We wish the modification to the network
>> part will be generic which not used by vhost-net backend only, but a user
>> application may use it as well when the zero-copy device may provides async
>> read/write operations later.
>> 
>> Please give comments especially for the network part modifications.
>> 
>> 
>> We provide multiple submits and asynchronous notifiicaton to 
>>vhost-net too.
>> 
>> Our goal is to improve the bandwidth and reduce the CPU usage.
>> Exact performance data will be provided later. But for simple
>> test with netperf, we found bindwidth up and CPU % up too,
>> but the bindwidth up ratio is much more than CPU % up ratio.
>> 
>> What we have not done yet:
>> 	packet split support

>What does this mean, exactly?
We can support 1500MTU, but for jumbo frame, since vhost driver before don't 
support mergeable buffer, we cannot try it for multiple sg. A jumbo frame will split 5
frags and hook them once a descriptor, so the user buffer allocation is greatly dependent
on how guest virtio-net drivers submits buffers. We think mergeable buffer is suitable for it. 

>> 	To support GRO
Actually, I think if the mergeable buffer may get good performance, then GRO is not 
so important then.
>And TSO/GSO?
Do we really need them?

>> 	Performance tuning
>> 
>> what we have done in v1:
>> 	polish the RCU usage
>> 	deal with write logging in asynchroush mode in vhost
>> 	add notifier block for mp device
>> 	rename page_ctor to mp_port in netdevice.h to make it looks generic
>> 	add mp_dev_change_flags() for mp device to change NIC state
>> 	add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
>> 	a small fix for missing dev_put when fail
>> 	using dynamic minor instead of static minor number
>> 	a __KERNEL__ protect to mp_get_sock()
>> 
>> what we have done in v2:
>> 	
>> 	remove most of the RCU usage, since the ctor pointer is only
>> 	changed by BIND/UNBIND ioctl, and during that time, NIC will be
>> 	stopped to get good cleanup(all outstanding requests are finished),
>> 	so the ctor pointer cannot be raced into wrong situation.
>> 
>> 	Remove the struct vhost_notifier with struct kiocb.
>> 	Let vhost-net backend to alloc/free the kiocb and transfer them
>> 	via sendmsg/recvmsg.
>> 
>> 	use get_user_pages_fast() and set_page_dirty_lock() when read.
>> 
>> 	Add some comments for netdev_mp_port_prep() and handle_mpassthru().
>> 
>> 
>> Comments not addressed yet in this time:
>> 	the async write logging is not satified by vhost-net
>> 	Qemu needs a sync write
>> 	a limit for locked pages from get_user_pages_fast()
>> 	
>> 		
>> performance:
>> 	using netperf with GSO/TSO disabled, 10G NIC, 
>> 	disabled packet split mode, with raw socket case compared to vhost.
>> 
>> 	bindwidth will be from 1.1Gbps to 1.7Gbps
>> 	CPU % from 120%-140% to 140%-160%

^ permalink raw reply

* [PATCH] Speedup link loss detection for 3c59x
From: Martin Buck @ 2010-04-15  9:11 UTC (permalink / raw)
  To: Steffen Klassert, netdev; +Cc: linux-kernel

From: Martin Buck <mb-tmp-yvahk-argqri@gromit.dyndns.org>

Change the timer used for link status checking to check link every 5s,
regardless of the current link state. This way, link loss is detected as
fast as new link, whereas this took up to 60s previously (which is pretty
inconvenient when trying to react on link loss using e.g. ifplugd). This
also matches behaviour of most other Ethernet drivers which typically have
link check intervals in the low second range.

Signed-off-by: Martin Buck <mb-tmp-yvahk-argqri@gromit.dyndns.org>
---

--- linux-2.6.31.6/drivers/net/3c59x.c.orig	2010-04-13 17:46:07.000000000 +0200
+++ linux-2.6.31.6/drivers/net/3c59x.c	2010-04-13 17:55:31.000000000 +0200
@@ -1761,7 +1761,7 @@ vortex_timer(unsigned long data)
 	struct net_device *dev = (struct net_device *)data;
 	struct vortex_private *vp = netdev_priv(dev);
 	void __iomem *ioaddr = vp->ioaddr;
-	int next_tick = 60*HZ;
+	int next_tick = 5*HZ;
 	int ok = 0;
 	int media_status, old_window;

@@ -1807,9 +1807,6 @@ vortex_timer(unsigned long data)
 		ok = 1;
 	}

-	if (!netif_carrier_ok(dev))
-		next_tick = 5*HZ;
-
 	if (vp->medialock)
 		goto leave_media_alone;

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox