Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: rps perfomance WAS(Re: rps: question
From: David Miller @ 2010-04-15  8:57 UTC (permalink / raw)
  To: eric.dumazet; +Cc: therbert, hadi, shemminger, netdev, robert, xiaosuo, andi
In-Reply-To: <1271278661.16881.1761.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 14 Apr 2010 22:57:41 +0200

> RPS can be tuned (Changli wants a finer tuning...), it would be
> intereting to tune multiqueue devices too. I dont know if its possible
> right now.

Only NIU allows real detailed control over queue selection and
stuff like that, because the hardware has a real TCAM for
packet matching and packets which match in TCAM entries can
steer to different collections of queues.

We have ethtool interfaces for this (ETHTOOL_GRXCLS*), so you can
change it.

For most other chips we only have interfaces for modifying the
RX hashing algorithm or what the RX hash covers, stuff like
that.

See also ETHTOOL_GRXFH, ETHTOOL_SRXFH, ETHTOOL_SRXNTUPLE, and
ETHTOOL_GRXNTUPLE, the latter two of which were added for Intel
NICs.

^ permalink raw reply

* Re: BUG: using smp_processor_id() in preemptible [00000000] code: avahi-daemon: caller is netif_rx
From: Eric Dumazet @ 2010-04-15  8:58 UTC (permalink / raw)
  To: David Miller; +Cc: xiaosuo, therbert, eparis, netdev
In-Reply-To: <20100415.013347.98375530.davem@davemloft.net>

Le jeudi 15 avril 2010 à 01:33 -0700, David Miller a écrit :

> Yes, this looks more reasonable.  Eric if you agree please (re-)submit
> this formally, I must have missed this somehow, sorry.
> 
> And this is a bug fix in any kernel, not just one's that have RPS
> patches applied.
> 
> If we are not called from some interrupt context, there is no sure
> trigger to make sure software interrupts will be executed after the
> packet is queued locally.  netif_rx_ni() makes sure that any pending
> software interrupts will run in such cases.

Our mails crossed ;)

Yes I think it's more reasonable to fix it like this, I'll submit a
patch after fully testing it :)




^ permalink raw reply

* RE: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
From: Xin, Xiaohui @ 2010-04-15  9:01 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: netdev@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mst@redhat.com, mingo@elte.hu,
	davem@davemloft.net, jdike@linux.intel.com
In-Reply-To: <201004141655.21885.arnd@arndb.de>

Arnd,
>> From: Xin Xiaohui <xiaohui.xin@intel.com>
>> 
>> Add a device to utilize the vhost-net backend driver for
>> copy-less data transfer between guest FE and host NIC.
>> It pins the guest user space to the host memory and
>> provides proto_ops as sendmsg/recvmsg to vhost-net.

>Sorry for taking so long before finding the time to look
>at your code in more detail.

>It seems that you are duplicating a lot of functionality that
>is already in macvtap. I've asked about this before but then
>didn't look at your newer versions. Can you explain the value
>of introducing another interface to user land?

>I'm still planning to add zero-copy support to macvtap,
>hopefully reusing parts of your code, but do you think there
>is value in having both?

I have not looked into your macvtap code in detail before.
Does the two interface exactly the same? We just want to create a simple
way to do zero-copy. Now it can only support vhost, but in future
we also want it to support directly read/write operations from user space too.

Basically, compared to the interface, I'm more worried about the modification
to net core we have made to implement zero-copy now. If this hardest part
can be done, then any user space interface modifications or integrations are 
more easily to be done after that.

>> diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
>> new file mode 100644
>> index 0000000..86d2525
>> --- /dev/null
>> +++ b/drivers/vhost/mpassthru.c
> >@@ -0,0 +1,1264 @@
> >+
> >+#ifdef MPASSTHRU_DEBUG
>> +static int debug;
>> +
>> +#define DBG  if (mp->debug) printk
> >+#define DBG1 if (debug == 2) printk
> >+#else
> >+#define DBG(a...)
> >+#define DBG1(a...)
> >+#endif

>This should probably just use the existing dev_dbg/pr_debug infrastructure.

	Thanks. Will try that.
> [... skipping buffer management code for now]

> >+static int mp_sendmsg(struct kiocb *iocb, struct socket *sock,
> >+			struct msghdr *m, size_t total_len)
> >+{
> >[...]

>This function looks like we should be able to easily include it into
>macvtap and get zero-copy transmits without introducing the new
>user-level interface.

>> +static int mp_recvmsg(struct kiocb *iocb, struct socket *sock,
>> +			struct msghdr *m, size_t total_len,
>> +			int flags)
>> +{
>> +	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
>> +	struct page_ctor *ctor;
>> +	struct vhost_virtqueue *vq = (struct vhost_virtqueue *)(iocb->private);

>It smells like a layering violation to look at the iocb->private field
>from a lower-level driver. I would have hoped that it's possible to implement
>this without having this driver know about the higher-level vhost driver
>internals. Can you explain why this is needed?

I don't like this too, but since the kiocb is maintained by vhost with a list_head.
And mp device is responsible to collect the kiocb into the list_head,
We need something known by vhost/mp both.
 
>> +	spin_lock_irqsave(&ctor->read_lock, flag);
>> +	list_add_tail(&info->list, &ctor->readq);
> >+	spin_unlock_irqrestore(&ctor->read_lock, flag);
> >+
> >+	if (!vq->receiver) {
> >+		vq->receiver = mp_recvmsg_notify;
> >+		set_memlock_rlimit(ctor, RLIMIT_MEMLOCK,
> >+				   vq->num * 4096,
> >+				   vq->num * 4096);
> >+	}
> >+
> >+	return 0;
> >+}

>Not sure what I'm missing, but who calls the vq->receiver? This seems
>to be neither in the upstream version of vhost nor introduced by your
>patch.

See Patch v3 2/3 I have sent out, it is called by handle_rx() in vhost.

>> +static void __mp_detach(struct mp_struct *mp)
>> +{
> >+	mp->mfile = NULL;
> >+
> >+	mp_dev_change_flags(mp->dev, mp->dev->flags & ~IFF_UP);
> >+	page_ctor_detach(mp);
> >+	mp_dev_change_flags(mp->dev, mp->dev->flags | IFF_UP);
> >+
> >+	/* Drop the extra count on the net device */
> >+	dev_put(mp->dev);
> >+}
> >+
> >+static DEFINE_MUTEX(mp_mutex);
> >+
> >+static void mp_detach(struct mp_struct *mp)
> >+{
> >+	mutex_lock(&mp_mutex);
> >+	__mp_detach(mp);
> >+	mutex_unlock(&mp_mutex);
> >+}
> >+
> >+static void mp_put(struct mp_file *mfile)
> >+{
> >+	if (atomic_dec_and_test(&mfile->count))
> >+		mp_detach(mfile->mp);
> >+}
> >+
> >+static int mp_release(struct socket *sock)
> >+{
> >+	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
> >+	struct mp_file *mfile = mp->mfile;
> >+
> >+	mp_put(mfile);
> >+	sock_put(mp->socket.sk);
> >+	put_net(mfile->net);
> >+
> >+	return 0;
> >+}

>Doesn't this prevent the underlying interface from going away while the chardev
>is open? You also have logic to handle that case, so why do you keep the extra
>reference on the netdev?

Let me think.

>> +/* Ops structure to mimic raw sockets with mp device */
>> +static const struct proto_ops mp_socket_ops = {
>> +	.sendmsg = mp_sendmsg,
>> +	.recvmsg = mp_recvmsg,
>> +	.release = mp_release,
>> +};

>> +static int mp_chr_open(struct inode *inode, struct file * file)
>> +{
>> +	struct mp_file *mfile;
>>+	cycle_kernel_lock();

>I don't think you really want to use the BKL here, just kill that line.

>> +static long mp_chr_ioctl(struct file *file, unsigned int cmd,
>> +		unsigned long arg)
>> +{
>> +	struct mp_file *mfile = file->private_data;
>> +	struct mp_struct *mp;
>> +	struct net_device *dev;
>> +	void __user* argp = (void __user *)arg;
>> +	struct ifreq ifr;
>> +	struct sock *sk;
>> +	int ret;
>> +
>> +	ret = -EINVAL;
>> +
>> +	switch (cmd) {
>> +	case MPASSTHRU_BINDDEV:
>> +		ret = -EFAULT;
>> +		if (copy_from_user(&ifr, argp, sizeof ifr))
>> +			break;

>This is broken for 32 bit compat mode ioctls, because struct ifreq
>is different between 32 and 64 bit systems. Since you are only
>using the device name anyway, a fixed length string or just the
>interface index would be simpler and work better.

 Thanks, will look into this.

>> +		ifr.ifr_name[IFNAMSIZ-1] = '\0';
>> +
>> +		ret = -EBUSY;
>> +
>> +		if (ifr.ifr_flags & IFF_MPASSTHRU_EXCL)
>> +			break;

>Your current use of the IFF_MPASSTHRU* flags does not seem to make
>any sense whatsoever. You check that this flag is never set, but set
>it later yourself and then ignore all flags.

Using that flag is tried to prevent if another one wants to bind the same device
Again. But I will see if it really ignore all other flags.

>> +		ret = -ENODEV;
>> +		dev = dev_get_by_name(mfile->net, ifr.ifr_name);
>> +		if (!dev)
>> +			break;

>There is no permission checking on who can access what device, which
>seems a bit simplistic. Any user that has access to the mpassthru device
>seems to be able to bind to any network interface in the namespace.
>This is one point where the macvtap model seems more appropriate, it
>separates the permissions for creating logical interfaces and using them.

Yes, that's a problem I have not addressed yet.

>> +static ssize_t mp_chr_aio_write(struct kiocb *iocb, const struct iovec *iov,
>> +				unsigned long count, loff_t pos)
>> +{
>> +	struct file *file = iocb->ki_filp;
>> +	struct mp_struct *mp = mp_get(file->private_data);
>> +	struct sock *sk = mp->socket.sk;
>> +	struct sk_buff *skb;
>> +	int len, err;
>> +	ssize_t result;

>Can you explain what this function is even there for? AFAICT, vhost-net
>doesn't call it, the interface is incompatible with the existing
>tap interface, and you don't provide a read function.

 I saw Michael have given the answer already.

>> diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h
>> new file mode 100644
>> index 0000000..2be21c5
>> --- /dev/null
>> +++ b/include/linux/mpassthru.h
>> @@ -0,0 +1,29 @@
>> +#ifndef __MPASSTHRU_H
>> +#define __MPASSTHRU_H
>> +
> >+#include <linux/types.h>
>> +#include <linux/if_ether.h>
>> +
>> +/* ioctl defines */
>> +#define MPASSTHRU_BINDDEV      _IOW('M', 213, int)
> >+#define MPASSTHRU_UNBINDDEV    _IOW('M', 214, int)

>These definitions are slightly wrong, because you pass more than just an 'int'.

 Thanks. I wrote them too quickly. :-(

>> +/* MPASSTHRU ifc flags */
>> +#define IFF_MPASSTHRU		0x0001
>> +#define IFF_MPASSTHRU_EXCL	0x0002

>As mentioned above, these flags don't make any sense with your current code.

I used them try to prevent the one who want to bind the same device again.
	Arnd

^ permalink raw reply

* Re: BUG: using smp_processor_id() in preemptible [00000000] code: avahi-daemon: caller is netif_rx
From: David Miller @ 2010-04-15  9:02 UTC (permalink / raw)
  To: eric.dumazet; +Cc: xiaosuo, therbert, eparis, netdev
In-Reply-To: <1271321358.16881.2240.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 15 Apr 2010 10:49:18 +0200

> Maybe we should add a new function after all...
> 
> int netif_rx_any(struct sk_buff *skb) 
> {
>        if (in_interrupt())
>                return netif_rx(skb);
> 
> 	return netif_rx_ni(skb);
> }

Ok, thanks for the analysis.

Since we keep coming back to this issue why don't we simply
solve it forever?  Let's make netif_rx() work in all contexts
and get rid of netif_rx_ni().

I think this is the thing to do because this whole netif_rx_ni()
vs. netif_rx() thing was meant to be an optimization of sorts (this
goes back to like 8+ years ago :-), and really I doubt it really
matters on that level any more.

What do you think?

^ permalink raw reply

* Re: [RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
From: Michael S. Tsirkin @ 2010-04-15  9:03 UTC (permalink / raw)
  To: Xin, Xiaohui
  Cc: Arnd Bergmann, netdev@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mingo@elte.hu, davem@davemloft.net,
	jdike@linux.intel.com
In-Reply-To: <97F6D3BD476C464182C1B7BABF0B0AF5C18969A5@shzsmsx502.ccr.corp.intel.com>

On Thu, Apr 15, 2010 at 05:01:10PM +0800, Xin, Xiaohui wrote:
> >It smells like a layering violation to look at the iocb->private field
> >from a lower-level driver. I would have hoped that it's possible to implement
> >this without having this driver know about the higher-level vhost driver
> >internals. Can you explain why this is needed?
> 
> I don't like this too, but since the kiocb is maintained by vhost with a list_head.
> And mp device is responsible to collect the kiocb into the list_head,
> We need something known by vhost/mp both.

Can't vhost supply a kiocb completion callback that will handle the list?

-- 
MST

^ permalink raw reply

* Re: NULL pointer dereference panic in stable (2.6.33.2), amd64
From: David Miller @ 2010-04-15  9:06 UTC (permalink / raw)
  To: eric.dumazet; +Cc: krkumar2, netdev, nuclearcat
In-Reply-To: <1271321507.16881.2244.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 15 Apr 2010 10:51:47 +0200

> In any case, I think there is a fundamental problem with this sk
> caching. Because one packet can travel in many stacked devices before
> hitting the wire.
> 
> (bonding, vlan, ethernet) for example.
> 
> Socket cache is meaningfull for one level only...

We were talking the other day about that 'tun' change to orphan the
SKB on TX, and I mentioned the possibility of just doing this in some
generic location before we give the packet to the device ->xmit()
method.

Such a scheme could help with this problem too.

^ permalink raw reply

* [PATCH 1/2 resend] igb: dobule increment nr_frags
From: Koki Sanagi @ 2010-04-15  9:06 UTC (permalink / raw)
  To: netdev, e1000-devel
  Cc: davem, jeffrey.t.kirsher, jesse.brandeburg, bruce.w.allan,
	peter.p.waskiewicz.jr, john.ronciak, Taku Izumi

Previous patch has some mail format problem.
Maybe I've fixed and re-sent.

There is no need to increment nr_frags becasue skb_fill_page_desc increments
it.

Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
---
 drivers/net/igb/igb_main.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
index 78cc742..8bde6c3 100644
--- a/drivers/net/igb/igb_main.c
+++ b/drivers/net/igb/igb_main.c
@@ -5254,7 +5254,7 @@ static bool igb_clean_rx_irq_adv(struct igb_q_vector *q_vector,
 				       PAGE_SIZE / 2, PCI_DMA_FROMDEVICE);
 			buffer_info->page_dma = 0;
 
-			skb_fill_page_desc(skb, skb_shinfo(skb)->nr_frags++,
+			skb_fill_page_desc(skb, skb_shinfo(skb)->nr_frags,
 						buffer_info->page,
 						buffer_info->page_offset,
 						length);


^ permalink raw reply related

* [PATCH net-2.6] packet : remove init_net restriction
From: Daniel Lezcano @ 2010-04-15  9:11 UTC (permalink / raw)
  To: davem; +Cc: netdev

The af_packet protocol is used by Perl to do ioctls as reported by
Stephane Riviere:

"Net::RawIP relies on SIOCGIFADDR et SIOCGIFHWADDR to get the IP and MAC
addresses of the network interface."

But in a new network namespace these ioctl fail because it is disabled for
a namespace different from the init_net_ns.

These two lines should not be there as af_inet and af_packet are
namespace aware since a long time now. I suppose we forget to remove these
lines because we sent the af_packet first, before af_inet was supported.

Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
Reported-by: Stephane Riviere <stephane.riviere@regis-dgac.net>
---
 net/packet/af_packet.c |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index cc90363..243946d 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2169,8 +2169,6 @@ static int packet_ioctl(struct socket *sock, unsigned int cmd,
 	case SIOCGIFDSTADDR:
 	case SIOCSIFDSTADDR:
 	case SIOCSIFFLAGS:
-		if (!net_eq(sock_net(sk), &init_net))
-			return -ENOIOCTLCMD;
 		return inet_dgram_ops.ioctl(sock, cmd, arg);
 #endif

-- 
1.6.3.3

^ permalink raw reply related

* Re: NULL pointer dereference panic in stable (2.6.33.2), amd64
From: Denys Fedorysychenko @ 2010-04-15  9:11 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, krkumar2, netdev
In-Reply-To: <20100415.020619.00349859.davem@davemloft.net>

Btw i have application using tun.

On Thursday 15 April 2010 12:06:19 David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 15 Apr 2010 10:51:47 +0200
> 
> > In any case, I think there is a fundamental problem with this sk
> > caching. Because one packet can travel in many stacked devices before
> > hitting the wire.
> >
> > (bonding, vlan, ethernet) for example.
> >
> > Socket cache is meaningfull for one level only...
> 
> We were talking the other day about that 'tun' change to orphan the
> SKB on TX, and I mentioned the possibility of just doing this in some
> generic location before we give the packet to the device ->xmit()
> method.
> 
> Such a scheme could help with this problem too.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* [PATCH] Speedup link loss detection for 3c59x
From: Martin Buck @ 2010-04-15  9:11 UTC (permalink / raw)
  To: Steffen Klassert, netdev; +Cc: linux-kernel

From: Martin Buck <mb-tmp-yvahk-argqri@gromit.dyndns.org>

Change the timer used for link status checking to check link every 5s,
regardless of the current link state. This way, link loss is detected as
fast as new link, whereas this took up to 60s previously (which is pretty
inconvenient when trying to react on link loss using e.g. ifplugd). This
also matches behaviour of most other Ethernet drivers which typically have
link check intervals in the low second range.

Signed-off-by: Martin Buck <mb-tmp-yvahk-argqri@gromit.dyndns.org>
---

--- linux-2.6.31.6/drivers/net/3c59x.c.orig	2010-04-13 17:46:07.000000000 +0200
+++ linux-2.6.31.6/drivers/net/3c59x.c	2010-04-13 17:55:31.000000000 +0200
@@ -1761,7 +1761,7 @@ vortex_timer(unsigned long data)
 	struct net_device *dev = (struct net_device *)data;
 	struct vortex_private *vp = netdev_priv(dev);
 	void __iomem *ioaddr = vp->ioaddr;
-	int next_tick = 60*HZ;
+	int next_tick = 5*HZ;
 	int ok = 0;
 	int media_status, old_window;

@@ -1807,9 +1807,6 @@ vortex_timer(unsigned long data)
 		ok = 1;
 	}

-	if (!netif_carrier_ok(dev))
-		next_tick = 5*HZ;
-
 	if (vp->medialock)
 		goto leave_media_alone;

^ permalink raw reply

* RE: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.
From: Xin, Xiaohui @ 2010-04-15  9:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mingo@elte.hu,
	jdike@linux.intel.com, davem@davemloft.net
In-Reply-To: <20100414152519.GA10792@redhat.com>


Michael,
>> The idea is simple, just to pin the guest VM user space and then
>> let host NIC driver has the chance to directly DMA to it. 
>> The patches are based on vhost-net backend driver. We add a device
>> which provides proto_ops as sendmsg/recvmsg to vhost-net to
>> send/recv directly to/from the NIC driver. KVM guest who use the
>> vhost-net backend may bind any ethX interface in the host side to
>> get copyless data transfer thru guest virtio-net frontend.
>> 
>> The scenario is like this:
>> 
>> The guest virtio-net driver submits multiple requests thru vhost-net
>> backend driver to the kernel. And the requests are queued and then
>> completed after corresponding actions in h/w are done.
>> 
>> For read, user space buffers are dispensed to NIC driver for rx when
>> a page constructor API is invoked. Means NICs can allocate user buffers
>> from a page constructor. We add a hook in netif_receive_skb() function
>> to intercept the incoming packets, and notify the zero-copy device.
>> 
>> For write, the zero-copy deivce may allocates a new host skb and puts
>> payload on the skb_shinfo(skb)->frags, and copied the header to skb->data.
>> The request remains pending until the skb is transmitted by h/w.
>> 
>> Here, we have ever considered 2 ways to utilize the page constructor
>> API to dispense the user buffers.
>> 
>> One:	Modify __alloc_skb() function a bit, it can only allocate a 
>> 	structure of sk_buff, and the data pointer is pointing to a 
>> 	user buffer which is coming from a page constructor API.
>> 	Then the shinfo of the skb is also from guest.
>> 	When packet is received from hardware, the skb->data is filled
>> 	directly by h/w. What we have done is in this way.
>> 
>> 	Pros:	We can avoid any copy here.
>> 	Cons:	Guest virtio-net driver needs to allocate skb as almost
>> 		the same method with the host NIC drivers, say the size
>> 		of netdev_alloc_skb() and the same reserved space in the
>> 		head of skb. Many NIC drivers are the same with guest and
>> 		ok for this. But some lastest NIC drivers reserves special
>> 		room in skb head. To deal with it, we suggest to provide
>> 		a method in guest virtio-net driver to ask for parameter
>> 		we interest from the NIC driver when we know which device 
>> 		we have bind to do zero-copy. Then we ask guest to do so.
>> 		Is that reasonable?

>Unfortunately, this would break compatibility with existing virtio.
>This also complicates migration. 

You mean any modification to the guest virtio-net driver will break the
compatibility? We tried to enlarge the virtio_net_config to contains the
2 parameter, and add one VIRTIO_NET_F_PASSTHRU flag, virtionet_probe()
will check the feature flag, and get the parameters, then virtio-net driver use
it to allocate buffers. How about this?

>What is the room in skb head used for?
I'm not sure, but the latest ixgbe driver does this, it reserves 32 bytes compared to
NET_IP_ALIGN.

>> Two:	Modify driver to get user buffer allocated from a page constructor
>> 	API(to substitute alloc_page()), the user buffer are used as payload
>> 	buffers and filled by h/w directly when packet is received. Driver
>> 	should associate the pages with skb (skb_shinfo(skb)->frags). For 
>> 	the head buffer side, let host allocates skb, and h/w fills it. 
>> 	After that, the data filled in host skb header will be copied into
>> 	guest header buffer which is submitted together with the payload buffer.
>> 
>> 	Pros:	We could less care the way how guest or host allocates their
>> 		buffers.
>> 	Cons:	We still need a bit copy here for the skb header.
>> 
>> We are not sure which way is the better here.

>The obvious question would be whether you see any speed difference
>with the two approaches. If no, then the second approach would be
>better.

I remember the second approach is a bit slower in 1500MTU. 
But we did not tested too much.

>> This is the first thing we want
>> to get comments from the community. We wish the modification to the network
>> part will be generic which not used by vhost-net backend only, but a user
>> application may use it as well when the zero-copy device may provides async
>> read/write operations later.
>> 
>> Please give comments especially for the network part modifications.
>> 
>> 
>> We provide multiple submits and asynchronous notifiicaton to 
>>vhost-net too.
>> 
>> Our goal is to improve the bandwidth and reduce the CPU usage.
>> Exact performance data will be provided later. But for simple
>> test with netperf, we found bindwidth up and CPU % up too,
>> but the bindwidth up ratio is much more than CPU % up ratio.
>> 
>> What we have not done yet:
>> 	packet split support

>What does this mean, exactly?
We can support 1500MTU, but for jumbo frame, since vhost driver before don't 
support mergeable buffer, we cannot try it for multiple sg. A jumbo frame will split 5
frags and hook them once a descriptor, so the user buffer allocation is greatly dependent
on how guest virtio-net drivers submits buffers. We think mergeable buffer is suitable for it. 

>> 	To support GRO
Actually, I think if the mergeable buffer may get good performance, then GRO is not 
so important then.
>And TSO/GSO?
Do we really need them?

>> 	Performance tuning
>> 
>> what we have done in v1:
>> 	polish the RCU usage
>> 	deal with write logging in asynchroush mode in vhost
>> 	add notifier block for mp device
>> 	rename page_ctor to mp_port in netdevice.h to make it looks generic
>> 	add mp_dev_change_flags() for mp device to change NIC state
>> 	add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
>> 	a small fix for missing dev_put when fail
>> 	using dynamic minor instead of static minor number
>> 	a __KERNEL__ protect to mp_get_sock()
>> 
>> what we have done in v2:
>> 	
>> 	remove most of the RCU usage, since the ctor pointer is only
>> 	changed by BIND/UNBIND ioctl, and during that time, NIC will be
>> 	stopped to get good cleanup(all outstanding requests are finished),
>> 	so the ctor pointer cannot be raced into wrong situation.
>> 
>> 	Remove the struct vhost_notifier with struct kiocb.
>> 	Let vhost-net backend to alloc/free the kiocb and transfer them
>> 	via sendmsg/recvmsg.
>> 
>> 	use get_user_pages_fast() and set_page_dirty_lock() when read.
>> 
>> 	Add some comments for netdev_mp_port_prep() and handle_mpassthru().
>> 
>> 
>> Comments not addressed yet in this time:
>> 	the async write logging is not satified by vhost-net
>> 	Qemu needs a sync write
>> 	a limit for locked pages from get_user_pages_fast()
>> 	
>> 		
>> performance:
>> 	using netperf with GSO/TSO disabled, 10G NIC, 
>> 	disabled packet split mode, with raw socket case compared to vhost.
>> 
>> 	bindwidth will be from 1.1Gbps to 1.7Gbps
>> 	CPU % from 120%-140% to 140%-160%

^ permalink raw reply

* Re: [PATCH] Speedup link loss detection for 3c59x
From: Steffen Klassert @ 2010-04-15  9:59 UTC (permalink / raw)
  To: Martin Buck; +Cc: netdev, linux-kernel
In-Reply-To: <20100415091134.GA9574@gromit.at.home>

On Thu, Apr 15, 2010 at 11:11:34AM +0200, Martin Buck wrote:
> From: Martin Buck <mb-tmp-yvahk-argqri@gromit.dyndns.org>
> 
> Change the timer used for link status checking to check link every 5s,
> regardless of the current link state. This way, link loss is detected as
> fast as new link, whereas this took up to 60s previously (which is pretty
> inconvenient when trying to react on link loss using e.g. ifplugd). This
> also matches behaviour of most other Ethernet drivers which typically have
> link check intervals in the low second range.
> 

We discussed this issue already some years ago. The 3c59x does polling
for external environment changes which is quite expensive. Firing a
timer that disables the interrupts on a running interface every 5
seconds is not reasonable for checking for external environment changes.
So we decided to let it depend on the link status, 5 seconds if the link
is down and 60 seconds if the link is up.

Steffen

^ permalink raw reply

* Re: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.
From: Michael S. Tsirkin @ 2010-04-15 10:05 UTC (permalink / raw)
  To: Xin, Xiaohui
  Cc: netdev@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mingo@elte.hu,
	jdike@linux.intel.com, davem@davemloft.net
In-Reply-To: <97F6D3BD476C464182C1B7BABF0B0AF5C18969CC@shzsmsx502.ccr.corp.intel.com>

On Thu, Apr 15, 2010 at 05:36:07PM +0800, Xin, Xiaohui wrote:
> 
> Michael,
> >> The idea is simple, just to pin the guest VM user space and then
> >> let host NIC driver has the chance to directly DMA to it. 
> >> The patches are based on vhost-net backend driver. We add a device
> >> which provides proto_ops as sendmsg/recvmsg to vhost-net to
> >> send/recv directly to/from the NIC driver. KVM guest who use the
> >> vhost-net backend may bind any ethX interface in the host side to
> >> get copyless data transfer thru guest virtio-net frontend.
> >> 
> >> The scenario is like this:
> >> 
> >> The guest virtio-net driver submits multiple requests thru vhost-net
> >> backend driver to the kernel. And the requests are queued and then
> >> completed after corresponding actions in h/w are done.
> >> 
> >> For read, user space buffers are dispensed to NIC driver for rx when
> >> a page constructor API is invoked. Means NICs can allocate user buffers
> >> from a page constructor. We add a hook in netif_receive_skb() function
> >> to intercept the incoming packets, and notify the zero-copy device.
> >> 
> >> For write, the zero-copy deivce may allocates a new host skb and puts
> >> payload on the skb_shinfo(skb)->frags, and copied the header to skb->data.
> >> The request remains pending until the skb is transmitted by h/w.
> >> 
> >> Here, we have ever considered 2 ways to utilize the page constructor
> >> API to dispense the user buffers.
> >> 
> >> One:	Modify __alloc_skb() function a bit, it can only allocate a 
> >> 	structure of sk_buff, and the data pointer is pointing to a 
> >> 	user buffer which is coming from a page constructor API.
> >> 	Then the shinfo of the skb is also from guest.
> >> 	When packet is received from hardware, the skb->data is filled
> >> 	directly by h/w. What we have done is in this way.
> >> 
> >> 	Pros:	We can avoid any copy here.
> >> 	Cons:	Guest virtio-net driver needs to allocate skb as almost
> >> 		the same method with the host NIC drivers, say the size
> >> 		of netdev_alloc_skb() and the same reserved space in the
> >> 		head of skb. Many NIC drivers are the same with guest and
> >> 		ok for this. But some lastest NIC drivers reserves special
> >> 		room in skb head. To deal with it, we suggest to provide
> >> 		a method in guest virtio-net driver to ask for parameter
> >> 		we interest from the NIC driver when we know which device 
> >> 		we have bind to do zero-copy. Then we ask guest to do so.
> >> 		Is that reasonable?
> 
> >Unfortunately, this would break compatibility with existing virtio.
> >This also complicates migration. 
> 
> You mean any modification to the guest virtio-net driver will break the
> compatibility? We tried to enlarge the virtio_net_config to contains the
> 2 parameter, and add one VIRTIO_NET_F_PASSTHRU flag, virtionet_probe()
> will check the feature flag, and get the parameters, then virtio-net driver use
> it to allocate buffers. How about this?

This means that we can't, for example, live-migrate between different systems
without flushing outstanding buffers.

> >What is the room in skb head used for?
> I'm not sure, but the latest ixgbe driver does this, it reserves 32 bytes compared to
> NET_IP_ALIGN.

Looking at code, this seems to do with alignment - could just be
a performance optimization.

> >> Two:	Modify driver to get user buffer allocated from a page constructor
> >> 	API(to substitute alloc_page()), the user buffer are used as payload
> >> 	buffers and filled by h/w directly when packet is received. Driver
> >> 	should associate the pages with skb (skb_shinfo(skb)->frags). For 
> >> 	the head buffer side, let host allocates skb, and h/w fills it. 
> >> 	After that, the data filled in host skb header will be copied into
> >> 	guest header buffer which is submitted together with the payload buffer.
> >> 
> >> 	Pros:	We could less care the way how guest or host allocates their
> >> 		buffers.
> >> 	Cons:	We still need a bit copy here for the skb header.
> >> 
> >> We are not sure which way is the better here.
> 
> >The obvious question would be whether you see any speed difference
> >with the two approaches. If no, then the second approach would be
> >better.
> 
> I remember the second approach is a bit slower in 1500MTU. 
> But we did not tested too much.

Well, that's an important datapoint. By the way, you'll need
header copy to activate LRO in host, so that's a good
reason to go with option 2 as well.

> >> This is the first thing we want
> >> to get comments from the community. We wish the modification to the network
> >> part will be generic which not used by vhost-net backend only, but a user
> >> application may use it as well when the zero-copy device may provides async
> >> read/write operations later.
> >> 
> >> Please give comments especially for the network part modifications.
> >> 
> >> 
> >> We provide multiple submits and asynchronous notifiicaton to 
> >>vhost-net too.
> >> 
> >> Our goal is to improve the bandwidth and reduce the CPU usage.
> >> Exact performance data will be provided later. But for simple
> >> test with netperf, we found bindwidth up and CPU % up too,
> >> but the bindwidth up ratio is much more than CPU % up ratio.
> >> 
> >> What we have not done yet:
> >> 	packet split support
> 
> >What does this mean, exactly?
> We can support 1500MTU, but for jumbo frame, since vhost driver before don't 
> support mergeable buffer, we cannot try it for multiple sg.

I do not see why, vhost currently supports 64K buffers with indirect
descriptors.

> A jumbo frame will split 5
> frags and hook them once a descriptor, so the user buffer allocation is greatly dependent
> on how guest virtio-net drivers submits buffers. We think mergeable buffer is suitable for it. 
> 
> >> 	To support GRO
> Actually, I think if the mergeable buffer may get good performance, then GRO is not 
> so important then.
> >And TSO/GSO?
> Do we really need them?

My guess would be yes. Mergeable buffers is a memory saving
optimization, not a performance optimization, I don't see
that it can help. And I think you can't solely rely on jumbo frames
in hardware, not everyone can enable them.

Having said that, number one priority is getting decent performance
out of the driver, in whatever way you find fit. I was just
suggesting obvious ways to do this.


> >> 	Performance tuning
> >> 
> >> what we have done in v1:
> >> 	polish the RCU usage
> >> 	deal with write logging in asynchroush mode in vhost
> >> 	add notifier block for mp device
> >> 	rename page_ctor to mp_port in netdevice.h to make it looks generic
> >> 	add mp_dev_change_flags() for mp device to change NIC state
> >> 	add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
> >> 	a small fix for missing dev_put when fail
> >> 	using dynamic minor instead of static minor number
> >> 	a __KERNEL__ protect to mp_get_sock()
> >> 
> >> what we have done in v2:
> >> 	
> >> 	remove most of the RCU usage, since the ctor pointer is only
> >> 	changed by BIND/UNBIND ioctl, and during that time, NIC will be
> >> 	stopped to get good cleanup(all outstanding requests are finished),
> >> 	so the ctor pointer cannot be raced into wrong situation.
> >> 
> >> 	Remove the struct vhost_notifier with struct kiocb.
> >> 	Let vhost-net backend to alloc/free the kiocb and transfer them
> >> 	via sendmsg/recvmsg.
> >> 
> >> 	use get_user_pages_fast() and set_page_dirty_lock() when read.
> >> 
> >> 	Add some comments for netdev_mp_port_prep() and handle_mpassthru().
> >> 
> >> 
> >> Comments not addressed yet in this time:
> >> 	the async write logging is not satified by vhost-net
> >> 	Qemu needs a sync write
> >> 	a limit for locked pages from get_user_pages_fast()
> >> 	
> >> 		
> >> performance:
> >> 	using netperf with GSO/TSO disabled, 10G NIC, 
> >> 	disabled packet split mode, with raw socket case compared to vhost.
> >> 
> >> 	bindwidth will be from 1.1Gbps to 1.7Gbps
> >> 	CPU % from 120%-140% to 140%-160%

^ permalink raw reply

* Re: BUG: using smp_processor_id() in preemptible [00000000] code: avahi-daemon: caller is netif_rx
From: Eric Dumazet @ 2010-04-15 10:29 UTC (permalink / raw)
  To: David Miller; +Cc: xiaosuo, therbert, eparis, netdev
In-Reply-To: <20100415.020246.218622820.davem@davemloft.net>

Le jeudi 15 avril 2010 à 02:02 -0700, David Miller a écrit :

> Since we keep coming back to this issue why don't we simply
> solve it forever?  Let's make netif_rx() work in all contexts
> and get rid of netif_rx_ni().
> 
> I think this is the thing to do because this whole netif_rx_ni()
> vs. netif_rx() thing was meant to be an optimization of sorts (this
> goes back to like 8+ years ago :-), and really I doubt it really
> matters on that level any more.
> 
> What do you think?

I was about to come to same idea indeed.

netif_receive_skb() is supposed to be used for modern devices anyway,
avoiding netif_rx() overhead...





^ permalink raw reply

* Re: NULL pointer dereference panic in stable (2.6.33.2), amd64
From: Eric Dumazet @ 2010-04-15 10:37 UTC (permalink / raw)
  To: Denys Fedorysychenko; +Cc: David Miller, krkumar2, netdev
In-Reply-To: <201004151211.28315.nuclearcat@nuclearcat.com>

Le jeudi 15 avril 2010 à 12:11 +0300, Denys Fedorysychenko a écrit :
> Btw i have application using tun.

Could you add following sanity test to catch the error ?

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index fa8b476..b67274a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -988,6 +988,7 @@ static inline
 struct netdev_queue *netdev_get_tx_queue(const struct net_device *dev,
 					 unsigned int index)
 {
+	WARN_ON(index >= dev->num_tx_queues);
 	return &dev->_tx[index];
 }
 



^ permalink raw reply related

* Re: rps perfomance WAS(Re: rps: question
From: jamal @ 2010-04-15 11:55 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, therbert, netdev, robert, xiaosuo, andi
In-Reply-To: <20100415.014857.168270765.davem@davemloft.net>

On Thu, 2010-04-15 at 01:48 -0700, David Miller wrote:

> A single-queue NIC is actually not a requirement, 
> RPS helps also in cases where you have 'N' application threads 
> and N is less than the number of CPUs your multi-queue NIC is 
> distributing traffic to.

sure..

> Moving the bulk of the input packet processing to the cpus where
> the applications actually sit had a non-trivial benefit.  

This is true regardless of rps though. 

> RFS takes this aspect to yet another level.

rfs looks quiet interesting;-> I think with some twist it could be
used with multiqueue nics as well

> I think for the case where application locality is important,
> RPS/RFS can help regardless of cache details.

Generally true, as long as there's not much shared data across the cpus
or the cost of a cache miss is reasonably tolerable. The socket layer
just happens to be not sharing much with ingress packet path and
for a single processor Nehalem, the caching system works so well that
the cost of cache misses is not as an important a variable. Everything
is on the same die including the MM controller etc.
I am speculating (didnt get any answer to the question i asked) that
people running rps use such hardware;->

I speculate again that it may be too costly to run rps on something like
a tigerton or intel clovertown where you have cores sharing/contending
for an FSB. If I can get answers to the question: "What h/ware are
people running?" i could be proven wrong.
[Note: I am not against RPS - i think it has its place; so i hope my
desire to find out when to use rps doesnt show as hostility towards
rps.]

cheers,
jamal

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: jamal @ 2010-04-15 12:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tom Herbert, Stephen Hemminger, netdev, robert, David Miller,
	Changli Gao, Andi Kleen
In-Reply-To: <1271278661.16881.1761.camel@edumazet-laptop>

On Wed, 2010-04-14 at 22:57 +0200, Eric Dumazet wrote:

> On my Nehalem machine (16 logical cpus), its NetXtreme II BCM57711E
> 10Gigabit has 16 queues. It might be good to use less queues according
> to your results on some workloads, and eventually use RPS on a second
> layering.

Ok Eric, you seem to be running a system with two Nehalems
interconnected by QPI.
Is there any difference, performance-wise, between redirecting from
coreX to coreY when they are on the same Nehalem vs when you
are going across QPI?

cheers,
jamal


^ permalink raw reply

* [PATCH] net: small cleanup of lib8390
From: Nikanth Karthikesan @ 2010-04-15 12:21 UTC (permalink / raw)
  To: Paul Gortmaker, David S. Miller; +Cc: netdev, Al Viro, Jeff Garzik

Remove the always true #if 1. Also the unecessary re-test of ei_local->irqlock
and the unreachable printk format string.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>

---

diff --git a/drivers/net/lib8390.c b/drivers/net/lib8390.c
index 56f66f4..6e1dcd3 100644
--- a/drivers/net/lib8390.c
+++ b/drivers/net/lib8390.c
@@ -445,14 +445,14 @@ static irqreturn_t __ei_interrupt(int irq, void *dev_id)
 
 	if (ei_local->irqlock)
 	{
-#if 1 /* This might just be an interrupt for a PCI device sharing this line */
-		/* The "irqlock" check is only for testing. */
-		printk(ei_local->irqlock
-			   ? "%s: Interrupted while interrupts are masked! isr=%#2x imr=%#2x.\n"
-			   : "%s: Reentering the interrupt handler! isr=%#2x imr=%#2x.\n",
+		/*
+		 * This might just be an interrupt for a PCI device sharing
+		 * this line
+		 */
+		printk("%s: Interrupted while interrupts are masked!"
+			   " isr=%#2x imr=%#2x.\n",
 			   dev->name, ei_inb_p(e8390_base + EN0_ISR),
 			   ei_inb_p(e8390_base + EN0_IMR));
-#endif
 		spin_unlock(&ei_local->page_lock);
 		return IRQ_NONE;
 	}

^ permalink raw reply related

* Re: rps perfomance WAS(Re: rps: question
From: Changli Gao @ 2010-04-15 12:32 UTC (permalink / raw)
  To: hadi
  Cc: Eric Dumazet, Tom Herbert, Stephen Hemminger, netdev, robert,
	David Miller, Andi Kleen
In-Reply-To: <1271333428.23780.3.camel@bigi>

On Thu, Apr 15, 2010 at 8:10 PM, jamal <hadi@cyberus.ca> wrote:
> On Wed, 2010-04-14 at 22:57 +0200, Eric Dumazet wrote:
>
>> On my Nehalem machine (16 logical cpus), its NetXtreme II BCM57711E
>> 10Gigabit has 16 queues. It might be good to use less queues according
>> to your results on some workloads, and eventually use RPS on a second
>> layering.
>

For historical reason, we use Linux-2.6.18. Our company have several
products with CPU Xen, P4, or i7. Some of them are SMP, Multi-Core and
Multi-Threaded. We use the similar mechanism like dynamic weighted
RPS. The total throughput is increased nearly linear with the number
of the worker threads(one worker thread per CPU).

-- 
Regards，
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* [PATCH 1/3] ipv4: ipmr: fix IP_MROUTE_MULTIPLE_TABLES Kconfig dependencies
From: Patrick McHardy @ 2010-04-15 12:47 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1271335678-20961-1-git-send-email-kaber@trash.net>

IP_MROUTE_MULTIPLE_TABLES should depend on IP_MROUTE.

Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 net/ipv4/Kconfig |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index be59774..8e3a1fd 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -252,7 +252,7 @@ config IP_MROUTE
 
 config IP_MROUTE_MULTIPLE_TABLES
 	bool "IP: multicast policy routing"
-	depends on IP_ADVANCED_ROUTER
+	depends on IP_MROUTE && IP_ADVANCED_ROUTER
 	select FIB_RULES
 	help
 	  Normally, a multicast router runs a userspace daemon and decides
-- 
1.7.0.4


^ permalink raw reply related

* [PATCH 3/3] ipv4: ipmr: fix NULL pointer deref during unres queue destruction
From: Patrick McHardy @ 2010-04-15 12:47 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1271335678-20961-1-git-send-email-kaber@trash.net>

Fix an oversight in ipmr_destroy_unres() - the net pointer is
unconditionally initialized to NULL, resulting in a NULL pointer
dereference later on.

Fix by adding a net pointer to struct mr_table and using it in
ipmr_destroy_unres().

Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 net/ipv4/ipmr.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 0643fb6..7d8a2bc 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -71,6 +71,9 @@
 
 struct mr_table {
 	struct list_head	list;
+#ifdef CONFIG_NET_NS
+	struct net		*net;
+#endif
 	u32			id;
 	struct sock		*mroute_sk;
 	struct timer_list	ipmr_expire_timer;
@@ -308,6 +311,7 @@ static struct mr_table *ipmr_new_table(struct net *net, u32 id)
 	mrt = kzalloc(sizeof(*mrt), GFP_KERNEL);
 	if (mrt == NULL)
 		return NULL;
+	write_pnet(&mrt->net, net);
 	mrt->id = id;
 
 	/* Forwarding cache */
@@ -580,7 +584,7 @@ static inline void ipmr_cache_free(struct mfc_cache *c)
 
 static void ipmr_destroy_unres(struct mr_table *mrt, struct mfc_cache *c)
 {
-	struct net *net = NULL; //mrt->net;
+	struct net *net = read_pnet(&mrt->net);
 	struct sk_buff *skb;
 	struct nlmsgerr *e;
 
-- 
1.7.0.4


^ permalink raw reply related

* [PATCH 0/3]: fixes for multicast routing rules
From: Patrick McHardy @ 2010-04-15 12:47 UTC (permalink / raw)
  To: davem; +Cc: netdev

Hi Dave,

the following three patches fix a few bugs introduced by the multicast routing
rule patches:

- a missing Kconfig dependency on IP_MROUTE

- a bug introduced by the list conversion, causing the list head to be treated
  as an element

- a NULL pointer dereference: the net pointer in ipmr_destroy_unres() was
  initialized to NULL. This patch was actually intended to be folded into
  the patch introducing multicast routing rules, but I missed it when
  rebasing to the current tree.

Please apply or pull from:

git://git.kernel.org/pub/scm/linux/kernel/git/kaber/ipmr-2.6.git master

Thanks!

^ permalink raw reply

* [PATCH 2/3] ipv4: ipmr: fix invalid cache resolving when adding a non-matching entry
From: Patrick McHardy @ 2010-04-15 12:47 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1271335678-20961-1-git-send-email-kaber@trash.net>

The patch to convert struct mfc_cache to list_heads (ipv4: ipmr: convert
struct mfc_cache to struct list_head) introduced a bug when adding new
cache entries that don't match any unresolved entries.

The unres queue is searched for a matching entry, which is then resolved.
When no matching entry is present, the iterator points to the head of the
list, but is treated as a matching entry. Use a seperate variable to
indicate that a matching entry was found.

Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 net/ipv4/ipmr.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 5df5fd7..0643fb6 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1089,12 +1089,14 @@ static int ipmr_mfc_add(struct net *net, struct mr_table *mrt,
 	 *	Check to see if we resolved a queued list. If so we
 	 *	need to send on the frames and tidy up.
 	 */
+	found = false;
 	spin_lock_bh(&mfc_unres_lock);
 	list_for_each_entry(uc, &mrt->mfc_unres_queue, list) {
 		if (uc->mfc_origin == c->mfc_origin &&
 		    uc->mfc_mcastgrp == c->mfc_mcastgrp) {
 			list_del(&uc->list);
 			atomic_dec(&mrt->cache_resolve_queue_len);
+			found = true;
 			break;
 		}
 	}
@@ -1102,7 +1104,7 @@ static int ipmr_mfc_add(struct net *net, struct mr_table *mrt,
 		del_timer(&mrt->ipmr_expire_timer);
 	spin_unlock_bh(&mfc_unres_lock);
 
-	if (uc) {
+	if (found) {
 		ipmr_cache_resolve(net, mrt, uc, c);
 		ipmr_cache_free(uc);
 	}
-- 
1.7.0.4


^ permalink raw reply related

* Re: rps perfomance WAS(Re: rps: question
From: jamal @ 2010-04-15 12:50 UTC (permalink / raw)
  To: Changli Gao; +Cc: Eric Dumazet, Tom Herbert, netdev
In-Reply-To: <s2p412e6f7f1004150532y4b13a0bfgadab3e6f2f4aecd@mail.gmail.com>

On Thu, 2010-04-15 at 20:32 +0800, Changli Gao wrote:

> For historical reason, we use Linux-2.6.18. Our company have several
> products with CPU Xen, P4, or i7. Some of them are SMP, Multi-Core and
> Multi-Threaded. 

Thanks for sharing. How much more can you say? ;-> Do you have a paper
or description of some sort somewhere?

> We use the similar mechanism like dynamic weighted
> RPS. The total throughput is increased nearly linear with the number
> of the worker threads(one worker thread per CPU).

Other than the i7 - have you tried to run rps on on the P4?

cheers,
jamal



^ permalink raw reply

* [net-next-2.6 PATCH 1/2] ipv6: cancel to setting local_df in ip6_xmit()
From: Shan Wei @ 2010-04-15 13:04 UTC (permalink / raw)
  To: David Miller, Herbert Xu, emils.tantilov
  Cc: kuznet, pekkas, jmorris,
	yoshfuji@linux-ipv6.org >> YOSHIFUJI Hideaki,
	Patrick McHardy, eric.dumazet, sri, netdev@vger.kernel.org,
	Shan Wei

commit f88037(sctp: Drop ipfargok in sctp_xmit function)
has droped ipfragok and set local_df value properly.

So the change of commit 77e2f1(ipv6: Fix ip6_xmit to 
send fragments if ipfragok is true) is not needed. 
So the patch remove them.

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
---
 net/ipv6/ip6_output.c |    4 ----
 1 files changed, 0 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 16c4391..f3a847e 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -231,10 +231,6 @@ int ip6_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl,
 	skb_reset_network_header(skb);
 	hdr = ipv6_hdr(skb);
 
-	/* Allow local fragmentation. */
-	if (ipfragok)
-		skb->local_df = 1;
-
 	/*
 	 *	Fill in the IPv6 header
 	 */
--
1.6.3.3 

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox