Netdev List
 help / color / mirror / Atom feed
* Re: [PATCHv3] drivers/net/usb: Add new driver ipheth
From: Ben Hutchings @ 2010-04-02 17:21 UTC (permalink / raw)
  To: "L. Alberto Giménez"
  Cc: linux-kernel, netdev, linux-usb, oliver, linville, j.dumon,
	steve.glendinning, davem, gregkh, dgiagio, dborca
In-Reply-To: <4BB62620.3070807@sysvalve.es>

On Fri, 2010-04-02 at 19:15 +0200, "L. Alberto Giménez" wrote:
> On 04/01/2010 01:18 AM, Ben Hutchings wrote:
> > On Wed, 2010-03-31 at 21:42 +0200, L. Alberto Giménez wrote:
> > [...]
> >> --- /dev/null
> >> +++ b/drivers/net/usb/ipheth.c
> > [...]
> 
> Hi Ben,
> 
> Upstream has fixed several errors pointed out by you and Oliver (thanks
> for that), but some of them are still pending.
> 
> I will send patches on top of my last driver submission (if the proper
> way would be resubmit the whole code, please tell me. Anyway I need to
> clarify some doubts...

Since David Miller has not merged your original patch, you should send a
single new patch with the changes incorporated.

> >> +	usb_fill_bulk_urb(dev->tx_urb, udev,
> >> +			  usb_sndbulkpipe(udev, dev->bulk_out),
> >> +			  dev->tx_buf, IPHETH_BUF_SIZE,
> >> +			  ipheth_sndbulk_callback,
> >> +			  dev);
> >> +	dev->tx_urb->transfer_flags |= URB_NO_TRANSFER_DMA_MAP;
> >> +
> >> +	retval = usb_submit_urb(dev->tx_urb, GFP_ATOMIC);
> >> +	if (retval) {
> >> +		err("%s: usb_submit_urb: %d", __func__, retval);
> >> +		dev->stats.tx_errors++;
> >> +		dev_kfree_skb_irq(skb);
> >> +	} else {
> >> +		net->trans_start = jiffies;
> > 
> > No longer needed.
> 
> What is not longer needed? The assignment, the whole "else" branch? If
> the assignment is what is not needed, can I just remove that line, right?

The assignment is not needed.

> > [...]
> >> +#ifdef HAVE_NET_DEVICE_OPS
> >> +static const struct net_device_ops ipheth_netdev_ops = {
> >> +       .ndo_open = &ipheth_open,
> >> +       .ndo_stop = &ipheth_close,
> >> +       .ndo_start_xmit = &ipheth_tx,
> >> +       .ndo_tx_timeout = &ipheth_tx_timeout,
> >> +       .ndo_get_stats = &ipheth_stats,
> >> +};
> >> +#endif
> > 
> > Remove the #ifdef, there is no question whether we have net_device_ops.
> 
> Ok, I will just remove both #ifdefs, but why is that? Maybe in previous
> versions of the kernel the net_device_ops struct was introduced and now
> it's present no matter how you configure your kernel?
[...]

Correct, it is now mandatory.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH] ethtool: RXHASH flag support (v2)
From: Jeff Garzik @ 2010-04-02 17:26 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20100402081515.1b963562@nehalam>

Both patches look great, thanks.  Will apply after Easter holiday...

	Jeff




^ permalink raw reply

* Receive steering and hash and cache misses
From: Stephen Hemminger @ 2010-04-02 17:26 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

Although Receive Packet Steering can use a hardware generated receive hash
the device driver still causes an unnecessary cache miss on the interrupt
processing CPU.  The current Ethernet network device driver receive processing
has the device driver calling eth_type_trans() which causes a the
interrupt CPU to read the received frame header.

Is there some way the hardware receive hash value could be used to
steer to the receive CPU, then have the receive CPU find the Ethernet
type field (eth_type_trans)?

^ permalink raw reply

* [PATCH] vhost: Make it more scalable by creating a vhost thread per device.
From: Sridhar Samudrala @ 2010-04-02 17:31 UTC (permalink / raw)
  To: Michael S. Tsirkin, Tom Lendacky; +Cc: netdev, kvm@vger.kernel.org

Make vhost scalable by creating a separate vhost thread per vhost
device. This provides better scaling across multiple guests and with
multiple interfaces in a guest.

I am seeing better aggregated througput/latency when running netperf
across multiple guests or multiple interfaces in a guest in parallel
with this patch.

Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>


diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index a6a88df..29aa80f 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -339,8 +339,10 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 		return r;
 	}
 
-	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
-	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
+	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT,
+			&n->dev);
+	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN,
+			&n->dev);
 	n->tx_poll_state = VHOST_NET_POLL_DISABLED;
 
 	f->private_data = n;
@@ -643,25 +645,14 @@ static struct miscdevice vhost_net_misc = {
 
 int vhost_net_init(void)
 {
-	int r = vhost_init();
-	if (r)
-		goto err_init;
-	r = misc_register(&vhost_net_misc);
-	if (r)
-		goto err_reg;
-	return 0;
-err_reg:
-	vhost_cleanup();
-err_init:
-	return r;
-
+	return misc_register(&vhost_net_misc);
 }
+
 module_init(vhost_net_init);
 
 void vhost_net_exit(void)
 {
 	misc_deregister(&vhost_net_misc);
-	vhost_cleanup();
 }
 module_exit(vhost_net_exit);
 
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 7bd7a1e..243f4d3 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -36,8 +36,6 @@ enum {
 	VHOST_MEMORY_F_LOG = 0x1,
 };
 
-static struct workqueue_struct *vhost_workqueue;
-
 static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
 			    poll_table *pt)
 {
@@ -56,18 +54,19 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync,
 	if (!((unsigned long)key & poll->mask))
 		return 0;
 
-	queue_work(vhost_workqueue, &poll->work);
+	queue_work(poll->dev->wq, &poll->work);
 	return 0;
 }
 
 /* Init poll structure */
 void vhost_poll_init(struct vhost_poll *poll, work_func_t func,
-		     unsigned long mask)
+		     unsigned long mask, struct vhost_dev *dev)
 {
 	INIT_WORK(&poll->work, func);
 	init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
 	init_poll_funcptr(&poll->table, vhost_poll_func);
 	poll->mask = mask;
+	poll->dev = dev;
 }
 
 /* Start polling a file. We add ourselves to file's wait queue. The caller must
@@ -96,7 +95,7 @@ void vhost_poll_flush(struct vhost_poll *poll)
 
 void vhost_poll_queue(struct vhost_poll *poll)
 {
-	queue_work(vhost_workqueue, &poll->work);
+	queue_work(poll->dev->wq, &poll->work);
 }
 
 static void vhost_vq_reset(struct vhost_dev *dev,
@@ -128,6 +127,11 @@ long vhost_dev_init(struct vhost_dev *dev,
 		    struct vhost_virtqueue *vqs, int nvqs)
 {
 	int i;
+
+	dev->wq = create_singlethread_workqueue("vhost");
+	if (!dev->wq)
+		return -ENOMEM;
+
 	dev->vqs = vqs;
 	dev->nvqs = nvqs;
 	mutex_init(&dev->mutex);
@@ -143,7 +147,7 @@ long vhost_dev_init(struct vhost_dev *dev,
 		if (dev->vqs[i].handle_kick)
 			vhost_poll_init(&dev->vqs[i].poll,
 					dev->vqs[i].handle_kick,
-					POLLIN);
+					POLLIN, dev);
 	}
 	return 0;
 }
@@ -216,6 +220,8 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
 	if (dev->mm)
 		mmput(dev->mm);
 	dev->mm = NULL;
+
+	destroy_workqueue(dev->wq);
 }
 
 static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
@@ -1095,16 +1101,3 @@ void vhost_disable_notify(struct vhost_virtqueue *vq)
 		vq_err(vq, "Failed to enable notification at %p: %d\n",
 		       &vq->used->flags, r);
 }
-
-int vhost_init(void)
-{
-	vhost_workqueue = create_singlethread_workqueue("vhost");
-	if (!vhost_workqueue)
-		return -ENOMEM;
-	return 0;
-}
-
-void vhost_cleanup(void)
-{
-	destroy_workqueue(vhost_workqueue);
-}
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 44591ba..60fefd0 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -29,10 +29,11 @@ struct vhost_poll {
 	/* struct which will handle all actual work. */
 	struct work_struct        work;
 	unsigned long		  mask;
+	struct vhost_dev	 *dev;
 };
 
 void vhost_poll_init(struct vhost_poll *poll, work_func_t func,
-		     unsigned long mask);
+		     unsigned long mask, struct vhost_dev *dev);
 void vhost_poll_start(struct vhost_poll *poll, struct file *file);
 void vhost_poll_stop(struct vhost_poll *poll);
 void vhost_poll_flush(struct vhost_poll *poll);
@@ -110,6 +111,7 @@ struct vhost_dev {
 	int nvqs;
 	struct file *log_file;
 	struct eventfd_ctx *log_ctx;
+	struct workqueue_struct *wq;
 };
 
 long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue *vqs, int nvqs);
@@ -136,9 +138,6 @@ bool vhost_enable_notify(struct vhost_virtqueue *);
 int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
 		    unsigned int log_num, u64 len);
 
-int vhost_init(void);
-void vhost_cleanup(void);
-
 #define vq_err(vq, fmt, ...) do {                                  \
 		pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
 		if ((vq)->error_ctx)                               \





^ permalink raw reply related

* Re: [PATCH 1/2] phylib: Support phy module autoloading
From: Andy Fleming @ 2010-04-02 17:51 UTC (permalink / raw)
  To: David Woodhouse; +Cc: davem, netdev, ben
In-Reply-To: <1270206327.3101.2436.camel@macbook.infradead.org>

On Fri, Apr 2, 2010 at 6:05 AM, David Woodhouse <dwmw2@infradead.org> wrote:
> We don't use the normal hotplug mechanism because it doesn't work. It will
> load the module some time after the device appears, but that's not good
> enough for us -- we need the driver loaded _immediately_ because otherwise
> the NIC driver may just abort and then the phy 'device' goes away.
>
> [bwh: s/phy/mdio/ in module alias, kerneldoc for struct mdio_device_id]
>
> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>

Neat!

Acked-by: Andy Fleming <afleming@freescale.com>

^ permalink raw reply

* Re: [PATCHv3] drivers/net/usb: Add new driver ipheth
From: "L. Alberto Giménez" @ 2010-04-02 17:53 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: "L. Alberto Giménez", linux-kernel, netdev,
	linux-usb, oliver, linville, j.dumon, steve.glendinning, davem,
	gregkh, dgiagio, Daniel Borca
In-Reply-To: <1270228883.12516.140.camel@localhost>

On 04/02/2010 07:21 PM, Ben Hutchings wrote:
> On Fri, 2010-04-02 at 19:15 +0200, "L. Alberto Giménez" wrote:
>> On 04/01/2010 01:18 AM, Ben Hutchings wrote:
>>> On Wed, 2010-03-31 at 21:42 +0200, L. Alberto Giménez wrote:
>>> [...]
>>>> --- /dev/null
>>>> +++ b/drivers/net/usb/ipheth.c
>>> [...]
>>>> +	usb_fill_bulk_urb(dev->tx_urb, udev,
>>>> +			  usb_sndbulkpipe(udev, dev->bulk_out),
>>>> +			  dev->tx_buf, IPHETH_BUF_SIZE,
>>>> +			  ipheth_sndbulk_callback,
>>>> +			  dev);
>>>> +	dev->tx_urb->transfer_flags |= URB_NO_TRANSFER_DMA_MAP;
>>>> +
>>>> +	retval = usb_submit_urb(dev->tx_urb, GFP_ATOMIC);
>>>> +	if (retval) {
>>>> +		err("%s: usb_submit_urb: %d", __func__, retval);
>>>> +		dev->stats.tx_errors++;
>>>> +		dev_kfree_skb_irq(skb);
>>>> +	} else {
>>>> +		net->trans_start = jiffies;
>>> No longer needed.
>> What is not longer needed? The assignment, the whole "else" branch? If
>> the assignment is what is not needed, can I just remove that line, right?
> 
> The assignment is not needed.

Hi,

I've been looking into this and it seems that the net_device.trans_start
field is now deprecated in favor of net_device.rx_queue.trans_start
-rx_queue is a struct netdev_queue- (file include/linux/netdevice.h), as
states the comment:

512         /*
513          * please use this field instead of dev->trans_start
514          */
515         unsigned long           trans_start;

Reading LDD3 book, it says that the driver is reponsible for updating
trans_start (as well as trans_rx, but we're not talking about that one).

So, I guess that the LDD book is outdated (again ;) ), but what I don't
understand at all is wether the driver should keep updating the right
field (dev->rx_queue.trans_start) or if the fact that it's inside a
"queue" implies that the code that is responsible for that queue would
update the trans_start field by itself?

Thanks in advance.

Best regards,
-- 
L. Alberto Giménez
JabberID agimenez@jabber.sysvalve.es
GnuPG key ID 0x3BAABDE1

^ permalink raw reply

* Re: Receive steering and hash and cache misses
From: Tom Herbert @ 2010-04-02 17:59 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Eric Dumazet, netdev
In-Reply-To: <20100402102650.5bdb5b52@nehalam>

On Fri, Apr 2, 2010 at 10:26 AM, Stephen Hemminger
<shemminger@vyatta.com> wrote:
>
> Although Receive Packet Steering can use a hardware generated receive hash
> the device driver still causes an unnecessary cache miss on the interrupt
> processing CPU.  The current Ethernet network device driver receive processing
> has the device driver calling eth_type_trans() which causes a the
> interrupt CPU to read the received frame header.
>

It should be possible to deduce the values set by eth_type_trans from
the RX descriptor along with the RX hash.  I'll post the patch getting
rxhash from bnx2x which does this.

> Is there some way the hardware receive hash value could be used to
> steer to the receive CPU, then have the receive CPU find the Ethernet
> type field (eth_type_trans)?
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Any ideas about a crash on reboot with igb and intel_iommu?
From: Roland Dreier @ 2010-04-02 18:13 UTC (permalink / raw)
  To: netdev; +Cc: iommu, David Woodhouse
In-Reply-To: <adabpe3f3c4.fsf@roland-alpha.cisco.com>

So actually I found the following change that went into 2.6.31:

    commit 91615f765a2935b6cbae424b9eee1585ed681ae6
    Author: Jesse Brandeburg <jesse.brandeburg@intel.com>
    Date:   Tue Jun 30 12:45:15 2009 +0000

        igb: fix unmap length bug
        
        driver was mixing NET_IP_ALIGN count bytes in map/unmap calls
        unevenly. Only map the bytes that the hardware might dma into

igb in 2.6.30.y is doing pci_map_single(<some length>) and doing
pci_unmap_single(<some other length>).

However I haven't been able to provoke the crash yet, even by bouncing
an igb interface (with VT-d turned on and with CONFIG_DMAR_DEFAULT_ON enabled).
Does this make sense as the sort of thing that might corrupt the iova
rbtree and lead to a crash in the __free_iova / rb_erase code?

Thanks,
  Roland
-- 
Roland Dreier <rolandd@cisco.com> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html

^ permalink raw reply

* Re: [PATCH 5/5] netfilter: xt_TEE: have cloned packet travel through Xtables too
From: Jan Engelhardt @ 2010-04-02 18:15 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: netfilter-devel, netdev
In-Reply-To: <4BB4A796.8010708@trash.net>


On Thursday 2010-04-01 16:03, Patrick McHardy wrote:
>>>>[detecting teed packets getting teed again by means of
>>>> iptables -A OUTPUT -j TEE]
>>>
>>> What's wrong with adding a reentrancy counter?
>> 
>> Sounds like a plan.

Should we be using a percpu variable, or is a simplistic
array ok too?

static bool tee_active;

target(...)
{
	if (tee_active[smp_processor_id()])
		return XT_CONTINUE;
	...
	if (tee_tg4_route(...)) {
		tee_active[cpu] = true;
		ip_local_out(skb);
		tee_active[cpu] = false;
	}
}



^ permalink raw reply

* Re: [PATCHv3] drivers/net/usb: Add new driver ipheth
From: "L. Alberto Giménez" @ 2010-04-02 18:23 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: linux-kernel, netdev, linux-usb, linville, j.dumon,
	steve.glendinning, davem, gregkh, dgiagio, dborca
In-Reply-To: <201003312233.26130.oliver@neukum.org>

On 03/31/2010 10:33 PM, Oliver Neukum wrote:
> Am Mittwoch, 31. März 2010 21:42:07 schrieb L. Alberto Giménez:

Hi Oliver,

Just like with Ben's comments I still have a couple of doubts about your
comments.


>> +
>> +static int ipheth_open(struct net_device *net)
>> +{
>> +	struct ipheth_device *dev = netdev_priv(net);
>> +	struct usb_device *udev = dev->udev;
>> +	int retval = 0;
>> +
>> +	usb_set_interface(udev, IPHETH_INTFNUM, IPHETH_ALT_INTFNUM);
>> +	usb_clear_halt(udev, usb_rcvbulkpipe(udev, dev->bulk_in));
>> +	usb_clear_halt(udev, usb_sndbulkpipe(udev, dev->bulk_out));
> 
> Is this really needed? If so, please add a comment.

I understand that usb_clear_halt is only needed when the device has
transmitted data, and as it is "open" time, we can assume that no
transmissions ere made, so we don't need to clear anything (aka: remove
both lines), am I right?


>> +
>> +	retval = ipheth_carrier_set(dev);
>> +	if (retval)
>> +		goto error;
>> +
>> +	retval = ipheth_rx_submit(dev, GFP_KERNEL);
>> +	if (retval)
>> +		goto error;
>> +
>> +	schedule_delayed_work(&dev->carrier_work, IPHETH_CARRIER_CHECK_TIMEOUT);
> 
> Does it make sense to start rx while you have no carrier?

Well, I have no clue about this one. I think that upstream developers
should take a look into this (Dario, Daniel, could you?) since I don't
have the knowledge to decide what to do about it.

But I assume that as with the previous one, we have just opened the
device and we aren't (yet) doing anything with it, so we shouldn't start rx?

>> +static void ipheth_disconnect(struct usb_interface *intf)
>> +{
>> +	struct ipheth_device *dev;
>> +
>> +	dev = usb_get_intfdata(intf);
>> +	if (dev != NULL) {
> 
> is this check needed?

Does usb_get_infdata always return not NULL? I haven't found anything
about it (just manual pages for the function, but can't spot if it
cannot return NULL). We disconnected the device, but I understand that
the kernel still has the information and the allocated memory, so the
cleanup code is still needed, isn't it?


> 
>> +static struct usb_driver ipheth_driver = {
>> +	.name =		"ipheth",
>> +	.probe =	ipheth_probe,
>> +	.disconnect =	ipheth_disconnect,
>> +	.id_table =	ipheth_table,
>> +	.supports_autosuspend = 0,
> 
> redundant

Why?

Thanks a lot in advance.


Best regards,
-- 
L. Alberto Giménez
JabberID agimenez@jabber.sysvalve.es
GnuPG key ID 0x3BAABDE1

^ permalink raw reply

* Re: [PATCH] rfs: Receive Flow Steering
From: Rick Jones @ 2010-04-02 18:25 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Eric Dumazet, Changli Gao, davem, netdev
In-Reply-To: <g2i65634d661004021045uff7c0e25ge7dfd17929bc9ee9@mail.gmail.com>

Tom Herbert wrote:
> 
> 
> On Fri, Apr 2, 2010 at 10:01 AM, Rick Jones <rick.jones2@hp.com 
> <mailto:rick.jones2@hp.com>> wrote:
> 
>     Eric Dumazet wrote:
> 
> 
>         Your claim of RPS being not good for applications is wrong, our test
>         results show an improvement as is. Maybe your applications dont
>         scale,
>         because of bad habits, or collidings heuristics, I dont know.
> 
> 
>     The progression in HP-UX was IPS (10.20) (aka RPS) then TOPS (11.0)
>     (aka RFS). We found that IPS was great for
>     single-flow-per-thread-of-execution stuff and that TOPS was better
>     for multiple-flow-per-thread-of-execution stuff.  It was long enough
>     ago now that I can safely say for one system-level benchmark not
>     known to be a "networking" benchmark, and without a massive kernel
>     component, TOPS was a 10% win.  Not too shabby.
> 
>     It wasn't that IPS wasn't good in its context - just that TOPS was
>     even better.
> 
> I would assume that with IPS threads would migrate to where packets were 
> being delivered thus giving the same sort of locality TOPS was 
> providing?  That would work great without any other constraints 
> (multiple flows per thread, thread CPU bindings, etc.).

Well... that depended - at the time, and still, we were and are also encouraging 
users and app designers to make copious use of processor/locality affinity (SMP 
and NUMA going back far longer in the RISC et al space than the x86 space).  So, 
it was and is entirely possible that the application thread of execution is 
hard-bound to a specific core/locality.  Also, I do not recall if HP-UX was as 
aggressive about waking a process/thread on the processor from which the wake-up 
came vs on the processor on which it last ran.

>     We also preferred the concept of the scheduler giving networking
>     clues as to where to process an application's packets rather than
>     networking trying to tell the scheduler.  There was some discussion
>     of out of order worries, but we were willing to trust to the basic
>     soundness of the scheduler - if it was moving threads around willy
>     nilly at a rate able to cause big packet reordering it had
>     fundamental problems that would have to be addressed anyway.
> 
> 
> I also think scheduler leading networking, like in RPS,  is generally 
> more scalable.  As for OOO packets, I've spent way to much time trying 
> to convince the bean-counters that a small number of them aren't 
> problematic :-), in the end it's just easier to not introduce new 
> mechanisms that will cause them!

So long as it doesn't drive you to produce new mechanisms heavier than they 
would have otherwise been.

The irony in the case of HP-UX IPS was that it was put in place in response to 
the severe out of order packet problems in HP-UX in 10.X before 10.20 - there 
were multiple netisr processes and only one netisr queue.  The other little 
tweak that came along in 10.20 with IPS, was inaddition to having a per 
processor (well, per core in today's parlance) netisr queue, the netisr would 
grab the entire queue under the one spinlock and work off of that.  That was 
nice because the code path became more efficient under load - more packets 
processed per spinlock/unlock pair.

happy benchmarking,

rick jones

^ permalink raw reply

* Re: [PATCHv3] drivers/net/usb: Add new driver ipheth
From: Ben Hutchings @ 2010-04-02 18:35 UTC (permalink / raw)
  To: "L. Alberto Giménez"
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-usb-u79uwXL29TY76Z2rM5mHXA,
	oliver-GvhC2dPhHPQdnm+yROfE0A, linville-2XuSBdqkA4R54TAoqtyWWQ,
	j.dumon-x9gZzRpC1QbQT0dZR+AlfA, steve.glendinning-sdUf+H5yV5I,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, gregkh-l3A5Bk7waGM,
	dgiagio-Re5JQEeQqe8AvxtiuMwx3w, Daniel Borca
In-Reply-To: <4BB62F33.1020507-lqZFv/KUvpAxAGwisGp4zA@public.gmane.org>

On Fri, 2010-04-02 at 19:53 +0200, "L. Alberto Giménez" wrote:
> On 04/02/2010 07:21 PM, Ben Hutchings wrote:
> > On Fri, 2010-04-02 at 19:15 +0200, "L. Alberto Giménez" wrote:
> >> On 04/01/2010 01:18 AM, Ben Hutchings wrote:
> >>> On Wed, 2010-03-31 at 21:42 +0200, L. Alberto Giménez wrote:
> >>> [...]
> >>>> --- /dev/null
> >>>> +++ b/drivers/net/usb/ipheth.c
> >>> [...]
> >>>> +	usb_fill_bulk_urb(dev->tx_urb, udev,
> >>>> +			  usb_sndbulkpipe(udev, dev->bulk_out),
> >>>> +			  dev->tx_buf, IPHETH_BUF_SIZE,
> >>>> +			  ipheth_sndbulk_callback,
> >>>> +			  dev);
> >>>> +	dev->tx_urb->transfer_flags |= URB_NO_TRANSFER_DMA_MAP;
> >>>> +
> >>>> +	retval = usb_submit_urb(dev->tx_urb, GFP_ATOMIC);
> >>>> +	if (retval) {
> >>>> +		err("%s: usb_submit_urb: %d", __func__, retval);
> >>>> +		dev->stats.tx_errors++;
> >>>> +		dev_kfree_skb_irq(skb);
> >>>> +	} else {
> >>>> +		net->trans_start = jiffies;
> >>> No longer needed.
> >> What is not longer needed? The assignment, the whole "else" branch? If
> >> the assignment is what is not needed, can I just remove that line, right?
> > 
> > The assignment is not needed.
> 
> Hi,
> 
> I've been looking into this and it seems that the net_device.trans_start
> field is now deprecated in favor of net_device.rx_queue.trans_start
[...]

The networking core updates netdev_queue::trans_start by calling
txq_trans_update() where necessary.  Drivers should not touch either
net_device::trans_start or netdev_queue::trans_start any more.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Receive steering and hash and cache misses
From: Stephen Hemminger @ 2010-04-02 18:54 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Eric Dumazet, netdev
In-Reply-To: <g2l65634d661004021059z94214a43v82409d15a0fb09b6@mail.gmail.com>

On Fri, 2 Apr 2010 10:59:43 -0700
Tom Herbert <therbert@google.com> wrote:

> On Fri, Apr 2, 2010 at 10:26 AM, Stephen Hemminger
> <shemminger@vyatta.com> wrote:
> >
> > Although Receive Packet Steering can use a hardware generated receive hash
> > the device driver still causes an unnecessary cache miss on the interrupt
> > processing CPU.  The current Ethernet network device driver receive processing
> > has the device driver calling eth_type_trans() which causes a the
> > interrupt CPU to read the received frame header.
> >
> 
> It should be possible to deduce the values set by eth_type_trans from
> the RX descriptor along with the RX hash.  I'll post the patch getting
> rxhash from bnx2x which does this.
> 

On sky2, I get only RSS, Checksum, and length from descriptor info.

^ permalink raw reply

* Re: Receive steering and hash and cache misses
From: Eric Dumazet @ 2010-04-02 19:36 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Tom Herbert, netdev
In-Reply-To: <20100402115439.348575c9@nehalam>

Le vendredi 02 avril 2010 à 11:54 -0700, Stephen Hemminger a écrit :
> On Fri, 2 Apr 2010 10:59:43 -0700
> Tom Herbert <therbert@google.com> wrote:
> 
> > On Fri, Apr 2, 2010 at 10:26 AM, Stephen Hemminger
> > <shemminger@vyatta.com> wrote:
> > >
> > > Although Receive Packet Steering can use a hardware generated receive hash
> > > the device driver still causes an unnecessary cache miss on the interrupt
> > > processing CPU.  The current Ethernet network device driver receive processing
> > > has the device driver calling eth_type_trans() which causes a the
> > > interrupt CPU to read the received frame header.
> > >
> > 
> > It should be possible to deduce the values set by eth_type_trans from
> > the RX descriptor along with the RX hash.  I'll post the patch getting
> > rxhash from bnx2x which does this.
> > 
> 
> On sky2, I get only RSS, Checksum, and length from descriptor info.

Doesnt sky2 also provide vlan id (OP_RXVLAN/OP_RXCHKSVLAN) ?

A future version of hardware could provide more info perhaps...

Must eth_type_trans() be done *before* netif_receive_skb() ?

If a device provides a rxhash, maybe we can delay eth_type_trans() too.
If not, we need to access IP header anyway in the first cpu.



^ permalink raw reply

* Re: [PATCH] rfs: Receive Flow Steering
From: Eric Dumazet @ 2010-04-02 19:43 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev
In-Reply-To: <1270225735.11099.20.camel@edumazet-laptop>

Le vendredi 02 avril 2010 à 18:28 +0200, Eric Dumazet a écrit :
> Some more thoughts ...
> 
> Do we really want to call inet_rps_record_flow(sk) from inet_sendmsg() &
> inet_sendpage() ?
> 
> This seems not necessary to me...
> 

I think I get it, you want to catch unidirectional flows (apps that only
send data), and let ACK packets be processed by the sender cpu :=)


I did following patch to remove one conditional branch :

 net/core/dev.c |   20 +++++---------------
 1 file changed, 5 insertions(+), 15 deletions(-)


diff --git a/net/core/dev.c b/net/core/dev.c
index 0a9ced8..cfe46d8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2225,8 +2225,6 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 	u16 tcpu;
 	u32 addr1, addr2, ports, ihl;
 
-	*rflowp = NULL;
-
 	if (skb_rx_queue_recorded(skb)) {
 		u16 index = skb_get_rx_queue(skb);
 		if (unlikely(index >= dev->num_rx_queues)) {
@@ -2443,7 +2441,7 @@ int netif_rx(struct sk_buff *skb)
 {
 	unsigned int qtail;
 #ifdef CONFIG_RPS
-	struct rps_dev_flow *rflow;
+	struct rps_dev_flow voidflow, *rflow = &voidflow;
 	int cpu, err;
 #endif
 
@@ -2461,10 +2459,7 @@ int netif_rx(struct sk_buff *skb)
 	if (cpu < 0)
 		cpu = smp_processor_id();
 
-	err = enqueue_to_backlog(skb, cpu, &qtail);
-
-	if (rflow)
-		rflow->last_qtail = qtail;
+	err = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
 
 	rcu_read_unlock();
 
@@ -2839,7 +2834,7 @@ out:
 int netif_receive_skb(struct sk_buff *skb)
 {
 #ifdef CONFIG_RPS
-	struct rps_dev_flow *rflow;
+	struct rps_dev_flow voidflow, *rflow = &voidflow;
 	int cpu, err;
 
 	rcu_read_lock();
@@ -2848,13 +2843,8 @@ int netif_receive_skb(struct sk_buff *skb)
 
 	if (cpu < 0)
 		err = __netif_receive_skb(skb);
-	else {
-		unsigned int qtail;
-
-		err = enqueue_to_backlog(skb, cpu, &qtail);
-		if (rflow)
-			rflow->last_qtail = qtail;
-	}
+	else
+		err = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
 
 	rcu_read_unlock();
 



^ permalink raw reply related

* Unaligned access in xfrm_user:copy_to_user_state
From: Jan Engelhardt @ 2010-04-02 20:18 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller

Hi,


since we seem to be dealing with unaligned access quite recently, here's 
my turn in reporting one:

22:09 ares:/etc # uname -a
Linux ares 2.6.34-rc1 #17 SMP Thu Mar 25 00:08:55 CET 2010 sparc64 
sparc64 sparc64 GNU/Linux
(This is kaber/nf-next)

Apr  2 22:09:53 ares kernel: Kernel unaligned access at TPC[101a0c18] 
copy_to_user_state+0x18/0x120 [xfrm_user]

0000000000000c00 <copy_to_user_state>:
     c00:       9d e3 bf 50     save  %sp, -176, %sp
     c04:       ce 5e 20 80     ldx  [ %i0 + 0x80 ], %g7
     c08:       86 06 20 80     add  %i0, 0x80, %g3
     c0c:       84 06 60 38     add  %i1, 0x38, %g2
     c10:       82 06 20 98     add  %i0, 0x98, %g1
     c14:       90 06 60 60     add  %i1, 0x60, %o0
     c18:       ce 76 60 38     stx  %g7, [ %i1 + 0x38 ]

That happens when strongswan is trying to handle a new incoming tunnel 
request between two IPv6 endpoints (it does not seem to get triggered
for IPv4).

^ permalink raw reply

* Re: [BUG] latest net-next-2.6 doesnt fly
From: David Miller @ 2010-04-02 20:35 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, fujita.tomonori
In-Reply-To: <1270202304.1989.14.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 02 Apr 2010 11:58:24 +0200

> [PATCH net-next-2.6] net: illegal_highdma() fix
> 
> Followup to commit 5acbbd428db47b12f137a8a2aa96b3c0a96b744e
> (net: change illegal_highdma to use dma_mask)
> 
> If dev->dev.parent is NULL, we should not try to dereference it.
> 
> Dont force inline illegal_highdma() as its pretty big now.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied, thanks for tracking this down.

^ permalink raw reply

* Re: Unaligned access in xfrm_user:copy_to_user_state
From: David Miller @ 2010-04-02 21:03 UTC (permalink / raw)
  To: jengelh; +Cc: netdev
In-Reply-To: <alpine.LSU.2.01.1004022210270.30875@obet.zrqbmnf.qr>

From: Jan Engelhardt <jengelh@medozas.de>
Date: Fri, 2 Apr 2010 22:18:59 +0200 (CEST)

> since we seem to be dealing with unaligned access quite recently, here's 
> my turn in reporting one:
> 
> 22:09 ares:/etc # uname -a
> Linux ares 2.6.34-rc1 #17 SMP Thu Mar 25 00:08:55 CET 2010 sparc64 
> sparc64 sparc64 GNU/Linux
> (This is kaber/nf-next)
> 
> Apr  2 22:09:53 ares kernel: Kernel unaligned access at TPC[101a0c18] 
> copy_to_user_state+0x18/0x120 [xfrm_user]
> 
> 0000000000000c00 <copy_to_user_state>:
>      c00:       9d e3 bf 50     save  %sp, -176, %sp
>      c04:       ce 5e 20 80     ldx  [ %i0 + 0x80 ], %g7
>      c08:       86 06 20 80     add  %i0, 0x80, %g3
>      c0c:       84 06 60 38     add  %i1, 0x38, %g2
>      c10:       82 06 20 98     add  %i0, 0x98, %g1
>      c14:       90 06 60 60     add  %i1, 0x60, %o0
>      c18:       ce 76 60 38     stx  %g7, [ %i1 + 0x38 ]
> 
> That happens when strongswan is trying to handle a new incoming tunnel 
> request between two IPv6 endpoints (it does not seem to get triggered
> for IPv4).

Yes, we need to "void *" untype the arguments to memcpy so that
GCC doesn't inline the thing.

Patches welcome.

^ permalink raw reply

* Re: [PATCH 1/2] phylib: Support phy module autoloading
From: David Miller @ 2010-04-02 21:31 UTC (permalink / raw)
  To: dwmw2; +Cc: netdev, ben
In-Reply-To: <1270206327.3101.2436.camel@macbook.infradead.org>

From: David Woodhouse <dwmw2@infradead.org>
Date: Fri, 02 Apr 2010 12:05:27 +0100

> We don't use the normal hotplug mechanism because it doesn't work. It will
> load the module some time after the device appears, but that's not good
> enough for us -- we need the driver loaded _immediately_ because otherwise
> the NIC driver may just abort and then the phy 'device' goes away.
> 
> [bwh: s/phy/mdio/ in module alias, kerneldoc for struct mdio_device_id]
> 
> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>

Applied.

^ permalink raw reply

* Re: [PATCH 2/2] phylib: Add module table to all existing phy drivers
From: David Miller @ 2010-04-02 21:31 UTC (permalink / raw)
  To: dwmw2; +Cc: netdev, ben
In-Reply-To: <1270206356.3101.2437.camel@macbook.infradead.org>

From: David Woodhouse <dwmw2@infradead.org>
Date: Fri, 02 Apr 2010 12:05:56 +0100

> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>

Applied, thanks for doing the notes and providing backporting
notes :-)

^ permalink raw reply

* [PATCH] SCTP: Change to use ipv6_addr_copy()
From: Brian Haley @ 2010-04-02 21:38 UTC (permalink / raw)
  To: vladislav.yasevich, davem; +Cc: netdev, linux-sctp

Change SCTP IPv6 code to use ipv6_addr_copy()

Signed-off-by: Brian Haley <brian.haley@hp.com>
---
 net/sctp/ipv6.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index 216d88f..db1c767 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -364,7 +364,7 @@ static void sctp_v6_copy_addrlist(struct list_head *addrlist,
 		if (addr) {
 			addr->a.v6.sin6_family = AF_INET6;
 			addr->a.v6.sin6_port = 0;
-			addr->a.v6.sin6_addr = ifp->addr;
+			ipv6_addr_copy(&addr->a.v6.sin6_addr, &ifp->addr);
 			addr->a.v6.sin6_scope_id = dev->ifindex;
 			addr->valid = 1;
 			INIT_LIST_HEAD(&addr->list);
@@ -405,7 +405,7 @@ static void sctp_v6_from_sk(union sctp_addr *addr, struct sock *sk)
 {
 	addr->v6.sin6_family = AF_INET6;
 	addr->v6.sin6_port = 0;
-	addr->v6.sin6_addr = inet6_sk(sk)->rcv_saddr;
+	ipv6_addr_copy(&addr->v6.sin6_addr, &inet6_sk(sk)->rcv_saddr);
 }
 
 /* Initialize sk->sk_rcv_saddr from sctp_addr. */
@@ -418,7 +418,7 @@ static void sctp_v6_to_sk_saddr(union sctp_addr *addr, struct sock *sk)
 		inet6_sk(sk)->rcv_saddr.s6_addr32[3] =
 			addr->v4.sin_addr.s_addr;
 	} else {
-		inet6_sk(sk)->rcv_saddr = addr->v6.sin6_addr;
+		ipv6_addr_copy(&inet6_sk(sk)->rcv_saddr, &addr->v6.sin6_addr);
 	}
 }
 
@@ -431,7 +431,7 @@ static void sctp_v6_to_sk_daddr(union sctp_addr *addr, struct sock *sk)
 		inet6_sk(sk)->daddr.s6_addr32[2] = htonl(0x0000ffff);
 		inet6_sk(sk)->daddr.s6_addr32[3] = addr->v4.sin_addr.s_addr;
 	} else {
-		inet6_sk(sk)->daddr = addr->v6.sin6_addr;
+		ipv6_addr_copy(&inet6_sk(sk)->daddr, &addr->v6.sin6_addr);
 	}
 }
 
-- 
1.5.4.3


^ permalink raw reply related

* Re: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.
From: Sridhar Samudrala @ 2010-04-02 23:51 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mingo, mst, jdike, davem
In-Reply-To: <1270193100-6769-1-git-send-email-xiaohui.xin@intel.com>

On Fri, 2010-04-02 at 15:25 +0800, xiaohui.xin@intel.com wrote:
> The idea is simple, just to pin the guest VM user space and then
> let host NIC driver has the chance to directly DMA to it. 
> The patches are based on vhost-net backend driver. We add a device
> which provides proto_ops as sendmsg/recvmsg to vhost-net to
> send/recv directly to/from the NIC driver. KVM guest who use the
> vhost-net backend may bind any ethX interface in the host side to
> get copyless data transfer thru guest virtio-net frontend.

What is the advantage of this approach compared to PCI-passthrough
of the host NIC to the guest?
Does this require pinning of the entire guest memory? Or only the
send/receive buffers?

Thanks
Sridhar
> 
> The scenario is like this:
> 
> The guest virtio-net driver submits multiple requests thru vhost-net
> backend driver to the kernel. And the requests are queued and then
> completed after corresponding actions in h/w are done.
> 
> For read, user space buffers are dispensed to NIC driver for rx when
> a page constructor API is invoked. Means NICs can allocate user buffers
> from a page constructor. We add a hook in netif_receive_skb() function
> to intercept the incoming packets, and notify the zero-copy device.
> 
> For write, the zero-copy deivce may allocates a new host skb and puts
> payload on the skb_shinfo(skb)->frags, and copied the header to skb->data.
> The request remains pending until the skb is transmitted by h/w.
> 
> Here, we have ever considered 2 ways to utilize the page constructor
> API to dispense the user buffers.
> 
> One:	Modify __alloc_skb() function a bit, it can only allocate a 
> 	structure of sk_buff, and the data pointer is pointing to a 
> 	user buffer which is coming from a page constructor API.
> 	Then the shinfo of the skb is also from guest.
> 	When packet is received from hardware, the skb->data is filled
> 	directly by h/w. What we have done is in this way.
> 
> 	Pros:	We can avoid any copy here.
> 	Cons:	Guest virtio-net driver needs to allocate skb as almost
> 		the same method with the host NIC drivers, say the size
> 		of netdev_alloc_skb() and the same reserved space in the
> 		head of skb. Many NIC drivers are the same with guest and
> 		ok for this. But some lastest NIC drivers reserves special
> 		room in skb head. To deal with it, we suggest to provide
> 		a method in guest virtio-net driver to ask for parameter
> 		we interest from the NIC driver when we know which device 
> 		we have bind to do zero-copy. Then we ask guest to do so.
> 		Is that reasonable?
> 
> Two:	Modify driver to get user buffer allocated from a page constructor
> 	API(to substitute alloc_page()), the user buffer are used as payload
> 	buffers and filled by h/w directly when packet is received. Driver
> 	should associate the pages with skb (skb_shinfo(skb)->frags). For 
> 	the head buffer side, let host allocates skb, and h/w fills it. 
> 	After that, the data filled in host skb header will be copied into
> 	guest header buffer which is submitted together with the payload buffer.
> 
> 	Pros:	We could less care the way how guest or host allocates their
> 		buffers.
> 	Cons:	We still need a bit copy here for the skb header.
> 
> We are not sure which way is the better here. This is the first thing we want
> to get comments from the community. We wish the modification to the network
> part will be generic which not used by vhost-net backend only, but a user
> application may use it as well when the zero-copy device may provides async
> read/write operations later.
> 
> Please give comments especially for the network part modifications.
> 
> 
> We provide multiple submits and asynchronous notifiicaton to 
> vhost-net too.
> 
> Our goal is to improve the bandwidth and reduce the CPU usage.
> Exact performance data will be provided later. But for simple
> test with netperf, we found bindwidth up and CPU % up too,
> but the bindwidth up ratio is much more than CPU % up ratio.
> 
> What we have not done yet:
> 	packet split support
> 	To support GRO
> 	Performance tuning
> 
> what we have done in v1:
> 	polish the RCU usage
> 	deal with write logging in asynchroush mode in vhost
> 	add notifier block for mp device
> 	rename page_ctor to mp_port in netdevice.h to make it looks generic
> 	add mp_dev_change_flags() for mp device to change NIC state
> 	add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
> 	a small fix for missing dev_put when fail
> 	using dynamic minor instead of static minor number
> 	a __KERNEL__ protect to mp_get_sock()
> 
> what we have done in v2:
> 	
> 	remove most of the RCU usage, since the ctor pointer is only
> 	changed by BIND/UNBIND ioctl, and during that time, NIC will be
> 	stopped to get good cleanup(all outstanding requests are finished),
> 	so the ctor pointer cannot be raced into wrong situation.
> 
> 	Remove the struct vhost_notifier with struct kiocb.
> 	Let vhost-net backend to alloc/free the kiocb and transfer them
> 	via sendmsg/recvmsg.
> 
> 	use get_user_pages_fast() and set_page_dirty_lock() when read.
> 
> 	Add some comments for netdev_mp_port_prep() and handle_mpassthru().
> 
> 
> Comments not addressed yet in this time:
> 	the async write logging is not satified by vhost-net
> 	Qemu needs a sync write
> 	a limit for locked pages from get_user_pages_fast()
> 	
> 		
> performance:
> 	using netperf with GSO/TSO disabled, 10G NIC, 
> 	disabled packet split mode, with raw socket case compared to vhost.
> 
> 	bindwidth will be from 1.1Gbps to 1.7Gbps
> 	CPU % from 120%-140% to 140%-160%
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* Re: [PATCH 0/6] tagged sysfs support
From: Ben Hutchings @ 2010-04-03  0:58 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Eric W. Biederman, Greg Kroah-Hartman, Greg KH, linux-kernel,
	Tejun Heo, Cornelia Huck, linux-fsdevel, Eric Dumazet,
	Benjamin LaHaise, Serge Hallyn, netdev
In-Reply-To: <s2hac3eb2511003302251rcbae8767ne21e9daf1546c849@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2414 bytes --]

On Wed, 2010-03-31 at 07:51 +0200, Kay Sievers wrote:
> On Wed, Mar 31, 2010 at 01:04, Eric W. Biederman <ebiederm@xmission.com> wrote:
> > Kay Sievers <kay.sievers@vrfy.org> writes:
> >> On Tue, Mar 30, 2010 at 20:30, Eric W. Biederman <ebiederm@xmission.com> wrote:
> >>>
> >>> The main short coming of using multiple network namespaces today
> >>> is that only network devices for the primary network namespaces
> >>> can be put in the kobject layer and sysfs.
> >>>
> >>> This is essentially the earlier version of this patchset that was
> >>> reviewed before, just now on top of a version of sysfs that doesn't
> >>> need cleanup patches to support it.
> >>
> >> Just to check if we are not in conflict with planned changes, and how
> >> to possibly handle them:
> >>
> >> There is the plan and ongoing work to unify classes and buses, export
> >> them at /sys/subsystem in the same layout of the current /sys/bus/.
> >> The decision to export buses and classes as two different things
> >> (which they aren't) is the last major piece in the sysfs layout which
> >> needs to be fixed.
> >
> > Interesting.  We will symlinks ie:
> > /sys/class -> /sys/subsystem
> > /sys/bus -> /sys/subsystem
> > to keep from breaking userspace.
> 
> Yeah, /sys/bus/, which is the only sane layout of the needlessly
> different 3 versions of the same thing (bus, class, block).
[...]

block vs class/block is arguable, but as for abstracting the difference
between bus and class... why?

Each bus defines a device interface covering enumeration,
identification, power management and various aspects of their connection
to the host.  This interface is implemented by the bus driver.

Each class defines a device interface covering functionality provided to
user-space or higher level kernel components (block interface to
filesystems, net driver interface to the networking core, etc).  This
interface is implemented by multiple device-specific drivers.

So while buses and classes both define device interfaces, they are
fundamentally different types of interface.  And there are 'subsystems'
that don't have devices at all (time, RCU, perf, ...).  If you're going
to expose the set of subsystems, don't they belong in there?  But then,
what would you put in their directories?

Ben.

-- 
Ben Hutchings
Once a job is fouled up, anything done to improve it makes it worse.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply

* Re: Receive steering and hash and cache misses
From: Stephen Hemminger @ 2010-04-02 22:52 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Tom Herbert, netdev
In-Reply-To: <1270236991.1978.17.camel@edumazet-laptop>

On Fri, 02 Apr 2010 21:36:31 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le vendredi 02 avril 2010 à 11:54 -0700, Stephen Hemminger a écrit :
> > On Fri, 2 Apr 2010 10:59:43 -0700
> > Tom Herbert <therbert@google.com> wrote:
> > 
> > > On Fri, Apr 2, 2010 at 10:26 AM, Stephen Hemminger
> > > <shemminger@vyatta.com> wrote:
> > > >
> > > > Although Receive Packet Steering can use a hardware generated receive hash
> > > > the device driver still causes an unnecessary cache miss on the interrupt
> > > > processing CPU.  The current Ethernet network device driver receive processing
> > > > has the device driver calling eth_type_trans() which causes a the
> > > > interrupt CPU to read the received frame header.
> > > >
> > > 
> > > It should be possible to deduce the values set by eth_type_trans from
> > > the RX descriptor along with the RX hash.  I'll post the patch getting
> > > rxhash from bnx2x which does this.
> > > 
> > 
> > On sky2, I get only RSS, Checksum, and length from descriptor info.
> 
> Doesnt sky2 also provide vlan id (OP_RXVLAN/OP_RXCHKSVLAN) ?
> 
> A future version of hardware could provide more info perhaps...

I have only some information from Marvell and no idea what they might
do with future hardware. 

> Must eth_type_trans() be done *before* netif_receive_skb() ?

In current arch yes, because netif_receive_skb is used for multiple
hardware types and the backlog queue could theoretically contain
skb's of different hardware types.  Also GRO works against RPS
since it does lookup work on the initial CPU and dirties the skb.

This is mostly theoretical at this point the bigger performance bottlenecks
are farther down the packet processing chain.




^ permalink raw reply

* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Herbert Xu @ 2010-04-03  3:38 UTC (permalink / raw)
  To: Timo Teras; +Cc: netdev
In-Reply-To: <1270126340-30181-2-git-send-email-timo.teras@iki.fi>

On Thu, Apr 01, 2010 at 03:52:17PM +0300, Timo Teras wrote:
> This allows to validate the cached object before returning it.
> It also allows to destruct object properly, if the last reference
> was held in flow cache. This is also a prepartion for caching
> bundles in the flow cache.
> 
> In return for virtualizing the methods, we save on:
> - not having to regenerate the whole flow cache on policy removal:
>   each flow matching a killed policy gets refreshed as the getter
>   function notices it smartly.
> - we do not have to call flow_cache_flush from policy gc, since the
>   flow cache now properly deletes the object if it had any references
> 
> Signed-off-by: Timo Teras <timo.teras@iki.fi>

With repsect to removing the cache flush upon policy removal,
what takes care of the timely purging of the corresponding cache
entries if no new traffic comes through?

The concern is that if they're not purged in the absence of new
traffic, then we may hold references on all sorts of objects,
leading to consequences such as the inability to unregister net
devices.
  
>  struct flow_cache_entry {
> -	struct flow_cache_entry	*next;
> -	u16			family;
> -	u8			dir;
> -	u32			genid;
> -	struct flowi		key;
> -	void			*object;
> -	atomic_t		*object_ref;
> +	struct flow_cache_entry *	next;

Please follow the existing coding style.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox