Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: Receive steering and hash and cache misses
From: Stephen Hemminger @ 2010-04-02 18:54 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Eric Dumazet, netdev
In-Reply-To: <g2l65634d661004021059z94214a43v82409d15a0fb09b6@mail.gmail.com>

On Fri, 2 Apr 2010 10:59:43 -0700
Tom Herbert <therbert@google.com> wrote:

> On Fri, Apr 2, 2010 at 10:26 AM, Stephen Hemminger
> <shemminger@vyatta.com> wrote:
> >
> > Although Receive Packet Steering can use a hardware generated receive hash
> > the device driver still causes an unnecessary cache miss on the interrupt
> > processing CPU.  The current Ethernet network device driver receive processing
> > has the device driver calling eth_type_trans() which causes a the
> > interrupt CPU to read the received frame header.
> >
> 
> It should be possible to deduce the values set by eth_type_trans from
> the RX descriptor along with the RX hash.  I'll post the patch getting
> rxhash from bnx2x which does this.
> 

On sky2, I get only RSS, Checksum, and length from descriptor info.

^ permalink raw reply

* Re: Receive steering and hash and cache misses
From: Eric Dumazet @ 2010-04-02 19:36 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Tom Herbert, netdev
In-Reply-To: <20100402115439.348575c9@nehalam>

Le vendredi 02 avril 2010 à 11:54 -0700, Stephen Hemminger a écrit :
> On Fri, 2 Apr 2010 10:59:43 -0700
> Tom Herbert <therbert@google.com> wrote:
> 
> > On Fri, Apr 2, 2010 at 10:26 AM, Stephen Hemminger
> > <shemminger@vyatta.com> wrote:
> > >
> > > Although Receive Packet Steering can use a hardware generated receive hash
> > > the device driver still causes an unnecessary cache miss on the interrupt
> > > processing CPU.  The current Ethernet network device driver receive processing
> > > has the device driver calling eth_type_trans() which causes a the
> > > interrupt CPU to read the received frame header.
> > >
> > 
> > It should be possible to deduce the values set by eth_type_trans from
> > the RX descriptor along with the RX hash.  I'll post the patch getting
> > rxhash from bnx2x which does this.
> > 
> 
> On sky2, I get only RSS, Checksum, and length from descriptor info.

Doesnt sky2 also provide vlan id (OP_RXVLAN/OP_RXCHKSVLAN) ?

A future version of hardware could provide more info perhaps...

Must eth_type_trans() be done *before* netif_receive_skb() ?

If a device provides a rxhash, maybe we can delay eth_type_trans() too.
If not, we need to access IP header anyway in the first cpu.



^ permalink raw reply

* Re: [PATCH] rfs: Receive Flow Steering
From: Eric Dumazet @ 2010-04-02 19:43 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev
In-Reply-To: <1270225735.11099.20.camel@edumazet-laptop>

Le vendredi 02 avril 2010 à 18:28 +0200, Eric Dumazet a écrit :
> Some more thoughts ...
> 
> Do we really want to call inet_rps_record_flow(sk) from inet_sendmsg() &
> inet_sendpage() ?
> 
> This seems not necessary to me...
> 

I think I get it, you want to catch unidirectional flows (apps that only
send data), and let ACK packets be processed by the sender cpu :=)


I did following patch to remove one conditional branch :

 net/core/dev.c |   20 +++++---------------
 1 file changed, 5 insertions(+), 15 deletions(-)


diff --git a/net/core/dev.c b/net/core/dev.c
index 0a9ced8..cfe46d8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2225,8 +2225,6 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 	u16 tcpu;
 	u32 addr1, addr2, ports, ihl;
 
-	*rflowp = NULL;
-
 	if (skb_rx_queue_recorded(skb)) {
 		u16 index = skb_get_rx_queue(skb);
 		if (unlikely(index >= dev->num_rx_queues)) {
@@ -2443,7 +2441,7 @@ int netif_rx(struct sk_buff *skb)
 {
 	unsigned int qtail;
 #ifdef CONFIG_RPS
-	struct rps_dev_flow *rflow;
+	struct rps_dev_flow voidflow, *rflow = &voidflow;
 	int cpu, err;
 #endif
 
@@ -2461,10 +2459,7 @@ int netif_rx(struct sk_buff *skb)
 	if (cpu < 0)
 		cpu = smp_processor_id();
 
-	err = enqueue_to_backlog(skb, cpu, &qtail);
-
-	if (rflow)
-		rflow->last_qtail = qtail;
+	err = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
 
 	rcu_read_unlock();
 
@@ -2839,7 +2834,7 @@ out:
 int netif_receive_skb(struct sk_buff *skb)
 {
 #ifdef CONFIG_RPS
-	struct rps_dev_flow *rflow;
+	struct rps_dev_flow voidflow, *rflow = &voidflow;
 	int cpu, err;
 
 	rcu_read_lock();
@@ -2848,13 +2843,8 @@ int netif_receive_skb(struct sk_buff *skb)
 
 	if (cpu < 0)
 		err = __netif_receive_skb(skb);
-	else {
-		unsigned int qtail;
-
-		err = enqueue_to_backlog(skb, cpu, &qtail);
-		if (rflow)
-			rflow->last_qtail = qtail;
-	}
+	else
+		err = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
 
 	rcu_read_unlock();
 



^ permalink raw reply related

* Unaligned access in xfrm_user:copy_to_user_state
From: Jan Engelhardt @ 2010-04-02 20:18 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller

Hi,


since we seem to be dealing with unaligned access quite recently, here's 
my turn in reporting one:

22:09 ares:/etc # uname -a
Linux ares 2.6.34-rc1 #17 SMP Thu Mar 25 00:08:55 CET 2010 sparc64 
sparc64 sparc64 GNU/Linux
(This is kaber/nf-next)

Apr  2 22:09:53 ares kernel: Kernel unaligned access at TPC[101a0c18] 
copy_to_user_state+0x18/0x120 [xfrm_user]

0000000000000c00 <copy_to_user_state>:
     c00:       9d e3 bf 50     save  %sp, -176, %sp
     c04:       ce 5e 20 80     ldx  [ %i0 + 0x80 ], %g7
     c08:       86 06 20 80     add  %i0, 0x80, %g3
     c0c:       84 06 60 38     add  %i1, 0x38, %g2
     c10:       82 06 20 98     add  %i0, 0x98, %g1
     c14:       90 06 60 60     add  %i1, 0x60, %o0
     c18:       ce 76 60 38     stx  %g7, [ %i1 + 0x38 ]

That happens when strongswan is trying to handle a new incoming tunnel 
request between two IPv6 endpoints (it does not seem to get triggered
for IPv4).

^ permalink raw reply

* Re: [BUG] latest net-next-2.6 doesnt fly
From: David Miller @ 2010-04-02 20:35 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, fujita.tomonori
In-Reply-To: <1270202304.1989.14.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 02 Apr 2010 11:58:24 +0200

> [PATCH net-next-2.6] net: illegal_highdma() fix
> 
> Followup to commit 5acbbd428db47b12f137a8a2aa96b3c0a96b744e
> (net: change illegal_highdma to use dma_mask)
> 
> If dev->dev.parent is NULL, we should not try to dereference it.
> 
> Dont force inline illegal_highdma() as its pretty big now.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied, thanks for tracking this down.

^ permalink raw reply

* Re: Unaligned access in xfrm_user:copy_to_user_state
From: David Miller @ 2010-04-02 21:03 UTC (permalink / raw)
  To: jengelh; +Cc: netdev
In-Reply-To: <alpine.LSU.2.01.1004022210270.30875@obet.zrqbmnf.qr>

From: Jan Engelhardt <jengelh@medozas.de>
Date: Fri, 2 Apr 2010 22:18:59 +0200 (CEST)

> since we seem to be dealing with unaligned access quite recently, here's 
> my turn in reporting one:
> 
> 22:09 ares:/etc # uname -a
> Linux ares 2.6.34-rc1 #17 SMP Thu Mar 25 00:08:55 CET 2010 sparc64 
> sparc64 sparc64 GNU/Linux
> (This is kaber/nf-next)
> 
> Apr  2 22:09:53 ares kernel: Kernel unaligned access at TPC[101a0c18] 
> copy_to_user_state+0x18/0x120 [xfrm_user]
> 
> 0000000000000c00 <copy_to_user_state>:
>      c00:       9d e3 bf 50     save  %sp, -176, %sp
>      c04:       ce 5e 20 80     ldx  [ %i0 + 0x80 ], %g7
>      c08:       86 06 20 80     add  %i0, 0x80, %g3
>      c0c:       84 06 60 38     add  %i1, 0x38, %g2
>      c10:       82 06 20 98     add  %i0, 0x98, %g1
>      c14:       90 06 60 60     add  %i1, 0x60, %o0
>      c18:       ce 76 60 38     stx  %g7, [ %i1 + 0x38 ]
> 
> That happens when strongswan is trying to handle a new incoming tunnel 
> request between two IPv6 endpoints (it does not seem to get triggered
> for IPv4).

Yes, we need to "void *" untype the arguments to memcpy so that
GCC doesn't inline the thing.

Patches welcome.

^ permalink raw reply

* Re: [PATCH 1/2] phylib: Support phy module autoloading
From: David Miller @ 2010-04-02 21:31 UTC (permalink / raw)
  To: dwmw2; +Cc: netdev, ben
In-Reply-To: <1270206327.3101.2436.camel@macbook.infradead.org>

From: David Woodhouse <dwmw2@infradead.org>
Date: Fri, 02 Apr 2010 12:05:27 +0100

> We don't use the normal hotplug mechanism because it doesn't work. It will
> load the module some time after the device appears, but that's not good
> enough for us -- we need the driver loaded _immediately_ because otherwise
> the NIC driver may just abort and then the phy 'device' goes away.
> 
> [bwh: s/phy/mdio/ in module alias, kerneldoc for struct mdio_device_id]
> 
> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>

Applied.

^ permalink raw reply

* Re: [PATCH 2/2] phylib: Add module table to all existing phy drivers
From: David Miller @ 2010-04-02 21:31 UTC (permalink / raw)
  To: dwmw2; +Cc: netdev, ben
In-Reply-To: <1270206356.3101.2437.camel@macbook.infradead.org>

From: David Woodhouse <dwmw2@infradead.org>
Date: Fri, 02 Apr 2010 12:05:56 +0100

> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>

Applied, thanks for doing the notes and providing backporting
notes :-)

^ permalink raw reply

* [PATCH] SCTP: Change to use ipv6_addr_copy()
From: Brian Haley @ 2010-04-02 21:38 UTC (permalink / raw)
  To: vladislav.yasevich, davem; +Cc: netdev, linux-sctp

Change SCTP IPv6 code to use ipv6_addr_copy()

Signed-off-by: Brian Haley <brian.haley@hp.com>
---
 net/sctp/ipv6.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index 216d88f..db1c767 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -364,7 +364,7 @@ static void sctp_v6_copy_addrlist(struct list_head *addrlist,
 		if (addr) {
 			addr->a.v6.sin6_family = AF_INET6;
 			addr->a.v6.sin6_port = 0;
-			addr->a.v6.sin6_addr = ifp->addr;
+			ipv6_addr_copy(&addr->a.v6.sin6_addr, &ifp->addr);
 			addr->a.v6.sin6_scope_id = dev->ifindex;
 			addr->valid = 1;
 			INIT_LIST_HEAD(&addr->list);
@@ -405,7 +405,7 @@ static void sctp_v6_from_sk(union sctp_addr *addr, struct sock *sk)
 {
 	addr->v6.sin6_family = AF_INET6;
 	addr->v6.sin6_port = 0;
-	addr->v6.sin6_addr = inet6_sk(sk)->rcv_saddr;
+	ipv6_addr_copy(&addr->v6.sin6_addr, &inet6_sk(sk)->rcv_saddr);
 }
 
 /* Initialize sk->sk_rcv_saddr from sctp_addr. */
@@ -418,7 +418,7 @@ static void sctp_v6_to_sk_saddr(union sctp_addr *addr, struct sock *sk)
 		inet6_sk(sk)->rcv_saddr.s6_addr32[3] =
 			addr->v4.sin_addr.s_addr;
 	} else {
-		inet6_sk(sk)->rcv_saddr = addr->v6.sin6_addr;
+		ipv6_addr_copy(&inet6_sk(sk)->rcv_saddr, &addr->v6.sin6_addr);
 	}
 }
 
@@ -431,7 +431,7 @@ static void sctp_v6_to_sk_daddr(union sctp_addr *addr, struct sock *sk)
 		inet6_sk(sk)->daddr.s6_addr32[2] = htonl(0x0000ffff);
 		inet6_sk(sk)->daddr.s6_addr32[3] = addr->v4.sin_addr.s_addr;
 	} else {
-		inet6_sk(sk)->daddr = addr->v6.sin6_addr;
+		ipv6_addr_copy(&inet6_sk(sk)->daddr, &addr->v6.sin6_addr);
 	}
 }
 
-- 
1.5.4.3


^ permalink raw reply related

* Re: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.
From: Sridhar Samudrala @ 2010-04-02 23:51 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mingo, mst, jdike, davem
In-Reply-To: <1270193100-6769-1-git-send-email-xiaohui.xin@intel.com>

On Fri, 2010-04-02 at 15:25 +0800, xiaohui.xin@intel.com wrote:
> The idea is simple, just to pin the guest VM user space and then
> let host NIC driver has the chance to directly DMA to it. 
> The patches are based on vhost-net backend driver. We add a device
> which provides proto_ops as sendmsg/recvmsg to vhost-net to
> send/recv directly to/from the NIC driver. KVM guest who use the
> vhost-net backend may bind any ethX interface in the host side to
> get copyless data transfer thru guest virtio-net frontend.

What is the advantage of this approach compared to PCI-passthrough
of the host NIC to the guest?
Does this require pinning of the entire guest memory? Or only the
send/receive buffers?

Thanks
Sridhar
> 
> The scenario is like this:
> 
> The guest virtio-net driver submits multiple requests thru vhost-net
> backend driver to the kernel. And the requests are queued and then
> completed after corresponding actions in h/w are done.
> 
> For read, user space buffers are dispensed to NIC driver for rx when
> a page constructor API is invoked. Means NICs can allocate user buffers
> from a page constructor. We add a hook in netif_receive_skb() function
> to intercept the incoming packets, and notify the zero-copy device.
> 
> For write, the zero-copy deivce may allocates a new host skb and puts
> payload on the skb_shinfo(skb)->frags, and copied the header to skb->data.
> The request remains pending until the skb is transmitted by h/w.
> 
> Here, we have ever considered 2 ways to utilize the page constructor
> API to dispense the user buffers.
> 
> One:	Modify __alloc_skb() function a bit, it can only allocate a 
> 	structure of sk_buff, and the data pointer is pointing to a 
> 	user buffer which is coming from a page constructor API.
> 	Then the shinfo of the skb is also from guest.
> 	When packet is received from hardware, the skb->data is filled
> 	directly by h/w. What we have done is in this way.
> 
> 	Pros:	We can avoid any copy here.
> 	Cons:	Guest virtio-net driver needs to allocate skb as almost
> 		the same method with the host NIC drivers, say the size
> 		of netdev_alloc_skb() and the same reserved space in the
> 		head of skb. Many NIC drivers are the same with guest and
> 		ok for this. But some lastest NIC drivers reserves special
> 		room in skb head. To deal with it, we suggest to provide
> 		a method in guest virtio-net driver to ask for parameter
> 		we interest from the NIC driver when we know which device 
> 		we have bind to do zero-copy. Then we ask guest to do so.
> 		Is that reasonable?
> 
> Two:	Modify driver to get user buffer allocated from a page constructor
> 	API(to substitute alloc_page()), the user buffer are used as payload
> 	buffers and filled by h/w directly when packet is received. Driver
> 	should associate the pages with skb (skb_shinfo(skb)->frags). For 
> 	the head buffer side, let host allocates skb, and h/w fills it. 
> 	After that, the data filled in host skb header will be copied into
> 	guest header buffer which is submitted together with the payload buffer.
> 
> 	Pros:	We could less care the way how guest or host allocates their
> 		buffers.
> 	Cons:	We still need a bit copy here for the skb header.
> 
> We are not sure which way is the better here. This is the first thing we want
> to get comments from the community. We wish the modification to the network
> part will be generic which not used by vhost-net backend only, but a user
> application may use it as well when the zero-copy device may provides async
> read/write operations later.
> 
> Please give comments especially for the network part modifications.
> 
> 
> We provide multiple submits and asynchronous notifiicaton to 
> vhost-net too.
> 
> Our goal is to improve the bandwidth and reduce the CPU usage.
> Exact performance data will be provided later. But for simple
> test with netperf, we found bindwidth up and CPU % up too,
> but the bindwidth up ratio is much more than CPU % up ratio.
> 
> What we have not done yet:
> 	packet split support
> 	To support GRO
> 	Performance tuning
> 
> what we have done in v1:
> 	polish the RCU usage
> 	deal with write logging in asynchroush mode in vhost
> 	add notifier block for mp device
> 	rename page_ctor to mp_port in netdevice.h to make it looks generic
> 	add mp_dev_change_flags() for mp device to change NIC state
> 	add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
> 	a small fix for missing dev_put when fail
> 	using dynamic minor instead of static minor number
> 	a __KERNEL__ protect to mp_get_sock()
> 
> what we have done in v2:
> 	
> 	remove most of the RCU usage, since the ctor pointer is only
> 	changed by BIND/UNBIND ioctl, and during that time, NIC will be
> 	stopped to get good cleanup(all outstanding requests are finished),
> 	so the ctor pointer cannot be raced into wrong situation.
> 
> 	Remove the struct vhost_notifier with struct kiocb.
> 	Let vhost-net backend to alloc/free the kiocb and transfer them
> 	via sendmsg/recvmsg.
> 
> 	use get_user_pages_fast() and set_page_dirty_lock() when read.
> 
> 	Add some comments for netdev_mp_port_prep() and handle_mpassthru().
> 
> 
> Comments not addressed yet in this time:
> 	the async write logging is not satified by vhost-net
> 	Qemu needs a sync write
> 	a limit for locked pages from get_user_pages_fast()
> 	
> 		
> performance:
> 	using netperf with GSO/TSO disabled, 10G NIC, 
> 	disabled packet split mode, with raw socket case compared to vhost.
> 
> 	bindwidth will be from 1.1Gbps to 1.7Gbps
> 	CPU % from 120%-140% to 140%-160%
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* Re: [PATCH 0/6] tagged sysfs support
From: Ben Hutchings @ 2010-04-03  0:58 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Eric W. Biederman, Greg Kroah-Hartman, Greg KH, linux-kernel,
	Tejun Heo, Cornelia Huck, linux-fsdevel, Eric Dumazet,
	Benjamin LaHaise, Serge Hallyn, netdev
In-Reply-To: <s2hac3eb2511003302251rcbae8767ne21e9daf1546c849@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2414 bytes --]

On Wed, 2010-03-31 at 07:51 +0200, Kay Sievers wrote:
> On Wed, Mar 31, 2010 at 01:04, Eric W. Biederman <ebiederm@xmission.com> wrote:
> > Kay Sievers <kay.sievers@vrfy.org> writes:
> >> On Tue, Mar 30, 2010 at 20:30, Eric W. Biederman <ebiederm@xmission.com> wrote:
> >>>
> >>> The main short coming of using multiple network namespaces today
> >>> is that only network devices for the primary network namespaces
> >>> can be put in the kobject layer and sysfs.
> >>>
> >>> This is essentially the earlier version of this patchset that was
> >>> reviewed before, just now on top of a version of sysfs that doesn't
> >>> need cleanup patches to support it.
> >>
> >> Just to check if we are not in conflict with planned changes, and how
> >> to possibly handle them:
> >>
> >> There is the plan and ongoing work to unify classes and buses, export
> >> them at /sys/subsystem in the same layout of the current /sys/bus/.
> >> The decision to export buses and classes as two different things
> >> (which they aren't) is the last major piece in the sysfs layout which
> >> needs to be fixed.
> >
> > Interesting.  We will symlinks ie:
> > /sys/class -> /sys/subsystem
> > /sys/bus -> /sys/subsystem
> > to keep from breaking userspace.
> 
> Yeah, /sys/bus/, which is the only sane layout of the needlessly
> different 3 versions of the same thing (bus, class, block).
[...]

block vs class/block is arguable, but as for abstracting the difference
between bus and class... why?

Each bus defines a device interface covering enumeration,
identification, power management and various aspects of their connection
to the host.  This interface is implemented by the bus driver.

Each class defines a device interface covering functionality provided to
user-space or higher level kernel components (block interface to
filesystems, net driver interface to the networking core, etc).  This
interface is implemented by multiple device-specific drivers.

So while buses and classes both define device interfaces, they are
fundamentally different types of interface.  And there are 'subsystems'
that don't have devices at all (time, RCU, perf, ...).  If you're going
to expose the set of subsystems, don't they belong in there?  But then,
what would you put in their directories?

Ben.

-- 
Ben Hutchings
Once a job is fouled up, anything done to improve it makes it worse.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply

* Re: Receive steering and hash and cache misses
From: Stephen Hemminger @ 2010-04-02 22:52 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Tom Herbert, netdev
In-Reply-To: <1270236991.1978.17.camel@edumazet-laptop>

On Fri, 02 Apr 2010 21:36:31 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le vendredi 02 avril 2010 à 11:54 -0700, Stephen Hemminger a écrit :
> > On Fri, 2 Apr 2010 10:59:43 -0700
> > Tom Herbert <therbert@google.com> wrote:
> > 
> > > On Fri, Apr 2, 2010 at 10:26 AM, Stephen Hemminger
> > > <shemminger@vyatta.com> wrote:
> > > >
> > > > Although Receive Packet Steering can use a hardware generated receive hash
> > > > the device driver still causes an unnecessary cache miss on the interrupt
> > > > processing CPU.  The current Ethernet network device driver receive processing
> > > > has the device driver calling eth_type_trans() which causes a the
> > > > interrupt CPU to read the received frame header.
> > > >
> > > 
> > > It should be possible to deduce the values set by eth_type_trans from
> > > the RX descriptor along with the RX hash.  I'll post the patch getting
> > > rxhash from bnx2x which does this.
> > > 
> > 
> > On sky2, I get only RSS, Checksum, and length from descriptor info.
> 
> Doesnt sky2 also provide vlan id (OP_RXVLAN/OP_RXCHKSVLAN) ?
> 
> A future version of hardware could provide more info perhaps...

I have only some information from Marvell and no idea what they might
do with future hardware. 

> Must eth_type_trans() be done *before* netif_receive_skb() ?

In current arch yes, because netif_receive_skb is used for multiple
hardware types and the backlog queue could theoretically contain
skb's of different hardware types.  Also GRO works against RPS
since it does lookup work on the initial CPU and dirties the skb.

This is mostly theoretical at this point the bigger performance bottlenecks
are farther down the packet processing chain.




^ permalink raw reply

* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Herbert Xu @ 2010-04-03  3:38 UTC (permalink / raw)
  To: Timo Teras; +Cc: netdev
In-Reply-To: <1270126340-30181-2-git-send-email-timo.teras@iki.fi>

On Thu, Apr 01, 2010 at 03:52:17PM +0300, Timo Teras wrote:
> This allows to validate the cached object before returning it.
> It also allows to destruct object properly, if the last reference
> was held in flow cache. This is also a prepartion for caching
> bundles in the flow cache.
> 
> In return for virtualizing the methods, we save on:
> - not having to regenerate the whole flow cache on policy removal:
>   each flow matching a killed policy gets refreshed as the getter
>   function notices it smartly.
> - we do not have to call flow_cache_flush from policy gc, since the
>   flow cache now properly deletes the object if it had any references
> 
> Signed-off-by: Timo Teras <timo.teras@iki.fi>

With repsect to removing the cache flush upon policy removal,
what takes care of the timely purging of the corresponding cache
entries if no new traffic comes through?

The concern is that if they're not purged in the absence of new
traffic, then we may hold references on all sorts of objects,
leading to consequences such as the inability to unregister net
devices.
  
>  struct flow_cache_entry {
> -	struct flow_cache_entry	*next;
> -	u16			family;
> -	u8			dir;
> -	u32			genid;
> -	struct flowi		key;
> -	void			*object;
> -	atomic_t		*object_ref;
> +	struct flow_cache_entry *	next;

Please follow the existing coding style.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH] ethtool: add names of newer Marvell chips
From: Mark Ryden @ 2010-04-03  7:20 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Jeff Garzik, netdev
In-Reply-To: <20100402081632.5401cf2f@nehalam>

Hi,

> +       case 0xba:      printf("Yukon Ultra 2"); break;
> +       case 0xbc:      printf("Yukon Optima"); break;

What about 0xbb?
Is there ant reason for not using 0xbb for Yukon Optima?

Is it something with blackberry (bb)?  :-)
Mark


On Fri, Apr 2, 2010 at 6:16 PM, Stephen Hemminger <shemminger@vyatta.com> wrote:
> Fill in names of newer chips.
>
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
>
>
> diff --git a/marvell.c b/marvell.c
> index 6696e0a..af38c21 100644
> --- a/marvell.c
> +++ b/marvell.c
> @@ -184,6 +184,9 @@ static void dump_mac(const u8 *r)
>        case 0xb6:      printf("Yukon-2 EC");   break;
>        case 0xb7:      printf("Yukon-2 FE");   break;
>        case 0xb8:      printf("Yukon-2 FE Plus"); break;
> +       case 0xb9:      printf("Yukon Supreme"); break;
> +       case 0xba:      printf("Yukon Ultra 2"); break;
> +       case 0xbc:      printf("Yukon Optima"); break;
>        default:        printf("(Unknown)");    break;
>        }
>
> --
> 1.6.3.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* Re: [PATCH 0/6] tagged sysfs support
From: Kay Sievers @ 2010-04-03  8:35 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Eric W. Biederman, Greg Kroah-Hartman, Greg KH, linux-kernel,
	Tejun Heo, Cornelia Huck, linux-fsdevel, Eric Dumazet,
	Benjamin LaHaise, Serge Hallyn, netdev
In-Reply-To: <1270256303.12516.234.camel@localhost>

On Sat, Apr 3, 2010 at 02:58, Ben Hutchings <ben@decadent.org.uk> wrote:
> On Wed, 2010-03-31 at 07:51 +0200, Kay Sievers wrote:
>> Yeah, /sys/bus/, which is the only sane layout of the needlessly
>> different 3 versions of the same thing (bus, class, block).
> [...]
>
> block vs class/block is arguable,

That's already done long ago.

> but as for abstracting the difference
> between bus and class... why?

There is absolutely no need to needlessly export two versions of the
same thing. These directories serve no other purpose than to collect
all devices of the same subsystem. There is no useful information that
belongs to the type class or bus, they are both the same. Like
"inputX" is implemented as a class, but is much more like a bus. And
"usb" are devices, which are more a class of devices, and the
interfaces and contollers belong to a bus.

There is really no point to make userspace needlessly complicated to
distinguish the both.

We also have already a buch of subsystems which moved from class to
bus because they needed to express hierarchy between the same devices.
So the goal is to have only one type of subsystem to solve these
problems.

> Each bus defines a device interface covering enumeration,
> identification, power management and various aspects of their connection
> to the host.  This interface is implemented by the bus driver.

Sure, but that does not mean that class is a useful layout, or that
class devices can not do the same.

> Each class defines a device interface covering functionality provided to
> user-space or higher level kernel components (block interface to
> filesystems, net driver interface to the networking core, etc).  This
> interface is implemented by multiple device-specific drivers.

That's absolutely wrong. Classes are just too simple uses of the same
thing. We have many class devices which are not "interfaces", and we
have bus devices which are interfaces.

> So while buses and classes both define device interfaces, they are
> fundamentally different types of interface.

No, they are not. They are just "devices". There is no useful
difference these two different types expose. And the class layout is
fundamentally broken, and not extendable. Peole mix lists of devices
with custom subsystem-wide attributes, which we need to stop from
doing this. The bus layout can carry custom directories, which is why
we want that by default for all "classifications".

> And there are 'subsystems'
> that don't have devices at all (time, RCU, perf, ...).  If you're going
> to expose the set of subsystems, don't they belong in there?
> But then,

We are talking about the current users in /sys, and the difference in
the sysfs export between /sys/bus and /sys/class.

> what would you put in their directories?

We are not talking about anything not in /sys currently.

Kay

^ permalink raw reply

* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Herbert Xu @ 2010-04-03  8:36 UTC (permalink / raw)
  To: Timo Teras; +Cc: netdev
In-Reply-To: <20100403033857.GA2205@gondor.apana.org.au>

On Sat, Apr 03, 2010 at 11:38:57AM +0800, Herbert Xu wrote:
> 
> With repsect to removing the cache flush upon policy removal,
> what takes care of the timely purging of the corresponding cache
> entries if no new traffic comes through?

In fact this change would seem to render the existing bundle
pruning mechanism when devices are unregistered ineffective.

xfrm_prune_bundles walks through active policy lists to find
the bundles to purge.  However, if a policy has been deleted
while a bundle referencing it is still in the cache, that bundle
will not be pruned.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Timo Teräs @ 2010-04-03 13:50 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev
In-Reply-To: <20100403083609.GA3654@gondor.apana.org.au>

Herbert Xu wrote:
> With repsect to removing the cache flush upon policy removal,
> what takes care of the timely purging of the corresponding cache
> entries if no new traffic comes through?
> 
> The concern is that if they're not purged in the absence of new
> traffic, then we may hold references on all sorts of objects,
> leading to consequences such as the inability to unregister net
> devices.

The flow cache is randomized every ten minutes. Thus all flow
cache entries get recreated regularly.

> On Sat, Apr 03, 2010 at 11:38:57AM +0800, Herbert Xu wrote:
>> With repsect to removing the cache flush upon policy removal,
>> what takes care of the timely purging of the corresponding cache
>> entries if no new traffic comes through?
> 
> In fact this change would seem to render the existing bundle
> pruning mechanism when devices are unregistered ineffective.
> 
> xfrm_prune_bundles walks through active policy lists to find
> the bundles to purge.  However, if a policy has been deleted
> while a bundle referencing it is still in the cache, that bundle
> will not be pruned.

When policy is killed, the policy->genid is incremented which
makes xfrm_bundle_ok check fail and the bundle to get pruned
immediately on flush.

^ permalink raw reply

* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Herbert Xu @ 2010-04-03 14:17 UTC (permalink / raw)
  To: Timo Teräs; +Cc: netdev
In-Reply-To: <4BB74790.7070109@iki.fi>

On Sat, Apr 03, 2010 at 04:50:08PM +0300, Timo Teräs wrote:
>
> The flow cache is randomized every ten minutes. Thus all flow
> cache entries get recreated regularly.

Having rmmod <netdrv> block for up to ten minutes is hardly
ideal.

Besides, in future we may want to get rid of the regular reseeding
just like we did for IPv4 routes.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Timo Teräs @ 2010-04-03 14:26 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev
In-Reply-To: <20100403141709.GA5165@gondor.apana.org.au>

Herbert Xu wrote:
> On Sat, Apr 03, 2010 at 04:50:08PM +0300, Timo Teräs wrote:
>> The flow cache is randomized every ten minutes. Thus all flow
>> cache entries get recreated regularly.
> 
> Having rmmod <netdrv> block for up to ten minutes is hardly
> ideal.

Why would this block? The device down hook calls flow cache
flush. On flush all bundles with non-up devices get pruned
immediately (via stale_bundle check).

> Besides, in future we may want to get rid of the regular reseeding
> just like we did for IPv4 routes.

Right. It certainly sounds good. And needs a separate change
then.

If this is done, we will need to still have some sort of
periodic gc for flow cache, just like ipv4 routes have. The
gc would on each tick scan just some of the flow cache hash
chains.

^ permalink raw reply

* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Herbert Xu @ 2010-04-03 15:53 UTC (permalink / raw)
  To: Timo Teräs; +Cc: netdev
In-Reply-To: <4BB74FF8.2020303@iki.fi>

On Sat, Apr 03, 2010 at 05:26:00PM +0300, Timo Teräs wrote:
>
> Why would this block? The device down hook calls flow cache
> flush. On flush all bundles with non-up devices get pruned
> immediately (via stale_bundle check).

Perhaps I missed something in your patch, but the flush that
we currently perform is limited to the bundles from hashed policies.
So if a policy has just recently been removed, then its bundles
won't be flushed.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 0/6] tagged sysfs support
From: Ben Hutchings @ 2010-04-03 16:05 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Eric W. Biederman, Greg Kroah-Hartman, Greg KH, linux-kernel,
	Tejun Heo, Cornelia Huck, linux-fsdevel, Eric Dumazet,
	Benjamin LaHaise, Serge Hallyn, netdev
In-Reply-To: <h2gac3eb2511004030135q2a2f5002z912974c1d2ab8853@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2247 bytes --]

On Sat, 2010-04-03 at 10:35 +0200, Kay Sievers wrote:
> On Sat, Apr 3, 2010 at 02:58, Ben Hutchings <ben@decadent.org.uk> wrote:
> > On Wed, 2010-03-31 at 07:51 +0200, Kay Sievers wrote:
> >> Yeah, /sys/bus/, which is the only sane layout of the needlessly
> >> different 3 versions of the same thing (bus, class, block).
> > [...]
> >
> > block vs class/block is arguable,
> 
> That's already done long ago.
> 
> > but as for abstracting the difference
> > between bus and class... why?
> 
> There is absolutely no need to needlessly export two versions of the
> same thing. These directories serve no other purpose than to collect
> all devices of the same subsystem. There is no useful information that
> belongs to the type class or bus, they are both the same. Like
> "inputX" is implemented as a class, but is much more like a bus.

Really, how do you enumerate 'input' buses?

> And "usb" are devices, which are more a class of devices, and the
> interfaces and contollers belong to a bus.

What common higher-level functionality do USB devices provide?

> There is really no point to make userspace needlessly complicated to
> distinguish the both.
> 
> We also have already a buch of subsystems which moved from class to
> bus because they needed to express hierarchy between the same devices.
> So the goal is to have only one type of subsystem to solve these
> problems.

That's interesting.  Which were those?

[...]
> > So while buses and classes both define device interfaces, they are
> > fundamentally different types of interface.
> 
> No, they are not. They are just "devices". There is no useful
> difference these two different types expose. And the class layout is
> fundamentally broken, and not extendable. Peole mix lists of devices
> with custom subsystem-wide attributes, which we need to stop from
> doing this. The bus layout can carry custom directories, which is why
> we want that by default for all "classifications".
[...]

I understand that you want to clean up a mess, but how do you know
you're not going to break user-space that depends on some of this mess?

Ben.

-- 
Ben Hutchings
Once a job is fouled up, anything done to improve it makes it worse.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply

* Re: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.
From: Avi Kivity @ 2010-04-03 16:32 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: xiaohui.xin, netdev, kvm, linux-kernel, mingo, mst, jdike, davem
In-Reply-To: <1270252268.13897.14.camel@w-sridhar.beaverton.ibm.com>

On 04/03/2010 02:51 AM, Sridhar Samudrala wrote:
> On Fri, 2010-04-02 at 15:25 +0800, xiaohui.xin@intel.com wrote:
>    
>> The idea is simple, just to pin the guest VM user space and then
>> let host NIC driver has the chance to directly DMA to it.
>> The patches are based on vhost-net backend driver. We add a device
>> which provides proto_ops as sendmsg/recvmsg to vhost-net to
>> send/recv directly to/from the NIC driver. KVM guest who use the
>> vhost-net backend may bind any ethX interface in the host side to
>> get copyless data transfer thru guest virtio-net frontend.
>>      
> What is the advantage of this approach compared to PCI-passthrough
> of the host NIC to the guest?
>    

swapping/ksm/etc
independence from host hardware
live migration

> Does this require pinning of the entire guest memory? Or only the
> send/receive buffers?
>    

If done correctly, just the send/receive buffers.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply

* Re: [PATCH 0/6] tagged sysfs support
From: Kay Sievers @ 2010-04-03 16:35 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Eric W. Biederman, Greg Kroah-Hartman, Greg KH, linux-kernel,
	Tejun Heo, Cornelia Huck, linux-fsdevel, Eric Dumazet,
	Benjamin LaHaise, Serge Hallyn, netdev
In-Reply-To: <1270310756.12516.308.camel@localhost>

On Sat, Apr 3, 2010 at 18:05, Ben Hutchings <ben@decadent.org.uk> wrote:
> On Sat, 2010-04-03 at 10:35 +0200, Kay Sievers wrote:
>> On Sat, Apr 3, 2010 at 02:58, Ben Hutchings <ben@decadent.org.uk> wrote:
>> > On Wed, 2010-03-31 at 07:51 +0200, Kay Sievers wrote:
>> >> Yeah, /sys/bus/, which is the only sane layout of the needlessly
>> >> different 3 versions of the same thing (bus, class, block).
>> > [...]
>> >
>> > block vs class/block is arguable,
>>
>> That's already done long ago.
>>
>> > but as for abstracting the difference
>> > between bus and class... why?
>>
>> There is absolutely no need to needlessly export two versions of the
>> same thing. These directories serve no other purpose than to collect
>> all devices of the same subsystem. There is no useful information that
>> belongs to the type class or bus, they are both the same. Like
>> "inputX" is implemented as a class, but is much more like a bus.
>
> Really, how do you enumerate 'input' buses?

The current inputX devices, unlike eventX and mouseX, are like "bus devices".

>> And "usb" are devices, which are more a class of devices, and the
>> interfaces and contollers belong to a bus.
>
> What common higher-level functionality do USB devices provide?

A device file per example, which can do anything to the device. :)

>> There is really no point to make userspace needlessly complicated to
>> distinguish the both.
>>
>> We also have already a buch of subsystems which moved from class to
>> bus because they needed to express hierarchy between the same devices.
>> So the goal is to have only one type of subsystem to solve these
>> problems.
>
> That's interesting.  Which were those?

i2c, iio, and a few which have been out-of-tree and got changed before
the merge, because we knew they would not work as class devices, cause
of the need to have childs, or the need to add additional properties
at the subsystem directory level, just like pci, which has a "slots"
directory at the pci subsystem directory, such stuff is not possible
with the too simple class layout.

> [...]
>> > So while buses and classes both define device interfaces, they are
>> > fundamentally different types of interface.
>>
>> No, they are not. They are just "devices". There is no useful
>> difference these two different types expose. And the class layout is
>> fundamentally broken, and not extendable. Peole mix lists of devices
>> with custom subsystem-wide attributes, which we need to stop from
>> doing this. The bus layout can carry custom directories, which is why
>> we want that by default for all "classifications".
> [...]
>
> I understand that you want to clean up a mess, but how do you know
> you're not going to break user-space that depends on some of this mess?

Just like /sys/block is doing it, /sys/class, /sys/bus will stay as
symlinks, and not go away.

Kay
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Timo Teräs @ 2010-04-03 20:19 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev
In-Reply-To: <20100403155353.GA5618@gondor.apana.org.au>

Herbert Xu wrote:
> On Sat, Apr 03, 2010 at 05:26:00PM +0300, Timo Teräs wrote:
>> Why would this block? The device down hook calls flow cache
>> flush. On flush all bundles with non-up devices get pruned
>> immediately (via stale_bundle check).
> 
> Perhaps I missed something in your patch, but the flush that
> we currently perform is limited to the bundles from hashed policies.
> So if a policy has just recently been removed, then its bundles
> won't be flushed.

If a policy is removed, policy->genid is incremented invalidating
the bundles. Those bundles get freed when:
 - specific flow gets hit
 - cache is flushed due to GC call, or interface going down
 - flow cache randomization

If someone is then removing a net driver, we still execute
flush on the 'device down' hook, and all stale bundles
get flushed.

But yes, this means that xfrm_policy struct can now be held
allocated up to ten extra minutes. But it's only memory that
it's holding, not any extra refs. And it's still reclaimable
by the GC.

If this feels troublesome, we could add asynchronous flush
request that would be called on policy removal. Or even stick
to the synchronous one.

^ permalink raw reply

* [PATCH  kernel 2.6.34-rc3] smc91c92_cs: fix the problem of "Unable to find hardware address"
From: Ken Kawasaki @ 2010-04-03 21:14 UTC (permalink / raw)
  To: netdev; +Cc: ken_kawasaki
In-Reply-To: <20100328055537.13ef6c01.ken_kawasaki@spring.nifty.jp>


smc91c92_cs:
 *cvt_ascii_address returns 0, if success.
 *call free_netdev, if we can't find hardware address.

Signed-off-by: Ken Kawasaki <ken_kawasaki@spring.nifty.jp>

---

--- linux-2.6.34-rc3/drivers/net/pcmcia/smc91c92_cs.c.orig	2010-04-03 09:41:20.000000000 +0900
+++ linux-2.6.34-rc3/drivers/net/pcmcia/smc91c92_cs.c	2010-04-03 20:34:06.000000000 +0900
@@ -493,13 +493,14 @@ static int pcmcia_get_versmac(struct pcm
 {
 	struct net_device *dev = priv;
 	cisparse_t parse;
+	u8 *buf;
 
 	if (pcmcia_parse_tuple(tuple, &parse))
 		return -EINVAL;
 
-	if ((parse.version_1.ns > 3) &&
-	    (cvt_ascii_address(dev,
-			       (parse.version_1.str + parse.version_1.ofs[3]))))
+	buf = parse.version_1.str + parse.version_1.ofs[3];
+
+	if ((parse.version_1.ns > 3) && (cvt_ascii_address(dev, buf) == 0))
 		return 0;
 
 	return -EINVAL;
@@ -528,7 +529,7 @@ static int mhz_setup(struct pcmcia_devic
     len = pcmcia_get_tuple(link, 0x81, &buf);
     if (buf && len >= 13) {
 	    buf[12] = '\0';
-	    if (cvt_ascii_address(dev, buf))
+	    if (cvt_ascii_address(dev, buf) == 0)
 		    rc = 0;
     }
     kfree(buf);
@@ -910,7 +911,7 @@ static int smc91c92_config(struct pcmcia
 
     if (i != 0) {
 	printk(KERN_NOTICE "smc91c92_cs: Unable to find hardware address.\n");
-	goto config_undo;
+	goto config_failed;
     }
 
     smc->duplex = 0;
@@ -998,6 +999,7 @@ config_undo:
     unregister_netdev(dev);
 config_failed:
     smc91c92_release(link);
+    free_netdev(dev);
     return -ENODEV;
 } /* smc91c92_config */
 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox