Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH] pkt_sched: Fix qdisc_create on stab error handling
From: Jarek Poplawski @ 2009-09-15 18:46 UTC (permalink / raw)
  To: David Miller; +Cc: Patrick McHardy, netdev

If qdisc_get_stab returns error in qdisc_create there is skipped qdisc
ops->destroy, which is necessary because it's after ops->init at the
moment, so memory leaks are quite probable.

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
---

 net/sched/sch_api.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 692d9a4..329be0c 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -804,7 +804,7 @@ qdisc_create(struct net_device *dev, struct netdev_queue *dev_queue,
 			stab = qdisc_get_stab(tca[TCA_STAB]);
 			if (IS_ERR(stab)) {
 				err = PTR_ERR(stab);
-				goto err_out3;
+				goto err_out4;
 			}
 			sch->stab = stab;
 		}
@@ -833,7 +833,6 @@ qdisc_create(struct net_device *dev, struct netdev_queue *dev_queue,
 		return sch;
 	}
 err_out3:
-	qdisc_put_stab(sch->stab);
 	dev_put(dev);
 	kfree((char *) sch - sch->padded);
 err_out2:
@@ -847,6 +846,7 @@ err_out4:
 	 * Any broken qdiscs that would require a ops->reset() here?
 	 * The qdisc was never in action so it shouldn't be necessary.
 	 */
+	qdisc_put_stab(sch->stab);
 	if (ops->destroy)
 		ops->destroy(sch);
 	goto err_out3;

^ permalink raw reply related

* Re: [crash] kernel BUG at net/core/pktgen.c:3503!
From: Cyrill Gorcunov @ 2009-09-15 18:51 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: David Miller, torvalds, akpm, netdev, linux-kernel
In-Reply-To: <20090915183647.GA11628@elte.hu>

[Ingo Molnar - Tue, Sep 15, 2009 at 08:36:47PM +0200]
| 
| not sure which merge caused this, but i got this boot crash with latest 
| -git:
| 
| calling  flow_cache_init+0x0/0x1b9 @ 1
| initcall flow_cache_init+0x0/0x1b9 returned 0 after 64 usecs
| calling  pg_init+0x0/0x37c @ 1
| pktgen 2.72: Packet Generator for packet performance testing.
| ------------[ cut here ]------------
| kernel BUG at net/core/pktgen.c:3503!
| invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
| last sysfs file: 
| 

Hi Ingo,

just curious, will the following patch fix the problem?
I've been fixing problem with familiar symthoms on
system with custome virtual cpu implementation so
it may not help in mainline but anyway :)

	-- Cyrill
---
 net/core/pktgen.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.git/net/core/pktgen.c
=====================================================================
--- linux-2.6.git.orig/net/core/pktgen.c
+++ linux-2.6.git/net/core/pktgen.c
@@ -3511,7 +3511,7 @@ static int pktgen_thread_worker(void *ar
 	struct pktgen_dev *pkt_dev = NULL;
 	int cpu = t->cpu;
 
-	BUG_ON(smp_processor_id() != cpu);
+	BUG_ON(task_cpu(current) != cpu);
 
 	init_waitqueue_head(&t->queue);
 	complete(&t->start_done);

^ permalink raw reply

* Re: [PATCH net-next-2.6] bonding: make ab_arp select active slaves as other modes
From: Jiri Pirko @ 2009-09-15 18:55 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: netdev, davem, bonding-devel, nicolas.2p.debian
In-Reply-To: <30429.1253031653@death.nxdomain.ibm.com>

Tue, Sep 15, 2009 at 06:20:53PM CEST, fubar@us.ibm.com wrote:
>Jiri Pirko <jpirko@redhat.com> wrote:
>
>>Fri, Sep 11, 2009 at 02:32:18AM CEST, fubar@us.ibm.com wrote:
>>>Jiri Pirko <jpirko@redhat.com> wrote:
>>>
>>>>When I was implementing primary_passive option (formely named primary_lazy) I've
>>>>run into troubles with ab_arp. This is the only mode which is not using
>>>>bond_select_active_slave() function to select active slave and instead it
>>>>selects it itself. This seems to be not the right behaviour and it would be
>>>>better to do it in bond_select_active_slave() for all cases. This patch makes
>>>>this happen. Please review.
>>>
>>>	Sorry for the delay in response; was out of the office.
>>>
>>>	My first question is whether this affect the "current_arp_slave"
>>>behavior, i.e., the round-robining of the ARP probes when no slaves are
>>>active.  Is that something you checked?
>>
>>Yes, according to my tests this behaves the same way as original code.
>>How about your tests?
>
>	Yah, it seems to work like it should.  I just have this nagging
>feeling I'm forgetting something; that there was a reason that the ab
>ARP was doing things differently.  I sure don't remember, though;
>probably just getting old.
>
>	The only nitpicks I see are a couple of changes that appear to
>be just for style ("break" changed to "continue"; some code rearranged
>in bond_find_best_slave, which is noted below) and one locking nit:
>strictly speaking, curr_slave_lock should be held for read when
>inspecting curr_active_slave.  The place it happens, though, already
>holds rtnl, and all changes to curr_active_slave happen under rtnl, so
>it won't actually fail, but it's different than everywhere else.

Well I changed bond_ab_arp_commit to be similar to bond_miimon_commit. Therefor
changing breaks to continues, no curr_active_lock around
bond_select_active_slave etc.

I adjusted bond_find_best_slave to work with slave->link so this could be used
with arp too. I believe changes in this function are correct and my test results
are telling the same.

I hope I cleared all your comments :)

Jirka

>
>	I've been gnawing on getting rid of curr_slave_lock for a while;
>I think it can go away, and be subsumed into the general bond->lock.
>The curr_active_slave is (today, this didn't used to be true) only
>changed under rtnl, but some other code does inspect it outside of rtnl.

I was looking on this several times but I haven't found a courage to actually
eliminate this lock... (and use rculists etc...) Maybe later :)

Jirka

>
>	-J
>
>
>
>>Jirka
>>
>>>
>>>	I'll give this a test tomorrow as well.
>>>
>>>	-J
>>>
>>>---
>>>	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
>>>
>>>>Signed-off-by: Jiri Pirko <jpirko@redhat.com>
>>>>
>>>>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>>>>index 7c0e0bd..6ebd88d 100644
>>>>--- a/drivers/net/bonding/bond_main.c
>>>>+++ b/drivers/net/bonding/bond_main.c
>>>>@@ -1093,15 +1093,8 @@ static struct slave *bond_find_best_slave(struct bonding *bond)
>>>> 			return NULL; /* still no slave, return NULL */
>>>> 	}
>>>>
>>>>-	/*
>>>>-	 * first try the primary link; if arping, a link must tx/rx
>>>>-	 * traffic before it can be considered the curr_active_slave.
>>>>-	 * also, we would skip slaves between the curr_active_slave
>>>>-	 * and primary_slave that may be up and able to arp
>>>>-	 */
>>>> 	if ((bond->primary_slave) &&
>>>>-	    (!bond->params.arp_interval) &&
>>>>-	    (IS_UP(bond->primary_slave->dev))) {
>>>>+	    bond->primary_slave->link == BOND_LINK_UP) {
>>>> 		new_active = bond->primary_slave;
>>>> 	}
>>>>
>>>>@@ -1109,15 +1102,14 @@ static struct slave *bond_find_best_slave(struct bonding *bond)
>>>> 	old_active = new_active;
>>>>
>>>> 	bond_for_each_slave_from(bond, new_active, i, old_active) {
>>>>-		if (IS_UP(new_active->dev)) {
>>>>-			if (new_active->link == BOND_LINK_UP) {
>>>>-				return new_active;
>>>>-			} else if (new_active->link == BOND_LINK_BACK) {
>>>>-				/* link up, but waiting for stabilization */
>>>>-				if (new_active->delay < mintime) {
>>>>-					mintime = new_active->delay;
>>>>-					bestslave = new_active;
>>>>-				}
>>>>+		if (new_active->link == BOND_LINK_UP) {
>>>>+			return new_active;
>>>>+		} else if (new_active->link == BOND_LINK_BACK &&
>>>>+			   IS_UP(new_active->dev)) {
>>>>+			/* link up, but waiting for stabilization */
>>>>+			if (new_active->delay < mintime) {
>>>>+				mintime = new_active->delay;
>>>>+				bestslave = new_active;
>>>
>>>	Is there a functional reason for rearranging this (i.e., did the
>>>use of select_active_slave need this for some reason)?
>>>
>>>
>>>> 			}
>>>> 		}
>>>> 	}
>>>>@@ -2929,18 +2921,6 @@ static int bond_ab_arp_inspect(struct bonding *bond, int delta_in_ticks)
>>>> 		}
>>>> 	}
>>>>
>>>>-	read_lock(&bond->curr_slave_lock);
>>>>-
>>>>-	/*
>>>>-	 * Trigger a commit if the primary option setting has changed.
>>>>-	 */
>>>>-	if (bond->primary_slave &&
>>>>-	    (bond->primary_slave != bond->curr_active_slave) &&
>>>>-	    (bond->primary_slave->link == BOND_LINK_UP))
>>>>-		commit++;
>>>>-
>>>>-	read_unlock(&bond->curr_slave_lock);
>>>>-
>>>> 	return commit;
>>>> }
>>>>
>>>>@@ -2961,90 +2941,58 @@ static void bond_ab_arp_commit(struct bonding *bond, int delta_in_ticks)
>>>> 			continue;
>>>>
>>>> 		case BOND_LINK_UP:
>>>>-			write_lock_bh(&bond->curr_slave_lock);
>>>>-
>>>>-			if (!bond->curr_active_slave &&
>>>>-			    time_before_eq(jiffies, dev_trans_start(slave->dev) +
>>>>-					   delta_in_ticks)) {
>>>>+			if ((!bond->curr_active_slave &&
>>>>+			     time_before_eq(jiffies,
>>>>+					    dev_trans_start(slave->dev) +
>>>>+					    delta_in_ticks)) ||
>>>>+			    bond->curr_active_slave != slave) {
>>>> 				slave->link = BOND_LINK_UP;
>>>>-				bond_change_active_slave(bond, slave);
>>>> 				bond->current_arp_slave = NULL;
>>>>
>>>> 				pr_info(DRV_NAME
>>>>-				       ": %s: %s is up and now the "
>>>>-				       "active interface\n",
>>>>-				       bond->dev->name, slave->dev->name);
>>>>-
>>>>-			} else if (bond->curr_active_slave != slave) {
>>>>-				/* this slave has just come up but we
>>>>-				 * already have a current slave; this can
>>>>-				 * also happen if bond_enslave adds a new
>>>>-				 * slave that is up while we are searching
>>>>-				 * for a new slave
>>>>-				 */
>>>>-				slave->link = BOND_LINK_UP;
>>>>-				bond_set_slave_inactive_flags(slave);
>>>>-				bond->current_arp_slave = NULL;
>>>>+					": %s: link status definitely "
>>>>+					"up for interface %s.\n",
>>>>+					bond->dev->name, slave->dev->name);
>>>>
>>>>-				pr_info(DRV_NAME
>>>>-				       ": %s: backup interface %s is now up\n",
>>>>-				       bond->dev->name, slave->dev->name);
>>>>-			}
>>>>+				if (!bond->curr_active_slave ||
>>>>+				    (slave == bond->primary_slave))
>>>>+					goto do_failover;
>>>>
>>>>-			write_unlock_bh(&bond->curr_slave_lock);
>>>>+			}
>>>>
>>>>-			break;
>>>>+			continue;
>>>>
>>>> 		case BOND_LINK_DOWN:
>>>> 			if (slave->link_failure_count < UINT_MAX)
>>>> 				slave->link_failure_count++;
>>>>
>>>> 			slave->link = BOND_LINK_DOWN;
>>>>+			bond_set_slave_inactive_flags(slave);
>>>>
>>>>-			if (slave == bond->curr_active_slave) {
>>>>-				pr_info(DRV_NAME
>>>>-				       ": %s: link status down for active "
>>>>-				       "interface %s, disabling it\n",
>>>>-				       bond->dev->name, slave->dev->name);
>>>>-
>>>>-				bond_set_slave_inactive_flags(slave);
>>>>-
>>>>-				write_lock_bh(&bond->curr_slave_lock);
>>>>-
>>>>-				bond_select_active_slave(bond);
>>>>-				if (bond->curr_active_slave)
>>>>-					bond->curr_active_slave->jiffies =
>>>>-						jiffies;
>>>>-
>>>>-				write_unlock_bh(&bond->curr_slave_lock);
>>>>+			pr_info(DRV_NAME
>>>>+				": %s: link status definitely down for "
>>>>+				"interface %s, disabling it\n",
>>>>+				bond->dev->name, slave->dev->name);
>>>>
>>>>+			if (slave == bond->curr_active_slave) {
>>>> 				bond->current_arp_slave = NULL;
>>>>-
>>>>-			} else if (slave->state == BOND_STATE_BACKUP) {
>>>>-				pr_info(DRV_NAME
>>>>-				       ": %s: backup interface %s is now down\n",
>>>>-				       bond->dev->name, slave->dev->name);
>>>>-
>>>>-				bond_set_slave_inactive_flags(slave);
>>>>+				goto do_failover;
>>>> 			}
>>>>-			break;
>>>>+
>>>>+			continue;
>>>>
>>>> 		default:
>>>> 			pr_err(DRV_NAME
>>>> 			       ": %s: impossible: new_link %d on slave %s\n",
>>>> 			       bond->dev->name, slave->new_link,
>>>> 			       slave->dev->name);
>>>>+			continue;
>>>> 		}
>>>>-	}
>>>>
>>>>-	/*
>>>>-	 * No race with changes to primary via sysfs, as we hold rtnl.
>>>>-	 */
>>>>-	if (bond->primary_slave &&
>>>>-	    (bond->primary_slave != bond->curr_active_slave) &&
>>>>-	    (bond->primary_slave->link == BOND_LINK_UP)) {
>>>>+do_failover:
>>>>+		ASSERT_RTNL();
>>>> 		write_lock_bh(&bond->curr_slave_lock);
>>>>-		bond_change_active_slave(bond, bond->primary_slave);
>>>>+		bond_select_active_slave(bond);
>>>> 		write_unlock_bh(&bond->curr_slave_lock);
>>>> 	}
>>>>
>>>>--
>>>>To unsubscribe from this list: send the line "unsubscribe netdev" in
>>>>the body of a message to majordomo@vger.kernel.org
>>>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>

^ permalink raw reply

* Re: UDP regression with packets rates < 10k per sec
From: Eric Dumazet @ 2009-09-15 17:26 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: netdev
In-Reply-To: <alpine.DEB.1.10.0909151000230.20318@V090114053VZO-1>

Christoph Lameter a écrit :
> On Tue, 15 Sep 2009, Eric Dumazet wrote:
> 
>> 2.6.31 is actually faster than 2.6.22 on the bench you provided.
> 
> Well at high packet rates which were not the topic.
> 
>> Must be specific to the hardware I guess ?
> 
> Huh? Even your loopback numbers did show the regression up to 10k.
> 
>> As text size presumably is bigger in 2.6.31, fetching code
>> in cpu caches to handle 10 packets per second is what we call
>> a cold path anyway.
> 
> Ok so its an accepted regression? This is a significant reason not to use
> newer versions of kernels for latency critical applications that may have
> to send a packet once in a while for notification. The latency is doubled
> (1G) / tripled / quadrupled (IB) vs 2.6.22.
> 
>> If you want to make it a fast path, you want to make sure code its
>> always hot in cpu caches, and find a way to inject packets into
>> the kernel to make sure cpu keep the path hot.
> 
> Oh, gosh.

It seems there is a lot of confusion on this topic, so I will make a full recap :

Once I understood my 2.6.31 kernel had much more features than 2.6.22 and that I tuned
it to :

- Let cpu run at full speed (3GHz instead of 2GHz) : before tuning, 2.6.31 was 
using "ondemand" governor and my cpus were running at 2GHz, while they where
running at 3GHz on my 2.6.22 config

- Dont let cpus enter C2/C3 wait states (idle=mwait)

- Correctly affine cpu to ethX irq (2.6.22 was running ethX irq on one cpu, while
 on 2.6.31, irqs were distributed to all online cpus)


Then, your mcast test gives same results, at 10pps, 100pps, 1000pps, 10000pps

When sniffing receiving side, I can notice :

- Answer to an icmp ping (served by softirq only) : 6 us between request and reply

- Answer to one 'give timestamp' request from mcast client : 11 us betwen request and reply,
  regardless of kernel version (2.6.22 or 2.6.31)

So there is a 5us cost to actually wakeup a process and let him do the recvfrom() and sendto() pair,
which is quite OK, and this time was not significantly changed between 2.6.22 and 2.6.31

Hope this helps

^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Gregory Haskins @ 2009-09-15 20:08 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Michael S. Tsirkin, Ira W. Snyder, netdev, virtualization, kvm,
	linux-kernel, mingo, linux-mm, akpm, hpa, Rusty Russell, s.hetze,
	alacrityvm-devel
In-Reply-To: <4AAFACB5.9050808@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 6099 bytes --]

Avi Kivity wrote:
> On 09/15/2009 04:50 PM, Gregory Haskins wrote:
>>> Why?  vhost will call get_user_pages() or copy_*_user() which ought to
>>> do the right thing.
>>>      
>> I was speaking generally, not specifically to Ira's architecture.  What
>> I mean is that vbus was designed to work without assuming that the
>> memory is pageable.  There are environments in which the host is not
>> capable of mapping hvas/*page, but the memctx->copy_to/copy_from
>> paradigm could still work (think rdma, for instance).
>>    
> 
> Sure, vbus is more flexible here.
> 
>>>> As an aside: a bigger issue is that, iiuc, Ira wants more than a single
>>>> ethernet channel in his design (multiple ethernets, consoles, etc).  A
>>>> vhost solution in this environment is incomplete.
>>>>
>>>>        
>>> Why?  Instantiate as many vhost-nets as needed.
>>>      
>> a) what about non-ethernets?
>>    
> 
> There's virtio-console, virtio-blk etc.  None of these have kernel-mode
> servers, but these could be implemented if/when needed.

IIUC, Ira already needs at least ethernet and console capability.

> 
>> b) what do you suppose this protocol to aggregate the connections would
>> look like? (hint: this is what a vbus-connector does).
>>    
> 
> You mean multilink?  You expose the device as a multiqueue.

No, what I mean is how do you surface multiple ethernet and consoles to
the guests?  For Ira's case, I think he needs at minimum at least one of
each, and he mentioned possibly having two unique ethernets at one point.

His slave boards surface themselves as PCI devices to the x86
host.  So how do you use that to make multiple vhost-based devices (say
two virtio-nets, and a virtio-console) communicate across the transport?

There are multiple ways to do this, but what I am saying is that
whatever is conceived will start to look eerily like a vbus-connector,
since this is one of its primary purposes ;)

> 
>> c) how do you manage the configuration, especially on a per-board basis?
>>    
> 
> pci (for kvm/x86).

Ok, for kvm understood (and I would also add "qemu" to that mix).  But
we are talking about vhost's application in a non-kvm environment here,
right?.

So if the vhost-X devices are in the "guest", and the x86 board is just
a slave...How do you tell each ppc board how many devices and what
config (e.g. MACs, etc) to instantiate?  Do you assume that they should
all be symmetric and based on positional (e.g. slot) data?  What if you
want asymmetric configurations (if not here, perhaps in a different
environment)?

> 
>> Actually I have patches queued to allow vbus to be managed via ioctls as
>> well, per your feedback (and it solves the permissions/lifetime
>> critisims in alacrityvm-v0.1).
>>    
> 
> That will make qemu integration easier.
> 
>>>   The only difference is the implementation.  vhost-net
>>> leaves much more to userspace, that's the main difference.
>>>      
>> Also,
>>
>> *) vhost is virtio-net specific, whereas vbus is a more generic device
>> model where thing like virtio-net or venet ride on top.
>>    
> 
> I think vhost-net is separated into vhost and vhost-net.

Thats good.

> 
>> *) vhost is only designed to work with environments that look very
>> similar to a KVM guest (slot/hva translatable).  vbus can bridge various
>> environments by abstracting the key components (such as memory access).
>>    
> 
> Yes.  virtio is really virtualization oriented.

I would say that its vhost in particular that is virtualization
oriented.  virtio, as a concept, generally should work in physical
systems, if perhaps with some minor modifications.  The biggest "limit"
is having "virt" in its name ;)

> 
>> *) vhost requires an active userspace management daemon, whereas vbus
>> can be driven by transient components, like scripts (ala udev)
>>    
> 
> vhost by design leaves configuration and handshaking to userspace.  I
> see it as an advantage.

The misconception here is that vbus by design _doesn't define_ where
configuration/handshaking happens.  It is primarily implemented by a
modular component called a "vbus-connector", and _I_ see this
flexibility as an advantage.  vhost on the other hand depends on a
active userspace component and a slots/hva memory design, which is more
limiting in where it can be used and forces you to split the logic.
However, I think we both more or less agree on this point already.

For the record, vbus itself is simply a resource container for
virtual-devices, which provides abstractions for the various points of
interest to generalizing PV (memory, signals, etc) and the proper
isolation and protection guarantees.  What you do with it is defined by
the modular virtual-devices (e.g. virtion-net, venet, sched, hrt, scsi,
rdma, etc) and vbus-connectors (vbus-kvm, etc) you plug into it.

As an example, you could emulate the vhost design in vbus by writing a
"vbus-vhost" connector.  This connector would be very thin and terminate
locally in QEMU.  It would provide a ioctl-based verb namespace similar
to the existing vhost verbs we have today.  QEMU would then similarly
reflect the vbus-based virtio device as a PCI device to the guest, so
that virtio-pci works unmodified.

You would then have most of the advantages of the work I have done for
commoditizing/abstracting the key points for in-kernel PV, like the
memctx.  In addition, much of the work could be reused in multiple
environments since any vbus-compliant device model that is plugged into
the framework would work with any connector that is plugged in (e.g.
vbus-kvm (alacrityvm), vbus-vhost (KVM), and "vbus-ira").

The only tradeoff is in features offered by the connector (e.g.
vbus-vhost has the advantage that existing PV guests can continue to
work unmodified, vbus-kvm has the advantage that it supports new
features like generic shared memory, non-virtio based devices,
priortizable interrupts, no dependencies on PCI for non PCI guests, etc).

Kind Regards,
-Greg

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply

* Re: fanotify as syscalls
From: Evgeniy Polyakov @ 2009-09-15 20:16 UTC (permalink / raw)
  To: Eric Paris
  Cc: Jamie Lokier, David Miller, linux-kernel, linux-fsdevel, netdev,
	viro, alan, hch, torvalds
In-Reply-To: <1252955295.2246.35.camel@dhcp231-106.rdu.redhat.com>

On Mon, Sep 14, 2009 at 03:08:15PM -0400, Eric Paris (eparis@redhat.com) wrote:
> Just this week I got another request to look at syscalls.  So I did, I
> haven't prototyped it, but I can do it with syscalls, they would look
> like this:
> 
> int fanotify_init(int flags, int f_flags, __u64 mask, unsigned int priority);

...

> Are there demands that I convert to syscalls?  Do I really gain anything
> using 9 new inextensible syscalls over socket(), bind(), and 8
> setsockopt() calls?

In my personal opinion sockets are way much simpler and obvious than
syscalls. Also one should not edit hundred of arch-dependant headers
conflicting with other pity syscallers.

But implementing af_fanotify opens a door for zillions of other
af_something which can be implemented using existing infrastructure
namely netlink will solve likely 99% of potential usage cases.

And frankly I did not find it way too convincing that using netlink is
impossible in your scenario if some things will be simplified, namely
event merging. Moreover you can implement a pool of working threads and
postpone all the work to them and appropriate event queues, which will
allow to use rlimits for the listeners and open files 'kind of' on
behalf of those processes.

But it is quite diferent from the approach you selected and which is
more obvious indeed. So if you ask a question whether fanotify should
use sockets or syscalls, I would prefer sockets, but I still recommend
to rethink your decision to move away from netlink to be 100% sure that
it will not solve your needs.

-- 
	Evgeniy Polyakov

^ permalink raw reply

* Re: UDP regression with packets rates < 10k per sec
From: Christoph Lameter @ 2009-09-15 20:25 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <4AAFCE3A.8060102@gmail.com>

On Tue, 15 Sep 2009, Eric Dumazet wrote:

> Once I understood my 2.6.31 kernel had much more features than 2.6.22 and that I tuned
> it to :
>
> - Let cpu run at full speed (3GHz instead of 2GHz) : before tuning, 2.6.31 was
> using "ondemand" governor and my cpus were running at 2GHz, while they where
> running at 3GHz on my 2.6.22 config

My kernel did not have support for any governors compiled in.

> - Dont let cpus enter C2/C3 wait states (idle=mwait)

Ok. Trying idle=mwait.

> - Correctly affine cpu to ethX irq (2.6.22 was running ethX irq on one cpu, while
>  on 2.6.31, irqs were distributed to all online cpus)

Interrupts of both 2.6.22 and 2.6.31 go to cpu 0. Does it matter for
loopback?

> Then, your mcast test gives same results, at 10pps, 100pps, 1000pps, 10000pps

loopback via mcast -Ln1 -r <rate>

		10pps	100pps	1000pps	10000pps
2.6.22(32bit)	7.36	7.28	7.15	7.16
2.6.31(64bit)	9.28	10.27	9.70	9.79

What a difference. Now the initial latency rampup for 2.6.31 is gone. So
even w/o governors the kernel does something to increase the latencies.

We sacrificed 2 - 3 microseconds per message to kernel features, bloat and
64 bitness?

^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Michael S. Tsirkin @ 2009-09-15 20:40 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Avi Kivity, Ira W. Snyder, netdev, virtualization, kvm,
	linux-kernel, mingo, linux-mm, akpm, hpa, Rusty Russell, s.hetze,
	alacrityvm-devel
In-Reply-To: <4AAFF437.7060100@gmail.com>

On Tue, Sep 15, 2009 at 04:08:23PM -0400, Gregory Haskins wrote:
> No, what I mean is how do you surface multiple ethernet and consoles to
> the guests?  For Ira's case, I think he needs at minimum at least one of
> each, and he mentioned possibly having two unique ethernets at one point.
> 
> His slave boards surface themselves as PCI devices to the x86
> host.  So how do you use that to make multiple vhost-based devices (say
> two virtio-nets, and a virtio-console) communicate across the transport?
> 
> There are multiple ways to do this, but what I am saying is that
> whatever is conceived will start to look eerily like a vbus-connector,
> since this is one of its primary purposes ;)

Can't all this be in userspace?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Gregory Haskins @ 2009-09-15 20:43 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Avi Kivity, Ira W. Snyder, netdev, virtualization, kvm,
	linux-kernel, mingo, linux-mm, akpm, hpa, Rusty Russell, s.hetze,
	alacrityvm-devel
In-Reply-To: <20090915204036.GA27954@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 830 bytes --]

Michael S. Tsirkin wrote:
> On Tue, Sep 15, 2009 at 04:08:23PM -0400, Gregory Haskins wrote:
>> No, what I mean is how do you surface multiple ethernet and consoles to
>> the guests?  For Ira's case, I think he needs at minimum at least one of
>> each, and he mentioned possibly having two unique ethernets at one point.
>>
>> His slave boards surface themselves as PCI devices to the x86
>> host.  So how do you use that to make multiple vhost-based devices (say
>> two virtio-nets, and a virtio-console) communicate across the transport?
>>
>> There are multiple ways to do this, but what I am saying is that
>> whatever is conceived will start to look eerily like a vbus-connector,
>> since this is one of its primary purposes ;)
> 
> Can't all this be in userspace?

Can you outline your proposal?

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply

* Re: Fwd: [RFC v3] net: Introduce recvmmsg socket syscall
From: Arnaldo Carvalho de Melo @ 2009-09-15 20:52 UTC (permalink / raw)
  To: Nir Tzachar; +Cc: Linux Networking Development Mailing List, Ziv Ayalon
In-Reply-To: <9b2db90b0909151120l71498303w26cfd657c9f18092@mail.gmail.com>

Em Tue, Sep 15, 2009 at 09:20:13PM +0300, Nir Tzachar escreveu:
> >> Setup:
> >> linux 2.6.29.2 with the third version of the patch, running on an
> >> Intel Xeon X3220 2.4GHz quad core, with 4Gbyte of ram, running Ubuntu
> >> 9.04
> >
> > Which NIC? 10 Gbit/s?
> 
> 1G. We do not care as much for throughput as we do about latency...

OK, but anyway the 10 Gbit/s cards I've briefly played with all
exhibited lower latencies than all 1 gbit/s ones, in fact I've heard
about people moving to 10 Gbit/s not for the bw, but for the lower
latencies :-)
 
> 
> >> Results:
> >> On general, the recvmmsg beats the pants off the regular recvmsg by a
> >> whole millisecond (which might not sound much, but is _really_ a lot
> >> for us ;). The exact distribution fluctuates between half a milli and
> >> 2 millis, but the average is 1 milli.
> >
> > Do you have any testcase using publicly available software? Like qpidd,
> > etc? I'll eventually have to do that, for now I'm just using that
> > recvmmsg tool I posted, now with a recvmsg mode, then collecting 'perf
> > record' with and without callgraphs to post here. The client is just
> > pktgen spitting datagrams as if there is no tomorrow :-)
> 
> No. This was on a live, production system.

Wow :-)
> 
> > Showing that we get latency improvements is complementary to what I'm
> > doing, that is for now just showing the performance improvements and
> > showing what gives this improvement (perf counters runs).
> 
> We are more latency oriented, and, naturally, concentrate on this
> aspect of the problem. Producing numbers here is much more easier....
> I can easily come up with a test application which just measures the
> latency of processing packets, by employing a sending loop between two
> hosts.
> 
> > If you could come up with a testcase that you could share with us,
> > perhaps using one of these AMQP implementations, that would be great
> > too.
> 
> Well, in our experience, AMQP and other solutions have latency issues.
> Moreover, the receiving end of our application is a regular multicast
> stream. I will implement the simple latency test I mentioned earlier,
> and post some results soon.

OK.

And here are some callgraphs for a very simple app (attached) that stops
after receiving 1 million datagrams, collected with the perf tools in
the kernel. No packets per sec numbers besides the fact that the
recvmmsg test run collected way less samples (finished quicker).

Client is pktgen sending 100 byte datagrams over a single tg3 1 Gbit/s
NIC, server is running over a bnx2 1 Gbit/s link as well, just a sink:

With recvmmsg, batch of 8 datagrams, no timeout:

http://oops.ghostprotocols.net:81/acme/perf.recvmmsg.step1.cg.data.txt.bz2

And with recvmsg:

http://oops.ghostprotocols.net:81/acme/perf.recvmsg.step1.cg.data.txt.bz2

Notice where we are spending time in the recvmmsg case... :-)

- Arnaldo

^ permalink raw reply

* Re: UDP regression with packets rates < 10k per sec
From: Eric Dumazet @ 2009-09-15 19:02 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: netdev
In-Reply-To: <alpine.DEB.1.10.0909151550200.3340@V090114053VZO-1>

Christoph Lameter a écrit :
> On Tue, 15 Sep 2009, Eric Dumazet wrote:
> 
>> Once I understood my 2.6.31 kernel had much more features than 2.6.22 and that I tuned
>> it to :
>>
>> - Let cpu run at full speed (3GHz instead of 2GHz) : before tuning, 2.6.31 was
>> using "ondemand" governor and my cpus were running at 2GHz, while they where
>> running at 3GHz on my 2.6.22 config
> 
> My kernel did not have support for any governors compiled in.
> 
>> - Dont let cpus enter C2/C3 wait states (idle=mwait)
> 
> Ok. Trying idle=mwait.
> 
>> - Correctly affine cpu to ethX irq (2.6.22 was running ethX irq on one cpu, while
>>  on 2.6.31, irqs were distributed to all online cpus)
> 
> Interrupts of both 2.6.22 and 2.6.31 go to cpu 0. Does it matter for
> loopback?

No of course, loopback triggers softirq on the local cpu, no special setup
to respect.

> 
>> Then, your mcast test gives same results, at 10pps, 100pps, 1000pps, 10000pps
> 
> loopback via mcast -Ln1 -r <rate>
> 
> 		10pps	100pps	1000pps	10000pps
> 2.6.22(32bit)	7.36	7.28	7.15	7.16
> 2.6.31(64bit)	9.28	10.27	9.70	9.79
> 
> What a difference. Now the initial latency rampup for 2.6.31 is gone. So
> even w/o governors the kernel does something to increase the latencies.
> 
> We sacrificed 2 - 3 microseconds per message to kernel features, bloat and
> 64 bitness?

Well, I dont know, I mainly use 32bits kernel, but yes, using 64bits has a cost,
since skbs for example are bigger, sockets are bigger, so we touch more cache lines
per transaction...


You could precisely compute number of cycles per transaction, with "perf" tools
(only on 2.6.31), between 64bit and 32bit kernels, benching 100000 pps for example
and counting number of perf counter irqs per second

^ permalink raw reply

* Re: alloc skb based on a given data buffer
From: David Miller @ 2009-09-15 21:16 UTC (permalink / raw)
  To: johannes-cdvu00un1VgdHxzADdlk8Q
  Cc: yi.zhu-ral2JQCrhuEAvxtiuMwx3w, mel-wPRd99KPJ+uzQB+pC5nmwQ,
	reinette.chatre-ral2JQCrhuEAvxtiuMwx3w,
	elendil-EIBgga6/0yRmR6Xm/wNWPw,
	Larry.Finger-tQ5ms3gMjBLk1uMJSBkQmQ,
	linville-2XuSBdqkA4R54TAoqtyWWQ, penberg-bbCR+/B0CizivPeTLB3BmA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	ipw3945-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	assaf.krauss-ral2JQCrhuEAvxtiuMwx3w,
	mohamed.abbas-ral2JQCrhuEAvxtiuMwx3w,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1253028631.23427.55.camel-YfaajirXv2244ywRPIzf9A@public.gmane.org>

From: Johannes Berg <johannes-cdvu00un1VgdHxzADdlk8Q@public.gmane.org>
Date: Tue, 15 Sep 2009 08:30:31 -0700

> Hold, mac80211 can't cope with that at this point for sw crypto and
> possibly other things.

Then it should skb_linearize() at input, or similar.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 0/5] Phonet: basic routing support
From: David Miller @ 2009-09-15 21:22 UTC (permalink / raw)
  To: remi; +Cc: netdev
In-Reply-To: <a9972d3fe3d2bf09387bee7827b571c0@chewa.net>

From: Rémi Denis-Courmont <remi@remlab.net>
Date: Tue, 15 Sep 2009 13:30:19 +0200

> I am not sure if feature patches are still allowed. If not, I can just
> repost this at a more convenient time.

Please resend these when net-next-2.6 opens back up.

Thank you.

^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Michael S. Tsirkin @ 2009-09-15 21:25 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Avi Kivity, Ira W. Snyder, netdev, virtualization, kvm,
	linux-kernel, mingo, linux-mm, akpm, hpa, Rusty Russell, s.hetze,
	alacrityvm-devel
In-Reply-To: <4AAFFC8E.9010404@gmail.com>

On Tue, Sep 15, 2009 at 04:43:58PM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Tue, Sep 15, 2009 at 04:08:23PM -0400, Gregory Haskins wrote:
> >> No, what I mean is how do you surface multiple ethernet and consoles to
> >> the guests?  For Ira's case, I think he needs at minimum at least one of
> >> each, and he mentioned possibly having two unique ethernets at one point.
> >>
> >> His slave boards surface themselves as PCI devices to the x86
> >> host.  So how do you use that to make multiple vhost-based devices (say
> >> two virtio-nets, and a virtio-console) communicate across the transport?
> >>
> >> There are multiple ways to do this, but what I am saying is that
> >> whatever is conceived will start to look eerily like a vbus-connector,
> >> since this is one of its primary purposes ;)
> > 
> > Can't all this be in userspace?
> 
> Can you outline your proposal?
> 
> -Greg
> 

Userspace in x86 maps a PCI region, uses it for communication with ppc?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Michael S. Tsirkin @ 2009-09-15 21:38 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Avi Kivity, Ira W. Snyder, netdev, virtualization, kvm,
	linux-kernel, mingo, linux-mm, akpm, hpa, Rusty Russell, s.hetze,
	alacrityvm-devel
In-Reply-To: <4AB0098F.9030207@gmail.com>

On Tue, Sep 15, 2009 at 05:39:27PM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Tue, Sep 15, 2009 at 04:43:58PM -0400, Gregory Haskins wrote:
> >> Michael S. Tsirkin wrote:
> >>> On Tue, Sep 15, 2009 at 04:08:23PM -0400, Gregory Haskins wrote:
> >>>> No, what I mean is how do you surface multiple ethernet and consoles to
> >>>> the guests?  For Ira's case, I think he needs at minimum at least one of
> >>>> each, and he mentioned possibly having two unique ethernets at one point.
> >>>>
> >>>> His slave boards surface themselves as PCI devices to the x86
> >>>> host.  So how do you use that to make multiple vhost-based devices (say
> >>>> two virtio-nets, and a virtio-console) communicate across the transport?
> >>>>
> >>>> There are multiple ways to do this, but what I am saying is that
> >>>> whatever is conceived will start to look eerily like a vbus-connector,
> >>>> since this is one of its primary purposes ;)
> >>> Can't all this be in userspace?
> >> Can you outline your proposal?
> >>
> >> -Greg
> >>
> > 
> > Userspace in x86 maps a PCI region, uses it for communication with ppc?
> > 
> 
> And what do you propose this communication to look like?

Who cares? Implement vbus protocol there if you like.

> -Greg
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Gregory Haskins @ 2009-09-15 21:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Avi Kivity, Ira W. Snyder, netdev, virtualization, kvm,
	linux-kernel, mingo, linux-mm, akpm, hpa, Rusty Russell, s.hetze,
	alacrityvm-devel
In-Reply-To: <20090915212545.GC27954@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 1112 bytes --]

Michael S. Tsirkin wrote:
> On Tue, Sep 15, 2009 at 04:43:58PM -0400, Gregory Haskins wrote:
>> Michael S. Tsirkin wrote:
>>> On Tue, Sep 15, 2009 at 04:08:23PM -0400, Gregory Haskins wrote:
>>>> No, what I mean is how do you surface multiple ethernet and consoles to
>>>> the guests?  For Ira's case, I think he needs at minimum at least one of
>>>> each, and he mentioned possibly having two unique ethernets at one point.
>>>>
>>>> His slave boards surface themselves as PCI devices to the x86
>>>> host.  So how do you use that to make multiple vhost-based devices (say
>>>> two virtio-nets, and a virtio-console) communicate across the transport?
>>>>
>>>> There are multiple ways to do this, but what I am saying is that
>>>> whatever is conceived will start to look eerily like a vbus-connector,
>>>> since this is one of its primary purposes ;)
>>> Can't all this be in userspace?
>> Can you outline your proposal?
>>
>> -Greg
>>
> 
> Userspace in x86 maps a PCI region, uses it for communication with ppc?
> 

And what do you propose this communication to look like?

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply

* [PATCH -next] ssb/sdio: fix printk format warnings
From: Randy Dunlap @ 2009-09-15 21:52 UTC (permalink / raw)
  To: linux-mmc, Michael Buesch; +Cc: netdev, akpm

From: Randy Dunlap <randy.dunlap@oracle.com>

Fix printk format warnings:

drivers/ssb/sdio.c:336: warning: format '%u' expects type 'unsigned int', but argument 7 has type 'size_t'
drivers/ssb/sdio.c:443: warning: format '%u' expects type 'unsigned int', but argument 7 has type 'size_t'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
---
 drivers/ssb/sdio.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- linux-next-20090914.orig/drivers/ssb/sdio.c
+++ linux-next-20090914/drivers/ssb/sdio.c
@@ -333,7 +333,7 @@ static void ssb_sdio_block_read(struct s
 		goto out;
 
 err_out:
-	dev_dbg(ssb_sdio_dev(bus), "%04X:%04X (width=%u, len=%u), error %d\n",
+	dev_dbg(ssb_sdio_dev(bus), "%04X:%04X (width=%u, len=%zu), error %d\n",
 		bus->sdio_sbaddr >> 16, offset, reg_width, saved_count, error);
 out:
 	sdio_release_host(bus->host_sdio);
@@ -440,7 +440,7 @@ static void ssb_sdio_block_write(struct 
 		goto out;
 
 err_out:
-	dev_dbg(ssb_sdio_dev(bus), "%04X:%04X (width=%u, len=%u), error %d\n",
+	dev_dbg(ssb_sdio_dev(bus), "%04X:%04X (width=%u, len=%zu), error %d\n",
 		bus->sdio_sbaddr >> 16, offset, reg_width, saved_count, error);
 out:
 	sdio_release_host(bus->host_sdio);




---
~Randy
LPC 2009, Sept. 23-25, Portland, Oregon
http://linuxplumbersconf.org/2009/

^ permalink raw reply

* Re: fanotify as syscalls
From: Eric Paris @ 2009-09-15 21:54 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Jamie Lokier, David Miller, linux-kernel, linux-fsdevel, netdev,
	viro, alan, hch, torvalds
In-Reply-To: <20090915201620.GB32192@ioremap.net>

On Wed, 2009-09-16 at 00:16 +0400, Evgeniy Polyakov wrote:
> On Mon, Sep 14, 2009 at 03:08:15PM -0400, Eric Paris (eparis@redhat.com) wrote:
> > Just this week I got another request to look at syscalls.  So I did, I
> > haven't prototyped it, but I can do it with syscalls, they would look
> > like this:
> > 
> > int fanotify_init(int flags, int f_flags, __u64 mask, unsigned int priority);
> 
> ...
> 
> > Are there demands that I convert to syscalls?  Do I really gain anything
> > using 9 new inextensible syscalls over socket(), bind(), and 8
> > setsockopt() calls?
> 
> In my personal opinion sockets are way much simpler and obvious than
> syscalls. Also one should not edit hundred of arch-dependant headers
> conflicting with other pity syscallers.
> 
> But implementing af_fanotify opens a door for zillions of other
> af_something which can be implemented using existing infrastructure
> namely netlink will solve likely 99% of potential usage cases.
> 
> And frankly I did not find it way too convincing that using netlink is
> impossible in your scenario if some things will be simplified, namely
> event merging.

Nothing's impossible, but is netlink a square peg for this round hole?
One of the great benefits of netlink, the attribute matching and
filtering, although possibly useful isn't some panacea as we have to do
that well before netlink to have anything like decent performance.
Imagine every single fs event creating an skb and sending it with
netlink only to have most of them dropped.

The only other benefit to netlink that I know of is the reasonable,
easy, and clean addition of information later in time with backwards
compatibility as needed.  That's really cool, I admit, but with the
limited amount of additional info that users have wanted out of inotify
I think my data type extensibility should be enough.

>  Moreover you can implement a pool of working threads and
> postpone all the work to them and appropriate event queues, which will
> allow to use rlimits for the listeners and open files 'kind of' on
> behalf of those processes.

I'm sorry, I don't userstand.  I don't see how worker threads help
anything here.  Can you explain what you are thinking?

> But it is quite diferent from the approach you selected and which is
> more obvious indeed. So if you ask a question whether fanotify should
> use sockets or syscalls, I would prefer sockets

I've heard someone else off list say this as well.  I'm not certain why.
I actually spent the day yesterday and have fanotify working over 5 new
syscalls (good thing I wrote the code with separate back and and front
ends for just this purpose)  And I really don't hate it.  I think 3
might be enough.

fanotify_init() ---- very much like inotify_init
fanotify_modify_mark_at() --- like inotify_add_watch and rm_watch
fanotify_modify_mark_fd() --- same but with an fd instead of a path
fanotify_response() --- userspace tells the kernel what to do if requested/allowed
   (could probably be done using write() to the fanotify fd)
fanotify_exclude() --- a horrid syscall which might be better as an ioctl since it isn't strongly typed....

If there is a strong need for syscalls I'm ready and willing.  I'd love
to hear Linus say socket+setsockopt() is a merge blocker and I have to
go to syscalls if he sees it that way.  If davem and friends say I'm not
networky enough to use socket()+setsockopt() I guess I have to look at
netlink (which I'm still not convinced buys us anything vs the crazy
complexity) or go with syscalls.  I'm perfectly willing to admit this is
a /dev+ioctl type interface only implemented on socket+setsockopt().  If
that's a no go, someone say it now please.

> but I still recommend
> to rethink your decision to move away from netlink to be 100% sure that
> it will not solve your needs.

I don't see what's gained using netlink.  I am already reusing the
fsnotify code to do all my queuing.  Someone help me understand the
benefit of netlink and help me understand how we can reasonably meet the
needs and I'll try to prototype it.

1) fd's must be opened in the recv process
2) reliability, if loss must know on the send side

^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Gregory Haskins @ 2009-09-15 21:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Avi Kivity, Ira W. Snyder, netdev, virtualization, kvm,
	linux-kernel, mingo, linux-mm, akpm, hpa, Rusty Russell, s.hetze,
	alacrityvm-devel
In-Reply-To: <20090915213854.GE27954@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 2304 bytes --]

Michael S. Tsirkin wrote:
> On Tue, Sep 15, 2009 at 05:39:27PM -0400, Gregory Haskins wrote:
>> Michael S. Tsirkin wrote:
>>> On Tue, Sep 15, 2009 at 04:43:58PM -0400, Gregory Haskins wrote:
>>>> Michael S. Tsirkin wrote:
>>>>> On Tue, Sep 15, 2009 at 04:08:23PM -0400, Gregory Haskins wrote:
>>>>>> No, what I mean is how do you surface multiple ethernet and consoles to
>>>>>> the guests?  For Ira's case, I think he needs at minimum at least one of
>>>>>> each, and he mentioned possibly having two unique ethernets at one point.
>>>>>>
>>>>>> His slave boards surface themselves as PCI devices to the x86
>>>>>> host.  So how do you use that to make multiple vhost-based devices (say
>>>>>> two virtio-nets, and a virtio-console) communicate across the transport?
>>>>>>
>>>>>> There are multiple ways to do this, but what I am saying is that
>>>>>> whatever is conceived will start to look eerily like a vbus-connector,
>>>>>> since this is one of its primary purposes ;)
>>>>> Can't all this be in userspace?
>>>> Can you outline your proposal?
>>>>
>>>> -Greg
>>>>
>>> Userspace in x86 maps a PCI region, uses it for communication with ppc?
>>>
>> And what do you propose this communication to look like?
> 
> Who cares? Implement vbus protocol there if you like.
> 

Exactly.  My point is that you need something like a vbus protocol there. ;)

Here is the protocol I run over PCI in AlacrityVM:

http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=include/linux/vbus_pci.h;h=fe337590e644017392e4c9d9236150adb2333729;hb=ded8ce2005a85c174ba93ee26f8d67049ef11025

And I guess to your point, yes the protocol can technically be in
userspace (outside of whatever you need for the in-kernel portion of the
communication transport, if any.

The vbus-connector design does not specify where the protocol needs to
take place, per se.  Note, however, for performance reasons some parts
of the protocol may want to be in the kernel (such as DEVCALL and
SHMSIGNAL).  It is for this reason that I just run all of it there,
because IMO its simpler than splitting it up.  The slow path stuff just
rides on infrastructure that I need for fast-path anyway, so it doesn't
really cost me anything additional.

Kind Regards,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply

* [GIT PULL 0/7] Small IEEE 802.15.4 updates
From: Dmitry Eremin-Solenikov @ 2009-09-15 22:12 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-zigbee-devel, Sergey Lapin, netdev

Hi, Dave,

Could you please pull small updates for IEEE 802.15.4 targeted
into 2.6.32-rcX. Thank you.


The following changes since commit 4142e0d1def2c0176c27fd2e810243045a62eb6d:
  Linus Torvalds (1):
        Merge branch 'osync_cleanup' of git://git.kernel.org/.../jack/linux-fs-2.6

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/lowpan/lowpan.git for-next

Dmitry Eremin-Solenikov (7):
      af_ieee802154: setsockopt optlen arg isn't __user
      ieee802154: add locking for seq numbers
      ieee802154: add a helper to put the wpan_phy device
      ieee802154: add wpan-phy iteration functions
      ieee802154: merge nl802154 and wpan-class in single module
      nl802154: split away MAC commands implementation
      ieee802154: add LIST_PHY command support

 include/linux/nl802154.h    |    1 +
 include/net/wpan-phy.h      |    8 +
 net/ieee802154/Makefile     |    4 +-
 net/ieee802154/dgram.c      |    2 +-
 net/ieee802154/ieee802154.h |   48 ++++
 net/ieee802154/netlink.c    |  616 ++-----------------------------------------
 net/ieee802154/nl-mac.c     |  609 ++++++++++++++++++++++++++++++++++++++++++
 net/ieee802154/nl-phy.c     |  175 ++++++++++++
 net/ieee802154/raw.c        |    2 +-
 net/ieee802154/wpan-class.c |   48 ++++-
 10 files changed, 908 insertions(+), 605 deletions(-)
 create mode 100644 net/ieee802154/ieee802154.h
 create mode 100644 net/ieee802154/nl-mac.c
 create mode 100644 net/ieee802154/nl-phy.c


-- 
With best wishes
Dmitry


^ permalink raw reply

* [PATCH 1/7] af_ieee802154: setsockopt optlen arg isn't __user
From: Dmitry Eremin-Solenikov @ 2009-09-15 22:12 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-zigbee-devel, Sergey Lapin, netdev
In-Reply-To: <1253052785-26190-1-git-send-email-dbaryshkov@gmail.com>

Remove __user annotation from optlen arg as it's bogus.

Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
---
 net/ieee802154/dgram.c |    2 +-
 net/ieee802154/raw.c   |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ieee802154/dgram.c b/net/ieee802154/dgram.c
index 77ae685..51593a4 100644
--- a/net/ieee802154/dgram.c
+++ b/net/ieee802154/dgram.c
@@ -414,7 +414,7 @@ static int dgram_getsockopt(struct sock *sk, int level, int optname,
 }
 
 static int dgram_setsockopt(struct sock *sk, int level, int optname,
-		    char __user *optval, int __user optlen)
+		    char __user *optval, int optlen)
 {
 	struct dgram_sock *ro = dgram_sk(sk);
 	int val;
diff --git a/net/ieee802154/raw.c b/net/ieee802154/raw.c
index 4681501..1319885 100644
--- a/net/ieee802154/raw.c
+++ b/net/ieee802154/raw.c
@@ -244,7 +244,7 @@ static int raw_getsockopt(struct sock *sk, int level, int optname,
 }
 
 static int raw_setsockopt(struct sock *sk, int level, int optname,
-		    char __user *optval, int __user optlen)
+		    char __user *optval, int optlen)
 {
 	return -EOPNOTSUPP;
 }
-- 
1.6.3.3


^ permalink raw reply related

* [PATCH 2/7] ieee802154: add locking for seq numbers
From: Dmitry Eremin-Solenikov @ 2009-09-15 22:13 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-zigbee-devel, Sergey Lapin, netdev
In-Reply-To: <1253052785-26190-2-git-send-email-dbaryshkov@gmail.com>

Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
---
 net/ieee802154/netlink.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/net/ieee802154/netlink.c b/net/ieee802154/netlink.c
index 2106ecb..ca767bd 100644
--- a/net/ieee802154/netlink.c
+++ b/net/ieee802154/netlink.c
@@ -35,6 +35,7 @@
 #include <net/ieee802154_netdev.h>
 
 static unsigned int ieee802154_seq_num;
+static DEFINE_SPINLOCK(ieee802154_seq_lock);
 
 static struct genl_family ieee802154_coordinator_family = {
 	.id		= GENL_ID_GENERATE,
@@ -57,12 +58,15 @@ static struct sk_buff *ieee802154_nl_create(int flags, u8 req)
 {
 	void *hdr;
 	struct sk_buff *msg = nlmsg_new(NLMSG_GOODSIZE, GFP_ATOMIC);
+	unsigned long f;
 
 	if (!msg)
 		return NULL;
 
+	spin_lock_irqsave(&ieee802154_seq_lock, f);
 	hdr = genlmsg_put(msg, 0, ieee802154_seq_num++,
 			&ieee802154_coordinator_family, flags, req);
+	spin_unlock_irqrestore(&ieee802154_seq_lock, f);
 	if (!hdr) {
 		nlmsg_free(msg);
 		return NULL;
-- 
1.6.3.3


^ permalink raw reply related

* [PATCH 3/7] ieee802154: add a helper to put the wpan_phy device
From: Dmitry Eremin-Solenikov @ 2009-09-15 22:13 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-zigbee-devel, Sergey Lapin, netdev
In-Reply-To: <1253052785-26190-3-git-send-email-dbaryshkov@gmail.com>

Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
---
 include/net/wpan-phy.h |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/include/net/wpan-phy.h b/include/net/wpan-phy.h
index 547b1e2..5e803a0 100644
--- a/include/net/wpan-phy.h
+++ b/include/net/wpan-phy.h
@@ -56,6 +56,12 @@ static inline void *wpan_phy_priv(struct wpan_phy *phy)
 }
 
 struct wpan_phy *wpan_phy_find(const char *str);
+
+static inline void wpan_phy_put(struct wpan_phy *phy)
+{
+	put_device(&phy->dev);
+}
+
 static inline const char *wpan_phy_name(struct wpan_phy *phy)
 {
 	return dev_name(&phy->dev);
-- 
1.6.3.3


^ permalink raw reply related

* [PATCH 5/7] ieee802154: merge nl802154 and wpan-class in single module
From: Dmitry Eremin-Solenikov @ 2009-09-15 22:13 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-zigbee-devel, Sergey Lapin, netdev
In-Reply-To: <1253052785-26190-5-git-send-email-dbaryshkov@gmail.com>

There is no real need to have ieee802154 interfaces separate
into several small modules, as neither of them has it's own use.

Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
---
 net/ieee802154/Makefile     |    4 ++--
 net/ieee802154/netlink.c    |    9 ++-------
 net/ieee802154/wpan-class.c |   23 ++++++++++++++++++++---
 3 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/net/ieee802154/Makefile b/net/ieee802154/Makefile
index 4068a9f..42b1f0d 100644
--- a/net/ieee802154/Makefile
+++ b/net/ieee802154/Makefile
@@ -1,5 +1,5 @@
-obj-$(CONFIG_IEEE802154) +=	nl802154.o af_802154.o wpan-class.o
-nl802154-y		:= netlink.o nl_policy.o
+obj-$(CONFIG_IEEE802154) +=	ieee802154.o af_802154.o
+ieee802154-y		:= netlink.o nl_policy.o wpan-class.o
 af_802154-y		:= af_ieee802154.o raw.o dgram.o
 
 ccflags-y += -Wall -DDEBUG
diff --git a/net/ieee802154/netlink.c b/net/ieee802154/netlink.c
index ca767bd..0fadd6b 100644
--- a/net/ieee802154/netlink.c
+++ b/net/ieee802154/netlink.c
@@ -643,7 +643,7 @@ static struct genl_ops ieee802154_coordinator_ops[] = {
 							ieee802154_dump_iface),
 };
 
-static int __init ieee802154_nl_init(void)
+int __init ieee802154_nl_init(void)
 {
 	int rc;
 	int i;
@@ -676,14 +676,9 @@ fail:
 	genl_unregister_family(&ieee802154_coordinator_family);
 	return rc;
 }
-module_init(ieee802154_nl_init);
 
-static void __exit ieee802154_nl_exit(void)
+void __exit ieee802154_nl_exit(void)
 {
 	genl_unregister_family(&ieee802154_coordinator_family);
 }
-module_exit(ieee802154_nl_exit);
-
-MODULE_LICENSE("GPL v2");
-MODULE_DESCRIPTION("ieee 802.15.4 configuration interface");
 
diff --git a/net/ieee802154/wpan-class.c b/net/ieee802154/wpan-class.c
index 0cec138..f8479c6 100644
--- a/net/ieee802154/wpan-class.c
+++ b/net/ieee802154/wpan-class.c
@@ -22,6 +22,8 @@
 
 #include <net/wpan-phy.h>
 
+#include "ieee802154.h"
+
 #define MASTER_SHOW_COMPLEX(name, format_string, args...)		\
 static ssize_t name ## _show(struct device *dev,			\
 			    struct device_attribute *attr, char *buf)	\
@@ -169,16 +171,31 @@ EXPORT_SYMBOL(wpan_phy_free);
 
 static int __init wpan_phy_class_init(void)
 {
-	return class_register(&wpan_phy_class);
+	int rc;
+	rc = class_register(&wpan_phy_class);
+	if (rc)
+		goto err;
+
+	rc = ieee802154_nl_init();
+	if (rc)
+		goto err_nl;
+
+	return 0;
+err_nl:
+	class_unregister(&wpan_phy_class);
+err:
+	return rc;
 }
-subsys_initcall(wpan_phy_class_init);
+module_init(wpan_phy_class_init);
 
 static void __exit wpan_phy_class_exit(void)
 {
+	ieee802154_nl_exit();
 	class_unregister(&wpan_phy_class);
 }
 module_exit(wpan_phy_class_exit);
 
-MODULE_DESCRIPTION("IEEE 802.15.4 device class");
 MODULE_LICENSE("GPL v2");
+MODULE_DESCRIPTION("IEEE 802.15.4 configuration interface");
+MODULE_AUTHOR("Dmitry Eremin-Solenikov");
 
-- 
1.6.3.3


^ permalink raw reply related

* [PATCH 4/7] ieee802154: add wpan-phy iteration functions
From: Dmitry Eremin-Solenikov @ 2009-09-15 22:13 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-zigbee-devel, Sergey Lapin, netdev
In-Reply-To: <1253052785-26190-4-git-send-email-dbaryshkov@gmail.com>

Add API to iterate over the wpan-phy instances.

Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
---
 include/net/wpan-phy.h      |    2 ++
 net/ieee802154/wpan-class.c |   25 +++++++++++++++++++++++++
 2 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/include/net/wpan-phy.h b/include/net/wpan-phy.h
index 5e803a0..3367dd9 100644
--- a/include/net/wpan-phy.h
+++ b/include/net/wpan-phy.h
@@ -48,6 +48,8 @@ struct wpan_phy *wpan_phy_alloc(size_t priv_size);
 int wpan_phy_register(struct device *parent, struct wpan_phy *phy);
 void wpan_phy_unregister(struct wpan_phy *phy);
 void wpan_phy_free(struct wpan_phy *phy);
+/* Same semantics as for class_for_each_device */
+int wpan_phy_for_each(int (*fn)(struct wpan_phy *phy, void *data), void *data);
 
 static inline void *wpan_phy_priv(struct wpan_phy *phy)
 {
diff --git a/net/ieee802154/wpan-class.c b/net/ieee802154/wpan-class.c
index f306604..0cec138 100644
--- a/net/ieee802154/wpan-class.c
+++ b/net/ieee802154/wpan-class.c
@@ -91,6 +91,31 @@ struct wpan_phy *wpan_phy_find(const char *str)
 }
 EXPORT_SYMBOL(wpan_phy_find);
 
+struct wpan_phy_iter_data {
+	int (*fn)(struct wpan_phy *phy, void *data);
+	void *data;
+};
+
+static int wpan_phy_iter(struct device *dev, void *_data)
+{
+	struct wpan_phy_iter_data *wpid = _data;
+	struct wpan_phy *phy = container_of(dev, struct wpan_phy, dev);
+	return wpid->fn(phy, wpid->data);
+}
+
+int wpan_phy_for_each(int (*fn)(struct wpan_phy *phy, void *data),
+		void *data)
+{
+	struct wpan_phy_iter_data wpid = {
+		.fn = fn,
+		.data = data,
+	};
+
+	return class_for_each_device(&wpan_phy_class, NULL,
+			&wpid, wpan_phy_iter);
+}
+EXPORT_SYMBOL(wpan_phy_for_each);
+
 static int wpan_phy_idx_valid(int idx)
 {
 	return idx >= 0;
-- 
1.6.3.3


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox