Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] ipv6: Add IFA_F_DADFAILED flag
From: Jens Rosenboom @ 2009-09-08 13:57 UTC (permalink / raw)
  To: Brian Haley; +Cc: david Miller, netdev@vger.kernel.org, YOSHIFUJI Hideaki
In-Reply-To: <4AA1C0FF.4030109@hp.com>

On Fri, 2009-09-04 at 21:38 -0400, Brian Haley wrote:
> [Note: if this is accepted I'll send out a patch for iproute,
>  if you'd prefer to not use the last ifa_flag I'll send a
>  much larger patch that does this differently :) ]
> 
> 
> Add IFA_F_DADFAILED flag to denote an IPv6 address that has
> failed Duplicate Address Detection, that way tools like
> /sbin/ip can be more informative.
> 
> 3: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qlen 1000
>     inet6 2001:db8::1/64 scope global tentative dadfailed 
>        valid_lft forever preferred_lft forever
> 
> Signed-off-by: Brian Haley <brian.haley@hp.com>
> ---
> 
> diff --git a/include/linux/if_addr.h b/include/linux/if_addr.h
> index a60c821..fd97404 100644
> --- a/include/linux/if_addr.h
> +++ b/include/linux/if_addr.h
> @@ -41,6 +41,7 @@ enum
>  
>  #define	IFA_F_NODAD		0x02
>  #define IFA_F_OPTIMISTIC	0x04
> +#define IFA_F_DADFAILED		0x08
>  #define	IFA_F_HOMEADDRESS	0x10
>  #define IFA_F_DEPRECATED	0x20
>  #define IFA_F_TENTATIVE		0x40
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index 43b3c9f..6532966 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -1376,7 +1376,7 @@ static void addrconf_dad_stop(struct inet6_ifaddr *ifp)
>  	if (ifp->flags&IFA_F_PERMANENT) {
>  		spin_lock_bh(&ifp->lock);
>  		addrconf_del_timer(ifp);
> -		ifp->flags |= IFA_F_TENTATIVE;
> +		ifp->flags |= IFA_F_DADFAILED;

I think you still have to set IFA_F_TENTATIVE here, too, otherwise
ipv6_dev_get_saddr() will use this address. 		


^ permalink raw reply

* Re: [iproute2] tc action mirred question
From: jamal @ 2009-09-08 14:10 UTC (permalink / raw)
  To: thomas yang; +Cc: netdev
In-Reply-To: <f4f837ab0909080650t343efbmeb2f121def40bd9f@mail.gmail.com>

On Tue, 2009-09-08 at 21:50 +0800, thomas yang wrote:

> He want to route the mirroring packets.
> 
> " - Mirror takes a copy of the packet and sends it to specified
>  dev ("port" in ethernet switch/bridging terminology)
>  - redirect
>  steals the packet and redirects to specified destination dev. "
> 
> So,'mirror' is different from 'redirect'.  Change the line 'action
> mirred egress redirect dev eth0' to 'action mirred egress mirror dev
> eth0' .
> Both 'mirror' and 'redirect'  can transmit the packets to otner node,
> but mirror make a copy, then transmit it;  redirect steals the packet,
> right  ?
> 

Yes, of course. That was an example on how to use pedit. If you want
to be pedantic then note that no eth1 device is being used in the
original example and neither is itsensible to make changes to the MAC
address on ingress ;->
In any case, please go and run some experiments to test the theories.

cheers,
jamal


^ permalink raw reply

* Re: [iproute2] tc action mirred question
From: thomas yang @ 2009-09-08 14:35 UTC (permalink / raw)
  To: hadi; +Cc: netdev
In-Reply-To: <1252419044.5244.17.camel@dogo.mojatatu.com>

>
> Yes, of course. That was an example on how to use pedit. If you want
> to be pedantic then note that no eth1 device is being used in the
> original example and neither is itsensible to make changes to the MAC
> address on ingress ;->
> In any case, please go and run some experiments to test the theories.
>

I think the idea of the original example is good,  'tc' is very useful.
I will try some experiments to test the theories.   : )

------
regards,
thomas

^ permalink raw reply

* Re: [PATCH] ipv6: Add IFA_F_DADFAILED flag
From: Brian Haley @ 2009-09-08 15:18 UTC (permalink / raw)
  To: Jens Rosenboom; +Cc: david Miller, netdev@vger.kernel.org, YOSHIFUJI Hideaki
In-Reply-To: <1252418247.5827.8.camel@fnki-nb00130>

Jens Rosenboom wrote:
>> --- a/net/ipv6/addrconf.c
>> +++ b/net/ipv6/addrconf.c
>> @@ -1376,7 +1376,7 @@ static void addrconf_dad_stop(struct inet6_ifaddr *ifp)
>>  	if (ifp->flags&IFA_F_PERMANENT) {
>>  		spin_lock_bh(&ifp->lock);
>>  		addrconf_del_timer(ifp);
>> -		ifp->flags |= IFA_F_TENTATIVE;
>> +		ifp->flags |= IFA_F_DADFAILED;
> 
> I think you still have to set IFA_F_TENTATIVE here, too, otherwise
> ipv6_dev_get_saddr() will use this address. 		

The tentative bit is still set from when this address was added back
in ipv6_add_addr() from what I can tell, re-setting it here is actually
unnecessary.  At least /sbin/ip was still showing it set during my
testing.

-Brian


^ permalink raw reply

* Re: [PATCH] ipv6: Add IFA_F_DADFAILED flag
From: Jens Rosenboom @ 2009-09-08 15:43 UTC (permalink / raw)
  To: Brian Haley; +Cc: david Miller, netdev@vger.kernel.org, YOSHIFUJI Hideaki
In-Reply-To: <4AA675D4.8030406@hp.com>

On Tue, 2009-09-08 at 11:18 -0400, Brian Haley wrote:
> Jens Rosenboom wrote:
> >> --- a/net/ipv6/addrconf.c
> >> +++ b/net/ipv6/addrconf.c
> >> @@ -1376,7 +1376,7 @@ static void addrconf_dad_stop(struct inet6_ifaddr *ifp)
> >>  	if (ifp->flags&IFA_F_PERMANENT) {
> >>  		spin_lock_bh(&ifp->lock);
> >>  		addrconf_del_timer(ifp);
> >> -		ifp->flags |= IFA_F_TENTATIVE;
> >> +		ifp->flags |= IFA_F_DADFAILED;
> > 
> > I think you still have to set IFA_F_TENTATIVE here, too, otherwise
> > ipv6_dev_get_saddr() will use this address. 		
> 
> The tentative bit is still set from when this address was added back
> in ipv6_add_addr() from what I can tell, re-setting it here is actually
> unnecessary.  At least /sbin/ip was still showing it set during my
> testing.

There is the possibility of a race when the dad_timer expires at the
same time the NA triggering DAD failure is received. There isn't a big
chance to see that during real world testing, though.


^ permalink raw reply

* Re: net_sched 00/07: classful multiqueue dummy scheduler
From: Patrick McHardy @ 2009-09-08 15:53 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev
In-Reply-To: <20090908.023140.210471397.davem@davemloft.net>

David Miller wrote:
> From: Patrick McHardy <kaber@trash.net>
> Date: Mon, 07 Sep 2009 16:23:27 +0200
>
>   
>> This is a patch I used for testing, but I'll come up with
>> something more elegant (I hope) as a final fix :)
>>     
>
> Thanks for figuring this out Patrick.
>
> Let me know when you have a final patch
>   

Will do. I'm having some trouble with my test system, so might take until
tommorrow.

^ permalink raw reply

* [PATCH] genetlink: fix netns vs. netlink table locking
From: Johannes Berg @ 2009-09-08 15:59 UTC (permalink / raw)
  To: netdev; +Cc: linux-wireless

Since my commits introducing netns awareness into
genetlink we can get this problem:

BUG: scheduling while atomic: modprobe/1178/0x00000002
2 locks held by modprobe/1178:
 #0:  (genl_mutex){+.+.+.}, at: [<ffffffff8135ee1a>] genl_register_mc_grou
 #1:  (rcu_read_lock){.+.+..}, at: [<ffffffff8135eeb5>] genl_register_mc_g
Pid: 1178, comm: modprobe Not tainted 2.6.31-rc8-wl-34789-g95cb731-dirty #
Call Trace:
 [<ffffffff8103e285>] __schedule_bug+0x85/0x90
 [<ffffffff81403138>] schedule+0x108/0x588
 [<ffffffff8135b131>] netlink_table_grab+0xa1/0xf0
 [<ffffffff8135c3a7>] netlink_change_ngroups+0x47/0x100
 [<ffffffff8135ef0f>] genl_register_mc_group+0x12f/0x290

because I overlooked that netlink_table_grab() will
schedule, thinking it was just the rwlock. However,
in the contention case, that isn't actually true.

Fix this by letting the code grab the netlink table
lock first and then the RCU for netns protection.

Signed-off-by: Johannes Berg <johannes-cdvu00un1VgdHxzADdlk8Q@public.gmane.org>
---
 include/linux/netlink.h  |    4 +++
 net/netlink/af_netlink.c |   51 ++++++++++++++++++++++++++---------------------
 net/netlink/genetlink.c  |    5 +++-
 3 files changed, 37 insertions(+), 23 deletions(-)

--- wireless-testing.orig/include/linux/netlink.h	2009-09-08 17:49:54.000000000 +0200
+++ wireless-testing/include/linux/netlink.h	2009-09-08 17:50:29.000000000 +0200
@@ -176,12 +176,16 @@ struct netlink_skb_parms
 #define NETLINK_CREDS(skb)	(&NETLINK_CB((skb)).creds)
 
 
+extern void netlink_table_grab(void);
+extern void netlink_table_ungrab(void);
+
 extern struct sock *netlink_kernel_create(struct net *net,
 					  int unit,unsigned int groups,
 					  void (*input)(struct sk_buff *skb),
 					  struct mutex *cb_mutex,
 					  struct module *module);
 extern void netlink_kernel_release(struct sock *sk);
+extern int __netlink_change_ngroups(struct sock *sk, unsigned int groups);
 extern int netlink_change_ngroups(struct sock *sk, unsigned int groups);
 extern void netlink_clear_multicast_users(struct sock *sk, unsigned int group);
 extern void netlink_ack(struct sk_buff *in_skb, struct nlmsghdr *nlh, int err);
--- wireless-testing.orig/net/netlink/af_netlink.c	2009-09-08 17:37:57.000000000 +0200
+++ wireless-testing/net/netlink/af_netlink.c	2009-09-08 17:52:45.000000000 +0200
@@ -177,9 +177,11 @@ static void netlink_sock_destruct(struct
  * this, _but_ remember, it adds useless work on UP machines.
  */
 
-static void netlink_table_grab(void)
+void netlink_table_grab(void)
 	__acquires(nl_table_lock)
 {
+	might_sleep();
+
 	write_lock_irq(&nl_table_lock);
 
 	if (atomic_read(&nl_table_users)) {
@@ -200,7 +202,7 @@ static void netlink_table_grab(void)
 	}
 }
 
-static void netlink_table_ungrab(void)
+void netlink_table_ungrab(void)
 	__releases(nl_table_lock)
 {
 	write_unlock_irq(&nl_table_lock);
@@ -1549,37 +1551,21 @@ static void netlink_free_old_listeners(s
 	kfree(lrh->ptr);
 }
 
-/**
- * netlink_change_ngroups - change number of multicast groups
- *
- * This changes the number of multicast groups that are available
- * on a certain netlink family. Note that it is not possible to
- * change the number of groups to below 32. Also note that it does
- * not implicitly call netlink_clear_multicast_users() when the
- * number of groups is reduced.
- *
- * @sk: The kernel netlink socket, as returned by netlink_kernel_create().
- * @groups: The new number of groups.
- */
-int netlink_change_ngroups(struct sock *sk, unsigned int groups)
+int __netlink_change_ngroups(struct sock *sk, unsigned int groups)
 {
 	unsigned long *listeners, *old = NULL;
 	struct listeners_rcu_head *old_rcu_head;
 	struct netlink_table *tbl = &nl_table[sk->sk_protocol];
-	int err = 0;
 
 	if (groups < 32)
 		groups = 32;
 
-	netlink_table_grab();
 	if (NLGRPSZ(tbl->groups) < NLGRPSZ(groups)) {
 		listeners = kzalloc(NLGRPSZ(groups) +
 				    sizeof(struct listeners_rcu_head),
 				    GFP_ATOMIC);
-		if (!listeners) {
-			err = -ENOMEM;
-			goto out_ungrab;
-		}
+		if (!listeners)
+			return -ENOMEM;
 		old = tbl->listeners;
 		memcpy(listeners, old, NLGRPSZ(tbl->groups));
 		rcu_assign_pointer(tbl->listeners, listeners);
@@ -1597,8 +1583,29 @@ int netlink_change_ngroups(struct sock *
 	}
 	tbl->groups = groups;
 
- out_ungrab:
+	return 0;
+}
+
+/**
+ * netlink_change_ngroups - change number of multicast groups
+ *
+ * This changes the number of multicast groups that are available
+ * on a certain netlink family. Note that it is not possible to
+ * change the number of groups to below 32. Also note that it does
+ * not implicitly call netlink_clear_multicast_users() when the
+ * number of groups is reduced.
+ *
+ * @sk: The kernel netlink socket, as returned by netlink_kernel_create().
+ * @groups: The new number of groups.
+ */
+int netlink_change_ngroups(struct sock *sk, unsigned int groups)
+{
+	int err;
+
+	netlink_table_grab();
+	err = __netlink_change_ngroups(sk, groups);
 	netlink_table_ungrab();
+
 	return err;
 }
 EXPORT_SYMBOL(netlink_change_ngroups);
--- wireless-testing.orig/net/netlink/genetlink.c	2009-09-08 17:50:38.000000000 +0200
+++ wireless-testing/net/netlink/genetlink.c	2009-09-08 17:58:50.000000000 +0200
@@ -176,9 +176,10 @@ int genl_register_mc_group(struct genl_f
 	if (family->netnsok) {
 		struct net *net;
 
+		netlink_table_grab();
 		rcu_read_lock();
 		for_each_net_rcu(net) {
-			err = netlink_change_ngroups(net->genl_sock,
+			err = __netlink_change_ngroups(net->genl_sock,
 					mc_groups_longs * BITS_PER_LONG);
 			if (err) {
 				/*
@@ -188,10 +189,12 @@ int genl_register_mc_group(struct genl_f
 				 * increased on some sockets which is ok.
 				 */
 				rcu_read_unlock();
+				netlink_table_ungrab();
 				goto out;
 			}
 		}
 		rcu_read_unlock();
+		netlink_table_ungrab();
 	} else {
 		err = netlink_change_ngroups(init_net.genl_sock,
 					     mc_groups_longs * BITS_PER_LONG);


--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Stop using tasklets for bottom halves
From: Stephen Hemminger @ 2009-09-08 16:11 UTC (permalink / raw)
  To: rostedt
  Cc: Luis R. Rodriguez, Ingo Molnar, Michael Buesch, John W. Linville,
	linux-wireless, linux-kernel, netdev, Matt Smith, Kevin Hayes,
	Bob Copeland, Jouni Malinen, Ivan Seskar, ic.felix
In-Reply-To: <1252376254.21261.2052.camel@gandalf.stny.rr.com>

On Mon, 07 Sep 2009 22:17:34 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Mon, 2009-09-07 at 17:14 -0700, Stephen Hemminger wrote:
> > On Mon, 7 Sep 2009 15:58:50 -0700
> > "Luis R. Rodriguez" <mcgrof@gmail.com> wrote:
> > 
> > > A while ago I had read about an effort to consider removing tasklets
> > > [1] or at least trying to not use them. I'm unaware of the progress in
> > > this respect but since reading that article have always tried to
> > > evaluate whether or not we need tasklets on wireless drivers. I have
> > > also wondered whether work in irq context in other parts of the kernel
> > > can be moved to process context, a curious example being timers. I'll
> > > personally be trying to using only process context on bottom halves on
> > > future drivers but I figured it may be a good time to ask how serious
> > > was avoiding tasklets or using wrappers in the future to avoid irq
> > > context is or is it advised. Do we have a general agreement this is a
> > > good step forward to take? Has anyone made tests or changes on a
> > > specific driver from irq context to process context and proven there
> > > are no significant advantages of using irq context where you would
> > > have expected it?
> > > 
> > > Wireless in particular should IMHO not require taskets for anything
> > > time sensitive that I can think about except perhaps changing channels
> > > quickly and to do that appropriately also process pending RX frames
> > > prior to a switch. It remains to be seen experimentally whether or not
> > > using a workqueue for RX processing would affect the time to switch
> > > channels negatively but I doubt it would be significant. I hope to
> > > test that with ath9k_htc.
> > > 
> > > What about gigabit or 10 Gigabit Ethernet drivers ? Do they face any
> > > challenges which would yet need to be proven would not face issues
> > > when processing bottom halves in process context?
> > > 
> > > [1] http://lwn.net/Articles/239633/
> > > 
> > >   Luis
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > Why not use NAPI, which is soft irq? Almost all 1G and 10G drivers
> > use NAPI.
> > 
> > Process context is too slow.
> 
> Well, I'm hoping to prove the opposite. I'm working on some stuff that I
> plan to present at Linux Plumbers. I've been too distracted by other
> things, but hopefully I'll have some good numbers to present by then.
> 


That's great, does it keep the good properties of NAPI (irq disabling
and throttling?)



-- 

^ permalink raw reply

* Re: Stop using tasklets for bottom halves
From: Stephen Hemminger @ 2009-09-08 16:12 UTC (permalink / raw)
  To: rostedt
  Cc: Luis R. Rodriguez, Ingo Molnar, Michael Buesch, John W. Linville,
	linux-wireless, linux-kernel, netdev, Matt Smith, Kevin Hayes,
	Bob Copeland, Jouni Malinen, Ivan Seskar, ic.felix
In-Reply-To: <1252376254.21261.2052.camel@gandalf.stny.rr.com>

On Mon, 07 Sep 2009 22:17:34 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Mon, 2009-09-07 at 17:14 -0700, Stephen Hemminger wrote:
> > On Mon, 7 Sep 2009 15:58:50 -0700
> > "Luis R. Rodriguez" <mcgrof@gmail.com> wrote:
> > 
> > > A while ago I had read about an effort to consider removing tasklets
> > > [1] or at least trying to not use them. I'm unaware of the progress in
> > > this respect but since reading that article have always tried to
> > > evaluate whether or not we need tasklets on wireless drivers. I have
> > > also wondered whether work in irq context in other parts of the kernel
> > > can be moved to process context, a curious example being timers. I'll
> > > personally be trying to using only process context on bottom halves on
> > > future drivers but I figured it may be a good time to ask how serious
> > > was avoiding tasklets or using wrappers in the future to avoid irq
> > > context is or is it advised. Do we have a general agreement this is a
> > > good step forward to take? Has anyone made tests or changes on a
> > > specific driver from irq context to process context and proven there
> > > are no significant advantages of using irq context where you would
> > > have expected it?
> > > 
> > > Wireless in particular should IMHO not require taskets for anything
> > > time sensitive that I can think about except perhaps changing channels
> > > quickly and to do that appropriately also process pending RX frames
> > > prior to a switch. It remains to be seen experimentally whether or not
> > > using a workqueue for RX processing would affect the time to switch
> > > channels negatively but I doubt it would be significant. I hope to
> > > test that with ath9k_htc.
> > > 
> > > What about gigabit or 10 Gigabit Ethernet drivers ? Do they face any
> > > challenges which would yet need to be proven would not face issues
> > > when processing bottom halves in process context?
> > > 
> > > [1] http://lwn.net/Articles/239633/
> > > 
> > >   Luis
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > Why not use NAPI, which is soft irq? Almost all 1G and 10G drivers
> > use NAPI.
> > 
> > Process context is too slow.
> 
> Well, I'm hoping to prove the opposite. I'm working on some stuff that I
> plan to present at Linux Plumbers. I've been too distracted by other
> things, but hopefully I'll have some good numbers to present by then.
> 
> -- Steve
> 
> 

A good performance test is changing the behaviour of loopback
device and running lmbench.  This checks overhead without the specter
of real hardware.

-- 

^ permalink raw reply

* Re: Stop using tasklets for bottom halves
From: Steven Rostedt @ 2009-09-08 16:40 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Luis R. Rodriguez, Ingo Molnar, Michael Buesch, John W. Linville,
	linux-wireless, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Matt Smith, Kevin Hayes,
	Bob Copeland, Jouni Malinen, Ivan Seskar,
	ic.felix-Re5JQEeQqe8AvxtiuMwx3w
In-Reply-To: <20090908091143.1e613963@nehalam>

On Tue, 2009-09-08 at 09:11 -0700, Stephen Hemminger wrote:

> > > Process context is too slow.
> > 
> > Well, I'm hoping to prove the opposite. I'm working on some stuff that I
> > plan to present at Linux Plumbers. I've been too distracted by other
> > things, but hopefully I'll have some good numbers to present by then.
> > 
> 
> 
> That's great, does it keep the good properties of NAPI (irq disabling
> and throttling?)

Not exactly sure what you mean by throttling, but I'm assuming it will.

As for irqs disabling, I'm trying to avoid doing that. Note, the device
will have its interrupts disabled, but not all other devices will.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Stop using tasklets for bottom halves
From: Stephen Hemminger @ 2009-09-08 17:01 UTC (permalink / raw)
  To: rostedt
  Cc: Luis R. Rodriguez, Ingo Molnar, Michael Buesch, John W. Linville,
	linux-wireless, linux-kernel, netdev, Matt Smith, Kevin Hayes,
	Bob Copeland, Jouni Malinen, Ivan Seskar, ic.felix
In-Reply-To: <1252428023.15626.30.camel@gandalf.stny.rr.com>

On Tue, 08 Sep 2009 12:40:23 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Tue, 2009-09-08 at 09:11 -0700, Stephen Hemminger wrote:
> 
> > > > Process context is too slow.
> > > 
> > > Well, I'm hoping to prove the opposite. I'm working on some stuff that I
> > > plan to present at Linux Plumbers. I've been too distracted by other
> > > things, but hopefully I'll have some good numbers to present by then.
> > > 
> > 
> > 
> > That's great, does it keep the good properties of NAPI (irq disabling
> > and throttling?)
> 
> Not exactly sure what you mean by throttling, but I'm assuming it will.
> 
> As for irqs disabling, I'm trying to avoid doing that. Note, the device
> will have its interrupts disabled, but not all other devices will.
> 
> -- Steve
> 
> 

The way NAPI works is that in irq routine, the device disables interrupts
then schedules processing packets, when processing is done irq's are re-enabled.
This means that if machine is being flooded, irq's stay off, and the packets
get discarded (because device hardware ring is full), rather than in software
(because software receive queue is full).


-- 

^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Ira W. Snyder @ 2009-09-08 17:20 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, virtualization, kvm, linux-kernel, mingo, linux-mm, akpm,
	hpa, gregory.haskins, Rusty Russell, s.hetze
In-Reply-To: <20090907101537.GH3031@redhat.com>

On Mon, Sep 07, 2009 at 01:15:37PM +0300, Michael S. Tsirkin wrote:
> On Thu, Sep 03, 2009 at 11:39:45AM -0700, Ira W. Snyder wrote:
> > On Thu, Aug 27, 2009 at 07:07:50PM +0300, Michael S. Tsirkin wrote:
> > > What it is: vhost net is a character device that can be used to reduce
> > > the number of system calls involved in virtio networking.
> > > Existing virtio net code is used in the guest without modification.
> > > 
> > > There's similarity with vringfd, with some differences and reduced scope
> > > - uses eventfd for signalling
> > > - structures can be moved around in memory at any time (good for migration)
> > > - support memory table and not just an offset (needed for kvm)
> > > 
> > > common virtio related code has been put in a separate file vhost.c and
> > > can be made into a separate module if/when more backends appear.  I used
> > > Rusty's lguest.c as the source for developing this part : this supplied
> > > me with witty comments I wouldn't be able to write myself.
> > > 
> > > What it is not: vhost net is not a bus, and not a generic new system
> > > call. No assumptions are made on how guest performs hypercalls.
> > > Userspace hypervisors are supported as well as kvm.
> > > 
> > > How it works: Basically, we connect virtio frontend (configured by
> > > userspace) to a backend. The backend could be a network device, or a
> > > tun-like device. In this version I only support raw socket as a backend,
> > > which can be bound to e.g. SR IOV, or to macvlan device.  Backend is
> > > also configured by userspace, including vlan/mac etc.
> > > 
> > > Status:
> > > This works for me, and I haven't see any crashes.
> > > I have done some light benchmarking (with v4), compared to userspace, I
> > > see improved latency (as I save up to 4 system calls per packet) but not
> > > bandwidth/CPU (as TSO and interrupt mitigation are not supported).  For
> > > ping benchmark (where there's no TSO) troughput is also improved.
> > > 
> > > Features that I plan to look at in the future:
> > > - tap support
> > > - TSO
> > > - interrupt mitigation
> > > - zero copy
> > > 
> > 
> > Hello Michael,
> > 
> > I've started looking at vhost with the intention of using it over PCI to
> > connect physical machines together.
> > 
> > The part that I am struggling with the most is figuring out which parts
> > of the rings are in the host's memory, and which parts are in the
> > guest's memory.
> 
> All rings are in guest's memory, to match existing virtio code.

Ok, this makes sense.

> vhost
> assumes that the memory space of the hypervisor userspace process covers
> the whole of guest memory.

Is this necessary? Why? The assumption seems very wrong when you're
doing data transport between two physical systems via PCI.

I know vhost has not been designed for this specific situation, but it
is good to be looking toward other possible uses.

> And there's a translation table.
> Ring addresses are userspace addresses, they do not undergo translation.
> 
> > If I understand everything correctly, the rings are all userspace
> > addresses, which means that they can be moved around in physical memory,
> > and get pushed out to swap.
> 
> Unless they are locked, yes.
> 
> > AFAIK, this is impossible to handle when
> > connecting two physical systems, you'd need the rings available in IO
> > memory (PCI memory), so you can ioreadXX() them instead. To the best of
> > my knowledge, I shouldn't be using copy_to_user() on an __iomem address.
> > Also, having them migrate around in memory would be a bad thing.
> > 
> > Also, I'm having trouble figuring out how the packet contents are
> > actually copied from one system to the other. Could you point this out
> > for me?
> 
> The code in net/packet/af_packet.c does it when vhost calls sendmsg.
> 

Ok. The sendmsg() implementation uses memcpy_fromiovec(). Is it possible
to make this use a DMA engine instead? I know this was suggested in an
earlier thread.

> > Is there somewhere I can find the userspace code (kvm, qemu, lguest,
> > etc.) code needed for interacting with the vhost misc device so I can
> > get a better idea of how userspace is supposed to work?
> 
> Look in archives for kvm@vger.kernel.org. the subject is qemu-kvm: vhost net.
> 
> > (Features
> > negotiation, etc.)
> > 
> 
> That's not yet implemented as there are no features yet.  I'm working on
> tap support, which will add a feature bit.  Overall, qemu does an ioctl
> to query supported features, and then acks them with another ioctl.  I'm
> also trying to avoid duplicating functionality available elsewhere.  So
> that to check e.g. TSO support, you'd just look at the underlying
> hardware device you are binding to.
> 

Ok. Do you have plans to support the VIRTIO_NET_F_MRG_RXBUF feature in
the future? I found that this made an enormous improvement in throughput
on my virtio-net <-> virtio-net system. Perhaps it isn't needed with
vhost-net.

Thanks for replying,
Ira

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: Stop using tasklets for bottom halves
From: Steven Rostedt @ 2009-09-08 17:27 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Luis R. Rodriguez, Ingo Molnar, Michael Buesch, John W. Linville,
	linux-wireless, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Matt Smith, Kevin Hayes,
	Bob Copeland, Jouni Malinen, Ivan Seskar,
	ic.felix-Re5JQEeQqe8AvxtiuMwx3w, Thomas Gleixner
In-Reply-To: <20090908100144.6e06872b@nehalam>

[ added Thomas Gleixner to Cc]

On Tue, 2009-09-08 at 10:01 -0700, Stephen Hemminger wrote:
> On Tue, 08 Sep 2009 12:40:23 -0400
> Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org> wrote:
> 
> > On Tue, 2009-09-08 at 09:11 -0700, Stephen Hemminger wrote:
> > 
> > > > > Process context is too slow.
> > > > 
> > > > Well, I'm hoping to prove the opposite. I'm working on some stuff that I
> > > > plan to present at Linux Plumbers. I've been too distracted by other
> > > > things, but hopefully I'll have some good numbers to present by then.
> > > > 
> > > 
> > > 
> > > That's great, does it keep the good properties of NAPI (irq disabling
> > > and throttling?)
> > 
> > Not exactly sure what you mean by throttling, but I'm assuming it will.
> > 
> > As for irqs disabling, I'm trying to avoid doing that. Note, the device
> > will have its interrupts disabled, but not all other devices will.
> > 
> > -- Steve
> > 
> > 
> 
> The way NAPI works is that in irq routine, the device disables interrupts
> then schedules processing packets, when processing is done irq's are re-enabled.
> This means that if machine is being flooded, irq's stay off, and the packets
> get discarded (because device hardware ring is full), rather than in software
> (because software receive queue is full).

That sounds exactly like what threaded IRQs will do. When an interrupt
comes in, the device driver will disable the device interrupts, and then
the device irq thread handler is awoken.

The device irq handler will handle all the packets. If new packets come
in, and the hardware ring buffer is full, those packets will be dropped.
When the irq handler thread is done processing all pending packets, it
will re-enable the device's interrupts and go to sleep.

Yeah, looking at the NAPI code, it does seem to follow what threaded
interrupts do.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 1/1] netpoll: fix race between skb_queue_len and skb_dequeue
From: Matt Mackall @ 2009-09-08 17:27 UTC (permalink / raw)
  To: DDD; +Cc: David Miller, netdev
In-Reply-To: <1252396189.16528.19.camel@dengdd-desktop>

On Tue, 2009-09-08 at 15:49 +0800, DDD wrote:
> This race will break the messages order.
> 
> Sequence of events in the problem case:
> 
> Assume there is one skb "A" in skb_queue and the next action of
> netpoll_send_skb() is: sending out "B" skb.
> The right order of messages should be: send A first, then send B.
> 
> But as following orders, it will send B first, then send A.

I would say no, the order of messages A and B queued on different CPUs
is undefined. The only issue is that we can queue a message A on CPU0,
then causally trigger a message on CPU1 B that arrives first. But bear
in mind that the message A >>may never arrive<< because the message is
about a lockup that kills processing of delayed work.

Generally speaking, queueing should be a last ditch effort. We should
instead aim to deliver all messages immediately, even if they might be
out of order. Because out of order is better than not arriving at all.

-- 
http://selenic.com : development and support for Mercurial and Linux

^ permalink raw reply

* [PATCH 0/5] Adds implementation of TFRC-SP on DCCP test tree
From: Ivo Calado @ 2009-09-08 18:28 UTC (permalink / raw)
  To: dccp; +Cc: netdev

Due to the problems in the previuos patch pointed by Gerrit Renker and all, I'm
resending the patches.

These patches adds implementation of TFRC-SP at the receiver side, and
are targeted at the DCCP branch

Patch #1: First Patch on TFRC-SP. Copy base files from TFRC
Patch #2: Implement loss counting on TFRC-SP receiver
Patch #3: Implement TFRC-SP calc of mean length of loss intervals
accordingly to section 3 of RFC 4828
Patch #4: Adds options DROPPED PACKETS and LOSS INTERVALS to receiver
Patch #5: Updating documentation accordingly

Following this patches, we'll be sending the sender side of TFRC-SP.
Once this code is integrated on the branch, we can proceed adding the
CCID4 code that uses this new TFRC-SP.

--
Ivo Augusto Andrade Rocha Calado
MSc. Candidate
Embedded Systems and Pervasive Computing Lab -
http://embedded.ufcg.edu.br
Systems and Computing Department - http://www.dsc.ufcg.edu.br
Electrical Engineering and Informatics Center -
http://www.ceei.ufcg.edu.br
Federal University of Campina Grande - http://www.ufcg.edu.br

PGP: 0x03422935
Quidquid latine dictum sit, altum viditur.

^ permalink raw reply

* [PATCH 1/5] First Patch on TFRC-SP. Copy base files from TFRC
From: Ivo Calado @ 2009-09-08 18:28 UTC (permalink / raw)
  To: dccp; +Cc: netdev

First Patch on TFRC-SP. Does a copy from TFRC and adjust symbols name
with infix "_sp".
Also updates Kconfig and init/exit code. An #ifndef was added to headers
that
have commom symbols with TFRC that were not changed, so they don't get
included twice.

Following the rule #8 in Documentation/SubmittingPatches the patch is
stored at
http://embedded.ufcg.edu.br/~ivocalado/dccp/patches_tfrc_sp/tfrc_sp_receiver_01.patch

^ permalink raw reply

* [PATCH 3/5] Implement TFRC-SP calc of mean length of loss intervals, accordingly to section 3 of RFC 4828
From: Ivo Calado @ 2009-09-08 18:28 UTC (permalink / raw)
  To: dccp; +Cc: netdev

Implement TFRC-SP calc of mean length of loss intervals accordingly to section 3 of RFC 4828

Changes:
 - Modify tfrc_sp_lh_calc_i_mean header, now receiving the current ccval, so it can determine
   if a loss interval is too recent
 - Consider number of losses in each loss interval
 - Only consider open loss interval if it is at least 2 rtt old
 - Changes function signatures as necessary

Signed-off-by: Ivo Calado, Erivaldo Xavier, Leandro Sales <ivocalado@embedded.ufcg.edu.br>, <desadoc@gmail.com>, <leandroal@gmail.com>

Index: dccp_tree_work5/net/dccp/ccids/lib/loss_interval_sp.c
===================================================================
--- dccp_tree_work5.orig/net/dccp/ccids/lib/loss_interval_sp.c	2009-09-08 10:37:16.000000000 -0300
+++ dccp_tree_work5/net/dccp/ccids/lib/loss_interval_sp.c	2009-09-08 10:42:30.000000000 -0300
@@ -67,10 +67,11 @@
 		}
 }
 
-static void tfrc_sp_lh_calc_i_mean(struct tfrc_loss_hist *lh)
+static void tfrc_sp_lh_calc_i_mean(struct tfrc_loss_hist *lh, __u8 curr_ccval)
 {
 	u32 i_i, i_tot0 = 0, i_tot1 = 0, w_tot = 0;
 	int i, k = tfrc_lh_length(lh) - 1; /* k is as in rfc3448bis, 5.4 */
+	u32 losses;
 
 	if (k <= 0)
 		return;
@@ -78,6 +79,15 @@
 	for (i = 0; i <= k; i++) {
 		i_i = tfrc_lh_get_interval(lh, i);
 
+		if (SUB16(curr_ccval,
+		    tfrc_lh_get_loss_interval(lh, i)->li_ccval) <= 8) {
+
+			losses = tfrc_lh_get_loss_interval(lh, i)->li_losses;
+
+			if (losses > 0)
+				i_i = div64_u64(i_i, losses);
+		}
+
 		if (i < k) {
 			i_tot0 += i_i * tfrc_lh_weights[i];
 			w_tot  += tfrc_lh_weights[i];
@@ -87,6 +97,11 @@
 	}
 
 	lh->i_mean = max(i_tot0, i_tot1) / w_tot;
+	BUG_ON(w_tot == 0);
+	if (SUB16(curr_ccval, tfrc_lh_get_loss_interval(lh, 0)->li_ccval) > 8)
+		lh->i_mean = max(i_tot0, i_tot1) / w_tot;
+	else
+		lh->i_mean = i_tot1 / w_tot;
 }
 
 /*
@@ -127,7 +142,7 @@
 		return;
 
 	cur->li_length = len;
-	tfrc_sp_lh_calc_i_mean(lh);
+	tfrc_sp_lh_calc_i_mean(lh, dccp_hdr(skb)->dccph_ccval);
 }
 
 /* RFC 4342, 10.2: test for the existence of packet with sequence number S */
@@ -148,7 +163,8 @@
 bool tfrc_sp_lh_interval_add(struct tfrc_loss_hist *lh,
 			     struct tfrc_rx_hist *rh,
 			     u32 (*calc_first_li)(struct sock *),
-			     struct sock *sk)
+			     struct sock *sk,
+			     __u8 ccval)
 {
 	struct tfrc_loss_interval *cur = tfrc_lh_peek(lh);
 	struct tfrc_rx_hist_entry *cong_evt;
@@ -217,7 +233,7 @@
 		if (lh->counter > (2*LIH_SIZE))
 			lh->counter -= LIH_SIZE;
 
-		tfrc_sp_lh_calc_i_mean(lh);
+		tfrc_sp_lh_calc_i_mean(lh, ccval);
 	}
 
 	return true;
Index: dccp_tree_work5/net/dccp/ccids/lib/loss_interval_sp.h
===================================================================
--- dccp_tree_work5.orig/net/dccp/ccids/lib/loss_interval_sp.h	2009-09-08 10:37:16.000000000 -0300
+++ dccp_tree_work5/net/dccp/ccids/lib/loss_interval_sp.h	2009-09-08 10:42:30.000000000 -0300
@@ -73,7 +73,8 @@
 extern bool tfrc_sp_lh_interval_add(struct tfrc_loss_hist *,
 				    struct tfrc_rx_hist *,
 				    u32 (*first_li)(struct sock *),
-				    struct sock *);
+				    struct sock *,
+				    __u8 ccval);
 extern void tfrc_sp_lh_update_i_mean(struct tfrc_loss_hist *lh,
 				     struct sk_buff *);
 extern void tfrc_sp_lh_cleanup(struct tfrc_loss_hist *lh);
Index: dccp_tree_work5/net/dccp/ccids/lib/packet_history_sp.c
===================================================================
--- dccp_tree_work5.orig/net/dccp/ccids/lib/packet_history_sp.c	2009-09-08 10:37:16.000000000 -0300
+++ dccp_tree_work5/net/dccp/ccids/lib/packet_history_sp.c	2009-09-08 10:42:30.000000000 -0300
@@ -369,7 +369,8 @@
 		/*
 		* Update Loss Interval database and recycle RX records
 		*/
-		new_event = tfrc_sp_lh_interval_add(lh, h, first_li, sk);
+		new_event = tfrc_sp_lh_interval_add(lh, h, first_li, sk,
+						dccp_hdr(skb)->dccph_ccval);
 		__three_after_loss(h);
 
 	} else if (dccp_data_packet(skb) && dccp_skb_is_ecn_ce(skb)) {
@@ -378,7 +379,8 @@
 		* the RFC considers ECN marks - a future implementation may
 		* find it useful to also check ECN marks on non-data packets.
 		*/
-		new_event = tfrc_sp_lh_interval_add(lh, h, first_li, sk);
+		new_event = tfrc_sp_lh_interval_add(lh, h, first_li, sk,
+						dccp_hdr(skb)->dccph_ccval);
 		/*
 		* Also combinations of loss and ECN-marks (as per the warning)
 		* are not supported. The permutations of loss combined with or



^ permalink raw reply

* [PATCH 2/5] Implement loss counting on TFRC-SP receiver
From: Ivo Calado @ 2009-09-08 18:28 UTC (permalink / raw)
  To: dccp; +Cc: netdev

Implement loss counting on TFRC-SP receiver. Consider transmission's hole size as loss count.

Changes:
 - Adds field li_losses to tfrc_loss_interval to track loss count per interval
 - Adds field num_losses to tfrc_rx_hist, used to store loss count per loss event
 - Adds dccp_loss_count function to net/dccp/dccp.h, responsible for loss count using sequence numbers

Signed-off-by: Ivo Calado, Erivaldo Xavier, Leandro Sales <ivocalado@embedded.ufcg.edu.br>, <desadoc@gmail.com>, <leandroal@gmail.com>

Index: dccp_tree_work4/net/dccp/ccids/lib/loss_interval_sp.c
===================================================================
--- dccp_tree_work4.orig/net/dccp/ccids/lib/loss_interval_sp.c	2009-09-03 22:58:17.000000000 -0300
+++ dccp_tree_work4/net/dccp/ccids/lib/loss_interval_sp.c	2009-09-03 23:00:24.000000000 -0300
@@ -187,6 +187,7 @@
 		s64 len = dccp_delta_seqno(cur->li_seqno, cong_evt_seqno);
 		if ((len <= 0) ||
 		    (!tfrc_lh_closed_check(cur, cong_evt->tfrchrx_ccval))) {
+			cur->li_losses += rh->num_losses;
 			return false;
 		}
 
@@ -204,6 +205,7 @@
 	cur->li_seqno	  = cong_evt_seqno;
 	cur->li_ccval	  = cong_evt->tfrchrx_ccval;
 	cur->li_is_closed = false;
+	cur->li_losses	  = rh->num_losses;
 
 	if (++lh->counter == 1)
 		lh->i_mean = cur->li_length = (*calc_first_li)(sk);
Index: dccp_tree_work4/net/dccp/ccids/lib/loss_interval_sp.h
===================================================================
--- dccp_tree_work4.orig/net/dccp/ccids/lib/loss_interval_sp.h	2009-09-03 22:58:17.000000000 -0300
+++ dccp_tree_work4/net/dccp/ccids/lib/loss_interval_sp.h	2009-09-03 23:00:24.000000000 -0300
@@ -30,12 +30,14 @@
  *  @li_ccval:		The CCVal belonging to @li_seqno
  *  @li_is_closed:	Whether @li_seqno is older than 1 RTT
  *  @li_length:		Loss interval sequence length
+ *  @li_losses: 	Number of losses counted on this interval
  */
 struct tfrc_loss_interval {
 	u64		 li_seqno:48,
 			 li_ccval:4,
 			 li_is_closed:1;
 	u32		 li_length;
+	u32		 li_losses;
 };
 
 /*
Index: dccp_tree_work4/net/dccp/ccids/lib/packet_history_sp.c
===================================================================
--- dccp_tree_work4.orig/net/dccp/ccids/lib/packet_history_sp.c	2009-09-03 22:58:17.000000000 -0300
+++ dccp_tree_work4/net/dccp/ccids/lib/packet_history_sp.c	2009-09-03 23:00:24.000000000 -0300
@@ -244,6 +244,7 @@
 		h->loss_count = 3;
 		tfrc_sp_rx_hist_entry_from_skb(tfrc_rx_hist_entry(h, 3),
 					       skb, n3);
+		h->num_losses = dccp_loss_count(s2, s3, n3);
 		return 1;
 	}
 
@@ -257,6 +258,7 @@
 		tfrc_sp_rx_hist_entry_from_skb(tfrc_rx_hist_entry(h, 2),
 					       skb, n3);
 		h->loss_count = 3;
+		h->num_losses = dccp_loss_count(s1, s3, n3);
 		return 1;
 	}
 
@@ -293,6 +295,7 @@
 	h->loss_start = tfrc_rx_hist_index(h, 3);
 	tfrc_sp_rx_hist_entry_from_skb(tfrc_rx_hist_entry(h, 1), skb, n3);
 	h->loss_count = 3;
+	h->num_losses = dccp_loss_count(s0, s3, n3);
 
 	return 1;
 }
Index: dccp_tree_work4/net/dccp/ccids/lib/packet_history_sp.h
===================================================================
--- dccp_tree_work4.orig/net/dccp/ccids/lib/packet_history_sp.h	2009-09-03 22:58:17.000000000 -0300
+++ dccp_tree_work4/net/dccp/ccids/lib/packet_history_sp.h	2009-09-03 22:58:29.000000000 -0300
@@ -113,6 +113,7 @@
 	u32			  packet_size,
 				  bytes_recvd;
 	ktime_t			  bytes_start;
+	u8			  num_losses;
 };
 
 /*
Index: dccp_tree_work4/net/dccp/dccp.h
===================================================================
--- dccp_tree_work4.orig/net/dccp/dccp.h	2009-09-03 22:58:17.000000000 -0300
+++ dccp_tree_work4/net/dccp/dccp.h	2009-09-03 22:58:29.000000000 -0300
@@ -168,6 +168,21 @@
 	return (u64)delta <= ndp + 1;
 }
 
+static inline u64 dccp_loss_count(const u64 s1, const u64 s2, const u64 ndp)
+{
+	s64 delta, count;
+
+	delta = dccp_delta_seqno(s1, s2);
+	WARN_ON(delta < 0);
+
+	count = ndp + 1;
+	count -= delta;
+
+	count = (count > 0) ? count : 0;
+
+	return (u64) count;
+}
+
 enum {
 	DCCP_MIB_NUM = 0,
 	DCCP_MIB_ACTIVEOPENS,			/* ActiveOpens */



^ permalink raw reply

* [PATCH 4/5] Adds options DROPPED PACKETS and LOSS INTERVALS to receiver
From: Ivo Calado @ 2009-09-08 18:28 UTC (permalink / raw)
  To: dccp; +Cc: netdev

Adds options DROPPED PACKETS and LOSS INTERVALS to receiver. In this patch is added the
mechanism of gathering information about loss intervals and storing it, for later
construction of these two options.

Changes:
 - Adds tfrc_loss_data and tfrc_loss_data_entry, structures that register loss intervals info
 - Adds dccp_skb_is_ecn_ect0 and dccp_skb_is_ecn_ect1 as necessary, so ecn can be verified and
   used in loss intervals option, that reports ecn nonce sum
 - Adds tfrc_sp_update_li_data that updates information about loss intervals
 - Adds tfrc_sp_ld_prepare_data, that fills fields on tfrc_loss_data with current options values
 - And adds a field of type struct tfrc_loss_data to struct tfrc_hc_rx_sock

Signed-off-by: Ivo Calado, Erivaldo Xavier, Leandro Sales <ivocalado@embedded.ufcg.edu.br>, <desadoc@gmail.com>, <leandroal@gmail.com>

Index: dccp_tree_work5/net/dccp/ccids/lib/packet_history_sp.c
===================================================================
--- dccp_tree_work5.orig/net/dccp/ccids/lib/packet_history_sp.c	2009-09-08 10:42:30.000000000 -0300
+++ dccp_tree_work5/net/dccp/ccids/lib/packet_history_sp.c	2009-09-08 10:42:37.000000000 -0300
@@ -233,7 +233,9 @@
 }
 
 /* return 1 if a new loss event has been identified */
-static int __two_after_loss(struct tfrc_rx_hist *h, struct sk_buff *skb, u32 n3)
+static int __two_after_loss(struct tfrc_rx_hist *h,
+			    struct sk_buff *skb, u32 n3,
+			    bool *new_loss)
 {
 	u64 s0 = tfrc_rx_hist_loss_prev(h)->tfrchrx_seqno,
 	    s1 = tfrc_rx_hist_entry(h, 1)->tfrchrx_seqno,
@@ -245,6 +247,7 @@
 		tfrc_sp_rx_hist_entry_from_skb(tfrc_rx_hist_entry(h, 3),
 					       skb, n3);
 		h->num_losses = dccp_loss_count(s2, s3, n3);
+		*new_loss = 1;
 		return 1;
 	}
 
@@ -259,6 +262,7 @@
 					       skb, n3);
 		h->loss_count = 3;
 		h->num_losses = dccp_loss_count(s1, s3, n3);
+		*new_loss = 1;
 		return 1;
 	}
 
@@ -284,6 +288,7 @@
 			tfrc_sp_rx_hist_entry_from_skb(
 					tfrc_rx_hist_loss_prev(h), skb, n3);
 
+		*new_loss = 0;
 		return 0;
 	}
 
@@ -297,6 +302,7 @@
 	h->loss_count = 3;
 	h->num_losses = dccp_loss_count(s0, s3, n3);
 
+	*new_loss = 1;
 	return 1;
 }
 
@@ -348,11 +354,14 @@
  *  operations when loss_count is greater than 0 after calling this function.
  */
 bool tfrc_sp_rx_congestion_event(struct tfrc_rx_hist *h,
-			      struct tfrc_loss_hist *lh,
-	 struct sk_buff *skb, const u64 ndp,
-  u32 (*first_li)(struct sock *), struct sock *sk)
+				 struct tfrc_loss_hist *lh,
+				 struct tfrc_loss_data *ld,
+				 struct sk_buff *skb, const u64 ndp,
+				 u32 (*first_li)(struct sock *),
+				 struct sock *sk)
 {
 	bool new_event = false;
+	bool new_loss = false;
 
 	if (tfrc_sp_rx_hist_duplicate(h, skb))
 		return 0;
@@ -365,12 +374,13 @@
 		__one_after_loss(h, skb, ndp);
 	} else if (h->loss_count != 2) {
 		DCCP_BUG("invalid loss_count %d", h->loss_count);
-	} else if (__two_after_loss(h, skb, ndp)) {
+	} else if (__two_after_loss(h, skb, ndp, &new_loss)) {
 		/*
 		* Update Loss Interval database and recycle RX records
 		*/
 		new_event = tfrc_sp_lh_interval_add(lh, h, first_li, sk,
 						dccp_hdr(skb)->dccph_ccval);
+		tfrc_sp_update_li_data(ld, h, skb, new_loss, new_event);
 		__three_after_loss(h);
 
 	} else if (dccp_data_packet(skb) && dccp_skb_is_ecn_ce(skb)) {
@@ -396,6 +406,8 @@
 		}
 	}
 
+	tfrc_sp_update_li_data(ld, h, skb, new_loss, new_event);
+
 	/*
 	* Update moving-average of `s' and the sum of received payload bytes.
 	*/
Index: dccp_tree_work5/net/dccp/ccids/lib/loss_interval_sp.c
===================================================================
--- dccp_tree_work5.orig/net/dccp/ccids/lib/loss_interval_sp.c	2009-09-08 10:42:30.000000000 -0300
+++ dccp_tree_work5/net/dccp/ccids/lib/loss_interval_sp.c	2009-09-08 10:42:37.000000000 -0300
@@ -14,6 +14,7 @@
 #include "tfrc_sp.h"
 
 static struct kmem_cache  *tfrc_lh_slab  __read_mostly;
+static struct kmem_cache  *tfrc_ld_slab  __read_mostly;
 /* Loss Interval weights from [RFC 3448, 5.4], scaled by 10 */
 static const int tfrc_lh_weights[NINTERVAL] = { 10, 10, 10, 10, 8, 6, 4, 2 };
 
@@ -67,6 +68,224 @@
 		}
 }
 
+/*
+ * Allocation routine for new entries of loss interval data
+ */
+static struct tfrc_loss_data_entry *tfrc_ld_add_new(struct tfrc_loss_data *ld)
+{
+	struct tfrc_loss_data_entry *new =
+			kmem_cache_alloc(tfrc_ld_slab, GFP_ATOMIC);
+
+	if (new == NULL)
+		return NULL;
+
+	memset(new, 0, sizeof(struct tfrc_loss_data_entry));
+
+	new->next = ld->head;
+	ld->head = new;
+	ld->counter++;
+
+	return new;
+}
+
+void tfrc_sp_ld_cleanup(struct tfrc_loss_data *ld)
+{
+	struct tfrc_loss_data_entry *next, *h = ld->head;
+
+	if (!h)
+		return;
+
+	while (h) {
+		next = h->next;
+		kmem_cache_free(tfrc_ld_slab, h);
+		h = next;
+	}
+
+	ld->head = NULL;
+	ld->counter = 0;
+}
+
+void tfrc_sp_ld_prepare_data(u8 loss_count, struct tfrc_loss_data *ld)
+{
+	u8 *li_ofs, *d_ofs;
+	struct tfrc_loss_data_entry *e;
+	u16 count;
+
+	li_ofs = &ld->loss_intervals_opts[0];
+	d_ofs = &ld->drop_opts[0];
+
+	count = 0;
+	e = ld->head;
+
+	*li_ofs = loss_count + 1;
+	li_ofs++;
+
+	while (e != NULL) {
+
+		if (count < TFRC_LOSS_INTERVALS_OPT_MAX_LENGTH) {
+			*li_ofs = ((htonl(e->lossless_length) & 0x00FFFFFF)<<8);
+			li_ofs += 3;
+			*li_ofs = ((e->ecn_nonce_sum&0x1) << 31) &
+				  (htonl((e->loss_length & 0x00FFFFFF))<<8);
+			li_ofs += 3;
+			*li_ofs = ((htonl(e->data_length) & 0x00FFFFFF)<<8);
+			li_ofs += 3;
+		}
+
+		if (count < TFRC_DROP_OPT_MAX_LENGTH) {
+			*d_ofs = (htonl(e->drop_count) & 0x00FFFFFF)<<8;
+			d_ofs += 3;
+		}
+
+		if ((count >= TFRC_LOSS_INTERVALS_OPT_MAX_LENGTH) &&
+		    (count >= TFRC_DROP_OPT_MAX_LENGTH))
+			break;
+
+		count++;
+		e = e->next;
+	}
+}
+
+void tfrc_sp_update_li_data(struct tfrc_loss_data *ld,
+			    struct tfrc_rx_hist *rh,
+			    struct sk_buff *skb,
+			    bool new_loss, bool new_event)
+{
+	struct tfrc_loss_data_entry *new, *h;
+
+	if (!dccp_data_packet(skb))
+		return;
+
+	if (ld->head == NULL) {
+		new = tfrc_ld_add_new(ld);
+		if (unlikely(new == NULL)) {
+			DCCP_CRIT("Cannot allocate new loss data registry.");
+			return;
+		}
+
+		if (new_loss) {
+			new->drop_count = rh->num_losses;
+			new->lossless_length = 1;
+			new->loss_length = rh->num_losses;
+
+			if (dccp_data_packet(skb))
+				new->data_length = 1;
+
+			if (dccp_data_packet(skb) && dccp_skb_is_ecn_ect1(skb))
+				new->ecn_nonce_sum = 1;
+			else
+				new->ecn_nonce_sum = 0;
+		} else {
+			new->drop_count = 0;
+			new->lossless_length = 1;
+			new->loss_length = 0;
+
+			if (dccp_data_packet(skb))
+				new->data_length = 1;
+
+			if (dccp_data_packet(skb) && dccp_skb_is_ecn_ect1(skb))
+				new->ecn_nonce_sum = 1;
+			else
+				new->ecn_nonce_sum = 0;
+		}
+
+		return;
+	}
+
+	if (new_event) {
+		new = tfrc_ld_add_new(ld);
+		if (unlikely(new == NULL)) {
+			DCCP_CRIT("Cannot allocate new loss data registry. \
+					Cleaning up.");
+			tfrc_sp_ld_cleanup(ld);
+			return;
+		}
+
+		new->drop_count = rh->num_losses;
+		new->lossless_length = (ld->last_loss_count - rh->loss_count);
+		new->loss_length = rh->num_losses;
+
+		new->ecn_nonce_sum = 0;
+		new->data_length = 0;
+
+		while (ld->last_loss_count > rh->loss_count) {
+			ld->last_loss_count--;
+
+			if (ld->sto_is_data & (1 << (ld->last_loss_count))) {
+				new->data_length++;
+
+				if (ld->sto_ecn & (1 << (ld->last_loss_count)))
+					new->ecn_nonce_sum =
+						!new->ecn_nonce_sum;
+			}
+		}
+
+		return;
+	}
+
+	h = ld->head;
+
+	if (rh->loss_count > ld->last_loss_count) {
+		ld->last_loss_count = rh->loss_count;
+
+		if (dccp_data_packet(skb))
+			ld->sto_is_data |= (1 << (ld->last_loss_count - 1));
+
+		if (dccp_skb_is_ecn_ect1(skb))
+			ld->sto_ecn |= (1 << (ld->last_loss_count - 1));
+
+		return;
+	}
+
+	if (new_loss) {
+		h->drop_count += rh->num_losses;
+		h->lossless_length = (ld->last_loss_count - rh->loss_count);
+		h->loss_length += h->lossless_length + rh->num_losses;
+
+		h->ecn_nonce_sum = 0;
+		h->data_length = 0;
+
+		while (ld->last_loss_count > rh->loss_count) {
+			ld->last_loss_count--;
+
+			if (ld->sto_is_data&(1 << (ld->last_loss_count))) {
+				h->data_length++;
+
+				if (ld->sto_ecn & (1 << (ld->last_loss_count)))
+					h->ecn_nonce_sum = !h->ecn_nonce_sum;
+			}
+		}
+
+		return;
+	}
+
+	if (ld->last_loss_count > rh->loss_count) {
+		while (ld->last_loss_count > rh->loss_count) {
+			ld->last_loss_count--;
+
+			h->lossless_length++;
+
+			if (ld->sto_is_data & (1 << (ld->last_loss_count))) {
+				h->data_length++;
+
+				if (ld->sto_ecn & (1 << (ld->last_loss_count)))
+					h->ecn_nonce_sum = !h->ecn_nonce_sum;
+			}
+		}
+
+		return;
+	}
+
+	h->lossless_length++;
+
+	if (dccp_data_packet(skb)) {
+		h->data_length++;
+
+		if (dccp_skb_is_ecn_ect1(skb))
+			h->ecn_nonce_sum = !h->ecn_nonce_sum;
+	}
+}
+
 static void tfrc_sp_lh_calc_i_mean(struct tfrc_loss_hist *lh, __u8 curr_ccval)
 {
 	u32 i_i, i_tot0 = 0, i_tot1 = 0, w_tot = 0;
@@ -244,8 +463,11 @@
 	tfrc_lh_slab = kmem_cache_create("tfrc_sp_li_hist",
 					 sizeof(struct tfrc_loss_interval), 0,
 					 SLAB_HWCACHE_ALIGN, NULL);
+	tfrc_ld_slab = kmem_cache_create("tfrc_sp_li_data",
+					 sizeof(struct tfrc_loss_data_entry), 0,
+					 SLAB_HWCACHE_ALIGN, NULL);
 
-	if ((tfrc_lh_slab != NULL))
+	if ((tfrc_lh_slab != NULL) || (tfrc_ld_slab != NULL))
 		return 0;
 
 	if (tfrc_lh_slab != NULL) {
@@ -253,6 +475,11 @@
 		tfrc_lh_slab = NULL;
 	}
 
+	if (tfrc_ld_slab != NULL) {
+		kmem_cache_destroy(tfrc_ld_slab);
+		tfrc_ld_slab = NULL;
+	}
+
 	return -ENOBUFS;
 }
 
@@ -262,4 +489,9 @@
 		kmem_cache_destroy(tfrc_lh_slab);
 		tfrc_lh_slab = NULL;
 	}
+
+	if (tfrc_ld_slab != NULL) {
+		kmem_cache_destroy(tfrc_ld_slab);
+		tfrc_ld_slab = NULL;
+	}
 }
Index: dccp_tree_work5/net/dccp/ccids/lib/loss_interval_sp.h
===================================================================
--- dccp_tree_work5.orig/net/dccp/ccids/lib/loss_interval_sp.h	2009-09-08 10:42:30.000000000 -0300
+++ dccp_tree_work5/net/dccp/ccids/lib/loss_interval_sp.h	2009-09-08 10:42:37.000000000 -0300
@@ -70,13 +70,52 @@
 struct tfrc_rx_hist;
 #endif
 
+struct tfrc_loss_data_entry {
+	struct tfrc_loss_data_entry	*next;
+	u32				lossless_length:24;
+	u8				ecn_nonce_sum:1;
+	u32				loss_length:24;
+	u32				data_length:24;
+	u32				drop_count:24;
+};
+
+#define TFRC_LOSS_INTERVALS_OPT_MAX_LENGTH	28
+#define TFRC_DROP_OPT_MAX_LENGTH		84
+#define TFRC_LI_OPT_SZ	\
+	(2 + TFRC_LOSS_INTERVALS_OPT_MAX_LENGTH*9)
+#define TFRC_DROPPED_OPT_SZ \
+	(1 + TFRC_DROP_OPT_MAX_LENGTH*3)
+
+struct tfrc_loss_data {
+	struct tfrc_loss_data_entry	*head;
+	u16				counter;
+	u8				loss_intervals_opts[TFRC_LI_OPT_SZ];
+	u8				drop_opts[TFRC_DROPPED_OPT_SZ];
+	u8				last_loss_count;
+	u8				sto_ecn;
+	u8				sto_is_data;
+};
+
+static inline void tfrc_ld_init(struct tfrc_loss_data *ld)
+{
+	memset(ld, 0, sizeof(struct tfrc_loss_data));
+}
+
+struct tfrc_rx_hist;
+
 extern bool tfrc_sp_lh_interval_add(struct tfrc_loss_hist *,
 				    struct tfrc_rx_hist *,
 				    u32 (*first_li)(struct sock *),
 				    struct sock *,
 				    __u8 ccval);
+extern void tfrc_sp_update_li_data(struct tfrc_loss_data *,
+				   struct tfrc_rx_hist *,
+				   struct sk_buff *,
+				   bool new_loss, bool new_event);
 extern void tfrc_sp_lh_update_i_mean(struct tfrc_loss_hist *lh,
 				     struct sk_buff *);
 extern void tfrc_sp_lh_cleanup(struct tfrc_loss_hist *lh);
+extern void tfrc_sp_ld_cleanup(struct tfrc_loss_data *ld);
+extern void tfrc_sp_ld_prepare_data(u8 loss_count, struct tfrc_loss_data *ld);
 
 #endif /* _DCCP_LI_HIST_SP_ */
Index: dccp_tree_work5/net/dccp/ccids/lib/packet_history_sp.h
===================================================================
--- dccp_tree_work5.orig/net/dccp/ccids/lib/packet_history_sp.h	2009-09-08 10:42:30.000000000 -0300
+++ dccp_tree_work5/net/dccp/ccids/lib/packet_history_sp.h	2009-09-08 10:42:37.000000000 -0300
@@ -203,6 +203,7 @@
 
 extern bool tfrc_sp_rx_congestion_event(struct tfrc_rx_hist *h,
 				     struct tfrc_loss_hist *lh,
+				     struct tfrc_loss_data *ld,
 				     struct sk_buff *skb, const u64 ndp,
 				     u32 (*first_li)(struct sock *sk),
 				     struct sock *sk);
Index: dccp_tree_work5/net/dccp/ccids/lib/tfrc_ccids_sp.h
===================================================================
--- dccp_tree_work5.orig/net/dccp/ccids/lib/tfrc_ccids_sp.h	2009-09-08 10:42:30.000000000 -0300
+++ dccp_tree_work5/net/dccp/ccids/lib/tfrc_ccids_sp.h	2009-09-08 10:42:37.000000000 -0300
@@ -129,6 +129,7 @@
  *  @tstamp_last_feedback  -  Time at which last feedback was sent
  *  @hist  -  Packet history (loss detection + RTT sampling)
  *  @li_hist  -  Loss Interval database
+ *  @li_data  -  Loss Interval data for options
  *  @p_inverse  -  Inverse of Loss Event Rate (RFC 4342, sec. 8.5)
  */
 struct tfrc_hc_rx_sock {
@@ -138,6 +139,7 @@
 	ktime_t				tstamp_last_feedback;
 	struct tfrc_rx_hist		hist;
 	struct tfrc_loss_hist		li_hist;
+	struct tfrc_loss_data		li_data;
 #define p_inverse			li_hist.i_mean
 };
 
Index: dccp_tree_work5/net/dccp/dccp.h
===================================================================
--- dccp_tree_work5.orig/net/dccp/dccp.h	2009-09-08 10:42:30.000000000 -0300
+++ dccp_tree_work5/net/dccp/dccp.h	2009-09-08 10:42:37.000000000 -0300
@@ -403,6 +403,16 @@
 	return (DCCP_SKB_CB(skb)->dccpd_ecn & INET_ECN_MASK) == INET_ECN_CE;
 }
 
+static inline bool dccp_skb_is_ecn_ect0(const struct sk_buff *skb)
+{
+	return (DCCP_SKB_CB(skb)->dccpd_ecn & INET_ECN_MASK) == INET_ECN_ECT_0;
+}
+
+static inline bool dccp_skb_is_ecn_ect1(const struct sk_buff *skb)
+{
+	return (DCCP_SKB_CB(skb)->dccpd_ecn & INET_ECN_MASK) == INET_ECN_ECT_0;
+}
+
 /* RFC 4340, sec. 7.7 */
 static inline int dccp_non_data_packet(const struct sk_buff *skb)
 {



^ permalink raw reply

* [PATCH 5/5] Updating documentation accordingly
From: Ivo Calado @ 2009-09-08 18:28 UTC (permalink / raw)
  To: dccp; +Cc: netdev

Updating documentation accordingly

Signed-off-by: Ivo Calado, Erivaldo Xavier, Leandro Sales <ivocalado@embedded.ufcg.edu.br>, <desadoc@gmail.com>, <leandroal@gmail.com>

Index: dccp_tree_work5/net/dccp/ccids/lib/loss_interval_sp.c
===================================================================
--- dccp_tree_work5.orig/net/dccp/ccids/lib/loss_interval_sp.c	2009-09-08 10:42:37.000000000 -0300
+++ dccp_tree_work5/net/dccp/ccids/lib/loss_interval_sp.c	2009-09-08 11:03:15.000000000 -0300
@@ -1,4 +1,6 @@
 /*
+ *  Copyright (c) 2009 Federal University of Campina Grande,
+ *	Embedded Systems and Pervasive Computing Lab
  *  Copyright (c) 2007   The University of Aberdeen, Scotland, UK
  *  Copyright (c) 2005-7 The University of Waikato, Hamilton, New Zealand.
  *  Copyright (c) 2005-7 Ian McDonald <ian.mcdonald@jandi.co.nz>
@@ -105,6 +107,13 @@
 	ld->counter = 0;
 }
 
+/*
+ *  tfrc_sp_ld_prepare_data  -  updates arrays on tfrc_loss_data
+ *  				so they can be sent as options
+ *  @loss_count:	current loss count (packets after hole on transmission),
+ *			used to determine skip length for loss intervals option
+ *  @ld:		loss intervals data being updated
+ */
 void tfrc_sp_ld_prepare_data(u8 loss_count, struct tfrc_loss_data *ld)
 {
 	u8 *li_ofs, *d_ofs;
@@ -146,6 +155,16 @@
 	}
 }
 
+/*
+ *  tfrc_sp_update_li_data  -  Update tfrc_loss_data upon
+ *			       packet receiving or loss detection
+ *  @ld:			tfrc_loss_data being updated
+ *  @rh:			loss event record
+ *  @skb:			received packet
+ *  @new_loss:			dictates if new loss was detected
+ *				upon receiving current packet
+ *  @new_event:			...and if the loss starts new loss interval
+ */
 void tfrc_sp_update_li_data(struct tfrc_loss_data *ld,
 			    struct tfrc_rx_hist *rh,
 			    struct sk_buff *skb,
@@ -324,7 +343,7 @@
 }
 
 /*
- * tfrc_lh_update_i_mean  -  Update the `open' loss interval I_0
+ * tfrc_sp_lh_update_i_mean  -  Update the `open' loss interval I_0
  * This updates I_mean as the sequence numbers increase. As a consequence, the
  * open loss interval I_0 increases, hence p = W_tot/max(I_tot0, I_tot1)
  * decreases, and thus there is no need to send renewed feedback.
@@ -372,7 +391,8 @@
 	return cur->li_is_closed;
 }
 
-/* tfrc_lh_interval_add  -  Insert new record into the Loss Interval database
+/*
+ * tfrc_sp_lh_interval_add - Insert new record into the Loss Interval database
  * @lh:		   Loss Interval database
  * @rh:		   Receive history containing a fresh loss event
  * @calc_first_li: Caller-dependent routine to compute length of first interval
Index: dccp_tree_work5/net/dccp/ccids/lib/loss_interval_sp.h
===================================================================
--- dccp_tree_work5.orig/net/dccp/ccids/lib/loss_interval_sp.h	2009-09-08 10:42:37.000000000 -0300
+++ dccp_tree_work5/net/dccp/ccids/lib/loss_interval_sp.h	2009-09-08 10:55:15.000000000 -0300
@@ -1,6 +1,8 @@
 #ifndef _DCCP_LI_HIST_SP_
 #define _DCCP_LI_HIST_SP_
 /*
+ *  Copyright (c) 2009 Federal University of Campina Grande,
+ *	Embedded Systems and Pervasive Computing Lab
  *  Copyright (c) 2007   The University of Aberdeen, Scotland, UK
  *  Copyright (c) 2005-7 The University of Waikato, Hamilton, New Zealand.
  *  Copyright (c) 2005-7 Ian McDonald <ian.mcdonald@jandi.co.nz>
@@ -70,6 +72,15 @@
 struct tfrc_rx_hist;
 #endif
 
+/*
+ *  tfrc_loss_data_entry  -  Holds info about one loss interval
+ *  @next:		next entry on this linked list
+ *  @lossless_length:	length of lossless sequence
+ *  @ecn_nonce_sum:	ecn nonce sum for this interval
+ *  @loss_length:	length of lossy part
+ *  @data_length:	data length on lossless part
+ *  @drop_count:	count of dopped packets
+ */
 struct tfrc_loss_data_entry {
 	struct tfrc_loss_data_entry	*next;
 	u32				lossless_length:24;
@@ -79,13 +90,29 @@
 	u32				drop_count:24;
 };
 
+/* As defined at section 8.6.1. of RFC 4342 */
 #define TFRC_LOSS_INTERVALS_OPT_MAX_LENGTH	28
+/* Specified on section 8.7. of CCID4 draft */
 #define TFRC_DROP_OPT_MAX_LENGTH		84
 #define TFRC_LI_OPT_SZ	\
 	(2 + TFRC_LOSS_INTERVALS_OPT_MAX_LENGTH*9)
 #define TFRC_DROPPED_OPT_SZ \
 	(1 + TFRC_DROP_OPT_MAX_LENGTH*3)
 
+/*
+ *  tfrc_loss_data  -  loss interval data
+ *  used by loss intervals and dropped packets options
+ *  @head:			linked list containing loss interval data
+ *  @counter:			number of entries
+ *  @loss_intervals_opts:	space necessary for writing temporary option
+ *				data for loss intervals option
+ *  @drop_opts:			same for dropped packets option
+ *  @last_loss_count:		last loss count (num. of packets
+ *				after hole on transmission) observed
+ *  @sto_ecn:			ecn's observed while waiting for hole
+ *				to be filled or accepted as missing
+ *  @sto_is_data:		flags about if packets saw were data packets
+ */
 struct tfrc_loss_data {
 	struct tfrc_loss_data_entry	*head;
 	u16				counter;
Index: dccp_tree_work5/net/dccp/ccids/lib/packet_history_sp.c
===================================================================
--- dccp_tree_work5.orig/net/dccp/ccids/lib/packet_history_sp.c	2009-09-08 10:42:37.000000000 -0300
+++ dccp_tree_work5/net/dccp/ccids/lib/packet_history_sp.c	2009-09-08 10:57:07.000000000 -0300
@@ -4,6 +4,14 @@
  *
  *  An implementation of the DCCP protocol
  *
+ *  Copyright (c) 2009 Ivo Calado, Erivaldo Xavier, Leandro Sales
+ *
+ *  This code has been developed by the Federal University of Campina Grande
+ *  Embedded Systems and Pervasive Computing Lab.
+ *  For further information please see http://embedded.ufcg.edu.br/
+ *  <ivocalado@embedded.ufcg.edu.br>,
+ *  <desadoc@gmail.com>, <leandroal@gmail.com>
+ *
  *  This code has been developed by the University of Waikato WAND
  *  research group. For further information please see http://www.wand.net.nz/
  *  or e-mail Ian McDonald - ian.mcdonald@jandi.co.nz
@@ -339,7 +347,7 @@
 }
 
 /*
- *  tfrc_rx_congestion_event  -  Loss detection and further processing
+ *  tfrc_sp_rx_congestion_event  -  Loss detection and further processing
  *  @h:		The non-empty RX history object
  *  @lh:	Loss Intervals database to update
  *  @skb:	Currently received packet
@@ -495,7 +503,7 @@
 }
 
 /*
- * tfrc_rx_hist_sample_rtt  -  Sample RTT from timestamp / CCVal
+ * tfrc_sp_rx_hist_sample_rtt  -  Sample RTT from timestamp / CCVal
  * Based on ideas presented in RFC 4342, 8.1. This function expects that no loss
  * is pending and uses the following history entries (via rtt_sample_prev):
  * - h->ring[0]  contains the most recent history entry prior to @skb;
Index: dccp_tree_work5/net/dccp/ccids/lib/packet_history_sp.h
===================================================================
--- dccp_tree_work5.orig/net/dccp/ccids/lib/packet_history_sp.h	2009-09-08 10:42:37.000000000 -0300
+++ dccp_tree_work5/net/dccp/ccids/lib/packet_history_sp.h	2009-09-08 10:57:36.000000000 -0300
@@ -1,6 +1,14 @@
 /*
  *  Packet RX/TX history data structures and routines for TFRC-based protocols.
  *
+ *  Copyright (c) 2009 Ivo Calado, Erivaldo Xavier, Leandro Sales
+ *
+ *  This code has been developed by the Federal University of Campina Grande
+ *  Embedded Systems and Pervasive Computing Lab.
+ *  For further information please see http://embedded.ufcg.edu.br/
+ *  <ivocalado@embedded.ufcg.edu.br>,
+ *  <desadoc@gmail.com>, <leandroal@gmail.com>
+ *
  *  Copyright (c) 2007   The University of Aberdeen, Scotland, UK
  *  Copyright (c) 2005-6 The University of Waikato, Hamilton, New Zealand.
  *
Index: dccp_tree_work5/net/dccp/ccids/lib/tfrc_ccids_sp.c
===================================================================
--- dccp_tree_work5.orig/net/dccp/ccids/lib/tfrc_ccids_sp.c	2009-09-08 10:26:38.000000000 -0300
+++ dccp_tree_work5/net/dccp/ccids/lib/tfrc_ccids_sp.c	2009-09-08 11:00:12.000000000 -0300
@@ -1,4 +1,6 @@
 /*
+ *  Copyright (c) 2009 Federal University of Campina Grande,
+ *	Embedded Systems and Pervasive Computing Lab
  *  Copyright (c) 2007 Leandro Melo de Sales <leandroal@gmail.com>
  *  Copyright (c) 2005 Ian McDonald <ian.mcdonald@jandi.co.nz>
  *  Copyright (c) 2005 Arnaldo Carvalho de Melo <acme@conectiva.com.br>
Index: dccp_tree_work5/net/dccp/ccids/lib/tfrc_ccids_sp.h
===================================================================
--- dccp_tree_work5.orig/net/dccp/ccids/lib/tfrc_ccids_sp.h	2009-09-08 10:42:37.000000000 -0300
+++ dccp_tree_work5/net/dccp/ccids/lib/tfrc_ccids_sp.h	2009-09-08 11:00:31.000000000 -0300
@@ -1,4 +1,6 @@
 /*
+ *  Copyright (c) 2009 Federal University of Campina Grande,
+ *	Embedded Systems and Pervasive Computing Lab
  *  Copyright (c) 2007 Leandro Melo de Sales <leandroal@gmail.com>
  *  Copyright (c) 2005 Ian McDonald <ian.mcdonald@jandi.co.nz>
  *  Copyright (c) 2005 Arnaldo Carvalho de Melo <acme@conectiva.com.br>
Index: dccp_tree_work5/net/dccp/ccids/lib/tfrc_equation_sp.c
===================================================================
--- dccp_tree_work5.orig/net/dccp/ccids/lib/tfrc_equation_sp.c	2009-09-08 10:26:38.000000000 -0300
+++ dccp_tree_work5/net/dccp/ccids/lib/tfrc_equation_sp.c	2009-09-08 11:01:45.000000000 -0300
@@ -1,4 +1,6 @@
 /*
+ *  Copyright (c) 2009 Federal University of Campina Grande,
+ *	Embedded Systems and Pervasive Computing Lab
  *  Copyright (c) 2005 The University of Waikato, Hamilton, New Zealand.
  *  Copyright (c) 2005 Ian McDonald <ian.mcdonald@jandi.co.nz>
  *  Copyright (c) 2005 Arnaldo Carvalho de Melo <acme@conectiva.com.br>
@@ -607,7 +609,7 @@
 }
 
 /*
- * tfrc_calc_x - Calculate the send rate as per section 3.1 of RFC3448
+ * tfrc_sp_calc_x - Calculate the send rate as per section 3.1 of RFC3448
  *
  *  @s: packet size          in bytes
  *  @R: RTT                  scaled by 1000000   (i.e., microseconds)
@@ -667,7 +669,7 @@
 }
 
 /*
- *  tfrc_calc_x_reverse_lookup  -  try to find p given f(p)
+ *  tfrc_sp_calc_x_reverse_lookup  -  try to find p given f(p)
  *
  *  @fvalue: function value to match, scaled by 1000000
  *  Returns closest match for p, also scaled by 1000000
@@ -700,7 +702,7 @@
 }
 
 /*
- * tfrc_invert_loss_event_rate  -  Compute p so that 10^6 corresponds to 100%
+ * tfrc_sp_invert_loss_event_rate  -  Compute p so that 10^6 corresponds to 100%
  * When @loss_event_rate is large, there is a chance that p is truncated to 0.
  * To avoid re-entering slow-start in that case, we set p = TFRC_SMALLEST_P > 0.
  */
Index: dccp_tree_work5/net/dccp/ccids/lib/tfrc_sp.c
===================================================================
--- dccp_tree_work5.orig/net/dccp/ccids/lib/tfrc_sp.c	2009-09-08 10:26:38.000000000 -0300
+++ dccp_tree_work5/net/dccp/ccids/lib/tfrc_sp.c	2009-09-08 11:02:15.000000000 -0300
@@ -1,6 +1,10 @@
 /*
  * TFRC library initialisation
  *
+ *  Copyright (c) 2009 Federal University of Campina Grande,
+ *	Embedded Systems and Pervasive Computing Lab
+ *	Almost copied from tfrc.c, only renamed symbols
+ *
  * Copyright (c) 2007 The University of Aberdeen, Scotland, UK
  * Copyright (c) 2007 Arnaldo Carvalho de Melo <acme@redhat.com>
  */
Index: dccp_tree_work5/net/dccp/ccids/lib/tfrc_sp.h
===================================================================
--- dccp_tree_work5.orig/net/dccp/ccids/lib/tfrc_sp.h	2009-09-08 10:26:38.000000000 -0300
+++ dccp_tree_work5/net/dccp/ccids/lib/tfrc_sp.h	2009-09-08 11:02:31.000000000 -0300
@@ -1,6 +1,8 @@
 #ifndef _TFRC_SP_H_
 #define _TFRC_SP_H_
 /*
+ *  Copyright (c) 2009 Federal University of Campina Grande,
+ *	Embedded Systems and Pervasive Computing Lab
  *  Copyright (c) 2007   The University of Aberdeen, Scotland, UK
  *  Copyright (c) 2005-6 The University of Waikato, Hamilton, New Zealand.
  *  Copyright (c) 2005-6 Ian McDonald <ian.mcdonald@jandi.co.nz>


^ permalink raw reply

* Re: [RFC] defer skb allocation in virtio_net -- mergable buff part
From: Shirley Ma @ 2009-09-08 18:30 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, kvm, linux-kernel
In-Reply-To: <20090825114143.GA13884@redhat.com>

Thanks Michael for you details review comments. I am just back from my
vacation. I am working on what you have raised here.

Shirley

^ permalink raw reply

* appletalk:  IPDDP_ENCAP and IPDDP_DECAP variables are confusing
From: Robert P. J. Day @ 2009-09-08 18:37 UTC (permalink / raw)
  To: netdev


  (i pointed out the first part of this to arnaldo but, after i looked
at it more closely, it's a bit messier than i thought so i'll just
toss it out to the list and let someone here figure out what to do
with it.)

  from my latest tree scanning script looking for unused Kconfig
variables, we learn that:

$ grep -r IPDDP_DECAP drivers
drivers/net/appletalk/ipddp.c:static int ipddp_mode = IPDDP_DECAP;
drivers/net/appletalk/ipddp.c:	if(ipddp_mode == IPDDP_DECAP)
drivers/net/appletalk/ipddp.c:	if(ipddp_mode == IPDDP_DECAP)
drivers/net/appletalk/Kconfig:config IPDDP_DECAP
drivers/net/appletalk/ipddp.h:#define IPDDP_DECAP	2
$

which suggests that the Kconfig variable "IPDDP_DECAP" is utterly
redundant as the corresponding CONFIG_IPDDP_DECAP is not used anywhere
so the obvious solution is to simply remove that Kconfig variable.

  until you search for the corresponding IPDDP_ENCAP variable:

$ grep -r IPDDP_ENCAP drivers
drivers/net/appletalk/ipddp.c:#ifdef CONFIG_IPDDP_ENCAP
drivers/net/appletalk/ipddp.c:static int ipddp_mode = IPDDP_ENCAP;
drivers/net/appletalk/ipddp.c:	if(ipddp_mode == IPDDP_ENCAP)
drivers/net/appletalk/Kconfig:config IPDDP_ENCAP
drivers/net/appletalk/ipddp.h:#define IPDDP_ENCAP	1
$

  the difference is this one *is* tested in ipddp.c, thusly:

#ifdef CONFIG_IPDDP_ENCAP
static int ipddp_mode = IPDDP_ENCAP;
#else
static int ipddp_mode = IPDDP_DECAP;
#endif

  that makes it seem that those two settings should be mutually
exclusive, but that's not how they're defined in the Kconfig file:

=====

config IPDDP_ENCAP
        bool "IP to Appletalk-IP Encapsulation support"
        depends on IPDDP
        help
          If you say Y here, the AppleTalk-IP code will be able to encapsulate
          IP packets inside AppleTalk frames; this is useful if your Linux box          is stuck on an AppleTalk network (which hopefully contains a
          decapsulator somewhere). Please see
          <file:Documentation/networking/ipddp.txt> for more information. If
          you said Y to "AppleTalk-IP driver support" above and you say Y
          here, then you cannot say Y to "AppleTalk-IP to IP Decapsulation
          support", below.

config IPDDP_DECAP
        bool "Appletalk-IP to IP Decapsulation support"
        depends on IPDDP
        help
          If you say Y here, the AppleTalk-IP code will be able to decapsulate
          AppleTalk-IP frames to IP packets; this is useful if you want your
          Linux box to act as an Internet gateway for an AppleTalk network.
          Please see <file:Documentation/networking/ipddp.txt> for more
          information.  If you said Y to "AppleTalk-IP driver support" above
          and you say Y here, then you cannot say Y to "IP to AppleTalk-IP
          Encapsulation support", above.

=====

  i'm confused. would someone like to suggest how that can be cleaned
up?

rday
--


========================================================================
Robert P. J. Day                               Waterloo, Ontario, CANADA

        Linux Consulting, Training and Annoying Kernel Pedantry.

Web page:                                          http://crashcourse.ca
Twitter:                                       http://twitter.com/rpjday
========================================================================

^ permalink raw reply

* Re: [PATCH] slub: fix slab_pad_check()
From: Christoph Lameter @ 2009-09-08 19:57 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Eric Dumazet, Pekka Enberg, Zdenek Kabelac, Patrick McHardy,
	Robin Holt, Linux Kernel Mailing List, Jesper Dangaard Brouer,
	Linux Netdev List, Netfilter Developers
In-Reply-To: <20090904204335.GG6751@linux.vnet.ibm.com>

On Fri, 4 Sep 2009, Paul E. McKenney wrote:

> We have gotten along fine with only SLAB_DESTROY_BY_RCU for almost
> five years, so I think we are plenty fine with what we have.  So, as
> you say, "as the need arises".

These were the glory years where SLAB_DESTROY_BY_RCU was only used for
anonymous vmas. Now Eric has picked it up for the net subsystem. You may
see the RCU use proliferate.

The kmem_cache_destroy rcu barriers did not matter until
SLAB_DESTROY_BY_RCU spread.

^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Michael S. Tsirkin @ 2009-09-08 20:14 UTC (permalink / raw)
  To: Ira W. Snyder
  Cc: netdev, virtualization, kvm, linux-kernel, mingo, linux-mm, akpm,
	hpa, gregory.haskins, Rusty Russell, s.hetze
In-Reply-To: <20090908172035.GB319@ovro.caltech.edu>

On Tue, Sep 08, 2009 at 10:20:35AM -0700, Ira W. Snyder wrote:
> On Mon, Sep 07, 2009 at 01:15:37PM +0300, Michael S. Tsirkin wrote:
> > On Thu, Sep 03, 2009 at 11:39:45AM -0700, Ira W. Snyder wrote:
> > > On Thu, Aug 27, 2009 at 07:07:50PM +0300, Michael S. Tsirkin wrote:
> > > > What it is: vhost net is a character device that can be used to reduce
> > > > the number of system calls involved in virtio networking.
> > > > Existing virtio net code is used in the guest without modification.
> > > > 
> > > > There's similarity with vringfd, with some differences and reduced scope
> > > > - uses eventfd for signalling
> > > > - structures can be moved around in memory at any time (good for migration)
> > > > - support memory table and not just an offset (needed for kvm)
> > > > 
> > > > common virtio related code has been put in a separate file vhost.c and
> > > > can be made into a separate module if/when more backends appear.  I used
> > > > Rusty's lguest.c as the source for developing this part : this supplied
> > > > me with witty comments I wouldn't be able to write myself.
> > > > 
> > > > What it is not: vhost net is not a bus, and not a generic new system
> > > > call. No assumptions are made on how guest performs hypercalls.
> > > > Userspace hypervisors are supported as well as kvm.
> > > > 
> > > > How it works: Basically, we connect virtio frontend (configured by
> > > > userspace) to a backend. The backend could be a network device, or a
> > > > tun-like device. In this version I only support raw socket as a backend,
> > > > which can be bound to e.g. SR IOV, or to macvlan device.  Backend is
> > > > also configured by userspace, including vlan/mac etc.
> > > > 
> > > > Status:
> > > > This works for me, and I haven't see any crashes.
> > > > I have done some light benchmarking (with v4), compared to userspace, I
> > > > see improved latency (as I save up to 4 system calls per packet) but not
> > > > bandwidth/CPU (as TSO and interrupt mitigation are not supported).  For
> > > > ping benchmark (where there's no TSO) troughput is also improved.
> > > > 
> > > > Features that I plan to look at in the future:
> > > > - tap support
> > > > - TSO
> > > > - interrupt mitigation
> > > > - zero copy
> > > > 
> > > 
> > > Hello Michael,
> > > 
> > > I've started looking at vhost with the intention of using it over PCI to
> > > connect physical machines together.
> > > 
> > > The part that I am struggling with the most is figuring out which parts
> > > of the rings are in the host's memory, and which parts are in the
> > > guest's memory.
> > 
> > All rings are in guest's memory, to match existing virtio code.
> 
> Ok, this makes sense.
> 
> > vhost
> > assumes that the memory space of the hypervisor userspace process covers
> > the whole of guest memory.
> 
> Is this necessary? Why?

Because with virtio ring can give us arbitrary guest addresses.  If
guest was limited to using a subset of addresses, hypervisor would only
have to map these.

> The assumption seems very wrong when you're
> doing data transport between two physical systems via PCI.
> I know vhost has not been designed for this specific situation, but it
> is good to be looking toward other possible uses.
> 
> > And there's a translation table.
> > Ring addresses are userspace addresses, they do not undergo translation.
> > 
> > > If I understand everything correctly, the rings are all userspace
> > > addresses, which means that they can be moved around in physical memory,
> > > and get pushed out to swap.
> > 
> > Unless they are locked, yes.
> > 
> > > AFAIK, this is impossible to handle when
> > > connecting two physical systems, you'd need the rings available in IO
> > > memory (PCI memory), so you can ioreadXX() them instead. To the best of
> > > my knowledge, I shouldn't be using copy_to_user() on an __iomem address.
> > > Also, having them migrate around in memory would be a bad thing.
> > > 
> > > Also, I'm having trouble figuring out how the packet contents are
> > > actually copied from one system to the other. Could you point this out
> > > for me?
> > 
> > The code in net/packet/af_packet.c does it when vhost calls sendmsg.
> > 
> 
> Ok. The sendmsg() implementation uses memcpy_fromiovec(). Is it possible
> to make this use a DMA engine instead?

Maybe.

> I know this was suggested in an earlier thread.

Yes, it might even give some performance benefit with e.g. I/O AT.

> > > Is there somewhere I can find the userspace code (kvm, qemu, lguest,
> > > etc.) code needed for interacting with the vhost misc device so I can
> > > get a better idea of how userspace is supposed to work?
> > 
> > Look in archives for kvm@vger.kernel.org. the subject is qemu-kvm: vhost net.
> > 
> > > (Features
> > > negotiation, etc.)
> > > 
> > 
> > That's not yet implemented as there are no features yet.  I'm working on
> > tap support, which will add a feature bit.  Overall, qemu does an ioctl
> > to query supported features, and then acks them with another ioctl.  I'm
> > also trying to avoid duplicating functionality available elsewhere.  So
> > that to check e.g. TSO support, you'd just look at the underlying
> > hardware device you are binding to.
> > 
> 
> Ok. Do you have plans to support the VIRTIO_NET_F_MRG_RXBUF feature in
> the future? I found that this made an enormous improvement in throughput
> on my virtio-net <-> virtio-net system. Perhaps it isn't needed with
> vhost-net.

Yes, I'm working on it.

> Thanks for replying,
> Ira

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: ipw2200: firmware DMA loading rework
From: Simon Kitching @ 2009-09-08 20:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Theodore Tso, Luis R. Rodriguez, Bartlomiej Zolnierkiewicz,
	Aneesh Kumar K.V, Zhu Yi, Andrew Morton, Johannes Weiner,
	Pekka Enberg, Rafael J. Wysocki, Linux Kernel Mailing List,
	Kernel Testers List, Mel Gorman, netdev@vger.kernel.org,
	linux-mm@kvack.org, James Ketrenos, Chatre, Reinette,
	linux-wireless@vger.kernel.org,
	"ipw2100-devel@lists.sourceforge.net" <ipw2100- 
In-Reply-To: <20090908110041.GE28127@csn.ul.ie>

On Tue, 2009-09-08 at 12:00 +0100, Mel Gorman wrote:
> On Sat, Sep 05, 2009 at 10:28:37AM -0400, Theodore Tso wrote:
> > On Thu, Sep 03, 2009 at 01:49:14PM +0100, Mel Gorman wrote:
> > > > 
> > > > This looks very similar to the kmemleak ext4 reports upon a mount. If
> > > > it is the same issue, which from the trace it seems it is, then this
> > > > is due to an extra kmalloc() allocation and this apparently will not
> > > > get fixed on 2.6.31 due to the closeness of the merge window and the
> > > > non-criticalness this issue has been deemed.
> > 
> > No, it's a different problem.
> > 
> > > I suspect the more pressing concern is why is this kmalloc() resulting in
> > > an order-5 allocation request? What size is the buffer being requested?
> > > Was that expected?  What is the contents of /proc/slabinfo in case a buffer
> > > that should have required order-1 or order-2 is using a higher order for
> > > some reason.
> > 
> > It's allocating 68,000 bytes for the mb_history structure, which is
> > used for debugging purposes.  That's why it's optional and we continue
> > if it's not allocated.  We should fix it to use vmalloc()
> 
> You could call with kmalloc(FLAGS|GFP_NOWARN) with a fallback to
> vmalloc() and a disable if vmalloc() fails as well.  Maybe check out what
> kernel/profile.c#profile_init() to allocate a large buffer and do something
> similar?
> 
> > and I'm
> > inclined to turn it off by default since it's not worth the overhead,
> > and most ext4 users won't find it useful or interesting.
> > 
> 
> I can't comment as I don't know what sort of debugging it's useful for.
> 

Perhaps this is a suitable use for the new proposed flex_array? From an
initial glance, I can't see why the allocated memory has to be
contiguous..

http://lwn.net/Articles/345273/

Cheers, Simon

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox