Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: 2.6.38.2, kernel panic, probably related to framentation handling
From: Eric Dumazet @ 2011-05-04 20:02 UTC (permalink / raw)
  To: Denys Fedoryshchenko, David Miller; +Cc: netdev
In-Reply-To: <1304532674.32152.16.camel@edumazet-laptop>

Le mercredi 04 mai 2011 à 20:11 +0200, Eric Dumazet a écrit :
> Le mercredi 04 mai 2011 à 19:03 +0200, Eric Dumazet a écrit :
> 
> > Hi Denys
> > 
> > Is it reproductible, and possibly on latest kernel ?
> > 
> > We fixed some bugs lately (assuming you also use a bridge ?)
> > 
> > Could you send the disassembled code on your kernel of icmp_send() ?
> 
> Oh well, I think I found the problem, I am working on a patch and send
> it shortly.
> 
> Thanks
> 

I believe bug is one year old (2.6.35), please try following patch.

Thanks !

[PATCH] net: ip_expire() must revalidate route

Commit 4a94445c9a5c (net: Use ip_route_input_noref() in input path)
added a bug in IP defragmentation handling, in case timeout is fired.

When a frame is defragmented, we use last skb dst field when building
final skb. Its dst is valid, since we are in rcu read section.

But if a timeout occurs, we take first queued fragment to build one ICMP
TIME EXCEEDED message. Problem is all queued skb have weak dst pointers,
since we escaped RCU critical section after their queueing. icmp_send()
might dereference a now freed (and possibly reused) part of memory.

Calling skb_dst_drop() and ip_route_input_noref() to revalidate route is
the only possible choice.

Reported-by: Denys Fedoryshchenko <denys@visp.net.lb>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/ip_fragment.c |   31 +++++++++++++++----------------
 1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index a1151b8..b1d282f 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -223,31 +223,30 @@ static void ip_expire(unsigned long arg)
 
 	if ((qp->q.last_in & INET_FRAG_FIRST_IN) && qp->q.fragments != NULL) {
 		struct sk_buff *head = qp->q.fragments;
+		const struct iphdr *iph;
+		int err;
 
 		rcu_read_lock();
 		head->dev = dev_get_by_index_rcu(net, qp->iif);
 		if (!head->dev)
 			goto out_rcu_unlock;
 
+		/* skb dst is stale, drop it, and perform route lookup again */
+		skb_dst_drop(head);
+		iph = ip_hdr(head);
+		err = ip_route_input_noref(head, iph->daddr, iph->saddr,
+					   iph->tos, head->dev);
+		if (err)
+			goto out_rcu_unlock;
+
 		/*
-		 * Only search router table for the head fragment,
-		 * when defraging timeout at PRE_ROUTING HOOK.
+		 * Only an end host needs to send an ICMP
+		 * "Fragment Reassembly Timeout" message, per RFC792.
 		 */
-		if (qp->user == IP_DEFRAG_CONNTRACK_IN && !skb_dst(head)) {
-			const struct iphdr *iph = ip_hdr(head);
-			int err = ip_route_input(head, iph->daddr, iph->saddr,
-						 iph->tos, head->dev);
-			if (unlikely(err))
-				goto out_rcu_unlock;
-
-			/*
-			 * Only an end host needs to send an ICMP
-			 * "Fragment Reassembly Timeout" message, per RFC792.
-			 */
-			if (skb_rtable(head)->rt_type != RTN_LOCAL)
-				goto out_rcu_unlock;
+		if (qp->user == IP_DEFRAG_CONNTRACK_IN &&
+		    skb_rtable(head)->rt_type != RTN_LOCAL)
+			goto out_rcu_unlock;
 
-		}
 
 		/* Send an ICMP "Fragment Reassembly Timeout" message. */
 		icmp_send(head, ICMP_TIME_EXCEEDED, ICMP_EXC_FRAGTIME, 0);



^ permalink raw reply related

* [PATCH] tcp_cubic: limit delayed_ack ratio to prevent divide error
From: Stephen Hemminger @ 2011-05-04 20:04 UTC (permalink / raw)
  To: David Miller, Sangtae Ha, Injong Rhee
  Cc: Valdis.Kletnieks, rdunlap, lkml, netdev, linux-kernel
In-Reply-To: <20110504.124053.260068550.davem@davemloft.net>

TCP Cubic keeps a metric that estimates the amount of delayed
acknowledgements to use in adjusting the window. If an abnormally
large number of packets are acknowledged at once, then the update
could wrap and reach zero. This kind of ACK could only
happen when there was a large window and huge number of
ACK's were lost.

This patch limits the value of delayed ack ratio. The choice of 32
is just a conservative value since normally it should be range of 
1 to 4 packets.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

---
Patch against 2.6.39-rc5+


--- a/net/ipv4/tcp_cubic.c	2011-05-04 11:58:49.666027155 -0700
+++ b/net/ipv4/tcp_cubic.c	2011-05-04 12:52:34.716767304 -0700
@@ -93,6 +93,7 @@ struct bictcp {
 	u32	ack_cnt;	/* number of acks */
 	u32	tcp_cwnd;	/* estimated tcp cwnd */
 #define ACK_RATIO_SHIFT	4
+#define ACK_RATIO_LIMIT (32u << ACK_RATIO_SHIFT)
 	u16	delayed_ack;	/* estimate the ratio of Packets/ACKs << 4 */
 	u8	sample_cnt;	/* number of samples to decide curr_rtt */
 	u8	found;		/* the exit point is found? */
@@ -398,8 +399,12 @@ static void bictcp_acked(struct sock *sk
 	u32 delay;
 
 	if (icsk->icsk_ca_state == TCP_CA_Open) {
-		cnt -= ca->delayed_ack >> ACK_RATIO_SHIFT;
-		ca->delayed_ack += cnt;
+		u32 ratio = ca->delayed_ack;
+
+		ratio -= ca->delayed_ack >> ACK_RATIO_SHIFT;
+		ratio += cnt;
+
+		ca->delayed_ack = min(ratio, ACK_RATIO_LIMIT);
 	}
 
 	/* Some calls are for duplicates without timetamps */

^ permalink raw reply

* Re: ath5k regression associating with APs in 2.6.38
From: Nick Kossifidis @ 2011-05-04 20:09 UTC (permalink / raw)
  To: John W. Linville, Jiri Slaby, Nick Kossifidis, Luis R. Rodriguez,
	Bob Copeland
In-Reply-To: <20110504192639.GB4551@thinkpad-t410>

2011/5/4 Seth Forshee <seth.forshee@canonical.com>:
> On Wed, May 04, 2011 at 01:27:17PM -0400, John W. Linville wrote:
>> On Wed, May 04, 2011 at 10:38:19AM -0500, Seth Forshee wrote:
>> > I've been investigating some reports of a regression in associating with
>> > APs with AR2413 in 2.6.38. Association repeatedly fails with some
>> > "direct probe to x timed out" messages (see syslog excerpt below),
>> > although it will generally associate eventually, after many tries.
>> >
>> > Bisection identifies 8aec7af (ath5k: Support synth-only channel change
>> > for AR2413/AR5413) as offending commit. Prior to this commit there are
>> > no direct probe messages at all in the logs. I've also found that
>> > forcing fast to false at the top of ath5k_hw_reset() fixes the issue.
>> > I'm not sure what the connection is between this commit and the
>> > timeouts. Any suggestions?
>>
>> Have you tried reverting that commit on top of 2.6.38?  Can you
>> recreate the issue with 2.6.39-rc6 (or later)?
>
> I started to revert that commit, but it wasn't straight-forward due to
> later changes. Forcing fast to false in ath5k_hw_reset() acts as a
> functional revert of sorts since that should force it back to a full
> reset for all channel changes, and it's much simpler than working out
> the right way to revert the commit. I think the results suggest strongly
> that a revert is likely to fix the problem. I can finish the work to
> revert if you'd still like to see the results.
>
> Testing a previous .39-rc kernel still exhibited the failure. I don't
> recall which one it was and apparently forgot to make note of it. I'll
> request testing against rc6.
>
> Thanks,
> Seth
>

Do you get scan results ?
Can you enable ATH5K_DEBUG_RESET and see what you get ?

-- 
GPG ID: 0xD21DB2DB
As you read this post global entropy rises. Have Fun ;-)
Nick

^ permalink raw reply

* [PATCH 0/5] More rt->rt_{src,dst} removals.
From: David Miller @ 2011-05-04 20:19 UTC (permalink / raw)
  To: netdev

After this set the number of uses of rt->rt_{src,dst} are very small.

One case to hit still will need a slight rearrangement of how we do
SRR handling, since SRR handling on input pops the nexthop and thus
creates situations where iph->daddr != rt->rt_dst and thus eliminates
the posibility of converting the rt->rt_dst access in ip_forward().

Another class of remaining cases occur in situations where we can
fetch the saddr/daddr from the inet socket.

Soon, all that will be left are the assignments and netlink dump code
cases in net/ipv4/route.c

^ permalink raw reply

* [PATCH 1/5] ipv4: Use flowi4->{daddr,saddr} in ipip_tunnel_xmit().
From: David Miller @ 2011-05-04 20:19 UTC (permalink / raw)
  To: netdev


Instead of rt->rt_{dst,src}

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/ipip.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c
index 88d96bd..bfa0b98 100644
--- a/net/ipv4/ipip.c
+++ b/net/ipv4/ipip.c
@@ -550,8 +550,8 @@ static netdev_tx_t ipip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev)
 	iph->frag_off		=	df;
 	iph->protocol		=	IPPROTO_IPIP;
 	iph->tos		=	INET_ECN_encapsulate(tos, old_iph->tos);
-	iph->daddr		=	rt->rt_dst;
-	iph->saddr		=	rt->rt_src;
+	iph->daddr		=	fl4.daddr;
+	iph->saddr		=	fl4.saddr;
 
 	if ((iph->ttl = tiph->ttl) == 0)
 		iph->ttl	=	old_iph->ttl;
-- 
1.7.4.5


^ permalink raw reply related

* [PATCH 2/5] ipv4: In ip_build_and_send_pkt() use 'saddr' and 'daddr' args passed in.
From: David Miller @ 2011-05-04 20:19 UTC (permalink / raw)
  To: netdev


Instead of rt->rt_{dst,src}

The only tricky part is source route option handling.

If the source route option is enabled we can't just use plain 'daddr',
we have to use opt->opt.faddr.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/ip_output.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 3aa4c31..db38c18 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -158,8 +158,8 @@ int ip_build_and_send_pkt(struct sk_buff *skb, struct sock *sk,
 	else
 		iph->frag_off = 0;
 	iph->ttl      = ip_select_ttl(inet, &rt->dst);
-	iph->daddr    = rt->rt_dst;
-	iph->saddr    = rt->rt_src;
+	iph->daddr    = (opt && opt->opt.srr ? opt->opt.faddr : daddr);
+	iph->saddr    = saddr;
 	iph->protocol = sk->sk_protocol;
 	ip_select_ident(iph, &rt->dst, sk);
 
-- 
1.7.4.5


^ permalink raw reply related

* [PATCH 3/5] ipv4: Pass explicit saddr/daddr args to ipmr_get_route().
From: David Miller @ 2011-05-04 20:19 UTC (permalink / raw)
  To: netdev


This eliminates the need to use rt->rt_{src,dst}.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/linux/mroute.h |    1 +
 net/ipv4/ipmr.c        |   16 ++++++++--------
 net/ipv4/route.c       |    4 +++-
 3 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/include/linux/mroute.h b/include/linux/mroute.h
index b21d567..46caaf4 100644
--- a/include/linux/mroute.h
+++ b/include/linux/mroute.h
@@ -244,6 +244,7 @@ struct mfc_cache {
 #ifdef __KERNEL__
 struct rtmsg;
 extern int ipmr_get_route(struct net *net, struct sk_buff *skb,
+			  __be32 saddr, __be32 daddr,
 			  struct rtmsg *rtm, int nowait);
 #endif
 
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 86033b7..30a7763 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -2041,20 +2041,20 @@ rtattr_failure:
 	return -EMSGSIZE;
 }
 
-int ipmr_get_route(struct net *net,
-		   struct sk_buff *skb, struct rtmsg *rtm, int nowait)
+int ipmr_get_route(struct net *net, struct sk_buff *skb,
+		   __be32 saddr, __be32 daddr,
+		   struct rtmsg *rtm, int nowait)
 {
-	int err;
-	struct mr_table *mrt;
 	struct mfc_cache *cache;
-	struct rtable *rt = skb_rtable(skb);
+	struct mr_table *mrt;
+	int err;
 
 	mrt = ipmr_get_table(net, RT_TABLE_DEFAULT);
 	if (mrt == NULL)
 		return -ENOENT;
 
 	rcu_read_lock();
-	cache = ipmr_cache_find(mrt, rt->rt_src, rt->rt_dst);
+	cache = ipmr_cache_find(mrt, saddr, daddr);
 
 	if (cache == NULL) {
 		struct sk_buff *skb2;
@@ -2087,8 +2087,8 @@ int ipmr_get_route(struct net *net,
 		skb_reset_network_header(skb2);
 		iph = ip_hdr(skb2);
 		iph->ihl = sizeof(struct iphdr) >> 2;
-		iph->saddr = rt->rt_src;
-		iph->daddr = rt->rt_dst;
+		iph->saddr = saddr;
+		iph->daddr = daddr;
 		iph->version = 0;
 		err = ipmr_cache_unresolved(mrt, vif, skb2);
 		read_unlock(&mrt_lock);
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 3bc6854..6a83840 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2857,7 +2857,9 @@ static int rt_fill_info(struct net *net,
 
 		if (ipv4_is_multicast(dst) && !ipv4_is_local_multicast(dst) &&
 		    IPV4_DEVCONF_ALL(net, MC_FORWARDING)) {
-			int err = ipmr_get_route(net, skb, r, nowait);
+			int err = ipmr_get_route(net, skb,
+						 rt->rt_src, rt->rt_dst,
+						 r, nowait);
 			if (err <= 0) {
 				if (!nowait) {
 					if (err == 0)
-- 
1.7.4.5


^ permalink raw reply related

* [PATCH 4/5] ipv4: Kill rt->rt_{src, dst} usage in IP GRE tunnels.
From: David Miller @ 2011-05-04 20:19 UTC (permalink / raw)
  To: netdev


First, make callers pass on-stack flowi4 to ip_route_output_gre()
so they can get at the fully resolved flow key.

Next, use that in ipgre_tunnel_xmit() to avoid the need to use
rt->rt_{dst,src}.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/route.h |   19 +++++++++----------
 net/ipv4/ip_gre.c   |   37 +++++++++++++++++++++----------------
 2 files changed, 30 insertions(+), 26 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index 8c02c87..9f8070b 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -152,19 +152,18 @@ static inline struct rtable *ip_route_output_ports(struct net *net, struct flowi
 	return ip_route_output_flow(net, fl4, sk);
 }
 
-static inline struct rtable *ip_route_output_gre(struct net *net,
+static inline struct rtable *ip_route_output_gre(struct net *net, struct flowi4 *fl4,
 						 __be32 daddr, __be32 saddr,
 						 __be32 gre_key, __u8 tos, int oif)
 {
-	struct flowi4 fl4 = {
-		.flowi4_oif = oif,
-		.daddr = daddr,
-		.saddr = saddr,
-		.flowi4_tos = tos,
-		.flowi4_proto = IPPROTO_GRE,
-		.fl4_gre_key = gre_key,
-	};
-	return ip_route_output_key(net, &fl4);
+	memset(fl4, 0, sizeof(*fl4));
+	fl4->flowi4_oif = oif;
+	fl4->daddr = daddr;
+	fl4->saddr = saddr;
+	fl4->flowi4_tos = tos;
+	fl4->flowi4_proto = IPPROTO_GRE;
+	fl4->fl4_gre_key = gre_key;
+	return ip_route_output_key(net, fl4);
 }
 
 extern int ip_route_input_common(struct sk_buff *skb, __be32 dst, __be32 src,
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 24efd35..10e9b5a 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -699,6 +699,7 @@ static netdev_tx_t ipgre_tunnel_xmit(struct sk_buff *skb, struct net_device *dev
 	struct pcpu_tstats *tstats;
 	const struct iphdr  *old_iph = ip_hdr(skb);
 	const struct iphdr  *tiph;
+	struct flowi4 fl4;
 	u8     tos;
 	__be16 df;
 	struct rtable *rt;     			/* Route to the other host */
@@ -769,7 +770,7 @@ static netdev_tx_t ipgre_tunnel_xmit(struct sk_buff *skb, struct net_device *dev
 			tos = ipv6_get_dsfield((const struct ipv6hdr *)old_iph);
 	}
 
-	rt = ip_route_output_gre(dev_net(dev), dst, tiph->saddr,
+	rt = ip_route_output_gre(dev_net(dev), &fl4, dst, tiph->saddr,
 				 tunnel->parms.o_key, RT_TOS(tos),
 				 tunnel->parms.link);
 	if (IS_ERR(rt)) {
@@ -873,8 +874,8 @@ static netdev_tx_t ipgre_tunnel_xmit(struct sk_buff *skb, struct net_device *dev
 	iph->frag_off		=	df;
 	iph->protocol		=	IPPROTO_GRE;
 	iph->tos		=	ipgre_ecn_encapsulate(tos, old_iph, skb);
-	iph->daddr		=	rt->rt_dst;
-	iph->saddr		=	rt->rt_src;
+	iph->daddr		=	fl4.daddr;
+	iph->saddr		=	fl4.saddr;
 
 	if ((iph->ttl = tiph->ttl) == 0) {
 		if (skb->protocol == htons(ETH_P_IP))
@@ -938,12 +939,14 @@ static int ipgre_tunnel_bind_dev(struct net_device *dev)
 	/* Guess output device to choose reasonable mtu and needed_headroom */
 
 	if (iph->daddr) {
-		struct rtable *rt = ip_route_output_gre(dev_net(dev),
-							iph->daddr, iph->saddr,
-							tunnel->parms.o_key,
-							RT_TOS(iph->tos),
-							tunnel->parms.link);
-
+		struct flowi4 fl4;
+		struct rtable *rt;
+
+		rt = ip_route_output_gre(dev_net(dev), &fl4,
+					 iph->daddr, iph->saddr,
+					 tunnel->parms.o_key,
+					 RT_TOS(iph->tos),
+					 tunnel->parms.link);
 		if (!IS_ERR(rt)) {
 			tdev = rt->dst.dev;
 			ip_rt_put(rt);
@@ -1196,13 +1199,15 @@ static int ipgre_open(struct net_device *dev)
 	struct ip_tunnel *t = netdev_priv(dev);
 
 	if (ipv4_is_multicast(t->parms.iph.daddr)) {
-		struct rtable *rt = ip_route_output_gre(dev_net(dev),
-							t->parms.iph.daddr,
-							t->parms.iph.saddr,
-							t->parms.o_key,
-							RT_TOS(t->parms.iph.tos),
-							t->parms.link);
-
+		struct flowi4 fl4;
+		struct rtable *rt;
+
+		rt = ip_route_output_gre(dev_net(dev), &fl4,
+					 t->parms.iph.daddr,
+					 t->parms.iph.saddr,
+					 t->parms.o_key,
+					 RT_TOS(t->parms.iph.tos),
+					 t->parms.link);
 		if (IS_ERR(rt))
 			return -EADDRNOTAVAIL;
 		dev = rt->dst.dev;
-- 
1.7.4.5


^ permalink raw reply related

* [PATCH 5/5] ipv6: Use flowi4->{daddr,saddr} in ipip6_tunnel_xmit().
From: David Miller @ 2011-05-04 20:19 UTC (permalink / raw)
  To: netdev


Instead of rt->rt_{dst,src}

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv6/sit.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index a24fb14..c53abcf 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -828,8 +828,8 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff *skb,
 	iph->frag_off		=	df;
 	iph->protocol		=	IPPROTO_IPV6;
 	iph->tos		=	INET_ECN_encapsulate(tos, ipv6_get_dsfield(iph6));
-	iph->daddr		=	rt->rt_dst;
-	iph->saddr		=	rt->rt_src;
+	iph->daddr		=	fl4.daddr;
+	iph->saddr		=	fl4.saddr;
 
 	if ((iph->ttl = tiph->ttl) == 0)
 		iph->ttl	=	iph6->hop_limit;
-- 
1.7.4.5


^ permalink raw reply related

* [PATCHv2] virtio-spec: 64 bit features, used/avail event
From: Michael S. Tsirkin @ 2011-05-04 20:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rusty Russell, Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	virtualization, netdev, linux-s390, kvm, Krishna Kumar,
	Tom Lendacky, steved, habanero

I'm working on a patchset (to follow shortly)
that modified the notificatin hand-off in virtio to be basically
like Xen: each side published an index, the other side only triggers
an event when it crosses that index value
(Xen event indexes start at 1, ours start at 0 for
backward-compatiblity, but that's minor).

Especially for testing, it is very convenient to have
separate feature bits for this change in used and available
ring; since we've run out of bits in the 32 bit field,
I added another 32 bit and bit 31 enables that.

I started with using both flags and indexes in parallel,
but switched to doing either-or: this means we do
not need to tweak memory access ordering as index access just
replaces flags access.

A note on naming: the index replacing avail->flags is named
used_event, the index replacing used->flags is named
avail_event to stress the fact that these actually
point into the other side of the ring:
event is triggered when avail->idx == used->avail_event + 1
and when used->idx == avail->used_event + 1, respectively.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---

Changes from v1:
	- minor wording changes to address comments
	- fill a couple of places I missed
	- add text about access ordering

 virtio-spec.lyx |  719 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 700 insertions(+), 19 deletions(-)

diff --git a/virtio-spec.lyx b/virtio-spec.lyx
index f7c9c38..95fd926 100644
--- a/virtio-spec.lyx
+++ b/virtio-spec.lyx
@@ -1,4 +1,4 @@
-#LyX 1.6.7 created this file. For more info see http://www.lyx.org/
+#LyX 1.6.8 created this file. For more info see http://www.lyx.org/
 \lyxformat 345
 \begin_document
 \begin_header
@@ -36,7 +36,7 @@
 \paperpagestyle default
 \tracking_changes true
 \output_changes true
-\author "" 
+\author "Michael S. Tsirkin" 
 \author "" 
 \end_header
 
@@ -953,6 +953,10 @@ ISR
 
 \size footnotesize
 Features
+\change_inserted 0 1304329091
+ bits 0:31
+\change_unchanged
+
 \end_layout
 
 \end_inset
@@ -964,6 +968,10 @@ Features
 
 \size footnotesize
 Features
+\change_inserted 0 1304329086
+ bits 0:31
+\change_unchanged
+
 \end_layout
 
 \end_inset
@@ -1186,6 +1194,177 @@ Vector
 \end_layout
 
 \begin_layout Standard
+
+\change_inserted 0 1304328924
+Finally, if feature bits (VIRTIO_F_FEATURES_HI) this is immediately followed
+ by two additional fields:
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 0 1304328925
+\begin_inset Tabular
+<lyxtabular version="3" rows="4" columns="3">
+<features>
+<column alignment="left" valignment="top" width="0">
+<column alignment="left" valignment="top" width="0">
+<column alignment="left" valignment="top" width="0">
+<row>
+<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
+\begin_inset Text
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1304328925
+Bits
+\end_layout
+
+\end_inset
+</cell>
+<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
+\begin_inset Text
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1304328925
+32
+\end_layout
+
+\end_inset
+</cell>
+<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
+\begin_inset Text
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1304328925
+32
+\end_layout
+
+\end_inset
+</cell>
+</row>
+<row>
+<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
+\begin_inset Text
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1304328925
+Read/Write
+\end_layout
+
+\end_inset
+</cell>
+<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
+\begin_inset Text
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1304328925
+R
+\end_layout
+
+\end_inset
+</cell>
+<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
+\begin_inset Text
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1304328925
+R+W
+\end_layout
+
+\end_inset
+</cell>
+</row>
+<row>
+<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
+\begin_inset Text
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1304328925
+Purpose
+\end_layout
+
+\end_inset
+</cell>
+<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
+\begin_inset Text
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1304328925
+
+\size footnotesize
+Device
+\end_layout
+
+\end_inset
+</cell>
+<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
+\begin_inset Text
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1304328925
+
+\size footnotesize
+Guest
+\end_layout
+
+\end_inset
+</cell>
+</row>
+<row>
+<cell alignment="center" valignment="top" bottomline="true" leftline="true" rightline="true" usebox="none">
+\begin_inset Text
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1304328925
+
+\end_layout
+
+\end_inset
+</cell>
+<cell alignment="center" valignment="top" bottomline="true" leftline="true" usebox="none">
+\begin_inset Text
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1304329099
+
+\size footnotesize
+Features bits 32:63
+\end_layout
+
+\end_inset
+</cell>
+<cell alignment="center" valignment="top" bottomline="true" leftline="true" usebox="none">
+\begin_inset Text
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1304329102
+
+\size footnotesize
+Features bits 32:63
+\end_layout
+
+\end_inset
+</cell>
+</row>
+</lyxtabular>
+
+\end_inset
+
+
+\end_layout
+
+\begin_layout Standard
 Immediately following these general headers, there may be device-specific
  headers:
 \end_layout
@@ -1348,7 +1527,20 @@ Feature Bits
 The least significant 31 bits of the first configuration field indicates
  the features that the device supports (the high bit is reserved, and will
  be used to indicate the presence of future feature bits elsewhere).
- The bits are allocated as follows:
+ 
+\change_inserted 0 1304331636
+If more than 31 feature bits are supported, the device indicates so by setting
+ feature bit 31 (see 
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "cha:Reserved-Feature-Bits"
+
+\end_inset
+
+).
+ 
+\change_unchanged
+The bits are allocated as follows:
 \end_layout
 
 \begin_layout Description
@@ -1372,7 +1564,33 @@ to
 \begin_inset space ~
 \end_inset
 
-30 Feature bits reserved for extensions to the queue mechanism
+
+\change_inserted 0 1304329326
+4
+\change_deleted 0 1304329325
+3
+\change_unchanged
+0 Feature bits reserved for extensions to the queue 
+\change_inserted 0 1304540448
+and feature negotiation 
+\change_unchanged
+mechanism
+\change_inserted 0 1304540449
+s
+\end_layout
+
+\begin_layout Description
+
+\change_inserted 0 1304329398
+41
+\begin_inset space ~
+\end_inset
+
+to
+\begin_inset space ~
+\end_inset
+
+63 Feature bits reserved for future extensions
 \end_layout
 
 \begin_layout Standard
@@ -1407,6 +1625,19 @@ This allows for forwards and backwards compatibility: if the device is enhanced
  support, it will not see that feature bit in the Device Features field
  and can go into backwards compatibility mode (or, for poor implementations,
  set the FAILED Device Status bit).
+\change_inserted 0 1304329423
+
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 0 1304331742
+Access to feature bits 32 to 63 is enabled by Guest by setting feature bit
+ 31.
+ If this bit is unset, Device must assume that all feature bits > 31 are
+ unset.
+\change_unchanged
+
 \end_layout
 
 \begin_layout Subsubsection
@@ -1891,7 +2122,38 @@ flags
 
  field is currently 0 or 1: 1 indicating that we do not need an interrupt
  when the device consumes a descriptor from the available ring.
- This interrupt suppression is merely an optimization; it may not suppress
+ 
+\change_inserted 0 1304540481
+Alternatively, the guest ask the device to delay interrupts until an entry
+ with an index specified by the 
+\begin_inset Quotes eld
+\end_inset
+
+used_event
+\begin_inset Quotes erd
+\end_inset
+
+ field is written in the used ring (equivalently, until the 
+\emph on
+idx
+\emph default
+ field in the used ring will reach the value 
+\emph on
+used_event + 1
+\emph default
+).
+ The method employed by the device is controlled by the VIRTIO_RING_F_USED_EVENT
+_IDX feature bit (see 
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "cha:Reserved-Feature-Bits"
+
+\end_inset
+
+).
+ 
+\change_unchanged
+This interrupt suppression is merely an optimization; it may not suppress
  interrupts entirely.
 \end_layout
 
@@ -1940,6 +2202,17 @@ struct vring_avail {
 \begin_layout Plain Layout
 
    u16 ring[qsz]; /* qsz is the Queue Size field read from device */
+\change_inserted 0 1304329945
+
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1304329957
+
+   u16 used_event;
+\change_unchanged
+
 \end_layout
 
 \begin_layout Plain Layout
@@ -1963,8 +2236,71 @@ The used ring is where the device returns buffers once it is done with them.
 \emph on
 available
 \emph default
- ring (the flag is kept here because this is the only part of the virtqueue
- written by the device).
+ ring
+\change_inserted 0 1304540575
+.
+ Alternatively, the 
+\begin_inset Quotes eld
+\end_inset
+
+avail_event
+\begin_inset Quotes erd
+\end_inset
+
+ field can be used by the device to hint that no notification is necessary
+ until an entry with an index specified by the 
+\begin_inset Quotes eld
+\end_inset
+
+avail_event
+\begin_inset Quotes erd
+\end_inset
+
+ is written in the available ring (equivalently, until the 
+\emph on
+idx
+\emph default
+ field in the available ring will reach the value 
+\emph on
+avail_event + 1
+\emph default
+).
+
+\change_unchanged
+ 
+\change_inserted 0 1304540614
+The method employed by the device is controlled by the guest through the
+ VIRTIO_RING_F_AVAIL_EVENT_IDX feature bit (see 
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "cha:Reserved-Feature-Bits"
+
+\end_inset
+
+).
+ 
+\change_deleted 0 1304331235
+(the flag is kept here because this is the only part of the virtqueue written
+ by the device)
+\change_inserted 0 1304540560
+
+\begin_inset Foot
+status open
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1304331235
+These fields are kept here because this is the only part of the virtqueue
+ written by the device
+\change_unchanged
+
+\end_layout
+
+\end_inset
+
+
+\change_unchanged
+.
 \end_layout
 
 \begin_layout Standard
@@ -2046,6 +2382,17 @@ struct vring_used {
 \begin_layout Plain Layout
 
     struct vring_used_elem ring[qsz];
+\change_inserted 0 1304330369
+
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1304330380
+
+    u16 avail_event;
+\change_unchanged
+
 \end_layout
 
 \begin_layout Plain Layout
@@ -2065,9 +2412,13 @@ Helpers for Managing Virtqueues
 \begin_layout Standard
 The Linux Kernel Source code contains the definitions above and helper routines
  in a more usable form, in include/linux/virtio_ring.h.
- This was explicitly licensed by IBM under the (3-clause) BSD license so
- that it can be freely used by all other projects, and is reproduced (with
- slight variation to remove Linux assumptions) in Appendix A.
+ This was explicitly licensed by IBM 
+\change_inserted 0 1304342159
+and Red Hat 
+\change_unchanged
+under the (3-clause) BSD license so that it can be freely used by all other
+ projects, and is reproduced (with slight variation to remove Linux assumptions)
+ in Appendix A.
 \end_layout
 
 \begin_layout Section
@@ -2374,12 +2725,61 @@ before
 \emph default
  checking the suppression flag: it's OK to notify gratuitously, but not
  to omit a required notification.
- So again, we use a memory barrier here before reading the flags.
+ So again, we use a memory barrier here before reading the flags
+\change_inserted 0 1304336099
+ or the avail_event field
+\change_unchanged
+.
+\end_layout
+
+\begin_layout Standard
+If 
+\change_inserted 0 1304336234
+the VIRTIO_ F_RING_AVAIL_EVENT_IDX feature is not negotiated, and if 
+\change_unchanged
+the VRING_USED_F_NOTIFY flag is not set, we go ahead and write to the PCI
+ configuration space.
+\change_inserted 0 1304336255
+
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 0 1304336617
+If the VIRTIO_ F_RING_AVAIL_EVENT_IDX feature is negotiated, we read the
+ avail_event field in the available ring structure.
+ If the available index crossed_the 
+\emph on
+avail_event
+\emph default
+ field value since the last notification, we go ahead and write to the PCI
+ configuration space.
+ The 
+\emph on
+avail_event
+\emph default
+ field wraps naturally at 65536 as well:
 \end_layout
 
 \begin_layout Standard
-If the VRING_USED_F_NOTIFY flag is not set, we go ahead and write to the
- PCI configuration space.
+
+\change_inserted 0 1304336524
+\begin_inset listings
+inline false
+status open
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1304336569
+
+(u16)(new_idx - avail_event - 1) < (u16)(new_idx - old_idx)
+\end_layout
+
+\end_inset
+
+
+\change_unchanged
+
 \end_layout
 
 \begin_layout Subsection
@@ -2408,8 +2808,66 @@ Update the used ring idx.
 \end_layout
 
 \begin_layout Enumerate
-If the VRING_AVAIL_F_NO_INTERRUPT flag is not set in avail\SpecialChar \nobreakdash-
->flags:
+
+\change_inserted 0 1304336736
+Determine whether an interrupt is necessary:
+\end_layout
+
+\begin_deeper
+\begin_layout Enumerate
+
+\change_inserted 0 1304336780
+If the VIRTIO_F_RING_USED_IDX is not negotiated: check if 
+\change_deleted 0 1304336781
+I
+\change_unchanged
+f the VRING_AVAIL_F_NO_INTERRUPT flag is not set in avail\SpecialChar \nobreakdash-
+>flags
+\change_inserted 0 1304336788
+
+\end_layout
+
+\begin_layout Enumerate
+
+\change_deleted 0 1304336785
+:
+\change_inserted 0 1304336896
+If the VIRTIO_F_RING_USED_IDX is negotiated: check whether the used index
+ crossed the 
+\emph on
+used_event
+\emph default
+ field value since the last update.
+ The 
+\emph on
+used_event
+\emph default
+ field wraps naturally at 65536 as well:
+\begin_inset listings
+inline false
+status open
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1304336902
+
+(u16)(new_idx - used_event - 1) < (u16)(new_idx - old_idx)
+\end_layout
+
+\end_inset
+
+
+\change_unchanged
+
+\end_layout
+
+\end_deeper
+\begin_layout Enumerate
+
+\change_inserted 0 1304336714
+If an interrupt is necessary:
+\change_unchanged
+
 \end_layout
 
 \begin_deeper
@@ -2464,13 +2922,87 @@ If MSI-X capability is enabled: look through the used rings of each virtqueue
 \end_layout
 
 \begin_layout Standard
+
+\change_inserted 0 1304341856
+For each ring, guest should then disable interrupts by writing VRING_AVAIL_F_NO_
+INTERRUPT flag in avail structure, if required.
+ It can then process used ring entries finally enabling interrupts by clearing
+ the VRING_AVAIL_F_NO_INTERRUPT flag or updating the used_idx field in the
+ available structure, Guest should then execute a memory barrier, and then
+ recheck the ring empty condition.
+ This is necessary to handle the case where, after the last check and before
+ enabling interrupts, an interrupt has been suppressed by the device:
+\end_layout
+
+\begin_layout Standard
 \begin_inset listings
 inline false
 status open
 
 \begin_layout Plain Layout
 
-while (vq->last_seen_used != vring->used.idx) {
+\change_inserted 0 1304342051
+
+vring_disable_interrupts(vq);
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1304341878
+
+for (;;) {
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1304341880
+
+    if 
+\change_deleted 0 1304341882
+while 
+\change_unchanged
+(vq->last_seen_used != vring->used.idx) {
+\change_inserted 0 1304341888
+
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1304342047
+
+		vring_enable_interrupts(vq);
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1304341986
+
+		mb();
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1304341964
+
+		if (vq->last_seen_used != vring->used.idx)
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1304341974
+
+			break;
+\change_unchanged
+
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1304341887
+
+    }
+\change_unchanged
+
 \end_layout
 
 \begin_layout Plain Layout
@@ -2721,6 +3253,15 @@ status open
 \begin_layout Plain Layout
 
  * Copyright 2007, 2009, IBM Corporation
+\change_inserted 0 1304341032
+
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1304341075
+
+ * Copyright 2011, Red Hat, Inc
 \end_layout
 
 \begin_layout Plain Layout
@@ -3019,6 +3560,17 @@ struct vring_avail {
 \begin_layout Plain Layout
 
         uint16_t ring[];
+\change_inserted 0 1304340808
+
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1304340816
+
+        uint16_t used_event;
+\change_unchanged
+
 \end_layout
 
 \begin_layout Plain Layout
@@ -3090,6 +3642,17 @@ struct vring_used {
 \begin_layout Plain Layout
 
         struct vring_used_elem ring[];
+\change_inserted 0 1304340824
+
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1304340831
+
+        uint16_t avail_event;
+\change_unchanged
+
 \end_layout
 
 \begin_layout Plain Layout
@@ -3326,12 +3889,58 @@ static inline unsigned vring_size(unsigned int num, unsigned long align)
 
 \begin_layout Plain Layout
 
-                + sizeof(uint16_t)*2 + sizeof(struct vring_used_elem)*num;
+                + sizeof(uint16_t)*
+\change_deleted 0 1304340844
+2
+\change_inserted 0 1304340844
+3
+\change_unchanged
+ + sizeof(struct vring_used_elem)*num;
+\end_layout
+
+\begin_layout Plain Layout
+
+}
+\change_inserted 0 1304340918
+
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1304340918
+
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1304340987
+
+static inline int vring_need_event(uint16_t event_idx, uint16_t new_idx,
+ uint16_t old_idx)
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1304340944
+
+{
 \end_layout
 
 \begin_layout Plain Layout
 
+\change_inserted 0 1304341001
+
+         return (uint16_t)(new_idx - event_idx - 1) < (uint16_t)(new_idx
+ - old_idx); 
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1304340938
+
 }
+\change_unchanged
+
 \end_layout
 
 \begin_layout Plain Layout
@@ -3355,7 +3964,13 @@ Appendix B: Reserved Feature Bits
 \end_layout
 
 \begin_layout Standard
-Currently there are three device-independent feature bits defined:
+Currently there are 
+\change_inserted 0 1304540655
+six
+\change_deleted 0 1304330657
+three
+\change_unchanged
+ device-independent feature bits defined:
 \end_layout
 
 \begin_layout Description
@@ -3365,7 +3980,11 @@ VIRTIO_F_NOTIFY_ON_EMPTY
 
 (24) Negotiating this feature indicates that the driver wants an interrupt
  if the device runs out of available descriptors on a virtqueue, even though
- interrupts are suppressed using the VRING_AVAIL_F_NO_INTERRUPT flag.
+ interrupts are suppressed using the VRING_AVAIL_F_NO_INTERRUPT flag
+\change_inserted 0 1304341161
+ or the used_event field
+\change_unchanged
+.
  An example of this is the networking driver: it doesn't need to know every
  time a packet is transmitted, but it does need to free the transmitted
  packets a finite time after they are transmitted.
@@ -3390,6 +4009,31 @@ reference "sub:Indirect-Descriptors"
 \end_layout
 
 \begin_layout Description
+
+\change_inserted 0 1304331394
+VIRTIO_F_RING_USED_EVENT_IDX(29) This feature indicates that the device
+ should ignore the 
+\emph on
+flags
+\emph default
+ field in the available ring structure.
+ Instead, the
+\emph on
+ used_event
+\emph default
+ field in this structure is used by guest to suppress device interrupts.
+ If unset, the device should ignore the 
+\emph on
+used_event
+\emph default
+ field; the 
+\emph on
+flags
+\emph default
+ field is used
+\end_layout
+
+\begin_layout Description
 VIRTIO_F_BAD_FEATURE(30) This feature should never be negotiated by the
  guest; doing so is an indication that the guest is faulty
 \begin_inset Foot
@@ -3403,6 +4047,43 @@ An experimental virtio PCI driver contained in Linux version 2.6.25 had this
 \end_inset
 
 
+\change_inserted 0 1304330854
+
+\end_layout
+
+\begin_layout Description
+
+\change_inserted 0 1304330961
+VIRTIO_F_FEATURES_HIGH(31) This feature indicates that the device supports
+ feature bits 32:63.
+ If unset, feature bits 32:63 are unset.
+\end_layout
+
+\begin_layout Description
+
+\change_inserted 0 1304331390
+VIRTIO_F_RING_AVAIL_EVENT_IDX(32) This feature indicates that the driver
+ should ignore the 
+\emph on
+flags
+\emph default
+ field in the used ring structure.
+ Instead, the 
+\emph on
+avail_event
+\emph default
+ field in this structure is used by the device to suppress notifications.
+ If unset, the device should ignore the 
+\emph on
+avail_event
+\emph default
+ field; the 
+\emph on
+flags
+\emph default
+ field is used
+\change_unchanged
+
 \end_layout
 
 \begin_layout Chapter*
-- 
1.7.5.53.gc233e

^ permalink raw reply related

* Re: [net-next-2.6 PATCH] can: make struct can_proto const
From: Oliver Hartkopp @ 2011-05-04 20:34 UTC (permalink / raw)
  To: socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20110504044057.GC278-MxZ6Iy/zr/UdbCeoMzGj59i2O/JbrIOy@public.gmane.org>

On 04.05.2011 06:40, Kurt Van Dijck wrote:
> commit 53914b67993c724cec585863755c9ebc8446e83b had the
> same message. That commit did put everything in place but
> did not make can_proto const itself.
> 
> Signed-off-by: Kurt Van Dijck <kurt.van.dijck-/BeEPy95v10@public.gmane.org>
> 

Acked-by: Oliver Hartkopp <socketcan-fJ+pQTUTwRTk1uMJSBkQmQ@public.gmane.org>

Thanks Kurt!

> diff --git a/include/linux/can/core.h b/include/linux/can/core.h
> index 6f70a6d..5ce6b5d 100644
> --- a/include/linux/can/core.h
> +++ b/include/linux/can/core.h
> @@ -44,8 +44,8 @@ struct can_proto {
>  
>  /* function prototypes for the CAN networklayer core (af_can.c) */
>  
> -extern int  can_proto_register(struct can_proto *cp);
> -extern void can_proto_unregister(struct can_proto *cp);
> +extern int  can_proto_register(const struct can_proto *cp);
> +extern void can_proto_unregister(const struct can_proto *cp);
>  
>  extern int  can_rx_register(struct net_device *dev, canid_t can_id,
>  			    canid_t mask,
> diff --git a/net/can/af_can.c b/net/can/af_can.c
> index a8dcaa4..5b52762 100644
> --- a/net/can/af_can.c
> +++ b/net/can/af_can.c
> @@ -84,7 +84,7 @@ static DEFINE_SPINLOCK(can_rcvlists_lock);
>  static struct kmem_cache *rcv_cache __read_mostly;
>  
>  /* table of registered CAN protocols */
> -static struct can_proto *proto_tab[CAN_NPROTO] __read_mostly;
> +static const struct can_proto *proto_tab[CAN_NPROTO] __read_mostly;
>  static DEFINE_MUTEX(proto_tab_lock);
>  
>  struct timer_list can_stattimer;   /* timer for statistics update */
> @@ -115,9 +115,9 @@ static void can_sock_destruct(struct sock *sk)
>  	skb_queue_purge(&sk->sk_receive_queue);
>  }
>  
> -static struct can_proto *can_try_module_get(int protocol)
> +static const struct can_proto *can_try_module_get(int protocol)
>  {
> -	struct can_proto *cp;
> +	const struct can_proto *cp;
>  
>  	rcu_read_lock();
>  	cp = rcu_dereference(proto_tab[protocol]);
> @@ -132,7 +132,7 @@ static int can_create(struct net *net, struct socket *sock, int protocol,
>  		      int kern)
>  {
>  	struct sock *sk;
> -	struct can_proto *cp;
> +	const struct can_proto *cp;
>  	int err = 0;
>  
>  	sock->state = SS_UNCONNECTED;
> @@ -691,7 +691,7 @@ drop:
>   *  -EBUSY  protocol already in use
>   *  -ENOBUF if proto_register() fails
>   */
> -int can_proto_register(struct can_proto *cp)
> +int can_proto_register(const struct can_proto *cp)
>  {
>  	int proto = cp->protocol;
>  	int err = 0;
> @@ -728,7 +728,7 @@ EXPORT_SYMBOL(can_proto_register);
>   * can_proto_unregister - unregister CAN transport protocol
>   * @cp: pointer to CAN protocol structure
>   */
> -void can_proto_unregister(struct can_proto *cp)
> +void can_proto_unregister(const struct can_proto *cp)
>  {
>  	int proto = cp->protocol;
>  
> diff --git a/net/can/bcm.c b/net/can/bcm.c
> index 8a6a05e..cced806 100644
> --- a/net/can/bcm.c
> +++ b/net/can/bcm.c
> @@ -1601,7 +1601,7 @@ static struct proto bcm_proto __read_mostly = {
>  	.init       = bcm_init,
>  };
>  
> -static struct can_proto bcm_can_proto __read_mostly = {
> +static const struct can_proto bcm_can_proto = {
>  	.type       = SOCK_DGRAM,
>  	.protocol   = CAN_BCM,
>  	.ops        = &bcm_ops,
> diff --git a/net/can/raw.c b/net/can/raw.c
> index 0eb39a7..dea99a6 100644
> --- a/net/can/raw.c
> +++ b/net/can/raw.c
> @@ -774,7 +774,7 @@ static struct proto raw_proto __read_mostly = {
>  	.init       = raw_init,
>  };
>  
> -static struct can_proto raw_can_proto __read_mostly = {
> +static const struct can_proto raw_can_proto = {
>  	.type       = SOCK_RAW,
>  	.protocol   = CAN_RAW,
>  	.ops        = &raw_ops,
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [net-next-2.6 PATCH] can: rename can_try_module_get to can_get_proto
From: Oliver Hartkopp @ 2011-05-04 20:35 UTC (permalink / raw)
  To: socketcan-core, netdev
In-Reply-To: <20110504044204.GD278@e-circ.dyndns.org>

On 04.05.2011 06:42, Kurt Van Dijck wrote:
> can: rename can_try_module_get to can_get_proto
> 
> can_try_module_get does return a struct can_proto.
> The name explains what is done in so much detail that a caller
> may not notice that a struct can_proto is locked/unlocked.
> 
> Signed-off-by: Kurt Van Dijck <kurt.van.dijck@eia.be>

Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>

> 
> diff --git a/net/can/af_can.c b/net/can/af_can.c
> index 5b52762..094fc53 100644
> --- a/net/can/af_can.c
> +++ b/net/can/af_can.c
> @@ -115,7 +115,7 @@ static void can_sock_destruct(struct sock *sk)
>  	skb_queue_purge(&sk->sk_receive_queue);
>  }
>  
> -static const struct can_proto *can_try_module_get(int protocol)
> +static const struct can_proto *can_get_proto(int protocol)
>  {
>  	const struct can_proto *cp;
>  
> @@ -128,6 +128,11 @@ static const struct can_proto *can_try_module_get(int protocol)
>  	return cp;
>  }
>  
> +static inline void can_put_proto(const struct can_proto *cp)
> +{
> +	module_put(cp->prot->owner);
> +}
> +
>  static int can_create(struct net *net, struct socket *sock, int protocol,
>  		      int kern)
>  {
> @@ -143,7 +148,7 @@ static int can_create(struct net *net, struct socket *sock, int protocol,
>  	if (!net_eq(net, &init_net))
>  		return -EAFNOSUPPORT;
>  
> -	cp = can_try_module_get(protocol);
> +	cp = can_get_proto(protocol);
>  
>  #ifdef CONFIG_MODULES
>  	if (!cp) {
> @@ -160,7 +165,7 @@ static int can_create(struct net *net, struct socket *sock, int protocol,
>  			printk(KERN_ERR "can: request_module "
>  			       "(can-proto-%d) failed.\n", protocol);
>  
> -		cp = can_try_module_get(protocol);
> +		cp = can_get_proto(protocol);
>  	}
>  #endif
>  
> @@ -195,7 +200,7 @@ static int can_create(struct net *net, struct socket *sock, int protocol,
>  	}
>  
>   errout:
> -	module_put(cp->prot->owner);
> +	can_put_proto(cp);
>  	return err;
>  }
>  
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* [PATCH 00/18] virtio and vhost-net performance enhancements
From: Michael S. Tsirkin @ 2011-05-04 20:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rusty Russell, Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	linux-kernel, virtualization, netdev, linux-s390, kvm,
	Krishna Kumar, Tom Lendacky, steved, habanero

OK, here's a large patchset that implements the virtio spec update that I
sent earlier. It supercedes the PUBLISH_USED_IDX patches
I sent out earlier.

I know it's a lot to ask but please test, and please consider for 2.6.40 :)

I see nice performance improvements: one run showed going from 12
to 18 Gbit/s host to guest with netperf, but I did not spend a lot
of time testing performance, so no guarantees it's not a fluke,
I hope others will try this out and report.
Pls note I will be away from keyboard for the next week.

Essentially we change virtio ring notification
hand-off to work like the one in Xen -
each side publishes an event index, the other one
notifies when it reaches that value -
With the one difference that event index starts at 0,
same as request index (in xen event index starts at 1).

Each side of the handoff has a feature bit independent
of the other one, so we can have e.g. interrupts
handled in the new way and exits in the old one.

This is actually what made the patchset larger:
we run out of feature bits so I had to add some more.
I tested various combinations of hosts and guests and
this code seems to be solid.

With the indexes in place it becomes possbile to request an event after
many requests (and not just on the next one as done now). This shall fix
the TX queue overrun which currently triggers a storm of interrupts.

The patches are mostly independent and can also be cherry-picked,
hopefully there won't be much need of that.

One dependency I'd like to note is on two cleanup patches:
the patch removing batching of available index updates
and the patch fixing ring capability checks in virtio-net.
This simplified code a bit and made the following patch simpler.

I could unwrap the dependency but prefer as is.

The patchset is on top of net-next which at the time
I last rebased was 15ecd03 - so roughly 2.6.39-rc2.

qemu patch will follow shortly.

Rusty, I think (in the hope it will come to that) it will be easier to
merge vhost and virtio bits in one go. Can all go in through your tree
(Dave in the past acked a very similar patch so should not be a problem)
or from me to Dave Miller.

I see nice performance improvements: e.g. from 12 to 18 Gbit/s host
to guest with netperf, but did not spend a lot of time testing
performance, and I will be away from keyboard for the next week.
I hope others will try this out and report.

Michael S. Tsirkin (17):
  virtio: 64 bit features
  virtio_test: update for 64 bit features
  vhost: fix 64 bit features
  virtio: don't delay avail index update
  virtio: used event index interface
  virtio_ring: avail event index interface
  virtio ring: inline function to check for events
  virtio_ring: support for used_event idx feature
  virtio: use avail_event index
  vhost: utilize used_event index
  vhost: support avail_event idx
  virtio_test: support used_event index
  virtio_test: avail_event index support
  virtio: add api for delayed callbacks
  virtio_net: delay TX callbacks
  virtio_net: fix TX capacity checks using new API
  virtio_net: limit xmit polling

Shirley Ma (1):
  virtio_ring: Add capacity check API

 drivers/lguest/lguest_device.c |    8 +-
 drivers/net/virtio_net.c       |   25 ++++---
 drivers/s390/kvm/kvm_virtio.c  |    8 +-
 drivers/vhost/net.c            |   12 ++--
 drivers/vhost/test.c           |    6 +-
 drivers/vhost/vhost.c          |  139 ++++++++++++++++++++++++++++++----------
 drivers/vhost/vhost.h          |   30 ++++++---
 drivers/virtio/virtio.c        |    8 +-
 drivers/virtio/virtio_pci.c    |   34 ++++++++--
 drivers/virtio/virtio_ring.c   |  105 +++++++++++++++++++++++++++---
 include/linux/virtio.h         |   16 ++++-
 include/linux/virtio_config.h  |   15 +++--
 include/linux/virtio_pci.h     |    9 ++-
 include/linux/virtio_ring.h    |   30 ++++++++-
 tools/virtio/virtio_test.c     |   39 ++++++++++-
 15 files changed, 377 insertions(+), 107 deletions(-)

-- 
1.7.5.53.gc233e

^ permalink raw reply

* [PATCH 01/18] virtio: 64 bit features
From: Michael S. Tsirkin @ 2011-05-04 20:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rusty Russell, Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	linux-kernel, virtualization, netdev, linux-s390, kvm,
	Krishna Kumar, Tom Lendacky, steved, habanero
In-Reply-To: <cover.1304541918.git.mst@redhat.com>

Extend features to 64 bit so we can use more
transport bits.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/lguest/lguest_device.c |    8 ++++----
 drivers/s390/kvm/kvm_virtio.c  |    8 ++++----
 drivers/virtio/virtio.c        |    8 ++++----
 drivers/virtio/virtio_pci.c    |   34 ++++++++++++++++++++++++++++------
 drivers/virtio/virtio_ring.c   |    2 ++
 include/linux/virtio.h         |    2 +-
 include/linux/virtio_config.h  |   15 +++++++++------
 include/linux/virtio_pci.h     |    9 ++++++++-
 8 files changed, 60 insertions(+), 26 deletions(-)

diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c
index 69c84a1..d2d6953 100644
--- a/drivers/lguest/lguest_device.c
+++ b/drivers/lguest/lguest_device.c
@@ -93,17 +93,17 @@ static unsigned desc_size(const struct lguest_device_desc *desc)
 }
 
 /* This gets the device's feature bits. */
-static u32 lg_get_features(struct virtio_device *vdev)
+static u64 lg_get_features(struct virtio_device *vdev)
 {
 	unsigned int i;
-	u32 features = 0;
+	u64 features = 0;
 	struct lguest_device_desc *desc = to_lgdev(vdev)->desc;
 	u8 *in_features = lg_features(desc);
 
 	/* We do this the slow but generic way. */
-	for (i = 0; i < min(desc->feature_len * 8, 32); i++)
+	for (i = 0; i < min(desc->feature_len * 8, 64); i++)
 		if (in_features[i / 8] & (1 << (i % 8)))
-			features |= (1 << i);
+			features |= (1ull << i);
 
 	return features;
 }
diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c
index 414427d..c56293c 100644
--- a/drivers/s390/kvm/kvm_virtio.c
+++ b/drivers/s390/kvm/kvm_virtio.c
@@ -79,16 +79,16 @@ static unsigned desc_size(const struct kvm_device_desc *desc)
 }
 
 /* This gets the device's feature bits. */
-static u32 kvm_get_features(struct virtio_device *vdev)
+static u64 kvm_get_features(struct virtio_device *vdev)
 {
 	unsigned int i;
-	u32 features = 0;
+	u64 features = 0;
 	struct kvm_device_desc *desc = to_kvmdev(vdev)->desc;
 	u8 *in_features = kvm_vq_features(desc);
 
-	for (i = 0; i < min(desc->feature_len * 8, 32); i++)
+	for (i = 0; i < min(desc->feature_len * 8, 64); i++)
 		if (in_features[i / 8] & (1 << (i % 8)))
-			features |= (1 << i);
+			features |= (1ull << i);
 	return features;
 }
 
diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index efb35aa..52b24d7 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -112,7 +112,7 @@ static int virtio_dev_probe(struct device *_d)
 	struct virtio_device *dev = container_of(_d,struct virtio_device,dev);
 	struct virtio_driver *drv = container_of(dev->dev.driver,
 						 struct virtio_driver, driver);
-	u32 device_features;
+	u64 device_features;
 
 	/* We have a driver! */
 	add_status(dev, VIRTIO_CONFIG_S_DRIVER);
@@ -124,14 +124,14 @@ static int virtio_dev_probe(struct device *_d)
 	memset(dev->features, 0, sizeof(dev->features));
 	for (i = 0; i < drv->feature_table_size; i++) {
 		unsigned int f = drv->feature_table[i];
-		BUG_ON(f >= 32);
-		if (device_features & (1 << f))
+		BUG_ON(f >= 64);
+		if (device_features & (1ull << f))
 			set_bit(f, dev->features);
 	}
 
 	/* Transport features always preserved to pass to finalize_features. */
 	for (i = VIRTIO_TRANSPORT_F_START; i < VIRTIO_TRANSPORT_F_END; i++)
-		if (device_features & (1 << i))
+		if (device_features & (1ull << i))
 			set_bit(i, dev->features);
 
 	dev->config->finalize_features(dev);
diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index 4fb5b2b..04b216f 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -44,6 +44,8 @@ struct virtio_pci_device
 	spinlock_t lock;
 	struct list_head virtqueues;
 
+	/* 64 bit features */
+	int features_hi;
 	/* MSI-X support */
 	int msix_enabled;
 	int intx_enabled;
@@ -103,26 +105,46 @@ static struct virtio_pci_device *to_vp_device(struct virtio_device *vdev)
 }
 
 /* virtio config->get_features() implementation */
-static u32 vp_get_features(struct virtio_device *vdev)
+static u64 vp_get_features(struct virtio_device *vdev)
 {
 	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+	u32 flo, fhi;
 
-	/* When someone needs more than 32 feature bits, we'll need to
+	/* When someone needs more than 32 feature bits, we need to
 	 * steal a bit to indicate that the rest are somewhere else. */
-	return ioread32(vp_dev->ioaddr + VIRTIO_PCI_HOST_FEATURES);
+	flo = ioread32(vp_dev->ioaddr + VIRTIO_PCI_HOST_FEATURES);
+	if (flo & (0x1 << VIRTIO_F_FEATURES_HI)) {
+		vp_dev->features_hi = 1;
+		iowrite32(0x1 << VIRTIO_F_FEATURES_HI,
+			  vp_dev->ioaddr + VIRTIO_PCI_GUEST_FEATURES);
+		fhi = ioread32(vp_dev->ioaddr + VIRTIO_PCI_HOST_FEATURES_HI);
+	} else {
+		vp_dev->features_hi = 0;
+		fhi = 0;
+	}
+	return (((u64)fhi) << 32) | flo;
 }
 
 /* virtio config->finalize_features() implementation */
 static void vp_finalize_features(struct virtio_device *vdev)
 {
 	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+	u32 flo, fhi;
 
 	/* Give virtio_ring a chance to accept features. */
 	vring_transport_features(vdev);
 
-	/* We only support 32 feature bits. */
-	BUILD_BUG_ON(ARRAY_SIZE(vdev->features) != 1);
-	iowrite32(vdev->features[0], vp_dev->ioaddr+VIRTIO_PCI_GUEST_FEATURES);
+	/* We only support 64 feature bits. */
+	BUILD_BUG_ON(ARRAY_SIZE(vdev->features) != 64 / BITS_PER_LONG);
+	flo = vdev->features[0];
+	fhi = vdev->features[64 / BITS_PER_LONG - 1] >> (BITS_PER_LONG - 32);
+	iowrite32(flo, vp_dev->ioaddr + VIRTIO_PCI_GUEST_FEATURES);
+	if (flo & (0x1 << VIRTIO_F_FEATURES_HI)) {
+		vp_dev->features_hi = 1;
+		iowrite32(fhi, vp_dev->ioaddr + VIRTIO_PCI_GUEST_FEATURES_HI);
+	} else {
+		vp_dev->features_hi = 0;
+	}
 }
 
 /* virtio config->get() implementation */
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index cc2f73e..059e02d 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -469,6 +469,8 @@ void vring_transport_features(struct virtio_device *vdev)
 
 	for (i = VIRTIO_TRANSPORT_F_START; i < VIRTIO_TRANSPORT_F_END; i++) {
 		switch (i) {
+		case VIRTIO_F_FEATURES_HI:
+			break;
 		case VIRTIO_RING_F_INDIRECT_DESC:
 			break;
 		default:
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index aff5b4f..718336b 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -105,7 +105,7 @@ struct virtio_device {
 	struct virtio_config_ops *config;
 	struct list_head vqs;
 	/* Note that this is a Linux set_bit-style bitmap. */
-	unsigned long features[1];
+	unsigned long features[64 / BITS_PER_LONG];
 	void *priv;
 };
 
diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index 800617b..b1a1981 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -18,16 +18,19 @@
 /* We've given up on this device. */
 #define VIRTIO_CONFIG_S_FAILED		0x80
 
-/* Some virtio feature bits (currently bits 28 through 31) are reserved for the
+/* Some virtio feature bits (currently bits 28 through 39) are reserved for the
  * transport being used (eg. virtio_ring), the rest are per-device feature
  * bits. */
 #define VIRTIO_TRANSPORT_F_START	28
-#define VIRTIO_TRANSPORT_F_END		32
+#define VIRTIO_TRANSPORT_F_END		40
 
 /* Do we get callbacks when the ring is completely used, even if we've
  * suppressed them? */
 #define VIRTIO_F_NOTIFY_ON_EMPTY	24
 
+/* Enables feature bits 32 to 63 (only really required for virtio_pci). */
+#define VIRTIO_F_FEATURES_HI		31
+
 #ifdef __KERNEL__
 #include <linux/err.h>
 #include <linux/virtio.h>
@@ -72,7 +75,7 @@
  * @del_vqs: free virtqueues found by find_vqs().
  * @get_features: get the array of feature bits for this device.
  *	vdev: the virtio_device
- *	Returns the first 32 feature bits (all we currently need).
+ *	Returns the first 64 feature bits (all we currently need).
  * @finalize_features: confirm what device features we'll be using.
  *	vdev: the virtio_device
  *	This gives the final feature bits for the device: it can change
@@ -92,7 +95,7 @@ struct virtio_config_ops {
 			vq_callback_t *callbacks[],
 			const char *names[]);
 	void (*del_vqs)(struct virtio_device *);
-	u32 (*get_features)(struct virtio_device *vdev);
+	u64 (*get_features)(struct virtio_device *vdev);
 	void (*finalize_features)(struct virtio_device *vdev);
 };
 
@@ -110,9 +113,9 @@ static inline bool virtio_has_feature(const struct virtio_device *vdev,
 {
 	/* Did you forget to fix assumptions on max features? */
 	if (__builtin_constant_p(fbit))
-		BUILD_BUG_ON(fbit >= 32);
+		BUILD_BUG_ON(fbit >= 64);
 	else
-		BUG_ON(fbit >= 32);
+		BUG_ON(fbit >= 64);
 
 	if (fbit < VIRTIO_TRANSPORT_F_START)
 		virtio_check_driver_offered_feature(vdev, fbit);
diff --git a/include/linux/virtio_pci.h b/include/linux/virtio_pci.h
index 9a3d7c4..90f9725 100644
--- a/include/linux/virtio_pci.h
+++ b/include/linux/virtio_pci.h
@@ -55,9 +55,16 @@
 /* Vector value used to disable MSI for queue */
 #define VIRTIO_MSI_NO_VECTOR            0xffff
 
+/* An extended 32-bit r/o bitmask of the features supported by the host */
+#define VIRTIO_PCI_HOST_FEATURES_HI	24
+
+/* An extended 32-bit r/w bitmask of features activated by the guest */
+#define VIRTIO_PCI_GUEST_FEATURES_HI	28
+
 /* The remaining space is defined by each driver as the per-driver
  * configuration space */
-#define VIRTIO_PCI_CONFIG(dev)		((dev)->msix_enabled ? 24 : 20)
+#define VIRTIO_PCI_CONFIG(dev)		((dev)->features_hi ? 32 : \
+						(dev)->msix_enabled ? 24 : 20)
 
 /* Virtio ABI version, this must match exactly */
 #define VIRTIO_PCI_ABI_VERSION		0
-- 
1.7.5.53.gc233e

^ permalink raw reply related

* [PATCH 02/18] virtio_test: update for 64 bit features
From: Michael S. Tsirkin @ 2011-05-04 20:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390
In-Reply-To: <cover.1304541918.git.mst@redhat.com>

Extend the virtio_test tool so it can work with
64 bit features.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 tools/virtio/virtio_test.c |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/tools/virtio/virtio_test.c b/tools/virtio/virtio_test.c
index df0c6d2..9e65e6d 100644
--- a/tools/virtio/virtio_test.c
+++ b/tools/virtio/virtio_test.c
@@ -55,7 +55,6 @@ void vhost_vq_setup(struct vdev_info *dev, struct vq_info *info)
 {
 	struct vhost_vring_state state = { .index = info->idx };
 	struct vhost_vring_file file = { .index = info->idx };
-	unsigned long long features = dev->vdev.features[0];
 	struct vhost_vring_addr addr = {
 		.index = info->idx,
 		.desc_user_addr = (uint64_t)(unsigned long)info->vring.desc,
@@ -63,6 +62,10 @@ void vhost_vq_setup(struct vdev_info *dev, struct vq_info *info)
 		.used_user_addr = (uint64_t)(unsigned long)info->vring.used,
 	};
 	int r;
+	unsigned long long features = dev->vdev.features[0];
+	if (sizeof features > sizeof dev->vdev.features[0])
+		features |= ((unsigned long long)dev->vdev.features[1]) << 32;
+
 	r = ioctl(dev->control, VHOST_SET_FEATURES, &features);
 	assert(r >= 0);
 	state.num = info->vring.num;
@@ -107,7 +110,8 @@ static void vdev_info_init(struct vdev_info* dev, unsigned long long features)
 	int r;
 	memset(dev, 0, sizeof *dev);
 	dev->vdev.features[0] = features;
-	dev->vdev.features[1] = features >> 32;
+	if (sizeof features > sizeof dev->vdev.features[0])
+		dev->vdev.features[1] = features >> 32;
 	dev->buf_size = 1024;
 	dev->buf = malloc(dev->buf_size);
 	assert(dev->buf);
-- 
1.7.5.53.gc233e

^ permalink raw reply related

* [PATCH 03/18] vhost: fix 64 bit features
From: Michael S. Tsirkin @ 2011-05-04 20:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390
In-Reply-To: <cover.1304541918.git.mst@redhat.com>

Update vhost_has_feature to make it work correctly for bit > 32.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/vhost/vhost.h |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index b3363ae..0f1bf33 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -117,7 +117,7 @@ struct vhost_dev {
 	struct vhost_memory __rcu *memory;
 	struct mm_struct *mm;
 	struct mutex mutex;
-	unsigned acked_features;
+	u64 acked_features;
 	struct vhost_virtqueue *vqs;
 	int nvqs;
 	struct file *log_file;
@@ -169,14 +169,14 @@ enum {
 			 (1 << VIRTIO_NET_F_MRG_RXBUF),
 };
 
-static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
+static inline bool vhost_has_feature(struct vhost_dev *dev, int bit)
 {
-	unsigned acked_features;
+	u64 acked_features;
 
 	/* TODO: check that we are running from vhost_worker or dev mutex is
 	 * held? */
 	acked_features = rcu_dereference_index_check(dev->acked_features, 1);
-	return acked_features & (1 << bit);
+	return acked_features & (1ull << bit);
 }
 
 #endif
-- 
1.7.5.53.gc233e

^ permalink raw reply related

* [PATCH 04/18] virtio: don't delay avail index update
From: Michael S. Tsirkin @ 2011-05-04 20:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rusty Russell, Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	linux-kernel, virtualization, netdev, linux-s390, kvm,
	Krishna Kumar, Tom Lendacky, steved, habanero
In-Reply-To: <cover.1304541918.git.mst@redhat.com>

Update avail index immediately instead of upon kick:
for virtio-net RX this helps parallelism with the host.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_ring.c |   15 +++++----------
 1 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 059e02d..507d6eb 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -86,8 +86,6 @@ struct vring_virtqueue
 	unsigned int num_free;
 	/* Head of free buffer list. */
 	unsigned int free_head;
-	/* Number we've added since last sync. */
-	unsigned int num_added;
 
 	/* Last used index we've seen. */
 	u16 last_used_idx;
@@ -224,8 +222,12 @@ add_head:
 
 	/* Put entry in available array (but don't update avail->idx until they
 	 * do sync).  FIXME: avoid modulus here? */
-	avail = (vq->vring.avail->idx + vq->num_added++) % vq->vring.num;
+	avail = vq->vring.avail->idx % vq->vring.num;
 	vq->vring.avail->ring[avail] = head;
+	/* Descriptors and available array need to be set before we expose the
+	 * new available array entries. */
+	virtio_wmb();
+	vq->vring.avail->idx++;
 
 	pr_debug("Added buffer head %i to %p\n", head, vq);
 	END_USE(vq);
@@ -238,12 +240,6 @@ void virtqueue_kick(struct virtqueue *_vq)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
 	START_USE(vq);
-	/* Descriptors and available array need to be set before we expose the
-	 * new available array entries. */
-	virtio_wmb();
-
-	vq->vring.avail->idx += vq->num_added;
-	vq->num_added = 0;
 
 	/* Need to update avail index before checking if we should notify */
 	virtio_mb();
@@ -430,7 +426,6 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
 	vq->notify = notify;
 	vq->broken = false;
 	vq->last_used_idx = 0;
-	vq->num_added = 0;
 	list_add_tail(&vq->vq.list, &vdev->vqs);
 #ifdef DEBUG
 	vq->in_use = false;
-- 
1.7.5.53.gc233e

^ permalink raw reply related

* [PATCH 05/18] virtio: used event index interface
From: Michael S. Tsirkin @ 2011-05-04 20:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390
In-Reply-To: <cover.1304541918.git.mst@redhat.com>

Define a new feature bit for the guest to utilize a used_event index
(like Xen) instead if a flag bit to enable/disable interrupts.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/virtio_ring.h |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
index e4d144b..f5c1b75 100644
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -29,6 +29,10 @@
 /* We support indirect buffer descriptors */
 #define VIRTIO_RING_F_INDIRECT_DESC	28
 
+/* The Guest publishes the used index for which it expects an interrupt
+ * at the end of the avail ring. Host should ignore the avail->flags field. */
+#define VIRTIO_RING_F_USED_EVENT_IDX	29
+
 /* Virtio ring descriptors: 16 bytes.  These can chain together via "next". */
 struct vring_desc {
 	/* Address (guest-physical). */
@@ -83,6 +87,7 @@ struct vring {
  *	__u16 avail_flags;
  *	__u16 avail_idx;
  *	__u16 available[num];
+ *	__u16 used_event_idx;
  *
  *	// Padding to the next align boundary.
  *	char pad[];
@@ -93,6 +98,10 @@ struct vring {
  *	struct vring_used_elem used[num];
  * };
  */
+/* We publish the used event index at the end of the available ring.
+ * It is at the end for backwards compatibility. */
+#define vring_used_event(vr) ((vr)->avail->ring[(vr)->num])
+
 static inline void vring_init(struct vring *vr, unsigned int num, void *p,
 			      unsigned long align)
 {
-- 
1.7.5.53.gc233e

^ permalink raw reply related

* [PATCH 06/18] virtio_ring: avail event index interface
From: Michael S. Tsirkin @ 2011-05-04 20:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390
In-Reply-To: <cover.1304541918.git.mst@redhat.com>

Define a new feature bit for the host to
declare that it uses an avail_event index
(like Xen) instead of a feature bit
to enable/disable interrupts.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/virtio_ring.h |   11 ++++++++---
 1 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
index f5c1b75..f791772 100644
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -32,6 +32,9 @@
 /* The Guest publishes the used index for which it expects an interrupt
  * at the end of the avail ring. Host should ignore the avail->flags field. */
 #define VIRTIO_RING_F_USED_EVENT_IDX	29
+/* The Host publishes the avail index for which it expects a kick
+ * at the end of the used ring. Guest should ignore the used->flags field. */
+#define VIRTIO_RING_F_AVAIL_EVENT_IDX	32
 
 /* Virtio ring descriptors: 16 bytes.  These can chain together via "next". */
 struct vring_desc {
@@ -96,11 +99,13 @@ struct vring {
  *	__u16 used_flags;
  *	__u16 used_idx;
  *	struct vring_used_elem used[num];
+ *	__u16 avail_event_idx;
  * };
  */
-/* We publish the used event index at the end of the available ring.
- * It is at the end for backwards compatibility. */
+/* We publish the used event index at the end of the available ring, and vice
+ * versa. They are at the end for backwards compatibility. */
 #define vring_used_event(vr) ((vr)->avail->ring[(vr)->num])
+#define vring_avail_event(vr) (*(__u16 *)&(vr)->used->ring[(vr)->num])
 
 static inline void vring_init(struct vring *vr, unsigned int num, void *p,
 			      unsigned long align)
@@ -116,7 +121,7 @@ static inline unsigned vring_size(unsigned int num, unsigned long align)
 {
 	return ((sizeof(struct vring_desc) * num + sizeof(__u16) * (2 + num)
 		 + align - 1) & ~(align - 1))
-		+ sizeof(__u16) * 2 + sizeof(struct vring_used_elem) * num;
+		+ sizeof(__u16) * 3 + sizeof(struct vring_used_elem) * num;
 }
 
 #ifdef __KERNEL__
-- 
1.7.5.53.gc233e

^ permalink raw reply related

* [PATCH 07/18] virtio ring: inline function to check for events
From: Michael S. Tsirkin @ 2011-05-04 20:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rusty Russell, Carsten Otte, Christian Borntraeger, linux390,
	Martin Schwidefsky, Heiko Carstens, Shirley Ma, lguest,
	linux-kernel, virtualization, netdev, linux-s390, kvm,
	Krishna Kumar, Tom Lendacky, steved, habanero
In-Reply-To: <cover.1304541918.git.mst@redhat.com>

With the new used_event and avail_event and features, both
host and guest need similar logic to check whether events are
enabled, so it helps to put the common code in the header.

Note that Xen has similar logic for notification hold-off
in include/xen/interface/io/ring.h with req_event and req_prod
corresponding to event_idx + 1 and new_idx respectively.
+1 comes from the fact that req_event and req_prod in Xen start at 1,
while event index in virtio starts at 0.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/virtio_ring.h |   14 ++++++++++++++
 1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
index f791772..2a3b0ea 100644
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -124,6 +124,20 @@ static inline unsigned vring_size(unsigned int num, unsigned long align)
 		+ sizeof(__u16) * 3 + sizeof(struct vring_used_elem) * num;
 }
 
+/* The following is used with USED_EVENT_IDX and AVAIL_EVENT_IDX */
+/* Assuming a given event_idx value from the other size, if
+ * we have just incremented index from old to new_idx,
+ * should we trigger an event? */
+static inline int vring_need_event(__u16 event_idx, __u16 new_idx, __u16 old)
+{
+	/* Note: Xen has similar logic for notification hold-off
+	 * in include/xen/interface/io/ring.h with req_event and req_prod
+	 * corresponding to event_idx + 1 and new_idx respectively.
+	 * Note also that req_event and req_prod in Xen start at 1,
+	 * event indexes in virtio start at 0. */
+	return (__u16)(new_idx - event_idx - 1) < (__u16)(new_idx - old);
+}
+
 #ifdef __KERNEL__
 #include <linux/irqreturn.h>
 struct virtio_device;
-- 
1.7.5.53.gc233e

^ permalink raw reply related

* [PATCH 08/18] virtio_ring: support for used_event idx feature
From: Michael S. Tsirkin @ 2011-05-04 20:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390
In-Reply-To: <cover.1304541918.git.mst@redhat.com>

Add support for the used_event idx feature: when enabling
interrupts, publish the current avail index value to
the host so that we get interrupts on the next update.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_ring.c |   14 ++++++++++++++
 1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 507d6eb..3a3ed75 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -320,6 +320,14 @@ void *virtqueue_get_buf(struct virtqueue *_vq, unsigned int *len)
 	ret = vq->data[i];
 	detach_buf(vq, i);
 	vq->last_used_idx++;
+	/* If we expect an interrupt for the next entry, tell host
+	 * by writing event index and flush out the write before
+	 * the read in the next get_buf call. */
+	if (!(vq->vring.avail->flags & VRING_AVAIL_F_NO_INTERRUPT)) {
+		vring_used_event(&vq->vring) = vq->last_used_idx;
+		virtio_mb();
+	}
+
 	END_USE(vq);
 	return ret;
 }
@@ -341,7 +349,11 @@ bool virtqueue_enable_cb(struct virtqueue *_vq)
 
 	/* We optimistically turn back on interrupts, then check if there was
 	 * more to do. */
+	/* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
+	 * either clear the flags bit or point the event index at the next
+	 * entry. Always do both to keep code simple. */
 	vq->vring.avail->flags &= ~VRING_AVAIL_F_NO_INTERRUPT;
+	vring_used_event(&vq->vring) = vq->last_used_idx;
 	virtio_mb();
 	if (unlikely(more_used(vq))) {
 		END_USE(vq);
@@ -468,6 +480,8 @@ void vring_transport_features(struct virtio_device *vdev)
 			break;
 		case VIRTIO_RING_F_INDIRECT_DESC:
 			break;
+		case VIRTIO_RING_F_USED_EVENT_IDX:
+			break;
 		default:
 			/* We don't understand this bit. */
 			clear_bit(i, vdev->features);
-- 
1.7.5.53.gc233e

^ permalink raw reply related

* [PATCH 09/18] virtio: use avail_event index
From: Michael S. Tsirkin @ 2011-05-04 20:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390
In-Reply-To: <cover.1304541918.git.mst@redhat.com>

Use the new avail_event feature to reduce the number
of exits from the guest.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_ring.c |   39 ++++++++++++++++++++++++++++++++++++++-
 1 files changed, 38 insertions(+), 1 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 3a3ed75..262dfe6 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -82,6 +82,15 @@ struct vring_virtqueue
 	/* Host supports indirect buffers */
 	bool indirect;
 
+	/* Host publishes avail event idx */
+	bool event;
+
+	/* Is kicked_avail below valid? */
+	bool kicked_avail_valid;
+
+	/* avail idx value we already kicked. */
+	u16 kicked_avail;
+
 	/* Number of free buffers */
 	unsigned int num_free;
 	/* Head of free buffer list. */
@@ -228,6 +237,12 @@ add_head:
 	 * new available array entries. */
 	virtio_wmb();
 	vq->vring.avail->idx++;
+	/* If the driver never bothers to kick in a very long while,
+	 * avail index might wrap around. If that happens, invalidate
+	 * kicked_avail index we stored. TODO: make sure all drivers
+	 * kick at least once in 2^16 and remove this. */
+	if (unlikely(vq->vring.avail->idx == vq->kicked_avail))
+		vq->kicked_avail_valid = true;
 
 	pr_debug("Added buffer head %i to %p\n", head, vq);
 	END_USE(vq);
@@ -236,6 +251,23 @@ add_head:
 }
 EXPORT_SYMBOL_GPL(virtqueue_add_buf_gfp);
 
+
+static bool vring_notify(struct vring_virtqueue *vq)
+{
+	u16 old, new;
+	bool v;
+	if (!vq->event)
+		return !(vq->vring.used->flags & VRING_USED_F_NO_NOTIFY);
+
+	v = vq->kicked_avail_valid;
+	old = vq->kicked_avail;
+	new = vq->kicked_avail = vq->vring.avail->idx;
+	vq->kicked_avail_valid = true;
+	if (unlikely(!v))
+		return true;
+	return vring_need_event(vring_avail_event(&vq->vring), new, old);
+}
+
 void virtqueue_kick(struct virtqueue *_vq)
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
@@ -244,7 +276,7 @@ void virtqueue_kick(struct virtqueue *_vq)
 	/* Need to update avail index before checking if we should notify */
 	virtio_mb();
 
-	if (!(vq->vring.used->flags & VRING_USED_F_NO_NOTIFY))
+	if (vring_notify(vq))
 		/* Prod other side to tell it about changes. */
 		vq->notify(&vq->vq);
 
@@ -437,6 +469,8 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
 	vq->vq.name = name;
 	vq->notify = notify;
 	vq->broken = false;
+	vq->kicked_avail_valid = false;
+	vq->kicked_avail = 0;
 	vq->last_used_idx = 0;
 	list_add_tail(&vq->vq.list, &vdev->vqs);
 #ifdef DEBUG
@@ -444,6 +478,7 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
 #endif
 
 	vq->indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC);
+	vq->event = virtio_has_feature(vdev, VIRTIO_RING_F_AVAIL_EVENT_IDX);
 
 	/* No callback?  Tell other side not to bother us. */
 	if (!callback)
@@ -482,6 +517,8 @@ void vring_transport_features(struct virtio_device *vdev)
 			break;
 		case VIRTIO_RING_F_USED_EVENT_IDX:
 			break;
+		case VIRTIO_RING_F_AVAIL_EVENT_IDX:
+			break;
 		default:
 			/* We don't understand this bit. */
 			clear_bit(i, vdev->features);
-- 
1.7.5.53.gc233e

^ permalink raw reply related

* [PATCH 10/18] vhost: utilize used_event index
From: Michael S. Tsirkin @ 2011-05-04 20:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390
In-Reply-To: <cover.1304541918.git.mst@redhat.com>

Support the new used_event index. When acked,
utilize it to reduce the # of interrupts sent to the guest.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/vhost/vhost.c |   74 +++++++++++++++++++++++++++++++++++++------------
 drivers/vhost/vhost.h |    7 ++++
 2 files changed, 63 insertions(+), 18 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 2ab2912..e33d5a3 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -37,6 +37,8 @@ enum {
 	VHOST_MEMORY_F_LOG = 0x1,
 };
 
+#define vhost_used_event(vq) ((u16 __user *)&vq->avail->ring[vq->num])
+
 static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
 			    poll_table *pt)
 {
@@ -161,6 +163,8 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 	vq->last_avail_idx = 0;
 	vq->avail_idx = 0;
 	vq->last_used_idx = 0;
+	vq->signalled_used = 0;
+	vq->signalled_used_valid = false;
 	vq->used_flags = 0;
 	vq->log_used = false;
 	vq->log_addr = -1ull;
@@ -489,14 +493,15 @@ static int memory_access_ok(struct vhost_dev *d, struct vhost_memory *mem,
 	return 1;
 }
 
-static int vq_access_ok(unsigned int num,
+static int vq_access_ok(struct vhost_dev *d, unsigned int num,
 			struct vring_desc __user *desc,
 			struct vring_avail __user *avail,
 			struct vring_used __user *used)
 {
+	size_t sa = vhost_has_feature(d, VIRTIO_RING_F_USED_EVENT_IDX) ? 2 : 0;
 	return access_ok(VERIFY_READ, desc, num * sizeof *desc) &&
 	       access_ok(VERIFY_READ, avail,
-			 sizeof *avail + num * sizeof *avail->ring) &&
+			 sizeof *avail + num * sizeof *avail->ring + sa) &&
 	       access_ok(VERIFY_WRITE, used,
 			sizeof *used + num * sizeof *used->ring);
 }
@@ -531,7 +536,7 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq, void __user *log_base)
 /* Caller should have vq mutex and device mutex */
 int vhost_vq_access_ok(struct vhost_virtqueue *vq)
 {
-	return vq_access_ok(vq->num, vq->desc, vq->avail, vq->used) &&
+	return vq_access_ok(vq->dev, vq->num, vq->desc, vq->avail, vq->used) &&
 		vq_log_access_ok(vq, vq->log_base);
 }
 
@@ -577,6 +582,7 @@ static int init_used(struct vhost_virtqueue *vq,
 
 	if (r)
 		return r;
+	vq->signalled_used_valid = false;
 	return get_user(vq->last_used_idx, &used->idx);
 }
 
@@ -674,7 +680,7 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
 		 * If it is not, we don't as size might not have been setup.
 		 * We will verify when backend is configured. */
 		if (vq->private_data) {
-			if (!vq_access_ok(vq->num,
+			if (!vq_access_ok(d, vq->num,
 				(void __user *)(unsigned long)a.desc_user_addr,
 				(void __user *)(unsigned long)a.avail_user_addr,
 				(void __user *)(unsigned long)a.used_user_addr)) {
@@ -1267,6 +1273,12 @@ int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len)
 			eventfd_signal(vq->log_ctx, 1);
 	}
 	vq->last_used_idx++;
+	/* If the driver never bothers to signal in a very long while,
+	 * used index might wrap around. If that happens, invalidate
+	 * signalled_used index we stored. TODO: make sure driver
+	 * signals at least once in 2^16 and remove this. */
+	if (unlikely(vq->last_used_idx == vq->signalled_used))
+		vq->signalled_used_valid = false;
 	return 0;
 }
 
@@ -1275,6 +1287,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
 			    unsigned count)
 {
 	struct vring_used_elem __user *used;
+	u16 old, new;
 	int start;
 
 	start = vq->last_used_idx % vq->num;
@@ -1292,7 +1305,14 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
 			   ((void __user *)used - (void __user *)vq->used),
 			  count * sizeof *used);
 	}
-	vq->last_used_idx += count;
+	old = vq->last_used_idx;
+	new = (vq->last_used_idx += count);
+	/* If the driver never bothers to signal in a very long while,
+	 * used index might wrap around. If that happens, invalidate
+	 * signalled_used index we stored. TODO: make sure driver
+	 * signals at least once in 2^16 and remove this. */
+	if (unlikely((u16)(new - vq->signalled_used) < (u16)(new - old)))
+		vq->signalled_used_valid = false;
 	return 0;
 }
 
@@ -1331,29 +1351,47 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
 	return r;
 }
 
-/* This actually signals the guest, using eventfd. */
-void vhost_signal(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
-	__u16 flags;
-
+	__u16 old, new, event;
+	bool v;
 	/* Flush out used index updates. This is paired
 	 * with the barrier that the Guest executes when enabling
 	 * interrupts. */
 	smp_mb();
 
-	if (__get_user(flags, &vq->avail->flags)) {
-		vq_err(vq, "Failed to get flags");
-		return;
+	if (vhost_has_feature(dev, VIRTIO_F_NOTIFY_ON_EMPTY) &&
+	    unlikely(vq->avail_idx == vq->last_avail_idx))
+		return true;
+
+	if (!vhost_has_feature(dev, VIRTIO_RING_F_USED_EVENT_IDX)) {
+		__u16 flags;
+		if (__get_user(flags, &vq->avail->flags)) {
+			vq_err(vq, "Failed to get flags");
+			return true;
+		}
+		return !(flags & VRING_AVAIL_F_NO_INTERRUPT);
 	}
+	old = vq->signalled_used;
+	v = vq->signalled_used_valid;
+	new = vq->signalled_used = vq->last_used_idx;
+	vq->signalled_used_valid = true;
 
-	/* If they don't want an interrupt, don't signal, unless empty. */
-	if ((flags & VRING_AVAIL_F_NO_INTERRUPT) &&
-	    (vq->avail_idx != vq->last_avail_idx ||
-	     !vhost_has_feature(dev, VIRTIO_F_NOTIFY_ON_EMPTY)))
-		return;
+	if (unlikely(!v))
+		return true;
+
+	if (get_user(event, vhost_used_event(vq))) {
+		vq_err(vq, "Failed to get used event idx");
+		return true;
+	}
+	return vring_need_event(event, new, old);
+}
 
+/* This actually signals the guest, using eventfd. */
+void vhost_signal(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+{
 	/* Signal the Guest tell them we used something up. */
-	if (vq->call_ctx)
+	if (vq->call_ctx && vhost_notify(dev, vq))
 		eventfd_signal(vq->call_ctx, 1);
 }
 
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 0f1bf33..5825ac6 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -84,6 +84,12 @@ struct vhost_virtqueue {
 	/* Used flags */
 	u16 used_flags;
 
+	/* Last used index value we have signalled on */
+	u16 signalled_used;
+
+	/* Last used index value we have signalled on */
+	bool signalled_used_valid;
+
 	/* Log writes to used structure. */
 	bool log_used;
 	u64 log_addr;
@@ -164,6 +170,7 @@ int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
 enum {
 	VHOST_FEATURES = (1 << VIRTIO_F_NOTIFY_ON_EMPTY) |
 			 (1 << VIRTIO_RING_F_INDIRECT_DESC) |
+			 (1 << VIRTIO_RING_F_USED_EVENT_IDX) |
 			 (1 << VHOST_F_LOG_ALL) |
 			 (1 << VHOST_NET_F_VIRTIO_NET_HDR) |
 			 (1 << VIRTIO_NET_F_MRG_RXBUF),
-- 
1.7.5.53.gc233e

^ permalink raw reply related

* [PATCH 11/18] vhost: support avail_event idx
From: Michael S. Tsirkin @ 2011-05-04 20:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390
In-Reply-To: <cover.1304541918.git.mst@redhat.com>

Add support for the new avail_event feature in vhost_net
and vhost test modules.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/vhost/net.c   |   12 ++++----
 drivers/vhost/test.c  |    6 ++--
 drivers/vhost/vhost.c |   65 +++++++++++++++++++++++++++++++++++++------------
 drivers/vhost/vhost.h |   17 +++++++------
 4 files changed, 67 insertions(+), 33 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 2f7c76a..e224a92 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -144,7 +144,7 @@ static void handle_tx(struct vhost_net *net)
 	}
 
 	mutex_lock(&vq->mutex);
-	vhost_disable_notify(vq);
+	vhost_disable_notify(&net->dev, vq);
 
 	if (wmem < sock->sk->sk_sndbuf / 2)
 		tx_poll_stop(net);
@@ -166,8 +166,8 @@ static void handle_tx(struct vhost_net *net)
 				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
 				break;
 			}
-			if (unlikely(vhost_enable_notify(vq))) {
-				vhost_disable_notify(vq);
+			if (unlikely(vhost_enable_notify(&net->dev, vq))) {
+				vhost_disable_notify(&net->dev, vq);
 				continue;
 			}
 			break;
@@ -315,7 +315,7 @@ static void handle_rx(struct vhost_net *net)
 		return;
 
 	mutex_lock(&vq->mutex);
-	vhost_disable_notify(vq);
+	vhost_disable_notify(&net->dev, vq);
 	vhost_hlen = vq->vhost_hlen;
 	sock_hlen = vq->sock_hlen;
 
@@ -334,10 +334,10 @@ static void handle_rx(struct vhost_net *net)
 			break;
 		/* OK, now we need to know about added descriptors. */
 		if (!headcount) {
-			if (unlikely(vhost_enable_notify(vq))) {
+			if (unlikely(vhost_enable_notify(&net->dev, vq))) {
 				/* They have slipped one in as we were
 				 * doing that: check again. */
-				vhost_disable_notify(vq);
+				vhost_disable_notify(&net->dev, vq);
 				continue;
 			}
 			/* Nothing new?  Wait for eventfd to tell us
diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
index 099f302..734e1d7 100644
--- a/drivers/vhost/test.c
+++ b/drivers/vhost/test.c
@@ -49,7 +49,7 @@ static void handle_vq(struct vhost_test *n)
 		return;
 
 	mutex_lock(&vq->mutex);
-	vhost_disable_notify(vq);
+	vhost_disable_notify(&n->dev, vq);
 
 	for (;;) {
 		head = vhost_get_vq_desc(&n->dev, vq, vq->iov,
@@ -61,8 +61,8 @@ static void handle_vq(struct vhost_test *n)
 			break;
 		/* Nothing new?  Wait for eventfd to tell us they refilled. */
 		if (head == vq->num) {
-			if (unlikely(vhost_enable_notify(vq))) {
-				vhost_disable_notify(vq);
+			if (unlikely(vhost_enable_notify(&n->dev, vq))) {
+				vhost_disable_notify(&n->dev, vq);
 				continue;
 			}
 			break;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index e33d5a3..2aea4cb 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -38,6 +38,7 @@ enum {
 };
 
 #define vhost_used_event(vq) ((u16 __user *)&vq->avail->ring[vq->num])
+#define vhost_avail_event(vq) ((u16 __user *)&vq->used->ring[vq->num])
 
 static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
 			    poll_table *pt)
@@ -499,11 +500,12 @@ static int vq_access_ok(struct vhost_dev *d, unsigned int num,
 			struct vring_used __user *used)
 {
 	size_t sa = vhost_has_feature(d, VIRTIO_RING_F_USED_EVENT_IDX) ? 2 : 0;
+	size_t su = vhost_has_feature(d, VIRTIO_RING_F_AVAIL_EVENT_IDX) ? 2 : 0;
 	return access_ok(VERIFY_READ, desc, num * sizeof *desc) &&
 	       access_ok(VERIFY_READ, avail,
 			 sizeof *avail + num * sizeof *avail->ring + sa) &&
 	       access_ok(VERIFY_WRITE, used,
-			sizeof *used + num * sizeof *used->ring);
+			sizeof *used + num * sizeof *used->ring + su);
 }
 
 /* Can we log writes? */
@@ -519,9 +521,11 @@ int vhost_log_access_ok(struct vhost_dev *dev)
 
 /* Verify access for write logging. */
 /* Caller should have vq mutex and device mutex */
-static int vq_log_access_ok(struct vhost_virtqueue *vq, void __user *log_base)
+static int vq_log_access_ok(struct vhost_dev *d, struct vhost_virtqueue *vq,
+			    void __user *log_base)
 {
 	struct vhost_memory *mp;
+	size_t s = vhost_has_feature(d, VIRTIO_RING_F_AVAIL_EVENT_IDX) ? 2 : 0;
 
 	mp = rcu_dereference_protected(vq->dev->memory,
 				       lockdep_is_held(&vq->mutex));
@@ -529,7 +533,7 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq, void __user *log_base)
 			    vhost_has_feature(vq->dev, VHOST_F_LOG_ALL)) &&
 		(!vq->log_used || log_access_ok(log_base, vq->log_addr,
 					sizeof *vq->used +
-					vq->num * sizeof *vq->used->ring));
+					vq->num * sizeof *vq->used->ring + s));
 }
 
 /* Can we start vq? */
@@ -537,7 +541,7 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq, void __user *log_base)
 int vhost_vq_access_ok(struct vhost_virtqueue *vq)
 {
 	return vq_access_ok(vq->dev, vq->num, vq->desc, vq->avail, vq->used) &&
-		vq_log_access_ok(vq, vq->log_base);
+		vq_log_access_ok(vq->dev, vq, vq->log_base);
 }
 
 static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
@@ -824,7 +828,7 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, unsigned long arg)
 			vq = d->vqs + i;
 			mutex_lock(&vq->mutex);
 			/* If ring is inactive, will check when it's enabled. */
-			if (vq->private_data && !vq_log_access_ok(vq, base))
+			if (vq->private_data && !vq_log_access_ok(d, vq, base))
 				r = -EFAULT;
 			else
 				vq->log_base = base;
@@ -1225,6 +1229,10 @@ int vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
 
 	/* On success, increment avail index. */
 	vq->last_avail_idx++;
+
+	/* Assume notifications from guest are disabled at this point,
+	 * if they aren't we would need to update avail_event index. */
+	BUG_ON(!(vq->used_flags & VRING_USED_F_NO_NOTIFY));
 	return head;
 }
 
@@ -1414,7 +1422,7 @@ void vhost_add_used_and_signal_n(struct vhost_dev *dev,
 }
 
 /* OK, now we need to know about added descriptors. */
-bool vhost_enable_notify(struct vhost_virtqueue *vq)
+bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
 	u16 avail_idx;
 	int r;
@@ -1422,11 +1430,34 @@ bool vhost_enable_notify(struct vhost_virtqueue *vq)
 	if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
 		return false;
 	vq->used_flags &= ~VRING_USED_F_NO_NOTIFY;
-	r = put_user(vq->used_flags, &vq->used->flags);
-	if (r) {
-		vq_err(vq, "Failed to enable notification at %p: %d\n",
-		       &vq->used->flags, r);
-		return false;
+	if (!vhost_has_feature(dev, VIRTIO_RING_F_AVAIL_EVENT_IDX)) {
+		r = put_user(vq->used_flags, &vq->used->flags);
+		if (r) {
+			vq_err(vq, "Failed to enable notification at %p: %d\n",
+			       &vq->used->flags, r);
+			return false;
+		}
+	} else {
+		r = put_user(vq->last_avail_idx, vhost_avail_event(vq));
+		if (r) {
+			vq_err(vq, "Failed to update avail event index at %p: %d\n",
+			       vhost_avail_event(vq), r);
+			return false;
+		}
+	}
+	if (unlikely(vq->log_used)) {
+		void __user *used;
+		/* Make sure data is seen before log. */
+		smp_wmb();
+		used = vhost_has_feature(dev, VIRTIO_RING_F_AVAIL_EVENT_IDX) ?
+			&vq->used->flags : vhost_avail_event(vq);
+		/* Log used flags or event index entry write. Both are 16 bit
+		 * fields. */
+		log_write(vq->log_base, vq->log_addr +
+			   (used - (void __user *)vq->used),
+			  sizeof(u16));
+		if (vq->log_ctx)
+			eventfd_signal(vq->log_ctx, 1);
 	}
 	/* They could have slipped one in as we were doing that: make
 	 * sure it's written, then check again. */
@@ -1442,15 +1473,17 @@ bool vhost_enable_notify(struct vhost_virtqueue *vq)
 }
 
 /* We don't need to be notified again. */
-void vhost_disable_notify(struct vhost_virtqueue *vq)
+void vhost_disable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
 	int r;
 
 	if (vq->used_flags & VRING_USED_F_NO_NOTIFY)
 		return;
 	vq->used_flags |= VRING_USED_F_NO_NOTIFY;
-	r = put_user(vq->used_flags, &vq->used->flags);
-	if (r)
-		vq_err(vq, "Failed to enable notification at %p: %d\n",
-		       &vq->used->flags, r);
+	if (!vhost_has_feature(dev, VIRTIO_RING_F_AVAIL_EVENT_IDX)) {
+		r = put_user(vq->used_flags, &vq->used->flags);
+		if (r)
+			vq_err(vq, "Failed to enable notification at %p: %d\n",
+			       &vq->used->flags, r);
+	}
 }
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 5825ac6..edf84be 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -155,8 +155,8 @@ void vhost_add_used_and_signal(struct vhost_dev *, struct vhost_virtqueue *,
 void vhost_add_used_and_signal_n(struct vhost_dev *, struct vhost_virtqueue *,
 			       struct vring_used_elem *heads, unsigned count);
 void vhost_signal(struct vhost_dev *, struct vhost_virtqueue *);
-void vhost_disable_notify(struct vhost_virtqueue *);
-bool vhost_enable_notify(struct vhost_virtqueue *);
+void vhost_disable_notify(struct vhost_dev *, struct vhost_virtqueue *);
+bool vhost_enable_notify(struct vhost_dev *, struct vhost_virtqueue *);
 
 int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
 		    unsigned int log_num, u64 len);
@@ -168,12 +168,13 @@ int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
 	} while (0)
 
 enum {
-	VHOST_FEATURES = (1 << VIRTIO_F_NOTIFY_ON_EMPTY) |
-			 (1 << VIRTIO_RING_F_INDIRECT_DESC) |
-			 (1 << VIRTIO_RING_F_USED_EVENT_IDX) |
-			 (1 << VHOST_F_LOG_ALL) |
-			 (1 << VHOST_NET_F_VIRTIO_NET_HDR) |
-			 (1 << VIRTIO_NET_F_MRG_RXBUF),
+	VHOST_FEATURES = (1ULL << VIRTIO_F_NOTIFY_ON_EMPTY) |
+			 (1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
+			 (1ULL << VIRTIO_RING_F_USED_EVENT_IDX) |
+			 (1ULL << VIRTIO_RING_F_AVAIL_EVENT_IDX) |
+			 (1ULL << VHOST_F_LOG_ALL) |
+			 (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
+			 (1ULL << VIRTIO_NET_F_MRG_RXBUF),
 };
 
 static inline bool vhost_has_feature(struct vhost_dev *dev, int bit)
-- 
1.7.5.53.gc233e

^ permalink raw reply related

* [PATCH 12/18] virtio_test: support used_event index
From: Michael S. Tsirkin @ 2011-05-04 20:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: Krishna Kumar, Carsten Otte, lguest, Shirley Ma, kvm, linux-s390,
	netdev, habanero, Heiko Carstens, linux-kernel, virtualization,
	steved, Christian Borntraeger, Tom Lendacky, Martin Schwidefsky,
	linux390
In-Reply-To: <cover.1304541918.git.mst@redhat.com>

Add ability to test the new used_event feature,
enable by default.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 tools/virtio/virtio_test.c |   18 ++++++++++++++++--
 1 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/tools/virtio/virtio_test.c b/tools/virtio/virtio_test.c
index 9e65e6d..157ec68 100644
--- a/tools/virtio/virtio_test.c
+++ b/tools/virtio/virtio_test.c
@@ -210,18 +210,29 @@ const struct option longopts[] = {
 		.val = 'i',
 	},
 	{
+		.name = "used-event-idx",
+		.val = 'U',
+	},
+	{
+		.name = "no-used-event-idx",
+		.val = 'u',
+	},
+	{
 	}
 };
 
 static void help()
 {
-	fprintf(stderr, "Usage: virtio_test [--help] [--no-indirect]\n");
+	fprintf(stderr, "Usage: virtio_test [--help]"
+		" [--no-indirect] "
+		" [--no-used-event-idx]\n");
 }
 
 int main(int argc, char **argv)
 {
 	struct vdev_info dev;
-	unsigned long long features = 1ULL << VIRTIO_RING_F_INDIRECT_DESC;
+	unsigned long long features = (1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
+		(1ULL << VIRTIO_RING_F_USED_EVENT_IDX);
 	int o;
 
 	for (;;) {
@@ -238,6 +249,9 @@ int main(int argc, char **argv)
 		case 'i':
 			features &= ~(1ULL << VIRTIO_RING_F_INDIRECT_DESC);
 			break;
+		case 'u':
+			features &= ~(1ULL << VIRTIO_RING_F_USED_EVENT_IDX);
+			break;
 		default:
 			assert(0);
 			break;
-- 
1.7.5.53.gc233e

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox