Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 13/13] TProxy: use the interface primary IP address as a default value for --on-ip
From: Balazs Scheidler @ 2009-09-22  6:38 UTC (permalink / raw)
  To: Brian Haley; +Cc: netfilter-devel, netdev
In-Reply-To: <4AB7BF47.2030404@hp.com>

On Mon, 2009-09-21 at 14:00 -0400, Brian Haley wrote:
> Balazs Scheidler wrote: 
> >  #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
> > +
> > +static inline const struct in6_addr *
> > +tproxy_laddr6(struct sk_buff *skb, const struct in6_addr *user_laddr, const struct in6_addr *daddr)
> > +{
> > +	struct inet6_dev *indev;
> > +	struct inet6_ifaddr *ifa;
> > +	struct in6_addr *laddr;
> > +	
> > +        if (!ipv6_addr_any(user_laddr))
> > +                return user_laddr;
> > +	
> > +        laddr = NULL;
> > +        rcu_read_lock();
> > +        indev = __in6_dev_get(skb->dev);
> > +        if (indev && (ifa = indev->addr_list)) {
> > +		laddr = &ifa->addr;
> > +	}
> > +        rcu_read_unlock();
> > +        
> > +        return laddr ? laddr : daddr;
> > +}
> 
> You should call ipv6_dev_get_saddr() to get a source address based on the target
> destination address.

Thanks for this hint, however this is not selecting a source address for
a given destination, rather it selects the address where tproxy is
redirecting the connection in case the user specified no --on-ip
parameter.

e.g. 

ip6tables -A PREROUTING -p tcp --dport 80 -j TPROXY --on-port 50080

This should redirect the connection to the primary IP address of the
incoming interface. In fact I spent 2 hours to figure out how to find
the proper address, and at the end I used the first IP address
configured to the interface, seeing that those addresses are sorted in
'scope' order, e.g. link-local and site-local addresses are at the end
of the list, thus the front should be ok.

Since I'm not that much into IPv6, I'd appreciate some help, is
ipv6_dev_get_saddr(client_ip_address) indeed the best solution here?

-- 
Bazsi

^ permalink raw reply

* Re: [PATCH 02/13] TProxy: add lookup type checks for UDP in nf_tproxy_get_sock_v4()
From: Balazs Scheidler @ 2009-09-22  6:40 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: netfilter-devel, netdev
In-Reply-To: <alpine.LSU.2.00.0909212320150.31023@obet.zrqbmnf.qr>

On Mon, 2009-09-21 at 23:20 +0200, Jan Engelhardt wrote:
> On Saturday 2009-08-15 14:01, Balazs Scheidler wrote:
> 
> >+	case IPPROTO_UDP:
> >+		sk = udp4_lib_lookup(net, saddr, sport, daddr, dport,
> >+				     in->ifindex);
> 
> You might want to add IPPROTO_UDPLITE in all places.

Well, I preferred not to add those, as I'm unable to test it, and I'd
prefer to submit a patch that really works for UDP and TCP at the
minimum. Further protocols (like SCTP) should be added later IMHO.

-- 
Bazsi


^ permalink raw reply

* Re: [PATCH][RESEND] IPv6: 6rd tunnel mode
From: Alexandre Cassen @ 2009-09-22  6:59 UTC (permalink / raw)
  To: Brian Haley; +Cc: netdev
In-Reply-To: <4AB838F1.1090704@hp.com>

Hi Brian,

On Mon, 2009-09-21 at 22:39 -0400, Brian Haley wrote:
> Hi Alexandre,
> 
> Alexandre Cassen wrote:
> > This patch add support to 6rd tunnel mode currently targetting
> > standard track at the IETF.
> > 
> > IPv6 rapid deployment (RFC5569) builds upon mechanisms of 6to4 (RFC3056)
> > to enable a service provider to rapidly deploy IPv6 unicast service
> > to IPv4 sites to which it provides customer premise equipment.  Like
> > 6to4, it utilizes stateless IPv6 in IPv4 encapsulation in order to
> > transit IPv4-only network infrastructure. Unlike 6to4, a 6rd service
> > provider uses an IPv6 prefix of its own in place of the fixed 6to4
> > prefix.
> 
> I couldn't find RFC 5569 (delayed due to IPR rights?), although I did find
> the latest 6rd draft, -03.  It was showing as Informational, not Standards
> track, is that right?  Just curious.

In fact there is currently two draft :

1) https://datatracker.ietf.org/idtracker/draft-despres-6rd/

   This draft is targeting informational RFC as an independent
submission. It is currently queued and has been delayed since may for
IPR.

2) http://tools.ietf.org/html/draft-townsley-ipv6-6rd-01

   This draft is targeting standard track so work is in progress here.

A good sum up has been done by Mark Townsley at last IETF meeting :

http://www.ietf.org/proceedings/75/slides/dhc-4.pdf

> > +		case SIOCADD6RD:
> > +		case SIOCCHG6RD:
> > +			if (ip6rd.prefixlen >= 95) {
> > +				err = -EINVAL;
> > +				goto done;
> > +			}
> > +			t->ip6rd_prefix.addr = ip6rd.addr;
> 
> ipv6_addr_copy(&t->ip6rd_prefix.addr, &ip6rd.addr); is the preferred way to
> copy the address.

agreed. will fix and resend.

regs,
Alexandre


^ permalink raw reply

* Re: [PATCH 10/13] TProxy: added IPv6 socket lookup function to nf_tproxy_core
From: Jan Engelhardt @ 2009-09-22  8:30 UTC (permalink / raw)
  To: Balazs Scheidler; +Cc: netfilter-devel, netdev
In-Reply-To: <1253548005.12519.10.camel@bzorp.balabit>


On Monday 2009-08-24 14:51, Balazs Scheidler wrote:
>+		case NFT_LOOKUP_LISTENER:
>+			sk = inet6_lookup_listener(net, &tcp_hashinfo,
>+                                                   daddr, ntohs(dport),
>+                                                   in->ifindex);
>+
>+                        /* NOTE: we return listeners even if bound to
>+                         * 0.0.0.0, those are filtered out in

s/0.0.0.0/::/g  :-)

^ permalink raw reply

* Re: [PATCH 12/13] TProxy: added IPv6 support to the socket match
From: Jan Engelhardt @ 2009-09-22  8:33 UTC (permalink / raw)
  To: Balazs Scheidler; +Cc: netfilter-devel, netdev
In-Reply-To: <1253548005.12519.12.camel@bzorp.balabit>

On Monday 2009-08-24 14:52, Balazs Scheidler wrote:

>+static bool
>+socket_mt6_v1(const struct sk_buff *skb, const struct xt_match_param *par)
>+{
>+	struct ipv6hdr *iph = ipv6_hdr(skb);
>+	struct udphdr _hdr, *hp = NULL;
>+	struct sock *sk;
>+	struct in6_addr *daddr, *saddr;
>+	__be16 dport, sport;
>+        int thoff;
>+	u8 tproto;
>+        const struct xt_socket_mtinfo1 *info = (struct xt_socket_mtinfo1 *) par->matchinfo;
>+        
>+        tproto = ipv6_find_hdr(skb, &thoff, -1, NULL);
>+        if (tproto < 0) {
>+		pr_debug("socket match: Unable to find transport header in IPv6 packet, dropping\n");
>+		return NF_DROP;
>+        }
>+
>+	if (tproto == IPPROTO_UDP || tproto == IPPROTO_TCP) {

The tabbing seems off (also noticed this in other patches,
pcregrep for '^ {8}' )

^ permalink raw reply

* [PATCH][RESEND 2] IPv6: 6rd tunnel mode
From: Alexandre Cassen @ 2009-09-22  8:51 UTC (permalink / raw)
  To: netdev

This patch add support to 6rd tunnel mode currently targetting
standard track at the IETF.

Patch history :
* http://patchwork.ozlabs.org/patch/26870/
* http://patchwork.ozlabs.org/patch/34026/

IPv6 rapid deployment (RFC5569) builds upon mechanisms of 6to4 (RFC3056)
to enable a service provider to rapidly deploy IPv6 unicast service
to IPv4 sites to which it provides customer premise equipment.  Like
6to4, it utilizes stateless IPv6 in IPv4 encapsulation in order to
transit IPv4-only network infrastructure. Unlike 6to4, a 6rd service
provider uses an IPv6 prefix of its own in place of the fixed 6to4
prefix.

Signed-off-by: Alexandre Cassen <acassen@freebox.fr>
---
 include/linux/if_tunnel.h |   10 +++++
 include/net/ipip.h        |    2 +
 net/ipv6/Kconfig          |   13 +++++++
 net/ipv6/sit.c            |   84 +++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 109 insertions(+), 0 deletions(-)

diff --git a/include/linux/if_tunnel.h b/include/linux/if_tunnel.h
index 5eb9b0f..0d44376 100644
--- a/include/linux/if_tunnel.h
+++ b/include/linux/if_tunnel.h
@@ -15,6 +15,10 @@
 #define SIOCADDPRL      (SIOCDEVPRIVATE + 5)
 #define SIOCDELPRL      (SIOCDEVPRIVATE + 6)
 #define SIOCCHGPRL      (SIOCDEVPRIVATE + 7)
+#define SIOCGET6RD      (SIOCDEVPRIVATE + 8)
+#define SIOCADD6RD      (SIOCDEVPRIVATE + 9)
+#define SIOCDEL6RD      (SIOCDEVPRIVATE + 10)
+#define SIOCCHG6RD      (SIOCDEVPRIVATE + 11)
 
 #define GRE_CSUM	__cpu_to_be16(0x8000)
 #define GRE_ROUTING	__cpu_to_be16(0x4000)
@@ -51,6 +55,12 @@ struct ip_tunnel_prl {
 /* PRL flags */
 #define	PRL_DEFAULT		0x0001
 
+/* 6RD parms */
+struct ip_tunnel_6rd {
+	struct in6_addr		addr;
+	__u8			prefixlen;
+};
+
 enum
 {
 	IFLA_GRE_UNSPEC,
diff --git a/include/net/ipip.h b/include/net/ipip.h
index 5d3036f..fa92c41 100644
--- a/include/net/ipip.h
+++ b/include/net/ipip.h
@@ -26,6 +26,8 @@ struct ip_tunnel
 
 	struct ip_tunnel_prl_entry	*prl;		/* potential router list */
 	unsigned int			prl_count;	/* # of entries in PRL */
+
+	struct ip_tunnel_6rd	ip6rd_prefix;	/* 6RD SP prefix */
 };
 
 /* ISATAP: default interval between RS in secondy */
diff --git a/net/ipv6/Kconfig b/net/ipv6/Kconfig
index ead6c7a..78a565b 100644
--- a/net/ipv6/Kconfig
+++ b/net/ipv6/Kconfig
@@ -170,6 +170,19 @@ config IPV6_SIT
 
 	  Saying M here will produce a module called sit. If unsure, say Y.
 
+config IPV6_SIT_6RD
+	bool "IPv6: 6rd tunnel mode (EXPERIMENTAL)"
+	depends on IPV6_SIT && EXPERIMENTAL
+	default n
+	---help---
+	IPv6 rapid deployment (RFC5569) builds upon mechanisms of 6to4 (RFC3056)
+	to enable a service provider to rapidly deploy IPv6 unicast service
+	to IPv4 sites to which it provides customer premise equipment.  Like
+	6to4, it utilizes stateless IPv6 in IPv4 encapsulation in order to
+	transit IPv4-only network infrastructure. Unlike 6to4, a 6rd service
+	provider uses an IPv6 prefix of its own in place of the fixed 6to4
+	prefix.
+
 config IPV6_NDISC_NODETYPE
 	bool
 
diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 0ae4f64..034acdc 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -604,6 +604,30 @@ static inline __be32 try_6to4(struct in6_addr *v6dst)
 	return dst;
 }
 
+#ifdef CONFIG_IPV6_SIT_6RD
+/* Returns the embedded IPv4 address if the IPv6 address comes from
+   6rd rule */
+
+static inline __be32 try_6rd(struct in6_addr *addr, u8 prefix_len, struct in6_addr *v6dst)
+{
+	__be32 dst = 0;
+
+	/* isolate addr according to mask */
+	if (ipv6_prefix_equal(v6dst, addr, prefix_len)) {
+		unsigned int d32_off, bits;
+
+		d32_off = prefix_len >> 5;
+		bits = (prefix_len & 0x1f);
+
+		dst = (ntohl(v6dst->s6_addr32[d32_off]) << bits);
+		if (bits)
+			dst |= ntohl(v6dst->s6_addr32[d32_off + 1]) >> (32 - bits);
+		dst = htonl(dst);
+	}
+	return dst;
+}
+#endif
+
 /*
  *	This function assumes it is being called from dev_queue_xmit()
  *	and that skb is filled properly by that function.
@@ -657,6 +681,13 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff *skb,
 			goto tx_error;
 	}
 
+#ifdef CONFIG_IPV6_SIT_6RD
+	if (!dst && tunnel->ip6rd_prefix.prefixlen)
+		dst = try_6rd(&tunnel->ip6rd_prefix.addr,
+			      tunnel->ip6rd_prefix.prefixlen,
+			      &iph6->daddr);
+       else
+#endif
 	if (!dst)
 		dst = try_6to4(&iph6->daddr);
 
@@ -848,6 +879,9 @@ ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 	int err = 0;
 	struct ip_tunnel_parm p;
 	struct ip_tunnel_prl prl;
+#ifdef CONFIG_IPV6_SIT_6RD
+	struct ip_tunnel_6rd ip6rd;
+#endif
 	struct ip_tunnel *t;
 	struct net *net = dev_net(dev);
 	struct sit_net *sitn = net_generic(net, sit_net_id);
@@ -987,6 +1021,56 @@ ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 		netdev_state_change(dev);
 		break;
 
+#ifdef CONFIG_IPV6_SIT_6RD
+	case SIOCGET6RD:
+		err = -EINVAL;
+		if (dev == sitn->fb_tunnel_dev)
+			goto done;
+		err = -ENOENT;
+		if (!(t = netdev_priv(dev)))
+			goto done;
+		memcpy(&ip6rd, &t->ip6rd_prefix, sizeof(ip6rd));
+		if (copy_to_user(ifr->ifr_ifru.ifru_data, &ip6rd, sizeof(ip6rd)))
+			err = -EFAULT;
+		else
+			err = 0;
+		break;
+
+	case SIOCADD6RD:
+	case SIOCDEL6RD:
+	case SIOCCHG6RD:
+		err = -EPERM;
+		if (!capable(CAP_NET_ADMIN))
+			goto done;
+		err = -EINVAL;
+		if (dev == sitn->fb_tunnel_dev)
+			goto done;
+		err = -EFAULT;
+		if (copy_from_user(&ip6rd, ifr->ifr_ifru.ifru_data, sizeof(ip6rd)))
+			goto done;
+		err = -ENOENT;
+		if (!(t = netdev_priv(dev)))
+			goto done;
+
+		err = 0;
+		switch (cmd) {
+		case SIOCDEL6RD:
+			memset(&t->ip6rd_prefix, 0, sizeof(ip6rd));
+			break;
+		case SIOCADD6RD:
+		case SIOCCHG6RD:
+			if (ip6rd.prefixlen >= 95) {
+				err = -EINVAL;
+				goto done;
+			}
+			ipv6_addr_copy(&t->ip6rd_prefix.addr, &ip6rd.addr);
+			t->ip6rd_prefix.prefixlen = ip6rd.prefixlen;
+			break;
+		}
+		netdev_state_change(dev);
+		break;
+#endif
+
 	default:
 		err = -EINVAL;
 	}
-- 
1.6.0.4


^ permalink raw reply related

* Resend: [PATCH] TCP Early Retransmit: reduce required dupacks for triggering fast retrans
From: Christian Samsel @ 2009-09-22  8:59 UTC (permalink / raw)
  To: netdev

This patch implements draft-ietf-tcpm-early-rexmt. The early retransmit 
mechanism allows the transport to reduce the number of duplicate
acknowledgments required to trigger a fast retransmission in case we
don't expect enough dupacks, (e.g. because there are not enough
packets inflight and nothing to send). This allows the transport to use
fast retransmit to recover packet losses that would otherwise require
a lengthy retransmission timeout.

See: http://tools.ietf.org/html/draft-ietf-tcpm-early-rexmt-01

Signed-off-by: Christian Samsel <christian.samsel@rwth-aachen.de>

---
 net/ipv4/tcp_input.c |   16 ++++++++++++++++
 1 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index af6d6fa..c0cc4fd 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2913,6 +2913,7 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag)
  int do_lost = is_dupack || ((flag & FLAG_DATA_SACKED) &&
                                     (tcp_fackets_out(tp) > tp->reordering));
  int fast_rexmit = 0, mib_idx;
+ u32 in_flight;
 
  if (WARN_ON(!tp->packets_out && tp->sacked_out))
          tp->sacked_out = 0;
@@ -3062,6 +3063,21 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag)
  if (do_lost || (tcp_is_fack(tp) && tcp_head_timedout(sk)))
          tcp_update_scoreboard(sk, fast_rexmit);
  tcp_cwnd_down(sk, flag);
+       
+
+ /* draft-ietf-tcpm-early-rexmt: lowers dup ack threshold to prevent rto
+         * in case we don't expect enough dup ack. if number of outstanding
+         * packets is less than four and there is either no unsent data ready
+         * for transmission or the advertised window does not permit new
+         * segments.
+         */
+ in_flight = tcp_packets_in_flight(tp);
+ if ( in_flight < 4 && (skb_queue_empty(&sk->sk_write_queue) ||
+         tcp_may_send_now(sk) == 0) )
+         tp->reordering = in_flight - 1;
+ else if (tp->reordering != sysctl_tcp_reordering)
+         tp->reordering = sysctl_tcp_reordering;
+
  tcp_xmit_retransmit_queue(sk);
 }
 
-- 
1.6.4.1


^ permalink raw reply related

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Avi Kivity @ 2009-09-22  9:43 UTC (permalink / raw)
  To: Ira W. Snyder
  Cc: Gregory Haskins, Michael S. Tsirkin, netdev, virtualization, kvm,
	linux-kernel, mingo, linux-mm, akpm, hpa, Rusty Russell, s.hetze,
	alacrityvm-devel
In-Reply-To: <20090921214312.GJ7182@ovro.caltech.edu>

On 09/22/2009 12:43 AM, Ira W. Snyder wrote:
>
>> Sure, virtio-ira and he is on his own to make a bus-model under that, or
>> virtio-vbus + vbus-ira-connector to use the vbus framework.  Either
>> model can work, I agree.
>>
>>      
> Yes, I'm having to create my own bus model, a-la lguest, virtio-pci, and
> virtio-s390. It isn't especially easy. I can steal lots of code from the
> lguest bus model, but sometimes it is good to generalize, especially
> after the fourth implemention or so. I think this is what GHaskins tried
> to do.
>    

Yes.  vbus is more finely layered so there is less code duplication.

The virtio layering was more or less dictated by Xen which doesn't have 
shared memory (it uses grant references instead).  As a matter of fact 
lguest, kvm/pci, and kvm/s390 all have shared memory, as you do, so that 
part is duplicated.  It's probably possible to add a virtio-shmem.ko 
library that people who do have shared memory can reuse.

> I've given it some thought, and I think that running vhost-net (or
> similar) on the ppc boards, with virtio-net on the x86 crate server will
> work. The virtio-ring abstraction is almost good enough to work for this
> situation, but I had to re-invent it to work with my boards.
>
> I've exposed a 16K region of memory as PCI BAR1 from my ppc board.
> Remember that this is the "host" system. I used each 4K block as a
> "device descriptor" which contains:
>
> 1) the type of device, config space, etc. for virtio
> 2) the "desc" table (virtio memory descriptors, see virtio-ring)
> 3) the "avail" table (available entries in the desc table)
>    

Won't access from x86 be slow to this memory (on the other hand, if you 
change it to main memory access from ppc will be slow... really depends 
on how your system is tuned.

> Parts 2 and 3 are repeated three times, to allow for a maximum of three
> virtqueues per device. This is good enough for all current drivers.
>    

The plan is to switch to multiqueue soon.  Will not affect you if your 
boards are uniprocessor or small smp.

> I've gotten plenty of email about this from lots of interested
> developers. There are people who would like this kind of system to just
> work, while having to write just some glue for their device, just like a
> network driver. I hunch most people have created some proprietary mess
> that basically works, and left it at that.
>    

So long as you keep the system-dependent features hookable or 
configurable, it should work.

> So, here is a desperate cry for help. I'd like to make this work, and
> I'd really like to see it in mainline. I'm trying to give back to the
> community from which I've taken plenty.
>    

Not sure who you're crying for help to.  Once you get this working, post 
patches.  If the patches are reasonably clean and don't impact 
performance for the main use case, and if you can show the need, I 
expect they'll be merged.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC] Virtual Machine Device Queues(VMDq) support on KVM
From: Michael S. Tsirkin @ 2009-09-22 10:38 UTC (permalink / raw)
  To: Chris Wright
  Cc: Stephen Hemminger, Rusty Russell, virtualization, Xin, Xiaohui,
	kvm@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org, hpa@zytor.com,
	mingo@elte.hu, akpm@linux-foundation.org
In-Reply-To: <20090921162718.GM26034@sequoia.sous-sol.org>

On Mon, Sep 21, 2009 at 09:27:18AM -0700, Chris Wright wrote:
> * Stephen Hemminger (shemminger@vyatta.com) wrote:
> > On Mon, 21 Sep 2009 16:37:22 +0930
> > Rusty Russell <rusty@rustcorp.com.au> wrote:
> > 
> > > > > Actually this framework can apply to traditional network adapters which have
> > > > > just one tx/rx queue pair. And applications using the same user/kernel interface
> > > > > can utilize this framework to send/receive network traffic directly thru a tx/rx
> > > > > queue pair in a network adapter.
> > > > > 
> > 
> > More importantly, when virtualizations is used with multi-queue
> > NIC's the virtio-net NIC is a single CPU bottleneck. The virtio-net
> > NIC should preserve the parallelism (lock free) using multiple
> > receive/transmit queues. The number of queues should equal the
> > number of CPUs.
> 
> Yup, multiqueue virtio is on todo list ;-)
> 
> thanks,
> -chris

Note we'll need multiqueue tap for that to help.

-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH] drivers/net/wireless: Use usb_endpoint_dir_out
From: Julia Lawall @ 2009-09-22 11:45 UTC (permalink / raw)
  To: John W. Linville, Ulrich Kunitz, Daniel Drake, linux-wireless,
	netdev, linux-kernel

From: Julia Lawall <julia@diku.dk>

Use the usb_endpoint_dir_out API function.  Note that the use of
USB_TYPE_MASK in the original code is incorrect; it results in a test that
is always false.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
struct usb_endpoint_descriptor *endpoint;
expression E;
@@

- (endpoint->bEndpointAddress & E) == USB_DIR_OUT
+ usb_endpoint_dir_out(endpoint)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>

---
 drivers/net/wireless/zd1211rw/zd_usb.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff -u -p a/drivers/net/wireless/zd1211rw/zd_usb.c b/drivers/net/wireless/zd1211rw/zd_usb.c
--- a/drivers/net/wireless/zd1211rw/zd_usb.c
+++ b/drivers/net/wireless/zd1211rw/zd_usb.c
@@ -1070,7 +1070,7 @@ static int eject_installer(struct usb_in
 
 	/* Find bulk out endpoint */
 	endpoint = &iface_desc->endpoint[1].desc;
-	if ((endpoint->bEndpointAddress & USB_TYPE_MASK) == USB_DIR_OUT &&
+	if (usb_endpoint_dir_out(endpoint) &&
 	    usb_endpoint_xfer_bulk(endpoint)) {
 		bulk_out_ep = endpoint->bEndpointAddress;
 	} else {

^ permalink raw reply

* Re: r8169 64-bit DMA support
From: Francois Romieu @ 2009-09-22 11:53 UTC (permalink / raw)
  To: Robert Hancock; +Cc: netdev
In-Reply-To: <4AB6BCEC.3070001@gmail.com>

Robert Hancock <hancockrwd@gmail.com> :
[...]
> It's not clear (from the mails I've read) exactly what was going on in  
> the case that caused this to be added.

Some AMD + r8169 systems simply did not work.

> Normally these days the PCI subsystem is supposed to detect that DAC
> isn't usable on a machine and refuse setting 64-bit DMA masks, it's
> not the driver's responsibility to handle this.
> I'm guessing that when this change was made that detection didn't exist
> though.

Not exactly. It was required for DAC to be explicitely enabled through
the CPlusCmd register.

> Thoughts on whether this default can be changed now ?

The 8168 does not seem to need the CPlusCmd stuff. I'll check it but it
should be possible to enable high DMA without condition for it.

-- 
Ueimor

^ permalink raw reply

* Re: [RFC] Virtual Machine Device Queues(VMDq) support on KVM
From: Arnd Bergmann @ 2009-09-22 11:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Chris Wright, Stephen Hemminger, Rusty Russell, virtualization,
	Xin, Xiaohui, kvm@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org, hpa@zytor.com,
	mingo@elte.hu, akpm@linux-foundation.org
In-Reply-To: <20090922103807.GA2555@redhat.com>

On Tuesday 22 September 2009, Michael S. Tsirkin wrote:
> > > More importantly, when virtualizations is used with multi-queue
> > > NIC's the virtio-net NIC is a single CPU bottleneck. The virtio-net
> > > NIC should preserve the parallelism (lock free) using multiple
> > > receive/transmit queues. The number of queues should equal the
> > > number of CPUs.
> > 
> > Yup, multiqueue virtio is on todo list ;-)
> > 
> 
> Note we'll need multiqueue tap for that to help.

My idea for that was to open multiple file descriptors to the same
macvtap device and let the kernel figure out the  right thing to
do with that. You can do the same with raw packed sockets in case
of vhost_net, but I wouldn't want to add more complexity to the
tun/tap driver for this.

	Arnd <><

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH] skge: request IRQ on activating the interface
From: Michal Schmidt @ 2009-09-22 12:01 UTC (permalink / raw)
  To: netdev; +Cc: Stephen Hemminger

skge requests IRQ in its probe function. This causes a problem in
the following real-life scenario with two different NICs in the machine:

1. modprobe skge
   The card is detected as eth0 and requests IRQ 17. Directory
   /proc/irq/17/eth0 is created.
2. There is an udev rule which says this interface should be called
   eth1, so udev renames eth0 -> eth1.
3. modprobe 8139too
   The Realtek card is detected as eth0. It will be using IRQ 17 too.
4. ip link set eth0 up
   Now 8139too requests IRQ 17.

The result is:
WARNING: at fs/proc/generic.c:590 proc_register ...
proc_dir_entry '17/eth0' already registered
...

And "ls /proc/irq/17" shows two subdirectories, both called eth0.

Fix it by requesting the IRQ in skge when the interface is activated.
This works, because interfaces can be renamed only while they are down.

Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
---

 drivers/net/skge.c |   27 +++++++++++++++------------
 1 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/drivers/net/skge.c b/drivers/net/skge.c
index 62e852e..7e90f27 100644
--- a/drivers/net/skge.c
+++ b/drivers/net/skge.c
@@ -105,6 +105,7 @@ static void yukon_init(struct skge_hw *hw, int port);
 static void genesis_mac_init(struct skge_hw *hw, int port);
 static void genesis_link_up(struct skge_port *skge);
 static void skge_set_multicast(struct net_device *dev);
+static irqreturn_t skge_intr(int irq, void *dev_id);
 
 /* Avoid conditionals by using array */
 static const int txqaddr[] = { Q_XA1, Q_XA2 };
@@ -2572,18 +2573,26 @@ static int skge_up(struct net_device *dev)
 	if (netif_msg_ifup(skge))
 		printk(KERN_INFO PFX "%s: enabling interface\n", dev->name);
 
+	err = request_irq(dev->irq, skge_intr, IRQF_SHARED, dev->name, hw);
+	if (err) {
+		dev_err(&hw->pdev->dev, "%s: cannot assign irq %d\n",
+			dev->name, dev->irq);
+		return err;
+	}
+
 	if (dev->mtu > RX_BUF_SIZE)
 		skge->rx_buf_size = dev->mtu + ETH_HLEN;
 	else
 		skge->rx_buf_size = RX_BUF_SIZE;
 
-
 	rx_size = skge->rx_ring.count * sizeof(struct skge_rx_desc);
 	tx_size = skge->tx_ring.count * sizeof(struct skge_tx_desc);
 	skge->mem_size = tx_size + rx_size;
 	skge->mem = pci_alloc_consistent(hw->pdev, skge->mem_size, &skge->dma);
-	if (!skge->mem)
-		return -ENOMEM;
+	if (!skge->mem) {
+		err = -ENOMEM;
+		goto free_irq;
+	}
 
 	BUG_ON(skge->dma & 7);
 
@@ -2646,6 +2655,8 @@ static int skge_up(struct net_device *dev)
  free_pci_mem:
 	pci_free_consistent(hw->pdev, skge->mem_size, skge->mem, skge->dma);
 	skge->mem = NULL;
+ free_irq:
+	free_irq(dev->irq, hw);
 
 	return err;
 }
@@ -2733,6 +2744,7 @@ static int skge_down(struct net_device *dev)
 	kfree(skge->tx_ring.start);
 	pci_free_consistent(hw->pdev, skge->mem_size, skge->mem, skge->dma);
 	skge->mem = NULL;
+	free_irq(dev->irq, hw);
 	return 0;
 }
 
@@ -3974,12 +3986,6 @@ static int __devinit skge_probe(struct pci_dev *pdev,
 		goto err_out_free_netdev;
 	}
 
-	err = request_irq(pdev->irq, skge_intr, IRQF_SHARED, dev->name, hw);
-	if (err) {
-		dev_err(&pdev->dev, "%s: cannot assign irq %d\n",
-		       dev->name, pdev->irq);
-		goto err_out_unregister;
-	}
 	skge_show_addr(dev);
 
 	if (hw->ports > 1 && (dev1 = skge_devinit(hw, 1, using_dac))) {
@@ -3996,8 +4002,6 @@ static int __devinit skge_probe(struct pci_dev *pdev,
 
 	return 0;
 
-err_out_unregister:
-	unregister_netdev(dev);
 err_out_free_netdev:
 	free_netdev(dev);
 err_out_led_off:
@@ -4041,7 +4045,6 @@ static void __devexit skge_remove(struct pci_dev *pdev)
 	skge_write16(hw, B0_LED, LED_STAT_OFF);
 	skge_write8(hw, B0_CTST, CS_RST_SET);
 
-	free_irq(pdev->irq, hw);
 	pci_release_regions(pdev);
 	pci_disable_device(pdev);
 	if (dev1)


^ permalink raw reply related

* [PATCH] 8139cp: fix duplicate loglevel in module load message
From: Alan Jenkins @ 2009-09-22 14:05 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-kernel, Alexander Beregalov

This was introduced by b93d58 "8139*: convert printk() to pr_<foo>()":

[ 2256252443 ] <6>8139cp: 10/100 PCI Ethernet driver v1.3 (Mar 22, 2004)

The "version" string is printed using pr_info(), so it doesn't need to
include a loglevel.

Signed-off-by: Alan Jenkins <alan-jenkins@tuffmail.co.uk>
CC: Alexander Beregalov <a.beregalov@gmail.com>
---
 drivers/net/8139cp.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/8139cp.c b/drivers/net/8139cp.c
index d0dbbf3..6841a9a 100644
--- a/drivers/net/8139cp.c
+++ b/drivers/net/8139cp.c
@@ -87,7 +87,7 @@
 
 /* These identify the driver base version and may not be removed. */
 static char version[] =
-KERN_INFO DRV_NAME ": 10/100 PCI Ethernet driver v" DRV_VERSION " (" DRV_RELDATE ")\n";
+DRV_NAME ": 10/100 PCI Ethernet driver v" DRV_VERSION " (" DRV_RELDATE ")\n";
 
 MODULE_AUTHOR("Jeff Garzik <jgarzik@pobox.com>");
 MODULE_DESCRIPTION("RealTek RTL-8139C+ series 10/100 PCI Ethernet driver");
-- 
1.6.3.2

^ permalink raw reply related

* [PATCH] smsc95xx: fix transmission where ZLP is expected
From: Steve Glendinning @ 2009-09-22 14:00 UTC (permalink / raw)
  To: netdev; +Cc: Ian Saturley, David Miller, Vlad Lyalikov, Steve Glendinning

Usbnet framework assumes USB hardware doesn't handle zero length
packets, but SMSC LAN95xx requires these to be sent for correct
operation.

This patch fixes an easily reproducible tx lockup when sending a frame
that results in exactly 512 bytes in a USB transmission (e.g. a UDP
frame with 458 data bytes, due to IP headers and our USB headers).  It
adds an extra flag to usbnet for the hardware driver to indicate that
it can handle and requires the zero length packets.

This patch should not affect other usbnet users, please also consider
for -stable.

Signed-off-by: Steve Glendinning <steve.glendinning@smsc.com>
---
 drivers/net/usb/smsc95xx.c |    2 +-
 drivers/net/usb/usbnet.c   |    2 +-
 include/linux/usb/usbnet.h |    1 +
 3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/usb/smsc95xx.c b/drivers/net/usb/smsc95xx.c
index 938fb35..6e9410f 100644
--- a/drivers/net/usb/smsc95xx.c
+++ b/drivers/net/usb/smsc95xx.c
@@ -1227,7 +1227,7 @@ static const struct driver_info smsc95xx_info = {
 	.rx_fixup	= smsc95xx_rx_fixup,
 	.tx_fixup	= smsc95xx_tx_fixup,
 	.status		= smsc95xx_status,
-	.flags		= FLAG_ETHER,
+	.flags		= FLAG_ETHER | FLAG_SEND_ZLP,
 };
 
 static const struct usb_device_id products[] = {
diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index 24b36f7..ca5ca5a 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -1049,7 +1049,7 @@ netdev_tx_t usbnet_start_xmit (struct sk_buff *skb,
 	 * NOTE:  strictly conforming cdc-ether devices should expect
 	 * the ZLP here, but ignore the one-byte packet.
 	 */
-	if ((length % dev->maxpacket) == 0) {
+	if (!(info->flags & FLAG_SEND_ZLP) && (length % dev->maxpacket) == 0) {
 		urb->transfer_buffer_length++;
 		if (skb_tailroom(skb)) {
 			skb->data[skb->len] = 0;
diff --git a/include/linux/usb/usbnet.h b/include/linux/usb/usbnet.h
index bb69e25..f814730 100644
--- a/include/linux/usb/usbnet.h
+++ b/include/linux/usb/usbnet.h
@@ -89,6 +89,7 @@ struct driver_info {
 #define FLAG_FRAMING_AX 0x0040		/* AX88772/178 packets */
 #define FLAG_WLAN	0x0080		/* use "wlan%d" names */
 #define FLAG_AVOID_UNLINK_URBS 0x0100	/* don't unlink urbs at usbnet_stop() */
+#define FLAG_SEND_ZLP	0x0200		/* hw requires ZLPs are sent */
 
 
 	/* init device ... can sleep, or cause probe() failure */
-- 
1.6.2.5


^ permalink raw reply related

* Re: [PATCH 13/13] TProxy: use the interface primary IP address as a default value for --on-ip
From: Brian Haley @ 2009-09-22 14:17 UTC (permalink / raw)
  To: Balazs Scheidler; +Cc: netfilter-devel, netdev
In-Reply-To: <1253601509.6883.5.camel@bzorp.balabit>

Balazs Scheidler wrote:
> On Mon, 2009-09-21 at 14:00 -0400, Brian Haley wrote:
>> Balazs Scheidler wrote: 
>>>  #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
>>> +
>>> +static inline const struct in6_addr *
>>> +tproxy_laddr6(struct sk_buff *skb, const struct in6_addr *user_laddr, const struct in6_addr *daddr)
>>> +{
>>> +	struct inet6_dev *indev;
>>> +	struct inet6_ifaddr *ifa;
>>> +	struct in6_addr *laddr;
>>> +	
>>> +        if (!ipv6_addr_any(user_laddr))
>>> +                return user_laddr;
>>> +	
>>> +        laddr = NULL;
>>> +        rcu_read_lock();
>>> +        indev = __in6_dev_get(skb->dev);
>>> +        if (indev && (ifa = indev->addr_list)) {
>>> +		laddr = &ifa->addr;
>>> +	}
>>> +        rcu_read_unlock();
>>> +        
>>> +        return laddr ? laddr : daddr;
>>> +}
>> You should call ipv6_dev_get_saddr() to get a source address based on the target
>> destination address.
> 
> Thanks for this hint, however this is not selecting a source address for
> a given destination, rather it selects the address where tproxy is
> redirecting the connection in case the user specified no --on-ip
> parameter.
> 
> e.g. 
> 
> ip6tables -A PREROUTING -p tcp --dport 80 -j TPROXY --on-port 50080
> 
> This should redirect the connection to the primary IP address of the
> incoming interface. In fact I spent 2 hours to figure out how to find
> the proper address, and at the end I used the first IP address
> configured to the interface, seeing that those addresses are sorted in
> 'scope' order, e.g. link-local and site-local addresses are at the end
> of the list, thus the front should be ok.

Yes, the addresses are sorted by scope, but just because they're in the
list doesn't mean they can be used, for example that address might have
failed DAD or be Deprecated.  ipv6_dev_get_saddr() will follow the rules
from RFC 3484 in picking the best address to use, or none if there isn't
anything appropriate.

> Since I'm not that much into IPv6, I'd appreciate some help, is
> ipv6_dev_get_saddr(client_ip_address) indeed the best solution here?

Probably.  An alternative might be to use ip6_dst_lookup() (see tcp_v6_connect()),
but a lot more code for you.

-Brian

^ permalink raw reply

* Re: fanotify as syscalls
From: Davide Libenzi @ 2009-09-22 14:51 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Andreas Gruenbacher, Eric Paris, Linus Torvalds, Evgeniy Polyakov,
	David Miller, Linux Kernel Mailing List, linux-fsdevel, netdev,
	viro, alan, hch
In-Reply-To: <20090921231227.GJ14700@shareable.org>

On Tue, 22 Sep 2009, Jamie Lokier wrote:

> I don't mind at all if fanotify is replaced by a general purpose "take
> over the system call table" solution ...

That was not what I meant ;)
You'd register/unregister as syscall interceptor, receiving syscall number 
and parameters, you'd be able to return status/error codes directly, and 
you'd have the ability to eventually change the parameters. All this 
should be pretty trivial code, and at the same time give full syscall 
visibility to the modules.
The complexity would be left to the interceptors, as they already do it 
today.

> But I can't help noticing that we _already_ have quite well placed
> hooks for intercepting system calls, called security_this and
> security_that (SELinux etc), ...

That has "some" limits WRT non-GPL modules and relative static linkage.

> However, being a little kinder, I suspect even the anti-malware
> vendors would rather not slow down everything with race-prone
> complicated tracking of everything every process does...  which is why
> fanotify allows it's "interest set" to be reduced from everything to a
> subset of files, and it's results to be cached, and let the races be
> handled in the normal way by VFS.

They are already doing it today, since they are forced to literally find 
and hack the syscall table.
They already have things like process whitelists, path whitelists, scan 
caches, and all the whistles, in their code.
Of course, some of them might be interested in pushing given complexity 
inside the kernel, since they won't have to maintain it.
Some other OTOH, might be interested in keeping a syscall-based access, 
since they already have working code based on that abstraction.
The good part of this would be that all the userspace communication API, 
whitelists, caches, etc...  would be left to the module implementors, and 
not pushed inside the kernel.
That, and the flexibility of being able to intercept all the userspace 
entrances into the kernel.

- Davide

^ permalink raw reply

* [PATCH] smsc95xx: add additional USB product IDs
From: Steve Glendinning @ 2009-09-22 15:13 UTC (permalink / raw)
  To: netdev; +Cc: Ian Saturley, David Miller, Vlad Lyalikov, Steve Glendinning

Signed-off-by: Steve Glendinning <steve.glendinning@smsc.com>
---
 drivers/net/usb/smsc95xx.c |   65 ++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 65 insertions(+), 0 deletions(-)

diff --git a/drivers/net/usb/smsc95xx.c b/drivers/net/usb/smsc95xx.c
index 938fb35..3aafebd 100644
--- a/drivers/net/usb/smsc95xx.c
+++ b/drivers/net/usb/smsc95xx.c
@@ -1237,10 +1237,75 @@ static const struct usb_device_id products[] = {
 		.driver_info = (unsigned long) &smsc95xx_info,
 	},
 	{
+		/* SMSC9505 USB Ethernet Device */
+		USB_DEVICE(0x0424, 0x9505),
+		.driver_info = (unsigned long) &smsc95xx_info,
+	},
+	{
+		/* SMSC9500A USB Ethernet Device */
+		USB_DEVICE(0x0424, 0x9E00),
+		.driver_info = (unsigned long) &smsc95xx_info,
+	},
+	{
+		/* SMSC9505A USB Ethernet Device */
+		USB_DEVICE(0x0424, 0x9E01),
+		.driver_info = (unsigned long) &smsc95xx_info,
+	},
+	{
 		/* SMSC9512/9514 USB Hub & Ethernet Device */
 		USB_DEVICE(0x0424, 0xec00),
 		.driver_info = (unsigned long) &smsc95xx_info,
 	},
+	{
+		/* SMSC9500 USB Ethernet Device (SAL10) */
+		USB_DEVICE(0x0424, 0x9900),
+		.driver_info = (unsigned long) &smsc95xx_info,
+	},
+	{
+		/* SMSC9505 USB Ethernet Device (SAL10) */
+		USB_DEVICE(0x0424, 0x9901),
+		.driver_info = (unsigned long) &smsc95xx_info,
+	},
+	{
+		/* SMSC9500A USB Ethernet Device (SAL10) */
+		USB_DEVICE(0x0424, 0x9902),
+		.driver_info = (unsigned long) &smsc95xx_info,
+	},
+	{
+		/* SMSC9505A USB Ethernet Device (SAL10) */
+		USB_DEVICE(0x0424, 0x9903),
+		.driver_info = (unsigned long) &smsc95xx_info,
+	},
+	{
+		/* SMSC9512/9514 USB Hub & Ethernet Device (SAL10) */
+		USB_DEVICE(0x0424, 0x9904),
+		.driver_info = (unsigned long) &smsc95xx_info,
+	},
+	{
+		/* SMSC9500A USB Ethernet Device (HAL) */
+		USB_DEVICE(0x0424, 0x9905),
+		.driver_info = (unsigned long) &smsc95xx_info,
+	},
+	{
+		/* SMSC9505A USB Ethernet Device (HAL) */
+		USB_DEVICE(0x0424, 0x9906),
+		.driver_info = (unsigned long) &smsc95xx_info,
+	},
+	{
+		/* SMSC9500 USB Ethernet Device (Alternate ID) */
+		USB_DEVICE(0x0424, 0x9907),
+		.driver_info = (unsigned long) &smsc95xx_info,
+	},
+	{
+		/* SMSC9500A USB Ethernet Device (Alternate ID) */
+		USB_DEVICE(0x0424, 0x9908),
+		.driver_info = (unsigned long) &smsc95xx_info,
+	},
+	{
+		/* SMSC9512/9514 USB Hub & Ethernet Device (Alternate ID) */
+		USB_DEVICE(0x0424, 0x9909),
+		.driver_info = (unsigned long) &smsc95xx_info,
+	},
 	{ },		/* END */
 };
 MODULE_DEVICE_TABLE(usb, products);
-- 
1.6.2.5


^ permalink raw reply related

* [PATCH] net: xilinx_emaclite: Fix problem with first incoming packet
From: John Linn @ 2009-09-22 15:24 UTC (permalink / raw)
  To: netdev, davem, linuxppc-dev, grant.likely, jwboyer,
	sadanand.mutyala
  Cc: Michal Simek

From: Michal Simek <monstr@monstr.eu>

You can't ping the board or connect to it unless you send
any packet out from board.

Tested-by: John Williams <john.williams@petalogix.com>
Signed-off-by: Michal Simek <monstr@monstr.eu>
Acked-by: John Linn <john.linn@xilinx.com>
---
 drivers/net/xilinx_emaclite.c |    7 ++-----
 1 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/net/xilinx_emaclite.c b/drivers/net/xilinx_emaclite.c
index dc22782..83a044d 100644
--- a/drivers/net/xilinx_emaclite.c
+++ b/drivers/net/xilinx_emaclite.c
@@ -134,18 +134,15 @@ static void xemaclite_enable_interrupts(struct net_local *drvdata)
 	}
 
 	/* Enable the Rx interrupts for the first buffer */
-	reg_data = in_be32(drvdata->base_addr + XEL_RSR_OFFSET);
 	out_be32(drvdata->base_addr + XEL_RSR_OFFSET,
-		 reg_data | XEL_RSR_RECV_IE_MASK);
+		 XEL_RSR_RECV_IE_MASK);
 
 	/* Enable the Rx interrupts for the second Buffer if
 	 * configured in HW */
 	if (drvdata->rx_ping_pong != 0) {
-		reg_data = in_be32(drvdata->base_addr + XEL_BUFFER_OFFSET +
-				   XEL_RSR_OFFSET);
 		out_be32(drvdata->base_addr + XEL_BUFFER_OFFSET +
 			 XEL_RSR_OFFSET,
-			 reg_data | XEL_RSR_RECV_IE_MASK);
+			 XEL_RSR_RECV_IE_MASK);
 	}
 
 	/* Enable the Global Interrupt Enable */
-- 
1.6.2.1



This email and any attachments are intended for the sole use of the named recipient(s) and contain(s) confidential information that may be proprietary, privileged or copyrighted under applicable law. If you are not the intended recipient, do not read, copy, or forward this email message or any attachments. Delete this email message and any attachments immediately.



^ permalink raw reply related

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Ira W. Snyder @ 2009-09-22 15:25 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Michael S. Tsirkin, netdev, virtualization, kvm,
	linux-kernel, mingo, linux-mm, akpm, hpa, Rusty Russell, s.hetze,
	alacrityvm-devel
In-Reply-To: <4AB89C48.4020903@redhat.com>

On Tue, Sep 22, 2009 at 12:43:36PM +0300, Avi Kivity wrote:
> On 09/22/2009 12:43 AM, Ira W. Snyder wrote:
> >
> >> Sure, virtio-ira and he is on his own to make a bus-model under that, or
> >> virtio-vbus + vbus-ira-connector to use the vbus framework.  Either
> >> model can work, I agree.
> >>
> >>      
> > Yes, I'm having to create my own bus model, a-la lguest, virtio-pci, and
> > virtio-s390. It isn't especially easy. I can steal lots of code from the
> > lguest bus model, but sometimes it is good to generalize, especially
> > after the fourth implemention or so. I think this is what GHaskins tried
> > to do.
> >    
> 
> Yes.  vbus is more finely layered so there is less code duplication.
> 
> The virtio layering was more or less dictated by Xen which doesn't have 
> shared memory (it uses grant references instead).  As a matter of fact 
> lguest, kvm/pci, and kvm/s390 all have shared memory, as you do, so that 
> part is duplicated.  It's probably possible to add a virtio-shmem.ko 
> library that people who do have shared memory can reuse.
> 

Seems like a nice benefit of vbus.

> > I've given it some thought, and I think that running vhost-net (or
> > similar) on the ppc boards, with virtio-net on the x86 crate server will
> > work. The virtio-ring abstraction is almost good enough to work for this
> > situation, but I had to re-invent it to work with my boards.
> >
> > I've exposed a 16K region of memory as PCI BAR1 from my ppc board.
> > Remember that this is the "host" system. I used each 4K block as a
> > "device descriptor" which contains:
> >
> > 1) the type of device, config space, etc. for virtio
> > 2) the "desc" table (virtio memory descriptors, see virtio-ring)
> > 3) the "avail" table (available entries in the desc table)
> >    
> 
> Won't access from x86 be slow to this memory (on the other hand, if you 
> change it to main memory access from ppc will be slow... really depends 
> on how your system is tuned.
> 

Writes across the bus are fast, reads across the bus are slow. These are
just the descriptor tables for memory buffers, not the physical memory
buffers themselves.

These only need to be written by the guest (x86), and read by the host
(ppc). The host never changes the tables, so we can cache a copy in the
guest, for a fast detach_buf() implementation (see virtio-ring, which
I'm copying the design from).

The only accesses are writes across the PCI bus. There is never a need
to do a read (except for slow-path configuration).

> > Parts 2 and 3 are repeated three times, to allow for a maximum of three
> > virtqueues per device. This is good enough for all current drivers.
> >    
> 
> The plan is to switch to multiqueue soon.  Will not affect you if your 
> boards are uniprocessor or small smp.
> 

Everything I have is UP. I don't need extreme performance, either.
40MB/sec is the minimum I need to reach, though I'd like to have some
headroom.

For reference, using the CPU to handle data transfers, I get ~2MB/sec
transfers. Using the DMA engine, I've hit about 60MB/sec with my
"crossed-wires" virtio-net.

> > I've gotten plenty of email about this from lots of interested
> > developers. There are people who would like this kind of system to just
> > work, while having to write just some glue for their device, just like a
> > network driver. I hunch most people have created some proprietary mess
> > that basically works, and left it at that.
> >    
> 
> So long as you keep the system-dependent features hookable or 
> configurable, it should work.
> 
> > So, here is a desperate cry for help. I'd like to make this work, and
> > I'd really like to see it in mainline. I'm trying to give back to the
> > community from which I've taken plenty.
> >    
> 
> Not sure who you're crying for help to.  Once you get this working, post 
> patches.  If the patches are reasonably clean and don't impact 
> performance for the main use case, and if you can show the need, I 
> expect they'll be merged.
> 

In the spirit of "post early and often", I'm making my code available,
that's all. I'm asking anyone interested for some review, before I have
to re-code this for about the fifth time now. I'm trying to avoid
Haskins' situation, where he's invented and debugged a lot of new code,
and then been told to do it completely differently.

Yes, the code I posted is only compile-tested, because quite a lot of
code (kernel and userspace) must be working before anything works at
all. I hate to design the whole thing, then be told that something
fundamental about it is wrong, and have to completely re-write it.

Thanks for the comments,
Ira

> -- 
> error compiling committee.c: too many arguments to function
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: fanotify as syscalls
From: Andreas Gruenbacher @ 2009-09-22 15:31 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Jamie Lokier, Eric Paris, Linus Torvalds, Evgeniy Polyakov,
	David Miller, Linux Kernel Mailing List, linux-fsdevel, netdev,
	viro, alan, hch
In-Reply-To: <alpine.DEB.2.00.0909211816020.1116@makko.or.mcafeemobile.com>

On Tuesday, 22 September 2009 16:51:39 Davide Libenzi wrote:
> On Tue, 22 Sep 2009, Jamie Lokier wrote:
> > I don't mind at all if fanotify is replaced by a general purpose "take
> > over the system call table" solution ...
>
> That was not what I meant ;)
> You'd register/unregister as syscall interceptor, receiving syscall number
> and parameters, you'd be able to return status/error codes directly, and
> you'd have the ability to eventually change the parameters. All this
> should be pretty trivial code, and at the same time give full syscall
> visibility to the modules.

The fatal flaw of syscall interception is race conditions: you look up a 
pathname in your interception layer; then when you call into the proper 
syscall, the kernel again looks up the same pathname. There is no way to 
guarantee that you end up at the same object in both lookups. The security 
and fsnotify hooks are placed in the appropriate spots to avoid exactly that.

Andreas

^ permalink raw reply

* Re: igb VF allocation with quirk_i82576_sriov
From: Alexander Duyck @ 2009-09-22 15:41 UTC (permalink / raw)
  To: Chris Wright
  Cc: e1000-devel@lists.sourceforge.net, netdev@vger.kernel.org,
	Ronciak, John
In-Reply-To: <20090922051910.GC1035@sequoia.sous-sol.org>

Chris Wright wrote:
> Is this known to work?  During recent virt testing for upcoming Fedora 12,
> a box w/out SR-IOV support in BIOS was using quirk to create VF BAR space,
> VF allocation worked enough to assign a device to the guest, but igbvf
> was not actually functioning properly in the guest.
> 
> Is it worth debugging this further, or is it already a known issue?

You could be experiencing one of a couple different issues.

First when you say you started SR-IOV on a box w/out SR-IOV support I 
assume you are using "pci=assign-busses" in order to reserve the bus 
space for the VFs, is that correct?  Also while your system may not 
support SR-IOV does it at least support VT-d?  Without VT-d support you 
won't be able to assign a device to the guest.

My recommendations for further testing would be to test a VF on the host 
kernel to see if that works.  If it does then you could also try direct 
assigning an entire port to see if that works.  If the entire port 
doesn't work then you probably don't have VT-d enabled.

Thanks,

Alex

------------------------------------------------------------------------------
Come build with us! The BlackBerry&reg; Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9&#45;12, 2009. Register now&#33;
http://p.sf.net/sfu/devconf

^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Avi Kivity @ 2009-09-22 15:56 UTC (permalink / raw)
  To: Ira W. Snyder
  Cc: Gregory Haskins, Michael S. Tsirkin, netdev, virtualization, kvm,
	linux-kernel, mingo, linux-mm, akpm, hpa, Rusty Russell, s.hetze,
	alacrityvm-devel
In-Reply-To: <20090922152520.GA9154@ovro.caltech.edu>

On 09/22/2009 06:25 PM, Ira W. Snyder wrote:
>
>> Yes.  vbus is more finely layered so there is less code duplication.
>>
>> The virtio layering was more or less dictated by Xen which doesn't have
>> shared memory (it uses grant references instead).  As a matter of fact
>> lguest, kvm/pci, and kvm/s390 all have shared memory, as you do, so that
>> part is duplicated.  It's probably possible to add a virtio-shmem.ko
>> library that people who do have shared memory can reuse.
>>
>>      
> Seems like a nice benefit of vbus.
>    

Yes, it is.  With some work virtio can gain that too (virtio-shmem.ko).

>>> I've given it some thought, and I think that running vhost-net (or
>>> similar) on the ppc boards, with virtio-net on the x86 crate server will
>>> work. The virtio-ring abstraction is almost good enough to work for this
>>> situation, but I had to re-invent it to work with my boards.
>>>
>>> I've exposed a 16K region of memory as PCI BAR1 from my ppc board.
>>> Remember that this is the "host" system. I used each 4K block as a
>>> "device descriptor" which contains:
>>>
>>> 1) the type of device, config space, etc. for virtio
>>> 2) the "desc" table (virtio memory descriptors, see virtio-ring)
>>> 3) the "avail" table (available entries in the desc table)
>>>
>>>        
>> Won't access from x86 be slow to this memory (on the other hand, if you
>> change it to main memory access from ppc will be slow... really depends
>> on how your system is tuned.
>>
>>      
> Writes across the bus are fast, reads across the bus are slow. These are
> just the descriptor tables for memory buffers, not the physical memory
> buffers themselves.
>
> These only need to be written by the guest (x86), and read by the host
> (ppc). The host never changes the tables, so we can cache a copy in the
> guest, for a fast detach_buf() implementation (see virtio-ring, which
> I'm copying the design from).
>
> The only accesses are writes across the PCI bus. There is never a need
> to do a read (except for slow-path configuration).
>    

Okay, sounds like what you're doing it optimal then.

> In the spirit of "post early and often", I'm making my code available,
> that's all. I'm asking anyone interested for some review, before I have
> to re-code this for about the fifth time now. I'm trying to avoid
> Haskins' situation, where he's invented and debugged a lot of new code,
> and then been told to do it completely differently.
>
> Yes, the code I posted is only compile-tested, because quite a lot of
> code (kernel and userspace) must be working before anything works at
> all. I hate to design the whole thing, then be told that something
> fundamental about it is wrong, and have to completely re-write it.
>    

Understood.  Best to get a review from Rusty then.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH] netfilter: nf_nat_helper: tidy up adjust_tcp_sequence
From: Hannes Eder @ 2009-09-22 16:00 UTC (permalink / raw)
  To: netdev; +Cc: netfilter-devel, linux-kernel

The variable 'other_way' gets initialized but is not read afterwards,
so remove it.  Pass the right arguments to a pr_debug call.

While being at tidy up a bit and it fix this checkpatch warning:
  WARNING: suspect code indent for conditional statements

Signed-off-by: Hannes Eder <heder@google.com>

 net/ipv4/netfilter/nf_nat_helper.c |   21 ++++++++-------------
 1 files changed, 8 insertions(+), 13 deletions(-)

diff --git a/net/ipv4/netfilter/nf_nat_helper.c b/net/ipv4/netfilter/nf_nat_helper.c
index 09172a6..56daa1b 100644
--- a/net/ipv4/netfilter/nf_nat_helper.c
+++ b/net/ipv4/netfilter/nf_nat_helper.c
@@ -41,18 +41,13 @@ adjust_tcp_sequence(u32 seq,
 		    struct nf_conn *ct,
 		    enum ip_conntrack_info ctinfo)
 {
-	int dir;
-	struct nf_nat_seq *this_way, *other_way;
+	int dir = CTINFO2DIR(ctinfo);
 	struct nf_conn_nat *nat = nfct_nat(ct);
+	struct nf_nat_seq *this_way = &nat->seq[dir];
 
-	pr_debug("adjust_tcp_sequence: seq = %u, sizediff = %d\n", seq, seq);
-
-	dir = CTINFO2DIR(ctinfo);
-
-	this_way = &nat->seq[dir];
-	other_way = &nat->seq[!dir];
+	pr_debug("%s(): seq = %u, sizediff = %d\n", __func__, seq, sizediff);
 
-	pr_debug("nf_nat_resize_packet: Seq_offset before: ");
+	pr_debug("%s(): Seq_offset before: ", __func__);
 	DUMP_OFFSET(this_way);
 
 	spin_lock_bh(&nf_nat_seqofs_lock);
@@ -63,13 +58,13 @@ adjust_tcp_sequence(u32 seq,
 	 * retransmit */
 	if (this_way->offset_before == this_way->offset_after ||
 	    before(this_way->correction_pos, seq)) {
-		   this_way->correction_pos = seq;
-		   this_way->offset_before = this_way->offset_after;
-		   this_way->offset_after += sizediff;
+		this_way->correction_pos = seq;
+		this_way->offset_before = this_way->offset_after;
+		this_way->offset_after += sizediff;
 	}
 	spin_unlock_bh(&nf_nat_seqofs_lock);
 
-	pr_debug("nf_nat_resize_packet: Seq_offset after: ");
+	pr_debug("%s(): Seq_offset after: ", __func__);
 	DUMP_OFFSET(this_way);
 }
 


^ permalink raw reply related

* Re: fanotify as syscalls
From: Davide Libenzi @ 2009-09-22 16:04 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Jamie Lokier, Eric Paris, Linus Torvalds, Evgeniy Polyakov,
	David Miller, Linux Kernel Mailing List, linux-fsdevel, netdev,
	viro, alan, hch
In-Reply-To: <200909221731.34717.agruen@suse.de>

On Tue, 22 Sep 2009, Andreas Gruenbacher wrote:

> The fatal flaw of syscall interception is race conditions: you look up a 
> pathname in your interception layer; then when you call into the proper 
> syscall, the kernel again looks up the same pathname. There is no way to 
> guarantee that you end up at the same object in both lookups. The security 
> and fsnotify hooks are placed in the appropriate spots to avoid exactly that.

Fatal? You mean, for this corner case that the anti-malware industry lived 
with for so much time (in Linux and Windows), you're prepared in pushing 
all the logic that is currently implemented into their modules, into the 
kernel?
This includes process whitelisting, path whitelisting, caches, userspace 
access API definition, and so on? On top of providing a generally more 
limited interception.
Why don't we instead offer a lower and broader level of interception, 
letting the users decide if such fatal flaw needs to be addressed or 
not, in their modules?
They get a broader inteception layer, with the option to decide if or if 
not address certain scenarios, and we get less code inside the kernel.
A win/win situation, if you ask me.

- Davide

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox