Netdev List
 help / color / mirror / Atom feed
* Re: rps testing questions
From: Ben Hutchings @ 2011-01-18 18:34 UTC (permalink / raw)
  To: Rick Jones; +Cc: mi wake, netdev
In-Reply-To: <4D35DAB0.9030201@hp.com>

On Tue, 2011-01-18 at 10:23 -0800, Rick Jones wrote:
> Ben Hutchings wrote:
> > On Mon, 2011-01-17 at 17:43 +0800, mi wake wrote:
[...]
> >>I do ab and tbench testing also find there is less tps with enable
> >>rps.but,there is more cpu using when with enable rps.when with enable
> >>rps ,softirqs is blanced  on cpus.
> >>
> >>is there something wrong with my test?
> > 
> > 
> > In addition to what Eric said, check the interrupt moderation settings
> > (ethtool -c/-C options).  One-way latency for a single request/response
> > test will be at least the interrupt moderation value.
> > 
> > I haven't tested RPS by itself (Solarflare NICs have plenty of hardware
> > queues) so I don't know whether it can improve latency.  However, RFS
> > certainly does when there are many flows.
> 
> Is there actually an expectation that either RPS or RFS would improve *latency*? 
>   Multiple-stream throughput certainly, but with the additional work done to 
> spread things around, I wouldn't expect either to improve latency.

Yes, it seems to make a big improvement to latency when many flows are
active.  Tom told me that one of his benchmarks was 200 * netperf TCP_RR
in parallel, and I've seen over 40% reduction in latency for that.  That
said, allocating more RX queues might also help (sfc currently defaults
to one per processor package rather than one per processor thread, due
to concerns about CPU efficiency).

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: rps testing questions
From: Rick Jones @ 2011-01-18 18:23 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: mi wake, netdev
In-Reply-To: <1295269713.3700.5.camel@localhost>

Ben Hutchings wrote:
> On Mon, 2011-01-17 at 17:43 +0800, mi wake wrote:
> 
>>I do a rps(Receive Packet Steering) testing on centos 5.5 with  kernel 2.6.37.
>>cpu: 8 core Intel.
>>ethernet adapter: bnx2x
>>
>>Problem statement:
>>enable rps with:
>>echo "ff" > /sys/class/net/eth2/queues/rx-0/rps_cpus.
>>
>>running 1 instances of netperf TCP_RR: netperf  -t TCP_RR -H 192.168.0.1 -c -C
>>without rps: 9963.48(Trans Rate per sec)
>>with rps:  9387.59(Trans Rate per sec)

Presumably there was an increase in service demand corresponding with the drop 
in transactions per second.

Also, an unsolicited benchmarking style tip or two.  I find it helpful to either 
do several discrete runs, or use the confidence intervals (global -i and -I 
options) with the TCP_RR tests when I am looking to compare two settings.  I 
find a bit more "variability" in the _RR tests than the _STREAM tests.

http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#index-g_t_002dI_002c-Global-26

Pinning netperf/netserver is also something I tend to do, but combining that 
with  confidence intervals, RPS is kind of difficult - the successive data 
connections made while running the iterations of the confidence intervals will 
have different port numbers and so different hashing.  That would cause RPS to 
put the connections on different cores in turn, which would, in conjunction with 
netperf/netserver being pinned to a core cause the relationship between where 
netperf runs and where netserver runs to change.  That will likely result in 
cache to cache (processor cache) transfers which will definitely up the service 
demand and drop the single-stream transactions per second.

In theory :) with RFS that should not be an issue since where netperf/netserver 
are pinned controls where the inbound processing takes place.

We are in a maze of twisty heuristics... :)

>>I do ab and tbench testing also find there is less tps with enable
>>rps.but,there is more cpu using when with enable rps.when with enable
>>rps ,softirqs is blanced  on cpus.
>>
>>is there something wrong with my test?
> 
> 
> In addition to what Eric said, check the interrupt moderation settings
> (ethtool -c/-C options).  One-way latency for a single request/response
> test will be at least the interrupt moderation value.
> 
> I haven't tested RPS by itself (Solarflare NICs have plenty of hardware
> queues) so I don't know whether it can improve latency.  However, RFS
> certainly does when there are many flows.

Is there actually an expectation that either RPS or RFS would improve *latency*? 
  Multiple-stream throughput certainly, but with the additional work done to 
spread things around, I wouldn't expect either to improve latency.

happy benchmarking,

rick jones

^ permalink raw reply

* Re: inbound connection problems when "netlink: test for all flags of the NLM_F_DUMP composite" commit applied
From: Jarek Poplawski @ 2011-01-18 18:28 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Alessandro Suardi, jamal, David Miller, pablo, arthur.marsh,
	eric.dumazet, netdev
In-Reply-To: <alpine.LNX.2.01.1101181921320.17010@obet.zrqbmnf.qr>

On Tue, Jan 18, 2011 at 07:24:40PM +0100, Jan Engelhardt wrote:
> 
> On Tuesday 2011-01-18 19:10, Alessandro Suardi wrote:
> >On Tue, Jan 18, 2011 at 6:23 PM, Jarek Poplawski <jarkao2@gmail.com> wrote:
> >>
> >> NLM_F_DUMP flags should be applied to GET requests only, eg. rtnetlink
> >> tests message type to verify this. Since genetlink can't do the same
> >> use "practical" test for ops->dumpit (assuming NEW request won't be
> >> mixed with GET).
...
> >2.6.37-git18 + netlink revert + this patch
> > - fixes Avahi
> > - breaks acpid
> >Starting acpi daemon: RTNETLINK1 answers: Operation not supported
> >acpid: error talking to the kernel via netlink
> 
> Deducing from that, it is a GET-like request that was sent by acpid, 
> and the message type is one that has both a dumpit and a doit function.
> So if EOPNOTSUPP now occurs on all message types that have both dumpit 
> and doit, you should have broken a lot more than just acpid.

Right, we need something better here.

Jarek P.

^ permalink raw reply

* Re: inbound connection problems when "netlink: test for all flags of the NLM_F_DUMP composite" commit applied
From: Jan Engelhardt @ 2011-01-18 18:24 UTC (permalink / raw)
  To: Alessandro Suardi
  Cc: Jarek Poplawski, jamal, David Miller, pablo, arthur.marsh,
	eric.dumazet, netdev
In-Reply-To: <AANLkTikUPRfYupw1-Py5mHeN+pFv-zMKhmK3DKixxsH4@mail.gmail.com>


On Tuesday 2011-01-18 19:10, Alessandro Suardi wrote:
>On Tue, Jan 18, 2011 at 6:23 PM, Jarek Poplawski <jarkao2@gmail.com> wrote:
>>
>> NLM_F_DUMP flags should be applied to GET requests only, eg. rtnetlink
>> tests message type to verify this. Since genetlink can't do the same
>> use "practical" test for ops->dumpit (assuming NEW request won't be
>> mixed with GET).
>>
>> -       if (nlh->nlmsg_flags & NLM_F_DUMP) {
>> -               if (ops->dumpit == NULL)
>> +       if (ops->dumpit) {
>> +               if (nlh->nlmsg_flags & NLM_F_DUMP) {
>> +                       genl_unlock();
>> +                       err = netlink_dump_start(net->genl_sock, skb, nlh,
>> +                                                ops->dumpit, ops->done);
>> +                       genl_lock();
>> +                       return err;
>> +               } else {
>>                        return -EOPNOTSUPP;
>> -
>> -               genl_unlock();
>> -               err = netlink_dump_start(net->genl_sock, skb, nlh,
>> -                                        ops->dumpit, ops->done);
>> -               genl_lock();
>> -               return err;
>> +               }
>
>2.6.37-git18 + netlink revert + this patch
> - fixes Avahi
> - breaks acpid
>Starting acpi daemon: RTNETLINK1 answers: Operation not supported
>acpid: error talking to the kernel via netlink

Deducing from that, it is a GET-like request that was sent by acpid, 
and the message type is one that has both a dumpit and a doit function.
So if EOPNOTSUPP now occurs on all message types that have both dumpit 
and doit, you should have broken a lot more than just acpid.

^ permalink raw reply

* Re: inbound connection problems when "netlink: test for all flags of the NLM_F_DUMP composite" commit applied
From: Jarek Poplawski @ 2011-01-18 18:23 UTC (permalink / raw)
  To: Alessandro Suardi
  Cc: jamal, David Miller, pablo, arthur.marsh, jengelh, eric.dumazet,
	netdev
In-Reply-To: <AANLkTikUPRfYupw1-Py5mHeN+pFv-zMKhmK3DKixxsH4@mail.gmail.com>

On Tue, Jan 18, 2011 at 07:10:35PM +0100, Alessandro Suardi wrote:
> On Tue, Jan 18, 2011 at 6:23 PM, Jarek Poplawski <jarkao2@gmail.com> wrote:
> > [PATCH] netlink: Fix possible NLM_F_DUMP misuse in genetlink
> >
> > NLM_F_DUMP flags should be applied to GET requests only, eg. rtnetlink
> > tests message type to verify this. Since genetlink can't do the same
> > use "practical" test for ops->dumpit (assuming NEW request won't be
> > mixed with GET).
> >
> > Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
> > Cc: Jan Engelhardt <jengelh@medozas.de>
> > Cc: Pablo Neira Ayuso <pablo@netfilter.org>
> > Cc: Jamal Hadi Salim <hadi@cyberus.ca>
> > ---
> > Not for stable before testing!
> >
> > diff -Nurp a/net/netlink/genetlink.c b/net/netlink/genetlink.c
> > --- a/net/netlink/genetlink.c   2011-01-18 16:58:16.000000000 +0100
> > +++ b/net/netlink/genetlink.c   2011-01-18 17:08:43.000000000 +0100
> > @@ -519,15 +519,16 @@ static int genl_rcv_msg(struct sk_buff *
...
> 2.6.37-git18 + netlink revert + this patch
>  - fixes Avahi
>  - breaks acpid

Well, then fixing genetlink needs more than this and I withdraw this
patch. Anyway, netlink revert is IMHO needed to prevent the regression.

...
> If more debugging/testing is needed, do ping me. Thanks,

Many thanks for such a fast response!
Jarek P.

^ permalink raw reply

* Re: inbound connection problems when "netlink: test for all flags of the NLM_F_DUMP composite" commit applied
From: Jarek Poplawski @ 2011-01-18 18:11 UTC (permalink / raw)
  To: jamal; +Cc: David Miller, pablo, arthur.marsh, jengelh, eric.dumazet, netdev
In-Reply-To: <1295359503.2066.10.camel@mojatatu>

On Tue, Jan 18, 2011 at 09:05:03AM -0500, jamal wrote:
> On Tue, 2011-01-18 at 10:24 +0000, Jarek Poplawski wrote:
> 
> > Do you all expect all users manage to upgrade avahi app before
> > changing their stable kernel? I mean "own distro" users especially.
> 
> Unfortunately if that app is widely deployed, it is not pragmatic
> to break it in the name of pedanticity. i.e.
> Be conservative in what you send (clearly Avahi is farting all over the
> place) but more importantly be a nice liberal in what you accept.
> Maybe tell them if you have the cycles about their bad behavior.
> The important part is the GET (kind = 2). The DUMP as Jarek says
> is merely a utility to extrapolate that we need you to
> get everything.. So it is not such a big deal if someone passes
> extraneous senseless flags. Wrong? yes.

Thanks for the confirmation of my suspicions wrt. the RFC & rtnetlink.
But then Avahi seems right and we should get back to the written law.

> Jarek - For the record although i coauthored the doc, I was merely
> a messenger in putting together some artist painting.

Well, ain't pictures better than words ;-)

Cheers,
Jarek P.

^ permalink raw reply

* Re: inbound connection problems when "netlink: test for all flags of the NLM_F_DUMP composite" commit applied
From: Alessandro Suardi @ 2011-01-18 18:10 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: jamal, David Miller, pablo, arthur.marsh, jengelh, eric.dumazet,
	netdev
In-Reply-To: <20110118172340.GB1843@del.dom.local>

On Tue, Jan 18, 2011 at 6:23 PM, Jarek Poplawski <jarkao2@gmail.com> wrote:
> [PATCH] netlink: Fix possible NLM_F_DUMP misuse in genetlink
>
> NLM_F_DUMP flags should be applied to GET requests only, eg. rtnetlink
> tests message type to verify this. Since genetlink can't do the same
> use "practical" test for ops->dumpit (assuming NEW request won't be
> mixed with GET).
>
> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
> Cc: Jan Engelhardt <jengelh@medozas.de>
> Cc: Pablo Neira Ayuso <pablo@netfilter.org>
> Cc: Jamal Hadi Salim <hadi@cyberus.ca>
> ---
> Not for stable before testing!
>
> diff -Nurp a/net/netlink/genetlink.c b/net/netlink/genetlink.c
> --- a/net/netlink/genetlink.c   2011-01-18 16:58:16.000000000 +0100
> +++ b/net/netlink/genetlink.c   2011-01-18 17:08:43.000000000 +0100
> @@ -519,15 +519,16 @@ static int genl_rcv_msg(struct sk_buff *
>            security_netlink_recv(skb, CAP_NET_ADMIN))
>                return -EPERM;
>
> -       if (nlh->nlmsg_flags & NLM_F_DUMP) {
> -               if (ops->dumpit == NULL)
> +       if (ops->dumpit) {
> +               if (nlh->nlmsg_flags & NLM_F_DUMP) {
> +                       genl_unlock();
> +                       err = netlink_dump_start(net->genl_sock, skb, nlh,
> +                                                ops->dumpit, ops->done);
> +                       genl_lock();
> +                       return err;
> +               } else {
>                        return -EOPNOTSUPP;
> -
> -               genl_unlock();
> -               err = netlink_dump_start(net->genl_sock, skb, nlh,
> -                                        ops->dumpit, ops->done);
> -               genl_lock();
> -               return err;
> +               }
>        }
>
>        if (ops->doit == NULL)
>

2.6.37-git18 + netlink revert + this patch
 - fixes Avahi
 - breaks acpid

Upon startup I have:

Starting acpi daemon: RTNETLINK1 answers: Operation not supported
acpid: error talking to the kernel via netlink




From strace output:

open("/dev/input/event6", O_RDONLY|O_NONBLOCK) = 8
fcntl(8, F_SETFD, FD_CLOEXEC)           = 0
ioctl(8, 0x80604520, 0x7fffb3418550)    = 8
ioctl(8, 0x80604521, 0x7fffb34185b0)    = 96
open("/dev/input/event8", O_RDONLY|O_NONBLOCK) = 9
fcntl(9, F_SETFD, FD_CLOEXEC)           = 0
ioctl(9, 0x80604520, 0x7fffb3418550)    = 8
ioctl(9, 0x80604532, 0x7fffb3418c10)    = 8
close(9)                                = 0
inotify_init()                          = 9
inotify_add_watch(9, "/dev/input", IN_CREATE) = 1
socket(PF_NETLINK, SOCK_RAW, 16)        = 10
setsockopt(10, SOL_SOCKET, SO_SNDBUF, [32768], 4) = 0
setsockopt(10, SOL_SOCKET, SO_RCVBUF, [32768], 4) = 0
bind(10, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 0
getsockname(10, {sa_family=AF_NETLINK, pid=4298, groups=00000000}, [12]) = 0
sendmsg(10, {msg_name(12)={sa_family=AF_NETLINK, pid=0,
groups=00000000},
msg_iov(1)=[{"$\0\0\0\20\0\5\0\346\3265M\0\0\0\0\3\0\0\0\17\0\2\0acpi_eve"...,
36}], msg_controllen=0, msg_flags=0}, 0) = 36
recvmsg(10, {msg_name(12)={sa_family=AF_NETLINK, pid=0,
groups=00000000},
msg_iov(1)=[{"8\0\0\0\2\0\0\0\346\3265M\312\20\0\0\241\377\377\377$\0\0\0\20\0\5\0\346\3265M"...,
16384}], msg_controllen=0, msg_flags=0}, 0) = 56
dup(2)                                  = 11
fcntl(11, F_GETFL)                      = 0x1 (flags O_WRONLY)
close(11)                               = 0
write(2, "RTNETLINK1 answers: Operation no"..., 44RTNETLINK1 answers:
Operation not supported
) = 44
write(2, "acpid: error talking to the kern"..., 47acpid: error talking
to the kernel via netlink
) = 47
close(10)                               = 0
socket(PF_NETLINK, SOCK_RAW, 16)        = 10
setsockopt(10, SOL_SOCKET, SO_SNDBUF, [32768], 4) = 0
setsockopt(10, SOL_SOCKET, SO_RCVBUF, [32768], 4) = 0
bind(10, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 0
getsockname(10, {sa_family=AF_NETLINK, pid=4298, groups=00000000}, [12]) = 0
unlink("/var/run/acpid.socket")         = 0
socket(PF_FILE, SOCK_STREAM, 0)         = 11
bind(11, {sa_family=AF_FILE, path="/var/run/acpid.socket"}, 110) = 0
listen(11, 10)                          = 0



If more debugging/testing is needed, do ping me. Thanks,

--alessandro

 "There's always a siren singing you to shipwreck"

   (Radiohead, "There There")

^ permalink raw reply

* [PATCHv2] vhost: rcu annotation fixup
From: Michael S. Tsirkin @ 2011-01-18 18:08 UTC (permalink / raw)
  To: Paul E. McKenney, jasowang
  Cc: Michael S. Tsirkin, kvm, virtualization, netdev, linux-kernel

When built with rcu checks enabled, vhost triggers
bogus warnings as vhost features are read without
dev->mutex sometimes, and private pointer is read
with our kind of rcu where work serves as a
read side critical section.

Fixing it properly is not trivial.
Disable the warnings by stubbing out the checks for now.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---

Changes from v1: add TODO, fix more warnings.

 drivers/vhost/net.c   |    9 +++++----
 drivers/vhost/vhost.h |    6 +++---
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9b3ca10..f616cef 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -128,8 +128,7 @@ static void handle_tx(struct vhost_net *net)
 	size_t hdr_size;
 	struct socket *sock;
 
-	/* TODO: check that we are running from vhost_worker?
-	 * Not sure it's worth it, it's straight-forward enough. */
+	/* TODO: check that we are running from vhost_worker? */
 	sock = rcu_dereference_check(vq->private_data, 1);
 	if (!sock)
 		return;
@@ -306,7 +305,8 @@ static void handle_rx_big(struct vhost_net *net)
 	size_t len, total_len = 0;
 	int err;
 	size_t hdr_size;
-	struct socket *sock = rcu_dereference(vq->private_data);
+	/* TODO: check that we are running from vhost_worker? */
+	struct socket *sock = rcu_dereference_check(vq->private_data, 1);
 	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
 		return;
 
@@ -415,7 +415,8 @@ static void handle_rx_mergeable(struct vhost_net *net)
 	int err, headcount;
 	size_t vhost_hlen, sock_hlen;
 	size_t vhost_len, sock_len;
-	struct socket *sock = rcu_dereference(vq->private_data);
+	/* TODO: check that we are running from vhost_worker? */
+	struct socket *sock = rcu_dereference_check(vq->private_data, 1);
 	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
 		return;
 
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 2af44b7..b3363ae 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -173,9 +173,9 @@ static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
 {
 	unsigned acked_features;
 
-	acked_features =
-		rcu_dereference_index_check(dev->acked_features,
-					    lockdep_is_held(&dev->mutex));
+	/* TODO: check that we are running from vhost_worker or dev mutex is
+	 * held? */
+	acked_features = rcu_dereference_index_check(dev->acked_features, 1);
 	return acked_features & (1 << bit);
 }
 
-- 
1.7.3.2.91.g446ac

^ permalink raw reply related

* Re: [PATCH] af_unix: implement socket filter
From: Alban Crequy @ 2011-01-18 17:51 UTC (permalink / raw)
  To: Eric Dumazet, Ian Molton
  Cc: netdev, linux-kernel, davem, ebiederm, xemul, davidel
In-Reply-To: <1295371361.3290.15.camel@edumazet-laptop>

Le Tue, 18 Jan 2011 18:22:41 +0100,
Eric Dumazet <eric.dumazet@gmail.com> a écrit :

> Le mardi 18 janvier 2011 à 16:39 +0000, Ian Molton a écrit :
> > From: Alban Crequy <alban.crequy@collabora.co.uk>
> > 
> > Linux Socket Filters can already be successfully attached and
> > detached on unix sockets with setsockopt(sockfd, SOL_SOCKET,
> > SO_{ATTACH,DETACH}_FILTER, ....). See:
> > Documentation/networking/filter.txt
> > 
> > But the filter was never used in the unix socket code so it did not
> > work. This patch uses sk_filter() to filter buffers before delivery.
> > 
> > This short program demonstrates the problem on SOCK_DGRAM.

By the way, the patch implements socket filters on SOCK_DGRAM and
SOCK_SEQPACKET but not SOCK_STREAM. Socket filters does not make sense
to me when there is no packet boundaries. But if there is a need for
it, the code for SOCK_STREAM could be added easily.

> Any idea on performance cost adding sk_filter() call ?

Ian will write a performance test and repost the patch with some stats.
I don't know about the performance cost.

> Hmm, looking at it, I have no idea why sk_filter() needs to block BH.

I don't know neither.

> I'll send a patch to relax this requirement.

Thanks for your review!

Alban

^ permalink raw reply

* Re: bnx2 cards intermittantly going offline
From: Michael Chan @ 2011-01-18 17:55 UTC (permalink / raw)
  To: Mills, Tony; +Cc: netdev@vger.kernel.org
In-Reply-To: <6DD3782C33561D44B47071B09946026405F63853AB@exchange1>


On Tue, 2011-01-18 at 02:54 -0800, Mills, Tony wrote:
> Last night i setup a machine to monitor overnight and at 3:52 this
> morning it became unresponsive. 
> 

When it becomes unresponsive, please send some packets to the NIC (such
as ping) and monitor statistics with ethtool -S.  See if the packets are
being received or discarded.  Also, run tcpdump on the machine to see if
the packets are properly received by the stack.  Thanks.



^ permalink raw reply

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
From: Kirill Smelkov @ 2011-01-18 17:56 UTC (permalink / raw)
  To: Oleg V. Ukhno
  Cc: Nicolas de Pesloüan, John Fastabend, Jay Vosburgh,
	David S. Miller, netdev@vger.kernel.org,
	Sébastien Barré, Christophe Paasch
In-Reply-To: <4D35B1B0.2090905@yandex-team.ru>

On Tue, Jan 18, 2011 at 06:28:48PM +0300, Oleg V. Ukhno wrote:
> On 01/18/2011 05:54 PM, Nicolas de Pesloüan wrote:
>> Le 18/01/2011 13:40, Oleg V. Ukhno a écrit :
[...]

>> Oleg, would you mind trying the above "two VLAN" topology" with
>> mode=balance-rr and report any results ? For high-availability purpose,
>> it's obviously necessary to setup those VLAN on distinct switches.
> I'll do it, but it will take some time to setup test environment,  
> several days may be.
> You mean following topology:
>           switch 1
>        /           \
> host A                host B
>        \  switch 2 /
>

FYI: I'm in the process of developing new redundancy mode for bonding,
and while at it, the following script is maybe useful for you too, so
that bonding testing can be done entirely on one host:

http://repo.or.cz/w/linux-2.6/kirr.git/blob/refs/heads/x/etherdup:/tools/bonding/mk-tap-loops.sh


Sorry for maybe being offtopic,
Kirill

^ permalink raw reply

* Re: [PATCH] vhost: rcu annotation fixup
From: Michael S. Tsirkin @ 2011-01-18 17:55 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Jason Wang, kvm, virtualization, netdev, linux-kernel
In-Reply-To: <20110118174834.GF2193@linux.vnet.ibm.com>

On Tue, Jan 18, 2011 at 09:48:34AM -0800, Paul E. McKenney wrote:
> On Tue, Jan 18, 2011 at 01:08:45PM +0200, Michael S. Tsirkin wrote:
> > When built with rcu checks enabled, vhost triggers
> > bogus warnings as vhost features are read without
> > dev->mutex sometimes.
> > Fixing it properly is not trivial as vhost.h does not
> > know which lockdep classes it will be used under.
> > Disable the warning by stubbing out the check for now.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> >  drivers/vhost/vhost.h |    4 +---
> >  1 files changed, 1 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > index 2af44b7..2d03a31 100644
> > --- a/drivers/vhost/vhost.h
> > +++ b/drivers/vhost/vhost.h
> > @@ -173,9 +173,7 @@ static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
> >  {
> >  	unsigned acked_features;
> > 
> > -	acked_features =
> > -		rcu_dereference_index_check(dev->acked_features,
> > -					    lockdep_is_held(&dev->mutex));
> > +	acked_features = rcu_dereference_index_check(dev->acked_features, 1);
> 
> Ouch!!!
> 
> Could you please at least add a comment?

Yes, OK.

> Alternatively, pass in the lock that is held and check for that?  Given
> that this is a static inline, the compiler should be able to optimize
> the argument away when !PROVE_RCU, correct?
> 
> 							Thanx, Paul

Hopefully, yes. We don't always have a lock: the idea was
to create a lockdep for these cases. But we can't pass
the pointer to that ...

> >  	return acked_features & (1 << bit);
> >  }
> > 
> > -- 
> > 1.7.3.2.91.g446ac
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply

* Re: [PATCH] vhost: rcu annotation fixup
From: Paul E. McKenney @ 2011-01-18 17:48 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Jason Wang, kvm, virtualization, netdev, linux-kernel
In-Reply-To: <20110118110845.GA11555@redhat.com>

On Tue, Jan 18, 2011 at 01:08:45PM +0200, Michael S. Tsirkin wrote:
> When built with rcu checks enabled, vhost triggers
> bogus warnings as vhost features are read without
> dev->mutex sometimes.
> Fixing it properly is not trivial as vhost.h does not
> know which lockdep classes it will be used under.
> Disable the warning by stubbing out the check for now.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  drivers/vhost/vhost.h |    4 +---
>  1 files changed, 1 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 2af44b7..2d03a31 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -173,9 +173,7 @@ static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
>  {
>  	unsigned acked_features;
> 
> -	acked_features =
> -		rcu_dereference_index_check(dev->acked_features,
> -					    lockdep_is_held(&dev->mutex));
> +	acked_features = rcu_dereference_index_check(dev->acked_features, 1);

Ouch!!!

Could you please at least add a comment?

Alternatively, pass in the lock that is held and check for that?  Given
that this is a static inline, the compiler should be able to optimize
the argument away when !PROVE_RCU, correct?

							Thanx, Paul

>  	return acked_features & (1 << bit);
>  }
> 
> -- 
> 1.7.3.2.91.g446ac
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply

* [PATCH] net: filter: dont block softirqs in sk_run_filter()
From: Eric Dumazet @ 2011-01-18 17:46 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, linux-kernel, ebiederm, xemul, davidel, Alban Crequy,
	Ian Molton
In-Reply-To: <1295371361.3290.15.camel@edumazet-laptop>

Packet filter (BPF) doesnt need to disable softirqs, being fully
re-entrant and lock-less.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/sock.h     |    2 +-
 net/core/filter.c      |    6 +++---
 net/packet/af_packet.c |    6 +++---
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index d884d26..ba6465b 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1189,7 +1189,7 @@ extern void sk_filter_release_rcu(struct rcu_head *rcu);
 static inline void sk_filter_release(struct sk_filter *fp)
 {
 	if (atomic_dec_and_test(&fp->refcnt))
-		call_rcu_bh(&fp->rcu, sk_filter_release_rcu);
+		call_rcu(&fp->rcu, sk_filter_release_rcu);
 }
 
 static inline void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp)
diff --git a/net/core/filter.c b/net/core/filter.c
index afc5837..232b187 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -142,14 +142,14 @@ int sk_filter(struct sock *sk, struct sk_buff *skb)
 	if (err)
 		return err;
 
-	rcu_read_lock_bh();
-	filter = rcu_dereference_bh(sk->sk_filter);
+	rcu_read_lock();
+	filter = rcu_dereference(sk->sk_filter);
 	if (filter) {
 		unsigned int pkt_len = sk_run_filter(skb, filter->insns);
 
 		err = pkt_len ? pskb_trim(skb, pkt_len) : -EPERM;
 	}
-	rcu_read_unlock_bh();
+	rcu_read_unlock();
 
 	return err;
 }
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 91cb1d7..c3fc7b7 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -523,11 +523,11 @@ static inline unsigned int run_filter(const struct sk_buff *skb,
 {
 	struct sk_filter *filter;
 
-	rcu_read_lock_bh();
-	filter = rcu_dereference_bh(sk->sk_filter);
+	rcu_read_lock();
+	filter = rcu_dereference(sk->sk_filter);
 	if (filter != NULL)
 		res = sk_run_filter(skb, filter->insns);
-	rcu_read_unlock_bh();
+	rcu_read_unlock();
 
 	return res;
 }



^ permalink raw reply related

* Re: [PATCH] net offloading: Do not mask out NETIF_F_HW_VLAN_TX for vlan.
From: Michael Chan @ 2011-01-18 17:21 UTC (permalink / raw)
  To: Jesse Gross; +Cc: Eric Dumazet, David Miller, netdev@vger.kernel.org
In-Reply-To: <AANLkTikmadouBTgkf4OZy30=Zg6uZ+qSng=rWhTjjm8e@mail.gmail.com>


On Mon, 2011-01-17 at 23:12 -0800, Jesse Gross wrote:
> On Tue, Jan 18, 2011 at 1:55 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Le lundi 17 janvier 2011 à 22:46 -0800, Jesse Gross a écrit :
> >> In netif_skb_features() we return only the features that are valid for vlans
> >> if we have a vlan packet.  However, we should not mask out NETIF_F_HW_VLAN_TX
> >> since it enables transmission of vlan tags and is obviously valid.
> >>
> >> Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
> >> Signed-off-by: Jesse Gross <jesse@nicira.com>
> >
> > Thanks Jesse
> >
> > Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
> >
> > Now back to the "ethtool -K eth0 txvlan off" problem on bnx2
> >
> > Is it a driver/software problem or hardware/firmware one ?
> 
> CC'ing Michael Chan
> 
> It looks like bnx2 is storing the offsets of various headers so the
> hardware can find them for TSO.  The parsing logic doesn't do anything
> for vlan tags, so the hardware gets confused if one is present in the
> packet itself.

Yeah, I don't think the hardware/firmware can replicate the vlan tag +
headers properly for TSO if there is a VLAN tag in the packet.  Even
simple tx checksum offload may have problem, but I'll need to check.

> 
> Quick fix is to simply disallow disabling TX vlan offload or disable
> TSO at the same time or some other Ethtool game.  However, if the
> hardware supports it then it would be nicer to fix up the TSO setup
> logic.  Maybe we can just add the size of the vlan tag to the offset
> but I am not certain.  Michael, do you know if this is possible?
> 

I doubt it, as the VLAN tag needs to be replicated for each transmitted
packet.  I'll check with the firmware/hardware guys.

Thanks.



^ permalink raw reply

* Re: inbound connection problems when "netlink: test for all flags of the NLM_F_DUMP composite" commit applied
From: Jarek Poplawski @ 2011-01-18 17:23 UTC (permalink / raw)
  To: jamal
  Cc: David Miller, pablo, arthur.marsh, jengelh, eric.dumazet, netdev,
	Alessandro Suardi

[PATCH] netlink: Fix possible NLM_F_DUMP misuse in genetlink

NLM_F_DUMP flags should be applied to GET requests only, eg. rtnetlink
tests message type to verify this. Since genetlink can't do the same
use "practical" test for ops->dumpit (assuming NEW request won't be
mixed with GET).

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Cc: Jan Engelhardt <jengelh@medozas.de>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jamal Hadi Salim <hadi@cyberus.ca>
---
Not for stable before testing!

diff -Nurp a/net/netlink/genetlink.c b/net/netlink/genetlink.c
--- a/net/netlink/genetlink.c	2011-01-18 16:58:16.000000000 +0100
+++ b/net/netlink/genetlink.c	2011-01-18 17:08:43.000000000 +0100
@@ -519,15 +519,16 @@ static int genl_rcv_msg(struct sk_buff *
 	    security_netlink_recv(skb, CAP_NET_ADMIN))
 		return -EPERM;
 
-	if (nlh->nlmsg_flags & NLM_F_DUMP) {
-		if (ops->dumpit == NULL)
+	if (ops->dumpit) {
+		if (nlh->nlmsg_flags & NLM_F_DUMP) {
+			genl_unlock();
+			err = netlink_dump_start(net->genl_sock, skb, nlh,
+						 ops->dumpit, ops->done);
+			genl_lock();
+			return err;
+		} else {
 			return -EOPNOTSUPP;
-
-		genl_unlock();
-		err = netlink_dump_start(net->genl_sock, skb, nlh,
-					 ops->dumpit, ops->done);
-		genl_lock();
-		return err;
+		}
 	}
 
 	if (ops->doit == NULL)

^ permalink raw reply

* Re: inbound connection problems when "netlink: test for all flags of the NLM_F_DUMP composite" commit applied
From: Jarek Poplawski @ 2011-01-18 17:22 UTC (permalink / raw)
  To: jamal
  Cc: David Miller, pablo, arthur.marsh, jengelh, eric.dumazet, netdev,
	Alessandro Suardi
In-Reply-To: <1295359653.2066.11.camel@mojatatu>

On Tue, Jan 18, 2011 at 09:07:33AM -0500, jamal wrote:
> On Tue, 2011-01-18 at 09:05 -0500, jamal wrote:
> > On Tue, 2011-01-18 at 10:24 +0000, Jarek Poplawski wrote:
> > 
> > > Do you all expect all users manage to upgrade avahi app before
> > > changing their stable kernel? I mean "own distro" users especially.
> > 
> > Unfortunately if that app is widely deployed, it is not pragmatic
> > to break it in the name of pedanticity. 
> 
> And to complete that thought - if avahi continues to work and merely
> whines, i dont see why this patch should be reverted..

Alas it looks like more than whines ;-)
Cheers,
Jarek P.

[PATCH] netlink: Revert test for all flags of the NLM_F_DUMP composite

Arthur Marsh reported:
> With kernel 2.6.37-git9 and later inbound telnetd-ssl connections
> failed, and on machine shut-down, there were warning messages about
> daemons not return status.
and bisected the bug.
 
After looking at net/core/rtnetlink.c:

static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
{
	...
        kind = type&3;

        if (kind != 2 && security_netlink_recv(skb, CAP_NET_ADMIN))
                return -EPERM;

        if (kind == 2 && nlh->nlmsg_flags&NLM_F_DUMP) {

I'm sure the fix in commit 0ab03c2b1478f24 is simply wrong. NLM_F_DUMP
flags can't be mistaken with NLM_F_EXCL if there is a check for GET
request like above (ie. kind == 2). So there is no reason to limit 3
various dump cases from the RFC (not counting the atomic flag) to one
and only NLM_F_DUMP.

That's why I propose this reverting patch here and a separate fix to
genetlink (soon).

Reverts commit 0ab03c2b1478f2438d2c80204f7fef65b1bca9cf

Reported-by: Alessandro Suardi <alessandro.suardi@gmail.com>
Reported-by: Arthur Marsh <arthur.marsh@internode.on.net>
Bisected-by: Arthur Marsh <arthur.marsh@internode.on.net>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Cc: Jan Engelhardt <jengelh@medozas.de>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jamal Hadi Salim <hadi@cyberus.ca>
---

 net/core/rtnetlink.c                 |    2 +-
 net/ipv4/inet_diag.c                 |    2 +-
 net/netfilter/nf_conntrack_netlink.c |    4 ++--
 net/netlink/genetlink.c              |    2 +-
 net/xfrm/xfrm_user.c                 |    2 +-
 5 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index a5f7535..750db57 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1820,7 +1820,7 @@ static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
 	if (kind != 2 && security_netlink_recv(skb, CAP_NET_ADMIN))
 		return -EPERM;
 
-	if (kind == 2 && (nlh->nlmsg_flags & NLM_F_DUMP) == NLM_F_DUMP) {
+	if (kind == 2 && nlh->nlmsg_flags&NLM_F_DUMP) {
 		struct sock *rtnl;
 		rtnl_dumpit_func dumpit;
 
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index 2746c1f..2ada171 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -858,7 +858,7 @@ static int inet_diag_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
 	    nlmsg_len(nlh) < hdrlen)
 		return -EINVAL;
 
-	if ((nlh->nlmsg_flags & NLM_F_DUMP) == NLM_F_DUMP) {
+	if (nlh->nlmsg_flags & NLM_F_DUMP) {
 		if (nlmsg_attrlen(nlh, hdrlen)) {
 			struct nlattr *attr;
 
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 2b7eef3..93297aa 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -924,7 +924,7 @@ ctnetlink_get_conntrack(struct sock *ctnl, struct sk_buff *skb,
 	u16 zone;
 	int err;
 
-	if ((nlh->nlmsg_flags & NLM_F_DUMP) == NLM_F_DUMP)
+	if (nlh->nlmsg_flags & NLM_F_DUMP)
 		return netlink_dump_start(ctnl, skb, nlh, ctnetlink_dump_table,
 					  ctnetlink_done);
 
@@ -1787,7 +1787,7 @@ ctnetlink_get_expect(struct sock *ctnl, struct sk_buff *skb,
 	u16 zone;
 	int err;
 
-	if ((nlh->nlmsg_flags & NLM_F_DUMP) == NLM_F_DUMP) {
+	if (nlh->nlmsg_flags & NLM_F_DUMP) {
 		return netlink_dump_start(ctnl, skb, nlh,
 					  ctnetlink_exp_dump_table,
 					  ctnetlink_exp_done);
diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index f83cb37..1781d99 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -519,7 +519,7 @@ static int genl_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
 	    security_netlink_recv(skb, CAP_NET_ADMIN))
 		return -EPERM;
 
-	if ((nlh->nlmsg_flags & NLM_F_DUMP) == NLM_F_DUMP) {
+	if (nlh->nlmsg_flags & NLM_F_DUMP) {
 		if (ops->dumpit == NULL)
 			return -EOPNOTSUPP;
 
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index d5e1e0b..6129196 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -2189,7 +2189,7 @@ static int xfrm_user_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
 
 	if ((type == (XFRM_MSG_GETSA - XFRM_MSG_BASE) ||
 	     type == (XFRM_MSG_GETPOLICY - XFRM_MSG_BASE)) &&
-	    (nlh->nlmsg_flags & NLM_F_DUMP) == NLM_F_DUMP) {
+	    (nlh->nlmsg_flags & NLM_F_DUMP)) {
 		if (link->dump == NULL)
 			return -EINVAL;
 

^ permalink raw reply related

* Re: [PATCH] af_unix: implement socket filter
From: Eric Dumazet @ 2011-01-18 17:22 UTC (permalink / raw)
  To: Ian Molton
  Cc: netdev, linux-kernel, davem, ebiederm, xemul, davidel,
	Alban Crequy
In-Reply-To: <1295368755-20931-1-git-send-email-ian.molton@collabora.co.uk>

Le mardi 18 janvier 2011 à 16:39 +0000, Ian Molton a écrit :
> From: Alban Crequy <alban.crequy@collabora.co.uk>
> 
> Linux Socket Filters can already be successfully attached and detached on unix
> sockets with setsockopt(sockfd, SOL_SOCKET, SO_{ATTACH,DETACH}_FILTER, ...).
> See: Documentation/networking/filter.txt
> 
> But the filter was never used in the unix socket code so it did not work. This
> patch uses sk_filter() to filter buffers before delivery.
> 
> This short program demonstrates the problem on SOCK_DGRAM.


Any idea on performance cost adding sk_filter() call ?

Hmm, looking at it, I have no idea why sk_filter() needs to block BH.

I'll send a patch to relax this requirement.

^ permalink raw reply

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
From: Oleg V. Ukhno @ 2011-01-18 17:21 UTC (permalink / raw)
  To: John Fastabend; +Cc: Jay Vosburgh, netdev@vger.kernel.org, David S. Miller
In-Reply-To: <4D35C2D7.6090008@intel.com>

On 01/18/2011 07:41 PM, John Fastabend wrote:

>>
>> John, what is you opinion on such load balancing method in general,
>> without referring to particular use cases?
>>
>
> This seems reasonable to me, but I'll defer to Jay on this. As long as the
> limitations are documented and it looks like they are this may be fine.
>
> Mostly I was interested to know what led you down this path and why MPIO
> was not working as at least I expected it should. When I get some time I'll
> see if we can address at least some of these issues. Even so it seems like
> this bonding mode may still be useful for some use cases perhaps even none
> storage use cases.
>
>>

I was adressing several problems with my patch:
  - I was unable to consume whole bandwidth with multipath - with four 
1Gbit "paths" it was slightly above 2Gbit/s
  - Link failures caused quite often disk failures, which led to Oracle 
ASM rebalance, especially with versions below 11.
  - It is not always possible to autogenerate multipathd.conf with 
human-readable device names because of iscsi session id and scsi device 
bus/channel/etc mismatch(usually it differs by 1, but not necessarily), 
with bonding solution I can just look into /dev/disk/by-path to find out 
where physically is device, let's  say, /dev/sdab, located(it's just a 
free bonus I've got, so to say:)) .



-- 
С уважением,
руководитель службы
эксплуатации коммерческих и финансовых сервисов
ООО Яндекс

Олег Юхно



^ permalink raw reply

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
From: Oleg V. Ukhno @ 2011-01-18 16:57 UTC (permalink / raw)
  To: Nicolas de Pesloüan
  Cc: John Fastabend, Jay Vosburgh, David S. Miller,
	netdev@vger.kernel.org, Sébastien Barré,
	Christophe Paasch
In-Reply-To: <4D35BED5.7040301@gmail.com>

On 01/18/2011 07:24 PM, Nicolas de Pesloüan wrote:
> Le 18/01/2011 16:28, Oleg V. Ukhno a écrit :
>> On 01/18/2011 05:54 PM, Nicolas de Pesloüan wrote:
>>> I remember a topology (described by Jay, for as far as I remember),
>>> where two hosts were connected through two distinct VLANs. In such
>>> topology:
>>> - it is possible to detect path failure using arp monitoring instead of
>>> miimon.
>>> - changing the destination MAC address of egress packets are not
>>> necessary, because egress path selection force ingress path selection
>>> due to the VLAN.
>>
>> In case with two VLANs - yes, this shouldn't be necessary(but needs to
>> be tested, I am not sure), but within one - it is essential for correct
>> rx load striping.
>
> Changing the destination MAC address is definitely not required if you
> segregate each path in a distinct VLAN.
Yes, such L2 network topology should provide necessary high-availability 
and load striping without need to change MAC addresses. But it is more 
difficult to maintain and to understand, in my opinion(when there are 
just several configurations like this - it's ok, but when you have 50 or 
more?) - this is why I've chosen 802.3ad.

> Even in the present of ISL between some switches, packet sent through
> host A interface connected to vlan 100 will only enter host B using the
> interface connected to vlan 100. So every slaves of the bonding
> interface can use the same MAC address.
>
> Of course, changing the destination address would be required in order
> to achieve ingress load balancing on a *single* LAN. But, as Jay noted
> at the beginning of this thread, this would violate 802.3ad.
>

I think receiving same MAC-addresses on different ports on same host 
will just make any troubleshooting much harder, won't it? With different 
MACs it takes little time to find out where the problem is, usually.
I think that implementing choice for choosing whether use single MAC 
address in etherchannel or just use slave's real MAC adresses, won't 
harm anything for both 802.3ad and balance-rr modes, but will simplify 
it's usage without doing any evil, when documented properly.

>
> You are right, but such LAN setup need to be carefully designed and
> built. I'm not sure that an automatic channel aggregation system is the
> right way to do it. Hence the reason why I suggest to use balance-rr
> with VLANs.
>
>>> Oleg, would you mind trying the above "two VLAN" topology" with
>>> mode=balance-rr and report any results ? For high-availability purpose,
>>> it's obviously necessary to setup those VLAN on distinct switches.
>> I'll do it, but it will take some time to setup test environment,
>> several days may be.
>
> Thanks. For testing purpose, it is enough to setup those VLAN on a
> single switch if it is easier for you to do.
Well, I'll do it with 2 switches :)
>
>> You mean following topology:
>
> See above.
>
>> (i'm sure it will work as desired if each host is connected to each
>> switch with only one slave link, if there are more slaves in each switch
>> - unsure)?
>
> If you want to use more than 2 slaves per host, then you need more than
> 2 VLAN.

That's what I don't like in this solution. Within one LAN is is simplier 
and requires less configuration efforts.

You also need to have the exact same number of slaves on all
> hosts, as egress path selection cause ingress path selection at the
> other side.
>

Well, and here's one difference from bonding with my patch. In case of 
my patch applied, it is not required to have equal number of slaves, it 
is enough to have *even* number of slaves, this almost always(so far I 
haven't seen opposite) gurarntees good rx(ingress) load striping.

>
> Nicolas.
>


-- 
С уважением,
руководитель службы
эксплуатации коммерческих и финансовых сервисов
ООО Яндекс

Олег Юхно



^ permalink raw reply

* Re: [PATCH] ns83820: Avoid bad pointer deref in ns83820_init_one().
From: Benjamin LaHaise @ 2011-01-18 16:42 UTC (permalink / raw)
  To: Jesper Juhl
  Cc: netdev, linux-ns83820, linux-kernel, Tejun Heo, Kulikov Vasiliy,
	Denis Kirjanov, David S. Miller
In-Reply-To: <alpine.LNX.2.00.1101172116330.27021@swampdragon.chaosbits.net>

On Mon, Jan 17, 2011 at 09:24:57PM +0100, Jesper Juhl wrote:
> In drivers/net/ns83820.c::ns83820_init_one() we dynamically allocate 
> memory via alloc_etherdev(). We then call PRIV() on the returned storage 
> which is 'return netdev_priv()'. netdev_priv() takes the pointer it is 
> passed and adds 'ALIGN(sizeof(struct net_device), NETDEV_ALIGN)' to it and 
> returns it. Then we test the resulting pointer for NULL, which it is 
> unlikely to be at this point, and later dereference it. This will go bad 
> if alloc_etherdev() actually returned NULL.
> 
> This patch reworks the code slightly so that we test for a NULL pointer 
> (and return -ENOMEM) directly after calling alloc_etherdev().

Looks good.

		-ben

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>

> Signed-off-by: Jesper Juhl <jj@chaosbits.net>
> ---
>  ns83820.c |    5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
>   Compile tested only. I have no way to test this for real.
> 
> diff --git a/drivers/net/ns83820.c b/drivers/net/ns83820.c
> index 84134c7..a41b2cf 100644
> --- a/drivers/net/ns83820.c
> +++ b/drivers/net/ns83820.c
> @@ -1988,12 +1988,11 @@ static int __devinit ns83820_init_one(struct pci_dev *pci_dev,
>  	}
>  
>  	ndev = alloc_etherdev(sizeof(struct ns83820));
> -	dev = PRIV(ndev);
> -
>  	err = -ENOMEM;
> -	if (!dev)
> +	if (!ndev)
>  		goto out;
>  
> +	dev = PRIV(ndev);
>  	dev->ndev = ndev;
>  
>  	spin_lock_init(&dev->rx_info.lock);
> 
> 
> -- 
> Jesper Juhl <jj@chaosbits.net>            http://www.chaosbits.net/
> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
> Plain text mails only, please.

^ permalink raw reply

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
From: John Fastabend @ 2011-01-18 16:41 UTC (permalink / raw)
  To: Oleg V. Ukhno; +Cc: Jay Vosburgh, netdev@vger.kernel.org, David S. Miller
In-Reply-To: <4D358A47.4020009@yandex-team.ru>

On 1/18/2011 4:40 AM, Oleg V. Ukhno wrote:
> On 01/18/2011 06:16 AM, John Fastabend wrote:
>> On 1/14/2011 4:05 PM, Jay Vosburgh wrote:
>>> 	Can somebody (John?) more knowledgable than I about dm-multipath
>>> comment on the above?
>>
>> Here I'll give it a go.
>>
>> I don't think detecting L2 link failure this way is very robust. If there
>> is a failure farther away then your immediate link your going to break
>> completely? Your bonding hash will continue to round robin the iscsi
>> packets and half them will get dropped on the floor. dm-multipath handles
>> this reasonably gracefully. Also in this bonding environment you seem to
>> be very sensitive to RTT times on the network. Maybe not bad out right but
>> I wouldn't consider this robust either.
> 
> John, I agree - this bonding mode should be used in quite limited number 
> of situations, but as for failure farther away then immediate link - 
> every bonding mode will suffer same problems in this case - bonding 
> detects only L2 failures, other is done by upper-layer mechanisms. And 
> almost all bonding modes depend on equal RTT on slaves. And, there is 
> already similar load balancing mode - balance-alb - what I did is 
> approximately the same, but for 802.3ad bonding mode and provides 
> "better"(more equal and non-conditional layser2) load striping for tx 
> and _rx_ .
> 
> I think I shouldn't mention the particular use case of this patch - when 
> I wrote it I tried to make a more general solution - my goal was "make 
> equal or near-equal load striping for TX and (most important part) RX 
> within single ethernet(layer 2) domain for  TCP transmission". This 
> bonding mode  just introduces ability to stripe rx and tx load for 
> single TCP connection between hosts inside of one ethernet segment. 
> iSCSI is just an example. It is possible to stripe load between a 
> linux-based router and linux-based web/ftp/etc server as well in the 
> same manner. I think this feature will be useful in some number of 
> network configurations.
> 
>   Also, I looked into net-next code - it seems to me that it can be 
> implemented(adapted to net-next bonding code) without any difficulties 
> and hashing function change makes no problem here.
> 
> What I've written below is just my personal experience and opinion after 
> 5 years of using Oracle +iSCSI +mpath(later - patched bonding).
> 
>  From my personal experience I just can say that most iSCSI failures are 
> caused by link failures, and also I would never send any significant 
> iSCSI traffic via router - router would be a bottleneck in this case.
> So, in my case iSCSI traffic flows within one ethernet domain and in 
> case of link failure bonding driver simply fails one slave(in case of 
> bonding) , instead of checking and failing hundreths of paths (in case 
> of mpath) and first case significantly less cpu, net and time 
> consuming(if using default mpath checker - readsector0).
> Mpath is good for me, when I use it to "merge" drbd mirrors from 
> different hosts, but for just doing simple load striping within single 
> L2 network switch  between 2 .. 16 hosts is some overkill(particularly 
> in maintaining human-readable device naming) :).
> 
> John, what is you opinion on such load balancing method in general, 
> without referring to particular use cases?
> 

This seems reasonable to me, but I'll defer to Jay on this. As long as the
limitations are documented and it looks like they are this may be fine.

Mostly I was interested to know what led you down this path and why MPIO
was not working as at least I expected it should. When I get some time I'll
see if we can address at least some of these issues. Even so it seems like
this bonding mode may still be useful for some use cases perhaps even none
storage use cases.

> 
>>
>> You could tweak your scsi timeout values and fail_fast values, set the io
>> retry to 0 to cause the fail over to occur faster. I suspect you already
>> did this and still it is too slow? Maybe adding a checker in multipathd to
>> listen for link events would be fast enough. The checker could then fail
>> the path immediately.
>>
>> I'll try to address your comments from the other thread here. In general I
>> wonder if it would be better to solve the problems in dm-multipath rather than
>> add another bonding mode?
> Of course I did this, but mpath is fine when device quantity is below 
> 30-40 devices with two paths, 150-200 devices with 2+ paths can make 
> life far more interesting :)

OK admittedly this gets ugly fast.

>>
>> OVU - it is slow(I am using ISCSI for Oracle , so I need to minimize latency)
>>
>> The dm-multipath layer is adding latency? How much? If this is really true
>> maybe its best to the address the real issue here and not avoid it by
>> using the bonding layer.
> 
> I do not remember exact number now, but switching one of my databases , 
> about 2 years ago to bonding increased read throughput for the entire db 
> from 15-20 Tb/day to approximately 30-35 Tb/day (4 iscsi initiators and 
> 8 iscsi targets, 4 ethernet links for iSCSI on each host, all plugged in 
> one switch) because of "full" bandwidth use. Also, bonding usage 
> simplifies network and application setup greatly(compared to mpath)
> 
>>
>> OVU - it handles any link failures bad, because of it's command queue
>> limitation(all queued commands above 32 are discarded in case of path
>> failure, as I remember)
>>
>> Maybe true but only link failures with the immediate peer are handled
>> with a bonding strategy. By working at the block layer we can detect
>> failures throughout the path. I would need to look into this again I
>> know when we were looking at this sometime ago there was some talk about
>> improving this behavior. I need to take some time to go back through the
>> error recovery stuff to remember how this works.
>>
>> OVU - it performs very bad when there are many devices and maтy paths(I was
>> unable to utilize more that 2Gbps of 4 even with 100 disks with 4 paths
>> per each disk)
> 
> Well, I think that behavior can be explained in such a way:
> when balancing by I/Os number per path(rr_min_io), and there is a huge 
> number of devices, mpath is doing load-balaning per-device, and it is 
> not possible to quarantee equal device use for all devices, so there 
> will be imbalance over network interface(mpath is unaware of it's 
> existence, etc), and it is likely it becomes more imbalanced when there 
> are many devices. Also, counting I/O's for many devices and paths 
> consumes some CPU resources and also can cause excessive context switches.
> 

hmm I'll get something setup here and see if this is the case.

>>
>> Hmm well that seems like something is broken. I'll try this setup when
>> I get some time next few days. This really shouldn't be the case dm-multipath
>> should not add a bunch of extra latency or effect throughput significantly.
>> By the way what are you seeing without mpio?
> 
> And one more obsevation from my 2-years old tests - reading device(using 
> dd) (rhel 5 update 1 kernel, ramdisk via ISCSI via loopback ) as mpath 
> device with single path was done at approximately 120-150mb/s, and same 
> test on non-mpath device at 800-900mb/s. Here I am quite sure, it was a 
> kind of revelation to me that time.
> 

Similarly I'll have a look. Thanks for the info.

>>
>> Thanks,
>> John
>>
> 
> 
 

^ permalink raw reply

* [PATCH] af_unix: implement socket filter
From: Ian Molton @ 2011-01-18 16:39 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, davem, eric.dumazet, ebiederm, xemul, davidel,
	Alban Crequy

From: Alban Crequy <alban.crequy@collabora.co.uk>

Linux Socket Filters can already be successfully attached and detached on unix
sockets with setsockopt(sockfd, SOL_SOCKET, SO_{ATTACH,DETACH}_FILTER, ...).
See: Documentation/networking/filter.txt

But the filter was never used in the unix socket code so it did not work. This
patch uses sk_filter() to filter buffers before delivery.

This short program demonstrates the problem on SOCK_DGRAM.

int main(void) {
  int i, j, ret;
  int sv[2];
  struct pollfd fds[2];
  char *message = "Hello world!";
  char buffer[64];
  struct sock_filter ins[32] = {{0,},};
  struct sock_fprog filter;

  socketpair(AF_UNIX, SOCK_DGRAM, 0, sv);

  for (i = 0 ; i < 2 ; i++) {
    fds[i].fd = sv[i];
    fds[i].events = POLLIN;
    fds[i].revents = 0;
  }

  for(j = 1 ; j < 13 ; j++) {

    /* Set a socket filter to truncate the message */
    memset(ins, 0, sizeof(ins));
    ins[0].code = BPF_RET|BPF_K;
    ins[0].k = j;
    filter.len = 1;
    filter.filter = ins;
    setsockopt(sv[1], SOL_SOCKET, SO_ATTACH_FILTER, &filter, sizeof(filter));

    /* send a message */
    send(sv[0], message, strlen(message) + 1, 0);

    /* The filter should let the message pass but truncated. */
    poll(fds, 2, 0);

    /* Receive the truncated message*/
    ret = recv(sv[1], buffer, 64, 0);
    printf("received %d bytes, expected %d\n", ret, j);
  }

    for (i = 0 ; i < 2 ; i++)
      close(sv[i]);

  return 0;
}

Signed-off-by: Alban Crequy <alban.crequy@collabora.co.uk>
Reviewed-by: Ian Molton <ian.molton@collabora.co.uk>
---
 net/unix/af_unix.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index dd419d2..8d9bbba 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1475,6 +1475,12 @@ restart:
 			goto out_free;
 	}
 
+	if (sk_filter(other, skb) < 0) {
+		/* Toss the packet but do not return any error to the sender */
+		err = len;
+		goto out_free;
+	}
+
 	unix_state_lock(other);
 	err = -EPERM;
 	if (!unix_may_send(sk, other))
-- 
1.7.2.3

^ permalink raw reply related

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
From: Nicolas de Pesloüan @ 2011-01-18 16:24 UTC (permalink / raw)
  To: Oleg V. Ukhno
  Cc: John Fastabend, Jay Vosburgh, David S. Miller,
	netdev@vger.kernel.org, Sébastien Barré,
	Christophe Paasch
In-Reply-To: <4D35B1B0.2090905@yandex-team.ru>

Le 18/01/2011 16:28, Oleg V. Ukhno a écrit :
> On 01/18/2011 05:54 PM, Nicolas de Pesloüan wrote:
>> I remember a topology (described by Jay, for as far as I remember),
>> where two hosts were connected through two distinct VLANs. In such
>> topology:
>> - it is possible to detect path failure using arp monitoring instead of
>> miimon.
>> - changing the destination MAC address of egress packets are not
>> necessary, because egress path selection force ingress path selection
>> due to the VLAN.
>
> In case with two VLANs - yes, this shouldn't be necessary(but needs to
> be tested, I am not sure), but within one - it is essential for correct
> rx load striping.

Changing the destination MAC address is definitely not required if you segregate each path in a 
distinct VLAN.

             +-------------------+     +-------------------+
     +-------|switch 1 - vlan 100|-----|switch 2 - vlan 100|-------+
     |       +-------------------+     +-------------------+       |
+------+              |                         |              +------+
|host A|              |                         |              |host B|
+------+              |                         |              +------+
     |       +-------------------+     +-------------------+       |
     +-------|switch 3 - vlan 200|-----|switch 4 - vlan 200|-------+
             +-------------------+     +-------------------+

Even in the present of ISL between some switches, packet sent through host A interface connected to 
vlan 100 will only enter host B using the interface connected to vlan 100. So every slaves of the 
bonding interface can use the same MAC address.

Of course, changing the destination address would be required in order to achieve ingress load 
balancing on a *single* LAN. But, as Jay noted at the beginning of this thread, this would violate 
802.3ad.

>> I think the only point is whether we need a new xmit_hash_policy for
>> mode=802.3ad or whether mode=balance-rr could be enough.
> May by, but it seems to me fair enough not to restrict this feature only
> to non-LACP aggregate links; dynamic aggregation may be useful(it helps
> to avoid switch misconfiguration(misconfigured slaves on switch side)
> sometimes without loss of service).

You are right, but such LAN setup need to be carefully designed and built. I'm not sure that an 
automatic channel aggregation system is the right way to do it. Hence the reason why I suggest to 
use balance-rr with VLANs.

>> Oleg, would you mind trying the above "two VLAN" topology" with
>> mode=balance-rr and report any results ? For high-availability purpose,
>> it's obviously necessary to setup those VLAN on distinct switches.
> I'll do it, but it will take some time to setup test environment,
> several days may be.

Thanks. For testing purpose, it is enough to setup those VLAN on a single switch if it is easier for 
you to do.

> You mean following topology:

See above.

> (i'm sure it will work as desired if each host is connected to each
> switch with only one slave link, if there are more slaves in each switch
> - unsure)?

If you want to use more than 2 slaves per host, then you need more than 2 VLAN. You also need to 
have the exact same number of slaves on all hosts, as egress path selection cause ingress path 
selection at the other side.

             +-------------------+     +-------------------+
     +-------|switch 1 - vlan 100|-----|switch 2 - vlan 100|-------+
     |       +-------------------+     +-------------------+       |
+------+              |                         |              +------+
|host A|              |                         |              |host B|
+------+              |                         |              +------+
   | |       +-------------------+     +-------------------+       | |
   | +-------|switch 3 - vlan 200|-----|switch 4 - vlan 200|-------+ |
   |         +-------------------+     +-------------------+         |
   |                   |                         |                   |
   |                   |                         |                   |
   |         +-------------------+     +-------------------+         |
   +---------|switch 5 - vlan 300|-----|switch 6 - vlan 300|---------+
             +-------------------+     +-------------------+

Of course, you can add others host to vlan 100, 200 and 300, with the exact same configuration at 
host A or host B.

	Nicolas.

^ permalink raw reply

* Re: [PATCH net-next 5/8] vmxnet3: Make ethtool handlers multiqueue aware
From: Ben Hutchings @ 2011-01-18 16:05 UTC (permalink / raw)
  To: Shreyas N Bhatewara; +Cc: netdev, linux-kernel, pv-drivers
In-Reply-To: <20110115005946.1064.97955.stgit@sbhatewara-dev1.eng.vmware.com>

On Fri, 2011-01-14 at 16:59 -0800, Shreyas N Bhatewara wrote:
> Show per-queue stats in ethtool -S output for vmxnet3 interface. Register dump
> of ethtool should dump registers for all tx and rx queues.
> 
> Signed-off-by: Shreyas N Bhatewara <sbhatewara@vmware.com>
> ---
>  drivers/net/vmxnet3/vmxnet3_ethtool.c |  259 ++++++++++++++++++---------------
>  1 files changed, 145 insertions(+), 114 deletions(-)
> 
> diff --git a/drivers/net/vmxnet3/vmxnet3_ethtool.c b/drivers/net/vmxnet3/vmxnet3_ethtool.c
> index 8e17fc8..d70cee1 100644
> --- a/drivers/net/vmxnet3/vmxnet3_ethtool.c
> +++ b/drivers/net/vmxnet3/vmxnet3_ethtool.c
> @@ -68,76 +68,78 @@ vmxnet3_set_rx_csum(struct net_device *netdev, u32 val)
>  static const struct vmxnet3_stat_desc
>  vmxnet3_tq_dev_stats[] = {
>  	/* description,         offset */
> -	{ "TSO pkts tx",        offsetof(struct UPT1_TxStats, TSOPktsTxOK) },
> -	{ "TSO bytes tx",       offsetof(struct UPT1_TxStats, TSOBytesTxOK) },
> -	{ "ucast pkts tx",      offsetof(struct UPT1_TxStats, ucastPktsTxOK) },
> -	{ "ucast bytes tx",     offsetof(struct UPT1_TxStats, ucastBytesTxOK) },
> -	{ "mcast pkts tx",      offsetof(struct UPT1_TxStats, mcastPktsTxOK) },
> -	{ "mcast bytes tx",     offsetof(struct UPT1_TxStats, mcastBytesTxOK) },
> -	{ "bcast pkts tx",      offsetof(struct UPT1_TxStats, bcastPktsTxOK) },
> -	{ "bcast bytes tx",     offsetof(struct UPT1_TxStats, bcastBytesTxOK) },
> -	{ "pkts tx err",        offsetof(struct UPT1_TxStats, pktsTxError) },
> -	{ "pkts tx discard",    offsetof(struct UPT1_TxStats, pktsTxDiscard) },
> +	{ "Tx Queue#",        0 },
> +	{ "  TSO pkts tx",	offsetof(struct UPT1_TxStats, TSOPktsTxOK) },
> +	{ "  TSO bytes tx",	offsetof(struct UPT1_TxStats, TSOBytesTxOK) },
[...]

I really don't like this.  You're making the assumption that these stats
will always be displayed as they are now by the ethtool command, but
that is not the only user of the ethtool API.

I expect that some people have scripts that involve reading ethtool
stats into a hash/dictionary.  (In fact, I wrote a diagnostic script for
Solarflare that does that.)  After this change to your driver, they
would get results from only one TX queue (with different names from
before).

So please:
- Don't use leading or trailing spaces in names
- Keep the global statistics, as most users will be more interested in
these
- If you think users actually want per-queue statistics, add them with
unique names (like bnx2x does)

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox