Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] CHOKe flow scheduler (0.9)
From: Eric Dumazet @ 2011-01-18 19:34 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Patrick McHardy, David Miller, netdev
In-Reply-To: <20110118110634.7386c757@nehalam>

Le mardi 18 janvier 2011 à 11:06 -0800, Stephen Hemminger a écrit :

> +static bool choke_match_flow(struct sk_buff *skb1, struct sk_buff *skb2)
> +{
> +	int off1, off2, poff;
> +	u8 ip_proto;
> +	u32 ihl;
> +
> +	if (skb1->protocol != skb2->protocol)
> +		return false;
> +
> +	off1 = skb_network_offset(skb1);
> +	off2 = skb_network_offset(skb2);

> +
> +	switch (skb1->protocol) {
> +	case __constant_htons(ETH_P_IP): {
> +		struct iphdr *ip1, *ip2;
> +
> +		if (!pskb_may_pull(skb1, sizeof(struct iphdr) + off1))
> +			return false;
> +
	pskb_network_may_pull() might be cleaner


> +		ip1 = (struct iphdr *) (skb1->data + off1);
	ip1 = ip_hdr(skb);

> +		if (ip1->frag_off & htons(IP_MF | IP_OFFSET))
> +			return false;	/* don't compare fragments */
> +

Hmm, we should compare fragments if possible.

saddr/daddr are available, not the ports.

> +		if (!pskb_may_pull(skb2, sizeof(struct iphdr) + off2))
> +			return false;
> +
> +		ip2 = (struct iphdr *) (skb2->data + off2);
> +		if (ip2->frag_off & htons(IP_MF | IP_OFFSET))
> +			return false;
> +
> +		if (ip1->protocol != ip2->protocol ||
> +		    ip1->saddr != ip2->saddr || ip1->daddr != ip2->daddr)
> +			return false;
> +


	What happens if ip1->ihl != ip2->ihl  here ?

Here I would add the fragment test :

	if ((ip1->frag_off | ip2->frag_off)) & htons(IP_MF | IP_OFFSET))
		return true;

> +		ip_proto = ip1->protocol;
> +		ihl = ip1->ihl;
> +		break;
> +	}
> +
> +	case __constant_htons(ETH_P_IPV6): {
> +		struct ipv6hdr *ip1, *ip2;
> +
> +		if (!pskb_may_pull(skb1, sizeof(struct ipv6hdr *) + off1))
> +			return false;

ouch... sizeof(sizeof(struct ipv6hdr *) is not what you want but
sizeof(struct ipv6hdr) is.

So just use :

	pskb_network_may_pull(skb1, sizeof(*ip1))
> +
> +		if (!pskb_may_pull(skb2, sizeof(struct ipv6hdr *) + off2))
> +			return false;
> +
> +		ip1 = (struct ipv6hdr *) (skb1->data + off1);
	ip1 = ipv6_hdr(skb1);
> +		ip2 = (struct ipv6hdr *) (skb2->data + off2);
> +
> +		if (ip1->nexthdr != ip2->nexthdr ||
> +		    ipv6_addr_cmp(&ip1->saddr, &ip2->saddr) != 0 ||
> +		    ipv6_addr_cmp(&ip1->daddr, &ip2->daddr))
> +			return false;
> +
> +		ihl = (40 >> 2);
> +		ip_proto = ip1->nexthdr;
> +		break;
> +	}
> +
> +	default:
> +		return false;
> +	}
> +




^ permalink raw reply

* Re: inbound connection problems when "netlink: test for all flags of the NLM_F_DUMP composite" commit applied
From: Alessandro Suardi @ 2011-01-18 19:26 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Jan Engelhardt, jamal, David Miller, pablo, arthur.marsh,
	eric.dumazet, netdev
In-Reply-To: <20110118184730.GD4202@del.dom.local>

On Tue, Jan 18, 2011 at 7:47 PM, Jarek Poplawski <jarkao2@gmail.com> wrote:
> On Tue, Jan 18, 2011 at 07:28:52PM +0100, Jarek Poplawski wrote:
>> On Tue, Jan 18, 2011 at 07:24:40PM +0100, Jan Engelhardt wrote:
>> >
>> > On Tuesday 2011-01-18 19:10, Alessandro Suardi wrote:
>> > >On Tue, Jan 18, 2011 at 6:23 PM, Jarek Poplawski <jarkao2@gmail.com> wrote:
>> > >>
>> > >> NLM_F_DUMP flags should be applied to GET requests only, eg. rtnetlink
>> > >> tests message type to verify this. Since genetlink can't do the same
>> > >> use "practical" test for ops->dumpit (assuming NEW request won't be
>> > >> mixed with GET).
>> ...
>> > >2.6.37-git18 + netlink revert + this patch
>> > > - fixes Avahi
>> > > - breaks acpid
>> > >Starting acpi daemon: RTNETLINK1 answers: Operation not supported
>> > >acpid: error talking to the kernel via netlink
>> >
>> > Deducing from that, it is a GET-like request that was sent by acpid,
>> > and the message type is one that has both a dumpit and a doit function.
>> > So if EOPNOTSUPP now occurs on all message types that have both dumpit
>> > and doit, you should have broken a lot more than just acpid.
>>
>> Right, we need something better here.
>
> On the other hand, until there is something better, we might try to
> fix it at least for "normal" dumpit cases?
>
> Alessandro, could you try (with the netlink revert)?
>
> Thanks,
> Jarek P.
>
> ---
> diff -Nurp a/net/netlink/genetlink.c b/net/netlink/genetlink.c
> --- a/net/netlink/genetlink.c   2011-01-18 16:58:16.000000000 +0100
> +++ b/net/netlink/genetlink.c   2011-01-18 19:36:25.000000000 +0100
> @@ -519,15 +519,14 @@ static int genl_rcv_msg(struct sk_buff *
>            security_netlink_recv(skb, CAP_NET_ADMIN))
>                return -EPERM;
>
> -       if (nlh->nlmsg_flags & NLM_F_DUMP) {
> -               if (ops->dumpit == NULL)
> -                       return -EOPNOTSUPP;
> -
> -               genl_unlock();
> -               err = netlink_dump_start(net->genl_sock, skb, nlh,
> -                                        ops->dumpit, ops->done);
> -               genl_lock();
> -               return err;
> +       if (ops->dumpit) {
> +               if (nlh->nlmsg_flags & NLM_F_DUMP) {
> +                       genl_unlock();
> +                       err = netlink_dump_start(net->genl_sock, skb, nlh,
> +                                                ops->dumpit, ops->done);
> +                       genl_lock();
> +                       return err;
> +               }
>        }
>
>        if (ops->doit == NULL)
>

Sure enough :)


2.6.37-git18 + netlink revert + this 2nd attempt

 appears to be good for me - both Avahi and acpid start up fine and I
 can't see any other program misbehaving.


Thanks, ciao,

--alessandro

 "There's always a siren singing you to shipwreck"

   (Radiohead, "There There")

^ permalink raw reply

* Re: rps testing questions
From: Rick Jones @ 2011-01-18 19:10 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: mi wake, netdev
In-Reply-To: <1295375676.3537.83.camel@bwh-desktop>

Ben Hutchings wrote:
> On Tue, 2011-01-18 at 10:23 -0800, Rick Jones wrote:
> 
>>Ben Hutchings wrote:
>>
>>>On Mon, 2011-01-17 at 17:43 +0800, mi wake wrote:
> 
> [...]
> 
>>>>I do ab and tbench testing also find there is less tps with enable
>>>>rps.but,there is more cpu using when with enable rps.when with enable
>>>>rps ,softirqs is blanced  on cpus.
>>>>
>>>>is there something wrong with my test?
>>>
>>>
>>>In addition to what Eric said, check the interrupt moderation settings
>>>(ethtool -c/-C options).  One-way latency for a single request/response
>>>test will be at least the interrupt moderation value.
>>>
>>>I haven't tested RPS by itself (Solarflare NICs have plenty of hardware
>>>queues) so I don't know whether it can improve latency.  However, RFS
>>>certainly does when there are many flows.
>>
>>Is there actually an expectation that either RPS or RFS would improve *latency*? 
>>  Multiple-stream throughput certainly, but with the additional work done to 
>>spread things around, I wouldn't expect either to improve latency.
> 
> 
> Yes, it seems to make a big improvement to latency when many flows are
> active. 

OK, you and I were using different definitions.  I was speaking to single-stream 
latency, but didn't say it explicitly (I may have subconsciously thought it was 
implicit given the OP used a single instance of netperf :).

happy benchmarking,

rick jones

> Tom told me that one of his benchmarks was 200 * netperf TCP_RR
> in parallel, and I've seen over 40% reduction in latency for that. That
> said, allocating more RX queues might also help (sfc currently defaults
> to one per processor package rather than one per processor thread, due
> to concerns about CPU efficiency).
> 
> Ben.
> 


^ permalink raw reply

* Re: [PATCH] CHOKe flow scheduler (0.9)
From: Stephen Hemminger @ 2011-01-18 19:06 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Patrick McHardy, David Miller, netdev
In-Reply-To: <1295286851.3335.36.camel@edumazet-laptop>

On Mon, 17 Jan 2011 18:54:11 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le lundi 17 janvier 2011 à 09:25 -0800, Stephen Hemminger a écrit :
> 
> > I rolled in your changes. But there is one more change I want to make.
> > The existing flow match based on hash is vulnerable to side-channel DoS attack.
> > It is possible for a hostile flow to send packets that match the same
> > hash value which would effectively kill a targeted flow.
> > 
> > The solution is to match based on full source and destination, not hash value.
> > Still coding that up.
> 
> I see, but you only want to make this full test if (!q->filter_list)  ?
> 
> (or precisely only if skb_get_rxhash() was used to get the cookie )

This is what I am starting to retest. The code can probably be simplified
to avoid the may_pull() on the packet already in queue.

Subject: sched: CHOKe flow scheduler

CHOKe ("CHOose and Kill" or "CHOose and Keep") is an alternative
packet scheduler based on the Random Exponential Drop (RED) algorithm.

The core idea is:
  For every packet arrival:
  	Calculate Qave
	if (Qave < minth) 
	     Queue the new packet
	else 
	     Select randomly a packet from the queue 
	     if (both packets from same flow)
	     then Drop both the packets
	     else if (Qave > maxth)
	          Drop packet
	     else
	       	  Admit packet with proability p (same as RED)

See also:
  Rong Pan, Balaji Prabhakar, Konstantinos Psounis, "CHOKe: a stateless active
   queue management scheme for approximating fair bandwidth allocation", 
  Proceeding of INFOCOM'2000, March 2000.

Help from:
     Eric Dumazet <eric.dumazet@gmail.com>
     Patrick McHardy <kaber@trash.net>

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

---
This version is based on net-next, and assumes Eric's patch for
corrected bstats is already applied.

0.9 incorporate patches from Patrick/Eric
    rework the peek_random and drop code to simplify and fix bug where
    random_N needs to called with full length (including holes).

 include/linux/pkt_sched.h |   29 ++
 net/sched/Kconfig         |   11 
 net/sched/Makefile        |    1 
 net/sched/sch_choke.c     |  579 ++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 620 insertions(+)

--- a/net/sched/Kconfig	2011-01-14 10:43:19.062537393 -0800
+++ b/net/sched/Kconfig	2011-01-16 13:42:45.938919517 -0800
@@ -205,6 +205,17 @@ config NET_SCH_DRR
 
 	  If unsure, say N.
 
+config NET_SCH_CHOKE
+	tristate "CHOose and Keep responsive flow scheduler (CHOKE)"
+	help
+	  Say Y here if you want to use the CHOKe packet scheduler (CHOose
+	  and Keep for responsive flows, CHOose and Kill for unresponsive
+	  flows). This is a variation of RED which trys to penalize flows
+	  that monopolize the queue.
+
+	  To compile this code as a module, choose M here: the
+	  module will be called sch_choke.
+
 config NET_SCH_INGRESS
 	tristate "Ingress Qdisc"
 	depends on NET_CLS_ACT
--- a/net/sched/Makefile	2011-01-14 10:43:19.072538228 -0800
+++ b/net/sched/Makefile	2011-01-16 13:42:45.946919793 -0800
@@ -32,6 +32,7 @@ obj-$(CONFIG_NET_SCH_MULTIQ)	+= sch_mult
 obj-$(CONFIG_NET_SCH_ATM)	+= sch_atm.o
 obj-$(CONFIG_NET_SCH_NETEM)	+= sch_netem.o
 obj-$(CONFIG_NET_SCH_DRR)	+= sch_drr.o
+obj-$(CONFIG_NET_SCH_CHOKE)	+= sch_choke.o
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
 obj-$(CONFIG_NET_CLS_FW)	+= cls_fw.o
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ b/net/sched/sch_choke.c	2011-01-17 09:18:42.271211633 -0800
@@ -0,0 +1,686 @@
+/*
+ * net/sched/sch_choke.c	CHOKE scheduler
+ *
+ * Copyright (c) 2011 Stephen Hemminger <shemminger@vyatta.com>
+ * Copyright (c) 2011 Eric Dumazet <eric.dumazet@gmail.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * version 2 as published by the Free Software Foundation.
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/skbuff.h>
+#include <linux/reciprocal_div.h>
+#include <net/pkt_sched.h>
+#include <net/inet_ecn.h>
+#include <net/red.h>
+#include <linux/ip.h>
+#include <net/ip.h>
+#include <linux/ipv6.h>
+#include <net/ipv6.h>
+
+/*
+   CHOKe stateless AQM for fair bandwidth allocation
+   =================================================
+
+   CHOKe (CHOose and Keep for responsive flows, CHOose and Kill for
+   unresponsive flows) is a variant of RED that penalizes misbehaving flows but
+   maintains no flow state. The difference from RED is an additional step
+   during the enqueuing process. If average queue size is over the
+   low threshold (qmin), a packet is chosen at random from the queue.
+   If both the new and chosen packet are from the same flow, both
+   are dropped. Unlike RED, CHOKe is not really a "classful" qdisc because it
+   needs to access packets in queue randomly. It has a minimal class
+   interface to allow overriding the builtin flow classifier with
+   filters.
+
+   Source:
+   R. Pan, B. Prabhakar, and K. Psounis, "CHOKe, A Stateless
+   Active Queue Management Scheme for Approximating Fair Bandwidth Allocation",
+   IEEE INFOCOM, 2000.
+
+   A. Tang, J. Wang, S. Low, "Understanding CHOKe: Throughput and Spatial
+   Characteristics", IEEE/ACM Transactions on Networking, 2004
+
+ */
+
+/* Upper bound on size of sk_buff table (packets) */
+#define CHOKE_MAX_QUEUE	(128*1024 - 1)
+
+struct choke_sched_data {
+/* Parameters */
+	u32		 limit;
+	unsigned char	 flags;
+
+	struct red_parms parms;
+
+/* Variables */
+	struct tcf_proto *filter_list;
+	struct {
+		u32	prob_drop;	/* Early probability drops */
+		u32	prob_mark;	/* Early probability marks */
+		u32	forced_drop;	/* Forced drops, qavg > max_thresh */
+		u32	forced_mark;	/* Forced marks, qavg > max_thresh */
+		u32	pdrop;          /* Drops due to queue limits */
+		u32	other;          /* Drops due to drop() calls */
+		u32	matched;	/* Drops to flow match */
+	} stats;
+
+	unsigned int	 head;
+	unsigned int	 tail;
+
+	unsigned int	 tab_mask; /* size - 1 */
+
+	struct sk_buff **tab;
+};
+
+/* deliver a random number between 0 and N - 1 */
+static u32 random_N(unsigned int N)
+{
+	return reciprocal_divide(random32(), N);
+}
+
+/* number of elements in queue including holes */
+static unsigned int choke_len(const struct choke_sched_data *q)
+{
+	return (q->tail - q->head) & q->tab_mask;
+}
+
+/* Is ECN parameter configured */
+static int use_ecn(const struct choke_sched_data *q)
+{
+	return q->flags & TC_RED_ECN;
+}
+
+/* Should packets over max just be dropped (versus marked) */
+static int use_harddrop(const struct choke_sched_data *q)
+{
+	return q->flags & TC_RED_HARDDROP;
+}
+
+/* Move head pointer forward to skip over holes */
+static void choke_zap_head_holes(struct choke_sched_data *q)
+{
+	do {
+		q->head = (q->head + 1) & q->tab_mask;
+		if (q->head == q->tail)
+			break;
+	} while (q->tab[q->head] == NULL);
+}
+
+/* Move tail pointer backwards to reuse holes */
+static void choke_zap_tail_holes(struct choke_sched_data *q)
+{
+	do {
+		q->tail = (q->tail - 1) & q->tab_mask;
+		if (q->head == q->tail)
+			break;
+	} while (q->tab[q->tail] == NULL);
+}
+
+/* Drop packet from queue array by creating a "hole" */
+static void choke_drop_by_idx(struct Qdisc *sch, unsigned int idx)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb = q->tab[idx];
+
+	q->tab[idx] = NULL;
+
+	if (idx == q->head)
+		choke_zap_head_holes(q);
+	if (idx == q->tail)
+		choke_zap_tail_holes(q);
+
+	sch->qstats.backlog -= qdisc_pkt_len(skb);
+	qdisc_drop(skb, sch);
+	qdisc_tree_decrease_qlen(sch, 1);
+	--sch->q.qlen;
+}
+
+/*
+ * Compare flow of two packets
+ *  Returns true only if source and destination address and port match.
+ *          false for special cases
+ */
+static bool choke_match_flow(struct sk_buff *skb1, struct sk_buff *skb2)
+{
+	int off1, off2, poff;
+	u8 ip_proto;
+	u32 ihl;
+
+	if (skb1->protocol != skb2->protocol)
+		return false;
+
+	off1 = skb_network_offset(skb1);
+	off2 = skb_network_offset(skb2);
+
+	switch (skb1->protocol) {
+	case __constant_htons(ETH_P_IP): {
+		struct iphdr *ip1, *ip2;
+
+		if (!pskb_may_pull(skb1, sizeof(struct iphdr) + off1))
+			return false;
+
+		ip1 = (struct iphdr *) (skb1->data + off1);
+		if (ip1->frag_off & htons(IP_MF | IP_OFFSET))
+			return false;	/* don't compare fragments */
+
+		if (!pskb_may_pull(skb2, sizeof(struct iphdr) + off2))
+			return false;
+
+		ip2 = (struct iphdr *) (skb2->data + off2);
+		if (ip2->frag_off & htons(IP_MF | IP_OFFSET))
+			return false;
+
+		if (ip1->protocol != ip2->protocol ||
+		    ip1->saddr != ip2->saddr || ip1->daddr != ip2->daddr)
+			return false;
+
+		ip_proto = ip1->protocol;
+		ihl = ip1->ihl;
+		break;
+	}
+
+	case __constant_htons(ETH_P_IPV6): {
+		struct ipv6hdr *ip1, *ip2;
+
+		if (!pskb_may_pull(skb1, sizeof(struct ipv6hdr *) + off1))
+			return false;
+
+		if (!pskb_may_pull(skb2, sizeof(struct ipv6hdr *) + off2))
+			return false;
+
+		ip1 = (struct ipv6hdr *) (skb1->data + off1);
+		ip2 = (struct ipv6hdr *) (skb2->data + off2);
+
+		if (ip1->nexthdr != ip2->nexthdr ||
+		    ipv6_addr_cmp(&ip1->saddr, &ip2->saddr) != 0 ||
+		    ipv6_addr_cmp(&ip1->daddr, &ip2->daddr))
+			return false;
+
+		ihl = (40 >> 2);
+		ip_proto = ip1->nexthdr;
+		break;
+	}
+
+	default:
+		return false;
+	}
+
+	poff = proto_ports_offset(ip_proto);
+	if (poff >= 0) {
+		u32 *ports1, *ports2;
+
+		off1 += ihl * 4 + poff;
+		if (!pskb_may_pull(skb1, off1 + 4))
+			return false;
+
+		off2 += ihl * 4 + poff;
+		if (!pskb_may_pull(skb2, off2 + 4))
+			return false;
+
+		ports1 = (__force u32 *) (skb1->data + off1);
+		ports2 = (__force u32 *) (skb2->data + off2);
+
+		return *ports1 == *ports2;
+	}
+
+	return true;
+}
+
+static inline void choke_set_classid(struct sk_buff *skb, u16 classid)
+{
+	*(unsigned int *)(qdisc_skb_cb(skb)->data) = classid;
+}
+
+static u16 choke_get_classid(const struct sk_buff *skb)
+{
+	return *(unsigned int *)(qdisc_skb_cb(skb)->data);
+}
+
+/*
+ * Classify flow using either:
+ *  1. pre-existing classification result in skb
+ *  2. fast internal classification
+ *  3. use TC filter based classification
+ */
+static bool choke_classify(struct sk_buff *skb,
+			   struct Qdisc *sch, int *qerr)
+
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	struct tcf_result res;
+	int result;
+
+	*qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
+
+	result = tc_classify(skb, q->filter_list, &res);
+	if (result >= 0) {
+#ifdef CONFIG_NET_CLS_ACT
+		switch (result) {
+		case TC_ACT_STOLEN:
+		case TC_ACT_QUEUED:
+			*qerr = NET_XMIT_SUCCESS | __NET_XMIT_STOLEN;
+		case TC_ACT_SHOT:
+			return false;
+		}
+#endif
+		choke_set_classid(skb, TC_H_MIN(res.classid));
+		return true;
+	}
+
+	return false;
+}
+
+/* Select packet a random from queue */
+static struct sk_buff *choke_peek_random(const struct choke_sched_data *q,
+					 unsigned int *pidx)
+{
+	struct sk_buff *skb;
+	int retrys = 3;
+
+	do {
+		*pidx = (q->head + random_N(choke_len(q))) & q->tab_mask;
+		skb = q->tab[*pidx];
+		if (skb)
+			return skb;
+	} while (--retrys > 0);
+
+	/* queue is has lots of holes use the head which is known to exist
+	 * Note : result can still be NULL if q->head == q->tail
+	 */
+	return q->tab[*pidx = q->head];
+}
+
+/* Select a packet at random from the queue and compare flow */
+static bool choke_match_random(const struct choke_sched_data *q,
+			       struct sk_buff *nskb,
+			       unsigned int *pidx)
+{
+	struct sk_buff *oskb;
+
+	if (q->head == q->tail)
+		return false;
+
+	oskb = choke_peek_random(q, pidx);
+	if (q->filter_list)
+		return choke_get_classid(nskb) == choke_get_classid(oskb);
+
+
+	return choke_match_flow(oskb, nskb);
+}
+
+static int choke_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	struct red_parms *p = &q->parms;
+	int uninitialized_var(ret);
+
+	/* If using external classifiers, get result and record it. */
+	if (q->filter_list &&
+	    !choke_classify(skb, sch, &ret)) {
+		/* Packet was eaten by filter */
+		if (ret & __NET_XMIT_BYPASS)
+			sch->qstats.drops++;
+		kfree_skb(skb);
+		return ret;
+	}
+
+	/* Compute average queue usage (see RED) */
+	p->qavg = red_calc_qavg(p, sch->q.qlen);
+	if (red_is_idling(p))
+		red_end_of_idle_period(p);
+
+	/* Is queue small? */
+	if (p->qavg <= p->qth_min)
+		p->qcount = -1;
+	else {
+		unsigned int idx;
+
+		/* Draw a packet at random from queue and compare flow */
+		if (choke_match_random(q, skb, &idx)) {
+			q->stats.matched++;
+			choke_drop_by_idx(sch, idx);
+			goto congestion_drop;
+		}
+
+		/* Queue is large, always mark/drop */
+		if (p->qavg > p->qth_max) {
+			p->qcount = -1;
+
+			sch->qstats.overlimits++;
+			if (use_harddrop(q) || !use_ecn(q) ||
+			    !INET_ECN_set_ce(skb)) {
+				q->stats.forced_drop++;
+				goto congestion_drop;
+			}
+
+			q->stats.forced_mark++;
+		} else if (++p->qcount) {
+			if (red_mark_probability(p, p->qavg)) {
+				p->qcount = 0;
+				p->qR = red_random(p);
+
+				sch->qstats.overlimits++;
+				if (!use_ecn(q) || !INET_ECN_set_ce(skb)) {
+					q->stats.prob_drop++;
+					goto congestion_drop;
+				}
+
+				q->stats.prob_mark++;
+			}
+		} else
+			p->qR = red_random(p);
+	}
+
+	/* Admit new packet */
+	if (sch->q.qlen < q->limit) {
+		q->tab[q->tail] = skb;
+		q->tail = (q->tail + 1) & q->tab_mask;
+		++sch->q.qlen;
+		sch->qstats.backlog += qdisc_pkt_len(skb);
+		return NET_XMIT_SUCCESS;
+	}
+
+	q->stats.pdrop++;
+	sch->qstats.drops++;
+	kfree_skb(skb);
+	return NET_XMIT_DROP;
+
+ congestion_drop:
+	qdisc_drop(skb, sch);
+	return NET_XMIT_CN;
+}
+
+static struct sk_buff *choke_dequeue(struct Qdisc *sch)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb;
+
+	if (q->head == q->tail) {
+		if (!red_is_idling(&q->parms))
+			red_start_of_idle_period(&q->parms);
+		return NULL;
+	}
+
+	skb = q->tab[q->head];
+	q->tab[q->head] = NULL;
+	choke_zap_head_holes(q);
+	--sch->q.qlen;
+	sch->qstats.backlog -= qdisc_pkt_len(skb);
+	qdisc_bstats_update(sch, skb);
+
+	return skb;
+}
+
+static unsigned int choke_drop(struct Qdisc *sch)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	unsigned int len;
+
+	len = qdisc_queue_drop(sch);
+	if (len > 0)
+		q->stats.other++;
+	else {
+		if (!red_is_idling(&q->parms))
+			red_start_of_idle_period(&q->parms);
+	}
+
+	return len;
+}
+
+static void choke_reset(struct Qdisc *sch)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+
+	red_restart(&q->parms);
+}
+
+static const struct nla_policy choke_policy[TCA_CHOKE_MAX + 1] = {
+	[TCA_CHOKE_PARMS]	= { .len = sizeof(struct tc_red_qopt) },
+	[TCA_CHOKE_STAB]	= { .len = RED_STAB_SIZE },
+};
+
+
+static void choke_free(void *addr)
+{
+	if (addr) {
+		if (is_vmalloc_addr(addr))
+			vfree(addr);
+		else
+			kfree(addr);
+	}
+}
+
+static int choke_change(struct Qdisc *sch, struct nlattr *opt)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	struct nlattr *tb[TCA_CHOKE_MAX + 1];
+	const struct tc_red_qopt *ctl;
+	int err;
+	struct sk_buff **old = NULL;
+	unsigned int mask;
+
+	if (opt == NULL)
+		return -EINVAL;
+
+	err = nla_parse_nested(tb, TCA_CHOKE_MAX, opt, choke_policy);
+	if (err < 0)
+		return err;
+
+	if (tb[TCA_CHOKE_PARMS] == NULL ||
+	    tb[TCA_CHOKE_STAB] == NULL)
+		return -EINVAL;
+
+	ctl = nla_data(tb[TCA_CHOKE_PARMS]);
+
+	if (ctl->limit > CHOKE_MAX_QUEUE)
+		return -EINVAL;
+
+	mask = roundup_pow_of_two(ctl->limit + 1) - 1;
+	if (mask != q->tab_mask) {
+		struct sk_buff **ntab;
+
+		ntab = kcalloc(mask + 1, sizeof(struct sk_buff *), GFP_KERNEL);
+		if (!ntab)
+			ntab = vzalloc((mask + 1) * sizeof(struct sk_buff *));
+		if (!ntab)
+			return -ENOMEM;
+
+		sch_tree_lock(sch);
+		old = q->tab;
+		if (old) {
+			unsigned int oqlen = sch->q.qlen, tail = 0;
+
+			while (q->head != q->tail) {
+				struct sk_buff *skb = q->tab[q->head];
+
+				q->head = (q->head + 1) & q->tab_mask;
+				if (!skb)
+					continue;
+				if (tail < mask) {
+					ntab[tail++] = skb;
+					continue;
+				}
+				sch->qstats.backlog -= qdisc_pkt_len(skb);
+				--sch->q.qlen;
+				qdisc_drop(skb, sch);
+			}
+			qdisc_tree_decrease_qlen(sch, oqlen - sch->q.qlen);
+			q->head = 0;
+			q->tail = tail;
+		}
+
+		q->tab_mask = mask;
+		q->tab = ntab;
+	} else
+		sch_tree_lock(sch);
+
+	q->flags = ctl->flags;
+	q->limit = ctl->limit;
+
+	red_set_parms(&q->parms, ctl->qth_min, ctl->qth_max, ctl->Wlog,
+		      ctl->Plog, ctl->Scell_log,
+		      nla_data(tb[TCA_CHOKE_STAB]));
+
+	if (q->head == q->tail)
+		red_end_of_idle_period(&q->parms);
+
+	sch_tree_unlock(sch);
+	choke_free(old);
+	return 0;
+}
+
+static int choke_init(struct Qdisc *sch, struct nlattr *opt)
+{
+	return choke_change(sch, opt);
+}
+
+static int choke_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	struct nlattr *opts = NULL;
+	struct tc_red_qopt opt = {
+		.limit		= q->limit,
+		.flags		= q->flags,
+		.qth_min	= q->parms.qth_min >> q->parms.Wlog,
+		.qth_max	= q->parms.qth_max >> q->parms.Wlog,
+		.Wlog		= q->parms.Wlog,
+		.Plog		= q->parms.Plog,
+		.Scell_log	= q->parms.Scell_log,
+	};
+
+	opts = nla_nest_start(skb, TCA_OPTIONS);
+	if (opts == NULL)
+		goto nla_put_failure;
+
+	NLA_PUT(skb, TCA_CHOKE_PARMS, sizeof(opt), &opt);
+	return nla_nest_end(skb, opts);
+
+nla_put_failure:
+	nla_nest_cancel(skb, opts);
+	return -EMSGSIZE;
+}
+
+static int choke_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	struct tc_choke_xstats st = {
+		.early	= q->stats.prob_drop + q->stats.forced_drop,
+		.marked	= q->stats.prob_mark + q->stats.forced_mark,
+		.pdrop	= q->stats.pdrop,
+		.other	= q->stats.other,
+		.matched = q->stats.matched,
+	};
+
+	return gnet_stats_copy_app(d, &st, sizeof(st));
+}
+
+static void choke_destroy(struct Qdisc *sch)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+
+	tcf_destroy_chain(&q->filter_list);
+	choke_free(q->tab);
+}
+
+static struct Qdisc *choke_leaf(struct Qdisc *sch, unsigned long arg)
+{
+	return NULL;
+}
+
+static unsigned long choke_get(struct Qdisc *sch, u32 classid)
+{
+	return 0;
+}
+
+static void choke_put(struct Qdisc *q, unsigned long cl)
+{
+}
+
+static unsigned long choke_bind(struct Qdisc *sch, unsigned long parent,
+				u32 classid)
+{
+	return 0;
+}
+
+static struct tcf_proto **choke_find_tcf(struct Qdisc *sch, unsigned long cl)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+
+	if (cl)
+		return NULL;
+	return &q->filter_list;
+}
+
+static int choke_dump_class(struct Qdisc *sch, unsigned long cl,
+			  struct sk_buff *skb, struct tcmsg *tcm)
+{
+	tcm->tcm_handle |= TC_H_MIN(cl);
+	return 0;
+}
+
+static void choke_walk(struct Qdisc *sch, struct qdisc_walker *arg)
+{
+	if (!arg->stop) {
+		if (arg->fn(sch, 1, arg) < 0) {
+			arg->stop = 1;
+			return;
+		}
+		arg->count++;
+	}
+}
+
+static const struct Qdisc_class_ops choke_class_ops = {
+	.leaf		=	choke_leaf,
+	.get		=	choke_get,
+	.put		=	choke_put,
+	.tcf_chain	=	choke_find_tcf,
+	.bind_tcf	=	choke_bind,
+	.unbind_tcf	=	choke_put,
+	.dump		=	choke_dump_class,
+	.walk		=	choke_walk,
+};
+
+static struct sk_buff *choke_peek_head(struct Qdisc *sch)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+
+	return (q->head != q->tail) ? q->tab[q->head] : NULL;
+}
+
+static struct Qdisc_ops choke_qdisc_ops __read_mostly = {
+	.id		=	"choke",
+	.priv_size	=	sizeof(struct choke_sched_data),
+
+	.enqueue	=	choke_enqueue,
+	.dequeue	=	choke_dequeue,
+	.peek		=	choke_peek_head,
+	.drop		=	choke_drop,
+	.init		=	choke_init,
+	.destroy	=	choke_destroy,
+	.reset		=	choke_reset,
+	.change		=	choke_change,
+	.dump		=	choke_dump,
+	.dump_stats	=	choke_dump_stats,
+	.owner		=	THIS_MODULE,
+};
+
+static int __init choke_module_init(void)
+{
+	return register_qdisc(&choke_qdisc_ops);
+}
+
+static void __exit choke_module_exit(void)
+{
+	unregister_qdisc(&choke_qdisc_ops);
+}
+
+module_init(choke_module_init)
+module_exit(choke_module_exit)
+
+MODULE_LICENSE("GPL");
--- a/include/linux/pkt_sched.h	2011-01-14 10:43:19.092539898 -0800
+++ b/include/linux/pkt_sched.h	2011-01-16 13:42:45.926919103 -0800
@@ -247,6 +247,35 @@ struct tc_gred_sopt {
 	__u16		pad1;
 };
 
+/* CHOKe section */
+
+enum {
+	TCA_CHOKE_UNSPEC,
+	TCA_CHOKE_PARMS,
+	TCA_CHOKE_STAB,
+	__TCA_CHOKE_MAX,
+};
+
+#define TCA_CHOKE_MAX (__TCA_CHOKE_MAX - 1)
+
+struct tc_choke_qopt {
+	__u32		limit;		/* Hard queue length (packets)	*/
+	__u32		qth_min;	/* Min average threshold (packets) */
+	__u32		qth_max;	/* Max average threshold (packets) */
+	unsigned char   Wlog;		/* log(W)		*/
+	unsigned char   Plog;		/* log(P_max/(qth_max-qth_min))	*/
+	unsigned char   Scell_log;	/* cell size for idle damping */
+	unsigned char	flags;		/* see RED flags */
+};
+
+struct tc_choke_xstats {
+	__u32		early;          /* Early drops */
+	__u32		pdrop;          /* Drops due to queue limits */
+	__u32		other;          /* Drops due to drop() calls */
+	__u32		marked;         /* Marked packets */
+	__u32		matched;	/* Drops due to flow match */
+};
+
 /* HTB section */
 #define TC_HTB_NUMPRIO		8
 #define TC_HTB_MAXDEPTH		8

^ permalink raw reply

* Re: [PATCH] vhost: rcu annotation fixup
From: Paul E. McKenney @ 2011-01-18 19:02 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Jason Wang, kvm, virtualization, netdev, linux-kernel
In-Reply-To: <20110118175500.GA6935@redhat.com>

On Tue, Jan 18, 2011 at 07:55:00PM +0200, Michael S. Tsirkin wrote:
> On Tue, Jan 18, 2011 at 09:48:34AM -0800, Paul E. McKenney wrote:
> > On Tue, Jan 18, 2011 at 01:08:45PM +0200, Michael S. Tsirkin wrote:
> > > When built with rcu checks enabled, vhost triggers
> > > bogus warnings as vhost features are read without
> > > dev->mutex sometimes.
> > > Fixing it properly is not trivial as vhost.h does not
> > > know which lockdep classes it will be used under.
> > > Disable the warning by stubbing out the check for now.
> > > 
> > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > ---
> > >  drivers/vhost/vhost.h |    4 +---
> > >  1 files changed, 1 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > > index 2af44b7..2d03a31 100644
> > > --- a/drivers/vhost/vhost.h
> > > +++ b/drivers/vhost/vhost.h
> > > @@ -173,9 +173,7 @@ static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
> > >  {
> > >  	unsigned acked_features;
> > > 
> > > -	acked_features =
> > > -		rcu_dereference_index_check(dev->acked_features,
> > > -					    lockdep_is_held(&dev->mutex));
> > > +	acked_features = rcu_dereference_index_check(dev->acked_features, 1);
> > 
> > Ouch!!!
> > 
> > Could you please at least add a comment?
> 
> Yes, OK.
> 
> > Alternatively, pass in the lock that is held and check for that?  Given
> > that this is a static inline, the compiler should be able to optimize
> > the argument away when !PROVE_RCU, correct?
> > 
> > 							Thanx, Paul
> 
> Hopefully, yes. We don't always have a lock: the idea was
> to create a lockdep for these cases. But we can't pass
> the pointer to that ...

I suppose you could pass a pointer to the lockdep map structure.
Not sure if this makes sense, but it would handle the situation.

Alternatively, create a helper function that checks the possibilities
and screams if none of them are in effect.

							Thanx, Paul

> > >  	return acked_features & (1 << bit);
> > >  }
> > > 
> > > -- 
> > > 1.7.3.2.91.g446ac
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply

* Re: inbound connection problems when "netlink: test for all flags of the NLM_F_DUMP composite" commit applied
From: Jarek Poplawski @ 2011-01-18 18:47 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Alessandro Suardi, jamal, David Miller, pablo, arthur.marsh,
	eric.dumazet, netdev
In-Reply-To: <20110118182852.GC4202@del.dom.local>

On Tue, Jan 18, 2011 at 07:28:52PM +0100, Jarek Poplawski wrote:
> On Tue, Jan 18, 2011 at 07:24:40PM +0100, Jan Engelhardt wrote:
> > 
> > On Tuesday 2011-01-18 19:10, Alessandro Suardi wrote:
> > >On Tue, Jan 18, 2011 at 6:23 PM, Jarek Poplawski <jarkao2@gmail.com> wrote:
> > >>
> > >> NLM_F_DUMP flags should be applied to GET requests only, eg. rtnetlink
> > >> tests message type to verify this. Since genetlink can't do the same
> > >> use "practical" test for ops->dumpit (assuming NEW request won't be
> > >> mixed with GET).
> ...
> > >2.6.37-git18 + netlink revert + this patch
> > > - fixes Avahi
> > > - breaks acpid
> > >Starting acpi daemon: RTNETLINK1 answers: Operation not supported
> > >acpid: error talking to the kernel via netlink
> > 
> > Deducing from that, it is a GET-like request that was sent by acpid, 
> > and the message type is one that has both a dumpit and a doit function.
> > So if EOPNOTSUPP now occurs on all message types that have both dumpit 
> > and doit, you should have broken a lot more than just acpid.
> 
> Right, we need something better here.

On the other hand, until there is something better, we might try to
fix it at least for "normal" dumpit cases?

Alessandro, could you try (with the netlink revert)?

Thanks,
Jarek P.

---
diff -Nurp a/net/netlink/genetlink.c b/net/netlink/genetlink.c
--- a/net/netlink/genetlink.c	2011-01-18 16:58:16.000000000 +0100
+++ b/net/netlink/genetlink.c	2011-01-18 19:36:25.000000000 +0100
@@ -519,15 +519,14 @@ static int genl_rcv_msg(struct sk_buff *
 	    security_netlink_recv(skb, CAP_NET_ADMIN))
 		return -EPERM;
 
-	if (nlh->nlmsg_flags & NLM_F_DUMP) {
-		if (ops->dumpit == NULL)
-			return -EOPNOTSUPP;
-
-		genl_unlock();
-		err = netlink_dump_start(net->genl_sock, skb, nlh,
-					 ops->dumpit, ops->done);
-		genl_lock();
-		return err;
+	if (ops->dumpit) {
+		if (nlh->nlmsg_flags & NLM_F_DUMP) {
+			genl_unlock();
+			err = netlink_dump_start(net->genl_sock, skb, nlh,
+						 ops->dumpit, ops->done);
+			genl_lock();
+			return err;
+		}
 	}
 
 	if (ops->doit == NULL)

^ permalink raw reply

* Re: rps testing questions
From: Ben Hutchings @ 2011-01-18 18:34 UTC (permalink / raw)
  To: Rick Jones; +Cc: mi wake, netdev
In-Reply-To: <4D35DAB0.9030201@hp.com>

On Tue, 2011-01-18 at 10:23 -0800, Rick Jones wrote:
> Ben Hutchings wrote:
> > On Mon, 2011-01-17 at 17:43 +0800, mi wake wrote:
[...]
> >>I do ab and tbench testing also find there is less tps with enable
> >>rps.but,there is more cpu using when with enable rps.when with enable
> >>rps ,softirqs is blanced  on cpus.
> >>
> >>is there something wrong with my test?
> > 
> > 
> > In addition to what Eric said, check the interrupt moderation settings
> > (ethtool -c/-C options).  One-way latency for a single request/response
> > test will be at least the interrupt moderation value.
> > 
> > I haven't tested RPS by itself (Solarflare NICs have plenty of hardware
> > queues) so I don't know whether it can improve latency.  However, RFS
> > certainly does when there are many flows.
> 
> Is there actually an expectation that either RPS or RFS would improve *latency*? 
>   Multiple-stream throughput certainly, but with the additional work done to 
> spread things around, I wouldn't expect either to improve latency.

Yes, it seems to make a big improvement to latency when many flows are
active.  Tom told me that one of his benchmarks was 200 * netperf TCP_RR
in parallel, and I've seen over 40% reduction in latency for that.  That
said, allocating more RX queues might also help (sfc currently defaults
to one per processor package rather than one per processor thread, due
to concerns about CPU efficiency).

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: rps testing questions
From: Rick Jones @ 2011-01-18 18:23 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: mi wake, netdev
In-Reply-To: <1295269713.3700.5.camel@localhost>

Ben Hutchings wrote:
> On Mon, 2011-01-17 at 17:43 +0800, mi wake wrote:
> 
>>I do a rps(Receive Packet Steering) testing on centos 5.5 with  kernel 2.6.37.
>>cpu: 8 core Intel.
>>ethernet adapter: bnx2x
>>
>>Problem statement:
>>enable rps with:
>>echo "ff" > /sys/class/net/eth2/queues/rx-0/rps_cpus.
>>
>>running 1 instances of netperf TCP_RR: netperf  -t TCP_RR -H 192.168.0.1 -c -C
>>without rps: 9963.48(Trans Rate per sec)
>>with rps:  9387.59(Trans Rate per sec)

Presumably there was an increase in service demand corresponding with the drop 
in transactions per second.

Also, an unsolicited benchmarking style tip or two.  I find it helpful to either 
do several discrete runs, or use the confidence intervals (global -i and -I 
options) with the TCP_RR tests when I am looking to compare two settings.  I 
find a bit more "variability" in the _RR tests than the _STREAM tests.

http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#index-g_t_002dI_002c-Global-26

Pinning netperf/netserver is also something I tend to do, but combining that 
with  confidence intervals, RPS is kind of difficult - the successive data 
connections made while running the iterations of the confidence intervals will 
have different port numbers and so different hashing.  That would cause RPS to 
put the connections on different cores in turn, which would, in conjunction with 
netperf/netserver being pinned to a core cause the relationship between where 
netperf runs and where netserver runs to change.  That will likely result in 
cache to cache (processor cache) transfers which will definitely up the service 
demand and drop the single-stream transactions per second.

In theory :) with RFS that should not be an issue since where netperf/netserver 
are pinned controls where the inbound processing takes place.

We are in a maze of twisty heuristics... :)

>>I do ab and tbench testing also find there is less tps with enable
>>rps.but,there is more cpu using when with enable rps.when with enable
>>rps ,softirqs is blanced  on cpus.
>>
>>is there something wrong with my test?
> 
> 
> In addition to what Eric said, check the interrupt moderation settings
> (ethtool -c/-C options).  One-way latency for a single request/response
> test will be at least the interrupt moderation value.
> 
> I haven't tested RPS by itself (Solarflare NICs have plenty of hardware
> queues) so I don't know whether it can improve latency.  However, RFS
> certainly does when there are many flows.

Is there actually an expectation that either RPS or RFS would improve *latency*? 
  Multiple-stream throughput certainly, but with the additional work done to 
spread things around, I wouldn't expect either to improve latency.

happy benchmarking,

rick jones

^ permalink raw reply

* Re: inbound connection problems when "netlink: test for all flags of the NLM_F_DUMP composite" commit applied
From: Jarek Poplawski @ 2011-01-18 18:28 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Alessandro Suardi, jamal, David Miller, pablo, arthur.marsh,
	eric.dumazet, netdev
In-Reply-To: <alpine.LNX.2.01.1101181921320.17010@obet.zrqbmnf.qr>

On Tue, Jan 18, 2011 at 07:24:40PM +0100, Jan Engelhardt wrote:
> 
> On Tuesday 2011-01-18 19:10, Alessandro Suardi wrote:
> >On Tue, Jan 18, 2011 at 6:23 PM, Jarek Poplawski <jarkao2@gmail.com> wrote:
> >>
> >> NLM_F_DUMP flags should be applied to GET requests only, eg. rtnetlink
> >> tests message type to verify this. Since genetlink can't do the same
> >> use "practical" test for ops->dumpit (assuming NEW request won't be
> >> mixed with GET).
...
> >2.6.37-git18 + netlink revert + this patch
> > - fixes Avahi
> > - breaks acpid
> >Starting acpi daemon: RTNETLINK1 answers: Operation not supported
> >acpid: error talking to the kernel via netlink
> 
> Deducing from that, it is a GET-like request that was sent by acpid, 
> and the message type is one that has both a dumpit and a doit function.
> So if EOPNOTSUPP now occurs on all message types that have both dumpit 
> and doit, you should have broken a lot more than just acpid.

Right, we need something better here.

Jarek P.

^ permalink raw reply

* Re: inbound connection problems when "netlink: test for all flags of the NLM_F_DUMP composite" commit applied
From: Jan Engelhardt @ 2011-01-18 18:24 UTC (permalink / raw)
  To: Alessandro Suardi
  Cc: Jarek Poplawski, jamal, David Miller, pablo, arthur.marsh,
	eric.dumazet, netdev
In-Reply-To: <AANLkTikUPRfYupw1-Py5mHeN+pFv-zMKhmK3DKixxsH4@mail.gmail.com>


On Tuesday 2011-01-18 19:10, Alessandro Suardi wrote:
>On Tue, Jan 18, 2011 at 6:23 PM, Jarek Poplawski <jarkao2@gmail.com> wrote:
>>
>> NLM_F_DUMP flags should be applied to GET requests only, eg. rtnetlink
>> tests message type to verify this. Since genetlink can't do the same
>> use "practical" test for ops->dumpit (assuming NEW request won't be
>> mixed with GET).
>>
>> -       if (nlh->nlmsg_flags & NLM_F_DUMP) {
>> -               if (ops->dumpit == NULL)
>> +       if (ops->dumpit) {
>> +               if (nlh->nlmsg_flags & NLM_F_DUMP) {
>> +                       genl_unlock();
>> +                       err = netlink_dump_start(net->genl_sock, skb, nlh,
>> +                                                ops->dumpit, ops->done);
>> +                       genl_lock();
>> +                       return err;
>> +               } else {
>>                        return -EOPNOTSUPP;
>> -
>> -               genl_unlock();
>> -               err = netlink_dump_start(net->genl_sock, skb, nlh,
>> -                                        ops->dumpit, ops->done);
>> -               genl_lock();
>> -               return err;
>> +               }
>
>2.6.37-git18 + netlink revert + this patch
> - fixes Avahi
> - breaks acpid
>Starting acpi daemon: RTNETLINK1 answers: Operation not supported
>acpid: error talking to the kernel via netlink

Deducing from that, it is a GET-like request that was sent by acpid, 
and the message type is one that has both a dumpit and a doit function.
So if EOPNOTSUPP now occurs on all message types that have both dumpit 
and doit, you should have broken a lot more than just acpid.

^ permalink raw reply

* Re: inbound connection problems when "netlink: test for all flags of the NLM_F_DUMP composite" commit applied
From: Jarek Poplawski @ 2011-01-18 18:23 UTC (permalink / raw)
  To: Alessandro Suardi
  Cc: jamal, David Miller, pablo, arthur.marsh, jengelh, eric.dumazet,
	netdev
In-Reply-To: <AANLkTikUPRfYupw1-Py5mHeN+pFv-zMKhmK3DKixxsH4@mail.gmail.com>

On Tue, Jan 18, 2011 at 07:10:35PM +0100, Alessandro Suardi wrote:
> On Tue, Jan 18, 2011 at 6:23 PM, Jarek Poplawski <jarkao2@gmail.com> wrote:
> > [PATCH] netlink: Fix possible NLM_F_DUMP misuse in genetlink
> >
> > NLM_F_DUMP flags should be applied to GET requests only, eg. rtnetlink
> > tests message type to verify this. Since genetlink can't do the same
> > use "practical" test for ops->dumpit (assuming NEW request won't be
> > mixed with GET).
> >
> > Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
> > Cc: Jan Engelhardt <jengelh@medozas.de>
> > Cc: Pablo Neira Ayuso <pablo@netfilter.org>
> > Cc: Jamal Hadi Salim <hadi@cyberus.ca>
> > ---
> > Not for stable before testing!
> >
> > diff -Nurp a/net/netlink/genetlink.c b/net/netlink/genetlink.c
> > --- a/net/netlink/genetlink.c   2011-01-18 16:58:16.000000000 +0100
> > +++ b/net/netlink/genetlink.c   2011-01-18 17:08:43.000000000 +0100
> > @@ -519,15 +519,16 @@ static int genl_rcv_msg(struct sk_buff *
...
> 2.6.37-git18 + netlink revert + this patch
>  - fixes Avahi
>  - breaks acpid

Well, then fixing genetlink needs more than this and I withdraw this
patch. Anyway, netlink revert is IMHO needed to prevent the regression.

...
> If more debugging/testing is needed, do ping me. Thanks,

Many thanks for such a fast response!
Jarek P.

^ permalink raw reply

* Re: inbound connection problems when "netlink: test for all flags of the NLM_F_DUMP composite" commit applied
From: Jarek Poplawski @ 2011-01-18 18:11 UTC (permalink / raw)
  To: jamal; +Cc: David Miller, pablo, arthur.marsh, jengelh, eric.dumazet, netdev
In-Reply-To: <1295359503.2066.10.camel@mojatatu>

On Tue, Jan 18, 2011 at 09:05:03AM -0500, jamal wrote:
> On Tue, 2011-01-18 at 10:24 +0000, Jarek Poplawski wrote:
> 
> > Do you all expect all users manage to upgrade avahi app before
> > changing their stable kernel? I mean "own distro" users especially.
> 
> Unfortunately if that app is widely deployed, it is not pragmatic
> to break it in the name of pedanticity. i.e.
> Be conservative in what you send (clearly Avahi is farting all over the
> place) but more importantly be a nice liberal in what you accept.
> Maybe tell them if you have the cycles about their bad behavior.
> The important part is the GET (kind = 2). The DUMP as Jarek says
> is merely a utility to extrapolate that we need you to
> get everything.. So it is not such a big deal if someone passes
> extraneous senseless flags. Wrong? yes.

Thanks for the confirmation of my suspicions wrt. the RFC & rtnetlink.
But then Avahi seems right and we should get back to the written law.

> Jarek - For the record although i coauthored the doc, I was merely
> a messenger in putting together some artist painting.

Well, ain't pictures better than words ;-)

Cheers,
Jarek P.

^ permalink raw reply

* Re: inbound connection problems when "netlink: test for all flags of the NLM_F_DUMP composite" commit applied
From: Alessandro Suardi @ 2011-01-18 18:10 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: jamal, David Miller, pablo, arthur.marsh, jengelh, eric.dumazet,
	netdev
In-Reply-To: <20110118172340.GB1843@del.dom.local>

On Tue, Jan 18, 2011 at 6:23 PM, Jarek Poplawski <jarkao2@gmail.com> wrote:
> [PATCH] netlink: Fix possible NLM_F_DUMP misuse in genetlink
>
> NLM_F_DUMP flags should be applied to GET requests only, eg. rtnetlink
> tests message type to verify this. Since genetlink can't do the same
> use "practical" test for ops->dumpit (assuming NEW request won't be
> mixed with GET).
>
> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
> Cc: Jan Engelhardt <jengelh@medozas.de>
> Cc: Pablo Neira Ayuso <pablo@netfilter.org>
> Cc: Jamal Hadi Salim <hadi@cyberus.ca>
> ---
> Not for stable before testing!
>
> diff -Nurp a/net/netlink/genetlink.c b/net/netlink/genetlink.c
> --- a/net/netlink/genetlink.c   2011-01-18 16:58:16.000000000 +0100
> +++ b/net/netlink/genetlink.c   2011-01-18 17:08:43.000000000 +0100
> @@ -519,15 +519,16 @@ static int genl_rcv_msg(struct sk_buff *
>            security_netlink_recv(skb, CAP_NET_ADMIN))
>                return -EPERM;
>
> -       if (nlh->nlmsg_flags & NLM_F_DUMP) {
> -               if (ops->dumpit == NULL)
> +       if (ops->dumpit) {
> +               if (nlh->nlmsg_flags & NLM_F_DUMP) {
> +                       genl_unlock();
> +                       err = netlink_dump_start(net->genl_sock, skb, nlh,
> +                                                ops->dumpit, ops->done);
> +                       genl_lock();
> +                       return err;
> +               } else {
>                        return -EOPNOTSUPP;
> -
> -               genl_unlock();
> -               err = netlink_dump_start(net->genl_sock, skb, nlh,
> -                                        ops->dumpit, ops->done);
> -               genl_lock();
> -               return err;
> +               }
>        }
>
>        if (ops->doit == NULL)
>

2.6.37-git18 + netlink revert + this patch
 - fixes Avahi
 - breaks acpid

Upon startup I have:

Starting acpi daemon: RTNETLINK1 answers: Operation not supported
acpid: error talking to the kernel via netlink




From strace output:

open("/dev/input/event6", O_RDONLY|O_NONBLOCK) = 8
fcntl(8, F_SETFD, FD_CLOEXEC)           = 0
ioctl(8, 0x80604520, 0x7fffb3418550)    = 8
ioctl(8, 0x80604521, 0x7fffb34185b0)    = 96
open("/dev/input/event8", O_RDONLY|O_NONBLOCK) = 9
fcntl(9, F_SETFD, FD_CLOEXEC)           = 0
ioctl(9, 0x80604520, 0x7fffb3418550)    = 8
ioctl(9, 0x80604532, 0x7fffb3418c10)    = 8
close(9)                                = 0
inotify_init()                          = 9
inotify_add_watch(9, "/dev/input", IN_CREATE) = 1
socket(PF_NETLINK, SOCK_RAW, 16)        = 10
setsockopt(10, SOL_SOCKET, SO_SNDBUF, [32768], 4) = 0
setsockopt(10, SOL_SOCKET, SO_RCVBUF, [32768], 4) = 0
bind(10, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 0
getsockname(10, {sa_family=AF_NETLINK, pid=4298, groups=00000000}, [12]) = 0
sendmsg(10, {msg_name(12)={sa_family=AF_NETLINK, pid=0,
groups=00000000},
msg_iov(1)=[{"$\0\0\0\20\0\5\0\346\3265M\0\0\0\0\3\0\0\0\17\0\2\0acpi_eve"...,
36}], msg_controllen=0, msg_flags=0}, 0) = 36
recvmsg(10, {msg_name(12)={sa_family=AF_NETLINK, pid=0,
groups=00000000},
msg_iov(1)=[{"8\0\0\0\2\0\0\0\346\3265M\312\20\0\0\241\377\377\377$\0\0\0\20\0\5\0\346\3265M"...,
16384}], msg_controllen=0, msg_flags=0}, 0) = 56
dup(2)                                  = 11
fcntl(11, F_GETFL)                      = 0x1 (flags O_WRONLY)
close(11)                               = 0
write(2, "RTNETLINK1 answers: Operation no"..., 44RTNETLINK1 answers:
Operation not supported
) = 44
write(2, "acpid: error talking to the kern"..., 47acpid: error talking
to the kernel via netlink
) = 47
close(10)                               = 0
socket(PF_NETLINK, SOCK_RAW, 16)        = 10
setsockopt(10, SOL_SOCKET, SO_SNDBUF, [32768], 4) = 0
setsockopt(10, SOL_SOCKET, SO_RCVBUF, [32768], 4) = 0
bind(10, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 0
getsockname(10, {sa_family=AF_NETLINK, pid=4298, groups=00000000}, [12]) = 0
unlink("/var/run/acpid.socket")         = 0
socket(PF_FILE, SOCK_STREAM, 0)         = 11
bind(11, {sa_family=AF_FILE, path="/var/run/acpid.socket"}, 110) = 0
listen(11, 10)                          = 0



If more debugging/testing is needed, do ping me. Thanks,

--alessandro

 "There's always a siren singing you to shipwreck"

   (Radiohead, "There There")

^ permalink raw reply

* [PATCHv2] vhost: rcu annotation fixup
From: Michael S. Tsirkin @ 2011-01-18 18:08 UTC (permalink / raw)
  To: Paul E. McKenney, jasowang
  Cc: Michael S. Tsirkin, kvm, virtualization, netdev, linux-kernel

When built with rcu checks enabled, vhost triggers
bogus warnings as vhost features are read without
dev->mutex sometimes, and private pointer is read
with our kind of rcu where work serves as a
read side critical section.

Fixing it properly is not trivial.
Disable the warnings by stubbing out the checks for now.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---

Changes from v1: add TODO, fix more warnings.

 drivers/vhost/net.c   |    9 +++++----
 drivers/vhost/vhost.h |    6 +++---
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9b3ca10..f616cef 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -128,8 +128,7 @@ static void handle_tx(struct vhost_net *net)
 	size_t hdr_size;
 	struct socket *sock;
 
-	/* TODO: check that we are running from vhost_worker?
-	 * Not sure it's worth it, it's straight-forward enough. */
+	/* TODO: check that we are running from vhost_worker? */
 	sock = rcu_dereference_check(vq->private_data, 1);
 	if (!sock)
 		return;
@@ -306,7 +305,8 @@ static void handle_rx_big(struct vhost_net *net)
 	size_t len, total_len = 0;
 	int err;
 	size_t hdr_size;
-	struct socket *sock = rcu_dereference(vq->private_data);
+	/* TODO: check that we are running from vhost_worker? */
+	struct socket *sock = rcu_dereference_check(vq->private_data, 1);
 	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
 		return;
 
@@ -415,7 +415,8 @@ static void handle_rx_mergeable(struct vhost_net *net)
 	int err, headcount;
 	size_t vhost_hlen, sock_hlen;
 	size_t vhost_len, sock_len;
-	struct socket *sock = rcu_dereference(vq->private_data);
+	/* TODO: check that we are running from vhost_worker? */
+	struct socket *sock = rcu_dereference_check(vq->private_data, 1);
 	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
 		return;
 
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 2af44b7..b3363ae 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -173,9 +173,9 @@ static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
 {
 	unsigned acked_features;
 
-	acked_features =
-		rcu_dereference_index_check(dev->acked_features,
-					    lockdep_is_held(&dev->mutex));
+	/* TODO: check that we are running from vhost_worker or dev mutex is
+	 * held? */
+	acked_features = rcu_dereference_index_check(dev->acked_features, 1);
 	return acked_features & (1 << bit);
 }
 
-- 
1.7.3.2.91.g446ac

^ permalink raw reply related

* Re: [PATCH] af_unix: implement socket filter
From: Alban Crequy @ 2011-01-18 17:51 UTC (permalink / raw)
  To: Eric Dumazet, Ian Molton
  Cc: netdev, linux-kernel, davem, ebiederm, xemul, davidel
In-Reply-To: <1295371361.3290.15.camel@edumazet-laptop>

Le Tue, 18 Jan 2011 18:22:41 +0100,
Eric Dumazet <eric.dumazet@gmail.com> a écrit :

> Le mardi 18 janvier 2011 à 16:39 +0000, Ian Molton a écrit :
> > From: Alban Crequy <alban.crequy@collabora.co.uk>
> > 
> > Linux Socket Filters can already be successfully attached and
> > detached on unix sockets with setsockopt(sockfd, SOL_SOCKET,
> > SO_{ATTACH,DETACH}_FILTER, ....). See:
> > Documentation/networking/filter.txt
> > 
> > But the filter was never used in the unix socket code so it did not
> > work. This patch uses sk_filter() to filter buffers before delivery.
> > 
> > This short program demonstrates the problem on SOCK_DGRAM.

By the way, the patch implements socket filters on SOCK_DGRAM and
SOCK_SEQPACKET but not SOCK_STREAM. Socket filters does not make sense
to me when there is no packet boundaries. But if there is a need for
it, the code for SOCK_STREAM could be added easily.

> Any idea on performance cost adding sk_filter() call ?

Ian will write a performance test and repost the patch with some stats.
I don't know about the performance cost.

> Hmm, looking at it, I have no idea why sk_filter() needs to block BH.

I don't know neither.

> I'll send a patch to relax this requirement.

Thanks for your review!

Alban

^ permalink raw reply

* Re: bnx2 cards intermittantly going offline
From: Michael Chan @ 2011-01-18 17:55 UTC (permalink / raw)
  To: Mills, Tony; +Cc: netdev@vger.kernel.org
In-Reply-To: <6DD3782C33561D44B47071B09946026405F63853AB@exchange1>


On Tue, 2011-01-18 at 02:54 -0800, Mills, Tony wrote:
> Last night i setup a machine to monitor overnight and at 3:52 this
> morning it became unresponsive. 
> 

When it becomes unresponsive, please send some packets to the NIC (such
as ping) and monitor statistics with ethtool -S.  See if the packets are
being received or discarded.  Also, run tcpdump on the machine to see if
the packets are properly received by the stack.  Thanks.



^ permalink raw reply

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
From: Kirill Smelkov @ 2011-01-18 17:56 UTC (permalink / raw)
  To: Oleg V. Ukhno
  Cc: Nicolas de Pesloüan, John Fastabend, Jay Vosburgh,
	David S. Miller, netdev@vger.kernel.org,
	Sébastien Barré, Christophe Paasch
In-Reply-To: <4D35B1B0.2090905@yandex-team.ru>

On Tue, Jan 18, 2011 at 06:28:48PM +0300, Oleg V. Ukhno wrote:
> On 01/18/2011 05:54 PM, Nicolas de Pesloüan wrote:
>> Le 18/01/2011 13:40, Oleg V. Ukhno a écrit :
[...]

>> Oleg, would you mind trying the above "two VLAN" topology" with
>> mode=balance-rr and report any results ? For high-availability purpose,
>> it's obviously necessary to setup those VLAN on distinct switches.
> I'll do it, but it will take some time to setup test environment,  
> several days may be.
> You mean following topology:
>           switch 1
>        /           \
> host A                host B
>        \  switch 2 /
>

FYI: I'm in the process of developing new redundancy mode for bonding,
and while at it, the following script is maybe useful for you too, so
that bonding testing can be done entirely on one host:

http://repo.or.cz/w/linux-2.6/kirr.git/blob/refs/heads/x/etherdup:/tools/bonding/mk-tap-loops.sh


Sorry for maybe being offtopic,
Kirill

^ permalink raw reply

* Re: [PATCH] vhost: rcu annotation fixup
From: Michael S. Tsirkin @ 2011-01-18 17:55 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Jason Wang, kvm, virtualization, netdev, linux-kernel
In-Reply-To: <20110118174834.GF2193@linux.vnet.ibm.com>

On Tue, Jan 18, 2011 at 09:48:34AM -0800, Paul E. McKenney wrote:
> On Tue, Jan 18, 2011 at 01:08:45PM +0200, Michael S. Tsirkin wrote:
> > When built with rcu checks enabled, vhost triggers
> > bogus warnings as vhost features are read without
> > dev->mutex sometimes.
> > Fixing it properly is not trivial as vhost.h does not
> > know which lockdep classes it will be used under.
> > Disable the warning by stubbing out the check for now.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> >  drivers/vhost/vhost.h |    4 +---
> >  1 files changed, 1 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > index 2af44b7..2d03a31 100644
> > --- a/drivers/vhost/vhost.h
> > +++ b/drivers/vhost/vhost.h
> > @@ -173,9 +173,7 @@ static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
> >  {
> >  	unsigned acked_features;
> > 
> > -	acked_features =
> > -		rcu_dereference_index_check(dev->acked_features,
> > -					    lockdep_is_held(&dev->mutex));
> > +	acked_features = rcu_dereference_index_check(dev->acked_features, 1);
> 
> Ouch!!!
> 
> Could you please at least add a comment?

Yes, OK.

> Alternatively, pass in the lock that is held and check for that?  Given
> that this is a static inline, the compiler should be able to optimize
> the argument away when !PROVE_RCU, correct?
> 
> 							Thanx, Paul

Hopefully, yes. We don't always have a lock: the idea was
to create a lockdep for these cases. But we can't pass
the pointer to that ...

> >  	return acked_features & (1 << bit);
> >  }
> > 
> > -- 
> > 1.7.3.2.91.g446ac
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply

* Re: [PATCH] vhost: rcu annotation fixup
From: Paul E. McKenney @ 2011-01-18 17:48 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Jason Wang, kvm, virtualization, netdev, linux-kernel
In-Reply-To: <20110118110845.GA11555@redhat.com>

On Tue, Jan 18, 2011 at 01:08:45PM +0200, Michael S. Tsirkin wrote:
> When built with rcu checks enabled, vhost triggers
> bogus warnings as vhost features are read without
> dev->mutex sometimes.
> Fixing it properly is not trivial as vhost.h does not
> know which lockdep classes it will be used under.
> Disable the warning by stubbing out the check for now.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  drivers/vhost/vhost.h |    4 +---
>  1 files changed, 1 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 2af44b7..2d03a31 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -173,9 +173,7 @@ static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
>  {
>  	unsigned acked_features;
> 
> -	acked_features =
> -		rcu_dereference_index_check(dev->acked_features,
> -					    lockdep_is_held(&dev->mutex));
> +	acked_features = rcu_dereference_index_check(dev->acked_features, 1);

Ouch!!!

Could you please at least add a comment?

Alternatively, pass in the lock that is held and check for that?  Given
that this is a static inline, the compiler should be able to optimize
the argument away when !PROVE_RCU, correct?

							Thanx, Paul

>  	return acked_features & (1 << bit);
>  }
> 
> -- 
> 1.7.3.2.91.g446ac
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply

* [PATCH] net: filter: dont block softirqs in sk_run_filter()
From: Eric Dumazet @ 2011-01-18 17:46 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, linux-kernel, ebiederm, xemul, davidel, Alban Crequy,
	Ian Molton
In-Reply-To: <1295371361.3290.15.camel@edumazet-laptop>

Packet filter (BPF) doesnt need to disable softirqs, being fully
re-entrant and lock-less.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/sock.h     |    2 +-
 net/core/filter.c      |    6 +++---
 net/packet/af_packet.c |    6 +++---
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index d884d26..ba6465b 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1189,7 +1189,7 @@ extern void sk_filter_release_rcu(struct rcu_head *rcu);
 static inline void sk_filter_release(struct sk_filter *fp)
 {
 	if (atomic_dec_and_test(&fp->refcnt))
-		call_rcu_bh(&fp->rcu, sk_filter_release_rcu);
+		call_rcu(&fp->rcu, sk_filter_release_rcu);
 }
 
 static inline void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp)
diff --git a/net/core/filter.c b/net/core/filter.c
index afc5837..232b187 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -142,14 +142,14 @@ int sk_filter(struct sock *sk, struct sk_buff *skb)
 	if (err)
 		return err;
 
-	rcu_read_lock_bh();
-	filter = rcu_dereference_bh(sk->sk_filter);
+	rcu_read_lock();
+	filter = rcu_dereference(sk->sk_filter);
 	if (filter) {
 		unsigned int pkt_len = sk_run_filter(skb, filter->insns);
 
 		err = pkt_len ? pskb_trim(skb, pkt_len) : -EPERM;
 	}
-	rcu_read_unlock_bh();
+	rcu_read_unlock();
 
 	return err;
 }
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 91cb1d7..c3fc7b7 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -523,11 +523,11 @@ static inline unsigned int run_filter(const struct sk_buff *skb,
 {
 	struct sk_filter *filter;
 
-	rcu_read_lock_bh();
-	filter = rcu_dereference_bh(sk->sk_filter);
+	rcu_read_lock();
+	filter = rcu_dereference(sk->sk_filter);
 	if (filter != NULL)
 		res = sk_run_filter(skb, filter->insns);
-	rcu_read_unlock_bh();
+	rcu_read_unlock();
 
 	return res;
 }



^ permalink raw reply related

* Re: [PATCH] net offloading: Do not mask out NETIF_F_HW_VLAN_TX for vlan.
From: Michael Chan @ 2011-01-18 17:21 UTC (permalink / raw)
  To: Jesse Gross; +Cc: Eric Dumazet, David Miller, netdev@vger.kernel.org
In-Reply-To: <AANLkTikmadouBTgkf4OZy30=Zg6uZ+qSng=rWhTjjm8e@mail.gmail.com>


On Mon, 2011-01-17 at 23:12 -0800, Jesse Gross wrote:
> On Tue, Jan 18, 2011 at 1:55 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Le lundi 17 janvier 2011 à 22:46 -0800, Jesse Gross a écrit :
> >> In netif_skb_features() we return only the features that are valid for vlans
> >> if we have a vlan packet.  However, we should not mask out NETIF_F_HW_VLAN_TX
> >> since it enables transmission of vlan tags and is obviously valid.
> >>
> >> Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
> >> Signed-off-by: Jesse Gross <jesse@nicira.com>
> >
> > Thanks Jesse
> >
> > Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
> >
> > Now back to the "ethtool -K eth0 txvlan off" problem on bnx2
> >
> > Is it a driver/software problem or hardware/firmware one ?
> 
> CC'ing Michael Chan
> 
> It looks like bnx2 is storing the offsets of various headers so the
> hardware can find them for TSO.  The parsing logic doesn't do anything
> for vlan tags, so the hardware gets confused if one is present in the
> packet itself.

Yeah, I don't think the hardware/firmware can replicate the vlan tag +
headers properly for TSO if there is a VLAN tag in the packet.  Even
simple tx checksum offload may have problem, but I'll need to check.

> 
> Quick fix is to simply disallow disabling TX vlan offload or disable
> TSO at the same time or some other Ethtool game.  However, if the
> hardware supports it then it would be nicer to fix up the TSO setup
> logic.  Maybe we can just add the size of the vlan tag to the offset
> but I am not certain.  Michael, do you know if this is possible?
> 

I doubt it, as the VLAN tag needs to be replicated for each transmitted
packet.  I'll check with the firmware/hardware guys.

Thanks.



^ permalink raw reply

* Re: inbound connection problems when "netlink: test for all flags of the NLM_F_DUMP composite" commit applied
From: Jarek Poplawski @ 2011-01-18 17:23 UTC (permalink / raw)
  To: jamal
  Cc: David Miller, pablo, arthur.marsh, jengelh, eric.dumazet, netdev,
	Alessandro Suardi

[PATCH] netlink: Fix possible NLM_F_DUMP misuse in genetlink

NLM_F_DUMP flags should be applied to GET requests only, eg. rtnetlink
tests message type to verify this. Since genetlink can't do the same
use "practical" test for ops->dumpit (assuming NEW request won't be
mixed with GET).

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Cc: Jan Engelhardt <jengelh@medozas.de>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jamal Hadi Salim <hadi@cyberus.ca>
---
Not for stable before testing!

diff -Nurp a/net/netlink/genetlink.c b/net/netlink/genetlink.c
--- a/net/netlink/genetlink.c	2011-01-18 16:58:16.000000000 +0100
+++ b/net/netlink/genetlink.c	2011-01-18 17:08:43.000000000 +0100
@@ -519,15 +519,16 @@ static int genl_rcv_msg(struct sk_buff *
 	    security_netlink_recv(skb, CAP_NET_ADMIN))
 		return -EPERM;
 
-	if (nlh->nlmsg_flags & NLM_F_DUMP) {
-		if (ops->dumpit == NULL)
+	if (ops->dumpit) {
+		if (nlh->nlmsg_flags & NLM_F_DUMP) {
+			genl_unlock();
+			err = netlink_dump_start(net->genl_sock, skb, nlh,
+						 ops->dumpit, ops->done);
+			genl_lock();
+			return err;
+		} else {
 			return -EOPNOTSUPP;
-
-		genl_unlock();
-		err = netlink_dump_start(net->genl_sock, skb, nlh,
-					 ops->dumpit, ops->done);
-		genl_lock();
-		return err;
+		}
 	}
 
 	if (ops->doit == NULL)

^ permalink raw reply

* Re: inbound connection problems when "netlink: test for all flags of the NLM_F_DUMP composite" commit applied
From: Jarek Poplawski @ 2011-01-18 17:22 UTC (permalink / raw)
  To: jamal
  Cc: David Miller, pablo, arthur.marsh, jengelh, eric.dumazet, netdev,
	Alessandro Suardi
In-Reply-To: <1295359653.2066.11.camel@mojatatu>

On Tue, Jan 18, 2011 at 09:07:33AM -0500, jamal wrote:
> On Tue, 2011-01-18 at 09:05 -0500, jamal wrote:
> > On Tue, 2011-01-18 at 10:24 +0000, Jarek Poplawski wrote:
> > 
> > > Do you all expect all users manage to upgrade avahi app before
> > > changing their stable kernel? I mean "own distro" users especially.
> > 
> > Unfortunately if that app is widely deployed, it is not pragmatic
> > to break it in the name of pedanticity. 
> 
> And to complete that thought - if avahi continues to work and merely
> whines, i dont see why this patch should be reverted..

Alas it looks like more than whines ;-)
Cheers,
Jarek P.

[PATCH] netlink: Revert test for all flags of the NLM_F_DUMP composite

Arthur Marsh reported:
> With kernel 2.6.37-git9 and later inbound telnetd-ssl connections
> failed, and on machine shut-down, there were warning messages about
> daemons not return status.
and bisected the bug.
 
After looking at net/core/rtnetlink.c:

static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
{
	...
        kind = type&3;

        if (kind != 2 && security_netlink_recv(skb, CAP_NET_ADMIN))
                return -EPERM;

        if (kind == 2 && nlh->nlmsg_flags&NLM_F_DUMP) {

I'm sure the fix in commit 0ab03c2b1478f24 is simply wrong. NLM_F_DUMP
flags can't be mistaken with NLM_F_EXCL if there is a check for GET
request like above (ie. kind == 2). So there is no reason to limit 3
various dump cases from the RFC (not counting the atomic flag) to one
and only NLM_F_DUMP.

That's why I propose this reverting patch here and a separate fix to
genetlink (soon).

Reverts commit 0ab03c2b1478f2438d2c80204f7fef65b1bca9cf

Reported-by: Alessandro Suardi <alessandro.suardi@gmail.com>
Reported-by: Arthur Marsh <arthur.marsh@internode.on.net>
Bisected-by: Arthur Marsh <arthur.marsh@internode.on.net>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Cc: Jan Engelhardt <jengelh@medozas.de>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jamal Hadi Salim <hadi@cyberus.ca>
---

 net/core/rtnetlink.c                 |    2 +-
 net/ipv4/inet_diag.c                 |    2 +-
 net/netfilter/nf_conntrack_netlink.c |    4 ++--
 net/netlink/genetlink.c              |    2 +-
 net/xfrm/xfrm_user.c                 |    2 +-
 5 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index a5f7535..750db57 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1820,7 +1820,7 @@ static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
 	if (kind != 2 && security_netlink_recv(skb, CAP_NET_ADMIN))
 		return -EPERM;
 
-	if (kind == 2 && (nlh->nlmsg_flags & NLM_F_DUMP) == NLM_F_DUMP) {
+	if (kind == 2 && nlh->nlmsg_flags&NLM_F_DUMP) {
 		struct sock *rtnl;
 		rtnl_dumpit_func dumpit;
 
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index 2746c1f..2ada171 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -858,7 +858,7 @@ static int inet_diag_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
 	    nlmsg_len(nlh) < hdrlen)
 		return -EINVAL;
 
-	if ((nlh->nlmsg_flags & NLM_F_DUMP) == NLM_F_DUMP) {
+	if (nlh->nlmsg_flags & NLM_F_DUMP) {
 		if (nlmsg_attrlen(nlh, hdrlen)) {
 			struct nlattr *attr;
 
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 2b7eef3..93297aa 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -924,7 +924,7 @@ ctnetlink_get_conntrack(struct sock *ctnl, struct sk_buff *skb,
 	u16 zone;
 	int err;
 
-	if ((nlh->nlmsg_flags & NLM_F_DUMP) == NLM_F_DUMP)
+	if (nlh->nlmsg_flags & NLM_F_DUMP)
 		return netlink_dump_start(ctnl, skb, nlh, ctnetlink_dump_table,
 					  ctnetlink_done);
 
@@ -1787,7 +1787,7 @@ ctnetlink_get_expect(struct sock *ctnl, struct sk_buff *skb,
 	u16 zone;
 	int err;
 
-	if ((nlh->nlmsg_flags & NLM_F_DUMP) == NLM_F_DUMP) {
+	if (nlh->nlmsg_flags & NLM_F_DUMP) {
 		return netlink_dump_start(ctnl, skb, nlh,
 					  ctnetlink_exp_dump_table,
 					  ctnetlink_exp_done);
diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index f83cb37..1781d99 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -519,7 +519,7 @@ static int genl_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
 	    security_netlink_recv(skb, CAP_NET_ADMIN))
 		return -EPERM;
 
-	if ((nlh->nlmsg_flags & NLM_F_DUMP) == NLM_F_DUMP) {
+	if (nlh->nlmsg_flags & NLM_F_DUMP) {
 		if (ops->dumpit == NULL)
 			return -EOPNOTSUPP;
 
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index d5e1e0b..6129196 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -2189,7 +2189,7 @@ static int xfrm_user_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
 
 	if ((type == (XFRM_MSG_GETSA - XFRM_MSG_BASE) ||
 	     type == (XFRM_MSG_GETPOLICY - XFRM_MSG_BASE)) &&
-	    (nlh->nlmsg_flags & NLM_F_DUMP) == NLM_F_DUMP) {
+	    (nlh->nlmsg_flags & NLM_F_DUMP)) {
 		if (link->dump == NULL)
 			return -EINVAL;
 

^ permalink raw reply related

* Re: [PATCH] af_unix: implement socket filter
From: Eric Dumazet @ 2011-01-18 17:22 UTC (permalink / raw)
  To: Ian Molton
  Cc: netdev, linux-kernel, davem, ebiederm, xemul, davidel,
	Alban Crequy
In-Reply-To: <1295368755-20931-1-git-send-email-ian.molton@collabora.co.uk>

Le mardi 18 janvier 2011 à 16:39 +0000, Ian Molton a écrit :
> From: Alban Crequy <alban.crequy@collabora.co.uk>
> 
> Linux Socket Filters can already be successfully attached and detached on unix
> sockets with setsockopt(sockfd, SOL_SOCKET, SO_{ATTACH,DETACH}_FILTER, ...).
> See: Documentation/networking/filter.txt
> 
> But the filter was never used in the unix socket code so it did not work. This
> patch uses sk_filter() to filter buffers before delivery.
> 
> This short program demonstrates the problem on SOCK_DGRAM.


Any idea on performance cost adding sk_filter() call ?

Hmm, looking at it, I have no idea why sk_filter() needs to block BH.

I'll send a patch to relax this requirement.

^ permalink raw reply

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing
From: Oleg V. Ukhno @ 2011-01-18 17:21 UTC (permalink / raw)
  To: John Fastabend; +Cc: Jay Vosburgh, netdev@vger.kernel.org, David S. Miller
In-Reply-To: <4D35C2D7.6090008@intel.com>

On 01/18/2011 07:41 PM, John Fastabend wrote:

>>
>> John, what is you opinion on such load balancing method in general,
>> without referring to particular use cases?
>>
>
> This seems reasonable to me, but I'll defer to Jay on this. As long as the
> limitations are documented and it looks like they are this may be fine.
>
> Mostly I was interested to know what led you down this path and why MPIO
> was not working as at least I expected it should. When I get some time I'll
> see if we can address at least some of these issues. Even so it seems like
> this bonding mode may still be useful for some use cases perhaps even none
> storage use cases.
>
>>

I was adressing several problems with my patch:
  - I was unable to consume whole bandwidth with multipath - with four 
1Gbit "paths" it was slightly above 2Gbit/s
  - Link failures caused quite often disk failures, which led to Oracle 
ASM rebalance, especially with versions below 11.
  - It is not always possible to autogenerate multipathd.conf with 
human-readable device names because of iscsi session id and scsi device 
bus/channel/etc mismatch(usually it differs by 1, but not necessarily), 
with bonding solution I can just look into /dev/disk/by-path to find out 
where physically is device, let's  say, /dev/sdab, located(it's just a 
free bonus I've got, so to say:)) .



-- 
С уважением,
руководитель службы
эксплуатации коммерческих и финансовых сервисов
ООО Яндекс

Олег Юхно



^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox