* Re: [PATCH v2 1/1] TCP: increase default initial receive window.
From: Nandita Dukkipati @ 2010-12-21 0:23 UTC (permalink / raw)
To: David S. Miller, Stephen Hemminger, Rick Jones
Cc: netdev, Tom Herbert, Laurent Chavey, Yuchung Cheng,
Nandita Dukkipati
In-Reply-To: <1292890556-29904-1-git-send-email-nanditad@google.com>
As per the comments, I have resubmitted the patch with a longer
explanation. Please let me know if the preference is to have a sysctl
for the initial default receive window, as I am ok either way.
Thanks,
-Nandita
On Mon, Dec 20, 2010 at 4:15 PM, Nandita Dukkipati <nanditad@google.com> wrote:
> This patch changes the default initial receive window to 10 mss
> (defined constant). The default window is limited to the maximum
> of 10*1460 and 2*mss (when mss > 1460).
>
> draft-ietf-tcpm-initcwnd-00 is a proposal to the IETF that recommends
> increasing TCP's initial congestion window to 10 mss or about 15KB.
> Leading up to this proposal were several large-scale live Internet
> experiments with an initial congestion window of 10 mss (IW10), where
> we showed that the average latency of HTTP responses improved by
> approximately 10%. This was accompanied by a slight increase in
> retransmission rate (0.5%), most of which is coming from applications
> opening multiple simultaneous connections. To understand the extreme
> worst case scenarios, and fairness issues (IW10 versus IW3), we further
> conducted controlled testbed experiments. We came away finding minimal
> negative impact even under low link bandwidths (dial-ups) and small
> buffers. These results are extremely encouraging to adopting IW10.
>
> However, an initial congestion window of 10 mss is useless unless a TCP
> receiver advertises an initial receive window of at least 10 mss.
> Fortunately, in the large-scale Internet experiments we found that most
> widely used operating systems advertised large initial receive windows
> of 64KB, allowing us to experiment with a wide range of initial
> congestion windows. Linux systems were among the few exceptions that
> advertised a small receive window of 6KB. The purpose of this patch is
> to fix this shortcoming.
>
> References:
> 1. A comprehensive list of all IW10 references to date.
> http://code.google.com/speed/protocols/tcpm-IW10.html
>
> 2. Paper describing results from large-scale Internet experiments with IW10.
> http://ccr.sigcomm.org/drupal/?q=node/621
>
> 3. Controlled testbed experiments under worst case scenarios and a
> fairness study.
> http://www.ietf.org/proceedings/79/slides/tcpm-0.pdf
>
> 4. Raw test data from testbed experiments (Linux senders/receivers)
> with initial congestion and receive windows of both 10 mss.
> http://research.csc.ncsu.edu/netsrv/?q=content/iw10
>
> 5. Internet-Draft. Increasing TCP's Initial Window.
> https://datatracker.ietf.org/doc/draft-ietf-tcpm-initcwnd/
>
> Signed-off-by: Nandita Dukkipati <nanditad@google.com>
> ---
> Changelog since v1:
> - A longer and better commit message.
>
> include/net/tcp.h | 3 +++
> net/ipv4/tcp_output.c | 11 ++++++++---
> 2 files changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index b448030..38509f0 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -60,6 +60,9 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
> */
> #define MAX_TCP_WINDOW 32767U
>
> +/* Offer an initial receive window of 10 mss. */
> +#define TCP_DEFAULT_INIT_RCVWND 10
> +
> /* Minimal accepted MSS. It is (60+60+8) - (20+20). */
> #define TCP_MIN_MSS 88U
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 2d39066..dc7c096 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -228,10 +228,15 @@ void tcp_select_initial_window(int __space, __u32 mss,
> }
> }
>
> - /* Set initial window to value enough for senders, following RFC5681. */
> + /* Set initial window to a value enough for senders starting with
> + * initial congestion window of TCP_DEFAULT_INIT_RCVWND. Place
> + * a limit on the initial window when mss is larger than 1460.
> + */
> if (mss > (1 << *rcv_wscale)) {
> - int init_cwnd = rfc3390_bytes_to_packets(mss);
> -
> + int init_cwnd = TCP_DEFAULT_INIT_RCVWND;
> + if (mss > 1460)
> + init_cwnd =
> + max_t(u32, (1460 * TCP_DEFAULT_INIT_RCVWND) / mss, 2);
> /* when initializing use the value from init_rcv_wnd
> * rather than the default from above
> */
> --
> 1.7.3.1
>
>
^ permalink raw reply
* Re: [PATCH net-2.6] net_sched: always clone skbs
From: Changli Gao @ 2010-12-21 0:21 UTC (permalink / raw)
To: Eric Dumazet
Cc: Jarek Poplawski, David S. Miller, netdev, Jamal Hadi Salim,
Pawel Staszewski
In-Reply-To: <1292887689.2627.150.camel@edumazet-laptop>
On Tue, Dec 21, 2010 at 7:28 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Indeed, commit number is wrong. It was :
>
> commit 210d6de78c5d7c785fc532556cea340e517955e1
> Author: Changli Gao <xiaosuo@gmail.com>
> Date: Thu Jun 24 16:25:12 2010 +0000
>
> act_mirred: don't clone skb when skb isn't shared
>
> It triggers now because GRO might be default to on now.
>
> commit 6a08d194ee40806e0ccd5f36ed768e64cbfc979f
> e1000: use GRO for receive
>
> (or other various GRO patches)
>
>
Oops. It is all my fault. Sorry. What can I do to fix this after David
has applied it and pushed it out?
--
Regards,
Changli Gao(xiaosuo@gmail.com)
^ permalink raw reply
* [PATCH v2 1/1] TCP: increase default initial receive window.
From: Nandita Dukkipati @ 2010-12-21 0:15 UTC (permalink / raw)
To: David S. Miller
Cc: netdev, Tom Herbert, Laurent Chavey, Yuchung Cheng,
Stephen Hemminger, Rick Jones, Nandita Dukkipati
In-Reply-To: <1292642451-892-1-git-send-email-nanditad@google.com>
This patch changes the default initial receive window to 10 mss
(defined constant). The default window is limited to the maximum
of 10*1460 and 2*mss (when mss > 1460).
draft-ietf-tcpm-initcwnd-00 is a proposal to the IETF that recommends
increasing TCP's initial congestion window to 10 mss or about 15KB.
Leading up to this proposal were several large-scale live Internet
experiments with an initial congestion window of 10 mss (IW10), where
we showed that the average latency of HTTP responses improved by
approximately 10%. This was accompanied by a slight increase in
retransmission rate (0.5%), most of which is coming from applications
opening multiple simultaneous connections. To understand the extreme
worst case scenarios, and fairness issues (IW10 versus IW3), we further
conducted controlled testbed experiments. We came away finding minimal
negative impact even under low link bandwidths (dial-ups) and small
buffers. These results are extremely encouraging to adopting IW10.
However, an initial congestion window of 10 mss is useless unless a TCP
receiver advertises an initial receive window of at least 10 mss.
Fortunately, in the large-scale Internet experiments we found that most
widely used operating systems advertised large initial receive windows
of 64KB, allowing us to experiment with a wide range of initial
congestion windows. Linux systems were among the few exceptions that
advertised a small receive window of 6KB. The purpose of this patch is
to fix this shortcoming.
References:
1. A comprehensive list of all IW10 references to date.
http://code.google.com/speed/protocols/tcpm-IW10.html
2. Paper describing results from large-scale Internet experiments with IW10.
http://ccr.sigcomm.org/drupal/?q=node/621
3. Controlled testbed experiments under worst case scenarios and a
fairness study.
http://www.ietf.org/proceedings/79/slides/tcpm-0.pdf
4. Raw test data from testbed experiments (Linux senders/receivers)
with initial congestion and receive windows of both 10 mss.
http://research.csc.ncsu.edu/netsrv/?q=content/iw10
5. Internet-Draft. Increasing TCP's Initial Window.
https://datatracker.ietf.org/doc/draft-ietf-tcpm-initcwnd/
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
---
Changelog since v1:
- A longer and better commit message.
include/net/tcp.h | 3 +++
net/ipv4/tcp_output.c | 11 ++++++++---
2 files changed, 11 insertions(+), 3 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index b448030..38509f0 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -60,6 +60,9 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
*/
#define MAX_TCP_WINDOW 32767U
+/* Offer an initial receive window of 10 mss. */
+#define TCP_DEFAULT_INIT_RCVWND 10
+
/* Minimal accepted MSS. It is (60+60+8) - (20+20). */
#define TCP_MIN_MSS 88U
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 2d39066..dc7c096 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -228,10 +228,15 @@ void tcp_select_initial_window(int __space, __u32 mss,
}
}
- /* Set initial window to value enough for senders, following RFC5681. */
+ /* Set initial window to a value enough for senders starting with
+ * initial congestion window of TCP_DEFAULT_INIT_RCVWND. Place
+ * a limit on the initial window when mss is larger than 1460.
+ */
if (mss > (1 << *rcv_wscale)) {
- int init_cwnd = rfc3390_bytes_to_packets(mss);
-
+ int init_cwnd = TCP_DEFAULT_INIT_RCVWND;
+ if (mss > 1460)
+ init_cwnd =
+ max_t(u32, (1460 * TCP_DEFAULT_INIT_RCVWND) / mss, 2);
/* when initializing use the value from init_rcv_wnd
* rather than the default from above
*/
--
1.7.3.1
^ permalink raw reply related
* Re: [stable] [STABLE 2.6.32 PATCH] net: release dst entry while cache-hot for GSO case too
From: Greg KH @ 2010-12-21 0:07 UTC (permalink / raw)
To: avagin@gmail.com
Cc: Greg Kroah-Hartman, krkumar2, avagin, eric.dumazet, netdev, mjt,
David Miller, stable
In-Reply-To: <20101209181937.GA24175@kroah.com>
On Thu, Dec 09, 2010 at 10:19:37AM -0800, Greg KH wrote:
> On Thu, Dec 09, 2010 at 09:43:03AM +0300, avagin@gmail.com wrote:
> > I add the patch in attachments
>
> Ok, thanks, I'll queue it up for the next .32 release after this one.
Now applied.
greg k-h
^ permalink raw reply
* Re: [PATCH net-2.6] net_sched: always clone skbs
From: Jarek Poplawski @ 2010-12-20 23:52 UTC (permalink / raw)
To: Eric Dumazet
Cc: Changli Gao, David S. Miller, netdev, Jamal Hadi Salim,
Pawel Staszewski
In-Reply-To: <1292887689.2627.150.camel@edumazet-laptop>
On Tue, Dec 21, 2010 at 12:28:09AM +0100, Eric Dumazet wrote:
> Le mardi 21 décembre 2010 ?? 00:20 +0100, Jarek Poplawski a écrit :
> > On Mon, Dec 20, 2010 at 10:35:30PM +0800, Changli Gao wrote:
> > > Pawel reported a panic related to handling shared skbs in ixgbe
> > > incorrectly. So we need to revert my previous patch to work around
> > > this bug. Instead of reverting the patch completely, I just revert
> > > the essential lines, so we can add the previous optimization
> > > back more easily in future.
> > >
> > > commit 3511c9132f8b1e1b5634e41a3331c44b0c13be70
> > > Author: Changli Gao <xiaosuo@gmail.com>
> > > Date: Sat Oct 16 13:04:08 2010 +0000
> > >
> > > net_sched: remove the unused parameter of qdisc_create_dflt()
> > >
> >
> > Isn't there some mistake in the commit number? IMHO this changelog is
> > mostly wrong. The bug happens outside of ixgbe, probably in
> > dev_hard_start_xmit() -> __skb_linearize(), and even if not, it can,
> > without any driver's fault. The only question is why it didn't triger
> > with 2.6.36.
> >
>
> Indeed, commit number is wrong. It was :
>
> commit 210d6de78c5d7c785fc532556cea340e517955e1
> Author: Changli Gao <xiaosuo@gmail.com>
> Date: Thu Jun 24 16:25:12 2010 +0000
>
> act_mirred: don't clone skb when skb isn't shared
>
> It triggers now because GRO might be default to on now.
>
> commit 6a08d194ee40806e0ccd5f36ed768e64cbfc979f
> e1000: use GRO for receive
>
> (or other various GRO patches)
Well, still some doubt after re-reading Pawel's 1st. message:
host1 (kernel 2.6.36.2)
netperf client -> eth3 (82598EB 10-Gigabit AT CX4) - directly connected to eth2 of host2
ethtool -k eth3
Offload parameters for eth3:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
...
Of course, I know this was another box.
Jarek P.
^ permalink raw reply
* Re: [PATCH net-2.6] net_sched: always clone skbs
From: Eric Dumazet @ 2010-12-20 23:28 UTC (permalink / raw)
To: Jarek Poplawski
Cc: Changli Gao, David S. Miller, netdev, Jamal Hadi Salim,
Pawel Staszewski
In-Reply-To: <20101220232020.GB2052@del.dom.local>
Le mardi 21 décembre 2010 à 00:20 +0100, Jarek Poplawski a écrit :
> On Mon, Dec 20, 2010 at 10:35:30PM +0800, Changli Gao wrote:
> > Pawel reported a panic related to handling shared skbs in ixgbe
> > incorrectly. So we need to revert my previous patch to work around
> > this bug. Instead of reverting the patch completely, I just revert
> > the essential lines, so we can add the previous optimization
> > back more easily in future.
> >
> > commit 3511c9132f8b1e1b5634e41a3331c44b0c13be70
> > Author: Changli Gao <xiaosuo@gmail.com>
> > Date: Sat Oct 16 13:04:08 2010 +0000
> >
> > net_sched: remove the unused parameter of qdisc_create_dflt()
> >
>
> Isn't there some mistake in the commit number? IMHO this changelog is
> mostly wrong. The bug happens outside of ixgbe, probably in
> dev_hard_start_xmit() -> __skb_linearize(), and even if not, it can,
> without any driver's fault. The only question is why it didn't triger
> with 2.6.36.
>
Indeed, commit number is wrong. It was :
commit 210d6de78c5d7c785fc532556cea340e517955e1
Author: Changli Gao <xiaosuo@gmail.com>
Date: Thu Jun 24 16:25:12 2010 +0000
act_mirred: don't clone skb when skb isn't shared
It triggers now because GRO might be default to on now.
commit 6a08d194ee40806e0ccd5f36ed768e64cbfc979f
e1000: use GRO for receive
(or other various GRO patches)
^ permalink raw reply
* Re: [net-next-2.6 PATCH 4/4] net_sched: add MQSAFE flag to qdisc to identify mq like qdiscs
From: Ben Hutchings @ 2010-12-20 23:22 UTC (permalink / raw)
To: John Fastabend
Cc: davem, netdev, hadi, shemminger, tgraf, eric.dumazet, nhorman
In-Reply-To: <20101217153456.12170.17327.stgit@jf-dev1-dcblab>
On Fri, 2010-12-17 at 07:34 -0800, John Fastabend wrote:
> Add a MQSAFE flag to the qdisc schedulers that can be safely
> managed by sch_mclass. Without this flag schedulers that are
> not aware of multiple tx queues can be grafted under the
> mclass qdisc. Allowing incorrect qdiscs to be grafted causes
> an invalid mapping from qdisc's to netdevice queues.
[...]
This should be defined before adding sch_mclass, or at the same time,
not after.
Ben.
--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply
* Re: [PATCH net-2.6] net_sched: always clone skbs
From: Jarek Poplawski @ 2010-12-20 23:20 UTC (permalink / raw)
To: Changli Gao
Cc: David S. Miller, Eric Dumazet, netdev, Jamal Hadi Salim,
Pawel Staszewski
In-Reply-To: <1292855730-19265-1-git-send-email-xiaosuo@gmail.com>
On Mon, Dec 20, 2010 at 10:35:30PM +0800, Changli Gao wrote:
> Pawel reported a panic related to handling shared skbs in ixgbe
> incorrectly. So we need to revert my previous patch to work around
> this bug. Instead of reverting the patch completely, I just revert
> the essential lines, so we can add the previous optimization
> back more easily in future.
>
> commit 3511c9132f8b1e1b5634e41a3331c44b0c13be70
> Author: Changli Gao <xiaosuo@gmail.com>
> Date: Sat Oct 16 13:04:08 2010 +0000
>
> net_sched: remove the unused parameter of qdisc_create_dflt()
>
Isn't there some mistake in the commit number? IMHO this changelog is
mostly wrong. The bug happens outside of ixgbe, probably in
dev_hard_start_xmit() -> __skb_linearize(), and even if not, it can,
without any driver's fault. The only question is why it didn't triger
with 2.6.36.
Jarek P.
^ permalink raw reply
* [PATCH net-next-2.6] sch_sfq: allow big packets and be fair
From: Eric Dumazet @ 2010-12-20 23:16 UTC (permalink / raw)
To: David Miller; +Cc: Patrick McHardy, netdev, Jarek Poplawski
In-Reply-To: <1292864525.2800.189.camel@edumazet-laptop>
SFQ is currently 'limited' to small packets, because it uses a 16bit
allotment number per flow. Switch it to 18bit, and use appropriate
handling to make sure this allotment is in [1 .. quantum] range before a
new packet is dequeued, so that fairness is respected.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jarek Poplawski <jarkao2@gmail.com>
Cc: Patrick McHardy <kaber@trash.net>
---
net/sched/sch_sfq.c | 24 ++++++++++++++++--------
1 file changed, 16 insertions(+), 8 deletions(-)
diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
index c474b4b..878704a 100644
--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -67,7 +67,7 @@
IMPLEMENTATION:
This implementation limits maximal queue length to 128;
- maximal mtu to 2^15-1; max 128 flows, number of hash buckets to 1024.
+ maximal mtu to 2^16-1; max 128 flows, number of hash buckets to 1024.
The only goal of this restrictions was that all data
fit into one 4K page on 32bit arches.
@@ -99,9 +99,10 @@ struct sfq_slot {
sfq_index qlen; /* number of skbs in skblist */
sfq_index next; /* next slot in sfq chain */
struct sfq_head dep; /* anchor in dep[] chains */
- unsigned short hash; /* hash value (index in ht[]) */
- short allot; /* credit for this slot */
+ unsigned int hash:14; /* hash value (index in ht[]) */
+ unsigned int allot:18; /* credit for this slot */
};
+#define ALLOT_ZERO (1 << 16)
struct sfq_sched_data
{
@@ -394,7 +395,7 @@ sfq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
q->tail->next = x;
}
q->tail = slot;
- slot->allot = q->quantum;
+ slot->allot = ALLOT_ZERO + q->quantum;
}
if (++sch->q.qlen <= q->limit) {
sch->bstats.bytes += qdisc_pkt_len(skb);
@@ -430,8 +431,14 @@ sfq_dequeue(struct Qdisc *sch)
if (q->tail == NULL)
return NULL;
+next:
a = q->tail->next;
slot = &q->slots[a];
+ if (slot->allot <= ALLOT_ZERO) {
+ q->tail = slot;
+ slot->allot += q->quantum;
+ goto next;
+ }
skb = slot_dequeue_head(slot);
sfq_dec(q, a);
sch->q.qlen--;
@@ -446,9 +453,8 @@ sfq_dequeue(struct Qdisc *sch)
return skb;
}
q->tail->next = next_a;
- } else if ((slot->allot -= qdisc_pkt_len(skb)) <= 0) {
- q->tail = slot;
- slot->allot += q->quantum;
+ } else {
+ slot->allot -= qdisc_pkt_len(skb);
}
return skb;
}
@@ -610,7 +616,9 @@ static int sfq_dump_class_stats(struct Qdisc *sch, unsigned long cl,
struct sfq_sched_data *q = qdisc_priv(sch);
const struct sfq_slot *slot = &q->slots[q->ht[cl - 1]];
struct gnet_stats_queue qs = { .qlen = slot->qlen };
- struct tc_sfq_xstats xstats = { .allot = slot->allot };
+ struct tc_sfq_xstats xstats = {
+ .allot = slot->allot - ALLOT_ZERO
+ };
struct sk_buff *skb;
slot_queue_walk(slot, skb)
^ permalink raw reply related
* Re: [net-next-2.6 PATCH 2/4] net_sched: Allow multiple mq qdisc to be used as non-root
From: Ben Hutchings @ 2010-12-20 23:12 UTC (permalink / raw)
To: John Fastabend
Cc: davem, netdev, hadi, shemminger, tgraf, eric.dumazet, nhorman
In-Reply-To: <20101217153445.12170.73420.stgit@jf-dev1-dcblab>
On Fri, 2010-12-17 at 07:34 -0800, John Fastabend wrote:
[...]
> diff --git a/net/sched/sch_mq.c b/net/sched/sch_mq.c
> index ecc302f..35ed26d 100644
> --- a/net/sched/sch_mq.c
> +++ b/net/sched/sch_mq.c
> @@ -19,17 +19,39 @@
>
> struct mq_sched {
> struct Qdisc **qdiscs;
> + u8 num_tc;
> };
>
> +static void mq_queues(struct net_device *dev, struct Qdisc *sch,
> + unsigned int *count, unsigned int *offset)
> +{
> + struct mq_sched *priv = qdisc_priv(sch);
> + if (priv->num_tc) {
> + int queue = TC_H_MIN(sch->parent) - 1;
> + if (count)
> + *count = dev->tc_to_txq[queue].count;
> + if (offset)
> + *offset = dev->tc_to_txq[queue].offset;
> + } else {
> + if (count)
> + *count = dev->num_tx_queues;
> + if (offset)
> + *offset = 0;
> + }
> +}
[...]
It looks like num_tc will be set even for the root qdisc if the device
is capable of QoS. Would mq_queues() behave correctly then, i.e. is the
queue range for priority 0 required to be [0, dev->num_tx_queues)?
Also it would be neater to return count and offset together as struct
netdev_tc_txq, rather than through optional out-parameters. Even better
would be to cache these in struct mq_sched, if that's possible.
Ben.
--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply
* Re: [PATCH v3] kptr_restrict for hiding kernel pointers
From: Dan Rosenberg @ 2010-12-20 23:01 UTC (permalink / raw)
To: Valdis.Kletnieks
Cc: linux-kernel, netdev, linux-security-module, jmorris,
eric.dumazet, tgraf, eugeneteo, kees.cook, mingo, davem,
a.p.zijlstra, akpm
In-Reply-To: <50259.1292883991@localhost>
On Mon, 2010-12-20 at 17:26 -0500, Valdis.Kletnieks@vt.edu wrote:
> On Sat, 18 Dec 2010 12:20:34 EST, Dan Rosenberg said:
>
> > @@ -1035,6 +1038,26 @@ char *pointer(const char *fmt, char *buf, char *end, void *ptr,
> > return buf + vsnprintf(buf, end - buf,
> > ((struct va_format *)ptr)->fmt,
> > *(((struct va_format *)ptr)->va));
> > + case 'K':
> > + /*
> > + * %pK cannot be used in IRQ context because it tests
> > + * CAP_SYSLOG.
> > + */
> > + if (in_irq() || in_serving_softirq() || in_nmi())
> > + WARN_ONCE(1, "%%pK used in interrupt context.\n");
>
> Should this then continue on and test CAP_SYSLOG anyhow, or should it
> return a "" or or "<invalid>" or something?
This is a valid point. I'll resend a new version shortly that defaults
to zeroing pointers if it's used incorrectly without relying on
capability checks.
Thanks,
Dan
^ permalink raw reply
* Re: [net-next-2.6 PATCH v2] net: implement mechanism for HW based QOS
From: Eric Dumazet @ 2010-12-20 22:58 UTC (permalink / raw)
To: Ben Hutchings
Cc: John Fastabend, davem, netdev, hadi, shemminger, tgraf, nhorman
In-Reply-To: <1292885287.3055.22.camel@bwh-desktop>
Le lundi 20 décembre 2010 à 22:48 +0000, Ben Hutchings a écrit :
> That seems like a fair amount of data to add to every net device,
> considering that users may create e.g. a lot of VLAN devices and they
> won't use this state at all. Have you considered putting these in a
> structure that is accessed indirectly?
Yes, this was like this in previous patch, but I asked John to put it in
netdevice instead, in order to avoid a false sharing and this
indirection.
This adds 64 bytes on a netdevice. Most machines dont have more than 16
netdevices, it seems a reasonable cost. (We removed the ingress bloat
some time ago, so no significant increase anyway)
^ permalink raw reply
* Re: [net-next-2.6 PATCH v2] net: implement mechanism for HW based QOS
From: Stephen Hemminger @ 2010-12-20 22:58 UTC (permalink / raw)
To: Ben Hutchings
Cc: John Fastabend, davem, netdev, hadi, tgraf, eric.dumazet, nhorman
In-Reply-To: <1292885287.3055.22.camel@bwh-desktop>
On Mon, 20 Dec 2010 22:48:07 +0000
Ben Hutchings <bhutchings@solarflare.com> wrote:
> > + u8 num_tc;
> > + struct netdev_tc_txq tc_to_txq[16];
> > + u8 prio_tc_map[16];
>
> That seems like a fair amount of data to add to every net device,
> considering that users may create e.g. a lot of VLAN devices and they
> won't use this state at all. Have you considered putting these in a
> structure that is accessed indirectly?
The tc stuff should use indirection and only allocate if needed.
--
^ permalink raw reply
* Re: [PATCH v2] net_sched: sch_sfq: better struct layouts
From: Jarek Poplawski @ 2010-12-20 22:55 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Patrick McHardy, David Miller, netdev
In-Reply-To: <1292864525.2800.189.camel@edumazet-laptop>
On Mon, Dec 20, 2010 at 06:02:05PM +0100, Eric Dumazet wrote:
> Le dimanche 19 décembre 2010 ?? 22:22 +0100, Jarek Poplawski a écrit :
>
> > I think open coding sk_buff_head is a wrong idea. Otherwise, this
> > patch looks OK to me, only a few cosmetic suggestions below.
> >
>
> I completely agree with you but this should be temporary, because David
> really wants to use list_head for skbs, I believe this will be done ;)
>
> I chose to name the list skblist to make clear where we want to plug a
> real list_head once done.
>
> Also, not using sk_buff_head saves at least 8 bytes per slot.
Alas I dumped my 486sx already :-/
> > > - sfq_index max_depth; /* Maximal depth */
> > > + sfq_index max_depth; /* depth of longest slot */
> >
> > depth and/or length? (One dimension should be enough.)
>
> maybe cur_depth ? Its not the maximal possible depth, but depth of
> longest slot, or current max depth...
Hmm... or max_depth? I meant the comment only, sorry ;-)
> > > - /* If selected queue has length q->limit, this means that
> > > - * all another queues are empty and that we do simple tail drop,
> >
> > No reason to remove this line.
>
> Well, the reason we drop this packet is not because other queues are
> empty, but because we reach max depth for this queue. (I have the idea
> to extend SFQ to allow more packets to be queued, still with a 127 limit
> per queue, and 127 flows). With 10Gbs links, a global limit of 127
> packets is short.
Right, but does this line say something else? Of course, you can find
it out by yourself too, but this comment makes reading a bit faster.
> > If you really have to do this, all these: __skb_queue_tail(),
> > __skb_dequeue(), skb_queue_head_init(), skb_peek() etc. used here
> > should stay as (local) inline functions to remain readability.
> >
>
> OK done, thanks a lot for reviewing and very useful comments !
Thanks for using them!
Jarek P.
^ permalink raw reply
* [PATCH v3 net-next-2.6] net_sched: sch_sfq: better struct layouts
From: Eric Dumazet @ 2010-12-20 22:54 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Patrick McHardy, Jarek Poplawski
In-Reply-To: <1292881372.2627.38.camel@edumazet-laptop>
Here is a respin of patch.
I'll send a short patch to make SFQ more fair in presence of large
packets as well.
Thanks
[PATCH v3 net-next-2.6] net_sched: sch_sfq: better struct layouts
This patch shrinks sizeof(struct sfq_sched_data)
from 0x14f8 (or more if spinlocks are bigger) to 0x1180 bytes, and
reduce text size as well.
text data bss dec hex filename
4821 152 0 4973 136d old/net/sched/sch_sfq.o
4627 136 0 4763 129b new/net/sched/sch_sfq.o
All data for a slot/flow is now grouped in a compact and cache friendly
structure, instead of being spreaded in many different points.
struct sfq_slot {
struct sk_buff *skblist_next;
struct sk_buff *skblist_prev;
sfq_index qlen; /* number of skbs in skblist */
sfq_index next; /* next slot in sfq chain */
struct sfq_head dep; /* anchor in dep[] chains */
unsigned short hash; /* hash value (index in ht[]) */
short allot; /* credit for this slot */
};
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jarek Poplawski <jarkao2@gmail.com>
Cc: Patrick McHardy <kaber@trash.net>
---
v3: respin after two prior patches applied
net/sched/sch_sfq.c | 260 ++++++++++++++++++++++++++----------------
1 file changed, 162 insertions(+), 98 deletions(-)
diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
index 42396c9..c474b4b 100644
--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -67,27 +67,42 @@
IMPLEMENTATION:
This implementation limits maximal queue length to 128;
- maximal mtu to 2^15-1; number of hash buckets to 1024.
+ maximal mtu to 2^15-1; max 128 flows, number of hash buckets to 1024.
The only goal of this restrictions was that all data
- fit into one 4K page :-). Struct sfq_sched_data is
- organized in anti-cache manner: all the data for a bucket
- are scattered over different locations. This is not good,
- but it allowed me to put it into 4K.
+ fit into one 4K page on 32bit arches.
It is easy to increase these values, but not in flight. */
-#define SFQ_DEPTH 128
+#define SFQ_DEPTH 128 /* max number of packets per flow */
+#define SFQ_SLOTS 128 /* max number of flows */
+#define SFQ_EMPTY_SLOT 255
#define SFQ_HASH_DIVISOR 1024
-/* This type should contain at least SFQ_DEPTH*2 values */
+/* This type should contain at least SFQ_DEPTH + SFQ_SLOTS values */
typedef unsigned char sfq_index;
+/*
+ * We dont use pointers to save space.
+ * Small indexes [0 ... SFQ_SLOTS - 1] are 'pointers' to slots[] array
+ * while following values [SFQ_SLOTS ... SFQ_SLOTS + SFQ_DEPTH - 1]
+ * are 'pointers' to dep[] array
+ */
struct sfq_head
{
sfq_index next;
sfq_index prev;
};
+struct sfq_slot {
+ struct sk_buff *skblist_next;
+ struct sk_buff *skblist_prev;
+ sfq_index qlen; /* number of skbs in skblist */
+ sfq_index next; /* next slot in sfq chain */
+ struct sfq_head dep; /* anchor in dep[] chains */
+ unsigned short hash; /* hash value (index in ht[]) */
+ short allot; /* credit for this slot */
+};
+
struct sfq_sched_data
{
/* Parameters */
@@ -99,17 +114,24 @@ struct sfq_sched_data
struct tcf_proto *filter_list;
struct timer_list perturb_timer;
u32 perturbation;
- sfq_index tail; /* Index of current slot in round */
- sfq_index max_depth; /* Maximal depth */
+ sfq_index cur_depth; /* depth of longest slot */
+ struct sfq_slot *tail; /* current slot in round */
sfq_index ht[SFQ_HASH_DIVISOR]; /* Hash table */
- sfq_index next[SFQ_DEPTH]; /* Active slots link */
- short allot[SFQ_DEPTH]; /* Current allotment per slot */
- unsigned short hash[SFQ_DEPTH]; /* Hash value indexed by slots */
- struct sk_buff_head qs[SFQ_DEPTH]; /* Slot queue */
- struct sfq_head dep[SFQ_DEPTH*2]; /* Linked list of slots, indexed by depth */
+ struct sfq_slot slots[SFQ_SLOTS];
+ struct sfq_head dep[SFQ_DEPTH]; /* Linked list of slots, indexed by depth */
};
+/*
+ * sfq_head are either in a sfq_slot or in dep[] array
+ */
+static inline struct sfq_head *sfq_dep_head(struct sfq_sched_data *q, sfq_index val)
+{
+ if (val < SFQ_SLOTS)
+ return &q->slots[val].dep;
+ return &q->dep[val - SFQ_SLOTS];
+}
+
static __inline__ unsigned sfq_fold_hash(struct sfq_sched_data *q, u32 h, u32 h1)
{
return jhash_2words(h, h1, q->perturbation) & (SFQ_HASH_DIVISOR - 1);
@@ -200,30 +222,41 @@ static unsigned int sfq_classify(struct sk_buff *skb, struct Qdisc *sch,
return 0;
}
+/*
+ * x : slot number [0 .. SFQ_SLOTS - 1]
+ */
static inline void sfq_link(struct sfq_sched_data *q, sfq_index x)
{
sfq_index p, n;
- int d = q->qs[x].qlen + SFQ_DEPTH;
+ int qlen = q->slots[x].qlen;
- p = d;
- n = q->dep[d].next;
- q->dep[x].next = n;
- q->dep[x].prev = p;
- q->dep[p].next = q->dep[n].prev = x;
+ p = qlen + SFQ_SLOTS;
+ n = q->dep[qlen].next;
+
+ q->slots[x].dep.next = n;
+ q->slots[x].dep.prev = p;
+
+ q->dep[qlen].next = x; /* sfq_dep_head(q, p)->next = x */
+ sfq_dep_head(q, n)->prev = x;
}
+#define sfq_unlink(q, x, n, p) \
+ n = q->slots[x].dep.next; \
+ p = q->slots[x].dep.prev; \
+ sfq_dep_head(q, p)->next = n; \
+ sfq_dep_head(q, n)->prev = p
+
+
static inline void sfq_dec(struct sfq_sched_data *q, sfq_index x)
{
sfq_index p, n;
+ int d;
- n = q->dep[x].next;
- p = q->dep[x].prev;
- q->dep[p].next = n;
- q->dep[n].prev = p;
-
- if (n == p && q->max_depth == q->qs[x].qlen + 1)
- q->max_depth--;
+ sfq_unlink(q, x, n, p);
+ d = q->slots[x].qlen--;
+ if (n == p && q->cur_depth == d)
+ q->cur_depth--;
sfq_link(q, x);
}
@@ -232,34 +265,72 @@ static inline void sfq_inc(struct sfq_sched_data *q, sfq_index x)
sfq_index p, n;
int d;
- n = q->dep[x].next;
- p = q->dep[x].prev;
- q->dep[p].next = n;
- q->dep[n].prev = p;
- d = q->qs[x].qlen;
- if (q->max_depth < d)
- q->max_depth = d;
+ sfq_unlink(q, x, n, p);
+ d = ++q->slots[x].qlen;
+ if (q->cur_depth < d)
+ q->cur_depth = d;
sfq_link(q, x);
}
+/* helper functions : might be changed when/if skb use a standard list_head */
+
+/* remove one skb from tail of slot queue */
+static inline struct sk_buff *slot_dequeue_tail(struct sfq_slot *slot)
+{
+ struct sk_buff *skb = slot->skblist_prev;
+
+ slot->skblist_prev = skb->prev;
+ skb->next = skb->prev = NULL;
+ return skb;
+}
+
+/* remove one skb from head of slot queue */
+static inline struct sk_buff *slot_dequeue_head(struct sfq_slot *slot)
+{
+ struct sk_buff *skb = slot->skblist_next;
+
+ slot->skblist_next = skb->next;
+ skb->next = skb->prev = NULL;
+ return skb;
+}
+
+static inline void slot_queue_init(struct sfq_slot *slot)
+{
+ slot->skblist_prev = slot->skblist_next = (struct sk_buff *)slot;
+}
+
+/* add skb to slot queue (tail add) */
+static inline void slot_queue_add(struct sfq_slot *slot, struct sk_buff *skb)
+{
+ skb->prev = slot->skblist_prev;
+ skb->next = (struct sk_buff *)slot;
+ slot->skblist_prev->next = skb;
+ slot->skblist_prev = skb;
+}
+
+#define slot_queue_walk(slot, skb) \
+ for (skb = slot->skblist_next; \
+ skb != (struct sk_buff *)slot; \
+ skb = skb->next)
+
static unsigned int sfq_drop(struct Qdisc *sch)
{
struct sfq_sched_data *q = qdisc_priv(sch);
- sfq_index d = q->max_depth;
+ sfq_index x, d = q->cur_depth;
struct sk_buff *skb;
unsigned int len;
+ struct sfq_slot *slot;
- /* Queue is full! Find the longest slot and
- drop a packet from it */
-
+ /* Queue is full! Find the longest slot and drop tail packet from it */
if (d > 1) {
- sfq_index x = q->dep[d + SFQ_DEPTH].next;
- skb = q->qs[x].prev;
+ x = q->dep[d].next;
+ slot = &q->slots[x];
+drop:
+ skb = slot_dequeue_tail(slot);
len = qdisc_pkt_len(skb);
- __skb_unlink(skb, &q->qs[x]);
- kfree_skb(skb);
sfq_dec(q, x);
+ kfree_skb(skb);
sch->q.qlen--;
sch->qstats.drops++;
sch->qstats.backlog -= len;
@@ -268,18 +339,11 @@ static unsigned int sfq_drop(struct Qdisc *sch)
if (d == 1) {
/* It is difficult to believe, but ALL THE SLOTS HAVE LENGTH 1. */
- d = q->next[q->tail];
- q->next[q->tail] = q->next[d];
- skb = q->qs[d].prev;
- len = qdisc_pkt_len(skb);
- __skb_unlink(skb, &q->qs[d]);
- kfree_skb(skb);
- sfq_dec(q, d);
- sch->q.qlen--;
- q->ht[q->hash[d]] = SFQ_DEPTH;
- sch->qstats.drops++;
- sch->qstats.backlog -= len;
- return len;
+ x = q->tail->next;
+ slot = &q->slots[x];
+ q->tail->next = slot->next;
+ q->ht[slot->hash] = SFQ_EMPTY_SLOT;
+ goto drop;
}
return 0;
@@ -291,6 +355,7 @@ sfq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
struct sfq_sched_data *q = qdisc_priv(sch);
unsigned int hash;
sfq_index x;
+ struct sfq_slot *slot;
int uninitialized_var(ret);
hash = sfq_classify(skb, sch, &ret);
@@ -303,30 +368,33 @@ sfq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
hash--;
x = q->ht[hash];
- if (x == SFQ_DEPTH) {
- q->ht[hash] = x = q->dep[SFQ_DEPTH].next;
- q->hash[x] = hash;
+ slot = &q->slots[x];
+ if (x == SFQ_EMPTY_SLOT) {
+ x = q->dep[0].next; /* get a free slot */
+ q->ht[hash] = x;
+ slot = &q->slots[x];
+ slot->hash = hash;
+ slot_queue_init(slot);
}
- /* If selected queue has length q->limit, this means that
- * all another queues are empty and that we do simple tail drop,
+ /* If selected queue has length q->limit, do simple tail drop,
* i.e. drop _this_ packet.
*/
- if (q->qs[x].qlen >= q->limit)
+ if (slot->qlen >= q->limit)
return qdisc_drop(skb, sch);
sch->qstats.backlog += qdisc_pkt_len(skb);
- __skb_queue_tail(&q->qs[x], skb);
+ slot_queue_add(slot, skb);
sfq_inc(q, x);
- if (q->qs[x].qlen == 1) { /* The flow is new */
- if (q->tail == SFQ_DEPTH) { /* It is the first flow */
- q->next[x] = x;
+ if (slot->qlen == 1) { /* The flow is new */
+ if (q->tail == NULL) { /* It is the first flow */
+ slot->next = x;
} else {
- q->next[x] = q->next[q->tail];
- q->next[q->tail] = x;
+ slot->next = q->tail->next;
+ q->tail->next = x;
}
- q->tail = x;
- q->allot[x] = q->quantum;
+ q->tail = slot;
+ slot->allot = q->quantum;
}
if (++sch->q.qlen <= q->limit) {
sch->bstats.bytes += qdisc_pkt_len(skb);
@@ -342,14 +410,12 @@ static struct sk_buff *
sfq_peek(struct Qdisc *sch)
{
struct sfq_sched_data *q = qdisc_priv(sch);
- sfq_index a;
/* No active slots */
- if (q->tail == SFQ_DEPTH)
+ if (q->tail == NULL)
return NULL;
- a = q->next[q->tail];
- return skb_peek(&q->qs[a]);
+ return q->slots[q->tail->next].skblist_next;
}
static struct sk_buff *
@@ -358,31 +424,31 @@ sfq_dequeue(struct Qdisc *sch)
struct sfq_sched_data *q = qdisc_priv(sch);
struct sk_buff *skb;
sfq_index a, next_a;
+ struct sfq_slot *slot;
/* No active slots */
- if (q->tail == SFQ_DEPTH)
+ if (q->tail == NULL)
return NULL;
- a = q->next[q->tail];
-
- /* Grab packet */
- skb = __skb_dequeue(&q->qs[a]);
+ a = q->tail->next;
+ slot = &q->slots[a];
+ skb = slot_dequeue_head(slot);
sfq_dec(q, a);
sch->q.qlen--;
sch->qstats.backlog -= qdisc_pkt_len(skb);
/* Is the slot empty? */
- if (q->qs[a].qlen == 0) {
- q->ht[q->hash[a]] = SFQ_DEPTH;
- next_a = q->next[a];
+ if (slot->qlen == 0) {
+ q->ht[slot->hash] = SFQ_EMPTY_SLOT;
+ next_a = slot->next;
if (a == next_a) {
- q->tail = SFQ_DEPTH;
+ q->tail = NULL; /* no more active slots */
return skb;
}
- q->next[q->tail] = next_a;
- } else if ((q->allot[a] -= qdisc_pkt_len(skb)) <= 0) {
- q->allot[a] += q->quantum;
- q->tail = a;
+ q->tail->next = next_a;
+ } else if ((slot->allot -= qdisc_pkt_len(skb)) <= 0) {
+ q->tail = slot;
+ slot->allot += q->quantum;
}
return skb;
}
@@ -446,17 +512,16 @@ static int sfq_init(struct Qdisc *sch, struct nlattr *opt)
init_timer_deferrable(&q->perturb_timer);
for (i = 0; i < SFQ_HASH_DIVISOR; i++)
- q->ht[i] = SFQ_DEPTH;
+ q->ht[i] = SFQ_EMPTY_SLOT;
for (i = 0; i < SFQ_DEPTH; i++) {
- skb_queue_head_init(&q->qs[i]);
- q->dep[i + SFQ_DEPTH].next = i + SFQ_DEPTH;
- q->dep[i + SFQ_DEPTH].prev = i + SFQ_DEPTH;
+ q->dep[i].next = i + SFQ_SLOTS;
+ q->dep[i].prev = i + SFQ_SLOTS;
}
q->limit = SFQ_DEPTH - 1;
- q->max_depth = 0;
- q->tail = SFQ_DEPTH;
+ q->cur_depth = 0;
+ q->tail = NULL;
if (opt == NULL) {
q->quantum = psched_mtu(qdisc_dev(sch));
q->perturb_period = 0;
@@ -467,7 +532,7 @@ static int sfq_init(struct Qdisc *sch, struct nlattr *opt)
return err;
}
- for (i = 0; i < SFQ_DEPTH; i++)
+ for (i = 0; i < SFQ_SLOTS; i++)
sfq_link(q, i);
return 0;
}
@@ -543,13 +608,12 @@ static int sfq_dump_class_stats(struct Qdisc *sch, unsigned long cl,
struct gnet_dump *d)
{
struct sfq_sched_data *q = qdisc_priv(sch);
- sfq_index idx = q->ht[cl-1];
- struct sk_buff_head *list = &q->qs[idx];
- struct gnet_stats_queue qs = { .qlen = list->qlen };
- struct tc_sfq_xstats xstats = { .allot = q->allot[idx] };
+ const struct sfq_slot *slot = &q->slots[q->ht[cl - 1]];
+ struct gnet_stats_queue qs = { .qlen = slot->qlen };
+ struct tc_sfq_xstats xstats = { .allot = slot->allot };
struct sk_buff *skb;
- skb_queue_walk(list, skb)
+ slot_queue_walk(slot, skb)
qs.backlog += qdisc_pkt_len(skb);
if (gnet_stats_copy_queue(d, &qs) < 0)
@@ -566,7 +630,7 @@ static void sfq_walk(struct Qdisc *sch, struct qdisc_walker *arg)
return;
for (i = 0; i < SFQ_HASH_DIVISOR; i++) {
- if (q->ht[i] == SFQ_DEPTH ||
+ if (q->ht[i] == SFQ_EMPTY_SLOT ||
arg->count < arg->skip) {
arg->count++;
continue;
^ permalink raw reply related
* Re: [net-next-2.6 PATCH v2] net: implement mechanism for HW based QOS
From: Ben Hutchings @ 2010-12-20 22:48 UTC (permalink / raw)
To: John Fastabend
Cc: davem, netdev, hadi, shemminger, tgraf, eric.dumazet, nhorman
In-Reply-To: <20101217165622.26608.24819.stgit@jf-dev1-dcblab>
On Fri, 2010-12-17 at 08:56 -0800, John Fastabend wrote:
[...]
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index cc916c5..c5d7949 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -646,6 +646,12 @@ struct xps_dev_maps {
> (nr_cpu_ids * sizeof(struct xps_map *)))
> #endif /* CONFIG_XPS */
>
> +/* HW offloaded queuing disciplines txq count and offset maps */
> +struct netdev_tc_txq {
> + u16 count;
> + u16 offset;
> +};
> +
> /*
> * This structure defines the management hooks for network devices.
> * The following hooks can be defined; unless noted otherwise, they are
> @@ -1146,6 +1152,9 @@ struct net_device {
> /* Data Center Bridging netlink ops */
> const struct dcbnl_rtnl_ops *dcbnl_ops;
> #endif
> + u8 num_tc;
> + struct netdev_tc_txq tc_to_txq[16];
> + u8 prio_tc_map[16];
That seems like a fair amount of data to add to every net device,
considering that users may create e.g. a lot of VLAN devices and they
won't use this state at all. Have you considered putting these in a
structure that is accessed indirectly?
> #if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
> /* max exchange id for FCoE LRO by ddp */
> @@ -1162,6 +1171,57 @@ struct net_device {
> #define NETDEV_ALIGN 32
>
> static inline
> +int netdev_get_prio_tc_map(const struct net_device *dev, u32 prio)
> +{
> + return dev->prio_tc_map[prio & 15];
> +}
> +
> +static inline
> +int netdev_set_prio_tc_map(struct net_device *dev, u8 prio, u8 tc)
> +{
> + if (tc >= dev->num_tc)
> + return -EINVAL;
> +
> + dev->prio_tc_map[prio & 15] = tc & 15;
> + return 0;
> +}
[...]
Please name the magic numbers 15 and 16.
Ben.
--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply
* Re: [PATCH v3] kptr_restrict for hiding kernel pointers
From: Valdis.Kletnieks @ 2010-12-20 22:26 UTC (permalink / raw)
To: Dan Rosenberg
Cc: linux-kernel, netdev, linux-security-module, jmorris,
eric.dumazet, tgraf, eugeneteo, kees.cook, mingo, davem,
a.p.zijlstra, akpm
In-Reply-To: <1292692835.10804.67.camel@dan>
[-- Attachment #1: Type: text/plain, Size: 620 bytes --]
On Sat, 18 Dec 2010 12:20:34 EST, Dan Rosenberg said:
> @@ -1035,6 +1038,26 @@ char *pointer(const char *fmt, char *buf, char *end, void *ptr,
> return buf + vsnprintf(buf, end - buf,
> ((struct va_format *)ptr)->fmt,
> *(((struct va_format *)ptr)->va));
> + case 'K':
> + /*
> + * %pK cannot be used in IRQ context because it tests
> + * CAP_SYSLOG.
> + */
> + if (in_irq() || in_serving_softirq() || in_nmi())
> + WARN_ONCE(1, "%%pK used in interrupt context.\n");
Should this then continue on and test CAP_SYSLOG anyhow, or should it
return a "" or or "<invalid>" or something?
[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]
^ permalink raw reply
* Re: AW: [PATCH] ISDN cmx: Avoid potential NULL deref in dsp_cmx_send_member() and shrink code size.
From: Jesper Juhl @ 2010-12-20 22:10 UTC (permalink / raw)
To: Andreas.Eversberg
Cc: Karsten Keil, David S. Miller, Julia Lawall, Tejun Heo, netdev,
linux-kernel
In-Reply-To: <B16317F2B1592B4492FC414114171CE604B08AEA@FLBVEXCH01.versatel.local>
On Mon, 20 Dec 2010, Andreas.Eversberg wrote:
> hi jesper,
>
> thanx for finding the bug. i think the right solution to solve the problem would be:
>
> if (dsp->conf && dsp->conf->software && dsp->conf->hardware)
> tx_data_only = 1;
> -> if (dsp->echo.software && dsp->echo.hardware)
> tx_data_only = 1;
>
> this is how it looks in the 'socket' branch of mISDN git respository. it has been fixed already. but i cannot tell in which commit. my current head is this commit:
> commit 45a51eed1c554a4891b48b88c270f4f95bd21df0
>
I'm not familiar enough with the code to determine if my fix or the one
you propose is the right one. My fix is functionally equivalent to what
was there before (minus the NULL deref), yours is not. I'll let someone
more knowledgeable about the ISDN code determine what is the right fix.
> what branch do you use?
>
I'm working against Linus' kernel as of today.
--
Jesper Juhl <jj@chaosbits.net> http://www.chaosbits.net/
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please.
^ permalink raw reply
* Re: [PATCH v2] net_sched: sch_sfq: better struct layouts
From: Eric Dumazet @ 2010-12-20 21:42 UTC (permalink / raw)
To: David Miller; +Cc: jarkao2, kaber, netdev
In-Reply-To: <20101220.133342.246531313.davem@davemloft.net>
Le lundi 20 décembre 2010 à 13:33 -0800, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 20 Dec 2010 18:02:05 +0100
>
> > [PATCH v2] net_sched: sch_sfq: better struct layouts
>
> Please respin this against current net-next-2.6, it doesn't
> apply cleanly after your other two sfq changes.
>
Sure, will do, thanks David.
^ permalink raw reply
* Re: [PATCH v2] net_sched: sch_sfq: better struct layouts
From: David Miller @ 2010-12-20 21:33 UTC (permalink / raw)
To: eric.dumazet; +Cc: jarkao2, kaber, netdev
In-Reply-To: <1292864525.2800.189.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 20 Dec 2010 18:02:05 +0100
> [PATCH v2] net_sched: sch_sfq: better struct layouts
Please respin this against current net-next-2.6, it doesn't
apply cleanly after your other two sfq changes.
Thanks!
^ permalink raw reply
* Re: [PATCH v3] net_sched: sch_sfq: fix allot handling
From: David Miller @ 2010-12-20 21:18 UTC (permalink / raw)
To: eric.dumazet; +Cc: kaber, netdev, jarkao2
In-Reply-To: <1292434227.3427.377.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 15 Dec 2010 18:30:27 +0100
> [PATCH v3] net_sched: sch_sfq: fix allot handling
...
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Applied to net-2.6, thanks.
^ permalink raw reply
* Re: [PATCH net-next-2.6] net_sched: sch_sfq: add backlog info in sfq_dump_class_stats()
From: David Miller @ 2010-12-20 21:14 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev, jarkao2, kaber
In-Reply-To: <1292437116.3427.386.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 15 Dec 2010 19:18:36 +0100
> We currently return for each active SFQ slot the number of packets in
> queue. We can also give number of bytes accounted for these packets.
>
> tc -s class show dev ifb0
>
> Before patch :
>
> class sfq 11:3d9 parent 11:
> (dropped 0, overlimits 0 requeues 0)
> backlog 0b 3p requeues 0
> allot 1266
>
> After patch :
>
> class sfq 11:3e4 parent 11:
> (dropped 0, overlimits 0 requeues 0)
> backlog 4380b 3p requeues 0
> allot 1212
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Applied, thanks Eric.
^ permalink raw reply
* Re: [PATCH net-next-2.6] bnx2: remove cancel_work_sync() from remove_one
From: David Miller @ 2010-12-20 21:11 UTC (permalink / raw)
To: tj; +Cc: mchan, netdev, linux-kernel
In-Reply-To: <4D08C81D.8020606@kernel.org>
From: Tejun Heo <tj@kernel.org>
Date: Wed, 15 Dec 2010 14:52:29 +0100
> After looking through the code, I don't think this is necessarily
> correct. ->ndo_close() doesn't guarantee that the watchdog timer has
> finished running (the timer is deleted with del_timer() not
> del_timer_sync()). ie. the watchdog timer could still be running
> after ->ndo_close() and may schedule reset_task. If remove_one
> doesn't flush the task, it may still be running when remove_one() is
> called.
>
> David, am I missing something? Wouldn't it cleaner to guarantee that
> ->ndo_close() is called with the guarantee that the watchdog timer is
> not running anymore?
It would but we can't just make the change over to del_timer_sync()
otherwise we'd deadlock on netif_tx_lock().
But I think things might be OK as-is.
The timer is deleted by dev_deactivate_many() which resets the qdisc
to the no-op qdisc. Then it deletes the timer.
Any running timer will complete or see the no-op qdisc attached and
return immediately.
synchronize_rcu() is then executed which guarentees completion.
Since both the watchdog timer itself and the del_timer() call run
with netif_tx_lock() held, this makes sure the timer, once deleted,
will only see the no-op qdisc and return immediately if it is
amidst running, else it has already returned when the timer delete
completes.
So we might be OK here.
^ permalink raw reply
* [net-next-2.6 PATCH 2/3] dcbnl: add appliction tlv handlers
From: John Fastabend @ 2010-12-20 20:50 UTC (permalink / raw)
To: davem; +Cc: netdev, tgraf
In-Reply-To: <20101220205013.24232.3980.stgit@jf-dev1-dcblab>
This patch adds application tlv handlers. Networking stacks
may use the application priority to set the skb priority of
their stack using the negoatiated dcbx priority.
This patch provides the dcb_{get|set}app() routines for the
stack to query these parameters. Notice lower layer drivers
can use the dcbnl_ops routines if additional handling is
needed. Perhaps in the firmware case for example
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
include/linux/dcbnl.h | 4 +
include/net/dcbnl.h | 9 +++
net/dcb/dcbnl.c | 131 +++++++++++++++++++++++++++++++++++++++++++++----
3 files changed, 133 insertions(+), 11 deletions(-)
diff --git a/include/linux/dcbnl.h b/include/linux/dcbnl.h
index 3706a13..9071e1c 100644
--- a/include/linux/dcbnl.h
+++ b/include/linux/dcbnl.h
@@ -71,7 +71,9 @@ struct ieee_pfc {
__u64 indications[IEEE_8021QAZ_MAX_TCS];
};
-/* This structure contains the IEEE 802.1Qaz APP managed object
+/* This structure contains the IEEE 802.1Qaz APP managed object. This
+ * object is also used for the CEE std as well. There is no difference
+ * between the objects.
*
* @selector: protocol identifier type
* @protocol: protocol of type indicated
diff --git a/include/net/dcbnl.h b/include/net/dcbnl.h
index e2d841e..ab7d623 100644
--- a/include/net/dcbnl.h
+++ b/include/net/dcbnl.h
@@ -22,6 +22,15 @@
#include <linux/dcbnl.h>
+struct dcb_app_type {
+ char name[IFNAMSIZ];
+ struct dcb_app app;
+ struct list_head list;
+};
+
+u8 dcb_setapp(struct net_device *, struct dcb_app *);
+u8 dcb_getapp(struct net_device *, struct dcb_app *);
+
/*
* Ops struct for the netlink callbacks. Used by DCB-enabled drivers through
* the netdevice struct.
diff --git a/net/dcb/dcbnl.c b/net/dcb/dcbnl.c
index 2ff9084..adc8c8d 100644
--- a/net/dcb/dcbnl.c
+++ b/net/dcb/dcbnl.c
@@ -179,6 +179,9 @@ static const struct nla_policy dcbnl_ieee_app[DCB_ATTR_IEEE_APP_MAX + 1] = {
[DCB_ATTR_IEEE_APP] = {.len = sizeof(struct dcb_app)},
};
+static LIST_HEAD(dcb_app_list);
+static DEFINE_SPINLOCK(dcb_lock);
+
/* standard netlink reply call */
static int dcbnl_reply(u8 value, u8 event, u8 cmd, u8 attr, u32 pid,
u32 seq, u16 flags)
@@ -634,12 +637,12 @@ out:
static int dcbnl_setapp(struct net_device *netdev, struct nlattr **tb,
u32 pid, u32 seq, u16 flags)
{
- int ret = -EINVAL;
+ int err, ret = -EINVAL;
u16 id;
u8 up, idtype;
struct nlattr *app_tb[DCB_APP_ATTR_MAX + 1];
- if (!tb[DCB_ATTR_APP] || !netdev->dcbnl_ops->setapp)
+ if (!tb[DCB_ATTR_APP])
goto out;
ret = nla_parse_nested(app_tb, DCB_APP_ATTR_MAX, tb[DCB_ATTR_APP],
@@ -663,9 +666,18 @@ static int dcbnl_setapp(struct net_device *netdev, struct nlattr **tb,
id = nla_get_u16(app_tb[DCB_APP_ATTR_ID]);
up = nla_get_u8(app_tb[DCB_APP_ATTR_PRIORITY]);
- ret = dcbnl_reply(netdev->dcbnl_ops->setapp(netdev, idtype, id, up),
- RTM_SETDCB, DCB_CMD_SAPP, DCB_ATTR_APP,
- pid, seq, flags);
+ if (netdev->dcbnl_ops->setapp) {
+ err = netdev->dcbnl_ops->setapp(netdev, idtype, id, up);
+ } else {
+ struct dcb_app app;
+ app.selector = idtype;
+ app.protocol = id;
+ app.priority = up;
+ err = dcb_setapp(netdev, &app);
+ }
+
+ ret = dcbnl_reply(err, RTM_SETDCB, DCB_CMD_SAPP, DCB_ATTR_APP,
+ pid, seq, flags);
out:
return ret;
}
@@ -1164,7 +1176,7 @@ static int dcbnl_ieee_set(struct net_device *netdev, struct nlattr **tb,
goto err;
}
- if (ieee[DCB_ATTR_IEEE_APP_TABLE] && ops->ieee_setapp) {
+ if (ieee[DCB_ATTR_IEEE_APP_TABLE]) {
struct nlattr *attr;
int rem;
@@ -1173,7 +1185,10 @@ static int dcbnl_ieee_set(struct net_device *netdev, struct nlattr **tb,
if (nla_type(attr) != DCB_ATTR_IEEE_APP)
continue;
app_data = nla_data(attr);
- err = ops->ieee_setapp(netdev, app_data);
+ if (ops->ieee_setapp)
+ err = ops->ieee_setapp(netdev, app_data);
+ else
+ err = dcb_setapp(netdev, app_data);
if (err)
goto err;
}
@@ -1193,7 +1208,8 @@ static int dcbnl_ieee_get(struct net_device *netdev, struct nlattr **tb,
struct sk_buff *skb;
struct nlmsghdr *nlh;
struct dcbmsg *dcb;
- struct nlattr *ieee;
+ struct nlattr *ieee, *app;
+ struct dcb_app_type *itr;
const struct dcbnl_rtnl_ops *ops = netdev->dcbnl_ops;
int err;
@@ -1230,6 +1246,19 @@ static int dcbnl_ieee_get(struct net_device *netdev, struct nlattr **tb,
NLA_PUT(skb, DCB_ATTR_IEEE_PFC, sizeof(pfc), &pfc);
}
+ app = nla_nest_start(skb, DCB_ATTR_IEEE_APP_TABLE);
+ if (!app)
+ goto nla_put_failure;
+
+ spin_lock(&dcb_lock);
+ list_for_each_entry(itr, &dcb_app_list, list) {
+ if (strncmp(itr->name, netdev->name, IFNAMSIZ) == 0)
+ NLA_PUT(skb, DCB_ATTR_IEEE_APP,
+ sizeof(itr->app), &itr->app);
+ }
+ spin_unlock(&dcb_lock);
+ nla_nest_end(skb, app);
+
nla_nest_end(skb, ieee);
nlmsg_end(skb, nlh);
@@ -1364,8 +1393,91 @@ out:
return ret;
}
+/**
+ * dcb_getapp - retrieve the DCBX application user priority
+ *
+ * On success returns a non-zero 802.1p user priority bitmap
+ * otherwise returns 0 as the invalid user priority bitmap to
+ * indicate an error.
+ */
+u8 dcb_getapp(struct net_device *dev, struct dcb_app *app)
+{
+ struct dcb_app_type *itr;
+ u8 prio = 0;
+
+ spin_lock(&dcb_lock);
+ list_for_each_entry(itr, &dcb_app_list, list) {
+ if (itr->app.selector == app->selector &&
+ itr->app.protocol == app->protocol &&
+ (strncmp(itr->name, dev->name, IFNAMSIZ) == 0)) {
+ prio = itr->app.priority;
+ break;
+ }
+ }
+ spin_unlock(&dcb_lock);
+
+ return prio;
+}
+EXPORT_SYMBOL(dcb_getapp);
+
+/**
+ * ixgbe_dcbnl_setapp - add dcb application data to app list
+ *
+ * Priority 0 is the default priority this removes applications
+ * from the app list if the priority is set to zero.
+ */
+u8 dcb_setapp(struct net_device *dev, struct dcb_app *new)
+{
+ struct dcb_app_type *itr;
+
+ spin_lock(&dcb_lock);
+ /* Search for existing match and replace */
+ list_for_each_entry(itr, &dcb_app_list, list) {
+ if (itr->app.selector == new->selector &&
+ itr->app.protocol == new->protocol &&
+ (strncmp(itr->name, dev->name, IFNAMSIZ) == 0)) {
+ if (new->priority)
+ itr->app.priority = new->priority;
+ else {
+ list_del(&itr->list);
+ kfree(itr);
+ }
+ goto out;
+ }
+ }
+ /* App type does not exist add new application type */
+ if (new->priority) {
+ struct dcb_app_type *entry;
+ entry = kmalloc(sizeof(struct dcb_app_type), GFP_ATOMIC);
+ if (!entry)
+ return -ENOMEM;
+
+ memcpy(&entry->app, new, sizeof(*new));
+ strncpy(entry->name, dev->name, IFNAMSIZ);
+ list_add(&entry->list, &dcb_app_list);
+ }
+out:
+ spin_unlock(&dcb_lock);
+ return 0;
+}
+EXPORT_SYMBOL(dcb_setapp);
+
+void dcb_flushapp(void)
+{
+ struct dcb_app_type *app;
+
+ spin_lock(&dcb_lock);
+ list_for_each_entry(app, &dcb_app_list, list) {
+ list_del(&app->list);
+ kfree(app);
+ }
+ spin_unlock(&dcb_lock);
+}
+
static int __init dcbnl_init(void)
{
+ INIT_LIST_HEAD(&dcb_app_list);
+
rtnl_register(PF_UNSPEC, RTM_GETDCB, dcb_doit, NULL);
rtnl_register(PF_UNSPEC, RTM_SETDCB, dcb_doit, NULL);
@@ -1377,7 +1489,6 @@ static void __exit dcbnl_exit(void)
{
rtnl_unregister(PF_UNSPEC, RTM_GETDCB);
rtnl_unregister(PF_UNSPEC, RTM_SETDCB);
+ dcb_flushapp();
}
module_exit(dcbnl_exit);
-
-
^ permalink raw reply related
* [net-next-2.6 PATCH 3/3] net_dcb: add application notifiers
From: John Fastabend @ 2010-12-20 20:50 UTC (permalink / raw)
To: davem; +Cc: netdev, tgraf
In-Reply-To: <20101220205013.24232.3980.stgit@jf-dev1-dcblab>
DCBx applications priorities can be changed dynamically. If
application stacks are expected to keep the skb priority
consistent with the dcbx priority the stack will need to
be notified when these changes occur.
This patch adds application notifiers for the stack to register
with.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
include/net/dcbevent.h | 31 +++++++++++++++++++++++++++++++
net/dcb/Makefile | 2 +-
net/dcb/dcbevent.c | 40 ++++++++++++++++++++++++++++++++++++++++
net/dcb/dcbnl.c | 2 ++
4 files changed, 74 insertions(+), 1 deletions(-)
create mode 100644 include/net/dcbevent.h
create mode 100644 net/dcb/dcbevent.c
diff --git a/include/net/dcbevent.h b/include/net/dcbevent.h
new file mode 100644
index 0000000..bc1e7ef
--- /dev/null
+++ b/include/net/dcbevent.h
@@ -0,0 +1,31 @@
+/*
+ * Copyright (c) 2010, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ *
+ * Author: John Fastabend <john.r.fastabend@intel.com>
+ */
+
+#ifndef _DCB_EVENT_H
+#define _DCB_EVENT_H
+
+enum dcbevent_notif_type {
+ DCB_APP_EVENT = 1,
+};
+
+extern int register_dcbevent_notifier(struct notifier_block *nb);
+extern int unregister_dcbevent_notifier(struct notifier_block *nb);
+extern int call_dcbevent_notifiers(unsigned long val, void *v);
+
+#endif
diff --git a/net/dcb/Makefile b/net/dcb/Makefile
index 9930f4c..c1282c9 100644
--- a/net/dcb/Makefile
+++ b/net/dcb/Makefile
@@ -1 +1 @@
-obj-$(CONFIG_DCB) += dcbnl.o
+obj-$(CONFIG_DCB) += dcbnl.o dcbevent.o
diff --git a/net/dcb/dcbevent.c b/net/dcb/dcbevent.c
new file mode 100644
index 0000000..665a880
--- /dev/null
+++ b/net/dcb/dcbevent.c
@@ -0,0 +1,40 @@
+/*
+ * Copyright (c) 2010, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ *
+ * Author: John Fastabend <john.r.fastabend@intel.com>
+ */
+
+#include <linux/rtnetlink.h>
+#include <linux/notifier.h>
+
+static ATOMIC_NOTIFIER_HEAD(dcbevent_notif_chain);
+
+int register_dcbevent_notifier(struct notifier_block *nb)
+{
+ return atomic_notifier_chain_register(&dcbevent_notif_chain, nb);
+}
+EXPORT_SYMBOL(register_dcbevent_notifier);
+
+int unregister_dcbevent_notifier(struct notifier_block *nb)
+{
+ return atomic_notifier_chain_unregister(&dcbevent_notif_chain, nb);
+}
+EXPORT_SYMBOL(unregister_dcbevent_notifier);
+
+int call_dcbevent_notifiers(unsigned long val, void *v)
+{
+ return atomic_notifier_call_chain(&dcbevent_notif_chain, val, v);
+}
diff --git a/net/dcb/dcbnl.c b/net/dcb/dcbnl.c
index adc8c8d..5f1d0ee 100644
--- a/net/dcb/dcbnl.c
+++ b/net/dcb/dcbnl.c
@@ -23,6 +23,7 @@
#include <net/netlink.h>
#include <net/rtnetlink.h>
#include <linux/dcbnl.h>
+#include <net/dcbevent.h>
#include <linux/rtnetlink.h>
#include <net/sock.h>
@@ -1458,6 +1459,7 @@ u8 dcb_setapp(struct net_device *dev, struct dcb_app *new)
}
out:
spin_unlock(&dcb_lock);
+ call_dcbevent_notifiers(DCB_APP_EVENT, new);
return 0;
}
EXPORT_SYMBOL(dcb_setapp);
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox