* Re: Open vSwitch Design
From: jamal @ 2011-11-24 22:30 UTC (permalink / raw)
To: Jesse Gross
Cc: dev-yBygre7rU0TnMu66kgdUjQ, Chris Wright, Herbert Xu,
Eric Dumazet, netdev, John Fastabend, Stephen Hemminger,
David Miller
In-Reply-To: <CAEP_g=_2L1xFWtDXh_6YyXz1Mt9TR3zvjLzix+SpO6yzeOLsSQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
Jesse,
I am going to try and respond to your comments below.
On Thu, 2011-11-24 at 12:10 -0800, Jesse Gross wrote:
>
> * Switching infrastructure: As the name implies, Open vSwitch is
> intended to be a network switch, focused on
> virtualization/OpenFlow/software defined networking. This means that
> what we are modeling is not actually a collection of flows but a
> switch which contains a group of related ports, a software virtual
> device, etc. The switch model is used in a variety of places, such as
> to measure traffic that actually flows through it in order to
> implement monitoring and sampling protocols.
Can you explain why you couldnt use the current bridge code (likely with
some mods)? I can see you want to isolate the VMs via the virtual ports;
maybe even vlans on the virtual ports - the current bridge code should
be able to handle that.
> * Flow lookup: Although used to implement OpenFlow, the kernel flow
> table does not actually directly contain OpenFlow flows. This is
> because OpenFlow tables can contain wildcards, multiple pipeline
> stages, etc. and we did not want to push that complexity into the
> kernel fast path (nor tie it to a specific version of OpenFlow).
> Instead an exact match flow table is populated on-demand from
> userspace based on the more complex rules stored there. Although it
> might seem limiting, this design has allowed significant new
> functionality to be added without modifications to the kernel or
> performance impact.
This can be achieved easily with zero changes to the kernel code.
You need to have default filters that redirect flows to user space
when you fail to match.
> * Packet execution: Once a flow is matched it can be output,
> enqueued to a particular qdisc, etc. Some of these operations are
> specific to Open vSwitch, such as sampling, whereas others we leverage
> existing infrastructure (including tc for QoS) by simply marking the
> packet for further processing.
The tc classifier-action-qdisc infrastructure handles this.
The sampler needs a new action defined.
> * Userspace interfaces: One of the difficulties of having a
> specialized, exact match flow lookup engine is maintaining
> compatibility across differing kernel/userspace versions. This
> compatibility shows up heavily in the userspace interfaces and is
> achieved by passing the kernel's version of the flow along with packet
> information. This allows userspace to install appropriate flows even
> if its interpretation of a packet differs from the kernel's without
> version checks or maintaining multiple implementations of the flow
> extraction code in the kernel.
I didnt quiet follow - are we talking about backward/forward
compatibility?
> It's obviously possible to put this code anywhere, whether it is an
> independent module, in the bridge, or tc. Regardless, however, it's
> largely new code that is geared towards this particular model so it
> seems better not to add to the complexity of existing components if at
> all possible.
I am still not seeing how this could not be done without the
infrastructure that exists. Granted, the user space brains - thats where
everything else resides - but you are not pushing that i think.
cheers,
jamal
^ permalink raw reply
* Re: [PATCH net-next 1/2] netem: rate-latency extension
From: Hagen Paul Pfeifer @ 2011-11-24 22:31 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev, Stephen Hemminger
In-Reply-To: <1322172898.2872.7.camel@edumazet-laptop>
* Eric Dumazet | 2011-11-24 23:14:58 [+0100]:
>Only point is why you chose ratelatency instead of rate ?
Not sure why, it was called rate in v1, then somebody said ratelatency and I
found it more stating. So in v2 it become ratelatency. I have no strong
opinion here - should I generate v3?
Hagen
^ permalink raw reply
* Re: MPLS for Linux kernel
From: Glen Turner @ 2011-11-24 23:39 UTC (permalink / raw)
To: David Miller; +Cc: igorm, netdev
In-Reply-To: <20111122.164909.156852889818363753.davem@davemloft.net>
On Tue, 2011-11-22 at 16:49 -0500, David Miller wrote:
> I frankly don't care very much about MPLS personally, it's such a
> fringe facility. So if people just argue themselves into oblivion and
> no forward progress is made, just like last time an MPLS submission
> was attempted, that's also fine with me :-)
Hello David,
Oh dear. I don't know how I can express my years of frustration at the
lack of MPLS in the stock Linux kernel and the thought that it may be
another five years away.
Maybe that I've just spend $50K on a router to terminate MPLS tunnels
into ethernet VLANs, just so Linux can be a recording device for Juniper
routers intercepting traffic. You might well ask why Juniper chose MPLS
rather than GRE, and the answer there is that MPLS is so fundamental to
modern networking that it is implemented in the forwarding silicon of
the interface, making MPLS the obvious choice for copying every packet
matching a flowspec into a tunnel
Maybe that MPLS is the technology used to transform a router into a
network. Which is why all of the Linux router and software-defined
networking devices have kernels with MPLS patches. Middleboxes lacking
MPLS are increasingly difficult to integrate into a modern ISP network
-- firewalls, SIP session border controllers, etc. Linux is, of course,
the dominant OS on those middleboxes.
Maybe that without host MPLS we are only left with the architectural
mess which is data centre ethernet when wanting to add advanced
networking features to hosts. There's a good argument that data centre
ethernet only exists because MPLS isn't widespread on hosts.
Maybe that servers hosting virtual machines forces servers to become
part of the network -- servers aren't edge devices anymore. Terminating
MPLS on an edge subnet is difficult when the edge subnet exists within a
server which doesn't implement MPLS.
Maybe that the server is changing so that advanced networking is part of
its brief. Large sites have redundant data centres. Small sites
outsource to cloud providers to gain the advantages of large sites. The
pool outside of those two is shrinking.
I don't know enough to say if these MPLS patches are any good or not --
I haven't spent my life working on the Linux kernel. I have spent my
life building Internet networks, so I do know enough to say that if you
want Linux to continue to be attractive to ISPs and large enterprises as
the Swiss Army Knife for network services then it is well time that some
MPLS implementation was in the stock kernel.
Otherwise the Linux networking implementation will simply become
irrelevant. People with deep networking needs -- ISPs, enterprise data
centres, large content sites -- will simply use Linux to implement the
interfaces and attach them to a VM running more capable networking
software from C or J. You're already seeing software from these people
now, because the thought of them getting revenue for existing software
with no hardware development makes them drool.
I've long admired the quality of the Linux networking implementation.
For example, router manufacturers could learn a lot from the deep
thought and clear design of the "tc" subsystem. The work which was done
to make TCP perform well is outstanding. I very much want Linux
networking to continue to succeed. So please take this suggestion that
it is well time for forward progress on MPLS in the spirit it is meant.
My apologies where you may feel I have vented excessively above.
Best wishes, Glen
--
Glen Turner <http://www.gdt.id.au/~gdt/>
^ permalink raw reply
* Re: [PATCH net-next 1/2] netem: rate-latency extension
From: Bill Fink @ 2011-11-25 1:06 UTC (permalink / raw)
To: Hagen Paul Pfeifer; +Cc: Eric Dumazet, netdev, Stephen Hemminger
In-Reply-To: <20111124223118.GH2673@hell>
On Thu, 24 Nov 2011, Hagen Paul Pfeifer wrote:
> * Eric Dumazet | 2011-11-24 23:14:58 [+0100]:
>
> >Only point is why you chose ratelatency instead of rate ?
>
>
> Not sure why, it was called rate in v1, then somebody said ratelatency and I
> found it more stating. So in v2 it become ratelatency. I have no strong
> opinion here - should I generate v3?
>From the user perspective, I also find rate much more natural.
No need to add further to tc obscurity.
I would ask for an update to the netem man page, but I guess
there isn't a netem man page. :-(
-Bill
^ permalink raw reply
* Re: [PATCH net-next 1/2] netem: rate-latency extension
From: Hagen Paul Pfeifer @ 2011-11-25 1:23 UTC (permalink / raw)
To: Bill Fink; +Cc: Eric Dumazet, netdev, Stephen Hemminger
In-Reply-To: <20111124200650.c1f609ef.billfink@mindspring.com>
* Bill Fink | 2011-11-24 20:06:50 [-0500]:
>From the user perspective, I also find rate much more natural.
>No need to add further to tc obscurity.
ok, then I will respin the patch.
>I would ask for an update to the netem man page, but I guess
>there isn't a netem man page. :-(
Someone wrote a man page, but it was never commited to iproute2. I will have a
look.
Hagen
^ permalink raw reply
* [PATCH v2 net-next 1/2] netem: rate extension
From: Hagen Paul Pfeifer @ 2011-11-25 2:22 UTC (permalink / raw)
To: netdev; +Cc: Stephen Hemminger, Hagen Paul Pfeifer
In-Reply-To: <1322156378-23257-1-git-send-email-hagen@jauu.net>
Currently netem is not in the ability to emulate channel bandwidth. Only static
delay (and optional random jitter) can be configured.
To emulate the channel rate the token bucket filter (sch_tbf) can be used. But
TBF has some major emulation flaws. The buffer (token bucket depth/rate) cannot
be 0. Also the idea behind TBF is that the credit (token in buckets) fills if
no packet is transmitted. So that there is always a "positive" credit for new
packets. In real life this behavior contradicts the law of nature where
nothing can travel faster as speed of light. E.g.: on an emulated 1000 byte/s
link a small IPv4/TCP SYN packet with ~50 byte require ~0.05 seconds - not 0
seconds.
Netem is an excellent place to implement a rate limiting feature: static
delay is already implemented, tfifo already has time information and the
user can skip TBF configuration completely.
This patch implement rate feature which can be configured via tc. e.g:
tc qdisc add dev eth0 root netem rate 10kbit
To emulate a link of 5000byte/s and add an additional static delay of 10ms:
tc qdisc add dev eth0 root netem delay 10ms rate 5KBps
Note: similar to TBF the rate extension is bounded to the kernel timing
system. Depending on the architecture timer granularity, higher rates (e.g.
10mbit/s and higher) tend to transmission bursts. Also note: further queues
living in network adaptors; see ethtool(8).
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
---
include/linux/pkt_sched.h | 5 +++++
net/sched/sch_netem.c | 40 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 45 insertions(+), 0 deletions(-)
diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index c533670..26c37ca 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -465,6 +465,7 @@ enum {
TCA_NETEM_REORDER,
TCA_NETEM_CORRUPT,
TCA_NETEM_LOSS,
+ TCA_NETEM_RATE,
__TCA_NETEM_MAX,
};
@@ -495,6 +496,10 @@ struct tc_netem_corrupt {
__u32 correlation;
};
+struct tc_netem_rate {
+ __u32 rate; /* byte/s */
+};
+
enum {
NETEM_LOSS_UNSPEC,
NETEM_LOSS_GI, /* General Intuitive - 4 state model */
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index eb3b9a8..9b7af9f 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -79,6 +79,7 @@ struct netem_sched_data {
u32 duplicate;
u32 reorder;
u32 corrupt;
+ u32 rate;
struct crndstate {
u32 last;
@@ -298,6 +299,11 @@ static psched_tdiff_t tabledist(psched_tdiff_t mu, psched_tdiff_t sigma,
return x / NETEM_DIST_SCALE + (sigma / NETEM_DIST_SCALE) * t + mu;
}
+static psched_time_t packet_len_2_sched_time(unsigned int len, u32 rate)
+{
+ return PSCHED_NS2TICKS((u64)len * NSEC_PER_SEC / rate);
+}
+
/*
* Insert one skb into qdisc.
* Note: parent depends on return value to account for queue length.
@@ -371,6 +377,24 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch)
&q->delay_cor, q->delay_dist);
now = psched_get_time();
+
+ if (q->rate) {
+ struct sk_buff_head *list = &q->qdisc->q;
+
+ delay += packet_len_2_sched_time(skb->len, q->rate);
+
+ if (!skb_queue_empty(list)) {
+ /*
+ * Last packet in queue is reference point (now).
+ * First packet in queue is already in flight,
+ * calculate this time bonus and substract
+ * from delay.
+ */
+ delay -= now - netem_skb_cb(skb_peek(list))->time_to_send;
+ now = netem_skb_cb(skb_peek_tail(list))->time_to_send;
+ }
+ }
+
cb->time_to_send = now + delay;
++q->counter;
ret = qdisc_enqueue(skb, q->qdisc);
@@ -535,6 +559,14 @@ static void get_corrupt(struct Qdisc *sch, const struct nlattr *attr)
init_crandom(&q->corrupt_cor, r->correlation);
}
+static void get_rate(struct Qdisc *sch, const struct nlattr *attr)
+{
+ struct netem_sched_data *q = qdisc_priv(sch);
+ const struct tc_netem_rate *r = nla_data(attr);
+
+ q->rate = r->rate;
+}
+
static int get_loss_clg(struct Qdisc *sch, const struct nlattr *attr)
{
struct netem_sched_data *q = qdisc_priv(sch);
@@ -594,6 +626,7 @@ static const struct nla_policy netem_policy[TCA_NETEM_MAX + 1] = {
[TCA_NETEM_CORR] = { .len = sizeof(struct tc_netem_corr) },
[TCA_NETEM_REORDER] = { .len = sizeof(struct tc_netem_reorder) },
[TCA_NETEM_CORRUPT] = { .len = sizeof(struct tc_netem_corrupt) },
+ [TCA_NETEM_RATE] = { .len = sizeof(struct tc_netem_rate) },
[TCA_NETEM_LOSS] = { .type = NLA_NESTED },
};
@@ -666,6 +699,9 @@ static int netem_change(struct Qdisc *sch, struct nlattr *opt)
if (tb[TCA_NETEM_CORRUPT])
get_corrupt(sch, tb[TCA_NETEM_CORRUPT]);
+ if (tb[TCA_NETEM_RATE])
+ get_rate(sch, tb[TCA_NETEM_RATE]);
+
q->loss_model = CLG_RANDOM;
if (tb[TCA_NETEM_LOSS])
ret = get_loss_clg(sch, tb[TCA_NETEM_LOSS]);
@@ -846,6 +882,7 @@ static int netem_dump(struct Qdisc *sch, struct sk_buff *skb)
struct tc_netem_corr cor;
struct tc_netem_reorder reorder;
struct tc_netem_corrupt corrupt;
+ struct tc_netem_rate rate;
qopt.latency = q->latency;
qopt.jitter = q->jitter;
@@ -868,6 +905,9 @@ static int netem_dump(struct Qdisc *sch, struct sk_buff *skb)
corrupt.correlation = q->corrupt_cor.rho;
NLA_PUT(skb, TCA_NETEM_CORRUPT, sizeof(corrupt), &corrupt);
+ rate.rate = q->rate;
+ NLA_PUT(skb, TCA_NETEM_RATE, sizeof(rate), &rate);
+
if (dump_loss_model(q, skb) != 0)
goto nla_put_failure;
--
1.7.7
^ permalink raw reply related
* [PATCH v2 net-next 2/2] netem: add cell concept to simulate special MAC behavior
From: Hagen Paul Pfeifer @ 2011-11-25 2:22 UTC (permalink / raw)
To: netdev; +Cc: Stephen Hemminger, Hagen Paul Pfeifer
In-Reply-To: <1322187773-27768-1-git-send-email-hagen@jauu.net>
This extension can be used to simulate special link layer
characteristics. Simulate because packet data is not modified, only the
calculation base is changed to delay a packet based on the original
packet size and artificial cell information.
packet_overhead can be used to simulate a link layer header compression
scheme (e.g. set packet_overhead to -20) or with a positive
packet_overhead value an additional MAC header can be simulated. It is
also possible to "replace" the 14 byte Ethernet header with something
else.
cell_size and cell_overhead can be used to simulate link layer schemes,
based on cells, like some TDMA schemes. Another application area are MAC
schemes using a link layer fragmentation with a (small) header each.
Cell size is the maximum amount of data bytes within one cell. Cell
overhead is an additional variable to change the per-cell-overhead (e.g.
5 byte header per fragment).
Example (5 kbit/s, 20 byte per packet overhead, cellsize 100 byte, per
cell overhead 5 byte):
tc qdisc add dev eth0 root netem rate 5kbit 20 100 5
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
---
include/linux/pkt_sched.h | 3 +++
net/sched/sch_netem.c | 30 +++++++++++++++++++++++++++---
2 files changed, 30 insertions(+), 3 deletions(-)
diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index 26c37ca..63845cf 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -498,6 +498,9 @@ struct tc_netem_corrupt {
struct tc_netem_rate {
__u32 rate; /* byte/s */
+ __s32 packet_overhead;
+ __u32 cell_size;
+ __s32 cell_overhead;
};
enum {
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 9b7af9f..11ca527 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -80,6 +80,9 @@ struct netem_sched_data {
u32 reorder;
u32 corrupt;
u32 rate;
+ s32 packet_overhead;
+ u32 cell_size;
+ s32 cell_overhead;
struct crndstate {
u32 last;
@@ -299,9 +302,24 @@ static psched_tdiff_t tabledist(psched_tdiff_t mu, psched_tdiff_t sigma,
return x / NETEM_DIST_SCALE + (sigma / NETEM_DIST_SCALE) * t + mu;
}
-static psched_time_t packet_len_2_sched_time(unsigned int len, u32 rate)
+static psched_time_t packet_len_2_sched_time(unsigned int len,
+ struct netem_sched_data *q)
{
- return PSCHED_NS2TICKS((u64)len * NSEC_PER_SEC / rate);
+ len += q->packet_overhead;
+
+ if (q->cell_size) {
+ u32 carry = len % q->cell_size;
+ len += carry;
+
+ if (q->cell_overhead) {
+ u32 cells = len / q->cell_size;
+ if (carry)
+ cells += 1;
+ len += cells * q->cell_overhead;
+ }
+ }
+
+ return PSCHED_NS2TICKS((u64)len * NSEC_PER_SEC / q->rate);
}
/*
@@ -381,7 +399,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch)
if (q->rate) {
struct sk_buff_head *list = &q->qdisc->q;
- delay += packet_len_2_sched_time(skb->len, q->rate);
+ delay += packet_len_2_sched_time(skb->len, q);
if (!skb_queue_empty(list)) {
/*
@@ -565,6 +583,9 @@ static void get_rate(struct Qdisc *sch, const struct nlattr *attr)
const struct tc_netem_rate *r = nla_data(attr);
q->rate = r->rate;
+ q->packet_overhead = r->packet_overhead;
+ q->cell_size = r->cell_size;
+ q->cell_overhead = r->cell_overhead;
}
static int get_loss_clg(struct Qdisc *sch, const struct nlattr *attr)
@@ -906,6 +927,9 @@ static int netem_dump(struct Qdisc *sch, struct sk_buff *skb)
NLA_PUT(skb, TCA_NETEM_CORRUPT, sizeof(corrupt), &corrupt);
rate.rate = q->rate;
+ rate.packet_overhead = q->packet_overhead;
+ rate.cell_size = q->cell_size;
+ rate.cell_overhead = q->cell_overhead;
NLA_PUT(skb, TCA_NETEM_RATE, sizeof(rate), &rate);
if (dump_loss_model(q, skb) != 0)
--
1.7.7
^ permalink raw reply related
* [PATCH v2 iproute2 1/2] utils: add s32 parser
From: Hagen Paul Pfeifer @ 2011-11-25 2:23 UTC (permalink / raw)
To: netdev; +Cc: Stephen Hemminger, Hagen Paul Pfeifer
In-Reply-To: <1322156378-23257-1-git-send-email-hagen@jauu.net>
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
---
include/utils.h | 1 +
lib/utils.c | 14 ++++++++++++++
2 files changed, 15 insertions(+), 0 deletions(-)
diff --git a/include/utils.h b/include/utils.h
index 47f8e07..496db68 100644
--- a/include/utils.h
+++ b/include/utils.h
@@ -85,6 +85,7 @@ extern int get_time_rtt(unsigned *val, const char *arg, int *raw);
#define get_short get_s16
extern int get_u64(__u64 *val, const char *arg, int base);
extern int get_u32(__u32 *val, const char *arg, int base);
+extern int get_s32(__s32 *val, const char *arg, int base);
extern int get_u16(__u16 *val, const char *arg, int base);
extern int get_s16(__s16 *val, const char *arg, int base);
extern int get_u8(__u8 *val, const char *arg, int base);
diff --git a/lib/utils.c b/lib/utils.c
index efaf377..6788dd9 100644
--- a/lib/utils.c
+++ b/lib/utils.c
@@ -198,6 +198,20 @@ int get_u8(__u8 *val, const char *arg, int base)
return 0;
}
+int get_s32(__s32 *val, const char *arg, int base)
+{
+ long res;
+ char *ptr;
+
+ if (!arg || !*arg)
+ return -1;
+ res = strtoul(arg, &ptr, base);
+ if (!ptr || ptr == arg || *ptr || res > INT32_MAX || res < INT32_MIN)
+ return -1;
+ *val = res;
+ return 0;
+}
+
int get_s16(__s16 *val, const char *arg, int base)
{
long res;
--
1.7.7
^ permalink raw reply related
* [PATCH v2 iproute2 2/2] tc: netem rate shaping and cell extension
From: Hagen Paul Pfeifer @ 2011-11-25 2:23 UTC (permalink / raw)
To: netdev; +Cc: Stephen Hemminger, Hagen Paul Pfeifer
In-Reply-To: <1322187831-27846-1-git-send-email-hagen@jauu.net>
This patch add rate shaping as well as cell support. The link-rate can be
specified via rate options. Three optional arguments control the cell
knobs: packet-overhead, cell-size, cell-overhead. To ratelimit eth0 root
queue to 5kbit/s, with a 20 byte packet overhead, 100 byte cell size and
a 5 byte per cell overhead:
tc qdisc add dev eth0 root netem rate 5kbit 20 100 5
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
---
include/linux/pkt_sched.h | 8 ++++++
tc/q_netem.c | 53 ++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 60 insertions(+), 1 deletions(-)
diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index c533670..eaf4e9e 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -465,6 +465,7 @@ enum {
TCA_NETEM_REORDER,
TCA_NETEM_CORRUPT,
TCA_NETEM_LOSS,
+ TCA_NETEM_RATE,
__TCA_NETEM_MAX,
};
@@ -495,6 +496,13 @@ struct tc_netem_corrupt {
__u32 correlation;
};
+struct tc_netem_rate {
+ __u32 rate; /* byte/s */
+ __s32 packet_overhead;
+ __u32 cell_size;
+ __s32 cell_overhead;
+};
+
enum {
NETEM_LOSS_UNSPEC,
NETEM_LOSS_GI, /* General Intuitive - 4 state model */
diff --git a/tc/q_netem.c b/tc/q_netem.c
index 6dc40bd..1fdfa44 100644
--- a/tc/q_netem.c
+++ b/tc/q_netem.c
@@ -34,7 +34,8 @@ static void explain(void)
" [ drop PERCENT [CORRELATION]] \n" \
" [ corrupt PERCENT [CORRELATION]] \n" \
" [ duplicate PERCENT [CORRELATION]]\n" \
-" [ reorder PRECENT [CORRELATION] [ gap DISTANCE ]]\n");
+" [ reorder PRECENT [CORRELATION] [ gap DISTANCE ]]\n" \
+" [ rate RATE [PACKETOVERHEAD] [CELLSIZE] [CELLOVERHEAD]]\n");
}
static void explain1(const char *arg)
@@ -131,6 +132,7 @@ static int netem_parse_opt(struct qdisc_util *qu, int argc, char **argv,
struct tc_netem_corr cor;
struct tc_netem_reorder reorder;
struct tc_netem_corrupt corrupt;
+ struct tc_netem_rate rate;
__s16 *dist_data = NULL;
int present[__TCA_NETEM_MAX];
@@ -139,6 +141,7 @@ static int netem_parse_opt(struct qdisc_util *qu, int argc, char **argv,
memset(&cor, 0, sizeof(cor));
memset(&reorder, 0, sizeof(reorder));
memset(&corrupt, 0, sizeof(corrupt));
+ memset(&rate, 0, sizeof(rate));
memset(present, 0, sizeof(present));
while (argc > 0) {
@@ -244,6 +247,34 @@ static int netem_parse_opt(struct qdisc_util *qu, int argc, char **argv,
free(dist_data);
return -1;
}
+ } else if (matches(*argv, "rate") == 0) {
+ ++present[TCA_NETEM_RATE];
+ NEXT_ARG();
+ if (get_rate(&rate.rate, *argv)) {
+ explain1("rate");
+ return -1;
+ }
+ if (NEXT_IS_NUMBER()) {
+ NEXT_ARG();
+ if (get_s32(&rate.packet_overhead, *argv, 0)) {
+ explain1("rate");
+ return -1;
+ }
+ }
+ if (NEXT_IS_NUMBER()) {
+ NEXT_ARG();
+ if (get_u32(&rate.cell_size, *argv, 0)) {
+ explain1("rate");
+ return -1;
+ }
+ }
+ if (NEXT_IS_NUMBER()) {
+ NEXT_ARG();
+ if (get_s32(&rate.cell_overhead, *argv, 0)) {
+ explain1("rate");
+ return -1;
+ }
+ }
} else if (strcmp(*argv, "help") == 0) {
explain();
return -1;
@@ -290,6 +321,10 @@ static int netem_parse_opt(struct qdisc_util *qu, int argc, char **argv,
addattr_l(n, 1024, TCA_NETEM_CORRUPT, &corrupt, sizeof(corrupt)) < 0)
return -1;
+ if (present[TCA_NETEM_RATE] &&
+ addattr_l(n, 1024, TCA_NETEM_RATE, &rate, sizeof(rate)) < 0)
+ return -1;
+
if (dist_data) {
if (addattr_l(n, MAX_DIST * sizeof(dist_data[0]),
TCA_NETEM_DELAY_DIST,
@@ -306,6 +341,7 @@ static int netem_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
const struct tc_netem_corr *cor = NULL;
const struct tc_netem_reorder *reorder = NULL;
const struct tc_netem_corrupt *corrupt = NULL;
+ const struct tc_netem_rate *rate = NULL;
struct tc_netem_qopt qopt;
int len = RTA_PAYLOAD(opt) - sizeof(qopt);
SPRINT_BUF(b1);
@@ -339,6 +375,11 @@ static int netem_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
return -1;
corrupt = RTA_DATA(tb[TCA_NETEM_CORRUPT]);
}
+ if (tb[TCA_NETEM_RATE]) {
+ if (RTA_PAYLOAD(tb[TCA_NETEM_RATE]) < sizeof(*rate))
+ return -1;
+ rate = RTA_DATA(tb[TCA_NETEM_RATE]);
+ }
}
fprintf(f, "limit %d", qopt.limit);
@@ -382,6 +423,16 @@ static int netem_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
sprint_percent(corrupt->correlation, b1));
}
+ if (rate && rate->rate) {
+ fprintf(f, " rate %s", sprint_rate(rate->rate, b1));
+ if (rate->packet_overhead)
+ fprintf(f, " packetoverhead %d", rate->packet_overhead);
+ if (rate->cell_size)
+ fprintf(f, " cellsize %u", rate->cell_size);
+ if (rate->cell_overhead)
+ fprintf(f, " celloverhead %d", rate->cell_overhead);
+ }
+
if (qopt.gap)
fprintf(f, " gap %lu", (unsigned long)qopt.gap);
--
1.7.7
^ permalink raw reply related
* Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
From: Krishna Kumar2 @ 2011-11-25 2:58 UTC (permalink / raw)
To: jasowang
Cc: arnd, davem, jeffrey.t.kirsher, levinsasha928, Michael S. Tsirkin,
netdev, virtualization
In-Reply-To: <4ECE4004.8010107@redhat.com>
jasowang <jasowang@redhat.com> wrote on 11/24/2011 06:30:52 PM:
>
> >> On Thu, Nov 24, 2011 at 01:47:14PM +0530, Krishna Kumar wrote:
> >>> It was reported that the macvtap device selects a
> >>> different vhost (when used with multiqueue feature)
> >>> for incoming packets of a single connection. Use
> >>> packet hash first. Patch tested on MQ virtio_net.
> >> So this is sure to address the problem, why exactly does this happen?
> >> Does your device spread a single flow across multiple RX queues? Would
> >> not that cause trouble in the TCP layer?
> >> It would seem that using the recorded queue should be faster with
> >> less cache misses. Before we give up on that, I'd
> >> like to understand why it's wrong. Do you know?
> > I am using ixgbe. From what I briefly saw, ixgbe_alloc_rx_buffers
> > calls skb_record_rx_queue when a skb is allocated. When a packet
> > is received (ixgbe_alloc_rx_buffers), it sets rxhash. The
> > recorded value is different for most skbs when I ran a single
> > stream TCP stream test (does skbs move between the rx_rings?).
>
> Yes, it moves. It depends on last processor or tx queue who transmits
> the packets of this stream. Because ixgbe select tx queue based on the
> processor id, so if vhost thread transmits skbs on different processors,
> the skb of a single stream may comes from different rx ring.
But I don't see transmit going on different queues,
only incoming.
- KK
^ permalink raw reply
* Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
From: Krishna Kumar2 @ 2011-11-25 3:07 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: arnd, davem, jasowang, levinsasha928, netdev, virtualization
In-Reply-To: <20111124161430.GC26770@redhat.com>
"Michael S. Tsirkin" <mst@redhat.com> wrote on 11/24/2011 09:44:31 PM:
> > As far as I can see, ixgbe binds queues to physical cpu, so let
consider:
> >
> > vhost thread transmits packets of flow A on processor M
> > during packet transmission, ixgbe driver programs the card to
> > deliver the packet of flow A to queue/cpu M through flow director
> > (see ixgbe_atr())
> > vhost thread then receives packet of flow A with from M
> > ...
> > vhost thread transmits packets of flow A on processor N
> > ixgbe driver programs the flow director to change the delivery of
> > flow A to queue N ( cpu N )
> > vhost thread then receives packet of flow A with from N
> > ...
> >
> > So, for a single flow A, we may get different queue mappings. Using
> > rxhash instead may solve this issue.
>
> Or better, transmit a single flow from a single vhost thread.
>
> If packets of a single flow get spread over different CPUs,
> they will get reordered and things are not going to work well.
My testing so far shows that guest sends on (e.g.) TXQ#2
only, which is handled by vhost#2; and this doesn't change
for the entire duration of the test. Incoming keeps
changing for different packets but become same with
this patch. To iterate, I have not seen the following:
"
vhost thread transmits packets of flow A on processor M
...
vhost thread transmits packets of flow A on processor N
"
- KK
^ permalink raw reply
* Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
From: Jason Wang @ 2011-11-25 3:09 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Krishna Kumar, arnd, netdev, virtualization, levinsasha928, davem
In-Reply-To: <20111124161430.GC26770@redhat.com>
On 11/25/2011 12:14 AM, Michael S. Tsirkin wrote:
> On Thu, Nov 24, 2011 at 08:56:45PM +0800, jasowang wrote:
>> > On 11/24/2011 06:34 PM, Michael S. Tsirkin wrote:
>>> > >On Thu, Nov 24, 2011 at 06:13:41PM +0800, jasowang wrote:
>>>> > >>On 11/24/2011 05:59 PM, Michael S. Tsirkin wrote:
>>>>> > >>>On Thu, Nov 24, 2011 at 01:47:14PM +0530, Krishna Kumar wrote:
>>>>>> > >>>>It was reported that the macvtap device selects a
>>>>>> > >>>>different vhost (when used with multiqueue feature)
>>>>>> > >>>>for incoming packets of a single connection. Use
>>>>>> > >>>>packet hash first. Patch tested on MQ virtio_net.
>>>>> > >>>So this is sure to address the problem, why exactly does this happen?
>>>> > >>Ixgbe has flow director and bind queue to host cpu, so it can make
>>>> > >>sure the packet of a flow to be handled by the same queue/cpu. So
>>>> > >>when vhost thread moves from one host cpu to another, ixgbe would
>>>> > >>therefore send the packet to the new cpu/queue.
>>> > >Confused. How does ixgbe know about vhost thread moving?
>> >
>> > As far as I can see, ixgbe binds queues to physical cpu, so let consider:
>> >
>> > vhost thread transmits packets of flow A on processor M
>> > during packet transmission, ixgbe driver programs the card to
>> > deliver the packet of flow A to queue/cpu M through flow director
>> > (see ixgbe_atr())
>> > vhost thread then receives packet of flow A with from M
>> > ...
>> > vhost thread transmits packets of flow A on processor N
>> > ixgbe driver programs the flow director to change the delivery of
>> > flow A to queue N ( cpu N )
>> > vhost thread then receives packet of flow A with from N
>> > ...
>> >
>> > So, for a single flow A, we may get different queue mappings. Using
>> > rxhash instead may solve this issue.
> Or better, transmit a single flow from a single vhost thread.
It has already worked this way, as the tx queue were choose based on tx
hash in guest(), but vhost thread can move among processors.
>
> If packets of a single flow get spread over different CPUs,
> they will get reordered and things are not going to work well.
>
The problem is that vhost does not handle TCP itself but ixgbe driver
would think it does, so the nic would deliver packets of a single flow
to different CPUs when the vhost thread who does the transmission moves.
So, in conclusion, if we do not consider features of under layer nic,
using rxhash instead of queue mappings to identify a flow is better.
^ permalink raw reply
* Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
From: Jason Wang @ 2011-11-25 3:18 UTC (permalink / raw)
To: Krishna Kumar2
Cc: arnd, Michael S. Tsirkin, netdev, virtualization, levinsasha928,
davem, jeffrey.t.kirsher
In-Reply-To: <OF2886C7FB.135CFC23-ON65257953.000F9C8A-65257953.00101CBF@in.ibm.com>
On 11/25/2011 10:58 AM, Krishna Kumar2 wrote:
> jasowang<jasowang@redhat.com> wrote on 11/24/2011 06:30:52 PM:
>
>>>> On Thu, Nov 24, 2011 at 01:47:14PM +0530, Krishna Kumar wrote:
>>>>> It was reported that the macvtap device selects a
>>>>> different vhost (when used with multiqueue feature)
>>>>> for incoming packets of a single connection. Use
>>>>> packet hash first. Patch tested on MQ virtio_net.
>>>> So this is sure to address the problem, why exactly does this happen?
>>>> Does your device spread a single flow across multiple RX queues? Would
>>>> not that cause trouble in the TCP layer?
>>>> It would seem that using the recorded queue should be faster with
>>>> less cache misses. Before we give up on that, I'd
>>>> like to understand why it's wrong. Do you know?
>>> I am using ixgbe. From what I briefly saw, ixgbe_alloc_rx_buffers
>>> calls skb_record_rx_queue when a skb is allocated. When a packet
>>> is received (ixgbe_alloc_rx_buffers), it sets rxhash. The
>>> recorded value is different for most skbs when I ran a single
>>> stream TCP stream test (does skbs move between the rx_rings?).
>> Yes, it moves. It depends on last processor or tx queue who transmits
>> the packets of this stream. Because ixgbe select tx queue based on the
>> processor id, so if vhost thread transmits skbs on different processors,
>> the skb of a single stream may comes from different rx ring.
> But I don't see transmit going on different queues,
> only incoming.
>
> - KK
>
Maybe I'm not clear enough, I mean the processor of host and tx queue of
ixgbe. So you would see, for a single vhost thread, as it moves among
host cpus, it would use different tx queues of ixgbe. I think if you pin
the vhost thread on host cpu, you may get consistent rx queue no.
^ permalink raw reply
* Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
From: Jason Wang @ 2011-11-25 3:21 UTC (permalink / raw)
To: Krishna Kumar2
Cc: arnd, Michael S. Tsirkin, netdev, virtualization, levinsasha928,
davem
In-Reply-To: <OF86B15006.7F4BE3FB-ON65257953.00104C3E-65257953.0010EFCA@in.ibm.com>
On 11/25/2011 11:07 AM, Krishna Kumar2 wrote:
> "Michael S. Tsirkin"<mst@redhat.com> wrote on 11/24/2011 09:44:31 PM:
>
>>> As far as I can see, ixgbe binds queues to physical cpu, so let
> consider:
>>> vhost thread transmits packets of flow A on processor M
>>> during packet transmission, ixgbe driver programs the card to
>>> deliver the packet of flow A to queue/cpu M through flow director
>>> (see ixgbe_atr())
>>> vhost thread then receives packet of flow A with from M
>>> ...
>>> vhost thread transmits packets of flow A on processor N
>>> ixgbe driver programs the flow director to change the delivery of
>>> flow A to queue N ( cpu N )
>>> vhost thread then receives packet of flow A with from N
>>> ...
>>>
>>> So, for a single flow A, we may get different queue mappings. Using
>>> rxhash instead may solve this issue.
>> Or better, transmit a single flow from a single vhost thread.
>>
>> If packets of a single flow get spread over different CPUs,
>> they will get reordered and things are not going to work well.
> My testing so far shows that guest sends on (e.g.) TXQ#2
> only, which is handled by vhost#2; and this doesn't change
> for the entire duration of the test. Incoming keeps
> changing for different packets but become same with
> this patch. To iterate, I have not seen the following:
Yes because guest chose the txq of virtio-net based on hash.
>
> "
> vhost thread transmits packets of flow A on processor M
> ...
> vhost thread transmits packets of flow A on processor N
> "
My description is not clear again :(
I mean the same vhost thead:
vhost thread #0 transmits packets of flow A on processor M
...
vhost thread #0 move to another process N and start to transmit packets
of flow A
> - KK
>
^ permalink raw reply
* Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
From: Krishna Kumar2 @ 2011-11-25 4:09 UTC (permalink / raw)
To: Jason Wang
Cc: arnd, Michael S. Tsirkin, netdev, virtualization, levinsasha928,
davem
In-Reply-To: <4ECF09D5.4010700@redhat.com>
Jason Wang <jasowang@redhat.com> wrote on 11/25/2011 08:51:57 AM:
>
> My description is not clear again :(
> I mean the same vhost thead:
>
> vhost thread #0 transmits packets of flow A on processor M
> ...
> vhost thread #0 move to another process N and start to transmit packets
> of flow A
Thanks for clarifying. Yes, binding vhosts to CPU's
makes the incoming packet go to the same vhost each
time. BTW, are you doing any binding and/or irqbalance
when you run your tests? I am not running either at
this time, but thought both might be useful.
- KK
^ permalink raw reply
* Re: [PATCH net-next 1/2] netem: rate-latency extension
From: Stephen Hemminger @ 2011-11-25 5:09 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Hagen Paul Pfeifer, netdev
In-Reply-To: <1322172898.2872.7.camel@edumazet-laptop>
On Thu, 24 Nov 2011 23:14:58 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> I like this patch, this is a useful extension.
>
> Only point is why you chose ratelatency instead of rate ?
>
> We want to emulate a real link, and yes, a 1000 bytes packet must be
> delayed _before_ we deliver it to the device, but its a detail of how
> works netem.
>
> The usual word we use to describe a 1Mbps link is "1Mbps rate" ;
I would rather a new qdisc then add more features to the already complex
netem. Initially, there where was a rate control built into netem, but
the consensus was to use stacking to do it.
^ permalink raw reply
* Re: Open vSwitch Design
From: Stephen Hemminger @ 2011-11-25 5:20 UTC (permalink / raw)
To: jhs-jkUAjuhPggJWk0Htik3J/w
Cc: dev-yBygre7rU0TnMu66kgdUjQ, Chris Wright, Herbert Xu,
Eric Dumazet, netdev, hadi-fAAogVwAN2Kw5LPnMra/2Q, Fastabend,
John-/PVsmBQoxgPKo9QCiBeYKEEOCMrvLtNR, David Miller
In-Reply-To: <1322173833.1944.5.camel@mojatatu>
On Thu, 24 Nov 2011 17:30:33 -0500
jamal <hadi-fAAogVwAN2Kw5LPnMra/2Q@public.gmane.org> wrote:
> Jesse,
>
> I am going to try and respond to your comments below.
>
> On Thu, 2011-11-24 at 12:10 -0800, Jesse Gross wrote:
>
> >
> > * Switching infrastructure: As the name implies, Open vSwitch is
> > intended to be a network switch, focused on
> > virtualization/OpenFlow/software defined networking. This means that
> > what we are modeling is not actually a collection of flows but a
> > switch which contains a group of related ports, a software virtual
> > device, etc. The switch model is used in a variety of places, such as
> > to measure traffic that actually flows through it in order to
> > implement monitoring and sampling protocols.
>
> Can you explain why you couldnt use the current bridge code (likely with
> some mods)? I can see you want to isolate the VMs via the virtual ports;
> maybe even vlans on the virtual ports - the current bridge code should
> be able to handle that.
The way openvswitch works is that the flow table is populated
by user space. The kernel bridge works completely differently (it learns
about MAC addresses).
> > * Flow lookup: Although used to implement OpenFlow, the kernel flow
> > table does not actually directly contain OpenFlow flows. This is
> > because OpenFlow tables can contain wildcards, multiple pipeline
> > stages, etc. and we did not want to push that complexity into the
> > kernel fast path (nor tie it to a specific version of OpenFlow).
> > Instead an exact match flow table is populated on-demand from
> > userspace based on the more complex rules stored there. Although it
> > might seem limiting, this design has allowed significant new
> > functionality to be added without modifications to the kernel or
> > performance impact.
>
> This can be achieved easily with zero changes to the kernel code.
> You need to have default filters that redirect flows to user space
> when you fail to match.
Actually, this is what puts me off on the current implementation.
I would prefer that the kernel implementation was just a software
implementation of a hardware OpenFlow switch. That way it would
be transparent that the control plane in user space was talking to kernel
or hardware.
> > * Packet execution: Once a flow is matched it can be output,
> > enqueued to a particular qdisc, etc. Some of these operations are
> > specific to Open vSwitch, such as sampling, whereas others we leverage
> > existing infrastructure (including tc for QoS) by simply marking the
> > packet for further processing.
>
> The tc classifier-action-qdisc infrastructure handles this.
> The sampler needs a new action defined.
There are too many damn layers in the software path already.
> > * Userspace interfaces: One of the difficulties of having a
> > specialized, exact match flow lookup engine is maintaining
> > compatibility across differing kernel/userspace versions. This
> > compatibility shows up heavily in the userspace interfaces and is
> > achieved by passing the kernel's version of the flow along with packet
> > information. This allows userspace to install appropriate flows even
> > if its interpretation of a packet differs from the kernel's without
> > version checks or maintaining multiple implementations of the flow
> > extraction code in the kernel.
>
> I didnt quiet follow - are we talking about backward/forward
> compatibility?
The problem is that there are two flow classifiers, one in OpenVswitch
in the kernel, and the other in the user space flow manager. I think the
issue is that the two have different code.
Is the kernel/userspace API for OpenVswitch nailed down and documented
well enough that alternative control plane software could be built?
> > It's obviously possible to put this code anywhere, whether it is an
> > independent module, in the bridge, or tc. Regardless, however, it's
> > largely new code that is geared towards this particular model so it
> > seems better not to add to the complexity of existing components if at
> > all possible.
>
> I am still not seeing how this could not be done without the
> infrastructure that exists. Granted, the user space brains - thats where
> everything else resides - but you are not pushing that i think.
^ permalink raw reply
* Re: MPLS for Linux kernel
From: David Miller @ 2011-11-25 5:43 UTC (permalink / raw)
To: gdt; +Cc: igorm, netdev
In-Reply-To: <1322177970.3236.57.camel@ilion>
From: Glen Turner <gdt@gdt.id.au>
Date: Fri, 25 Nov 2011 10:09:30 +1030
> On Tue, 2011-11-22 at 16:49 -0500, David Miller wrote:
>> I frankly don't care very much about MPLS personally, it's such a
>> fringe facility. So if people just argue themselves into oblivion and
>> no forward progress is made, just like last time an MPLS submission
>> was attempted, that's also fine with me :-)
>
> Hello David,
Just wanted to let you know that with the workload of patch review and
coding I have, I basically have zero time for verbose editorials like
this and, without exception, I never read them.
^ permalink raw reply
* Re: [PATCH net-next 1/2] netem: rate-latency extension
From: Eric Dumazet @ 2011-11-25 6:13 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: Hagen Paul Pfeifer, netdev
In-Reply-To: <20111124210926.3e4b7567@s6510.linuxnetplumber.net>
Le jeudi 24 novembre 2011 à 21:09 -0800, Stephen Hemminger a écrit :
> I would rather a new qdisc then add more features to the already complex
> netem. Initially, there where was a rate control built into netem, but
> the consensus was to use stacking to do it.
Yes, but Hagen change adds a few lines to netem, and netem already
handles throttling. This is why I believe its a nice enhancement.
Being able to simulate a ratelimit (in bits per second by the way, the
usual bandwith unit, not bytes per second...) in a very easy way seems a
good thing, even if it handles only the egress side.
As Hagen mentioned, a standard qdisc is able to rate limit, but the
first packet sent has a null delay, even if its 64Kbyte packet. It
doesnt mimic a true link.
^ permalink raw reply
* Re: Open vSwitch Design
From: Eric Dumazet @ 2011-11-25 6:18 UTC (permalink / raw)
To: Stephen Hemminger
Cc: dev-yBygre7rU0TnMu66kgdUjQ, Chris Wright, Herbert Xu, netdev,
hadi-fAAogVwAN2Kw5LPnMra/2Q, jhs-jkUAjuhPggJWk0Htik3J/w,
John Fastabend, David Miller
In-Reply-To: <20111124212021.2ae2fb7f-QE31Isp8l5DVJhW05BI4jyWSNWFUUkiGXqFh9Ls21Oc@public.gmane.org>
Le jeudi 24 novembre 2011 à 21:20 -0800, Stephen Hemminger a écrit :
> The problem is that there are two flow classifiers, one in OpenVswitch
> in the kernel, and the other in the user space flow manager. I think the
> issue is that the two have different code.
We have kind of same duplication in kernel already :)
__skb_get_rxhash() and net/sched/cls_flow.c contain roughly the same
logic...
Maybe its time to factorize the thing, eventually use it in a third
component (Open vSwitch...)
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev
^ permalink raw reply
* Re: Open vSwitch Design
From: David Miller @ 2011-11-25 6:25 UTC (permalink / raw)
To: eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w
Cc: dev-yBygre7rU0TnMu66kgdUjQ, chrisw-H+wXaHxf7aLQT0dZR+AlfA,
netdev-u79uwXL29TY76Z2rM5mHXA, hadi-fAAogVwAN2Kw5LPnMra/2Q,
jhs-jkUAjuhPggJWk0Htik3J/w,
john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
herbert-F6s6mLieUQo7FNHlEwC/lvQIK84fMopw,
shemminger-ZtmgI6mnKB3QT0dZR+AlfA
In-Reply-To: <1322201883.2872.19.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Date: Fri, 25 Nov 2011 07:18:03 +0100
> Le jeudi 24 novembre 2011 à 21:20 -0800, Stephen Hemminger a écrit :
>
>> The problem is that there are two flow classifiers, one in OpenVswitch
>> in the kernel, and the other in the user space flow manager. I think the
>> issue is that the two have different code.
>
> We have kind of same duplication in kernel already :)
>
> __skb_get_rxhash() and net/sched/cls_flow.c contain roughly the same
> logic...
>
> Maybe its time to factorize the thing, eventually use it in a third
> component (Open vSwitch...)
Yes.
^ permalink raw reply
* Re: [PATCH net] net: Revert ARCNET and PHYLIB to tristate options
From: David Miller @ 2011-11-25 6:31 UTC (permalink / raw)
To: ben; +Cc: jeffrey.t.kirsher, netdev, debian-kernel
In-Reply-To: <1322119410.2839.262.camel@deadeye>
From: Ben Hutchings <ben@decadent.org.uk>
Date: Thu, 24 Nov 2011 07:23:30 +0000
> Commit 88491d8103498a6166f70d5999902fec70924314 ("drivers/net: Kconfig
> & Makefile cleanup") changed the type of these options to bool, but
> they select code that could (and still can) be built as modules.
>
> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
> ---
> I consider the inability to build arcnet.o and libphy.o as modules to be
> a regression in 3.2 for general-purpose distribution kernels.
> Therefore, please apply this to the net tree.
I challenge you to get PHYLIB set to 'm' in a configuration such as
"allmodconfig" which is clearly in line with the kind of configuration
which "distribution kernels" use.
Did you really even try to see if it mattered? Or did you just test
your change with a stripped down configuration you happened to have
readily available in some build tree?
I really consider this change entirely pointless, because for the
situation where you say it matters, it actually doesn't as far as I
can tell.
I'm therefore rejecting this patch.
^ permalink raw reply
* Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
From: David Miller @ 2011-11-25 6:35 UTC (permalink / raw)
To: krkumar2; +Cc: arnd, mst, netdev, virtualization, levinsasha928
In-Reply-To: <OF835A6FF5.7D752AD1-ON65257953.0016350C-65257953.00169AF4@in.ibm.com>
From: Krishna Kumar2 <krkumar2@in.ibm.com>
Date: Fri, 25 Nov 2011 09:39:11 +0530
> Jason Wang <jasowang@redhat.com> wrote on 11/25/2011 08:51:57 AM:
>>
>> My description is not clear again :(
>> I mean the same vhost thead:
>>
>> vhost thread #0 transmits packets of flow A on processor M
>> ...
>> vhost thread #0 move to another process N and start to transmit packets
>> of flow A
>
> Thanks for clarifying. Yes, binding vhosts to CPU's
> makes the incoming packet go to the same vhost each
> time. BTW, are you doing any binding and/or irqbalance
> when you run your tests? I am not running either at
> this time, but thought both might be useful.
So are we going with this patch or are we saying that vhost binding
is a requirement?
^ permalink raw reply
* Re: Open vSwitch Design
From: Eric Dumazet @ 2011-11-25 6:36 UTC (permalink / raw)
To: David Miller
Cc: dev-yBygre7rU0TnMu66kgdUjQ, chrisw-H+wXaHxf7aLQT0dZR+AlfA,
Florian Westphal, netdev-u79uwXL29TY76Z2rM5mHXA,
hadi-fAAogVwAN2Kw5LPnMra/2Q, jhs-jkUAjuhPggJWk0Htik3J/w,
john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w,
herbert-F6s6mLieUQo7FNHlEwC/lvQIK84fMopw,
shemminger-ZtmgI6mnKB3QT0dZR+AlfA
In-Reply-To: <20111125.012517.2221372383643417980.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
Le vendredi 25 novembre 2011 à 01:25 -0500, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Fri, 25 Nov 2011 07:18:03 +0100
>
> > Le jeudi 24 novembre 2011 à 21:20 -0800, Stephen Hemminger a écrit :
> >
> >> The problem is that there are two flow classifiers, one in OpenVswitch
> >> in the kernel, and the other in the user space flow manager. I think the
> >> issue is that the two have different code.
> >
> > We have kind of same duplication in kernel already :)
> >
> > __skb_get_rxhash() and net/sched/cls_flow.c contain roughly the same
> > logic...
> >
> > Maybe its time to factorize the thing, eventually use it in a third
> > component (Open vSwitch...)
>
> Yes.
A third reason to do that anyway is that net/sched/sch_sfb.c should use
__skb_get_rxhash() providing the perturbation itself, and not use the
standard (hashrnd) one ).
Right now, if two flows share same rxhash, the double SFB hash will also
share the same final hash.
(This point was mentioned by Florian Westphal)
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev
^ permalink raw reply
* Re: [PATCH] natsemi: make cable length magic configurable
From: David Miller @ 2011-11-25 6:37 UTC (permalink / raw)
To: jdelvare; +Cc: netdev, thockin, okir
In-Reply-To: <201111241443.59191.jdelvare@suse.de>
From: Jean Delvare <jdelvare@suse.de>
Date: Thu, 24 Nov 2011 14:43:59 +0100
> We had a customer report concerning problems with a Natsemi DP83815-D
> and long cables. With 100m cables, the network would be essentially dead,
> not a single packet would get through either way. We had to apply the
> patch below to make it work.
Please do not add new private device driver module knobs.
Instead, add generic flags to the ethtool interface or similar to
control device behavior.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox