[PATCH] net/sched: sch_plug - Queue traffic until an explicit release command

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] net/sched: sch_plug - Queue traffic until an explicit release command
@ 2011-12-19 21:22 rshriram
  2011-12-19 21:27 ` Shriram Rajagopalan
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: rshriram @ 2011-12-19 21:22 UTC (permalink / raw)
  To: hadi; +Cc: netdev, Brendan Cully, Shriram Rajagopalan

This qdisc can be used to implement output buffering, an essential
functionality required for consistent recovery in checkpoint based
fault tolerance systems. The qdisc supports two operations - plug and
unplug. When the qdisc receives a plug command via netlink request,
packets arriving henceforth are buffered until a corresponding unplug
command is received.

Its intention is to support speculative execution by allowing generated
network traffic to be rolled back. It is used to provide network
protection for domUs in the Remus high availability project, available as
part of Xen. This module is generic enough to be used by any other
system that wishes to add speculative execution and output buffering to
its applications.

This module was originally available in the linux 2.6.32 PV-OPS tree,
used as dom0 for Xen.

For more information, please refer to http://nss.cs.ubc.ca/remus/
and http://wiki.xensource.com/xenwiki/Remus

Signed-off-by: Brendan Cully <brendan@cs.ubc.ca>
Signed-off-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>
[shriram - ported the code from older 2.6.32 to current tree]
---
 net/sched/Kconfig    |   19 ++++++
 net/sched/Makefile   |    1 +
 net/sched/sch_plug.c |  159 ++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 179 insertions(+), 0 deletions(-)
 create mode 100644 net/sched/sch_plug.c

diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 2590e91..d0ccefa 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -260,6 +260,25 @@ config NET_SCH_INGRESS
 	  To compile this code as a module, choose M here: the
 	  module will be called sch_ingress.
 
+config NET_SCH_PLUG
+	tristate "Plug network traffic until release (PLUG)"
+	---help---
+	  Say Y here if you are using this kernel for Xen dom0 and
+	  want to protect Xen guests with Remus.
+
+	  This queueing discipline is controlled by netlink. When it receives an
+	  enqueue command it inserts a plug into the outbound queue that causes
+	  following packets to enqueue until a dequeue command arrives over
+	  netlink, releasing packets up to the plug for delivery.
+
+	  This module provides "output buffering" functionality in the Remus HA
+	  project. It enables speculative execution of virtual machines by allowing
+	  the generated network output to be rolled back if needed. For more 
+	  information, please refer to http://wiki.xensource.com/xenwiki/Remus
+
+	  To compile this code as a module, choose M here: the
+	  module will be called sch_plug.
+
 comment "Classification"
 
 config NET_CLS
diff --git a/net/sched/Makefile b/net/sched/Makefile
index dc5889c..8cdf4e2 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -33,6 +33,7 @@ obj-$(CONFIG_NET_SCH_MULTIQ)	+= sch_multiq.o
 obj-$(CONFIG_NET_SCH_ATM)	+= sch_atm.o
 obj-$(CONFIG_NET_SCH_NETEM)	+= sch_netem.o
 obj-$(CONFIG_NET_SCH_DRR)	+= sch_drr.o
+obj-$(CONFIG_NET_SCH_PLUG)	+= sch_plug.o
 obj-$(CONFIG_NET_SCH_MQPRIO)	+= sch_mqprio.o
 obj-$(CONFIG_NET_SCH_CHOKE)	+= sch_choke.o
 obj-$(CONFIG_NET_SCH_QFQ)	+= sch_qfq.o
diff --git a/net/sched/sch_plug.c b/net/sched/sch_plug.c
new file mode 100644
index 0000000..7436498
--- /dev/null
+++ b/net/sched/sch_plug.c
@@ -0,0 +1,159 @@
+/*
+ * sch_plug.c Queue traffic until an explicit release command
+ *
+ *             This program is free software; you can redistribute it and/or
+ *             modify it under the terms of the GNU General Public License
+ *             as published by the Free Software Foundation; either version
+ *             2 of the License, or (at your option) any later version.
+ *
+ * The operation of the buffer is as follows:
+ * When a checkpoint begins, a plug is inserted into the
+ *   network queue by a netlink request (it operates by storing
+ *   a pointer to the next packet which arrives and blocking dequeue
+ *   when that packet is at the head of the queue).
+ * When a checkpoint completes (the backup acknowledges receipt),
+ *   currently-queued packets are released.
+ * So it supports two operations, plug and unplug.
+ */
+
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/netdevice.h>
+#include <linux/skbuff.h>
+#include <net/pkt_sched.h>
+
+#define FIFO_BUF    (10*1024*1024)
+
+#define TCQ_PLUG   0
+#define TCQ_UNPLUG 1
+
+struct plug_sched_data {
+	/*
+	 * stop points to the first packet which should not be
+	 * delivered.  If it is NULL, plug_enqueue will set it to the
+	 * next packet it sees.
+	 *
+	 * release is the last packet in the fifo that can be
+	 * released.
+	 */
+	struct sk_buff *stop, *release;
+};
+
+struct tc_plug_qopt {
+	/* 0: reset stop packet pointer
+	 * 1: dequeue to release pointer */
+	int action;
+};
+
+static int skb_remove_foreign_references(struct sk_buff *skb)
+{
+	return !skb_linearize(skb);
+}
+
+static int plug_enqueue(struct sk_buff *skb, struct Qdisc* sch)
+{
+	struct plug_sched_data *q = qdisc_priv(sch);
+
+	if (likely(sch->qstats.backlog + skb->len <= FIFO_BUF)) {
+		if (!q->stop)
+			q->stop = skb;
+
+		if (!skb_remove_foreign_references(skb)) {
+			printk(KERN_DEBUG "error removing foreign ref\n");
+			return qdisc_reshape_fail(skb, sch);
+		}
+
+		return qdisc_enqueue_tail(skb, sch);
+	}
+	printk(KERN_WARNING "queue reported full: %d,%d\n",
+	       sch->qstats.backlog, skb->len);
+
+	return qdisc_reshape_fail(skb, sch);
+}
+
+/* dequeue doesn't actually dequeue until the release command is
+ * received. */
+static struct sk_buff *plug_dequeue(struct Qdisc* sch)
+{
+	struct plug_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *peek;
+
+	if (qdisc_is_throttled(sch))
+		return NULL;
+
+	peek = (struct sk_buff *)((sch->q).next);
+
+	/* this pointer comparison may be shady */
+	if (peek == q->release) {
+		/*
+		 * This is the tail of the last round. Release it and
+		 * block the queue
+		 */
+		qdisc_throttled(sch);
+		return NULL;
+	}
+
+	return qdisc_dequeue_head(sch);
+}
+
+static int plug_init(struct Qdisc *sch, struct nlattr *opt)
+{
+	qdisc_throttled(sch);
+	return 0;
+}
+
+/*
+ * receives two messages:
+ *   0: checkpoint queue (set stop to next packet)
+ *   1: dequeue until stop
+ */
+static int plug_change(struct Qdisc *sch, struct nlattr *opt)
+{
+	struct plug_sched_data *q = qdisc_priv(sch);
+	struct tc_plug_qopt *msg;
+
+	if (!opt || nla_len(opt) < sizeof(*msg))
+		return -EINVAL;
+
+	msg = nla_data(opt);
+
+	if (msg->action == TCQ_PLUG) {
+		/* reset stop */
+		q->stop = NULL;
+	} else if (msg->action == TCQ_UNPLUG) {
+		/* dequeue */
+		q->release = q->stop;
+		qdisc_unthrottled(sch);
+		netif_schedule_queue(sch->dev_queue);
+	} else {
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+struct Qdisc_ops plug_qdisc_ops = {
+	.id          =       "plug",
+	.priv_size   =       sizeof(struct plug_sched_data),
+	.enqueue     =       plug_enqueue,
+	.dequeue     =       plug_dequeue,
+	.peek        =       qdisc_peek_head,
+	.init        =       plug_init,
+	.change      =       plug_change,
+	.owner       =       THIS_MODULE,
+};
+
+static int __init plug_module_init(void)
+{
+	return register_qdisc(&plug_qdisc_ops);
+}
+
+static void __exit plug_module_exit(void)
+{
+	unregister_qdisc(&plug_qdisc_ops);
+}
+module_init(plug_module_init)
+module_exit(plug_module_exit)
+MODULE_LICENSE("GPL");
-- 
1.7.0.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] net/sched: sch_plug - Queue traffic until an explicit release command
  2011-12-19 21:22 [PATCH] net/sched: sch_plug - Queue traffic until an explicit release command rshriram
@ 2011-12-19 21:27 ` Shriram Rajagopalan
  2011-12-19 21:53   ` David Miller
  2011-12-19 21:32 ` Stephen Hemminger
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Shriram Rajagopalan @ 2011-12-19 21:27 UTC (permalink / raw)
  To: netdev

Apologies for the duplicate mail (if any). Also, I am not
able to cc the Maintainer (Jamal Hadi Salim hadi@cyberus.ca).

stat=Deferred: 450 4.7.1 Client host rejected: cannot find your
hostname, [198.162.52.240]

shriram

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] net/sched: sch_plug - Queue traffic until an explicit release command
  2011-12-19 21:22 [PATCH] net/sched: sch_plug - Queue traffic until an explicit release command rshriram
  2011-12-19 21:27 ` Shriram Rajagopalan
@ 2011-12-19 21:32 ` Stephen Hemminger
  2011-12-19 21:35 ` David Miller
  2011-12-20 14:38 ` Jamal Hadi Salim
  3 siblings, 0 replies; 9+ messages in thread
From: Stephen Hemminger @ 2011-12-19 21:32 UTC (permalink / raw)
  To: rshriram; +Cc: hadi, netdev, Brendan Cully

On Mon, 19 Dec 2011 13:22:32 -0800
rshriram@cs.ubc.ca wrote:

> +
> +static int skb_remove_foreign_references(struct sk_buff *skb)
> +{
> +	return !skb_linearize(skb);
> +}
> +

This is silly. Just make qdisc work with fragmented skb's.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] net/sched: sch_plug - Queue traffic until an explicit release command
  2011-12-19 21:22 [PATCH] net/sched: sch_plug - Queue traffic until an explicit release command rshriram
  2011-12-19 21:27 ` Shriram Rajagopalan
  2011-12-19 21:32 ` Stephen Hemminger
@ 2011-12-19 21:35 ` David Miller
  2011-12-20 14:38 ` Jamal Hadi Salim
  3 siblings, 0 replies; 9+ messages in thread
From: David Miller @ 2011-12-19 21:35 UTC (permalink / raw)
  To: rshriram; +Cc: hadi, netdev, brendan

From: rshriram@cs.ubc.ca
Date: Mon, 19 Dec 2011 13:22:32 -0800

> +#define FIFO_BUF    (10*1024*1024)

This limit is, at best, extremely arbitrary.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] net/sched: sch_plug - Queue traffic until an explicit release command
  2011-12-19 21:27 ` Shriram Rajagopalan
@ 2011-12-19 21:53   ` David Miller
  0 siblings, 0 replies; 9+ messages in thread
From: David Miller @ 2011-12-19 21:53 UTC (permalink / raw)
  To: rshriram; +Cc: netdev

From: Shriram Rajagopalan <rshriram@cs.ubc.ca>
Date: Mon, 19 Dec 2011 15:27:10 -0600

> Apologies for the duplicate mail (if any). Also, I am not
> able to cc the Maintainer (Jamal Hadi Salim hadi@cyberus.ca).
> 
> stat=Deferred: 450 4.7.1 Client host rejected: cannot find your
> hostname, [198.162.52.240]

His current email address is: jhs@mojatatu.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] net/sched: sch_plug - Queue traffic until an explicit release command
  2011-12-19 21:22 [PATCH] net/sched: sch_plug - Queue traffic until an explicit release command rshriram
                   ` (2 preceding siblings ...)
  2011-12-19 21:35 ` David Miller
@ 2011-12-20 14:38 ` Jamal Hadi Salim
  2011-12-20 17:05   ` Shriram Rajagopalan
  3 siblings, 1 reply; 9+ messages in thread
From: Jamal Hadi Salim @ 2011-12-20 14:38 UTC (permalink / raw)
  To: rshriram; +Cc: netdev, Brendan Cully

Sorry - I didnt see your earlier CC. Cyberus.ca is probably the
worst service provider in Canada (maybe the world; i am sure there
are better ISPs in the middle of an ocean somewhere, deep underwater
probably).

On Mon, 2011-12-19 at 13:22 -0800, rshriram@cs.ubc.ca wrote:
> This qdisc can be used to implement output buffering, an essential
> functionality required for consistent recovery in checkpoint based
> fault tolerance systems. 

I am trying to figure where this qdisc runs - is it in the hypervisor?

> The qdisc supports two operations - plug and
> unplug. When the qdisc receives a plug command via netlink request,
> packets arriving henceforth are buffered until a corresponding unplug
> command is received.

Ok, so plug indicates "start of checkpoint" and unplug the end.
Seems all you want is at a certain point to throttle the qdisc and
later on unplug/unthrottle, correct?
Sounds to me like a generic problem that applies to all qdiscs?

> Its intention is to support speculative execution by allowing generated
> network traffic to be rolled back. It is used to provide network
> protection for domUs in the Remus high availability project, available as
> part of Xen. This module is generic enough to be used by any other
> system that wishes to add speculative execution and output buffering to
> its applications.

Should get a nice demo effect of showing a simple ping working with
zero drops, but: what is the effect of not even having this qdisc?
If you just switch the qdisc to a sleeping state from user space, all
packets arriving at that qdisc will be dropped during the checkpoint
phase (and the kernel code will be tiny or none). 
If you do nothing some packets will be buffered and a watchdog 
will recover them when conditions become right.
So does this qdisc add anything different?

[Note: In your case when arriving packets find the queue filled up 
you will drop lotsa packets - depending on traffic patterns; so
not much different than above]

> +
> +#define FIFO_BUF    (10*1024*1024)

Aha.
Technically - use tc to do this. Conceptually:
This is probably what makes you look good in a demo if you have one; 
huge freaking buffer. If you are doing a simple ping (or a simple
interactive session like ssh) and you can failover in 5 minutes you
still wont be able to fill that buffer!

Look at the other feedback given to you (Stephen and Dave responded).

If a qdisc is needed, should it not be a classful qdisc?

cheers,
jamal

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] net/sched: sch_plug - Queue traffic until an explicit release command
  2011-12-20 14:38 ` Jamal Hadi Salim
@ 2011-12-20 17:05   ` Shriram Rajagopalan
  2011-12-21 14:54     ` Jamal Hadi Salim
  0 siblings, 1 reply; 9+ messages in thread
From: Shriram Rajagopalan @ 2011-12-20 17:05 UTC (permalink / raw)
  To: Jamal Hadi Salim; +Cc: netdev, Brendan Cully

On Tue, Dec 20, 2011 at 8:38 AM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> Sorry - I didnt see your earlier CC. Cyberus.ca is probably the
> worst service provider in Canada (maybe the world; i am sure there
> are better ISPs in the middle of an ocean somewhere, deep underwater
> probably).
>
> On Mon, 2011-12-19 at 13:22 -0800, rshriram@cs.ubc.ca wrote:
>> This qdisc can be used to implement output buffering, an essential
>> functionality required for consistent recovery in checkpoint based
>> fault tolerance systems.
>
> I am trying to figure where this qdisc runs - is it in the hypervisor?
>

In dom0. Basically the setup is like this:
 Guest (veth0)
 dom0 (vif0.0 --> eth0)

   packets coming out of veth0 appear as incoming packets in vif0.0
 Once Remus (or any other output commit style system is started)
 the following commands could be executed in dom0, to activate this qdisc
 ip link set ifb0 up
 tc qdisc add dev vif0.0 ingress
 tc filter add dev vif0.0 parent ffff: proto ip pref 10 u32 match u32
0 0 action mirred egress redirect dev eth0

>> The qdisc supports two operations - plug and
>> unplug. When the qdisc receives a plug command via netlink request,
>> packets arriving henceforth are buffered until a corresponding unplug
>> command is received.
>
> Ok, so plug indicates "start of checkpoint" and unplug the end.
> Seems all you want is at a certain point to throttle the qdisc and
> later on unplug/unthrottle, correct?
> Sounds to me like a generic problem that applies to all qdiscs?
>

Oh yes. But throttle functionality doesnt seem to be implemented
in the qdisc code base. there is a throttle flag but looking at the
qdisc scheduler, I could see no references to this flag or related actions.
That is why in the dequeue function, I return NULL until the release pointer
is set.

And we need a couple of netlink api calls (PLUG/UNPLUG) to manipulate
the qdisc from userspace

>> Its intention is to support speculative execution by allowing generated
>> network traffic to be rolled back. It is used to provide network
>> protection for domUs in the Remus high availability project, available as
>> part of Xen. This module is generic enough to be used by any other
>> system that wishes to add speculative execution and output buffering to
>> its applications.
>
> Should get a nice demo effect of showing a simple ping working with
> zero drops,

I have done better. When Remus is activated for a Guest VM, I pull the
plug from the primary physical host and the ssh connection to the domU,
with top command running (and/or xeyes) continues to run seamlessly. ;)

>but: what is the effect of not even having this qdisc?

When the guest VM recovers on another physical host, it is restored to the
most recent checkpoint. With a checkpoint frequency of 25ms, in the worst case
on failover, one checkpoint worth of execution could be lost. With the loss of
the physical machine and its output buffer, the packets (tcp,udp, etc)
are also lost.

But that does not affect the "consistency" of the state of Guest and
client connections,
as the client (say TCP clients) only think that there was some packet
loss and "resend"
the packets. The resuming Guest VM on the backup host would pickup the
connection
from where it left off.

OTOH, if the packets were released before the checkpoint was done,
then we have the
classic orphaned messages problem. If the client and the Guest VM
exchange a bunch
of packets, the TCP window moves. When the Guest VM resumes on the backup host,
it is basically rolled back in time (by 25ms or so), i.e. it does not
know about the shift
in the tcp window. Hence, the client and Guest's tcp sequence numbers
would be out of
sync and the connection would hang. For other protocols like UDP etc,
the "unmodified"
client may be capable of handling packet losses but not the server
application forgetting
about stuff it had acknowledged.

> If you just switch the qdisc to a sleeping state from user space, all
> packets arriving at that qdisc will be dropped during the checkpoint
> phase (and the kernel code will be tiny or none).

Which we dont want. I want to buffer the packets and then release them
once the checkpoint is committed at the backup host.

> If you do nothing some packets will be buffered and a watchdog
> will recover them when conditions become right.

Doing nothing is sort of what this qdisc does, so that the packets
get buffered. The watchdog in this case is the user space process
that sends the netlink command UNPLUG to release the packets.
(aka conditions become right).

One could do this with any qdisc but the catch here is that the qdisc
stops releasing packets only when it hits the stop pointer (ie start of
the next checkpoint buffer). Simply putting a qdisc to sleep would
prevent it from releasing packets from the current buffer (whose checkpoint
has been acked).

> So does this qdisc add anything different?
>
> [Note: In your case when arriving packets find the queue filled up
> you will drop lotsa packets - depending on traffic patterns; so
> not much different than above]
>
>> +
>> +#define FIFO_BUF    (10*1024*1024)
>
> Aha.
> Technically - use tc to do this. Conceptually:
> This is probably what makes you look good in a demo if you have one;
> huge freaking buffer. If you are doing a simple ping (or a simple
> interactive session like ssh) and you can failover in 5 minutes you
> still wont be able to fill that buffer!
>

I agree. Its more sensible to configure it via the tc command. I would
probably have to end up issuing patches for the tc code base too.

> Look at the other feedback given to you (Stephen and Dave responded).
>
> If a qdisc is needed, should it not be a classful qdisc?
>

I dont understand. why ?

shriram

> cheers,
> jamal
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] net/sched: sch_plug - Queue traffic until an explicit release command
  2011-12-20 17:05   ` Shriram Rajagopalan
@ 2011-12-21 14:54     ` Jamal Hadi Salim
       [not found]       ` <CAP8mzPNtei=6oVC8APf3Og_g+2YEUCnPCB+YOdBEKkcUYqXN1Q@mail.gmail.com>
  0 siblings, 1 reply; 9+ messages in thread
From: Jamal Hadi Salim @ 2011-12-21 14:54 UTC (permalink / raw)
  To: rshriram; +Cc: netdev, Brendan Cully

On Tue, 2011-12-20 at 11:05 -0600, Shriram Rajagopalan wrote:

> In dom0. Basically the setup is like this:
>  Guest (veth0)
>  dom0 (vif0.0 --> eth0)
> 
>    packets coming out of veth0 appear as incoming packets in vif0.0
>  Once Remus (or any other output commit style system is started)
>  the following commands could be executed in dom0, to activate this qdisc
>  ip link set ifb0 up
>  tc qdisc add dev vif0.0 ingress
>  tc filter add dev vif0.0 parent ffff: proto ip pref 10 u32 match u32
> 0 0 action mirred egress redirect dev eth0

Ok. To fill in the blank there, I believe the qdisc will be attached to
ifb0? Nobody cares about latency? You have > 20K packets being
accumulated here ...

Also assuming that this is a setup thing that happens for every 
guest that needs checkpointing 

> 
> Oh yes. But throttle functionality doesnt seem to be implemented
> in the qdisc code base. there is a throttle flag but looking at the
> qdisc scheduler, I could see no references to this flag or related actions.
> That is why in the dequeue function, I return NULL until the release pointer
> is set.

Thats along the lines i was thinking of. If you could set the throttle
flag (more work involved than i am suggesting), then you solve the
problem with any qdisc.
Your qdisc is essentially a bfifo with throttling.

> And we need a couple of netlink api calls (PLUG/UNPLUG) to manipulate
> the qdisc from userspace

Right - needed to "set the throttle flag" part.

> I have done better. When Remus is activated for a Guest VM, I pull the
> plug from the primary physical host and the ssh connection to the domU,
> with top command running (and/or xeyes) continues to run seamlessly. ;)

I dont think that kind of traffic will fill up your humongous queue.
You need some bulk transfers going on filling the link bandwidth to see
the futility of the queue.

> When the guest VM recovers on another physical host, it is restored to the
> most recent checkpoint. With a checkpoint frequency of 25ms, in the worst case
> on failover, one checkpoint worth of execution could be lost.

Out of curiosity: how much traffic do you end up generating for just
checkpointing? Is this using a separate link?

I am also a little lost: Are you going to plug/unplug every time you
checkpoint? i.e we are going to have 2 netlink messages every 25ms
for above?

>  With the loss of
> the physical machine and its output buffer, the packets (tcp,udp, etc)
> are also lost.
> 
> But that does not affect the "consistency" of the state of Guest and
> client connections,
> as the client (say TCP clients) only think that there was some packet
> loss and "resend"
> the packets. The resuming Guest VM on the backup host would pickup the
> connection
> from where it left off.

Yes, this is what i was trying to get to. If you dont buffer,
the end hosts will recover anyways. The value being no code
changes needed. Not just that, I am pointing that buffering
in itself is not very useful when the link is being used to its
full capacity. 

[Sorry, I feel I am treading on questioning the utility of what
you are doing but i cant help myself, it is just my nature to 
question;-> In a past life i may have been a relative of some ancient
Greek philospher.]

> OTOH, if the packets were released before the checkpoint was done,
> then we have the
> classic orphaned messages problem. If the client and the Guest VM
> exchange a bunch of packets, the TCP window moves. 
> When the Guest VM resumes on the backup host,
> it is basically rolled back in time (by 25ms or so), i.e. it does not
> know about the shift
> in the tcp window. Hence, the client and Guest's tcp sequence numbers
> would be out of
> sync and the connection would hang. 

So this part is interesting - but i wonder if the issue is not so
much the window moved but some other bug or misfeature or i am missing
something. Shouldnt the sender just get an ACK with signifying the
correct sequence number? Also, if you have access to the receivers(new
guest) sequence number could you not "adjust it" based on the checkpoint
messages?

> Which we dont want. I want to buffer the packets and then release them
> once the checkpoint is committed at the backup host.

And whose value is still unclear to me.

> One could do this with any qdisc but the catch here is that the qdisc
> stops releasing packets only when it hits the stop pointer (ie start of
> the next checkpoint buffer). Simply putting a qdisc to sleep would
> prevent it from releasing packets from the current buffer (whose checkpoint
> has been acked).

I meant putting it to sleep when you plug and waking it when you
unplug. By sleep i meant blackholing for the checkpointing + 
recovery period. I understand that dropping is not what you want to
achieve because you see value in buffering.

> I agree. Its more sensible to configure it via the tc command. I would
> probably have to end up issuing patches for the tc code base too.

I think you MUST do this. We dont believe in hardcoding anything. You 
are playing in policy management territory. Let the control side worry
about policies. Ex: I think if you knew the bandwidth of the link and
the checkpointing frequency, you could come up with a reasonable buffer
size.

> > Look at the other feedback given to you (Stephen and Dave responded).
> >
> > If a qdisc is needed, should it not be a classful qdisc?
> >
> 
> I dont understand. why ?

I was coming from the same reasoning i used earlier, i.e
this sounds like a generic problem.
You are treading into policy management and deciding what is best
for the guest user. We have an infrastructure that allows the admin 
to setup policies based on traffic characteristics. You are limiting
this feature to be only used by folks who have no choice but to use
your qdisc. I cant isolate latency sensitive traffic like an slogin
sitting behind 20K scp packets etc. If you make it classful, that
isolation can be added etc. Not sure if i made sense.

Alternatively, this seems to me like a bfifo qdisc that needs to have
throttle support that can be controlled from user space.

cheers,
jamal

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] net/sched: sch_plug - Queue traffic until an explicit release command
       [not found]       ` <CAP8mzPNtei=6oVC8APf3Og_g+2YEUCnPCB+YOdBEKkcUYqXN1Q@mail.gmail.com>
@ 2012-01-05 18:15         ` Shriram Rajagopalan
  0 siblings, 0 replies; 9+ messages in thread
From: Shriram Rajagopalan @ 2012-01-05 18:15 UTC (permalink / raw)
  To: netdev, Jamal Hadi Salim

I apologize if this is a duplicate. Previous one got bumped due to
HTML formatting :(

On 2011-12-21, at 6:54 AM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:

> On Tue, 2011-12-20 at 11:05 -0600, Shriram Rajagopalan wrote:
>
>> In dom0. Basically the setup is like this:
>> Guest (veth0)
>> dom0 (vif0.0 --> eth0)
>>
>>   packets coming out of veth0 appear as incoming packets in vif0.0
>> Once Remus (or any other output commit style system is started)
>> the following commands could be executed in dom0, to activate this qdisc
>> ip link set ifb0 up
>> tc qdisc add dev vif0.0 ingress
>> tc filter add dev vif0.0 parent ffff: proto ip pref 10 u32 match u32
>> 0 0 action mirred egress redirect dev eth0
>
> Ok. To fill in the blank there, I believe the qdisc will be attached to
> ifb0? Nobody cares about latency? You have > 20K packets being
> accumulated here ...
>

Yes, the qdisc is attached to ifb0.

There is a latency overhead (release packets only after checkpoint is committed)
but that is an artifact of the checkpoint/output buffering principle itself.
Queuing delays at the qdisc, during buffer release are generally negligible.

The checkpoint intervals are very small (typically 30-50ms). Within
such a short execution
period, I don't expect the guest to generate 20K packets. It's an
extremely pathological case.

Also there is utmost one outstanding buffer to be released (not a
chain of buffers).
So there is no possibility of a large packet accumulation across
multiple output buffers.

> Also assuming that this is a setup thing that happens for every
> guest that needs checkpointing
>

Yep. Every guest has its own output buffer (and hence a dedicated ifb
and a qdisc).

>>
>> Oh yes. But throttle functionality doesnt seem to be implemented
>> in the qdisc code base. there is a throttle flag but looking at the
>> qdisc scheduler, I could see no references to this flag or related actions.
>> That is why in the dequeue function, I return NULL until the release pointer
>> is set.
>
> Thats along the lines i was thinking of. If you could set the throttle
> flag (more work involved than i am suggesting), then you solve the
> problem with any qdisc.
> Your qdisc is essentially a bfifo with throttling.

Not as simple as that. The throttle functionality you are talking
about is "throttle immediately".
The functionality I am talking about is "throttle at next incoming
packet" (i.e. continue dequeuing
existing packets in the queue.

There is a subtle difference between the two. I could implement the
(missing) throttle functionality
in the second version but that would be unintuitive to users, don't you think ?

>
>> And we need a couple of netlink api calls (PLUG/UNPLUG) to manipulate
>> the qdisc from userspace
>
> Right - needed to "set the throttle flag" part.
>
>> I have done better. When Remus is activated for a Guest VM, I pull the
>> plug from the primary physical host and the ssh connection to the domU,
>> with top command running (and/or xeyes) continues to run seamlessly. ;)
>
> I dont think that kind of traffic will fill up your humongous queue.
> You need some bulk transfers going on filling the link bandwidth to see
> the futility of the queue.

Yes I know. I was just citing an example :).

>
>> When the guest VM recovers on another physical host, it is restored to the
>> most recent checkpoint. With a checkpoint frequency of 25ms, in the worst case
>> on failover, one checkpoint worth of execution could be lost.
>
> Out of curiosity: how much traffic do you end up generating for just
> checkpointing? Is this using a separate link?
>

Yes, the checkpoint traffic is on a separate link. The amount of
traffic is purely application
workload dependent. I have seen traffic ranges from 3Mbits/s to 80+ Mbits/s

> I am also a little lost: Are you going to plug/unplug every time you
> checkpoint? i.e we are going to have 2 netlink messages every 25ms
> for above?
>

Yes. That's the idea.

>> With the loss of
>> the physical machine and its output buffer, the packets (tcp,udp, etc)
>> are also lost.
>>
>> But that does not affect the "consistency" of the state of Guest and
>> client connections,
>> as the client (say TCP clients) only think that there was some packet
>> loss and "resend"
>> the packets. The resuming Guest VM on the backup host would pickup the
>> connection
>> from where it left off.
>
> Yes, this is what i was trying to get to. If you dont buffer,
> the end hosts will recover anyways. The value being no code
> changes needed. Not just that, I am pointing that buffering
> in itself is not very useful when the link is being used to its
> full capacity.
>
> [Sorry, I feel I am treading on questioning the utility of what
> you are doing but i cant help myself, it is just my nature to
> question;-> In a past life i may have been a relative of some ancient
> Greek philospher.]
>
>> OTOH, if the packets were released before the checkpoint was done,
>> then we have the
>> classic orphaned messages problem. If the client and the Guest VM
>> exchange a bunch of packets, the TCP window moves.
>> When the Guest VM resumes on the backup host,
>> it is basically rolled back in time (by 25ms or so), i.e. it does not
>> know about the shift
>> in the tcp window. Hence, the client and Guest's tcp sequence numbers
>> would be out of
>> sync and the connection would hang.
>
> So this part is interesting - but i wonder if the issue is not so
> much the window moved but some other bug or misfeature or i am missing
> something. Shouldnt the sender just get an ACK with signifying the
> correct sequence number?

It does. But that seq number points to a value greater than the
senders seq number.

> Also, if you have access to the receivers(new
> guest) sequence number could you not "adjust it" based on the checkpoint
> messages?

Think about a client connection with a database server. Without
buffering, the server
acknowledges a transaction commit and then client performs more
operations. Now the
server fails and when it recovers on "another" machine, it has no
knowledge of the first transaction.
We could play tricks with the seq numbers at the server side but
what's the use? The client has
 already seen a non-existent state.

The sequence number fixing might work for some stateless services but
when you do something
completely transparent like Remus that considers the entire Guest as
one big opaque state
machine, these techniques will fail.

>
>> Which we dont want. I want to buffer the packets and then release them
>> once the checkpoint is committed at the backup host.
>
> And whose value is still unclear to me.

The backup host always knows about ever output the primary has sent or
intends to send.

Consider the database example above. With this qdisc, the server's
transaction commit
acknowledgment message will still be in the primary host's output
buffer. If the primary host crashes,
the backup host will resume VM execution from the last checkpoint
(consistent with respect to disk,
memory, network state).

If this last checkpoint does not even have a "commit txn" req from the
client, the client would resend
the req(tcp retransmit), server commits & acks. The disk state is "consistent".

If the last checkpoint does indicate that server has sent commit txn
ack, then when the client resends
the "commit txn", the tcp stack recognizes the packet as a duplicate
packet, resends the packet
containing the "commit ack".

In both cases, server and client are in consistent state.

>
>> One could do this with any qdisc but the catch here is that the qdisc
>> stops releasing packets only when it hits the stop pointer (ie start of
>> the next checkpoint buffer). Simply putting a qdisc to sleep would
>> prevent it from releasing packets from the current buffer (whose checkpoint
>> has been acked).
>
> I meant putting it to sleep when you plug and waking it when you
> unplug. By sleep i meant blackholing for the checkpointing +
> recovery period. I understand that dropping is not what you want to
> achieve because you see value in buffering.
>
>> I agree. Its more sensible to configure it via the tc command. I would
>> probably have to end up issuing patches for the tc code base too.
>
> I think you MUST do this. We dont believe in hardcoding anything. You
> are playing in policy management territory. Let the control side worry
> about policies. Ex: I think if you knew the bandwidth of the link and
> the checkpointing frequency, you could come up with a reasonable buffer
> size.
>
>>> Look at the other feedback given to you (Stephen and Dave responded).
>>>
>>> If a qdisc is needed, should it not be a classful qdisc?
>>>
>>
>> I dont understand. why ?
>
> I was coming from the same reasoning i used earlier, i.e
> this sounds like a generic problem.
> You are treading into policy management and deciding what is best
> for the guest user. We have an infrastructure that allows the admin
> to setup policies based on traffic characteristics. You are limiting
> this feature to be only used by folks who have no choice but to use
> your qdisc. I cant isolate latency sensitive traffic like an slogin
> sitting behind 20K scp packets etc. If you make it classful, that
> isolation can be added etc. Not sure if i made sense.
>
> Alternatively, this seems to me like a bfifo qdisc that needs to have
> throttle support that can be controlled from user space.
>
> cheers,
> jamal
>
>
>

cheers
Shriram

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2012-01-05 18:16 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-19 21:22 [PATCH] net/sched: sch_plug - Queue traffic until an explicit release command rshriram
2011-12-19 21:27 ` Shriram Rajagopalan
2011-12-19 21:53   ` David Miller
2011-12-19 21:32 ` Stephen Hemminger
2011-12-19 21:35 ` David Miller
2011-12-20 14:38 ` Jamal Hadi Salim
2011-12-20 17:05   ` Shriram Rajagopalan
2011-12-21 14:54     ` Jamal Hadi Salim
     [not found]       ` <CAP8mzPNtei=6oVC8APf3Og_g+2YEUCnPCB+YOdBEKkcUYqXN1Q@mail.gmail.com>
2012-01-05 18:15         ` Shriram Rajagopalan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).