[1/2] CARP implementation. HA master's failover.

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [1/2] CARP implementation. HA master's failover.
       [not found] <1089898303.6114.859.camel@uganda>
@ 2004-07-15 13:36 ` Evgeniy Polyakov
  2004-07-15 14:44   ` jamal
  0 siblings, 1 reply; 28+ messages in thread
From: Evgeniy Polyakov @ 2004-07-15 13:36 UTC (permalink / raw)
  To: netdev; +Cc: netfilter-failover


[-- Attachment #1.1: Type: text/plain, Size: 2096 bytes --]

On Thu, 2004-07-15 at 17:31, Evgeniy Polyakov wrote: 
> Hello, network developers.
> 
> I'm glad to introduce CARP failover mechanism implementation.
> It is based on OpenBSD's CARP protocol but is not compatible with it
> since OpenBSD's implementation does not contain protection against
> repeated message sending.
> 
> The main goal of the project is to implement CARP + firewall sync, but
> second part already implemented by Harald Welte <laforge@gnumonks.org> and 
> KOVACS Krisztian <hidden@sch.bme.hu> in ct_sync.
> 
> By design each node has it's own advertisement base and skew, node with
> the least timeval constructed from them became a master.
> It begins to advertise it's base and skew until shutdown or other node
> lower it's base+skew pair.
> CARP uses currently only IPv4 multicast, but can be easily changed to
> use IPv6. 
> Each CARP packet contains unique 64bit counter with it's SHA1 hmac
> digest with 20byte secret key. By design this counter is incremented in
> both master and backup before sending and while receiving accordingly.
> If master and backup counters do not coincide with each other while
> receiving backup node drops this packet and thus preventing repeated
> sending attack.
> When after predefined interval master didn't send any packet or it's
> base+skew is bigger than that in the remote node those node becomes a
> master and begins to advertise.
> 
> CARP has 2 work queues for "became_master" and "became_backup" events.
> Such events may be easily registered in runtime by external modules.
> One of such event handlers may send netlink message to ct_sync and/or
> userspace daemon which will flush iptables rules, up/down interfaces and
> so on...
> 
> Please review and comment.
> 
> Code against 2.6 attached 
> in next 2 e-mails since netfilter-failover@lists.netfilter.org doesn't accept
> e-mail greater than 40kb.
> 
> Code also is available at
> http://www.2ka.mipt.ru/~johnpol/carp_latest.tar.gz

-- 
	Evgeniy Polaykov ( s0mbre )

Crash is better than data corruption. -- Art Grabowski

[-- Attachment #1.2: carp.c --]
[-- Type: text/x-csrc, Size: 19948 bytes --]

/*
 * 	carp.c
 * 
 * 2004 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
 * All rights reserved.
 * 
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
 *
 */

#include <linux/config.h>
#include <linux/module.h>
#include <linux/types.h>
#include <linux/sched.h>
#include <linux/kernel.h>
#include <asm/uaccess.h>
#include <linux/skbuff.h>
#include <linux/netdevice.h>
#include <linux/in.h>
#include <linux/tcp.h>
#include <linux/udp.h>
#include <linux/if_arp.h>
#include <linux/mroute.h>
#include <linux/init.h>
#include <linux/in6.h>
#include <linux/inetdevice.h>
#include <linux/igmp.h>
#include <linux/netfilter_ipv4.h>
#include <linux/crypto.h>
#include <linux/random.h>

#include <net/sock.h>
#include <net/ip.h>
#include <net/icmp.h>
#include <net/protocol.h>
#include <net/arp.h>
#include <net/checksum.h>
#include <net/inet_ecn.h>

#include <asm/scatterlist.h>

#ifdef CONFIG_IPV6
#include <net/ipv6.h>
#include <net/ip6_fib.h>
#include <net/ip6_route.h>
#endif

#include "carp.h"
#include "carp_log.h"
#include "carp_queue.h"
#include "carp_ioctl.h"

#define timeval_before(before, after)			\
	(((before)->tv_sec == (after)->tv_sec) ? ((before)->tv_usec < (after)->tv_usec) : ((before)->tv_sec < (after)->tv_sec))


static int carp_init(struct net_device *);
static void carp_uninit(struct net_device *);
static void carp_setup(struct net_device *);
static int carp_close(struct net_device *);
static int carp_open(struct net_device *);
static int carp_ioctl (struct net_device *, struct ifreq *, int);
static int carp_check_params(struct carp_ioctl_params);

static void carp_err(struct sk_buff *, u32);
static int carp_rcv(struct sk_buff *);
static int carp_xmit(struct sk_buff *, struct net_device *);

static struct net_device_stats *carp_get_stats(struct net_device *);

static void carp_hmac_sign(struct carp_priv *, struct carp_header *);
static int carp_hmac_verify(struct carp_priv *, struct carp_header *);
static u32 inline addr2val(u8, u8, u8, u8);

static void carp_set_state(struct carp_priv *, enum carp_state);
static void carp_master_down(unsigned long);
static void carp_advertise(unsigned long);

static int __init device_carp_init(void);
void __exit device_carp_fini(void);

static struct net_device *carp_dev;

static void carp_uninit(struct net_device *dev)
{
	struct carp_priv *cp = dev->priv;
	
	if (timer_pending(&cp->md_timer))
		del_timer_sync(&cp->md_timer);
	if (timer_pending(&cp->adv_timer))
		del_timer_sync(&cp->adv_timer);
		
	log("%s\n", __func__);
	dev_put(cp->odev);
	dev_put(dev);
}

static void carp_err(struct sk_buff *skb, u32 info)
{
	log("%s\n", __func__);
	kfree_skb(skb);
}

static void carp_hmac_sign(struct carp_priv *cp, struct carp_header *ch)
{
	unsigned int keylen = sizeof(cp->carp_key);
	struct scatterlist sg;

	sg.page = virt_to_page(ch->carp_counter);
	sg.offset = ((unsigned long)(ch->carp_counter)) % PAGE_SIZE;
	sg.length = sizeof(ch->carp_counter);

	crypto_hmac(cp->tfm, cp->carp_key, &keylen, &sg, 1, ch->carp_md);
}

static int carp_hmac_verify(struct carp_priv *cp, struct carp_header *ch)
{
	u8 tmp_md[CARP_SIG_LEN];
	unsigned int keylen = sizeof(cp->carp_key);
	struct scatterlist sg;

	sg.page = virt_to_page(ch->carp_counter);
	sg.offset = ((unsigned long)(ch->carp_counter)) % PAGE_SIZE;
	sg.length = sizeof(ch->carp_counter);

	crypto_hmac(cp->tfm, cp->carp_key, &keylen, &sg, 1, tmp_md);
#if 0
	{
		int i;
		printk("calculated:  ");
		for (i=0; i<CARP_SIG_LEN; ++i)
			printk("%02x ", tmp_md[i]);
		printk("\n");
		printk("from header: ");
		for (i=0; i<CARP_SIG_LEN; ++i)
			printk("%02x ", ch->carp_md[i]);
		printk("\n");
	}
#endif
	return memcmp(tmp_md, ch->carp_md, CARP_SIG_LEN);
}

static int carp_check_params(struct carp_ioctl_params p)
{
	if (p.state != INIT && p.state != BACKUP && p.state != MASTER)
	{
		log("Wrong state %d.\n", p.state);
		return -1;
	}

	if (!__dev_get_by_name(p.devname))
	{
		log("No such device %s.\n", p.devname);
		return -2;
	}

	if (p.md_timeout > MAX_MD_TIMEOUT || p.adv_timeout > MAX_ADV_TIMEOUT ||
	    !p.md_timeout || !p.adv_timeout)
		return -3;

	return 0;
}

static void carp_set_state(struct carp_priv *cp, enum carp_state state)
{
	log("%s: Setting CARP state from %d to %d.\n", __func__, cp->state, state);
	cp->state = state;

	switch (state)
	{
		case MASTER:
			carp_call_queue(MASTER_QUEUE);
			if (!timer_pending(&cp->adv_timer))
				mod_timer(&cp->adv_timer, jiffies + cp->adv_timeout*HZ);
			break;
		case BACKUP:
			carp_call_queue(BACKUP_QUEUE);
			if (!timer_pending(&cp->md_timer))
				mod_timer(&cp->md_timer, jiffies + cp->md_timeout*HZ);
			break;
		case INIT:
			if (!timer_pending(&cp->md_timer))
				mod_timer(&cp->md_timer, jiffies + cp->md_timeout*HZ);
			break;
	}
}

static void carp_master_down(unsigned long data)
{
	struct carp_priv *cp = (struct carp_priv *)data;

	//log("%s: state=%d.\n", __func__, cp->state);

	if (cp->state != MASTER)
	{
		if (test_bit(CARP_DATA_AVAIL, (long *)&cp->flags))
		{
			if (!timer_pending(&cp->md_timer))
				mod_timer(&cp->md_timer, jiffies + cp->md_timeout*HZ);
		}
		else
			carp_set_state(cp, MASTER);
	}
			
	clear_bit(CARP_DATA_AVAIL, (long *)&cp->flags);
}

static int carp_rcv(struct sk_buff *skb)
{
	struct iphdr *iph;
	struct carp_priv *cp = carp_dev->priv;
	struct carp_header *ch;
	int err = 0;
	u64 tmp_counter;
	struct timeval cptv, chtv;

	//log("%s: state=%d\n", __func__, cp->state);
	
	spin_lock(&cp->lock);

	iph = skb->nh.iph;
	ch = (struct carp_header *)skb->data;

	//dump_carp_header(ch);
	
	if (ch->carp_version != cp->hdr.carp_version)
	{
		log("CARP version mismatch: remote=%d, local=%d.\n",
			ch->carp_version, cp->hdr.carp_version);
		cp->cstat.ver_errors++;
		goto err_out_skb_drop;
	}
	
	if (ch->carp_vhid != cp->hdr.carp_vhid)
	{
		log("CARP virtual host id mismatch: remote=%d, local=%d.\n",
			ch->carp_vhid, cp->hdr.carp_vhid);
		cp->cstat.vhid_errors++;
		goto err_out_skb_drop;
	}

	if (carp_hmac_verify(cp, ch))
	{
		log("HMAC mismatch.\n");
		cp->cstat.hmac_errors++;
		goto err_out_skb_drop;
	}

	tmp_counter = ntohl(ch->carp_counter[0]);
	tmp_counter = tmp_counter<<32;
	tmp_counter += ntohl(ch->carp_counter[1]);
	
	if (cp->state == BACKUP && ++cp->carp_adv_counter != tmp_counter)
	{
		log("Counter mismatch: remote=%llu, local=%llu.\n", tmp_counter, cp->carp_adv_counter);
		cp->cstat.counter_errors++;
		goto err_out_skb_drop;
	}

	cptv.tv_sec = cp->hdr.carp_advbase;
	if (cp->hdr.carp_advbase <  240)
		cptv.tv_usec = 240 * 1000000 / 256;
	else
		cptv.tv_usec = cp->hdr.carp_advskew * 1000000 / 256;
	
	chtv.tv_sec = ch->carp_advbase;
	chtv.tv_usec = ch->carp_advskew * 1000000 / 256;

	/*log("local=%lu.%lu, remote=%lu.%lu, lcounter=%llu, remcounter=%llu, state=%d\n", 
			cptv.tv_sec, cptv.tv_usec,
			chtv.tv_sec, chtv.tv_usec,
			cp->carp_adv_counter, tmp_counter,
			cp->state);
	*/	
	set_bit(CARP_DATA_AVAIL, (long *)&cp->flags);

	switch (cp->state)
	{
		case INIT:
			if (timeval_before(&chtv, &cptv))
			{
				cp->carp_adv_counter = tmp_counter;
				carp_set_state(cp, BACKUP);
			}
			else
			{
				carp_set_state(cp, MASTER);
			}
			break;
		case MASTER:
			if (timeval_before(&chtv, &cptv))
			{
				cp->carp_adv_counter = tmp_counter;
				carp_set_state(cp, BACKUP);
			}
			break;
		case BACKUP:
			if (timeval_before(&cptv, &chtv))
			{
				carp_set_state(cp, MASTER);
			}
			break;
	}

err_out_skb_drop:
	kfree_skb(skb);
	spin_unlock(&cp->lock);

	return err;
}

static int carp_xmit(struct sk_buff *skb, struct net_device *dev)
{
#if 0
	struct carp_priv *cp = (struct carp_priv *)dev->priv;
	struct net_device_stats *stats = &cp->stat;
	struct iphdr  *iph = skb->nh.iph;
	u8     tos;
	u16    df;
	struct rtable *rt;
	struct net_device *tdev;
	u32    dst;
	int    mtu;
	int err;							
	int pkt_len = skb->len;	
	log("%s\n", __func__);

	skb->ip_summed = CHECKSUM_NONE;
	skb->protocol = htons(ETH_P_IP);

	ip_select_ident(iph, &rt->u.dst, NULL);
	ip_send_check(iph);
	err = NF_HOOK(PF_INET, NF_IP_LOCAL_OUT, skb, NULL, rt->u.dst.dev, dst_output);
	if (err == NET_XMIT_SUCCESS || err == NET_XMIT_CN) {
		stats->tx_bytes += pkt_len;
		stats->tx_packets++;
	} else {
		stats->tx_errors++;
		stats->tx_aborted_errors++;
	}
#endif
	return 0;
}

static int carp_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
{
	int err = 0;
	struct carp_priv *cp = (struct carp_priv *)dev->priv;
	struct net_device *tdev = NULL;
	struct carp_ioctl_params p;

	log("%s\n", __func__);

	memset(&p, 0, sizeof(p));
	
	err = -EPERM;
	if (!capable(CAP_NET_ADMIN))
		goto err_out;

	switch (cmd)
	{
#if 0
		case SIOC_SETIPHDR:

			log("Setting new header.\n");
			
			err = -EFAULT;
			if (copy_from_user(&iph, ifr->ifr_ifru.ifru_data, sizeof(iph)))
				goto err_out;

			err = -EINVAL;
			if (iph.version != 4 || iph.protocol != IPPROTO_CARP || iph.ihl != 5 || !MULTICAST(iph.daddr))
				goto err_out;

			spin_lock(&cp->lock);
			carp_close(cp->dev);
			
			memcpy(&cp->iph, &iph, sizeof(iph));
			
			carp_open(cp->dev);
			spin_unlock(&cp->lock);
			break;
#endif
		case SIOC_SETCARPPARAMS:
			err = -EFAULT;
			if (copy_from_user(&p, ifr->ifr_ifru.ifru_data, sizeof(p)))
				goto err_out;

			err = -EINVAL;
			if (carp_check_params(p))
				goto err_out;

			log("Setting new CARP parameters.\n");

			if (memcmp(p.devname, cp->odev->name, IFNAMSIZ) && (tdev = dev_get_by_name(p.devname)) != NULL)
				carp_close(cp->dev);

			spin_lock(&cp->lock);

			if (tdev)
			{
				cp->odev->flags = cp->oflags;
				dev_put(cp->odev);
				
				cp->odev 	= tdev;
				cp->link 	= cp->odev->ifindex;
				cp->oflags 	= cp->odev->flags;
				cp->odev->flags |= IFF_BROADCAST | IFF_ALLMULTI;
			}
			
			cp->md_timeout = p.md_timeout;
			cp->adv_timeout = p.adv_timeout;
			
			carp_set_state(cp, p.state);
			memcpy(cp->carp_pad, p.carp_pad, sizeof(cp->carp_pad));
			memcpy(cp->carp_key, p.carp_key, sizeof(cp->carp_key));
			cp->hdr.carp_vhid = p.carp_vhid;
			cp->hdr.carp_advbase = p.carp_advbase;
			cp->hdr.carp_advskew = p.carp_advskew;

			spin_unlock(&cp->lock);
			if (tdev)
				carp_open(cp->dev);
			break;
		case SIOC_GETCARPPARAMS:

			log("Dumping CARP parameters.\n");

			spin_lock(&cp->lock);
			p.state = cp->state;
			memcpy(p.carp_pad, cp->carp_pad, sizeof(cp->carp_pad));
			memcpy(p.carp_key, cp->carp_key, sizeof(cp->carp_key));
			p.carp_vhid = cp->hdr.carp_vhid;
			p.carp_advbase = cp->hdr.carp_advbase;
			p.carp_advskew = cp->hdr.carp_advskew;
			p.md_timeout = cp->md_timeout;
			p.adv_timeout = cp->adv_timeout;
			memcpy(p.devname, cp->odev->name, sizeof(p.devname));
			p.devname[sizeof(p.devname) - 1] = '\0';
			spin_unlock(&cp->lock);

			err = -EFAULT;
			if (copy_to_user(ifr->ifr_ifru.ifru_data, &p, sizeof(p)))
				goto err_out;
			break;
		default:
			err = -EINVAL;
			break;

	}
	err = 0;
			
err_out:
	return err;
}

static struct net_device_stats *carp_get_stats(struct net_device *dev)
{
	struct carp_priv *cp = (struct carp_priv *)dev->priv;
	struct carp_stat *cs = &cp->cstat;
	
	log("%s: crc=%8d, ver=%8d, mem=%8d, xmit=%8d | bytes_sent=%8d\n", 
			__func__, 
			cs->crc_errors, cs->ver_errors, cs->mem_errors, cs->xmit_errors,
			cs->bytes_sent);
	return &(((struct carp_priv *)dev->priv)->stat);
}

static int carp_change_mtu(struct net_device *dev, int new_mtu)
{
	log("%s\n", __func__);
	dev->mtu = new_mtu;
	return 0;
}

static void carp_setup(struct net_device *dev)
{
	log("%s\n", __func__);
	SET_MODULE_OWNER(dev);
	dev->uninit		= carp_uninit;
	dev->destructor 	= free_netdev;
	dev->hard_start_xmit	= carp_xmit;
	dev->get_stats		= carp_get_stats;
	dev->do_ioctl		= carp_ioctl;
	dev->change_mtu		= carp_change_mtu;

	dev->type		= ARPHRD_ETHER;
	dev->hard_header_len 	= LL_MAX_HEADER;
	dev->mtu		= 1500;
	dev->flags		= IFF_NOARP;
	dev->iflink		= 0;
	dev->addr_len		= 4;
}

static int carp_open(struct net_device *dev)
{
	struct carp_priv *cp = (struct carp_priv *)dev->priv;
	struct flowi fl = { .oif = cp->link,
			    .nl_u = { .ip4_u =
				      { .daddr = cp->iph.daddr,
					.saddr = cp->iph.saddr,
					.tos = RT_TOS(cp->iph.tos) } },
			    .proto = IPPROTO_CARP };
	struct rtable *rt;
	
	if (ip_route_output_key(&rt, &fl))
		return -EADDRNOTAVAIL;
	dev = rt->u.dst.dev;
	ip_rt_put(rt);
	if (__in_dev_get(dev) == NULL)
		return -EADDRNOTAVAIL;
	cp->mlink = dev->ifindex;
	ip_mc_inc_group(__in_dev_get(dev), cp->iph.daddr);

	return 0;
}

static int carp_close(struct net_device *dev)
{
	struct carp_priv *cp = (struct carp_priv *)dev->priv;
	struct in_device *in_dev = inetdev_by_index(cp->mlink);
	
	if (in_dev) {
		ip_mc_dec_group(in_dev, cp->iph.daddr);
		in_dev_put(in_dev);
	}
	return 0;
}

static int carp_init(struct net_device *dev)
{
	struct net_device *tdev = NULL;
	struct carp_priv *cp;
	struct iphdr *iph;
	int hlen = LL_MAX_HEADER;
	int mtu = 1500;

	log("%s - %s\n", __func__, dev->name);
	cp = (struct carp_priv *)dev->priv;
	iph = &cp->iph;

	if (!iph->daddr || !MULTICAST(iph->daddr) || !iph->saddr)
		return -EINVAL;
	
	dev_hold(dev);

	cp->dev = dev;
	strncpy(cp->name, dev->name, IFNAMSIZ);

	ip_eth_mc_map(cp->iph.daddr, dev->dev_addr);
	memcpy(dev->broadcast, &iph->daddr, 4);

	{
		struct flowi fl = { .oif = cp->link,
				    .nl_u = { .ip4_u =
					      { .daddr = iph->daddr,
						.saddr = iph->saddr,
						.tos = RT_TOS(iph->tos) } },
				    .proto = IPPROTO_CARP };
		struct rtable *rt;
		if (!ip_route_output_key(&rt, &fl)) {
			tdev = rt->u.dst.dev;
			ip_rt_put(rt);
		}
	}

	cp->oflags 		= cp->odev->flags;
	dev->flags 		|= IFF_BROADCAST | IFF_ALLMULTI;
	cp->odev->flags 	|= IFF_BROADCAST | IFF_ALLMULTI;

	dev->open		= carp_open;
	dev->stop		= carp_close;

	if (!tdev && cp->link)
		tdev = __dev_get_by_index(cp->link);

	if (tdev) {
		hlen = tdev->hard_header_len;
		mtu = tdev->mtu;
	}
	dev->iflink = cp->link;

	dev->hard_header_len = hlen;
	dev->mtu = mtu;

	return 0;
}

static struct inet_protocol carp_protocol = {
	.handler	=	carp_rcv,
	.err_handler	=	carp_err,
};

static u32 inline addr2val(u8 a1, u8 a2, u8 a3, u8 a4)
{
	u32 ret;

	ret = ((a1 << 24) | (a2 << 16) | (a3 << 8) | (a4 << 0));
	
	return htonl(ret);
}

static void carp_advertise(unsigned long data)
{
	struct carp_priv *cp = (struct carp_priv *)data;
	struct carp_header *ch = &cp->hdr;
	struct carp_stat *cs = &cp->cstat;
	struct sk_buff *skb;
	int len;
	struct ethhdr *eth;
	struct iphdr *ip;
	struct carp_header *c;

	if (cp->state == BACKUP || !cp->odev)
		return;

	len = sizeof(struct iphdr) + sizeof(struct carp_header) + sizeof(struct ethhdr);
	
	skb = alloc_skb(len + 2, GFP_ATOMIC);
	if (!skb)
	{
		log("Failed to allocate new carp frame.\n");
		cs->mem_errors++;
		goto out;
	}
	
	skb_reserve(skb, 16);
	eth = (struct ethhdr *) skb_push(skb, 14);
	ip = (struct iphdr *)skb_put(skb, sizeof(struct iphdr));
	c = (struct carp_header *)skb_put(skb, sizeof(struct carp_header));
	
	memset(&(IPCB(skb)->opt), 0, sizeof(IPCB(skb)->opt));
	
	ip_eth_mc_map(cp->iph.daddr, eth->h_dest);
	memcpy(eth->h_source, cp->odev->dev_addr, ETH_ALEN);
	eth->h_proto 	= htons(ETH_P_IP);

	ip->ihl 	= 5;
	ip->version 	= 4;
	ip->tos		= 0;
	ip->tot_len	= htons(len - sizeof(struct ethhdr));
	ip->frag_off	= 0;
	ip->ttl		= CARP_TTL;
	ip->protocol	= IPPROTO_CARP;
	ip->check	= 0;
	ip->saddr	= cp->iph.saddr;
	ip->daddr	= cp->iph.daddr;
	get_random_bytes(&ip->id, 2);
	ip_send_check(ip);

	memcpy(c, ch, sizeof(struct carp_header));
	
	spin_lock(&cp->lock);
	cp->carp_adv_counter++;
	spin_unlock(&cp->lock);

	ch->carp_counter[1] = htonl(cp->carp_adv_counter & 0xffffffff);
	ch->carp_counter[0] = htonl((cp->carp_adv_counter >> 32) & 0xffffffff);
	carp_hmac_sign(cp, ch);
	
	skb->protocol 	= __constant_htons(ETH_P_IP);
	skb->mac.raw 	= ((u8 *)ip) - 14;
	skb->dev 	= cp->odev;
	skb->pkt_type 	= PACKET_MULTICAST;
		
	spin_lock_bh(&cp->odev->xmit_lock);
	if (!netif_queue_stopped(cp->odev)) 
	{
		atomic_inc(&skb->users);

		if (cp->odev->hard_start_xmit(skb, cp->odev)) 
		{
			atomic_dec(&skb->users);
			cs->xmit_errors++;
			log("Hard xmit error.\n");
		}
		cs->bytes_sent += len;
	}
	spin_unlock_bh(&cp->odev->xmit_lock);

	mod_timer(&cp->adv_timer, jiffies + cp->adv_timeout*HZ);

	kfree_skb(skb);
out:
	return;
}

static int __init device_carp_init(void)
{
	int err;
	struct carp_priv *cp;

	printk(KERN_INFO "CARP driver.\n");

	carp_dev = alloc_netdev(sizeof(struct carp_priv), "carp0",  carp_setup);
	if (!carp_dev) 
	{
		printk(KERN_ERR "Failed to allocate CARP network device structure.\n");
		return -ENOMEM;
	}

	if (inet_add_protocol(&carp_protocol, IPPROTO_CARP) < 0) 
	{
		printk(KERN_INFO "Failed to add CARP protocol.\n");
		err = -EAGAIN;
		goto err_out_mem_free;
	}

	cp = carp_dev->priv;
	
	cp->iph.saddr 	= addr2val(10, 0, 0, 3);
	cp->iph.daddr 	= addr2val(224, 0, 1, 10);
	cp->iph.tos	= 0;
	cp->md_timeout	= 3;
	cp->adv_timeout	= 1;
	cp->state	= INIT;
	spin_lock_init(&cp->lock);

	cp->odev	= dev_get_by_name("eth0");
	if (cp->odev)
	{
		cp->link	= cp->odev->ifindex;
		cp->iph.saddr	= (__in_dev_get(cp->odev))->ifa_list[0].ifa_address;
	}

	memset(cp->carp_key, 1, sizeof(cp->carp_key));
	get_random_bytes(&cp->carp_adv_counter, 8);
	
	get_random_bytes(&cp->hdr.carp_advskew, 1);
	get_random_bytes(&cp->hdr.carp_advbase, 1);

	dump_addr_info(cp);

	init_timer(&cp->md_timer);
	cp->md_timer.expires	= jiffies + cp->md_timeout*HZ;
	cp->md_timer.data	= (unsigned long)cp;
	cp->md_timer.function	= carp_master_down;

	init_timer(&cp->adv_timer);
	cp->adv_timer.expires	= jiffies + cp->adv_timeout*HZ;
	cp->adv_timer.data	= (unsigned long)cp;
	cp->adv_timer.function	= carp_advertise;

	carp_dev->init = carp_init;

	cp->tfm = crypto_alloc_tfm("sha1", 0);
	if (!cp->tfm)
	{
		printk(KERN_ERR "Failed to allocate SHA1 tfm.\n");
		err = -EINVAL;
		goto err_out_del_protocol;
	}

	dump_hmac_params(cp);

	err = carp_init_queues();
	if (err)
		goto err_out_crypto_free;

	if ((err = register_netdev(carp_dev)))
		goto err_out_fini_carp_queues;

	add_timer(&cp->md_timer);

	return err;

err_out_fini_carp_queues:
	carp_fini_queues();
err_out_crypto_free:
	crypto_free_tfm(cp->tfm);
err_out_del_protocol:
	inet_del_protocol(&carp_protocol, IPPROTO_CARP);
err_out_mem_free:
	free_netdev(carp_dev);

	return err;
}

void device_carp_fini(void)
{
	struct carp_priv *cp = carp_dev->priv;

	carp_fini_queues();

	if (inet_del_protocol(&carp_protocol, IPPROTO_CARP) < 0)
		printk(KERN_INFO "Failed to remove CARP protocol handler.\n");
	
	crypto_free_tfm(cp->tfm);

	unregister_netdev(carp_dev);
}

module_init(device_carp_init);
module_exit(device_carp_fini);
MODULE_LICENSE("GPL");

[-- Attachment #1.3: carp.h --]
[-- Type: text/x-chdr, Size: 2283 bytes --]

/*
 * 	carp.h
 * 
 * 2004 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
 * All rights reserved.
 * 
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
 *
 */

#ifndef __CARP_H
#define __CARP_H

#include <linux/netdevice.h>
#include <linux/if.h>
#include <linux/ip.h>

#include "carp_ioctl.h"

#define IPPROTO_CARP 		112
#define	CARP_VERSION		2
#define CARP_TTL		255
#define	CARP_SIG_LEN		20

/*
 * carp_priv->flags definitions.
 */
#define CARP_DATA_AVAIL		(1<<0)

struct carp_header 
{
#if defined(__LITTLE_ENDIAN_BITFIELD)
	u8	carp_type:4,
		carp_version:4;
#elif defined (__BIG_ENDIAN_BITFIELD)
	u8	carp_version:4,
		carp_type:4;
#else
#error	"Please fix <asm/byteorder.h>"
#endif
	u8	carp_vhid;
	u8	carp_advskew;
	u8	carp_authlen;
	u8	carp_pad1;
	u8	carp_advbase;	
	u16	carp_cksum;
	u32	carp_counter[2];
	u8	carp_md[CARP_SIG_LEN];
};

struct carp_stat
{
	u32	crc_errors;
	u32	ver_errors;
	u32	vhid_errors;
	u32	hmac_errors;
	u32	counter_errors;

	u32	mem_errors;
	u32	xmit_errors;

	u32	bytes_sent;
};

struct carp_priv
{
	struct net_device_stats	stat;
	struct net_device	*dev, *odev;
	char 			name[IFNAMSIZ];

	int			link, mlink;
	struct iphdr		iph;

	u32			md_timeout, adv_timeout;
	struct timer_list	md_timer, adv_timer;

	enum carp_state		state;
	struct carp_header 	hdr;
	struct carp_stat	cstat;

	u8			carp_key[CARP_KEY_LEN];
	u8			carp_pad[CARP_HMAC_PAD_LEN];
	struct crypto_tfm 	*tfm;

	u64			carp_adv_counter;

	spinlock_t		lock;

	u32			flags;
	unsigned short		oflags;
};

#endif /* __CARP_H */

[-- Attachment #1.4: carp_ioctl.h --]
[-- Type: text/x-chdr, Size: 1484 bytes --]

/*
 * 	carp_ioctl.h
 * 
 * 2004 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
 * All rights reserved.
 * 
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
 *
 */

#ifndef __CARP_IOCTL_H
#define __CARP_IOCTL_H

#include <linux/sockios.h>

#define	CARP_KEY_LEN		20
#define	CARP_HMAC_PAD_LEN	64

#define MAX_MD_TIMEOUT		5
#define MAX_ADV_TIMEOUT		5

enum carp_state {INIT = 0, MASTER, BACKUP};

enum carp_ioctls 
{
	SIOC_SETCARPPARAMS = SIOCDEVPRIVATE,
	SIOC_GETCARPPARAMS,
};

struct carp_ioctl_params
{
	__u8		carp_advskew;
	__u8		carp_advbase;
	__u8		carp_vhid;
	__u8 		carp_key[CARP_KEY_LEN];
	__u8 		carp_pad[CARP_HMAC_PAD_LEN];
	enum carp_state	state;
	char 		devname[IFNAMSIZ];
	__u32		md_timeout;
	__u32		adv_timeout;
	
};

#endif /* __CARP_IOCTL_H */

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [1/2] CARP implementation. HA master's failover.
  2004-07-15 13:36 ` [1/2] CARP implementation. HA master's failover Evgeniy Polyakov
@ 2004-07-15 14:44   ` jamal
  2004-07-15 15:27     ` Evgeniy Polyakov
  0 siblings, 1 reply; 28+ messages in thread
From: jamal @ 2004-07-15 14:44 UTC (permalink / raw)
  To: johnpol; +Cc: netdev, netfilter-failover

Evgeniy,

Why do you need to put this stuff in the kernel?
This should be implemented just the same way as VRRP was - in user
space.
BTW, is there a spec for this protocol or its one of those things where
you have to follow Yodas advice?

cheers,
jamal

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [1/2] CARP implementation. HA master's failover.
  2004-07-15 14:44   ` jamal
@ 2004-07-15 15:27     ` Evgeniy Polyakov
  2004-07-15 15:55       ` Evgeniy Polyakov
  2004-07-15 16:07       ` jamal
  0 siblings, 2 replies; 28+ messages in thread
From: Evgeniy Polyakov @ 2004-07-15 15:27 UTC (permalink / raw)
  To: jamal; +Cc: netdev, netfilter-failover

[-- Attachment #1: Type: text/plain, Size: 1375 bytes --]

On Thu, 2004-07-15 at 18:44, jamal wrote:
> Evgeniy,
> 
> Why do you need to put this stuff in the kernel?
> This should be implemented just the same way as VRRP was - in user
> space.

Hmm...
Just because i think it works better being implemented in the kernel? :)
I don't think it is a good answer thought.

It is faster, it is more flexible, it has access to kernel space...

> BTW, is there a spec for this protocol or its one of those things where
> you have to follow Yodas advice?

Exactly :)
Here are all links I found:
http://www.countersiege.com/doc/pfsync-carp/
http://www.openbsd.org/cgi-bin/man.cgi?query=carp&apropos=0&sektion=0&manpath=OpenBSD+Current&arch=i386&format=html#SEE+ALSO
http://www.openbsd.org/lyrics.html
VRRP2 spec.
http://www.openbsd.org/cgi-bin/cvsweb/src/sys/netinet/ip_carp.c

I do want this to be in the mainline kernel, but actually I even don't
think anyone will apply it.
It is too special stuff for generic kernel, it has reserved 112 vrrp
protocol number and so on...
So if developers decide not to include or even not to discuss this cruft
I will not beat myself by my heels. :)

It just works as expected, it is reliable and simple.
And it does it's work, so HA people would like it.

> cheers,
> jamal
-- 
	Evgeniy Polaykov ( s0mbre )

Crash is better than data corruption. -- Art Grabowski

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [1/2] CARP implementation. HA master's failover.
  2004-07-15 15:27     ` Evgeniy Polyakov
@ 2004-07-15 15:55       ` Evgeniy Polyakov
  2004-07-15 16:28         ` jamal
  2004-07-15 16:07       ` jamal
  1 sibling, 1 reply; 28+ messages in thread
From: Evgeniy Polyakov @ 2004-07-15 15:55 UTC (permalink / raw)
  To: jamal; +Cc: netdev, netfilter-failover

[-- Attachment #1: Type: text/plain, Size: 2200 bytes --]

On Thu, 2004-07-15 at 19:27, Evgeniy Polyakov wrote:
> On Thu, 2004-07-15 at 18:44, jamal wrote:
> > Evgeniy,
> > 
> > Why do you need to put this stuff in the kernel?
> > This should be implemented just the same way as VRRP was - in user
> > space.
> 
> Hmm...
> Just because i think it works better being implemented in the kernel? :)
> I don't think it is a good answer thought.
> 
> It is faster, it is more flexible, it has access to kernel space...

Just an addition[from private e-mail]:

> would it be possible to do load balancing at the network level with a 
> userland only implementation? 
> 
> OpenBSD's CARP does load balancing through Source Hashing (SH), which
UCARP 
> lacks support for.

Userspace can't in principle.
Current kernel implementation can't too, but it can. In principle.

But better implementation should use both carp and ct_sync and some load
balancing code, which should link ct_sync and carp.
OpenBSD has one disadvantage in this regard: it is not modular, so their
carp hooks live in if_ether.c.
In Linux we just need to use connection tracking.
ct_sync makes not exactly it but close to the idea.

> > BTW, is there a spec for this protocol or its one of those things where
> > you have to follow Yodas advice?
> 
> Exactly :)
> Here are all links I found:
> http://www.countersiege.com/doc/pfsync-carp/
> http://www.openbsd.org/cgi-bin/man.cgi?query=carp&apropos=0&sektion=0&manpath=OpenBSD+Current&arch=i386&format=html#SEE+ALSO
> http://www.openbsd.org/lyrics.html
> VRRP2 spec.
> http://www.openbsd.org/cgi-bin/cvsweb/src/sys/netinet/ip_carp.c
> 
> 
> I do want this to be in the mainline kernel, but actually I even don't
> think anyone will apply it.
> It is too special stuff for generic kernel, it has reserved 112 vrrp
> protocol number and so on...
> So if developers decide not to include or even not to discuss this cruft
> I will not beat myself by my heels. :)
> 
> It just works as expected, it is reliable and simple.
> And it does it's work, so HA people would like it.
> 
> > cheers,
> > jamal
-- 
	Evgeniy Polaykov ( s0mbre )

Crash is better than data corruption. -- Art Grabowski

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [1/2] CARP implementation. HA master's failover.
  2004-07-15 15:27     ` Evgeniy Polyakov
  2004-07-15 15:55       ` Evgeniy Polyakov
@ 2004-07-15 16:07       ` jamal
  2004-07-15 16:59         ` Evgeniy Polyakov
  1 sibling, 1 reply; 28+ messages in thread
From: jamal @ 2004-07-15 16:07 UTC (permalink / raw)
  To: johnpol; +Cc: netdev, netfilter-failover

On Thu, 2004-07-15 at 11:27, Evgeniy Polyakov wrote:
> On Thu, 2004-07-15 at 18:44, jamal wrote:
> > Evgeniy,
> > 
> > Why do you need to put this stuff in the kernel?
> > This should be implemented just the same way as VRRP was - in user
> > space.
> 
> Hmm...
> Just because i think it works better being implemented in the kernel? :)
> I don't think it is a good answer thought.
> 
> It is faster, it is more flexible, it has access to kernel space...

Yeah, I know ;-> and probably thats what the opnebsd people did.

I still think it should live in user space.  This should apply to
anything thats control related because such things tend to be
continoulsy enrichned with features. ARP unfortunately is in there; one
of my pet perpetual projects is to totaly rip it off. Theres already
hooks to deliver to user space today and Alexey has a daemon for it, not
sure how widely used it is.

> > BTW, is there a spec for this protocol or its one of those things where
> > you have to follow Yodas advice?
> 
> Exactly :)
> Here are all links I found:

Thank you. 
I think a better idea would be to implement a sync message
within CARP instead of that pfsync app doing its own thing. Unless i
misread, pfsync seems to be a separate app.
This way more than one app can use it via the CARP daemon
in user space to sync state of their choice (with whatever pfsync does
being one of many). 
This is an example of a rich application and further justification for
it to live in user space.

> I do want this to be in the mainline kernel, but actually I even don't
> think anyone will apply it.
>
> It is too special stuff for generic kernel, it has reserved 112 vrrp
> protocol number and so on...
> So if developers decide not to include or even not to discuss this cruft
> I will not beat myself by my heels. :)
> 
> It just works as expected, it is reliable and simple.
> And it does it's work, so HA people would like it.

It is valuable, just doesnt belong to the kernel.
BTW, i saw some claim that this is patent-free as opposed to VRRP?
I do hope it takes off.  What exactly is the patent issue that was at
stake? I couldnt tell from the song lyrics ;->
One valuable thing that could be done is while still avoiding any patent
issues make it interop with VRRP.

cheers,
jamal

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [1/2] CARP implementation. HA master's failover.
  2004-07-15 15:55       ` Evgeniy Polyakov
@ 2004-07-15 16:28         ` jamal
  2004-07-15 16:59           ` Evgeniy Polyakov
  0 siblings, 1 reply; 28+ messages in thread
From: jamal @ 2004-07-15 16:28 UTC (permalink / raw)
  To: johnpol; +Cc: netdev, netfilter-failover

On Thu, 2004-07-15 at 11:55, Evgeniy Polyakov wrote:


> > OpenBSD's CARP does load balancing through Source Hashing (SH), which
> UCARP 
> > lacks support for.
> 
> Userspace can't in principle.
> Current kernel implementation can't too, but it can. In principle.

Easy with current traffic control extensions. We need an action written
for this. User space dameon controls it.

cheers,
jamal

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [1/2] CARP implementation. HA master's failover.
  2004-07-15 16:28         ` jamal
@ 2004-07-15 16:59           ` Evgeniy Polyakov
  2004-07-15 17:30             ` jamal
  0 siblings, 1 reply; 28+ messages in thread
From: Evgeniy Polyakov @ 2004-07-15 16:59 UTC (permalink / raw)
  To: hadi; +Cc: netdev, netfilter-failover

[-- Attachment #1: Type: text/plain, Size: 1425 bytes --]

On Thu, 2004-07-15 at 20:28, jamal wrote:
> On Thu, 2004-07-15 at 11:55, Evgeniy Polyakov wrote:
> 
> 
> > > OpenBSD's CARP does load balancing through Source Hashing (SH), which
> > UCARP 
> > > lacks support for.
> > 
> > Userspace can't in principle.
> > Current kernel implementation can't too, but it can. In principle.
> 
> Easy with current traffic control extensions. We need an action written
> for this. User space dameon controls it.

Load balancing between different computers?
How nodes will know about each other using only tc extension?
Kernel traps packet, send info about it to userspace, it decides drop it
or not... Not very fast path.
Or you may hardcode parameters for packets to be sent through current
machine in it's rules, and userspace will decide only _when_ apply all
those rules. But if we want to change things we have following chain:
driver <-0-> stack <-1-> tc <-2-> userspace carp <-3-> stack <-4-> other
machine.
With kernel implementation we may avoid 2 and 3.

And the bigggest advantage of the CARP is that it may touch kernel bits.
For any situation that may occure in HA world and will require touching
kernel space we always need some inkernel agent and some state
machine/protocol to connect it to userspace...
CARP already may this.

> 
> cheers,
> jamal
-- 
	Evgeniy Polaykov ( s0mbre )

Crash is better than data corruption. -- Art Grabowski

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [1/2] CARP implementation. HA master's failover.
  2004-07-15 16:07       ` jamal
@ 2004-07-15 16:59         ` Evgeniy Polyakov
  2004-07-15 17:24           ` jamal
  0 siblings, 1 reply; 28+ messages in thread
From: Evgeniy Polyakov @ 2004-07-15 16:59 UTC (permalink / raw)
  To: jamal; +Cc: netdev, netfilter-failover

[-- Attachment #1: Type: text/plain, Size: 3373 bytes --]

On Thu, 2004-07-15 at 20:07, jamal wrote:

> > > Why do you need to put this stuff in the kernel?
> > > This should be implemented just the same way as VRRP was - in user
> > > space.
> > 
> > Hmm...
> > Just because i think it works better being implemented in the kernel? :)
> > I don't think it is a good answer thought.
> > 
> > It is faster, it is more flexible, it has access to kernel space...
> 
> Yeah, I know ;-> and probably thats what the opnebsd people did.
>  
> I still think it should live in user space.  This should apply to
> anything thats control related because such things tend to be
> continoulsy enrichned with features. ARP unfortunately is in there; one
> of my pet perpetual projects is to totaly rip it off. Theres already
> hooks to deliver to user space today and Alexey has a daemon for it, not
> sure how widely used it is.

Userspace is too slow.
It can only initiate master's failover, load balancing is a good example
here - userspace _itself_ can not control real time traffic.

> > > BTW, is there a spec for this protocol or its one of those things where
> > > you have to follow Yodas advice?
> > 
> > Exactly :)
> > Here are all links I found:
> 
> Thank you. 
> I think a better idea would be to implement a sync message
> within CARP instead of that pfsync app doing its own thing. Unless i
> misread, pfsync seems to be a separate app.
> This way more than one app can use it via the CARP daemon
> in user space to sync state of their choice (with whatever pfsync does
> being one of many). 

ct_sync module does this.
It uses connection tracking and sends firewall state across slaves.
CARP is separate by design - anyone may "attach" to master/slave
failover.

> This is an example of a rich application and further justification for
> it to live in user space.

If it will live in userspace, it just can not control realtime traffic
and even provide some mechanism to achive this.

> > I do want this to be in the mainline kernel, but actually I even don't
> > think anyone will apply it.
> >
> > It is too special stuff for generic kernel, it has reserved 112 vrrp
> > protocol number and so on...
> > So if developers decide not to include or even not to discuss this cruft
> > I will not beat myself by my heels. :)
> > 
> > It just works as expected, it is reliable and simple.
> > And it does it's work, so HA people would like it.
> 
> It is valuable, just doesnt belong to the kernel.
> BTW, i saw some claim that this is patent-free as opposed to VRRP?
> I do hope it takes off.  What exactly is the patent issue that was at
> stake? I couldnt tell from the song lyrics ;->

:) Cisco + hsrp == vrrp, but the former is patented.
Here is quote from Ryan McBride, an author of the CARP:

* P.S. If anyone has concerns about the Cisco's patent #5,473,599 and
how their claim that it applies to VRRP has forced us to design our own
incompatible protocol, don't talk to us. Instead, call Cisco's lawyer at
408-525-9706, or email him: rbarr@cisco.com *


> One valuable thing that could be done is while still avoiding any patent
> issues make it interop with VRRP.

VRRP is not secure, it is protocol dependent, it is not free...

> cheers,
> jamal
-- 
	Evgeniy Polaykov ( s0mbre )

Crash is better than data corruption. -- Art Grabowski

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [1/2] CARP implementation. HA master's failover.
  2004-07-15 16:59         ` Evgeniy Polyakov
@ 2004-07-15 17:24           ` jamal
  2004-07-15 19:53             ` Evgeniy Polyakov
  0 siblings, 1 reply; 28+ messages in thread
From: jamal @ 2004-07-15 17:24 UTC (permalink / raw)
  To: johnpol; +Cc: netdev, netfilter-failover

On Thu, 2004-07-15 at 12:59, Evgeniy Polyakov wrote:

> Userspace is too slow.
> It can only initiate master's failover, load balancing is a good example
> here - userspace _itself_ can not control real time traffic.

What is it that CARP does that couldnt be achieved by VRRP? 
VRRP is implemented in user space. As a user myself i can assure you
even with heartbeats at 100ms granularity i have seen no issues which
can be blamed on the fact it runs on user space. And i have used it in
fairly large setups.

> ct_sync module does this.
> It uses connection tracking and sends firewall state across slaves.
> CARP is separate by design - anyone may "attach" to master/slave
> failover.

Can you explain a little more what you mean by "attching" to
master/slave?
I hope you are not saying that that this ct_sync thing sends 
every piece of connection tracking state across. That would be a
collosal waste.

> > This is an example of a rich application and further justification for
> > it to live in user space.
> 
> If it will live in userspace, it just can not control realtime traffic
> and even provide some mechanism to achive this.

What do you mean realtime traffic?
Is it something that can be achieved by qos prioritization?

> > It is valuable, just doesnt belong to the kernel.
> > BTW, i saw some claim that this is patent-free as opposed to VRRP?
> > I do hope it takes off.  What exactly is the patent issue that was at
> > stake? I couldnt tell from the song lyrics ;->
> 
> :) Cisco + hsrp == vrrp, but the former is patented.
> Here is quote from Ryan McBride, an author of the CARP:
> 
> * P.S. If anyone has concerns about the Cisco's patent #5,473,599 and
> how their claim that it applies to VRRP has forced us to design our own
> incompatible protocol, don't talk to us. Instead, call Cisco's lawyer at
> 408-525-9706, or email him: rbarr@cisco.com *
> 

In the end CISCO is going to be the loser in this of because 
something like CARP will take off and it cant talk to them. At the
moment though they do have the market so interoping with them is
valuable.

> > One valuable thing that could be done is while still avoiding any patent
> > issues make it interop with VRRP.
> 
> VRRP is not secure, it is protocol dependent, it is not free...

I was talking more from a deployment side rather than technology.
The gentleman who now owns the VRRP daemon in Linux, Alexander Cassen, i
believe had a chat with this Cisco lawyer and if my understanding is
correct the main contention is CISCO is worried some idiot will sue them
by writting a similar patent i.e the patent was not to have something to
impose on other people rather a protection.  I could be wrong.
BTW, I am still not sure what the differences are that make CARP
patent-free. In other words, I wouldnt bet at this point that if someone
wanted to go after you for HSRP patent infrigement that it would be
impossible. In any case i fully support the effort.

BTW, I thought you could make VRRP secure.

cheers,
jamal

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [1/2] CARP implementation. HA master's failover.
  2004-07-15 16:59           ` Evgeniy Polyakov
@ 2004-07-15 17:30             ` jamal
  2004-07-15 19:20               ` Evgeniy Polyakov
  0 siblings, 1 reply; 28+ messages in thread
From: jamal @ 2004-07-15 17:30 UTC (permalink / raw)
  To: johnpol; +Cc: netdev, netfilter-failover

On Thu, 2004-07-15 at 12:59, Evgeniy Polyakov wrote:
> On Thu, 2004-07-15 at 20:28, jamal wrote:

> > Easy with current traffic control extensions. We need an action written
> > for this. User space dameon controls it.
> 
> Load balancing between different computers?
> How nodes will know about each other using only tc extension?

Why do they need to know about each other. Maybe explain a little
how said load balancing is achieved.

> Kernel traps packet, send info about it to userspace, it decides drop it
> or not... Not very fast path.

I am hoping CARP knows how to deal with dropped packets.

> Or you may hardcode parameters for packets to be sent through current
> machine in it's rules, and userspace will decide only _when_ apply all
> those rules. But if we want to change things we have following chain:
> driver <-0-> stack <-1-> tc <-2-> userspace carp <-3-> stack <-4-> other
> machine.
> With kernel implementation we may avoid 2 and 3.

use socket to send to user space. When you want to install a load
balancing rule, use netlink from user space to the kernel.
Loadbalancing resides in the kernel as a tc action.

> And the bigggest advantage of the CARP is that it may touch kernel bits.
> For any situation that may occure in HA world and will require touching
> kernel space we always need some inkernel agent and some state
> machine/protocol to connect it to userspace...
> CARP already may this.

weak arguement, I am afraid. I am looking at the way the BSD people did
it - which is what you are emulating and it is wrong. No need for this
stuff to be done in kernel at all.

cheers,
jamal

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [1/2] CARP implementation. HA master's failover.
  2004-07-15 17:30             ` jamal
@ 2004-07-15 19:20               ` Evgeniy Polyakov
  2004-07-16 12:34                 ` jamal
  0 siblings, 1 reply; 28+ messages in thread
From: Evgeniy Polyakov @ 2004-07-15 19:20 UTC (permalink / raw)
  To: hadi; +Cc: netdev, netfilter-failover

On 15 Jul 2004 13:30:59 -0400
jamal <hadi@cyberus.ca> wrote:

> On Thu, 2004-07-15 at 12:59, Evgeniy Polyakov wrote:
> > On Thu, 2004-07-15 at 20:28, jamal wrote:
> 
> > > Easy with current traffic control extensions. We need an action
> > > written for this. User space dameon controls it.
> > 
> > Load balancing between different computers?
> > How nodes will know about each other using only tc extension?
> 
> Why do they need to know about each other. Maybe explain a little
> how said load balancing is achieved.

Kind of scuch scenario:
If I am a masater, than get half of bandwidth, but if slave count is
less than threshold than get more.
If I am a slave and slave count is more than threshold than get
0.5/slave_count of the bandwidth and reserve some else get...

> > Kernel traps packet, send info about it to userspace, it decides
> > drop it or not... Not very fast path.
> 
> I am hoping CARP knows how to deal with dropped packets.

Tssss, OpenBSD's one just silently drops.
Linux one will (if will) use some clever mechanism to be sure that
someone got this packet and _I_ may drop it.

> > Or you may hardcode parameters for packets to be sent through
> > current machine in it's rules, and userspace will decide only _when_
> > apply all those rules. But if we want to change things we have
> > following chain: driver <-0-> stack <-1-> tc <-2-> userspace carp
> > <-3-> stack <-4-> other machine.
> > With kernel implementation we may avoid 2 and 3.
> 
> use socket to send to user space. When you want to install a load
> balancing rule, use netlink from user space to the kernel.
> Loadbalancing resides in the kernel as a tc action.

What about case when you do need kernel space access based on CARP
state?
You will need to install each time new kernel agent for it.
With CARP being implemented in kernelspace you need just one - CARP
itself.

> > And the bigggest advantage of the CARP is that it may touch kernel
> > bits. For any situation that may occure in HA world and will require
> > touching kernel space we always need some inkernel agent and some
> > state machine/protocol to connect it to userspace...
> > CARP already may this.
> 
> weak arguement, I am afraid. I am looking at the way the BSD people
> did it - which is what you are emulating and it is wrong. No need for
> this stuff to be done in kernel at all.

No. BSD people have kernel implementation: OpenBSD has, FreeBSD has
port, NetBSD has port, but not included into mainline due to patent
issue.

It is case of abstraction: for some reason(and for most of all) you do
not need kernel space implmentation.
But reasons do exist to use it in kernel space, and if it will become an
issue some day, you will anyway create a kernel agent. If you need
kernel access in HA system, do not create new agents, just use CARP as
kernel agent and arbiter.

> 
> cheers,
> jamal

	Evgeniy Polyakov ( s0mbre )

Only failure makes us experts. -- Theo de Raadt

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [1/2] CARP implementation. HA master's failover.
  2004-07-15 17:24           ` jamal
@ 2004-07-15 19:53             ` Evgeniy Polyakov
  2004-07-16 13:04               ` jamal
  0 siblings, 1 reply; 28+ messages in thread
From: Evgeniy Polyakov @ 2004-07-15 19:53 UTC (permalink / raw)
  To: hadi; +Cc: netdev, netfilter-failover

On 15 Jul 2004 13:24:45 -0400
jamal <hadi@cyberus.ca> wrote:

> On Thu, 2004-07-15 at 12:59, Evgeniy Polyakov wrote:
> 
> > Userspace is too slow.
> > It can only initiate master's failover, load balancing is a good
> > example here - userspace _itself_ can not control real time traffic.
> 
> What is it that CARP does that couldnt be achieved by VRRP? 

I will answer a question by question, sorry.
Has vrrp some authentification mechanism?
Can it be used over IPv6? (CARP also can't but it is _very_ easy to
add, I just don't have IPv6 network setup to test).
May someone use vrrp in private commercial enviroment without fear of
being convicted?

> VRRP is implemented in user space. As a user myself i can assure you
> even with heartbeats at 100ms granularity i have seen no issues which
> can be blamed on the fact it runs on user space. And i have used it in
> fairly large setups.
> 
> > ct_sync module does this.
> > It uses connection tracking and sends firewall state across slaves.
> > CARP is separate by design - anyone may "attach" to master/slave
> > failover.
> 
> 
> Can you explain a little more what you mean by "attching" to
> master/slave?

Consider using some abstraction layer which makes some decisions based
on knowledge of current HA state.

> I hope you are not saying that that this ct_sync thing sends 
> every piece of connection tracking state across. That would be a
> collosal waste.

It looks like we do not understand each other :)
Here is the explanation of the ct_sync:
http://cvs.netfilter.org/netfilter-ha/README?rev=1.2&content-type=text/vnd.viewcvs-markup

Harald Welte will have a talk about ct_sync at OLS.

> > > This is an example of a rich application and further justification
> > > for it to live in user space.
> > 
> > If it will live in userspace, it just can not control realtime
> > traffic and even provide some mechanism to achive this.
> 
> What do you mean realtime traffic?
> Is it something that can be achieved by qos prioritization?

Yes it can. But who will control prioritization mechanism?
Maybe userspace.
But with such approach we need to create each time we need kernel access
a kernel agent with it's own kernel<->user protocol, it's own connect
to master/slave arbiter...
With CARP just create one function in kernelspace and register it in
CARP using provided mechanism.

> > > It is valuable, just doesnt belong to the kernel.
> > > BTW, i saw some claim that this is patent-free as opposed to VRRP?
> > > I do hope it takes off.  What exactly is the patent issue that was
> > > at stake? I couldnt tell from the song lyrics ;->
> > 
> > :) Cisco + hsrp == vrrp, but the former is patented.
> > Here is quote from Ryan McBride, an author of the CARP:
> > 
> > * P.S. If anyone has concerns about the Cisco's patent #5,473,599
> > and how their claim that it applies to VRRP has forced us to design
> > our own incompatible protocol, don't talk to us. Instead, call
> > Cisco's lawyer at 408-525-9706, or email him: rbarr@cisco.com *
> > 
> 
> In the end CISCO is going to be the loser in this of because 
> something like CARP will take off and it cant talk to them. At the
> moment though they do have the market so interoping with them is
> valuable.

It is just marketing...
The better software the more market it can eat. Theoretically...

> > > One valuable thing that could be done is while still avoiding any
> > > patent issues make it interop with VRRP.
> > 
> > VRRP is not secure, it is protocol dependent, it is not free...
> 
> I was talking more from a deployment side rather than technology.
> The gentleman who now owns the VRRP daemon in Linux, Alexander Cassen,
> i believe had a chat with this Cisco lawyer and if my understanding is
> correct the main contention is CISCO is worried some idiot will sue
> them by writting a similar patent i.e the patent was not to have
> something to impose on other people rather a protection.  I could be

In theory practice and theory are the same, but in practice they are
different. (c) Larry McVoy.

Why use not good software and has even theoretical possibility to be 
convicted when we have free successor( :) I said it? Nah... ).

> wrong. BTW, I am still not sure what the differences are that make
> CARP patent-free. In other words, I wouldnt bet at this point that if
> someone wanted to go after you for HSRP patent infrigement that it
> would be impossible. In any case i fully support the effort.

I have great confidence that Theo de Raadt will not include non
patent-free code in OpenBSD.

> BTW, I thought you could make VRRP secure.

And protocol independent, and absolutely, even theoretically free, just
better and different. It was already done by Ryan McBride.
But he also changed name :)

I believe it was Moor's law, that said than people always have time to
rewrite project from scratch, but never have time to properly
coordinate efforts and create good thing at the first time.
Or something like this...

> cheers,
> jamal


	Evgeniy Polyakov ( s0mbre )

Only failure makes us experts. -- Theo de Raadt

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [1/2] CARP implementation. HA master's failover.
  2004-07-15 19:20               ` Evgeniy Polyakov
@ 2004-07-16 12:34                 ` jamal
  2004-07-16 15:06                   ` Evgeniy Polyakov
  0 siblings, 1 reply; 28+ messages in thread
From: jamal @ 2004-07-16 12:34 UTC (permalink / raw)
  To: johnpol; +Cc: netdev, netfilter-failover

On Thu, 2004-07-15 at 15:20, Evgeniy Polyakov wrote:

> > > Load balancing between different computers?
> > > How nodes will know about each other using only tc extension?
> > 
> > Why do they need to know about each other. Maybe explain a little
> > how said load balancing is achieved.
> 
> Kind of scuch scenario:
> If I am a masater, than get half of bandwidth, but if slave count is
> less than threshold than get more.
> If I am a slave and slave count is more than threshold than get
> 0.5/slave_count of the bandwidth and reserve some else get...

Ok, so some controller is in charge - seems like thats something that
could be easily done in user space based on mastership transitions.

> > > Kernel traps packet, send info about it to userspace, it decides
> > > drop it or not... Not very fast path.
> > 
> > I am hoping CARP knows how to deal with dropped packets.
> 
> Tssss, OpenBSD's one just silently drops.

Ok, i wont tell ;->

> Linux one will (if will) use some clever mechanism to be sure that
> someone got this packet and _I_ may drop it.

In VRRP for example its the number of heartbeats missed that makes the
difference. Also the dead interval is valuable before a split-brain hits
the fan. So maybe one or two dropped packets is ok.


> > use socket to send to user space. When you want to install a load
> > balancing rule, use netlink from user space to the kernel.
> > Loadbalancing resides in the kernel as a tc action.
> 
> What about case when you do need kernel space access based on CARP
> state?

What kind of access? To configure something? what kind of thing?

> It is case of abstraction: for some reason(and for most of all) you do
> not need kernel space implmentation.
> But reasons do exist to use it in kernel space, and if it will become an
> issue some day, you will anyway create a kernel agent. If you need
> kernel access in HA system, do not create new agents, just use CARP as
> kernel agent and arbiter.

Iam not buying it Evgeniy, sorry ;->

BTW, I like that ARP balancing feature that CARP has. Pretty neat.
Note that it could be easily done via a tc action with user space
control.

cheers,
jamal

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [1/2] CARP implementation. HA master's failover.
  2004-07-15 19:53             ` Evgeniy Polyakov
@ 2004-07-16 13:04               ` jamal
  2004-07-16 15:06                 ` Evgeniy Polyakov
  2004-07-19  7:16                 ` [nf-failover] " KOVACS Krisztian
  0 siblings, 2 replies; 28+ messages in thread
From: jamal @ 2004-07-16 13:04 UTC (permalink / raw)
  To: johnpol; +Cc: netdev, netfilter-failover

[-- Attachment #1: Type: text/plain, Size: 3250 bytes --]

On Thu, 2004-07-15 at 15:53, Evgeniy Polyakov wrote:
> On 15 Jul 2004 13:24:45 -0400
> jamal <hadi@cyberus.ca> wrote:

> > What is it that CARP does that couldnt be achieved by VRRP? 
> 
> I will answer a question by question, sorry.

;->

> Has vrrp some authentification mechanism?

They (at least used to) claim to be able to do so.

> Can it be used over IPv6? (CARP also can't but it is _very_ easy to
> add, I just don't have IPv6 network setup to test).

Theres effort to have it do v6.
http://www.ietf.org/internet-drafts/draft-ietf-vrrp-ipv6-spec-06.txt
I agree its lame to have it as an after thought it seems

> May someone use vrrp in private commercial enviroment without fear of
> being convicted?

That i dont know.

> > 
> > Can you explain a little more what you mean by "attching" to
> > master/slave?
> 
> Consider using some abstraction layer which makes some decisions based
> on knowledge of current HA state.

sure; make it an API/callback/event to/from the carp daemon to other
applications.

> It looks like we do not understand each other :)
> Here is the explanation of the ct_sync:
> http://cvs.netfilter.org/netfilter-ha/README?rev=1.2&content-type=text/vnd.viewcvs-markup
> 
> Harald Welte will have a talk about ct_sync at OLS.


Ok, good. Maybe if you too come to OLS we can settle this ;->
Looking at what HArald has, the infrastructure seems to be the correct
flavor. Seems something gets sent to user space via netlink and gets
delivered via keepalived.
I think the CARP loadbalancing feature is an improvement over what is
being suggested by Harald.
I have to say as well i am shocked that state is just being transfered
blindly - but i will deal with Harald when he shows up in Ottawa ;->

> > What do you mean realtime traffic?
> > Is it something that can be achieved by qos prioritization?
> 
> Yes it can. But who will control prioritization mechanism?
> Maybe userspace.
> But with such approach we need to create each time we need kernel access
> a kernel agent with it's own kernel<->user protocol, it's own connect
> to master/slave arbiter...
> With CARP just create one function in kernelspace and register it in
> CARP using provided mechanism.

bah.
Ok, now you are forcing me to draw diagrams.

I attached to avouid it being mangled.

> > In the end CISCO is going to be the loser in this of because 
> > something like CARP will take off and it cant talk to them. At the
> > moment though they do have the market so interoping with them is
> > valuable.
> 
> It is just marketing...
> The better software the more market it can eat. Theoretically...

I am afraid even if that sounds logical it doesnt work like that.
Too many stupid people. If it worked like that MS would be dead and
buried a few years ago.


> In theory practice and theory are the same, but in practice they are
> different. (c) Larry McVoy.

Agreed.

> Why use not good software and has even theoretical possibility to be 
> convicted when we have free successor( :) I said it? Nah... ).

Ok, keep spreading fear ;-> You are getting me worried now ;->

> I have great confidence that Theo de Raadt will not include non
> patent-free code in OpenBSD.

I hope he is a lawyer or has some good lawyer advising him;->

cheers,
jamal

[-- Attachment #2: e1 --]
[-- Type: text/plain, Size: 1281 bytes --]



         User space

              +-------+
              | CARPd | <-----> Other apps
              +---+---+
                  |
                  | - netlink to control ARP LB action
                  | - netlink to talk to ct_sync
                  | - netlink to control QoS rules
                  | - PF_PACKET or raw socket to send/rcv CARP pkts
                  |
   -----------------------------------
           Kernel
        network I/O etc

Apps interface to Carpd to send CARP encapulated packets i was talking
about earlier. With that thought lest redraw the diagram:


         User space

+-------+         +-------+         +---------+
| App#X | <-----> | CARPd | <-----> | ctsyncd |
+---+---+         +---+---+         +---+-----+
                      |                 |
                      |                 |
   ------------------------------------------
           Kernel
        network I/O etc

So now ct_sync control is not owned by carpd. Rather. ctsync listens
to the netlink msgs from the kernel, builds an app msg, passes it
carpd which will send it if state is right. Note such a mesage which
can be used by apps doesnt exist today, it would need to be defined.

Apps could also register callbacks for events
such as mastership transitions.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [1/2] CARP implementation. HA master's failover.
  2004-07-16 12:34                 ` jamal
@ 2004-07-16 15:06                   ` Evgeniy Polyakov
  2004-07-17 11:52                     ` jamal
  0 siblings, 1 reply; 28+ messages in thread
From: Evgeniy Polyakov @ 2004-07-16 15:06 UTC (permalink / raw)
  To: hadi; +Cc: netdev, netfilter-failover

[-- Attachment #1: Type: text/plain, Size: 3695 bytes --]

On Fri, 2004-07-16 at 16:34, jamal wrote:

> > If I am a master, than get half of bandwidth, but if slave count is
> > less than threshold than get more.
> > If I am a slave and slave count is more than threshold than get
> > 0.5/slave_count of the bandwidth and reserve some else get...
> 
> Ok, so some controller is in charge - seems like thats something that
> could be easily done in user space based on mastership transitions.

Yes, but here is tricky but true example:
Some time ago e1000 driver from Intel had possibility to do hardware
bonding(i absolutely don't remember how it was called, but idea was the
same as in bonding).
Consider following scenario: if node is a master than it enables this
bonding mode using e1000 internal registers. Ethtools doesn't support
those mode. yes it also can be enabled through patching userspace, but
with kernel CARP it is not needed.
Or consider TGE example(...wireless HA... strange sentence, but...):
If I am a master, than enable higher priority in driver.
Current tc design can't be mapped to driver's internal structures :>

But the main killer is following:
consider firewall with thousands iptables rules, and if node becomes a
master it needs to add or remove some rules from table.
Copying such amounts to/from userspace/kernelspace memory will take
_minutes_... Even using iptables chains.
But kernel implementation may just add one rule.

Yet another variant: you need to access CPU internal registers based on
HA state, kind of turning on or off additional hotplug CPU and or
memory, enabling/disabling NUMA access. Can you enable/disable bus
arbiter from userspace?
For example I'm using on-chip SDRAM in PPC440 as L2 cache or as jitter
buffer for OPB access, decision to use each mode is based on some
hardware loads. Userspace do not have access to such mechanism.
It is deep kernel internals, and I do not see any good reason to export
it to userspace.
Actually last example can't be used as argument in our discussion, but
it illustrates that sometimes we need to touch kernel-_only_ parts, and
this decision is dictated from the outside of the touchable part.

> > What about case when you do need kernel space access based on CARP
> > state?
> 
> What kind of access? To configure something? what kind of thing?

Some kind of scenarios above?

> > It is case of abstraction: for some reason(and for most of all) you do
> > not need kernel space implementation.
> > But reasons do exist to use it in kernel space, and if it will become an
> > issue some day, you will anyway create a kernel agent. If you need
> > kernel access in HA system, do not create new agents, just use CARP as
> > kernel agent and arbiter.
> 
> Iam not buying it Evgeniy, sorry ;->

I see :)

> BTW, I like that ARP balancing feature that CARP has. Pretty neat.
> Note that it could be easily done via a tc action with user space
> control.

Anything may be done in userspace.
For example routing decision.
Yes, it _may_ be done in userspace. But it is slow.
SCSI over IP may be done as network block device.
Or even copying packet to userspace through raw device and then send it
using socket.
QNX and Mach are even designed in this way.

It is not talk about current possibilities, it is kind of design :)
Yes, probably our _current_ needs may be satisfied using existing
userspace tools.
But I absolutely sure that we will need in-kernel support.
I'm reading you second e-mail with pretty diagrams and already see where
in-kernel CARP will live there :)

> cheers,
> jamal
-- 
	Evgeniy Polaykov ( s0mbre )

Crash is better than data corruption. -- Art Grabowski

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [1/2] CARP implementation. HA master's failover.
  2004-07-16 13:04               ` jamal
@ 2004-07-16 15:06                 ` Evgeniy Polyakov
  2004-07-17 12:47                   ` jamal
  2004-07-19  7:16                 ` [nf-failover] " KOVACS Krisztian
  1 sibling, 1 reply; 28+ messages in thread
From: Evgeniy Polyakov @ 2004-07-16 15:06 UTC (permalink / raw)
  To: hadi; +Cc: netdev, netfilter-failover


[-- Attachment #1.1: Type: text/plain, Size: 3543 bytes --]

On Fri, 2004-07-16 at 17:04, jamal wrote:

> > Has vrrp some authentification mechanism?
> 
> They (at least used to) claim to be able to do so.

Hm... Quote from draft-ietf-vrrp-spec-v2-08.txt
5.3.6.1 Authentication Type 0
5.3.6.2 Authentication Type 1
5.3.6.3 Authentication Type 2

1. 8bit virtual host ID.
2. Plain password.
3. HMAC.

I think even 3 is not good.
They do strong signed digest, but it does not have any kind of counter
so i do not see replay attack prevention.

> > Can it be used over IPv6? (CARP also can't but it is _very_ easy to
> > add, I just don't have IPv6 network setup to test).
> 
> Theres effort to have it do v6.
> http://www.ietf.org/internet-drafts/draft-ietf-vrrp-ipv6-spec-06.txt
> I agree its lame to have it as an after thought it seems

* VRRP for IPv6 does not currently include any type of authentication. *


> > > Can you explain a little more what you mean by "attching" to
> > > master/slave?
> > 
> > Consider using some abstraction layer which makes some decisions based
> > on knowledge of current HA state.
> 
> sure; make it an API/callback/event to/from the carp daemon to other
> applications.
> 
> > It looks like we do not understand each other :)
> > Here is the explanation of the ct_sync:
> > http://cvs.netfilter.org/netfilter-ha/README?rev=1.2&content-type=text/vnd.viewcvs-markup
> > 
> > Harald Welte will have a talk about ct_sync at OLS.
> 
> 
> Ok, good. Maybe if you too come to OLS we can settle this ;->

:) Unfortunately no...

> Looking at what HArald has, the infrastructure seems to be the correct
> flavor. Seems something gets sent to user space via netlink and gets
> delivered via keepalived.
> I think the CARP loadbalancing feature is an improvement over what is
> being suggested by Harald.
> I have to say as well i am shocked that state is just being transfered
> blindly - but i will deal with Harald when he shows up in Ottawa ;->

Harald, sorry :)

> > > What do you mean realtime traffic?
> > > Is it something that can be achieved by qos prioritization?
> > 
> > Yes it can. But who will control prioritization mechanism?
> > Maybe userspace.
> > But with such approach we need to create each time we need kernel access
> > a kernel agent with it's own kernel<->user protocol, it's own connect
> > to master/slave arbiter...
> > With CARP just create one function in kernelspace and register it in
> > CARP using provided mechanism.
> 
> bah.
> Ok, now you are forcing me to draw diagrams.
>
> I attached to avouid it being mangled.

I will draw one too.

> > > In the end CISCO is going to be the loser in this of because 
> > > something like CARP will take off and it cant talk to them. At the
> > > moment though they do have the market so interoping with them is
> > > valuable.
> > 
> > It is just marketing...
> > The better software the more market it can eat. Theoretically...
> 
> I am afraid even if that sounds logical it doesnt work like that.
> Too many stupid people. If it worked like that MS would be dead and
> buried a few years ago.

For those who cares they are already done.

> > I have great confidence that Theo de Raadt will not include non
> > patent-free code in OpenBSD.
> 
> I hope he is a lawyer or has some good lawyer advising him;->

He is an OpenBSD creator, so he is just a bit more paranoidal than
others :)

> cheers,
> jamal
-- 
	Evgeniy Polaykov ( s0mbre )

Crash is better than data corruption. -- Art Grabowski

[-- Attachment #1.2: carp_diagram.1 --]
[-- Type: text/plain, Size: 2745 bytes --]

No, your diagram should looks like this:

         User space
                    +----------------------------------------------------------+
		    |                                                          |
                    |  +-------------------------------------+                 |
		    |  |                                     |                 |
+-------+         +-------+         +---------+         +-------+          +-------+          
| App#X | <-----> | CARPd | <-----> | ctsyncd | <-----> | App#X | <----->  | App#X | <-----> 
+---+---+         +---+---+         +---+-----+         +---+---+          +---+---+         
                      |                 |                   |                  |              
                      |                 |                   |                  |              
   ------------------------------------------------------------------------------------------
           Kernel
        network I/O etc

Or only have one BUS(and it is actually implemented using netlink).

You need to connect each application daemon to carpd, even using broadcast netlink.
And for any in-kernel access you will need to create new App and new kernel part.

If we will extrapolate it we can create following:
userspace carp determines that it is a master, it will suspend all kernel memory or dump /proc/kmem
and begins to advertise it. Remote node receives it and has pretty the same firewall settings, 
flow controls and any in-kernel state.
No matter that it takes a long time.


It make sence if App#X needs userspace access only.
But here is other diagram:

                                        userspace
                 |
-----------------+-------------------------------
                CARP                  kernelspace
                 |
                 |
+----------+-----+-----+---------+-------
|          |           |         |
ct_sync  iSCSI       e1000      CPU


My main idea for in-kernel CARP was to implement invisible HA mechanism suitable for in-kernel use.
You do not need to create netlink protocol parser, you do not need to create extra userspace overhead,
you do not need to create suitable for userspace control hooks in kernel infrastructure.
Just register callback.
But even with such simple approach you have opportunity to collaborate with userspace. If you need.

Why creating all userspace cruft if/when you need only kernel one?


Resume: 
With your approach any data flow MUST go through userspace arbiters with all overhead and complexity.
With my approach any data flow _MAY_ go through userspace arbiters, but if you do_need/only_has 
in-kernel access than using in-kernel CARP is the only solution.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [1/2] CARP implementation. HA master's failover.
  2004-07-16 15:06                   ` Evgeniy Polyakov
@ 2004-07-17 11:52                     ` jamal
  2004-07-17 12:59                       ` Evgeniy Polyakov
  0 siblings, 1 reply; 28+ messages in thread
From: jamal @ 2004-07-17 11:52 UTC (permalink / raw)
  To: johnpol; +Cc: netdev, netfilter-failover

On Fri, 2004-07-16 at 11:06, Evgeniy Polyakov wrote:
> On Fri, 2004-07-16 at 16:34, jamal wrote:
> > 
> > Ok, so some controller is in charge - seems like thats something that
> > could be easily done in user space based on mastership transitions.
> 
> Yes, but here is tricky but true example:
> Some time ago e1000 driver from Intel had possibility to do hardware
> bonding(i absolutely don't remember how it was called, but idea was the
> same as in bonding).

I remeber that cruft. Actually (sadly) people like montavista ship that
thing in their distros (under the disguise of carrier grade linux;->).
I think the current folks out of intel working on Linux drivers and
bonding are a lot more knowledgeable - i do hope they have thrown that
"thing" out of the window.

> Consider following scenario: if node is a master than it enables this
> bonding mode using e1000 internal registers. Ethtools doesn't support
> those mode. yes it also can be enabled through patching userspace, but
> with kernel CARP it is not needed.

The way bonding works is the right way to do it.
Forget about that other crap.
There was a thread on netdev a while back to empower bonding to be 
controlled from user space; when that happens you are set.
But even without that the link carrier netlink event messages
should be a good start.

> Or consider TGE example(...wireless HA... strange sentence, but...):
> If I am a master, than enable higher priority in driver.
> Current tc design can't be mapped to driver's internal structures :>

Now you want to wakeup mr Vladimir ;-> Actually i think i just saw mails
from him.
Yes, these are some of the issues we wanna go work on. Still waiting for
the brief review of RSVP-like control being used in TGE. And when done
all that should be done in user space.

> But the main killer is following:
> consider firewall with thousands iptables rules, and if node becomes a
> master it needs to add or remove some rules from table.
> Copying such amounts to/from userspace/kernelspace memory will take
> _minutes_... Even using iptables chains.
> But kernel implementation may just add one rule.

Thats a deficiency in iptables. Iptables should be fixed.
I think there may be plans to fix it actually in place.

> Yet another variant: you need to access CPU internal registers based on
> HA state, kind of turning on or off additional hotplug CPU and or
> memory, enabling/disabling NUMA access. Can you enable/disable bus
> arbiter from userspace?

I think you should be able to write an interface to access such
functionality. Isnt there something along /sbin/hotplug used for such
things?

> For example I'm using on-chip SDRAM in PPC440 as L2 cache or as jitter
> buffer for OPB access, decision to use each mode is based on some
> hardware loads. Userspace do not have access to such mechanism.
> It is deep kernel internals, and I do not see any good reason to export
> it to userspace.
> Actually last example can't be used as argument in our discussion, but
> it illustrates that sometimes we need to touch kernel-_only_ parts, and
> this decision is dictated from the outside of the touchable part.
> 

I can tell you one thing: I am totaly against this thing being part
of the kernel; not just because it adds noise but because it makes it
harder to keep adding more and more functionality or integrating its
capability into other apps.
BTW, theres a very nice paper being presented at OLS by someone from .au
who is trying infact to move drivers to user space ;-> 
I dont mind adding some needed datapath mechanism in the kernel to
enable it to do interesting things; control of such mechanism and policy
decisions should be very clearly separated and sit in userspace.

> > BTW, I like that ARP balancing feature that CARP has. Pretty neat.
> > Note that it could be easily done via a tc action with user space
> > control.
> 
> Anything may be done in userspace.
> For example routing decision.
> Yes, it _may_ be done in userspace. But it is slow.

Big difference though with CARP. CARP shouldnt need to process 100Kpps;
but even if it did, CARP packet contain control information that is
valuable in policy settings. Control protocols tend to be "rich" and
evolve over much shorter periods of time.  
A better comparison what you are saying is to move OSPF to the kernel.

> SCSI over IP may be done as network block device.
> Or even copying packet to userspace through raw device and then send it
> using socket.

Again all that is datapath. CARP is control.

> QNX and Mach are even designed in this way.

We just have better architecture thats all ;-> 

[BTW, A lot of people with experience in things like vxworks (one big
flat memory space) always want to move things into the kernel. Typically
after some fight they move certain things to user space with
"you will hear from me" threats. I never hear back from them because
it works fine. This after they wanted to shoot me because linux "wasnt
realtime"]

> It is not talk about current possibilities, it is kind of design :)
> Yes, probably our _current_ needs may be satisfied using existing
> userspace tools.
> But I absolutely sure that we will need in-kernel support.
> I'm reading you second e-mail with pretty diagrams and already see where
> in-kernel CARP will live there :)

Ok;-> I am looking forward to see your view on it.

cheers,
jamal

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [1/2] CARP implementation. HA master's failover.
  2004-07-16 15:06                 ` Evgeniy Polyakov
@ 2004-07-17 12:47                   ` jamal
  2004-07-17 14:00                     ` Evgeniy Polyakov
  0 siblings, 1 reply; 28+ messages in thread
From: jamal @ 2004-07-17 12:47 UTC (permalink / raw)
  To: johnpol; +Cc: netdev, netfilter-failover

[-- Attachment #1: Type: text/plain, Size: 1291 bytes --]

On Fri, 2004-07-16 at 11:06, Evgeniy Polyakov wrote:
> On Fri, 2004-07-16 at 17:04, jamal wrote:
> 

[..]
> They do strong signed digest, but it does not have any kind of counter
> so i do not see replay attack prevention.

Ok, you are right. I do think that there are people who have run this
over IPSEC though. I could swear that the current linux based one does.
I wish we could get Alexander to comment on this discussion.

> 
> > > Can it be used over IPv6? (CARP also can't but it is _very_ easy to
> > > add, I just don't have IPv6 network setup to test).
> > 
> > Theres effort to have it do v6.
> > http://www.ietf.org/internet-drafts/draft-ietf-vrrp-ipv6-spec-06.txt
> > I agree its lame to have it as an after thought it seems
> 
> * VRRP for IPv6 does not currently include any type of authentication. *

Fine.

> I will draw one too.

ok, my resonse attached.

> For those who cares they are already done.

I was done 10 years ago. But theres a lot of fools around. 
MS targets the fools. 

> > > I have great confidence that Theo de Raadt will not include non
> > > patent-free code in OpenBSD.
> > 
> > I hope he is a lawyer or has some good lawyer advising him;->
> 
> He is an OpenBSD creator, so he is just a bit more paranoidal than
> others :)

I see ;-> 

cheers,
jamal

[-- Attachment #2: carp.dig2 --]
[-- Type: text/plain, Size: 3849 bytes --]

No, your diagram should looks like this:

         User space
                    +----------------------------------------------------------+
		    |                                                          |
                    |  +-------------------------------------+                 |
		    |  |                                     |                 |
+-------+         +-------+         +---------+         +-------+          +-------+          
| App#A | <-----> | CARPd | <-----> | ctsyncd | <-----> | App#B | <----->  | App#C | <-----> 
+---+---+         +---+---+         +---+-----+         +---+---+          +---+---+         
                      |                 |                   |                  |              
                      |                 |                   |                  |              
   ------------------------------------------------------------------------------------------
           Kernel
        network I/O etc

Or only have one BUS(and it is actually implemented using netlink).

jamal> I relabeled the Apps. I suppose you see some apps using ctsyncd for something?

You need to connect each application daemon to carpd, even using broadcast netlink.
And for any in-kernel access you will need to create new App and new kernel part.

jamal> App2app doesnt have to go across kernel unless it turns out it is the
jamal> best way.
jamal> Alternatives include: unix or local host sockets, IPCs such as pipes or
jamal> just shared libraries.
jamal> 

If we will extrapolate it we can create following:
userspace carp determines that it is a master, it will suspend all kernel memory or dump /proc/kmem
and begins to advertise it. Remote node receives it and has pretty the same firewall settings, 
flow controls and any in-kernel state.

jamal> I havent studied what Harald proposes in details. I think that the slave would
jamal> continously be getting master updates.
jamal> The interesting thing about CARP is the ARP balancing feature in which X nodes
jamal> maybe masters of different IP flows all within the same subnet. 
jamal> VRRP load balances by subnet. I am not sure how challenge this will present to
jamal> to ctsyncd.

No matter that it takes a long time.

It make sence if App#X needs userspace access only.
But here is other diagram:

                                        userspace
                 |
-----------------+-------------------------------
                CARP                  kernelspace
                 |
                 |
+----------+-----+-----+---------+-------
|          |           |         |
ct_sync  iSCSI       e1000      CPU

My main idea for in-kernel CARP was to implement invisible HA mechanism suitable for in-kernel use.
You do not need to create netlink protocol parser, you do not need to create extra userspace overhead,
you do not need to create suitable for userspace control hooks in kernel infrastructure.
Just register callback.
But even with such simple approach you have opportunity to collaborate with userspace. If you need.

Why creating all userspace cruft if/when you need only kernel one?

jamal> 
jamal> so we now move appA, B, C to the kernel too?
jamal> There is absolutely no need to put this in kernel space.
jamal> If you do this, your next step should be to put zebra in the kernel
jamal> 

Resume: 
With your approach any data flow MUST go through userspace arbiters with all overhead and complexity.
With my approach any data flow _MAY_ go through userspace arbiters, but if you do_need/only_has 
in-kernel access than using in-kernel CARP is the only solution.

jamal> 
jamal> Yes, there is a cost. How much? Read the paper on user space drivers it actually does
jamal> some cost analysis.
jamal> If you prove that it is too expensive to put it in user space then prove it and lets
jamal> have a re-discussion
jamal>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [1/2] CARP implementation. HA master's failover.
  2004-07-17 11:52                     ` jamal
@ 2004-07-17 12:59                       ` Evgeniy Polyakov
  2004-07-17 15:47                         ` jamal
  0 siblings, 1 reply; 28+ messages in thread
From: Evgeniy Polyakov @ 2004-07-17 12:59 UTC (permalink / raw)
  To: hadi; +Cc: netdev, netfilter-failover

On 17 Jul 2004 07:52:09 -0400
jamal <hadi@cyberus.ca> wrote:

<arguments and contrarguments are skipped>

> > Actually last example can't be used as argument in our discussion,
> > but it illustrates that sometimes we need to touch kernel-_only_
> > parts, and this decision is dictated from the outside of the
> > touchable part.
> > 
> 
> I can tell you one thing: I am totaly against this thing being part
> of the kernel; not just because it adds noise but because it makes it
> harder to keep adding more and more functionality or integrating its
> capability into other apps.
> BTW, theres a very nice paper being presented at OLS by someone from
> .au who is trying infact to move drivers to user space ;-> 

I saw it...
May I not comment it? I do not want to look like rude freak... :)

> I dont mind adding some needed datapath mechanism in the kernel to
> enable it to do interesting things; control of such mechanism and
> policy decisions should be very clearly separated and sit in
> userspace.
> 
> > > BTW, I like that ARP balancing feature that CARP has. Pretty neat.
> > > Note that it could be easily done via a tc action with user space
> > > control.
> > 
> > Anything may be done in userspace.
> > For example routing decision.
> > Yes, it _may_ be done in userspace. But it is slow.
> 
> Big difference though with CARP. CARP shouldnt need to process
> 100Kpps; but even if it did, CARP packet contain control information
> that is valuable in policy settings. Control protocols tend to be
> "rich" and evolve over much shorter periods of time.  
> A better comparison what you are saying is to move OSPF to the kernel.

Only for now, since we can imagine only some examples now.
When number of agents controlled/connected to CARP will became
significant broadcasting and userspace arbiter's overhead may not
satisfy HA needs.

> > SCSI over IP may be done as network block device.
> > Or even copying packet to userspace through raw device and then send
> > it using socket.
> 
> Again all that is datapath. CARP is control.

Control, but it must have possibility to control any dataflow element.
If using all_flows_one_arbiter, then we must have near standing
controller like in-kernel CARP.
If using one_flow_one_arbiter(like tc) then we may use far outstanding
control mechanism and near standing arbiter. 
Like qdisk + tc + ucarp.

The question is: "Do we need to create near standing ariters and far
standing controller for scenatio A, while we may have near standing
controller?".
I do believe that for some situations we just need in-kernel controller
without any overhead and simple in-kernel interface.

> > QNX and Mach are even designed in this way.
> 
> We just have better architecture thats all ;-> 

If we want to put as many as possible outside the kernel while it works
better in kernel then we slowly go to meet microkernel and userspace
thread for fs, for network, which will be controlled by broadcast
messages for simplifying control protocol.

But it is too far planes, so it is just "blah-blah" lyrics now... :)

> [BTW, A lot of people with experience in things like vxworks (one big
> flat memory space) always want to move things into the kernel.
> Typically after some fight they move certain things to user space with
> "you will hear from me" threats. I never hear back from them because
> it works fine. This after they wanted to shoot me because linux "wasnt
> realtime"]

BTW, I just reread OpenBSD's load balancing code...
IT is different from that one which may be created with tc and it's
extensions.

They look into each packet and if it's "signature" is controlled by node
and this node is master than process this packet.
They have one arbiter for any dataflow, while Linux has many arbiters
each of which may be controlled from userspace, that is the difference.

So their schema may not be implemented in userspace CARP, while in
Linux it may be implemented using tc extensions with userspace CARP.

I will rewrite my resume:
With your approach any data flow MUST go through userspace arbiters with
all overhead and complexity. With my approach any data flow _MAY_ go
through userspace arbiters, but if you do_need/only_has in-kernel access
than using in-kernel CARP is the only solution.

My main idea for in-kernel CARP was to implement invisible HA mechanism
suitable for in-kernel use. You do not need to create netlink protocol
parser, you do not need to create extra userspace overhead, you do not
need to create suitable for userspace control hooks in kernel
infrastructure. Just register callback.
But even with such simple approach you have opportunity to collaborate
with userspace. If you need.

Why creating all userspace cruft if/when you need only kernel one?

> > It is not talk about current possibilities, it is kind of design :)
> > Yes, probably our _current_ needs may be satisfied using existing
> > userspace tools.
> > But I absolutely sure that we will need in-kernel support.
> > I'm reading you second e-mail with pretty diagrams and already see
> > where in-kernel CARP will live there :)
> 
> Ok;-> I am looking forward to see your view on it.
> 
> cheers,
> jamal

	Evgeniy Polyakov ( s0mbre )

Only failure makes us experts. -- Theo de Raadt

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [1/2] CARP implementation. HA master's failover.
  2004-07-17 12:47                   ` jamal
@ 2004-07-17 14:00                     ` Evgeniy Polyakov
  2004-07-17 16:29                       ` jamal
  0 siblings, 1 reply; 28+ messages in thread
From: Evgeniy Polyakov @ 2004-07-17 14:00 UTC (permalink / raw)
  To: hadi; +Cc: netdev, netfilter-failover

On 17 Jul 2004 08:47:34 -0400
jamal <hadi@cyberus.ca> wrote:

jamal> I relabeled the Apps. I suppose you see some apps using ctsyncd
for something?

>>You need to connect each application daemon to carpd, even using
>>broadcast netlink. And for any in-kernel access you will need to
>>create
>>new App and new kernel part.

jamal> App2app doesnt have to go across kernel unless it turns out it is
jamal> the best way.
jamal> Alternatives include: unix or local host sockets, IPCs such as
jamal> pipes or 
jamal> just shared libraries.

MICROKERNEL, I see it :)
Non broacast/multicast will _strongly_ complicate protocol.
Broadcast will waste apprication/kernel "bandwidth".

>>If we will extrapolate it we can create following:
>>userspace carp determines that it is a master, it will suspend all
>>kernel memory or dump /proc/kmem and begins to advertise it. Remote
>>node
>>receives it and has pretty the same firewall settings, flow controls
>>and
>>any in-kernel state.

jamal> I havent studied what Harald proposes in details. I think that
jamal> the slave would 
jamal> continously be getting master updates.

Is it is.

jamal> The interesting thing about CARP is the ARP balancing feature in
jamal> which X nodes 
jamal> maybe masters of different IP flows all within the
jamal> same subnet. 
jamal> VRRP load balances by subnet. I am not sure how
jamal> challenge this will present to 
jamal> to ctsyncd.

CARP may do it, but it requires in-kernel hack into arp code.
Actually OpenBSD's one has it's entry in if_ether.c so their CARP always
has access to any network dataflow.

BTW, with your approach hack from arp code needs to send a message to
userspace carp to ask if it "good or bad" packet.
Or you need to create tc for arp.

Or to communicate with in-kernel CARP. :)

>>No matter that it takes a long time.

>>It make sence if App#X needs userspace access only.
>>But here is other diagram:

                                        userspace
                 |
-----------------+-------------------------------
                CARP                  kernelspace
                 |
                 |
+----------+-----+-----+---------+-------
|          |           |         |
ct_sync  iSCSI       e1000      CPU

>>My main idea for in-kernel CARP was to implement invisible HA
>>mechanism
>>suitable for in-kernel use. You do not need to create netlink protocol
>>parser, you do not need to create extra userspace overhead, you do not
>>need to create suitable for userspace control hooks in kernel
>>infrastructure. Just register callback.
>>But even with such simple approach you have opportunity to collaborate
>>with userspace. If you need.

>>Why creating all userspace cruft if/when you need only kernel one?

jamal> 
jamal> so we now move appA, B, C to the kernel too?
jamal> There is absolutely no need to put this in kernel space.
jamal> If you do this, your next step should be to put zebra in the
jamal> kernel

No.
And this is the beauty of the in-kernel CARP.
You _already_ has in-kernel parts which may need master/slave failover.

You just need to connect it to arbiter.

With userspace you _need_ to create all those Apps connected to
userspace carp, with in-kernel CARP you need to just register callback.
One function call.

BTW, someone created tux, khtpd, knfsd :)
But i think zebra must live in userspace, since it do not need to
control any kernel parameters.

CARP _may_ control kernel parameters.
If you do not need in-kernel functionality just use UCARP.

>>Resume: 
>>With your approach any data flow MUST go through userspace arbiters
>>with
>>all overhead and complexity. With my approach any data flow _MAY_ go
>>through userspace arbiters, but if you do_need/only_has in-kernel
>>access
>>than using in-kernel CARP is the only solution.

jamal> Yes, there is a cost. How much? Read the paper on user space
jamal> drivers it actually does 
jamal> some cost analysis.
jamal> If you prove that it is too expensive to put it in user space
jamal> then prove it and lets 
jamal> have a re-discussion

Hey-ho, easily :)

Consider embedded processors.
Numbers: ppc405gp, 200mhz, 32mb sdram.
Application - 4-8 DSP processors controlled by ppc.
Each dsp processor generates 6-8 bytes frame with 8khz frequency in
each channel(from 1 to 2). 
Driver reads data from each DSP and doing some postprocessing(mainly
split it into B/D channels). Driver has clever mapping so
userspace<->kernelspace dataflow may be zerocopied.

Kernelspace processing takes up to 133mghz of 200.

Consider userspace application that 
a. makes PCM stereo from different B/D logical channels (zerocopied from
kernelspace).
b. send it into network (using tcp by bad historical/compatibility
reasons).

Situation: if we have one userspace process(or even thread) per DSP,
than context switching takes too long time and we see data corruption.
None network parameter(100 mb network) can improve situation.
Only one process per 4 DSP may send data into network stack without any
data loss.

P.S. It is 2.4.25 kernel.

I do believe that Peter Chubb (peterc@gelato.unsw.edu.au) will talk
about big machines where big tasks _may_ have big time latencies.

May Oracle have little latencies? May. But it also _may_ have big
latencies. Why not? 

DSP and sound/video capturing _may_not_ have big latencies.

Although I do think that talk about userspace drivers is not an issue in
our discussion :)

> 
> cheers,
> jamal
> 

	Evgeniy Polyakov ( s0mbre )

Only failure makes us experts. -- Theo de Raadt

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [1/2] CARP implementation. HA master's failover.
  2004-07-17 12:59                       ` Evgeniy Polyakov
@ 2004-07-17 15:47                         ` jamal
  2004-07-17 20:04                           ` Evgeniy Polyakov
  0 siblings, 1 reply; 28+ messages in thread
From: jamal @ 2004-07-17 15:47 UTC (permalink / raw)
  To: johnpol; +Cc: netdev, netfilter-failover

On Sat, 2004-07-17 at 08:59, Evgeniy Polyakov wrote:
> On 17 Jul 2004 07:52:09 -0400
> jamal <hadi@cyberus.ca> wrote:

> > BTW, theres a very nice paper being presented at OLS by someone from
> > .au who is trying infact to move drivers to user space ;-> 
> 
> I saw it...
> May I not comment it? I do not want to look like rude freak... :)

rude freaks are not frowned upon here;-> They are loved (ok, maybe some
uptight people may have a problem with it;->). 
But maybe we should keep that thread separate from this.

> > Big difference though with CARP. CARP shouldnt need to process
> > 100Kpps; but even if it did, CARP packet contain control information
> > that is valuable in policy settings. Control protocols tend to be
> > "rich" and evolve over much shorter periods of time.  
> > A better comparison what you are saying is to move OSPF to the kernel.
> 
> Only for now, since we can imagine only some examples now.
> When number of agents controlled/connected to CARP will became
> significant broadcasting and userspace arbiter's overhead may not
> satisfy HA needs.

So let me put your fears to rest and share my experiences:
I have some experience in using VRRP in a variety of very large,
critical and at times very weird senseless setups. This is with some of
the most anal telco types you can come across. They protect the network
uptime just as if it was a part of their body. I hate to use dirty
cliches like "carrier grade" - but it would probably be the closest
qualifier; Telcos dont let you mess around their setups and create any
holes which will bring down anything for a few seconds. Note, this is
with VRRP running in user space. In _every_ case i have been in, i have
always been challenged as to why its not in the kernel or running as
realtime process. and in all cases, running in user space didnt prove to
be the problem. The biggest challenge was fixing broadcast storms
because someone created a bcast loop in which case the machine is under
DoS attack. The otehr valuable thing to do is to make sure that
VRRP packets (as any other control packets) get higher priority in the
network. 

BTW, If you are thinking of instantiating carpd for every agent, then
you got to rethink that plan. Hint: You need to handle all carp protocol
within one daemon. Maybe thats what you are saying but only to do it in
the kernel.

Broadcasts: I wasnt sure what you meant.

> Control, but it must have possibility to control any dataflow element.
> If using all_flows_one_arbiter, then we must have near standing
> controller like in-kernel CARP.
> If using one_flow_one_arbiter(like tc) then we may use far outstanding
> control mechanism and near standing arbiter. 
> Like qdisk + tc + ucarp.
> 
> The question is: "Do we need to create near standing ariters and far
> standing controller for scenatio A, while we may have near standing
> controller?".
> I do believe that for some situations we just need in-kernel controller
> without any overhead and simple in-kernel interface.

I think the example of ARP contradicts your view and may apply well
here. For small setups, you can use in-kernel ARP.
To scale it you move things to arpd in user space. Unfortunately, ARP
has always been in the kernel for Linux; so that maybe the reason Alexey
never ripped it out totaly.

> > We just have better architecture thats all ;-> 
> 
> If we want to put as many as possible outside the kernel while it works
> better in kernel then we slowly go to meet microkernel and userspace
> thread for fs, for network, which will be controlled by broadcast
> messages for simplifying control protocol.
> 
> But it is too far planes, so it is just "blah-blah" lyrics now... :)

hehe. Tell the guy who wrote openbsd song to make sure he doesnt quit
his day job;->
There are a lot of people talking about moving pieces of the net stack
to user space. I am not of that religion yet because i havent really
seen the value. 

> BTW, I just reread OpenBSD's load balancing code...
> IT is different from that one which may be created with tc and it's
> extensions.
> 
> They look into each packet and if it's "signature" is controlled by node
> and this node is master than process this packet.
> They have one arbiter for any dataflow, while Linux has many arbiters
> each of which may be controlled from userspace, that is the difference.

I think thats the wrong way to go about it.
What you need is enter a simple rule like:

filter: If you see ARP asking for our IPs
action: Accept
filter(installed by carpd): if you see ARP for IP X
action: accept
.... repeat a few similar ARP rules by carpd for different IPs ..
default action: drop all ARPs.

This is installed in the datapath before ARP code gets hit.
The additional accepts are entered by carpd when it receives CARP
packets which describe how to load balance.

> So their schema may not be implemented in userspace CARP, while in
> Linux it may be implemented using tc extensions with userspace CARP.

You could do the above in the kernel. It means everytime i want to make
changes i now have to change the kernel.

> I will rewrite my resume:
> With your approach any data flow MUST go through userspace arbiters with
> all overhead and complexity. With my approach any data flow _MAY_ go
> through userspace arbiters, but if you do_need/only_has in-kernel access
> than using in-kernel CARP is the only solution.

Evgeniy, this is the most valuable arguement you have for in-kernel. I
suggest drop all the other ones because they are red herrings and lets
focus on this one.

> My main idea for in-kernel CARP was to implement invisible HA mechanism
> suitable for in-kernel use. You do not need to create netlink protocol
> parser, you do not need to create extra userspace overhead, you do not
> need to create suitable for userspace control hooks in kernel
> infrastructure. Just register callback.
> But even with such simple approach you have opportunity to collaborate
> with userspace. If you need.
> 
> Why creating all userspace cruft if/when you need only kernel one?

Because of all the reasons i have mentioned so far ;->
Again, i am not against kernel helpers. I am against putting CARP in the
kernel.

cheers,
jamal

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [1/2] CARP implementation. HA master's failover.
  2004-07-17 14:00                     ` Evgeniy Polyakov
@ 2004-07-17 16:29                       ` jamal
  2004-07-17 20:03                         ` Evgeniy Polyakov
  0 siblings, 1 reply; 28+ messages in thread
From: jamal @ 2004-07-17 16:29 UTC (permalink / raw)
  To: johnpol; +Cc: netdev, netfilter-failover

On Sat, 2004-07-17 at 10:00, Evgeniy Polyakov wrote:
> On 17 Jul 2004 08:47:34 -0400
> jamal <hadi@cyberus.ca> wrote:

> jamal> App2app doesnt have to go across kernel unless it turns out it is
> jamal> the best way.
> jamal> Alternatives include: unix or local host sockets, IPCs such as
> jamal> pipes or 
> jamal> just shared libraries.
> 
> MICROKERNEL, I see it :)

Maybe subconsciouly, but not intentional ;->

> Non broacast/multicast will _strongly_ complicate protocol.
> Broadcast will waste apprication/kernel "bandwidth".

You could run multicast UDP over localhost; But that will be valuable if
you have one-to-many relationship. I guess theres such a relationship
between CARPd and other apps.

> 
> jamal> The interesting thing about CARP is the ARP balancing feature in
> jamal> which X nodes 
> jamal> maybe masters of different IP flows all within the
> jamal> same subnet. 
> jamal> VRRP load balances by subnet. I am not sure how
> jamal> challenge this will present to 
> jamal> to ctsyncd.
> 
> CARP may do it, but it requires in-kernel hack into arp code.
> Actually OpenBSD's one has it's entry in if_ether.c so their CARP always
> has access to any network dataflow.

Look at my comment in other email. Pick 2.6.8-rc1 and you could do that
in a hearbeat.

> BTW, with your approach hack from arp code needs to send a message to
> userspace carp to ask if it "good or bad" packet.
> Or you need to create tc for arp.

carpd gets a policy to tell it what rules to install.
It installs them via netlink or tc.
Unwanted arp packets get dropped before they ARP code sees them.

> Or to communicate with in-kernel CARP. :)

>                                         userspace
>                  |
> -----------------+-------------------------------
>                 CARP                  kernelspace
>                  |
>                  |
> +----------+-----+-----+---------+-------
> |          |           |         |
> ct_sync  iSCSI       e1000      CPU
> 
> 
> >>My main idea for in-kernel CARP was to implement invisible HA
> >>mechanism
> >>suitable for in-kernel use. You do not need to create netlink protocol
> >>parser, you do not need to create extra userspace overhead, you do not
> >>need to create suitable for userspace control hooks in kernel
> >>infrastructure. Just register callback.
> >>But even with such simple approach you have opportunity to collaborate
> >>with userspace. If you need.
> 
> >>Why creating all userspace cruft if/when you need only kernel one?
> 
> jamal> 
> jamal> so we now move appA, B, C to the kernel too?
> jamal> There is absolutely no need to put this in kernel space.
> jamal> If you do this, your next step should be to put zebra in the
> jamal> kernel
> 
> No.
> And this is the beauty of the in-kernel CARP.
> You _already_ has in-kernel parts which may need master/slave failover.
> 
> You just need to connect it to arbiter.

Sure - such arbitrer could reside in user space too.
And apps could connect to it as well. 
App wishing to listen to mastership changes joins a UDP mcast on
localhost. CARPd announces such changes on localhost mcast channel.
To make it more interesting, allow apps to query mastership and
other state.

> With userspace you _need_ to create all those Apps connected to
> userspace carp, with in-kernel CARP you need to just register callback.
> One function call.

Maybe i didnt explain well. Only apps interested in carp activities
connect to it; such an app would be ctsyncd. If you use shared
libraries, then you register a callback. Or you could use localhost
mcast example i gave above.

> BTW, someone created tux, khtpd, knfsd :)

I thoughth there were people who can beat tux from userspace these days
by virtue of numbers. But note again that things like these are datapath
level apps unlike CARP.

> But i think zebra must live in userspace, since it do not need to
> control any kernel parameters.
> 
> CARP _may_ control kernel parameters.
> If you do not need in-kernel functionality just use UCARP.

I am not sure i follow. You are proposing to do something like arp/arpd
now? Look at that code.

> jamal> If you prove that it is too expensive to put it in user space
> jamal> then prove it and lets 
> jamal> have a re-discussion
> 
> Hey-ho, easily :)
> 
> Consider embedded processors.
> Numbers: ppc405gp, 200mhz, 32mb sdram.
> Application - 4-8 DSP processors controlled by ppc.
> Each dsp processor generates 6-8 bytes frame with 8khz frequency in
> each channel(from 1 to 2). 
> Driver reads data from each DSP and doing some postprocessing(mainly
> split it into B/D channels). Driver has clever mapping so
> userspace<->kernelspace dataflow may be zerocopied.

Sure. Maybe mmap would suffice.

> Kernelspace processing takes up to 133mghz of 200.

How did you measure this?

> Consider userspace application that 
> a. makes PCM stereo from different B/D logical channels (zerocopied from
> kernelspace).
> b. send it into network (using tcp by bad historical/compatibility
> reasons).
> 
> Situation: if we have one userspace process(or even thread) per DSP,
> than context switching takes too long time and we see data corruption.
> None network parameter(100 mb network) can improve situation.
> Only one process per 4 DSP may send data into network stack without any
> data loss.

I am suprised abou the threads being problematic in context switch.

> P.S. It is 2.4.25 kernel.

I still dont like what you have described above ;-> It needs to be
qunatitative instead of qualitative. i.e "heres some numbers when X was
done and heres the numbers when Y was done".

> I do believe that Peter Chubb (peterc@gelato.unsw.edu.au) will talk
> about big machines where big tasks _may_ have big time latencies.
> 
> May Oracle have little latencies? May. But it also _may_ have big
> latencies. Why not? 
> 
> DSP and sound/video capturing _may_not_ have big latencies.
> 
> Although I do think that talk about userspace drivers is not an issue in
> our discussion :)

I agree. Let me summarize what i think is the most valuable thing you
have said so far - you could disagree, but this is my opinion of what i
think the most valuable thing  you said :

in the model where all things have to cross userspace-kernel boundary,
there is some cost associated. This is plausible when such crossings get
to be _very_ frequent. _very frequent needs to be quantified.
I claim from my experiences (running on small 824x ppc) that the cost is
highly exagerated. 
How about this: Look at the way arp does things and emulate it.
The way arp does it is still insufficient because it maintains a
threshold first that when exceeded is the only time control packets
get sent to user space.
You should have a sysctl where your code ships things to user space
every time when the systcl is set.
This is easy to do if you wrote the whole thing as a tc action instead
of a device driver.

> 	Evgeniy Polyakov ( s0mbre )
> 
> Only failure makes us experts. -- Theo de Raadt

To support mr de Raadt above:

"repeating failures makes you a sinner"
In other words, learn from the failures.

cheers,
jamal

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [1/2] CARP implementation. HA master's failover.
  2004-07-17 16:29                       ` jamal
@ 2004-07-17 20:03                         ` Evgeniy Polyakov
  2004-07-17 20:32                           ` jamal
  0 siblings, 1 reply; 28+ messages in thread
From: Evgeniy Polyakov @ 2004-07-17 20:03 UTC (permalink / raw)
  To: hadi; +Cc: netdev, netfilter-failover

On 17 Jul 2004 12:29:41 -0400
jamal <hadi@cyberus.ca> wrote:

> > jamal> The interesting thing about CARP is the ARP balancing feature
> > in jamal> which X nodes 
> > jamal> maybe masters of different IP flows all within the
> > jamal> same subnet. 
> > jamal> VRRP load balances by subnet. I am not sure how
> > jamal> challenge this will present to 
> > jamal> to ctsyncd.
> > 
> > CARP may do it, but it requires in-kernel hack into arp code.
> > Actually OpenBSD's one has it's entry in if_ether.c so their CARP
> > always has access to any network dataflow.
> 
> Look at my comment in other email. Pick 2.6.8-rc1 and you could do
> that in a hearbeat.

Sure.
It is an example where kernel helper already exists.
And it is similar to other networ arbiters - iptables, tc...
 
> > BTW, with your approach hack from arp code needs to send a message
> > to userspace carp to ask if it "good or bad" packet.
> > Or you need to create tc for arp.
> 
> carpd gets a policy to tell it what rules to install.
> It installs them via netlink or tc.
> Unwanted arp packets get dropped before they ARP code sees them.

Only because they already exist, it is easy.
I do not argue against it.
Some situations may be easily controlled from the userspace, but not
all.
And for those we need in-kernel solution - it may be kernel helper plus
userspace arbiter, but may just kernel callback.

> > jamal> so we now move appA, B, C to the kernel too?
> > jamal> There is absolutely no need to put this in kernel space.
> > jamal> If you do this, your next step should be to put zebra in the
> > jamal> kernel
> > 
> > No.
> > And this is the beauty of the in-kernel CARP.
> > You _already_ has in-kernel parts which may need master/slave
> > failover.
> > 
> > You just need to connect it to arbiter.
> 
> Sure - such arbitrer could reside in user space too.
> And apps could connect to it as well. 
> App wishing to listen to mastership changes joins a UDP mcast on
> localhost. CARPd announces such changes on localhost mcast channel.
> To make it more interesting, allow apps to query mastership and
> other state.

But why do you want to create this extra application when you already
have possibility to control it?
Why does someone need to create userspace application, kernel helper for
it, when it is only requres in-kernel access?

> > With userspace you _need_ to create all those Apps connected to
> > userspace carp, with in-kernel CARP you need to just register
> > callback. One function call.
> 
> Maybe i didnt explain well. Only apps interested in carp activities
> connect to it; such an app would be ctsyncd. If you use shared
> libraries, then you register a callback. Or you could use localhost
> mcast example i gave above.

I absolutely agree with you.
All your arguments are just right.
But whole you approach is not good for _any_ situation.

> > BTW, someone created tux, khtpd, knfsd :)
> 
> I thoughth there were people who can beat tux from userspace these
> days by virtue of numbers. But note again that things like these are
> datapath level apps unlike CARP.

Sure, it is just example that if something is good for something, then
no reason exist to move it around.
Userspace is good, but not for all.

> > But i think zebra must live in userspace, since it do not need to
> > control any kernel parameters.
> > 
> > CARP _may_ control kernel parameters.
> > If you do not need in-kernel functionality just use UCARP.
> 
> I am not sure i follow. You are proposing to do something like
> arp/arpd now? Look at that code.

It is good for arp, that we can control it from userspace.
But I do not see any good reaon to control everething from userspace.

> > jamal> If you prove that it is too expensive to put it in user space
> > jamal> then prove it and lets 
> > jamal> have a re-discussion
> > 
> > Hey-ho, easily :)
> > 
> > Consider embedded processors.
> > Numbers: ppc405gp, 200mhz, 32mb sdram.
> > Application - 4-8 DSP processors controlled by ppc.
> > Each dsp processor generates 6-8 bytes frame with 8khz frequency in
> > each channel(from 1 to 2). 
> > Driver reads data from each DSP and doing some postprocessing(mainly
> > split it into B/D channels). Driver has clever mapping so
> > userspace<->kernelspace dataflow may be zerocopied.
> 
> Sure. Maybe mmap would suffice.
> 
> > Kernelspace processing takes up to 133mghz of 200.
> 
> How did you measure this?

get_cycles() is my friend.
About 70mghz for DSP reading and the same for postprocessing.

> > Consider userspace application that 
> > a. makes PCM stereo from different B/D logical channels (zerocopied
> > from kernelspace).
> > b. send it into network (using tcp by bad historical/compatibility
> > reasons).
> > 
> > Situation: if we have one userspace process(or even thread) per DSP,
> > than context switching takes too long time and we see data
> > corruption. None network parameter(100 mb network) can improve
> > situation. Only one process per 4 DSP may send data into network
> > stack without any data loss.
> 
> I am suprised abou the threads being problematic in context switch.

Me too, probably they will survive for less threads, not tested.
The maximum configuration has 16 digital channels with 8bytes in 8khz
each.
16 threads can not handle this.

btw, i lie, it has 2 processes already plus threads or additional
processes.

> > P.S. It is 2.4.25 kernel.
> 
> I still dont like what you have described above ;-> It needs to be
> qunatitative instead of qualitative. i.e "heres some numbers when X
> was done and heres the numbers when Y was done".

Hmmm...
It works if we have little number of context switches and does not work
otherwise in above configuration.
Almost what you asked :)

> > I do believe that Peter Chubb (peterc@gelato.unsw.edu.au) will talk
> > about big machines where big tasks _may_ have big time latencies.
> > 
> > May Oracle have little latencies? May. But it also _may_ have big
> > latencies. Why not? 
> > 
> > DSP and sound/video capturing _may_not_ have big latencies.
> > 
> > Although I do think that talk about userspace drivers is not an
> > issue in our discussion :)
> 
> 
> I agree. Let me summarize what i think is the most valuable thing you
> have said so far - you could disagree, but this is my opinion of what
> i think the most valuable thing  you said :
> 
> in the model where all things have to cross userspace-kernel boundary,
> there is some cost associated. This is plausible when such crossings
> get to be _very_ frequent. _very frequent needs to be quantified.
> I claim from my experiences (running on small 824x ppc) that the cost
> is highly exagerated. 
> How about this: Look at the way arp does things and emulate it.
> The way arp does it is still insufficient because it maintains a
> threshold first that when exceeded is the only time control packets
> get sent to user space.
> You should have a sysctl where your code ships things to user space
> every time when the systcl is set.
> This is easy to do if you wrote the whole thing as a tc action instead
> of a device driver.

Sure.
I totally agree.

And I agree with your solution.
It is right for almost all situation.
But if you do not need userspace arbiter and kernel helper, you do not
need to create it. ust use in-ernel solution.

> 
> > 	Evgeniy Polyakov ( s0mbre )
> > 
> > Only failure makes us experts. -- Theo de Raadt
> 
> To support mr de Raadt above:
> 
> "repeating failures makes you a sinner"
> In other words, learn from the failures.

Or sinner just can not learn :)


Thank you for interesting discussion, I think we see each one's
position, we see it's advantages and disadvantages, but we have just a
bit different vews :)

With best regards.

> 
> cheers,
> jamal
> 


	Evgeniy Polyakov ( s0mbre )

Only failure makes us experts. -- Theo de Raadt

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [1/2] CARP implementation. HA master's failover.
  2004-07-17 15:47                         ` jamal
@ 2004-07-17 20:04                           ` Evgeniy Polyakov
  0 siblings, 0 replies; 28+ messages in thread
From: Evgeniy Polyakov @ 2004-07-17 20:04 UTC (permalink / raw)
  To: hadi; +Cc: netdev, netfilter-failover

On 17 Jul 2004 11:47:43 -0400
jamal <hadi@cyberus.ca> wrote:

> On Sat, 2004-07-17 at 08:59, Evgeniy Polyakov wrote:
> > On 17 Jul 2004 07:52:09 -0400
> > jamal <hadi@cyberus.ca> wrote:
> 
> > > BTW, theres a very nice paper being presented at OLS by someone
> > > from.au who is trying infact to move drivers to user space ;-> 
> > 
> > I saw it...
> > May I not comment it? I do not want to look like rude freak... :)
> 
> rude freaks are not frowned upon here;-> They are loved (ok, maybe
> some uptight people may have a problem with it;->). 
> But maybe we should keep that thread separate from this.

:) Sure.

> > > Big difference though with CARP. CARP shouldnt need to process
> > > 100Kpps; but even if it did, CARP packet contain control
> > > information that is valuable in policy settings. Control protocols
> > > tend to be"rich" and evolve over much shorter periods of time.  
> > > A better comparison what you are saying is to move OSPF to the
> > > kernel.
> > 
> > Only for now, since we can imagine only some examples now.
> > When number of agents controlled/connected to CARP will became
> > significant broadcasting and userspace arbiter's overhead may not
> > satisfy HA needs.
> 
> So let me put your fears to rest and share my experiences:
> I have some experience in using VRRP in a variety of very large,
> critical and at times very weird senseless setups. This is with some
> of the most anal telco types you can come across. They protect the
> network uptime just as if it was a part of their body. I hate to use
> dirty cliches like "carrier grade" - but it would probably be the
> closest qualifier; Telcos dont let you mess around their setups and
> create any holes which will bring down anything for a few seconds.
> Note, this is with VRRP running in user space. In _every_ case i have
> been in, i have always been challenged as to why its not in the kernel
> or running as realtime process. and in all cases, running in user
> space didnt prove to be the problem. The biggest challenge was fixing
> broadcast storms because someone created a bcast loop in which case
> the machine is under DoS attack. The otehr valuable thing to do is to
> make sure that VRRP packets (as any other control packets) get higher
> priority in the network. 
> 
> BTW, If you are thinking of instantiating carpd for every agent, then
> you got to rethink that plan. Hint: You need to handle all carp
> protocol within one daemon. Maybe thats what you are saying but only
> to do it in the kernel.
> 
> Broadcasts: I wasnt sure what you meant.

Netlink broadcast...

I absolutely agree with you, that userspace heartbeat application may
satisfy most of current needs.
But if you need kernel access you need to create not only kernel part,
but also some userspace controlling application. So we will have carpd,
which controls userspace application which in turn control kernel parts.
Maybe it is solution, but sometimes better just to use existing kernel
part with privided in-kernel carp.
Your position is that any control arbiter must live in userspace, but I
offer in-kernel solution for those who needs only in-kernel access.

> > Control, but it must have possibility to control any dataflow
> > element. If using all_flows_one_arbiter, then we must have near
> > standing controller like in-kernel CARP.
> > If using one_flow_one_arbiter(like tc) then we may use far
> > outstanding control mechanism and near standing arbiter. 
> > Like qdisk + tc + ucarp.
> > 
> > The question is: "Do we need to create near standing ariters and far
> > standing controller for scenatio A, while we may have near standing
> > controller?".
> > I do believe that for some situations we just need in-kernel
> > controller without any overhead and simple in-kernel interface.
> 
> I think the example of ARP contradicts your view and may apply well
> here. For small setups, you can use in-kernel ARP.
> To scale it you move things to arpd in user space. Unfortunately, ARP
> has always been in the kernel for Linux; so that maybe the reason
> Alexey never ripped it out totaly.

Yep, it is good example.
But this medal has another side: linux handles all network traffik only
in one processor in a time. Consider userspace process that binds to
receiving interface or some address or even skb, and such processes may
be processed in parallel, amy have prioritisation and so forth...
Ridiculous example, i know, and absolutely exaggerated but moving from
kernel to userspace not always good idea.
Even if it is not dataflow and just rare control.

> > BTW, I just reread OpenBSD's load balancing code...
> > IT is different from that one which may be created with tc and it's
> > extensions.
> > 
> > They look into each packet and if it's "signature" is controlled by
> > node and this node is master than process this packet.
> > They have one arbiter for any dataflow, while Linux has many
> > arbiters each of which may be controlled from userspace, that is the
> > difference.
> 
> I think thats the wrong way to go about it.
> What you need is enter a simple rule like:
> 
> filter: If you see ARP asking for our IPs
> action: Accept
> filter(installed by carpd): if you see ARP for IP X
> action: accept
> .... repeat a few similar ARP rules by carpd for different IPs ..
> default action: drop all ARPs.
> 
> This is installed in the datapath before ARP code gets hit.
> The additional accepts are entered by carpd when it receives CARP
> packets which describe how to load balance.

Sure.
It is do correct way of doing this.
But we just lucky guys and can control this dataflow in such way.
Not everithing may be fitted for such scenario.

> > So their schema may not be implemented in userspace CARP, while in
> > Linux it may be implemented using tc extensions with userspace CARP.
> 
> You could do the above in the kernel. It means everytime i want to
> make changes i now have to change the kernel.
> 
> > I will rewrite my resume:
> > With your approach any data flow MUST go through userspace arbiters
> > with all overhead and complexity. With my approach any data flow
> > _MAY_ go through userspace arbiters, but if you do_need/only_has
> > in-kernel access than using in-kernel CARP is the only solution.
> 
> Evgeniy, this is the most valuable arguement you have for in-kernel. I
> suggest drop all the other ones because they are red herrings and lets
> focus on this one.
> 
> > My main idea for in-kernel CARP was to implement invisible HA
> > mechanism suitable for in-kernel use. You do not need to create
> > netlink protocol parser, you do not need to create extra userspace
> > overhead, you do not need to create suitable for userspace control
> > hooks in kernel infrastructure. Just register callback.
> > But even with such simple approach you have opportunity to
> > collaborate with userspace. If you need.
> > 
> > Why creating all userspace cruft if/when you need only kernel one?
> 
> Because of all the reasons i have mentioned so far ;->
> Again, i am not against kernel helpers. I am against putting CARP in
> the kernel.

Okey.
I will summarize our views:
on one side we have in-kernel solution which may control any process
there, which may be used for in-kernel access.
On the other side we have userspace solution which requres kernel
helpers.

One do may use any, question is, does concrete situation requires
userspace control? If no, the using in-kernel solution if preferable.

You think that any in-kernel access must have userspace arbiter, but i
do think that in some situation we do not need it. For such situations
you do not need to create kernel helper and userspace controller, only
in-kernel carp.

> cheers,
> jamal
> 


	Evgeniy Polyakov ( s0mbre )

Only failure makes us experts. -- Theo de Raadt

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [1/2] CARP implementation. HA master's failover.
  2004-07-17 20:03                         ` Evgeniy Polyakov
@ 2004-07-17 20:32                           ` jamal
  0 siblings, 0 replies; 28+ messages in thread
From: jamal @ 2004-07-17 20:32 UTC (permalink / raw)
  To: johnpol; +Cc: netdev, netfilter-failover

On Sat, 2004-07-17 at 16:03, Evgeniy Polyakov wrote:

> Thank you for interesting discussion, I think we see each one's
> position, we see it's advantages and disadvantages, but we have just a
> bit different vews :)

Indeed. I would say we have agreed to disagree ;->
[Maybe at some point when i get excited about CARP you will see
alternative (user space based) code. I am hoping someone else (Like
Alexander Cassen) would beat me to it.]

cheers,
jamal

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [nf-failover] Re: [1/2] CARP implementation. HA master's failover.
  2004-07-16 13:04               ` jamal
  2004-07-16 15:06                 ` Evgeniy Polyakov
@ 2004-07-19  7:16                 ` KOVACS Krisztian
  2004-07-20  2:38                   ` Harald Welte
  2004-07-20 14:24                   ` jamal
  1 sibling, 2 replies; 28+ messages in thread
From: KOVACS Krisztian @ 2004-07-19  7:16 UTC (permalink / raw)
  To: hadi; +Cc: johnpol, netdev, Netfilter-failover list

  Hi,

2004-07-16, p keltezéssel 15:04-kor jamal ezt írta:
> Looking at what HArald has, the infrastructure seems to be the correct
> flavor. Seems something gets sent to user space via netlink and gets
> delivered via keepalived.

  Unfortunately this is not the case, as Evgeniy already mentioned.
ct_sync is currently an completely in-kernel solution, with all the pros
and cons of that. (Yes, it could be done in userspace with some minimal
kernel code, and yes, it had a few advantages over the current solution.
However, the kernel-side "agent" code would still be quite heavy-weight.
Unfortunately Netfilter's conntrack subsystem is a more complicated than
that of OpenBSD's pf. And the current code is not designed that way, so
I think it would be better to first try to finish the current project,
and then think about what should be done in a completely different way.)

> I think the CARP loadbalancing feature is an improvement over what is
> being suggested by Harald.

  What do you mean by that? Of course, it is a serious weakness of the
current code that it is not capable of load balancing, only failover
with passive slaves. However, load balancing would probably make things
a lot more complicated. For example, see NAT-related problems described
by Lennert Buytenhek here:

http://lists.netfilter.org/pipermail/netfilter-failover/2001-September/000043.html

> I have to say as well i am shocked that state is just being transfered
> blindly - but i will deal with Harald when he shows up in Ottawa ;->

  Would it be possible to summarize your ideas here? Yes, I know it is
easier and faster to talk about those things in person, but
unfortunately I won't be there in Ottawa, but am of course seriously
interested in all ideas related to ct_sync...

-- 
 Regards,
   Krisztian KOVACS

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [nf-failover] Re: [1/2] CARP implementation. HA master's failover.
  2004-07-19  7:16                 ` [nf-failover] " KOVACS Krisztian
@ 2004-07-20  2:38                   ` Harald Welte
  2004-07-20 14:24                   ` jamal
  1 sibling, 0 replies; 28+ messages in thread
From: Harald Welte @ 2004-07-20  2:38 UTC (permalink / raw)
  To: KOVACS Krisztian; +Cc: hadi, johnpol, netdev, Netfilter-failover list

[-- Attachment #1: Type: text/plain, Size: 2508 bytes --]

On Mon, Jul 19, 2004 at 09:16:07AM +0200, KOVACS Krisztian wrote:
>   Unfortunately this is not the case, as Evgeniy already mentioned.
> ct_sync is currently an completely in-kernel solution, with all the pros
> and cons of that. (Yes, it could be done in userspace with some minimal
> kernel code, and yes, it had a few advantages over the current solution.

I strongly disagree with this.  Or, let's say - while it is doable in
userspace, it would be at a very high cost... additional latency, lots
of more per-packet data copying, and the possibility of  being starved
by some other random userspace application.

> > I think the CARP loadbalancing feature is an improvement over what is
> > being suggested by Harald.

Yes, I would also be interested in what Jamal was referring to.  I
cannot really remember having anything suggested as 'cluster manager'.

> > I have to say as well i am shocked that state is just being transfered
> > blindly - but i will deal with Harald when he shows up in Ottawa ;->
> 
>   Would it be possible to summarize your ideas here? Yes, I know it is
> easier and faster to talk about those things in person, but
> unfortunately I won't be there in Ottawa, but am of course seriously
> interested in all ideas related to ct_sync...

I was just talking to Jamal today.  His basic proposal was something
like:

more than 90% of all connections last longer than 10 seconds, so we only
replicate connections that exist for more time.   I argued that this was
not the level of synchronization that we want to achieve, and that in
such a case he could just skip any kind of sync code and live with the
existing 'connection pickup' code of ip_conntrack (ACK from both sides
being able to establish connection even if no initial SYN handshake was
seen).

Jamal apparently wasn't aware of this.

Anyway, there will be some more discussion after my presentation at OLS
later this week.  I'll keep you posted.

>  Regards,
>    Krisztian KOVACS

btw: I think we should remove netdev for further discussions on ct_sync,
since there is a more specific mailinglist.

-- 
- Harald Welte <laforge@netfilter.org>             http://www.netfilter.org/
============================================================================
  "Fragmentation is like classful addressing -- an interesting early
   architectural error that shows how much experimentation was going
   on while IP was being designed."                    -- Paul Vixie

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [nf-failover] Re: [1/2] CARP implementation. HA master's failover.
  2004-07-19  7:16                 ` [nf-failover] " KOVACS Krisztian
  2004-07-20  2:38                   ` Harald Welte
@ 2004-07-20 14:24                   ` jamal
  1 sibling, 0 replies; 28+ messages in thread
From: jamal @ 2004-07-20 14:24 UTC (permalink / raw)
  To: KOVACS Krisztian; +Cc: johnpol, netdev, Netfilter-failover list

On Mon, 2004-07-19 at 03:16, KOVACS Krisztian wrote:
>   Hi,
> 
> 2004-07-16, p keltezéssel 15:04-kor jamal ezt írta:
> > Looking at what HArald has, the infrastructure seems to be the correct
> > flavor. Seems something gets sent to user space via netlink and gets
> > delivered via keepalived.
> 
>   Unfortunately this is not the case, as Evgeniy already mentioned.
> ct_sync is currently an completely in-kernel solution, with all the pros
> and cons of that. (Yes, it could be done in userspace with some minimal
> kernel code, and yes, it had a few advantages over the current solution.
> However, the kernel-side "agent" code would still be quite heavy-weight.
> Unfortunately Netfilter's conntrack subsystem is a more complicated than
> that of OpenBSD's pf. And the current code is not designed that way, so
> I think it would be better to first try to finish the current project,
> and then think about what should be done in a completely different way.)

Thats fine. So you may have to use Evgeniys in-kernel implementation for
now until things get better. How do you interact to keepalived?

> > I think the CARP loadbalancing feature is an improvement over what is
> > being suggested by Harald.
> 
>   What do you mean by that? Of course, it is a serious weakness of the
> current code that it is not capable of load balancing, only failover
> with passive slaves. However, load balancing would probably make things
> a lot more complicated. For example, see NAT-related problems described
> by Lennert Buytenhek here:
> 
> http://lists.netfilter.org/pipermail/netfilter-failover/2001-September/000043.html

i couldnt access that for some reason. Seems the wifi firewall is
blocking any web access.
What i meant was CARP with that feature infact has an active-active
setup - ct_snyc seems to be purely active-backup.

> > I have to say as well i am shocked that state is just being transfered
> > blindly - but i will deal with Harald when he shows up in Ottawa ;->
> 
>   Would it be possible to summarize your ideas here? Yes, I know it is
> easier and faster to talk about those things in person, but
> unfortunately I won't be there in Ottawa, but am of course seriously
> interested in all ideas related to ct_sync...

I talked briefly to Harald in the hallway and will attend his talk. I
may understand a little more about ct_sync - and hopefully a lot more
after his talk.

My comments were based on the fact that most flows are really
shortlived that there is no point in backing them up. Human nature
on a web page that is taking too long to load (for example because the
firewall failed) is to hit reload button - in which case the connection
tracking will be established from the begining with the failed over to
node. I will try to post a paper or two that have results on lifetimes
of flows etc when i have better network connection. What i remember
from one is that the majority of flows dont last longer than 10 secs
to begin with. 

So back to what i was saying earlier, and my .ca $0.02:
If the majority of the flows are only lasting that long, is there 
any value in backing them? One arguement i can see made is that you want
to speed up the lookup etc when 100K flows migrate to the backed-to
node.
my opinion is the following:
1)  it would be valuable to backup the rules if any exist and sync
things across. 
2) dont blindly migrate connection states until they are established and
probably lasted more than 10-15 seconds.

cheers,
jamal

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2004-07-20 14:24 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1089898303.6114.859.camel@uganda>
2004-07-15 13:36 ` [1/2] CARP implementation. HA master's failover Evgeniy Polyakov
2004-07-15 14:44   ` jamal
2004-07-15 15:27     ` Evgeniy Polyakov
2004-07-15 15:55       ` Evgeniy Polyakov
2004-07-15 16:28         ` jamal
2004-07-15 16:59           ` Evgeniy Polyakov
2004-07-15 17:30             ` jamal
2004-07-15 19:20               ` Evgeniy Polyakov
2004-07-16 12:34                 ` jamal
2004-07-16 15:06                   ` Evgeniy Polyakov
2004-07-17 11:52                     ` jamal
2004-07-17 12:59                       ` Evgeniy Polyakov
2004-07-17 15:47                         ` jamal
2004-07-17 20:04                           ` Evgeniy Polyakov
2004-07-15 16:07       ` jamal
2004-07-15 16:59         ` Evgeniy Polyakov
2004-07-15 17:24           ` jamal
2004-07-15 19:53             ` Evgeniy Polyakov
2004-07-16 13:04               ` jamal
2004-07-16 15:06                 ` Evgeniy Polyakov
2004-07-17 12:47                   ` jamal
2004-07-17 14:00                     ` Evgeniy Polyakov
2004-07-17 16:29                       ` jamal
2004-07-17 20:03                         ` Evgeniy Polyakov
2004-07-17 20:32                           ` jamal
2004-07-19  7:16                 ` [nf-failover] " KOVACS Krisztian
2004-07-20  2:38                   ` Harald Welte
2004-07-20 14:24                   ` jamal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).