Netdev List
 help / color / mirror / Atom feed
* RE: [PATCH 21/29] ioat2,3: dynamically resize descriptor ring
From: Sosnowski, Maciej @ 2009-09-14 15:00 UTC (permalink / raw)
  To: Williams, Dan J
  Cc: linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org,
	netdev@vger.kernel.org
In-Reply-To: <20090904023216.32667.55942.stgit@dwillia2-linux.ch.intel.com>

Williams, Dan J wrote:
> Increment the allocation order of the descriptor ring every time we run
> out of descriptors up to a maximum of allocation order specified by the
> module parameter 'ioat_max_alloc_order'.  After each idle period
> decrement the allocation order to a minimum order of
> 'ioat_ring_alloc_order' (i.e. the default ring size, tunable as a module
> parameter).
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---

Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>

Just one thing:

> +static int ioat_ring_max_alloc_order = IOAT_MAX_ORDER;
> +module_param(ioat_ring_max_alloc_order, int, 0644);
> +MODULE_PARM_DESC(ioat_ring_max_alloc_order,
> +		 "ioat2+: upper limit for dynamic ring resizing (default: n=16)");
[...]
> --- a/drivers/dma/ioat/dma_v2.h
> +++ b/drivers/dma/ioat/dma_v2.h
> @@ -37,6 +37,8 @@ extern int ioat_pending_level;
>  #define IOAT_MAX_ORDER 16
>  #define ioat_get_alloc_order() \
>  	(min(ioat_ring_alloc_order, IOAT_MAX_ORDER))
> +#define ioat_get_max_alloc_order() \
> +	(min(ioat_ring_max_alloc_order, IOAT_MAX_ORDER))

Making the max_alloc_order a module parameter gives impression
that it can be modified by an user, including making it larger than default.
The default is however its maximum value, which may be confusing.
Why not to use parameter only as the upper limit?

Thanks,
Maciej

^ permalink raw reply

* RE: [PATCH 20/29] ioat: switch watchdog and reset handler from workqueue to timer
From: Sosnowski, Maciej @ 2009-09-14 14:59 UTC (permalink / raw)
  To: Williams, Dan J
  Cc: linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org,
	netdev@vger.kernel.org
In-Reply-To: <20090904023211.32667.37259.stgit@dwillia2-linux.ch.intel.com>

Williams, Dan J wrote:
> In order to support dynamic resizing of the descriptor ring or polling
> for a descriptor in the presence of a hung channel the reset handler
> needs to make progress while in a non-preemptible context.  The current
> workqueue implementation precludes polling channel reset completion
> under spin_lock().
> 
> This conversion also allows us to return to opportunistic cleanup in the
> ioat2 case as the timer implementation guarantees at least one cleanup
> after every descriptor is submitted.  This means the worst case
> completion latency becomes the timer frequency (for exceptional
> circumstances), but with the benefit of avoiding busy waiting when the
> lock is contended.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---

[...]
> --- a/drivers/dma/ioat/dma_v2.c
> +++ b/drivers/dma/ioat/dma_v2.c
> @@ -49,7 +49,7 @@ static void __ioat2_issue_pending(struct ioat2_dma_chan *ioat)
>  	void * __iomem reg_base = ioat->base.reg_base;
> 
>  	ioat->pending = 0;
> -	ioat->dmacount += ioat2_ring_pending(ioat);
> +	ioat->dmacount += ioat2_ring_pending(ioat);;
double semicolon

Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>

^ permalink raw reply

* RE: [PATCH 19/29] ioat1: trim ioat_dma_desc_sw
From: Sosnowski, Maciej @ 2009-09-14 14:55 UTC (permalink / raw)
  To: Williams, Dan J
  Cc: linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org,
	netdev@vger.kernel.org
In-Reply-To: <20090904023206.32667.35974.stgit@dwillia2-linux.ch.intel.com>

Williams, Dan J wrote:
> Save 4 bytes per software descriptor by transmitting tx_cnt in an unused
> portion of the hardware descriptor.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---

Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>

^ permalink raw reply

* Re: [PATCH 4/4] bonding: add sysfs files to display tlb and alb hash table contents
From: Andy Gospodarek @ 2009-09-14 14:45 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: Andy Gospodarek, netdev, bonding-devel
In-Reply-To: <26430.1252705697@death.nxdomain.ibm.com>

On Fri, Sep 11, 2009 at 02:48:17PM -0700, Jay Vosburgh wrote:
> Andy Gospodarek <andy@greyhouse.net> wrote:
> 
> >bonding: add sysfs files to display tlb and alb hash table contents
> 
> 	Isn't it considered bad form to have sysfs files that kick out
> large amounts of data like this?  Not that I think this is a bad
> facility to have, just checking on the mechanism.
> 

I'm not aware of such a restriction -- though I'm sure at least one
person out there doesn't like it.

If that's the case, there are certainly a few files that should be
cleaned up:

# find -type f -exec wc -l {} 2> /dev/null \; | sort -r -n | head -10
1657 ./firmware/acpi/tables/SSDT
132 ./firmware/acpi/tables/dynamic/SSDT2
128 ./devices/pci0000:00/0000:00:1c.5/0000:3f:00.0/vpd
27 ./devices/system/node/node0/meminfo
24 ./devices/pnp0/00:08/options
24 ./devices/pnp0/00:07/options
12 ./devices/pci0000:00/0000:00:1e.0/resource
12 ./devices/pci0000:00/0000:00:1c.5/resource
12 ./devices/pci0000:00/0000:00:1c.4/resource
12 ./devices/pci0000:00/0000:00:1c.0/resource


> >While debugging some problems with alb (mode 6) bonding I realized that
> >being able to output the contents of both hash tables would be helpful.
> >This is what the output looks like for the two files:
> >
> >device  load
> >eth1    491
> >eth2    491
> >hash device   last device   tx bytes       load        next previous
> >2    eth1     eth1          2254           491         0    0
> >3    eth2     eth2          2744           491         0    0
> >6             eth2          0              488         0    0
> >8             eth2          0              461698      0    0
> >1b            eth2          0              249         0    0
> >eb            eth2          0              21          0    0
> >ff            eth2          0              22          0    0
> >
> >hash ip_src          ip_dst          mac_dst           slave assign ntt
> >2    10.0.3.2        10.0.3.11       00:e0:81:71:ee:a9 eth1  1      0
> >3    10.0.3.2        10.0.3.10       00:e0:81:71:ee:a9 eth2  1      0
> >8    10.0.3.2        10.0.3.1        00:e0:81:71:ee:a9 eth2  1      0
> >
> >These were a great help debugging the fixes I have just posted and they
> >might be helpful for others, so I decided to include them in my
> >patchset.
> >
> >Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
> >
> >---
> > drivers/net/bonding/bond_alb.c   |   61 ++++++++++++++++++++++++++++++++++++++
> > drivers/net/bonding/bond_alb.h   |    2 +
> > drivers/net/bonding/bond_sysfs.c |   40 +++++++++++++++++++++++++
> > 3 files changed, 103 insertions(+), 0 deletions(-)
> >
> >diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
> >index 7db8835..4e930e3 100644
> >--- a/drivers/net/bonding/bond_alb.c
> >+++ b/drivers/net/bonding/bond_alb.c
> >@@ -778,6 +778,67 @@ static struct slave *rlb_arp_xmit(struct sk_buff *skb, struct bonding *bond)
> > 	return tx_slave;
> > }
> >
> >+int rlb_print_rx_hashtbl(struct bonding *bond, char *buf)
> >+{
> >+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
> >+	struct rlb_client_info *client_info;
> >+	u32 hash_index;
> >+	u32 count = 0;
> >+	
> >+	_lock_rx_hashtbl(bond);
> >+
> >+	count = sprintf(buf, "hash ip_src          ip_dst          mac_dst           slave assign ntt\n");
> >+	hash_index = bond_info->rx_hashtbl_head;
> >+	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
> >+		client_info = &(bond_info->rx_hashtbl[hash_index]);
> >+		count += sprintf(buf + count,"%-4x %-15pi4 %-15pi4 %pM %-5s %-6d %d\n",
> >+				 hash_index,
> >+				 &client_info->ip_src,
> >+				 &client_info->ip_dst,
> >+				 client_info->mac_dst,
> >+				 client_info->slave->dev->name,
> >+				 client_info->assigned,
> >+				 client_info->ntt);
> >+	}
> >+
> >+	_unlock_rx_hashtbl(bond);
> >+	return count;
> >+}
> >+
> >+int tlb_print_tx_hashtbl(struct bonding *bond, char *buf)
> >+{
> >+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
> >+	u32 hash_index;
> >+	u32 count = 0;
> >+	struct slave *slave;
> >+	int i;
> >+	
> >+	_lock_tx_hashtbl(bond);
> >+
> >+	count += sprintf(buf, "device  load\n");
> >+	bond_for_each_slave(bond, slave, i) {
> >+		struct tlb_slave_info *slave_info = &(SLAVE_TLB_INFO(slave));
> >+		count += sprintf(buf + count,"%-7s %d\n",slave->dev->name,slave_info->load);
> >+	}
> >+	count += sprintf(buf + count, "hash device   last device   tx bytes       load        next previous\n");
> >+	for (hash_index = 0; hash_index < TLB_HASH_TABLE_SIZE; hash_index++) {
> >+		struct tlb_client_info *client_info = &(bond_info->tx_hashtbl[hash_index]);
> >+		if (client_info->tx_slave || client_info->last_slave) {
> >+			count += sprintf(buf + count,"%-4x %-8s %-13s %-14d %-11d %-4x %d\n",
> >+					 hash_index,
> >+					 (client_info->tx_slave) ? client_info->tx_slave->dev->name : "",
> >+					 (client_info->last_slave) ? client_info->last_slave->dev->name : "",
> >+					 client_info->tx_bytes,
> >+					 client_info->load_history,
> >+					 (client_info->next != TLB_NULL_INDEX) ? client_info->next : 0,
> >+					 (client_info->prev != TLB_NULL_INDEX) ? client_info->prev : 0);
> >+		}
> >+	}
> >+
> >+	_unlock_tx_hashtbl(bond);
> >+	return count;
> >+}
> >+
> > /* Caller must hold rx_hashtbl lock */
> > static void rlb_init_table_entry(struct rlb_client_info *entry)
> > {
> >diff --git a/drivers/net/bonding/bond_alb.h b/drivers/net/bonding/bond_alb.h
> >index b65fd29..8543447 100644
> >--- a/drivers/net/bonding/bond_alb.h
> >+++ b/drivers/net/bonding/bond_alb.h
> >@@ -132,5 +132,7 @@ int bond_alb_xmit(struct sk_buff *skb, struct net_device *bond_dev);
> > void bond_alb_monitor(struct work_struct *);
> > int bond_alb_set_mac_address(struct net_device *bond_dev, void *addr);
> > void bond_alb_clear_vlan(struct bonding *bond, unsigned short vlan_id);
> >+int rlb_print_rx_hashtbl(struct bonding *bond, char *buf);
> >+int tlb_print_tx_hashtbl(struct bonding *bond, char *buf);
> > #endif /* __BOND_ALB_H__ */
> >
> >diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
> >index 55bf34f..1123e1f 100644
> >--- a/drivers/net/bonding/bond_sysfs.c
> >+++ b/drivers/net/bonding/bond_sysfs.c
> >@@ -1480,6 +1480,44 @@ static ssize_t bonding_show_ad_partner_mac(struct device *d,
> > static DEVICE_ATTR(ad_partner_mac, S_IRUGO, bonding_show_ad_partner_mac, NULL);
> >
> >
> >+/*
> >+ * Show current tlb/alb tx hash table.
> >+ */
> >+static ssize_t bonding_show_tlb_tx_hash(struct device *d,
> >+					   struct device_attribute *attr,
> >+					   char *buf)
> >+{
> >+	int count = 0;
> >+	struct bonding *bond = to_bond(d);
> >+
> >+	if (bond->params.mode == BOND_MODE_ALB ||
> >+	    bond->params.mode == BOND_MODE_TLB) {
> >+		count = tlb_print_tx_hashtbl(bond, buf);
> >+	}
> >+
> >+	return count;
> >+}
> >+static DEVICE_ATTR(tlb_tx_hash, S_IRUGO, bonding_show_tlb_tx_hash, NULL);
> 
> 	Should the mode here be S_IRUSR (0400, instead of 0444)?
> Otherwise, a nefarious user could "while 1 cat /sys/.../tlb_tx_hash" and
> keep the hash table lock fairly busy.  Since the lock is acquired for
> every packet on tx, that's probably a bad thing.
> 
> >+
> >+/*
> >+ * Show current alb rx hash table.
> >+ */
> >+static ssize_t bonding_show_alb_rx_hash(struct device *d,
> >+					   struct device_attribute *attr,
> >+					   char *buf)
> >+{
> >+	int count = 0;
> >+	struct bonding *bond = to_bond(d);
> >+
> >+	if (bond->params.mode == BOND_MODE_ALB) {
> >+		count = rlb_print_rx_hashtbl(bond, buf);
> >+	}
> >+
> >+	return count;
> >+}
> >+static DEVICE_ATTR(alb_rx_hash, S_IRUGO, bonding_show_alb_rx_hash, NULL);
> 
> 	Same comment as for the mode of the tlb_tx_hash, although the rx
> hash table lock is much more lightly used, so it might not be a real
> problem.
> 
> >
> > static struct attribute *per_bond_attrs[] = {
> > 	&dev_attr_slaves.attr,
> >@@ -1505,6 +1543,8 @@ static struct attribute *per_bond_attrs[] = {
> > 	&dev_attr_ad_actor_key.attr,
> > 	&dev_attr_ad_partner_key.attr,
> > 	&dev_attr_ad_partner_mac.attr,
> >+	&dev_attr_alb_rx_hash.attr,
> >+	&dev_attr_tlb_tx_hash.attr,
> > 	NULL,
> > };
> >
> >-- 
> >1.5.5.6
> >
> 
> 	-J
> 
> ---
> 	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH] tcp: fix ssthresh u16 leftover
From: Ilpo Järvinen @ 2009-09-14 14:09 UTC (permalink / raw)
  To: David Miller; +Cc: Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4491 bytes --]

It was once upon time so that snd_sthresh was a 16-bit quantity.
...That has not been true for long period of time. I run across
some ancient compares which still seem to trust such legacy.
Put all that magic into a single place, I hopefully found all
of them.

Compile tested, though linking of allyesconfig is ridiculous
nowadays it seems.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

---
  include/net/tcp.h        |    7 +++++++
  net/ipv4/tcp.c           |    2 +-
  net/ipv4/tcp_input.c     |    2 +-
  net/ipv4/tcp_ipv4.c      |    4 ++--
  net/ipv4/tcp_minisocks.c |    2 +-
  net/ipv6/tcp_ipv6.c      |    5 +++--
  6 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index b71a446..56b7602 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -793,6 +793,13 @@ static inline unsigned int tcp_packets_in_flight(const struct tcp_sock *tp)
  	return tp->packets_out - tcp_left_out(tp) + tp->retrans_out;
  }

+#define TCP_INFINITE_SSTHRESH	0x7fffffff
+
+static inline bool tcp_in_initial_slowstart(const struct tcp_sock *tp)
+{
+	return tp->snd_ssthresh >= TCP_INFINITE_SSTHRESH;
+}
+
  /* If cwnd > ssthresh, we may raise ssthresh to be half-way to cwnd.
   * The exception is rate halving phase, when cwnd is decreasing towards
   * ssthresh.
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index edeea06..19a0612 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2012,7 +2012,7 @@ int tcp_disconnect(struct sock *sk, int flags)
  	tp->snd_cwnd = 2;
  	icsk->icsk_probes_out = 0;
  	tp->packets_out = 0;
-	tp->snd_ssthresh = 0x7fffffff;
+	tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
  	tp->snd_cwnd_cnt = 0;
  	tp->bytes_acked = 0;
  	tcp_set_ca_state(sk, TCP_CA_Open);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index af6d6fa..d86784b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -761,7 +761,7 @@ void tcp_update_metrics(struct sock *sk)
  			set_dst_metric_rtt(dst, RTAX_RTTVAR, var);
  		}

-		if (tp->snd_ssthresh >= 0xFFFF) {
+		if (tcp_in_initial_slowstart(tp)) {
  			/* Slow start still did not finish. */
  			if (dst_metric(dst, RTAX_SSTHRESH) &&
  			    !dst_metric_locked(dst, RTAX_SSTHRESH) &&
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 0543561..7cda24b 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1808,7 +1808,7 @@ static int tcp_v4_init_sock(struct sock *sk)
  	/* See draft-stevens-tcpca-spec-01 for discussion of the
  	 * initialization of these values.
  	 */
-	tp->snd_ssthresh = 0x7fffffff;	/* Infinity */
+	tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
  	tp->snd_cwnd_clamp = ~0;
  	tp->mss_cache = 536;

@@ -2284,7 +2284,7 @@ static void get_tcp4_sock(struct sock *sk, struct seq_file *f, int i, int *len)
  		jiffies_to_clock_t(icsk->icsk_ack.ato),
  		(icsk->icsk_ack.quick << 1) | icsk->icsk_ack.pingpong,
  		tp->snd_cwnd,
-		tp->snd_ssthresh >= 0xFFFF ? -1 : tp->snd_ssthresh,
+		tcp_in_initial_slowstart(tp) ? -1 : tp->snd_ssthresh,
  		len);
  }

diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index e48c37d..045bcfd 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -410,7 +410,7 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
  		newtp->retrans_out = 0;
  		newtp->sacked_out = 0;
  		newtp->fackets_out = 0;
-		newtp->snd_ssthresh = 0x7fffffff;
+		newtp->snd_ssthresh = TCP_INFINITE_SSTHRESH;

  		/* So many TCP implementations out there (incorrectly) count the
  		 * initial SYN frame in their delayed-ACK and congestion control
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 3aae0f2..6e3f0dc 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1846,7 +1846,7 @@ static int tcp_v6_init_sock(struct sock *sk)
  	/* See draft-stevens-tcpca-spec-01 for discussion of the
  	 * initialization of these values.
  	 */
-	tp->snd_ssthresh = 0x7fffffff;
+	tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
  	tp->snd_cwnd_clamp = ~0;
  	tp->mss_cache = 536;

@@ -1969,7 +1969,8 @@ static void get_tcp6_sock(struct seq_file *seq, struct sock *sp, int i)
  		   jiffies_to_clock_t(icsk->icsk_rto),
  		   jiffies_to_clock_t(icsk->icsk_ack.ato),
  		   (icsk->icsk_ack.quick << 1 ) | icsk->icsk_ack.pingpong,
-		   tp->snd_cwnd, tp->snd_ssthresh>=0xFFFF?-1:tp->snd_ssthresh
+		   tp->snd_cwnd,
+		   tcp_in_initial_slowstart(tp) ? -1 : tp->snd_ssthresh
  		   );
  }

-- 
tg: (13af7a6..) fix/ssthresh (depends on: origin/master)

^ permalink raw reply related

* Re: [PATCH 1/8] networking/fanotify: declare fanotify socket numbers
From: Evgeniy Polyakov @ 2009-09-14 14:07 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Eric Paris, David Miller, linux-kernel, linux-fsdevel, netdev,
	viro, alan, hch
In-Reply-To: <20090914001759.GB30621@shareable.org>

On Mon, Sep 14, 2009 at 01:17:59AM +0100, Jamie Lokier (jamie@shareable.org) wrote:
> > When queue is full or you do not have enough RAM. Both are reported at
> > 'sending' time.
> 
> Can you ->poll() and wait reliably until the queue will accept an skb?
> (A few spurious EAGAINs/ENOBUFs is ok, as long as it's not the norm).

Not that simple and for memory allocation error just can't.

There is no direct access to remote peer sockets, i.e. to userspace
ones, so one will have to lock netlink table and run over listeners and
check whether they can accept the message and wait/poll for queue size
to become big enough. Netlink table and its locking is not exported,
so effectively there is no simple way to do this.

-- 
	Evgeniy Polyakov

^ permalink raw reply

* Re: [PATCH RFC] tun: export underlying socket
From: Or Gerlitz @ 2009-09-14 14:06 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: David Miller, netdev, herbert
In-Reply-To: <20090914101012.GA14176@redhat.com>

Michael S. Tsirkin wrote:
>> how  would the use case with vhost will look like?
> - Configure bridge and tun using existing scripts
> - pass tun fd to vhost via an ioctl
> - vhost calls tun_get_socket
> - from this point, guest networking just goes faster

let me see I am with you:

1. vhost gets from user space through ioctl packet socket fd OR tun fd - 
but never both

2. for packet socket fd
VM.TX is translated by vhost to sendmsg which goes through the NIC
NIC RX  makes the fd poll to signal and then recvmsg is called on the 
fd, then vhost places the packet in a virtq

3. for tun fd
VM.TX is translated by vhost to sendmsg which is translated by tun to 
netif_rx which is then handled by the bridge
NIC RX  goes to the bridge which xmits the packet a tun interface, now 
what makes tun provide this packet to vhost and how it is done?


> A lot of people have asked for tun support in vhost, because qemu currently uses tun.  With this scheme existing code and scripts can be used to configure both tun and bridge.  You also can utilize virtualization-specific features in tun.
Tun has code to support some virtualization-specific features, however, 
it has also some inherent problems, I think, for example, you don't know 
over which NIC eventually a packet will be sent and as such, the feature 
advertising to the guest (virtio-net) NIC is problematic, for example, 
TSO. With vhost, since you are directly attached to a NIC and assuming 
its a PF or VF NIC and not something like macvlan/veth you can actually 
know what features are supported by this NIC.

Or.




^ permalink raw reply

* Re: ipv4 regression in 2.6.31 ?
From: Eric Dumazet @ 2009-09-14 13:57 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: linux-kernel, davem, Linux Netdev List
In-Reply-To: <20090914150935.cc895a3c.skraw@ithnet.com>

Stephan von Krawczynski a écrit :
> Hello all,
> 
> today we experienced some sort of regression in 2.6.31 ipv4 implementation, or
> at least some incompatibility with former 2.6.30.X kernels.
> 
> We have the following situation:
> 
>                                        ---------- vlan1@eth0 192.168.2.1/24
>                                       /
> host A 192.168.1.1/24 eth0  -------<router>            host B
>                                       \
>                                        ---------- eth1 192.168.3.1/24
> 
> 
> Now, if you route 192.168.1.0/24 via interface vlan1@eth0 on host B and let
> host A ping 192.168.2.1 everything works. But if you route 192.168.1.0/24 via
> interface eth1 on host B and let host A ping 192.168.2.1 you get no reply.
> With tcpdump we see the icmp packets arrive at vlan1@eth0, but no icmp echo
> reply being generated neither on vlan1 nor eth1.
> Kernels 2.6.30.X and below do not show this behaviour.
> Is this intended? Do we need to reconfigure something to restore the old
> behaviour?
> 

Asymetric routing ?

Check your rp_filter settings

grep . `find /proc/sys/net -name rp_filter`

rp_filter - INTEGER
        0 - No source validation.
        1 - Strict mode as defined in RFC3704 Strict Reverse Path
            Each incoming packet is tested against the FIB and if the interface
            is not the best reverse path the packet check will fail.
            By default failed packets are discarded.
        2 - Loose mode as defined in RFC3704 Loose Reverse Path
            Each incoming packet's source address is also tested against the FIB
            and if the source address is not reachable via any interface
            the packet check will fail.

        Current recommended practice in RFC3704 is to enable strict mode
        to prevent IP spoofing from DDos attacks. If using asymmetric routing
        or other complicated routing, then loose mode is recommended.

        conf/all/rp_filter must also be set to non-zero to do source validation
        on the interface

        Default value is 0. Note that some distributions enable it
        in startup scripts.



^ permalink raw reply

* Re: [iproute2] tc action mirred    question
From: Xiaofei Wu @ 2009-09-14 13:44 UTC (permalink / raw)
  To: hadi; +Cc: linux netdev
In-Reply-To: <1252704524.25158.42.camel@dogo.mojatatu.com>

>> 

>> How to do this. Could you show me the example commands?   Thank you.
>> 
>Add the rule to mirror on lo
>Add the rule to pedit for mirrored packet on eth0


I did two expriments. One is OK. The result of the other is not the same as I expected. I don't know why.

(1)
 A
| |
 C

A: eth0  192.168.1.242/24
   wlan1 192.168.4.5/24

C: wlan1 192.168.4.202/24
   eth0  192.168.1.215/24
On node A, I mirrored packets to wlan1(eth0 -> wlan1), modified dst,src MAC (transmit to wlan1 of node C).
When I run 'ping 192.168.1.215' on node A, one request will get two replies. It's OK.

(2)
 A
/ |
B |
\ |
 C

A: eth0  192.168.1.242/24
   wlan1 192.168.2.5/24

B: wlan1 192.168.2.11/24
   wlan2 192.168.4.11/24

C: wlan1 192.168.4.202/24
   eth0  192.168.1.215/24

On node A, I run this to mirror, pedit packets.
---
#tc qdisc add dev eth0 handle 1: root prio
#tc filter add dev eth0 parent 1: protocol ip prio 10 u32 \
match ip src 192.168.1.0/24 flowid 1:16 \
action mirred egress mirror dev wlan1

#tc qdisc add dev wlan1 handle 1: root prio
#tc filter add dev wlan1 parent 1: protocol ip prio 10 u32 \
match ip src 192.168.1.0/24 flowid 1:16 \
action pedit munge offset -14 u16 set 0x0023 \
munge offset -12 u32 set 0xcdafecda \
munge offset -8 u32 set 0x0023cdaf \
munge offset -4 u32 set 0xd0740800
---

the routing table 0f node B
---
#route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
192.168.4.0     0.0.0.0         255.255.255.0   U     0      0        0 wlan2
192.168.2.0     0.0.0.0         255.255.255.0   U     0      0        0 wlan1
0.0.0.0         192.168.4.202     0.0.0.0       UG    0      0        0 wlan2

#cat /proc/sys/net/ipv4/ip_forward
1
---

On node A I run 'ping 192.168.1.215'(IP addr of node C eth0) on node A, one request 'only' get one reply. It's strange.
On node B,
window1:  'tcpdump -i wlan1 -n -e', I can see the mirroring packets.
window2:  'tcpdump -i wlan2 -n -e', I see noting.
It seems that node B didn't forward the mirroring packects. So I did anotner experiment to check it. 
I am sure node B can forward packets. But it didn't forward the mirroring packets, why?  (something wrong with the mirroring packets?)



regards,
wu


      


^ permalink raw reply

* Re: [PATCH 1/8] networking/fanotify: declare fanotify socket numbers
From: jamal @ 2009-09-14 13:15 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Eric Paris, David Miller, linux-kernel, linux-fsdevel, netdev,
	viro, alan, hch, balbir
In-Reply-To: <20090914000303.GA30621@shareable.org>

On Mon, 2009-09-14 at 01:03 +0100, Jamie Lokier wrote:

> If you have enough memory to remember _what_ to retransmit, then you
> have enough memory to buffer a fixed-size message.  It just depends on
> how you do the buffering.  To say netlink drops the message and you
> can retry is just saying that the buffering is happening one step
> earlier, before netlink.  

it is the receiver that drops the message because of overruns
e.g. when receiver doesnt keep up..

> That's what I mean by netlink being a
> pointless complication for this, because you can just as easily write
> code which gets to the message to userspace without going through
> netlink and with no chance of it being dropped.
> 

Sure you can do that with netlink too. Whether it is overcomplicated
needs to be weighed out.

> Yes.  It uses positive acknowledge and flow control, because these
> match naturally with what fanotify does at the next higher level.
> 
> The process generating the multicast (e.g. trying to write a file) is
> blocked until the receiver gets the message, handles it and
> acknowledges with a "yes you can" or "no you can't" response.
> 
> That's part of fanotify's design.  The pattern conveniently has no
> issues with using unbounded memory for message, because the sending
> process is blocked.
> 

Ok, I understand better i think;-> So it is a synchronous type of
operation whereas in netlink type multicast, the optimization is to make
the operation async.

> True you only need one skb.  But netlink doesn't handle waiting for
> positive acknowledge responses from every receiver, and combining
> their value, does it?  

It is not netlink perse. It is how you use netlink for your app.
Classical one-to-many operations have the sender (kernel mostly)
do async sends to the listeners. It is up to the listener to catch
up if there are any holes. But this seems not what you want for
fanotify.

> You can't really take advantage of netlink's
> built in multicast, because to known when it has all the responses,
> the fanotify layer has to track the subscriber list itself anyway.

True, given my understanding so far fanotify has to track the subscriber
list. i.e something along the lines of:
- send a single multicast message to a set of listeners
- wait for response from all subscribers
- if no response from all subscribers given timeout then retransmit upto
max retransmit times

The chance of loosing a message in such a case is zero if the socket
buffer on each listener/receiver is larger than one fanotify event
message. You still have to alloc the message - and that may fail.

> What I'm saying is perhaps skbs are useful for fanotify, but I don't
> know that netlink's multicasting is useful.  But storing the messages
> in skbs for transmission, and using parts of netlink to manage them,
> and to provide some of the API, that might be useful.

The multicast-a-single-skb part is useful. Of course, its usefulness
diminishes as the number of listeners per subtree goes down (because it
reduces to one skb per listener). So all this depends on how fanotify is
going to be used.
The one thing i am not sure of is how you map a multicast group to a
subtree. In netlink groups to which multiple listeners subscribe to
are 32-bit identifiers. I suppose, one approach could be to register
for the event of interest, get an ID then use the ID to listen to a
multicast group of that ID. This way whoever is issuing the ID can also
factor in permissions and subtree overlap of the listener (and whether
the events are already being listened to in a known ID). 
Alternatively, to your statement above, if fanotify is keeping track of
all subsribers then it can replicast a single event instead and just
bump the refcount on the skb for each sent-to-user (and still use one
skb).. 

> You do get nothing unless you register interest.  The problem is
> there's no way to register interest on just a subtree, so the fanotify
> approach is let you register for events on the whole filesystem, and
> let the userspace daemon filter paths.  At least it's decisions can be
> cached, although I'm not sure how that works when multiple processes
> want to monitor overlapping parts of the filesystem.

I guess if the non-optimal part happens only once and subsequent cached
filters happen faster, then one could look at that as cost of setup.
I think, given that you are capable of creating such a cache, seems that
it would be cheaper to make such decision at registration time.

> It doesn't sound scalable to me, either, and that's why I don't like
> this part, and described a solution to monitoring subtrees - which
> would also solve the problem for inotify.  (Both use fsnotify under
> the hood, and that's where subtree notification would go).
>
> Eric's mentioned interest in a way to monitor subtrees, but that
> hasn't gone anywhere as far as I know.  He doesn't seem convinced by
> my solution - or even that scalability will be an issue.  I think
> there's a bit of vision lacking here, and I'll admit I'm more
> interested in the inotify uses of fsnotify (being able to detect
> changes) than the fanotify uses (being able to _block_ or _modify_
> changes).  I think both inotify and fanotify ought to benefit from the
> same improvements to file monitoring.
> 

The subtree overlap problem seems to invoke some well known computer
science algorithms, no? i.e tell me oracle given the event on nodeX of
this tree, which subscriber needs to be notified?


> I believe it would cause 10000 events, yes, even if they are files
> that userspace policy is not interested in.  Eric, is that right?
> 
> However I believe after the first grep, subsequent greps' decisions
> would be cached by marking the inodes.  I'm not sure what happens if
> two fanotify monitors both try marking the inodes.
>
> Arguably if a fanotify monitor is running before those files are in
> page cache anyway, then I/O may dominate, and when the files are
> cached, fanotify has already cached it's decisions in the kernel.
> However fanotify is synchronous: each new file access involves a round
> trip to the fanotify userspace and back before it can proceed, so
> there's quite a lot of IPC and scheduling too.  Without testing, it's
> hard to guess how it'll really perform.
> 

So if you can mark inodes, why not do it at register time?

> > > While skbs and netlink aren't that slow, I suspect they're an order of
> > > magnitude or two slower than, say, epoll or inotify at passing events
> > > around.
> > 
> > not familiar with inotify.
> 
> inotify is like dnotify, and like a signal or epoll: a message that
> something happened.  You register interest in individual files or
> directories only, and inotify does not (yet) provide a way to monitor
> the whole filesystem or a subtree.
> 
> fanotify is different: it provides access control, and can _refuse_
> attempts to read file X, or even modify the file before permitting the
> file to be read.
> 

Ok, I think i understood more about fanotify now. It is more of an
access control than a mass notification scheme (which is what i thought
of earlier).
Hrm, it does sound like something closer to selinux if it is simple
enough to require answers to simple questions like "should this
operation continue?"

> > Theres a difference between events which are abbreviated in the form
> > "hey some read happened on fd you are listening on" vs "hey a read
> > of file X for 16 bytes at offset 200 by process Y just occured while
> > at the same time process Z was writting at offset 2000". The later
> > (which netlink will give you) includes a lot more attribute details
> > which could be filtered or can be extended to include a lot
> > more. The former(what epoll will give you) is merely a signal.
> 
> Firstly, it's really hard to retain the ordering of userspace events
> like that in a useful way, given the non-determinstic parallelism
> going on with multiple processes doing I/O do the same file :-)
> 

Bad example ;->
That was not meant to be anything clever - rather to demonstrate that
netlink allows you to send many attributes with events and that you can
add as many as you want over a period of time (instead of hardcoding it
at design/coding time).

On a tangent: I would love to get more than simple events
(read/write/exception) on a file. Probably more on the writes
than on reads; example "offset X, length Y has been deleted" etc.
I would still love the option to exercise my rights to simple
events like read/write/exception

cheers,
jamal


^ permalink raw reply

* [PATCH] Phonet: Netlink event for autoconfigured addresses
From: Rémi Denis-Courmont @ 2009-09-14 13:10 UTC (permalink / raw)
  To: netdev; +Cc: Rémi Denis-Courmont

From: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>

From: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>

Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
---
 net/phonet/pn_dev.c |    9 ++++++++-
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/net/phonet/pn_dev.c b/net/phonet/pn_dev.c
index 2f65dca..5f42f30 100644
--- a/net/phonet/pn_dev.c
+++ b/net/phonet/pn_dev.c
@@ -209,7 +209,14 @@ static int phonet_device_autoconf(struct net_device *dev)
 						SIOCPNGAUTOCONF);
 	if (ret < 0)
 		return ret;
-	return phonet_address_add(dev, req.ifr_phonet_autoconf.device);
+
+	ASSERT_RTNL();
+	ret = phonet_address_add(dev, req.ifr_phonet_autoconf.device);
+	if (ret)
+		return ret;
+	phonet_address_notify(RTM_NEWADDR, dev,
+				req.ifr_phonet_autoconf.device);
+	return 0;
 }
 
 /* notify Phonet of device events */
-- 
1.6.0.4


^ permalink raw reply related

* [PATCH] cdc-phonet: remove noisy debug statement
From: Rémi Denis-Courmont @ 2009-09-14 13:10 UTC (permalink / raw)
  To: netdev; +Cc: Rémi Denis-Courmont
In-Reply-To: <1252933829-12442-1-git-send-email-remi@remlab.net>

From: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>

From: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>

Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
---
 drivers/net/usb/cdc-phonet.c |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/drivers/net/usb/cdc-phonet.c b/drivers/net/usb/cdc-phonet.c
index 97e54d9..33d5c57 100644
--- a/drivers/net/usb/cdc-phonet.c
+++ b/drivers/net/usb/cdc-phonet.c
@@ -264,7 +264,6 @@ static int usbpn_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
 	switch (cmd) {
 	case SIOCPNGAUTOCONF:
 		req->ifr_phonet_autoconf.device = PN_DEV_PC;
-		printk(KERN_CRIT"device is PN_DEV_PC\n");
 		return 0;
 	}
 	return -ENOIOCTLCMD;
-- 
1.6.0.4


^ permalink raw reply related

* [PATCH] pkt_sched: Fix tx queue selection in tc_modify_qdisc
From: Jarek Poplawski @ 2009-09-14 12:22 UTC (permalink / raw)
  To: David Miller; +Cc: Patrick McHardy, netdev

After the recent mq change there is the new select_queue qdisc class
method used in tc_modify_qdisc, but it works OK only for direct child
qdiscs of mq qdisc. Grandchildren always get the first tx queue, which
would give wrong qdisc_root etc. results (e.g. for sch_htb as child of
sch_prio). This patch fixes it by using parent's dev_queue for such
grandchildren qdiscs. The select_queue method is replaced BTW with the
static qdisc_select_tx_queue function (it's used only in one place).

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
---

 include/net/sch_generic.h |    1 -
 net/sched/sch_api.c       |   29 +++++++++++++++++++++--------
 net/sched/sch_mq.c        |   10 ----------
 3 files changed, 21 insertions(+), 19 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 88eb9de..865120c 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -81,7 +81,6 @@ struct Qdisc
 struct Qdisc_class_ops
 {
 	/* Child qdisc manipulation */
-	unsigned int		(*select_queue)(struct Qdisc *, struct tcmsg *);
 	int			(*graft)(struct Qdisc *, unsigned long cl,
 					struct Qdisc *, struct Qdisc **);
 	struct Qdisc *		(*leaf)(struct Qdisc *, unsigned long cl);
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 3af1061..223a6bc 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -990,6 +990,24 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n, void *arg)
 	return 0;
 }
 
+static struct netdev_queue *qdisc_select_tx_queue(struct net_device *dev,
+						  struct Qdisc *p, u32 clid)
+{
+	unsigned long ntx;
+
+	if (!p)
+		return netdev_get_tx_queue(dev, 0);
+
+	if (!(p->flags & TCQ_F_MQROOT))
+		return p->dev_queue;
+
+	ntx = TC_H_MIN(clid) - 1;
+	if (ntx >= dev->num_tx_queues)
+		ntx = 0;
+
+	return netdev_get_tx_queue(dev, ntx);
+}
+
 /*
    Create/change qdisc.
  */
@@ -1110,16 +1128,11 @@ create_n_graft:
 		q = qdisc_create(dev, &dev->rx_queue, p,
 				 tcm->tcm_parent, tcm->tcm_parent,
 				 tca, &err);
-	else {
-		unsigned int ntx = 0;
-
-		if (p && p->ops->cl_ops && p->ops->cl_ops->select_queue)
-			ntx = p->ops->cl_ops->select_queue(p, tcm);
-
-		q = qdisc_create(dev, netdev_get_tx_queue(dev, ntx), p,
+	else
+		q = qdisc_create(dev, qdisc_select_tx_queue(dev, p, clid), p,
 				 tcm->tcm_parent, tcm->tcm_handle,
 				 tca, &err);
-	}
+
 	if (q == NULL) {
 		if (err == -EAGAIN)
 			goto replay;
diff --git a/net/sched/sch_mq.c b/net/sched/sch_mq.c
index dd5ee02..4ad949b 100644
--- a/net/sched/sch_mq.c
+++ b/net/sched/sch_mq.c
@@ -125,15 +125,6 @@ static struct netdev_queue *mq_queue_get(struct Qdisc *sch, unsigned long cl)
 	return netdev_get_tx_queue(dev, ntx);
 }
 
-static unsigned int mq_select_queue(struct Qdisc *sch, struct tcmsg *tcm)
-{
-	unsigned int ntx = TC_H_MIN(tcm->tcm_parent);
-
-	if (!mq_queue_get(sch, ntx))
-		return 0;
-	return ntx - 1;
-}
-
 static int mq_graft(struct Qdisc *sch, unsigned long cl, struct Qdisc *new,
 		    struct Qdisc **old)
 {
@@ -213,7 +204,6 @@ static void mq_walk(struct Qdisc *sch, struct qdisc_walker *arg)
 }
 
 static const struct Qdisc_class_ops mq_class_ops = {
-	.select_queue	= mq_select_queue,
 	.graft		= mq_graft,
 	.leaf		= mq_leaf,
 	.get		= mq_get,

^ permalink raw reply related

* Re: more troubles with bridge in netns
From: Daniel Lezcano @ 2009-09-14 11:28 UTC (permalink / raw)
  To: Atis Elsts; +Cc: netdev
In-Reply-To: <200909141419.12330.atis@mikrotik.com>

Atis Elsts wrote:
> On Tuesday 08 September 2009 11:40:44 Daniel Lezcano wrote:
>   
>> Atis Elsts wrote:
>>     
>>> Trying to add bridge interface from userspace program, after moving the
>>> program to a new network namespace, causes kernel to crash. I am using
>>> latest kernel version from git (2.6.31-rc9).
>>> The bug is easy to reproduce - just compile and run the attached C
>>> program.
>>>
>>> I see that bridge interface has NETIF_F_NETNS_LOCAL flag, but as I
>>> understand, this flag simply means that a device cannot be *moved* across
>>> network namespaces, not that it cannot be *created* in other namespaces.
>>>       
>> Yep, very easy to reproduce :/
>> The sysfs has not been disabled for the bridge. I will try to fix it as
>> soon as I can.
>>
>> Thanks
>>   -- Daniel
>>     
>
> Hello,
>
> please let me know when the sysfs patch for bridge is available. At the moment 
> I managed to get it to work by just commenting out all sysfs stuff for bridge 
> module. However, a new problem appears now. After running C program 
> (attached) that creates a bridge in network namespace and attaches an 
> interface to it, I got this message repeatedly:
>  kernel:[  466.758908] unregister_netdevice: waiting for lo to become free. 
> Usage count = 2
>
> It sems pretty unlikely that my kernel changes could have caused this?
>
> The unregister_netdevice message does not appear, however, if I uncomment this 
> line in child.c:
>     system("brctl setfd sim_br0 0");
>   

I was about to send a patch to disable the bridge per namespace as it 
seems it was never tested.
Can you send me your kernel patch ?

Thanks.
  -- Daniel

^ permalink raw reply

* more troubles with bridge in netns
From: Atis Elsts @ 2009-09-14 11:19 UTC (permalink / raw)
  To: Daniel Lezcano; +Cc: netdev
In-Reply-To: <4AA6188C.1070806@free.fr>

[-- Attachment #1: Type: text/plain, Size: 1409 bytes --]

On Tuesday 08 September 2009 11:40:44 Daniel Lezcano wrote:
> Atis Elsts wrote:
> > Trying to add bridge interface from userspace program, after moving the
> > program to a new network namespace, causes kernel to crash. I am using
> > latest kernel version from git (2.6.31-rc9).
> > The bug is easy to reproduce - just compile and run the attached C
> > program.
> >
> > I see that bridge interface has NETIF_F_NETNS_LOCAL flag, but as I
> > understand, this flag simply means that a device cannot be *moved* across
> > network namespaces, not that it cannot be *created* in other namespaces.
>
> Yep, very easy to reproduce :/
> The sysfs has not been disabled for the bridge. I will try to fix it as
> soon as I can.
>
> Thanks
>   -- Daniel

Hello,

please let me know when the sysfs patch for bridge is available. At the moment 
I managed to get it to work by just commenting out all sysfs stuff for bridge 
module. However, a new problem appears now. After running C program 
(attached) that creates a bridge in network namespace and attaches an 
interface to it, I got this message repeatedly:
 kernel:[  466.758908] unregister_netdevice: waiting for lo to become free. 
Usage count = 2

It sems pretty unlikely that my kernel changes could have caused this?

The unregister_netdevice message does not appear, however, if I uncomment this 
line in child.c:
    system("brctl setfd sim_br0 0");

--Atis

[-- Attachment #2: brtest.tgz --]
[-- Type: application/x-tgz, Size: 10767 bytes --]

^ permalink raw reply

* [PATCH 4/4] RxRPC: Parse security index 5 keys (Kerberos 5)
From: David Howells @ 2009-09-14 11:17 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-afs, netdev, David Howells
In-Reply-To: <20090914111730.10233.44007.stgit@warthog.procyon.org.uk>

Parse RxRPC security index 5 type keys (Kerberos 5 tokens).

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/keys/rxrpc-type.h |   52 ++++
 net/rxrpc/ar-key.c        |  577 ++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 589 insertions(+), 40 deletions(-)


diff --git a/include/keys/rxrpc-type.h b/include/keys/rxrpc-type.h
index c0d9121..5eb2357 100644
--- a/include/keys/rxrpc-type.h
+++ b/include/keys/rxrpc-type.h
@@ -36,6 +36,54 @@ struct rxkad_key {
 };
 
 /*
+ * Kerberos 5 principal
+ *	name/name/name@realm
+ */
+struct krb5_principal {
+	u8	n_name_parts;		/* N of parts of the name part of the principal */
+	char	**name_parts;		/* parts of the name part of the principal */
+	char	*realm;			/* parts of the realm part of the principal */
+};
+
+/*
+ * Kerberos 5 tagged data
+ */
+struct krb5_tagged_data {
+	/* for tag value, see /usr/include/krb5/krb5.h
+	 * - KRB5_AUTHDATA_* for auth data
+	 * - 
+	 */
+	int32_t		tag;
+	uint32_t	data_len;
+	u8		*data;
+};
+
+/*
+ * RxRPC key for Kerberos V (type-5 security)
+ */
+struct rxk5_key {
+	uint64_t		authtime;	/* time at which auth token generated */
+	uint64_t		starttime;	/* time at which auth token starts */
+	uint64_t		endtime;	/* time at which auth token expired */
+	uint64_t		renew_till;	/* time to which auth token can be renewed */
+	int32_t			is_skey;	/* T if ticket is encrypted in another ticket's
+						 * skey */
+	int32_t			flags;		/* mask of TKT_FLG_* bits (krb5/krb5.h) */
+	struct krb5_principal	client;		/* client principal name */
+	struct krb5_principal	server;		/* server principal name */
+	uint16_t		ticket_len;	/* length of ticket */
+	uint16_t		ticket2_len;	/* length of second ticket */
+	u8			n_authdata;	/* number of authorisation data elements */
+	u8			n_addresses;	/* number of addresses */
+	struct krb5_tagged_data	session;	/* session data; tag is enctype */
+	struct krb5_tagged_data *addresses;	/* addresses */
+	u8			*ticket;	/* krb5 ticket */
+	u8			*ticket2;	/* second krb5 ticket, if related to ticket (via
+						 * DUPLICATE-SKEY or ENC-TKT-IN-SKEY) */
+	struct krb5_tagged_data *authdata;	/* authorisation data */
+};
+
+/*
  * list of tokens attached to an rxrpc key
  */
 struct rxrpc_key_token {
@@ -43,6 +91,7 @@ struct rxrpc_key_token {
 	struct rxrpc_key_token *next;	/* the next token in the list */
 	union {
 		struct rxkad_key *kad;
+		struct rxk5_key *k5;
 	};
 };
 
@@ -64,8 +113,11 @@ struct rxrpc_key_data_v1 {
  * - based on openafs-1.4.10/src/auth/afs_token.xg
  */
 #define AFSTOKEN_LENGTH_MAX		16384	/* max payload size */
+#define AFSTOKEN_STRING_MAX		256	/* max small string length */
+#define AFSTOKEN_DATA_MAX		64	/* max small data length */
 #define AFSTOKEN_CELL_MAX		64	/* max cellname length */
 #define AFSTOKEN_MAX			8	/* max tokens per payload */
+#define AFSTOKEN_BDATALN_MAX		16384	/* max big data length */
 #define AFSTOKEN_RK_TIX_MAX		12000	/* max RxKAD ticket size */
 #define AFSTOKEN_GK_KEY_MAX		64	/* max GSSAPI key size */
 #define AFSTOKEN_GK_TOKEN_MAX		16384	/* max GSSAPI token size */
diff --git a/net/rxrpc/ar-key.c b/net/rxrpc/ar-key.c
index bf4d623..44836f6 100644
--- a/net/rxrpc/ar-key.c
+++ b/net/rxrpc/ar-key.c
@@ -64,7 +64,7 @@ struct key_type key_type_rxrpc_s = {
 static int rxrpc_instantiate_xdr_rxkad(struct key *key, const __be32 *xdr,
 				       unsigned toklen)
 {
-	struct rxrpc_key_token *token;
+	struct rxrpc_key_token *token, **pptoken;
 	size_t plen;
 	u32 tktlen;
 	int ret;
@@ -129,13 +129,398 @@ static int rxrpc_instantiate_xdr_rxkad(struct key *key, const __be32 *xdr,
 	key->type_data.x[0]++;
 
 	/* attach the data */
-	token->next = key->payload.data;
-	key->payload.data = token;
+	for (pptoken = (struct rxrpc_key_token **)&key->payload.data;
+	     *pptoken;
+	     pptoken = &(*pptoken)->next)
+		continue;
+	*pptoken = token;
+	if (token->kad->expiry < key->expiry)
+		key->expiry = token->kad->expiry;
+
+	_leave(" = 0");
+	return 0;
+}
+
+static void rxrpc_free_krb5_principal(struct krb5_principal *princ)
+{
+	int loop;
+
+	if (princ->name_parts) {
+		for (loop = princ->n_name_parts - 1; loop >= 0; loop--)
+			kfree(princ->name_parts[loop]);
+		kfree(princ->name_parts);
+	}
+	kfree(princ->realm);
+}
+
+static void rxrpc_free_krb5_tagged(struct krb5_tagged_data *td)
+{
+	kfree(td->data);
+}
+
+/*
+ * free up an RxK5 token
+ */
+static void rxrpc_rxk5_free(struct rxk5_key *rxk5)
+{
+	int loop;
+
+	rxrpc_free_krb5_principal(&rxk5->client);
+	rxrpc_free_krb5_principal(&rxk5->server);
+	rxrpc_free_krb5_tagged(&rxk5->session);
+
+	if (rxk5->addresses) {
+		for (loop = rxk5->n_addresses - 1; loop >= 0; loop--)
+			rxrpc_free_krb5_tagged(&rxk5->addresses[loop]);
+		kfree(rxk5->addresses);
+	}
+	if (rxk5->authdata) {
+		for (loop = rxk5->n_authdata - 1; loop >= 0; loop--)
+			rxrpc_free_krb5_tagged(&rxk5->authdata[loop]);
+		kfree(rxk5->authdata);
+	}
+
+	kfree(rxk5->ticket);
+	kfree(rxk5->ticket2);
+	kfree(rxk5);
+}
+
+/*
+ * extract a krb5 principal
+ */
+static int rxrpc_krb5_decode_principal(struct krb5_principal *princ,
+				       const __be32 **_xdr,
+				       unsigned *_toklen)
+{
+	const __be32 *xdr = *_xdr;
+	unsigned toklen = *_toklen, n_parts, loop, tmp;
+
+	/* there must be at least one name, and at least #names+1 length
+	 * words */
+	if (toklen <= 12)
+		return -EINVAL;
+
+	_enter(",{%x,%x,%x},%u",
+	       ntohl(xdr[0]), ntohl(xdr[1]), ntohl(xdr[2]), toklen);
+
+	n_parts = ntohl(*xdr++);
+	toklen -= 4;
+	if (n_parts <= 0 || n_parts > AFSTOKEN_K5_COMPONENTS_MAX)
+		return -EINVAL;
+	princ->n_name_parts = n_parts;
+
+	if (toklen <= (n_parts + 1) * 4)
+		return -EINVAL;
+
+	princ->name_parts = kcalloc(sizeof(char *), n_parts, GFP_KERNEL);
+	if (!princ->name_parts)
+		return -ENOMEM;
+
+	for (loop = 0; loop < n_parts; loop++) {
+		if (toklen < 4)
+			return -EINVAL;
+		tmp = ntohl(*xdr++);
+		toklen -= 4;
+		if (tmp <= 0 || tmp > AFSTOKEN_STRING_MAX)
+			return -EINVAL;
+		if (tmp > toklen)
+			return -EINVAL;
+		princ->name_parts[loop] = kmalloc(tmp + 1, GFP_KERNEL);
+		if (!princ->name_parts[loop])
+			return -ENOMEM;
+		memcpy(princ->name_parts[loop], xdr, tmp);
+		princ->name_parts[loop][tmp] = 0;
+		tmp = (tmp + 3) & ~3;
+		toklen -= tmp;
+		xdr += tmp >> 2;
+	}
+
+	if (toklen < 4)
+		return -EINVAL;
+	tmp = ntohl(*xdr++);
+	toklen -= 4;
+	if (tmp <= 0 || tmp > AFSTOKEN_K5_REALM_MAX)
+		return -EINVAL;
+	if (tmp > toklen)
+		return -EINVAL;
+	princ->realm = kmalloc(tmp + 1, GFP_KERNEL);
+	if (!princ->realm)
+		return -ENOMEM;
+	memcpy(princ->realm, xdr, tmp);
+	princ->realm[tmp] = 0;
+	tmp = (tmp + 3) & ~3;
+	toklen -= tmp;
+	xdr += tmp >> 2;
+
+	_debug("%s/...@%s", princ->name_parts[0], princ->realm);
+
+	*_xdr = xdr;
+	*_toklen = toklen;
+	_leave(" = 0 [toklen=%u]", toklen);
+	return 0;
+}
+
+/*
+ * extract a piece of krb5 tagged data
+ */
+static int rxrpc_krb5_decode_tagged_data(struct krb5_tagged_data *td,
+					 size_t max_data_size,
+					 const __be32 **_xdr,
+					 unsigned *_toklen)
+{
+	const __be32 *xdr = *_xdr;
+	unsigned toklen = *_toklen, len;
+
+	/* there must be at least one tag and one length word */
+	if (toklen <= 8)
+		return -EINVAL;
+
+	_enter(",%zu,{%x,%x},%u",
+	       max_data_size, ntohl(xdr[0]), ntohl(xdr[1]), toklen);
+
+	td->tag = ntohl(*xdr++);
+	len = ntohl(*xdr++);
+	toklen -= 8;
+	if (len > max_data_size)
+		return -EINVAL;
+	td->data_len = len;
+
+	if (len > 0) {
+		td->data = kmalloc(len, GFP_KERNEL);
+		if (!td->data)
+			return -ENOMEM;
+		memcpy(td->data, xdr, len);
+		len = (len + 3) & ~3;
+		toklen -= len;
+		xdr += len >> 2;
+	}
+
+	_debug("tag %x len %x", td->tag, td->data_len);
+
+	*_xdr = xdr;
+	*_toklen = toklen;
+	_leave(" = 0 [toklen=%u]", toklen);
+	return 0;
+}
+
+/*
+ * extract an array of tagged data
+ */
+static int rxrpc_krb5_decode_tagged_array(struct krb5_tagged_data **_td,
+					  u8 *_n_elem,
+					  u8 max_n_elem,
+					  size_t max_elem_size,
+					  const __be32 **_xdr,
+					  unsigned *_toklen)
+{
+	struct krb5_tagged_data *td;
+	const __be32 *xdr = *_xdr;
+	unsigned toklen = *_toklen, n_elem, loop;
+	int ret;
+
+	/* there must be at least one count */
+	if (toklen < 4)
+		return -EINVAL;
+
+	_enter(",,%u,%zu,{%x},%u",
+	       max_n_elem, max_elem_size, ntohl(xdr[0]), toklen);
+
+	n_elem = ntohl(*xdr++);
+	toklen -= 4;
+	if (n_elem < 0 || n_elem > max_n_elem)
+		return -EINVAL;
+	*_n_elem = n_elem;
+	if (n_elem > 0) {
+		if (toklen <= (n_elem + 1) * 4)
+			return -EINVAL;
+
+		_debug("n_elem %d", n_elem);
+
+		td = kcalloc(sizeof(struct krb5_tagged_data), n_elem,
+			     GFP_KERNEL);
+		if (!td)
+			return -ENOMEM;
+		*_td = td;
+
+		for (loop = 0; loop < n_elem; loop++) {
+			ret = rxrpc_krb5_decode_tagged_data(&td[loop],
+							    max_elem_size,
+							    &xdr, &toklen);
+			if (ret < 0)
+				return ret;
+		}
+	}
+
+	*_xdr = xdr;
+	*_toklen = toklen;
+	_leave(" = 0 [toklen=%u]", toklen);
+	return 0;
+}
+
+/*
+ * extract a krb5 ticket
+ */
+static int rxrpc_krb5_decode_ticket(u8 **_ticket, uint16_t *_tktlen,
+				    const __be32 **_xdr, unsigned *_toklen)
+{
+	const __be32 *xdr = *_xdr;
+	unsigned toklen = *_toklen, len;
+
+	/* there must be at least one length word */
+	if (toklen <= 4)
+		return -EINVAL;
+
+	_enter(",{%x},%u", ntohl(xdr[0]), toklen);
+
+	len = ntohl(*xdr++);
+	toklen -= 4;
+	if (len > AFSTOKEN_K5_TIX_MAX)
+		return -EINVAL;
+	*_tktlen = len;
+
+	_debug("ticket len %u", len);
+
+	if (len > 0) {
+		*_ticket = kmalloc(len, GFP_KERNEL);
+		if (!*_ticket)
+			return -ENOMEM;
+		memcpy(*_ticket, xdr, len);
+		len = (len + 3) & ~3;
+		toklen -= len;
+		xdr += len >> 2;
+	}
+
+	*_xdr = xdr;
+	*_toklen = toklen;
+	_leave(" = 0 [toklen=%u]", toklen);
+	return 0;
+}
+
+/*
+ * parse an RxK5 type XDR format token
+ * - the caller guarantees we have at least 4 words
+ */
+static int rxrpc_instantiate_xdr_rxk5(struct key *key, const __be32 *xdr,
+				      unsigned toklen)
+{
+	struct rxrpc_key_token *token, **pptoken;
+	struct rxk5_key *rxk5;
+	const __be32 *end_xdr = xdr + (toklen >> 2);
+	int ret;
+
+	_enter(",{%x,%x,%x,%x},%u",
+	       ntohl(xdr[0]), ntohl(xdr[1]), ntohl(xdr[2]), ntohl(xdr[3]),
+	       toklen);
+
+	/* reserve some payload space for this subkey - the length of the token
+	 * is a reasonable approximation */
+	ret = key_payload_reserve(key, key->datalen + toklen);
+	if (ret < 0)
+		return ret;
+
+	token = kzalloc(sizeof(*token), GFP_KERNEL);
+	if (!token)
+		return -ENOMEM;
+
+	rxk5 = kzalloc(sizeof(*rxk5), GFP_KERNEL);
+	if (!rxk5) {
+		kfree(token);
+		return -ENOMEM;
+	}
+
+	token->security_index = RXRPC_SECURITY_RXK5;
+	token->k5 = rxk5;
+
+	/* extract the principals */
+	ret = rxrpc_krb5_decode_principal(&rxk5->client, &xdr, &toklen);
+	if (ret < 0)
+		goto error;
+	ret = rxrpc_krb5_decode_principal(&rxk5->server, &xdr, &toklen);
+	if (ret < 0)
+		goto error;
+
+	/* extract the session key and the encoding type (the tag field ->
+	 * ENCTYPE_xxx) */
+	ret = rxrpc_krb5_decode_tagged_data(&rxk5->session, AFSTOKEN_DATA_MAX,
+					    &xdr, &toklen);
+	if (ret < 0)
+		goto error;
+
+	if (toklen < 4 * 8 + 2 * 4)
+		goto inval;
+	rxk5->authtime	= be64_to_cpup((const __be64 *) xdr);
+	xdr += 2;
+	rxk5->starttime	= be64_to_cpup((const __be64 *) xdr);
+	xdr += 2;
+	rxk5->endtime	= be64_to_cpup((const __be64 *) xdr);
+	xdr += 2;
+	rxk5->renew_till = be64_to_cpup((const __be64 *) xdr);
+	xdr += 2;
+	rxk5->is_skey = ntohl(*xdr++);
+	rxk5->flags = ntohl(*xdr++);
+	toklen -= 4 * 8 + 2 * 4;
+
+	_debug("times: a=%llx s=%llx e=%llx rt=%llx",
+	       rxk5->authtime, rxk5->starttime, rxk5->endtime,
+	       rxk5->renew_till);
+	_debug("is_skey=%x flags=%x", rxk5->is_skey, rxk5->flags);
+
+	/* extract the permitted client addresses */
+	ret = rxrpc_krb5_decode_tagged_array(&rxk5->addresses,
+					     &rxk5->n_addresses,
+					     AFSTOKEN_K5_ADDRESSES_MAX,
+					     AFSTOKEN_DATA_MAX,
+					     &xdr, &toklen);
+	if (ret < 0)
+		goto error;
+
+	ASSERTCMP((end_xdr - xdr) << 2, ==, toklen);
+
+	/* extract the tickets */
+	ret = rxrpc_krb5_decode_ticket(&rxk5->ticket, &rxk5->ticket_len,
+				       &xdr, &toklen);
+	if (ret < 0)
+		goto error;
+	ret = rxrpc_krb5_decode_ticket(&rxk5->ticket2, &rxk5->ticket2_len,
+				       &xdr, &toklen);
+	if (ret < 0)
+		goto error;
+
+	ASSERTCMP((end_xdr - xdr) << 2, ==, toklen);
+
+	/* extract the typed auth data */
+	ret = rxrpc_krb5_decode_tagged_array(&rxk5->authdata,
+					     &rxk5->n_authdata,
+					     AFSTOKEN_K5_AUTHDATA_MAX,
+					     AFSTOKEN_BDATALN_MAX,
+					     &xdr, &toklen);
+	if (ret < 0)
+		goto error;
+
+	ASSERTCMP((end_xdr - xdr) << 2, ==, toklen);
+
+	if (toklen != 0)
+		goto inval;
+
+	/* attach the payload to the key */
+	for (pptoken = (struct rxrpc_key_token **)&key->payload.data;
+	     *pptoken;
+	     pptoken = &(*pptoken)->next)
+		continue;
+	*pptoken = token;
 	if (token->kad->expiry < key->expiry)
 		key->expiry = token->kad->expiry;
 
 	_leave(" = 0");
 	return 0;
+
+inval:
+	ret = -EINVAL;
+error:
+	rxrpc_rxk5_free(rxk5);
+	kfree(token);
+	_leave(" = %d", ret);
+	return ret;
 }
 
 /*
@@ -228,6 +613,8 @@ static int rxrpc_instantiate_xdr(struct key *key, const void *data, size_t datal
 		sec_ix = ntohl(*xdr++);
 		toklen -= 4;
 
+		_debug("TOKEN type=%u [%p-%p]", sec_ix, xdr, token);
+
 		switch (sec_ix) {
 		case RXRPC_SECURITY_RXKAD:
 			ret = rxrpc_instantiate_xdr_rxkad(key, xdr, toklen);
@@ -235,6 +622,12 @@ static int rxrpc_instantiate_xdr(struct key *key, const void *data, size_t datal
 				goto error;
 			break;
 
+		case RXRPC_SECURITY_RXK5:
+			ret = rxrpc_instantiate_xdr_rxk5(key, xdr, toklen);
+			if (ret != 0)
+				goto error;
+			break;
+
 		default:
 			ret = -EPROTONOSUPPORT;
 			goto error;
@@ -412,6 +805,10 @@ static void rxrpc_destroy(struct key *key)
 		case RXRPC_SECURITY_RXKAD:
 			kfree(token->kad);
 			break;
+		case RXRPC_SECURITY_RXK5:
+			if (token->k5)
+				rxrpc_rxk5_free(token->k5);
+			break;
 		default:
 			printk(KERN_ERR "Unknown token type %x on rxrpc key\n",
 			       token->security_index);
@@ -602,10 +999,13 @@ EXPORT_SYMBOL(rxrpc_get_null_key);
 static long rxrpc_read(const struct key *key,
 		       char __user *buffer, size_t buflen)
 {
-	struct rxrpc_key_token *token;
-	size_t size, toksize;
-	__be32 __user *xdr;
-	u32 cnlen, tktlen, ntoks, zero;
+	const struct rxrpc_key_token *token;
+	const struct krb5_principal *princ;
+	size_t size;
+	__be32 __user *xdr, *oldxdr;
+	u32 cnlen, toksize, ntoks, tok, zero;
+	u16 toksizes[AFSTOKEN_MAX];
+	int loop;
 
 	_enter("");
 
@@ -614,28 +1014,68 @@ static long rxrpc_read(const struct key *key,
 		return -EOPNOTSUPP;
 	cnlen = strlen(key->description + 4);
 
+#define RND(X) (((X) + 3) & ~3)
+
 	/* AFS keys we return in XDR form, so we need to work out the size of
 	 * the XDR */
 	size = 2 * 4;	/* flags, cellname len */
-	size += (cnlen + 3) & ~3;	/* cellname */
+	size += RND(cnlen);	/* cellname */
 	size += 1 * 4;	/* token count */
 
 	ntoks = 0;
 	for (token = key->payload.data; token; token = token->next) {
+		toksize = 4;	/* sec index */
+
 		switch (token->security_index) {
 		case RXRPC_SECURITY_RXKAD:
-			size += 2 * 4;	/* length, security index (switch ID) */
-			size += 8 * 4;	/* viceid, kvno, key*2, begin, end,
-					 * primary, tktlen */
-			size += (token->kad->ticket_len + 3) & ~3; /* ticket */
-			ntoks++;
+			toksize += 8 * 4;	/* viceid, kvno, key*2, begin,
+						 * end, primary, tktlen */
+			toksize += RND(token->kad->ticket_len);
 			break;
 
-		default: /* can't encode */
+		case RXRPC_SECURITY_RXK5:
+			princ = &token->k5->client;
+			toksize += 4 + princ->n_name_parts * 4;
+			for (loop = 0; loop < princ->n_name_parts; loop++)
+				toksize += RND(strlen(princ->name_parts[loop]));
+			toksize += 4 + RND(strlen(princ->realm));
+
+			princ = &token->k5->server;
+			toksize += 4 + princ->n_name_parts * 4;
+			for (loop = 0; loop < princ->n_name_parts; loop++)
+				toksize += RND(strlen(princ->name_parts[loop]));
+			toksize += 4 + RND(strlen(princ->realm));
+
+			toksize += 8 + RND(token->k5->session.data_len);
+
+			toksize += 4 * 8 + 2 * 4;
+
+			toksize += 4 + token->k5->n_addresses * 8;
+			for (loop = 0; loop < token->k5->n_addresses; loop++)
+				toksize += RND(token->k5->addresses[loop].data_len);
+
+			toksize += 4 + RND(token->k5->ticket_len);
+			toksize += 4 + RND(token->k5->ticket2_len);
+
+			toksize += 4 + token->k5->n_authdata * 8;
+			for (loop = 0; loop < token->k5->n_authdata; loop++)
+				toksize += RND(token->k5->authdata[loop].data_len);
 			break;
+
+		default: /* we have a ticket we can't encode */
+			BUG();
+			continue;
 		}
+
+		_debug("token[%u]: toksize=%u", ntoks, toksize);
+		ASSERTCMP(toksize, <=, AFSTOKEN_LENGTH_MAX);
+
+		toksizes[ntoks++] = toksize;
+		size += toksize + 4; /* each token has a length word */
 	}
 
+#undef RND
+
 	if (!buffer || buflen < size)
 		return size;
 
@@ -647,52 +1087,109 @@ static long rxrpc_read(const struct key *key,
 		if (put_user(y, xdr++) < 0)	\
 			goto fault;		\
 	} while(0)
+#define ENCODE_DATA(l, s)						\
+	do {								\
+		u32 _l = (l);						\
+		ENCODE(l);						\
+		if (copy_to_user(xdr, (s), _l) != 0)			\
+			goto fault;					\
+		if (_l & 3 &&						\
+		    copy_to_user((u8 *)xdr + _l, &zero, 4 - (_l & 3)) != 0) \
+			goto fault;					\
+		xdr += (_l + 3) >> 2;					\
+	} while(0)
+#define ENCODE64(x)					\
+	do {						\
+		__be64 y = cpu_to_be64(x);		\
+		if (copy_to_user(xdr, &y, 8) != 0)	\
+			goto fault;			\
+		xdr += 8 >> 2;				\
+	} while(0)
+#define ENCODE_STR(s)				\
+	do {					\
+		const char *_s = (s);		\
+		ENCODE_DATA(strlen(_s), _s);	\
+	} while(0)
 
-	ENCODE(0);	/* flags */
-	ENCODE(cnlen);	/* cellname length */
-	if (copy_to_user(xdr, key->description + 4, cnlen) != 0)
-		goto fault;
-	if (cnlen & 3 &&
-	    copy_to_user((u8 *)xdr + cnlen, &zero, 4 - (cnlen & 3)) != 0)
-		goto fault;
-	xdr += (cnlen + 3) >> 2;
-	ENCODE(ntoks);	/* token count */
+	ENCODE(0);					/* flags */
+	ENCODE_DATA(cnlen, key->description + 4);	/* cellname */
+	ENCODE(ntoks);
 
+	tok = 0;
 	for (token = key->payload.data; token; token = token->next) {
-		toksize = 1 * 4;	/* sec index */
+		toksize = toksizes[tok++];
+		ENCODE(toksize);
+		oldxdr = xdr;
+		ENCODE(token->security_index);
 
 		switch (token->security_index) {
 		case RXRPC_SECURITY_RXKAD:
-			toksize += 8 * 4;
-			toksize += (token->kad->ticket_len + 3) & ~3;
-			ENCODE(toksize);
-			ENCODE(token->security_index);
 			ENCODE(token->kad->vice_id);
 			ENCODE(token->kad->kvno);
-			if (copy_to_user(xdr, token->kad->session_key, 8) != 0)
-				goto fault;
-			xdr += 8 >> 2;
+			ENCODE_DATA(8, token->kad->session_key);
 			ENCODE(token->kad->start);
 			ENCODE(token->kad->expiry);
 			ENCODE(token->kad->primary_flag);
-			tktlen = token->kad->ticket_len;
-			ENCODE(tktlen);
-			if (copy_to_user(xdr, token->kad->ticket, tktlen) != 0)
-				goto fault;
-			if (tktlen & 3 &&
-			    copy_to_user((u8 *)xdr + tktlen, &zero,
-					 4 - (tktlen & 3)) != 0)
-				goto fault;
-			xdr += (tktlen + 3) >> 2;
+			ENCODE_DATA(token->kad->ticket_len, token->kad->ticket);
+			break;
+
+		case RXRPC_SECURITY_RXK5:
+			princ = &token->k5->client;
+			ENCODE(princ->n_name_parts);
+			for (loop = 0; loop < princ->n_name_parts; loop++)
+				ENCODE_STR(princ->name_parts[loop]);
+			ENCODE_STR(princ->realm);
+
+			princ = &token->k5->server;
+			ENCODE(princ->n_name_parts);
+			for (loop = 0; loop < princ->n_name_parts; loop++)
+				ENCODE_STR(princ->name_parts[loop]);
+			ENCODE_STR(princ->realm);
+
+			ENCODE(token->k5->session.tag);
+			ENCODE_DATA(token->k5->session.data_len,
+				    token->k5->session.data);
+
+			ENCODE64(token->k5->authtime);
+			ENCODE64(token->k5->starttime);
+			ENCODE64(token->k5->endtime);
+			ENCODE64(token->k5->renew_till);
+			ENCODE(token->k5->is_skey);
+			ENCODE(token->k5->flags);
+
+			ENCODE(token->k5->n_addresses);
+			for (loop = 0; loop < token->k5->n_addresses; loop++) {
+				ENCODE(token->k5->addresses[loop].tag);
+				ENCODE_DATA(token->k5->addresses[loop].data_len,
+					    token->k5->addresses[loop].data);
+			}
+
+			ENCODE_DATA(token->k5->ticket_len, token->k5->ticket);
+			ENCODE_DATA(token->k5->ticket2_len, token->k5->ticket2);
+
+			ENCODE(token->k5->n_authdata);
+			for (loop = 0; loop < token->k5->n_authdata; loop++) {
+				ENCODE(token->k5->authdata[loop].tag);
+				ENCODE_DATA(token->k5->authdata[loop].data_len,
+					    token->k5->authdata[loop].data);
+			}
 			break;
 
 		default:
+			BUG();
 			break;
 		}
+
+		ASSERTCMP((unsigned long)xdr - (unsigned long)oldxdr, ==,
+			  toksize);
 	}
 
+#undef ENCODE_STR
+#undef ENCODE_DATA
+#undef ENCODE64
 #undef ENCODE
 
+	ASSERTCMP(tok, ==, ntoks);
 	ASSERTCMP((char __user *) xdr - buffer, ==, size);
 	_leave(" = %zu", size);
 	return size;


^ permalink raw reply related

* [PATCH 3/4] RxRPC: Allow RxRPC keys to be read
From: David Howells @ 2009-09-14 11:17 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-afs, netdev, David Howells
In-Reply-To: <20090914111730.10233.44007.stgit@warthog.procyon.org.uk>

Allow RxRPC keys to be read.  This is to allow pioctl() to be implemented in
userspace.  RxRPC keys are read out in XDR format.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 net/rxrpc/ar-key.c |  109 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 109 insertions(+), 0 deletions(-)


diff --git a/net/rxrpc/ar-key.c b/net/rxrpc/ar-key.c
index a3a7acb..bf4d623 100644
--- a/net/rxrpc/ar-key.c
+++ b/net/rxrpc/ar-key.c
@@ -29,6 +29,7 @@ static int rxrpc_instantiate_s(struct key *, const void *, size_t);
 static void rxrpc_destroy(struct key *);
 static void rxrpc_destroy_s(struct key *);
 static void rxrpc_describe(const struct key *, struct seq_file *);
+static long rxrpc_read(const struct key *, char __user *, size_t);
 
 /*
  * rxrpc defined keys take an arbitrary string as the description and an
@@ -40,6 +41,7 @@ struct key_type key_type_rxrpc = {
 	.match		= user_match,
 	.destroy	= rxrpc_destroy,
 	.describe	= rxrpc_describe,
+	.read		= rxrpc_read,
 };
 EXPORT_SYMBOL(key_type_rxrpc);
 
@@ -592,3 +594,110 @@ struct key *rxrpc_get_null_key(const char *keyname)
 	return key;
 }
 EXPORT_SYMBOL(rxrpc_get_null_key);
+
+/*
+ * read the contents of an rxrpc key
+ * - this returns the result in XDR form
+ */
+static long rxrpc_read(const struct key *key,
+		       char __user *buffer, size_t buflen)
+{
+	struct rxrpc_key_token *token;
+	size_t size, toksize;
+	__be32 __user *xdr;
+	u32 cnlen, tktlen, ntoks, zero;
+
+	_enter("");
+
+	/* we don't know what form we should return non-AFS keys in */
+	if (memcmp(key->description, "afs@", 4) != 0)
+		return -EOPNOTSUPP;
+	cnlen = strlen(key->description + 4);
+
+	/* AFS keys we return in XDR form, so we need to work out the size of
+	 * the XDR */
+	size = 2 * 4;	/* flags, cellname len */
+	size += (cnlen + 3) & ~3;	/* cellname */
+	size += 1 * 4;	/* token count */
+
+	ntoks = 0;
+	for (token = key->payload.data; token; token = token->next) {
+		switch (token->security_index) {
+		case RXRPC_SECURITY_RXKAD:
+			size += 2 * 4;	/* length, security index (switch ID) */
+			size += 8 * 4;	/* viceid, kvno, key*2, begin, end,
+					 * primary, tktlen */
+			size += (token->kad->ticket_len + 3) & ~3; /* ticket */
+			ntoks++;
+			break;
+
+		default: /* can't encode */
+			break;
+		}
+	}
+
+	if (!buffer || buflen < size)
+		return size;
+
+	xdr = (__be32 __user *) buffer;
+	zero = 0;
+#define ENCODE(x)				\
+	do {					\
+		__be32 y = htonl(x);		\
+		if (put_user(y, xdr++) < 0)	\
+			goto fault;		\
+	} while(0)
+
+	ENCODE(0);	/* flags */
+	ENCODE(cnlen);	/* cellname length */
+	if (copy_to_user(xdr, key->description + 4, cnlen) != 0)
+		goto fault;
+	if (cnlen & 3 &&
+	    copy_to_user((u8 *)xdr + cnlen, &zero, 4 - (cnlen & 3)) != 0)
+		goto fault;
+	xdr += (cnlen + 3) >> 2;
+	ENCODE(ntoks);	/* token count */
+
+	for (token = key->payload.data; token; token = token->next) {
+		toksize = 1 * 4;	/* sec index */
+
+		switch (token->security_index) {
+		case RXRPC_SECURITY_RXKAD:
+			toksize += 8 * 4;
+			toksize += (token->kad->ticket_len + 3) & ~3;
+			ENCODE(toksize);
+			ENCODE(token->security_index);
+			ENCODE(token->kad->vice_id);
+			ENCODE(token->kad->kvno);
+			if (copy_to_user(xdr, token->kad->session_key, 8) != 0)
+				goto fault;
+			xdr += 8 >> 2;
+			ENCODE(token->kad->start);
+			ENCODE(token->kad->expiry);
+			ENCODE(token->kad->primary_flag);
+			tktlen = token->kad->ticket_len;
+			ENCODE(tktlen);
+			if (copy_to_user(xdr, token->kad->ticket, tktlen) != 0)
+				goto fault;
+			if (tktlen & 3 &&
+			    copy_to_user((u8 *)xdr + tktlen, &zero,
+					 4 - (tktlen & 3)) != 0)
+				goto fault;
+			xdr += (tktlen + 3) >> 2;
+			break;
+
+		default:
+			break;
+		}
+	}
+
+#undef ENCODE
+
+	ASSERTCMP((char __user *) xdr - buffer, ==, size);
+	_leave(" = %zu", size);
+	return size;
+
+fault:
+	_leave(" = -EFAULT");
+	return -EFAULT;
+}


^ permalink raw reply related

* [PATCH 2/4] RxRPC: Allow key payloads to be passed in XDR form
From: David Howells @ 2009-09-14 11:17 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-afs, netdev, David Howells
In-Reply-To: <20090914111730.10233.44007.stgit@warthog.procyon.org.uk>

Allow add_key() and KEYCTL_INSTANTIATE to accept key payloads in XDR form as
described by openafs-1.4.10/src/auth/afs_token.xg.  This provides a way of
passing kaserver, Kerberos 4, Kerberos 5 and GSSAPI keys from userspace, and
allows for future expansion.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/keys/rxrpc-type.h |   55 ++++++++
 net/rxrpc/ar-internal.h   |   16 --
 net/rxrpc/ar-key.c        |  308 +++++++++++++++++++++++++++++++++++++++------
 net/rxrpc/ar-security.c   |    8 +
 net/rxrpc/rxkad.c         |   41 +++---
 5 files changed, 353 insertions(+), 75 deletions(-)


diff --git a/include/keys/rxrpc-type.h b/include/keys/rxrpc-type.h
index 7609365..c0d9121 100644
--- a/include/keys/rxrpc-type.h
+++ b/include/keys/rxrpc-type.h
@@ -21,4 +21,59 @@ extern struct key_type key_type_rxrpc;
 
 extern struct key *rxrpc_get_null_key(const char *);
 
+/*
+ * RxRPC key for Kerberos IV (type-2 security)
+ */
+struct rxkad_key {
+	u32	vice_id;
+	u32	start;			/* time at which ticket starts */
+	u32	expiry;			/* time at which ticket expires */
+	u32	kvno;			/* key version number */
+	u8	primary_flag;		/* T if key for primary cell for this user */
+	u16	ticket_len;		/* length of ticket[] */
+	u8	session_key[8];		/* DES session key */
+	u8	ticket[0];		/* the encrypted ticket */
+};
+
+/*
+ * list of tokens attached to an rxrpc key
+ */
+struct rxrpc_key_token {
+	u16	security_index;		/* RxRPC header security index */
+	struct rxrpc_key_token *next;	/* the next token in the list */
+	union {
+		struct rxkad_key *kad;
+	};
+};
+
+/*
+ * structure of raw payloads passed to add_key() or instantiate key
+ */
+struct rxrpc_key_data_v1 {
+	u32		kif_version;		/* 1 */
+	u16		security_index;
+	u16		ticket_length;
+	u32		expiry;			/* time_t */
+	u32		kvno;
+	u8		session_key[8];
+	u8		ticket[0];
+};
+
+/*
+ * AF_RXRPC key payload derived from XDR format
+ * - based on openafs-1.4.10/src/auth/afs_token.xg
+ */
+#define AFSTOKEN_LENGTH_MAX		16384	/* max payload size */
+#define AFSTOKEN_CELL_MAX		64	/* max cellname length */
+#define AFSTOKEN_MAX			8	/* max tokens per payload */
+#define AFSTOKEN_RK_TIX_MAX		12000	/* max RxKAD ticket size */
+#define AFSTOKEN_GK_KEY_MAX		64	/* max GSSAPI key size */
+#define AFSTOKEN_GK_TOKEN_MAX		16384	/* max GSSAPI token size */
+#define AFSTOKEN_K5_COMPONENTS_MAX	16	/* max K5 components */
+#define AFSTOKEN_K5_NAME_MAX		128	/* max K5 name length */
+#define AFSTOKEN_K5_REALM_MAX		64	/* max K5 realm name length */
+#define AFSTOKEN_K5_TIX_MAX		16384	/* max K5 ticket size */
+#define AFSTOKEN_K5_ADDRESSES_MAX	16	/* max K5 addresses */
+#define AFSTOKEN_K5_AUTHDATA_MAX	16	/* max K5 pieces of auth data */
+
 #endif /* _KEYS_RXRPC_TYPE_H */
diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index 3e7318c..46c6d88 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -402,22 +402,6 @@ struct rxrpc_call {
 };
 
 /*
- * RxRPC key for Kerberos (type-2 security)
- */
-struct rxkad_key {
-	u16	security_index;		/* RxRPC header security index */
-	u16	ticket_len;		/* length of ticket[] */
-	u32	expiry;			/* time at which expires */
-	u32	kvno;			/* key version number */
-	u8	session_key[8];		/* DES session key */
-	u8	ticket[0];		/* the encrypted ticket */
-};
-
-struct rxrpc_key_payload {
-	struct rxkad_key k;
-};
-
-/*
  * locally abort an RxRPC call
  */
 static inline void rxrpc_abort_call(struct rxrpc_call *call, u32 abort_code)
diff --git a/net/rxrpc/ar-key.c b/net/rxrpc/ar-key.c
index b3d10e7..a3a7acb 100644
--- a/net/rxrpc/ar-key.c
+++ b/net/rxrpc/ar-key.c
@@ -17,6 +17,7 @@
 #include <linux/skbuff.h>
 #include <linux/key-type.h>
 #include <linux/crypto.h>
+#include <linux/ctype.h>
 #include <net/sock.h>
 #include <net/af_rxrpc.h>
 #include <keys/rxrpc-type.h>
@@ -55,6 +56,202 @@ struct key_type key_type_rxrpc_s = {
 };
 
 /*
+ * parse an RxKAD type XDR format token
+ * - the caller guarantees we have at least 4 words
+ */
+static int rxrpc_instantiate_xdr_rxkad(struct key *key, const __be32 *xdr,
+				       unsigned toklen)
+{
+	struct rxrpc_key_token *token;
+	size_t plen;
+	u32 tktlen;
+	int ret;
+
+	_enter(",{%x,%x,%x,%x},%u",
+	       ntohl(xdr[0]), ntohl(xdr[1]), ntohl(xdr[2]), ntohl(xdr[3]),
+	       toklen);
+
+	if (toklen <= 8 * 4)
+		return -EKEYREJECTED;
+	tktlen = ntohl(xdr[7]);
+	_debug("tktlen: %x", tktlen);
+	if (tktlen > AFSTOKEN_RK_TIX_MAX)
+		return -EKEYREJECTED;
+	if (8 * 4 + tktlen != toklen)
+		return -EKEYREJECTED;
+
+	plen = sizeof(*token) + sizeof(*token->kad) + tktlen;
+	ret = key_payload_reserve(key, key->datalen + plen);
+	if (ret < 0)
+		return ret;
+
+	plen -= sizeof(*token);
+	token = kmalloc(sizeof(*token), GFP_KERNEL);
+	if (!token)
+		return -ENOMEM;
+
+	token->kad = kmalloc(plen, GFP_KERNEL);
+	if (!token->kad) {
+		kfree(token);
+		return -ENOMEM;
+	}
+
+	token->security_index	= RXRPC_SECURITY_RXKAD;
+	token->kad->ticket_len	= tktlen;
+	token->kad->vice_id	= ntohl(xdr[0]);
+	token->kad->kvno	= ntohl(xdr[1]);
+	token->kad->start	= ntohl(xdr[4]);
+	token->kad->expiry	= ntohl(xdr[5]);
+	token->kad->primary_flag = ntohl(xdr[6]);
+	memcpy(&token->kad->session_key, &xdr[2], 8);
+	memcpy(&token->kad->ticket, &xdr[8], tktlen);
+
+	_debug("SCIX: %u", token->security_index);
+	_debug("TLEN: %u", token->kad->ticket_len);
+	_debug("EXPY: %x", token->kad->expiry);
+	_debug("KVNO: %u", token->kad->kvno);
+	_debug("PRIM: %u", token->kad->primary_flag);
+	_debug("SKEY: %02x%02x%02x%02x%02x%02x%02x%02x",
+	       token->kad->session_key[0], token->kad->session_key[1],
+	       token->kad->session_key[2], token->kad->session_key[3],
+	       token->kad->session_key[4], token->kad->session_key[5],
+	       token->kad->session_key[6], token->kad->session_key[7]);
+	if (token->kad->ticket_len >= 8)
+		_debug("TCKT: %02x%02x%02x%02x%02x%02x%02x%02x",
+		       token->kad->ticket[0], token->kad->ticket[1],
+		       token->kad->ticket[2], token->kad->ticket[3],
+		       token->kad->ticket[4], token->kad->ticket[5],
+		       token->kad->ticket[6], token->kad->ticket[7]);
+
+	/* count the number of tokens attached */
+	key->type_data.x[0]++;
+
+	/* attach the data */
+	token->next = key->payload.data;
+	key->payload.data = token;
+	if (token->kad->expiry < key->expiry)
+		key->expiry = token->kad->expiry;
+
+	_leave(" = 0");
+	return 0;
+}
+
+/*
+ * attempt to parse the data as the XDR format
+ * - the caller guarantees we have more than 7 words
+ */
+static int rxrpc_instantiate_xdr(struct key *key, const void *data, size_t datalen)
+{
+	const __be32 *xdr = data, *token;
+	const char *cp;
+	unsigned len, tmp, loop, ntoken, toklen, sec_ix;
+	int ret;
+
+	_enter(",{%x,%x,%x,%x},%zu",
+	       ntohl(xdr[0]), ntohl(xdr[1]), ntohl(xdr[2]), ntohl(xdr[3]),
+	       datalen);
+
+	if (datalen > AFSTOKEN_LENGTH_MAX)
+		goto not_xdr;
+
+	/* XDR is an array of __be32's */
+	if (datalen & 3)
+		goto not_xdr;
+
+	/* the flags should be 0 (the setpag bit must be handled by
+	 * userspace) */
+	if (ntohl(*xdr++) != 0)
+		goto not_xdr;
+	datalen -= 4;
+
+	/* check the cell name */
+	len = ntohl(*xdr++);
+	if (len < 1 || len > AFSTOKEN_CELL_MAX)
+		goto not_xdr;
+	datalen -= 4;
+	tmp = (len + 3) & ~3;
+	if (tmp > datalen)
+		goto not_xdr;
+
+	cp = (const char *) xdr;
+	for (loop = 0; loop < len; loop++)
+		if (!isprint(cp[loop]))
+			goto not_xdr;
+	if (len < tmp)
+		for (; loop < tmp; loop++)
+			if (cp[loop])
+				goto not_xdr;
+	_debug("cellname: [%u/%u] '%*.*s'",
+	       len, tmp, len, len, (const char *) xdr);
+	datalen -= tmp;
+	xdr += tmp >> 2;
+
+	/* get the token count */
+	if (datalen < 12)
+		goto not_xdr;
+	ntoken = ntohl(*xdr++);
+	datalen -= 4;
+	_debug("ntoken: %x", ntoken);
+	if (ntoken < 1 || ntoken > AFSTOKEN_MAX)
+		goto not_xdr;
+
+	/* check each token wrapper */
+	token = xdr;
+	loop = ntoken;
+	do {
+		if (datalen < 8)
+			goto not_xdr;
+		toklen = ntohl(*xdr++);
+		sec_ix = ntohl(*xdr);
+		datalen -= 4;
+		_debug("token: [%x/%zx] %x", toklen, datalen, sec_ix);
+		if (toklen < 20 || toklen > datalen)
+			goto not_xdr;
+		datalen -= (toklen + 3) & ~3;
+		xdr += (toklen + 3) >> 2;
+
+	} while (--loop > 0);
+
+	_debug("remainder: %zu", datalen);
+	if (datalen != 0)
+		goto not_xdr;
+
+	/* okay: we're going to assume it's valid XDR format
+	 * - we ignore the cellname, relying on the key to be correctly named
+	 */
+	do {
+		xdr = token;
+		toklen = ntohl(*xdr++);
+		token = xdr + ((toklen + 3) >> 2);
+		sec_ix = ntohl(*xdr++);
+		toklen -= 4;
+
+		switch (sec_ix) {
+		case RXRPC_SECURITY_RXKAD:
+			ret = rxrpc_instantiate_xdr_rxkad(key, xdr, toklen);
+			if (ret != 0)
+				goto error;
+			break;
+
+		default:
+			ret = -EPROTONOSUPPORT;
+			goto error;
+		}
+
+	} while (--ntoken > 0);
+
+	_leave(" = 0");
+	return 0;
+
+not_xdr:
+	_leave(" = -EPROTO");
+	return -EPROTO;
+error:
+	_leave(" = %d", ret);
+	return ret;
+}
+
+/*
  * instantiate an rxrpc defined key
  * data should be of the form:
  *	OFFSET	LEN	CONTENT
@@ -70,8 +267,8 @@ struct key_type key_type_rxrpc_s = {
  */
 static int rxrpc_instantiate(struct key *key, const void *data, size_t datalen)
 {
-	const struct rxkad_key *tsec;
-	struct rxrpc_key_payload *upayload;
+	const struct rxrpc_key_data_v1 *v1;
+	struct rxrpc_key_token *token, **pp;
 	size_t plen;
 	u32 kver;
 	int ret;
@@ -82,6 +279,13 @@ static int rxrpc_instantiate(struct key *key, const void *data, size_t datalen)
 	if (!data && datalen == 0)
 		return 0;
 
+	/* determine if the XDR payload format is being used */
+	if (datalen > 7 * 4) {
+		ret = rxrpc_instantiate_xdr(key, data, datalen);
+		if (ret != -EPROTO)
+			return ret;
+	}
+
 	/* get the key interface version number */
 	ret = -EINVAL;
 	if (datalen <= 4 || !data)
@@ -98,53 +302,67 @@ static int rxrpc_instantiate(struct key *key, const void *data, size_t datalen)
 
 	/* deal with a version 1 key */
 	ret = -EINVAL;
-	if (datalen < sizeof(*tsec))
+	if (datalen < sizeof(*v1))
 		goto error;
 
-	tsec = data;
-	if (datalen != sizeof(*tsec) + tsec->ticket_len)
+	v1 = data;
+	if (datalen != sizeof(*v1) + v1->ticket_length)
 		goto error;
 
-	_debug("SCIX: %u", tsec->security_index);
-	_debug("TLEN: %u", tsec->ticket_len);
-	_debug("EXPY: %x", tsec->expiry);
-	_debug("KVNO: %u", tsec->kvno);
+	_debug("SCIX: %u", v1->security_index);
+	_debug("TLEN: %u", v1->ticket_length);
+	_debug("EXPY: %x", v1->expiry);
+	_debug("KVNO: %u", v1->kvno);
 	_debug("SKEY: %02x%02x%02x%02x%02x%02x%02x%02x",
-	       tsec->session_key[0], tsec->session_key[1],
-	       tsec->session_key[2], tsec->session_key[3],
-	       tsec->session_key[4], tsec->session_key[5],
-	       tsec->session_key[6], tsec->session_key[7]);
-	if (tsec->ticket_len >= 8)
+	       v1->session_key[0], v1->session_key[1],
+	       v1->session_key[2], v1->session_key[3],
+	       v1->session_key[4], v1->session_key[5],
+	       v1->session_key[6], v1->session_key[7]);
+	if (v1->ticket_length >= 8)
 		_debug("TCKT: %02x%02x%02x%02x%02x%02x%02x%02x",
-		       tsec->ticket[0], tsec->ticket[1],
-		       tsec->ticket[2], tsec->ticket[3],
-		       tsec->ticket[4], tsec->ticket[5],
-		       tsec->ticket[6], tsec->ticket[7]);
+		       v1->ticket[0], v1->ticket[1],
+		       v1->ticket[2], v1->ticket[3],
+		       v1->ticket[4], v1->ticket[5],
+		       v1->ticket[6], v1->ticket[7]);
 
 	ret = -EPROTONOSUPPORT;
-	if (tsec->security_index != RXRPC_SECURITY_RXKAD)
+	if (v1->security_index != RXRPC_SECURITY_RXKAD)
 		goto error;
 
-	key->type_data.x[0] = tsec->security_index;
-
-	plen = sizeof(*upayload) + tsec->ticket_len;
-	ret = key_payload_reserve(key, plen);
+	plen = sizeof(*token->kad) + v1->ticket_length;
+	ret = key_payload_reserve(key, plen + sizeof(*token));
 	if (ret < 0)
 		goto error;
 
 	ret = -ENOMEM;
-	upayload = kmalloc(plen, GFP_KERNEL);
-	if (!upayload)
+	token = kmalloc(sizeof(*token), GFP_KERNEL);
+	if (!token)
 		goto error;
+	token->kad = kmalloc(plen, GFP_KERNEL);
+	if (!token->kad)
+		goto error_free;
+
+	token->security_index		= RXRPC_SECURITY_RXKAD;
+	token->kad->ticket_len		= v1->ticket_length;
+	token->kad->expiry		= v1->expiry;
+	token->kad->kvno		= v1->kvno;
+	memcpy(&token->kad->session_key, &v1->session_key, 8);
+	memcpy(&token->kad->ticket, v1->ticket, v1->ticket_length);
 
 	/* attach the data */
-	memcpy(&upayload->k, tsec, sizeof(*tsec));
-	memcpy(&upayload->k.ticket, (void *)tsec + sizeof(*tsec),
-	       tsec->ticket_len);
-	key->payload.data = upayload;
-	key->expiry = tsec->expiry;
+	key->type_data.x[0]++;
+
+	pp = (struct rxrpc_key_token **)&key->payload.data;
+	while (*pp)
+		pp = &(*pp)->next;
+	*pp = token;
+	if (token->kad->expiry < key->expiry)
+		key->expiry = token->kad->expiry;
+	token = NULL;
 	ret = 0;
 
+error_free:
+	kfree(token);
 error:
 	return ret;
 }
@@ -184,7 +402,22 @@ static int rxrpc_instantiate_s(struct key *key, const void *data,
  */
 static void rxrpc_destroy(struct key *key)
 {
-	kfree(key->payload.data);
+	struct rxrpc_key_token *token;
+
+	while ((token = key->payload.data)) {
+		key->payload.data = token->next;
+		switch (token->security_index) {
+		case RXRPC_SECURITY_RXKAD:
+			kfree(token->kad);
+			break;
+		default:
+			printk(KERN_ERR "Unknown token type %x on rxrpc key\n",
+			       token->security_index);
+			BUG();
+		}
+
+		kfree(token);
+	}
 }
 
 /*
@@ -293,7 +526,7 @@ int rxrpc_get_server_data_key(struct rxrpc_connection *conn,
 
 	struct {
 		u32 kver;
-		struct rxkad_key tsec;
+		struct rxrpc_key_data_v1 v1;
 	} data;
 
 	_enter("");
@@ -308,13 +541,12 @@ int rxrpc_get_server_data_key(struct rxrpc_connection *conn,
 	_debug("key %d", key_serial(key));
 
 	data.kver = 1;
-	data.tsec.security_index = RXRPC_SECURITY_RXKAD;
-	data.tsec.ticket_len = 0;
-	data.tsec.expiry = expiry;
-	data.tsec.kvno = 0;
+	data.v1.security_index = RXRPC_SECURITY_RXKAD;
+	data.v1.ticket_length = 0;
+	data.v1.expiry = expiry;
+	data.v1.kvno = 0;
 
-	memcpy(&data.tsec.session_key, session_key,
-	       sizeof(data.tsec.session_key));
+	memcpy(&data.v1.session_key, session_key, sizeof(data.v1.session_key));
 
 	ret = key_instantiate_and_link(key, &data, sizeof(data), NULL, NULL);
 	if (ret < 0)
diff --git a/net/rxrpc/ar-security.c b/net/rxrpc/ar-security.c
index dc62920..49b3cc3 100644
--- a/net/rxrpc/ar-security.c
+++ b/net/rxrpc/ar-security.c
@@ -16,6 +16,7 @@
 #include <linux/crypto.h>
 #include <net/sock.h>
 #include <net/af_rxrpc.h>
+#include <keys/rxrpc-type.h>
 #include "ar-internal.h"
 
 static LIST_HEAD(rxrpc_security_methods);
@@ -122,6 +123,7 @@ EXPORT_SYMBOL_GPL(rxrpc_unregister_security);
  */
 int rxrpc_init_client_conn_security(struct rxrpc_connection *conn)
 {
+	struct rxrpc_key_token *token;
 	struct rxrpc_security *sec;
 	struct key *key = conn->key;
 	int ret;
@@ -135,7 +137,11 @@ int rxrpc_init_client_conn_security(struct rxrpc_connection *conn)
 	if (ret < 0)
 		return ret;
 
-	sec = rxrpc_security_lookup(key->type_data.x[0]);
+	if (!key->payload.data)
+		return -EKEYREJECTED;
+	token = key->payload.data;
+
+	sec = rxrpc_security_lookup(token->security_index);
 	if (!sec)
 		return -EKEYREJECTED;
 	conn->security = sec;
diff --git a/net/rxrpc/rxkad.c b/net/rxrpc/rxkad.c
index acec762..713ac59 100644
--- a/net/rxrpc/rxkad.c
+++ b/net/rxrpc/rxkad.c
@@ -18,6 +18,7 @@
 #include <linux/ctype.h>
 #include <net/sock.h>
 #include <net/af_rxrpc.h>
+#include <keys/rxrpc-type.h>
 #define rxrpc_debug rxkad_debug
 #include "ar-internal.h"
 
@@ -59,14 +60,14 @@ static DEFINE_MUTEX(rxkad_ci_mutex);
  */
 static int rxkad_init_connection_security(struct rxrpc_connection *conn)
 {
-	struct rxrpc_key_payload *payload;
 	struct crypto_blkcipher *ci;
+	struct rxrpc_key_token *token;
 	int ret;
 
 	_enter("{%d},{%x}", conn->debug_id, key_serial(conn->key));
 
-	payload = conn->key->payload.data;
-	conn->security_ix = payload->k.security_index;
+	token = conn->key->payload.data;
+	conn->security_ix = token->security_index;
 
 	ci = crypto_alloc_blkcipher("pcbc(fcrypt)", 0, CRYPTO_ALG_ASYNC);
 	if (IS_ERR(ci)) {
@@ -75,8 +76,8 @@ static int rxkad_init_connection_security(struct rxrpc_connection *conn)
 		goto error;
 	}
 
-	if (crypto_blkcipher_setkey(ci, payload->k.session_key,
-				    sizeof(payload->k.session_key)) < 0)
+	if (crypto_blkcipher_setkey(ci, token->kad->session_key,
+				    sizeof(token->kad->session_key)) < 0)
 		BUG();
 
 	switch (conn->security_level) {
@@ -110,7 +111,7 @@ error:
  */
 static void rxkad_prime_packet_security(struct rxrpc_connection *conn)
 {
-	struct rxrpc_key_payload *payload;
+	struct rxrpc_key_token *token;
 	struct blkcipher_desc desc;
 	struct scatterlist sg[2];
 	struct rxrpc_crypt iv;
@@ -123,8 +124,8 @@ static void rxkad_prime_packet_security(struct rxrpc_connection *conn)
 	if (!conn->key)
 		return;
 
-	payload = conn->key->payload.data;
-	memcpy(&iv, payload->k.session_key, sizeof(iv));
+	token = conn->key->payload.data;
+	memcpy(&iv, token->kad->session_key, sizeof(iv));
 
 	desc.tfm = conn->cipher;
 	desc.info = iv.x;
@@ -197,7 +198,7 @@ static int rxkad_secure_packet_encrypt(const struct rxrpc_call *call,
 					u32 data_size,
 					void *sechdr)
 {
-	const struct rxrpc_key_payload *payload;
+	const struct rxrpc_key_token *token;
 	struct rxkad_level2_hdr rxkhdr
 		__attribute__((aligned(8))); /* must be all on one page */
 	struct rxrpc_skb_priv *sp;
@@ -219,8 +220,8 @@ static int rxkad_secure_packet_encrypt(const struct rxrpc_call *call,
 	rxkhdr.checksum = 0;
 
 	/* encrypt from the session key */
-	payload = call->conn->key->payload.data;
-	memcpy(&iv, payload->k.session_key, sizeof(iv));
+	token = call->conn->key->payload.data;
+	memcpy(&iv, token->kad->session_key, sizeof(iv));
 	desc.tfm = call->conn->cipher;
 	desc.info = iv.x;
 	desc.flags = 0;
@@ -400,7 +401,7 @@ static int rxkad_verify_packet_encrypt(const struct rxrpc_call *call,
 				       struct sk_buff *skb,
 				       u32 *_abort_code)
 {
-	const struct rxrpc_key_payload *payload;
+	const struct rxrpc_key_token *token;
 	struct rxkad_level2_hdr sechdr;
 	struct rxrpc_skb_priv *sp;
 	struct blkcipher_desc desc;
@@ -431,8 +432,8 @@ static int rxkad_verify_packet_encrypt(const struct rxrpc_call *call,
 	skb_to_sgvec(skb, sg, 0, skb->len);
 
 	/* decrypt from the session key */
-	payload = call->conn->key->payload.data;
-	memcpy(&iv, payload->k.session_key, sizeof(iv));
+	token = call->conn->key->payload.data;
+	memcpy(&iv, token->kad->session_key, sizeof(iv));
 	desc.tfm = call->conn->cipher;
 	desc.info = iv.x;
 	desc.flags = 0;
@@ -737,7 +738,7 @@ static int rxkad_respond_to_challenge(struct rxrpc_connection *conn,
 				      struct sk_buff *skb,
 				      u32 *_abort_code)
 {
-	const struct rxrpc_key_payload *payload;
+	const struct rxrpc_key_token *token;
 	struct rxkad_challenge challenge;
 	struct rxkad_response resp
 		__attribute__((aligned(8))); /* must be aligned for crypto */
@@ -778,7 +779,7 @@ static int rxkad_respond_to_challenge(struct rxrpc_connection *conn,
 	if (conn->security_level < min_level)
 		goto protocol_error;
 
-	payload = conn->key->payload.data;
+	token = conn->key->payload.data;
 
 	/* build the response packet */
 	memset(&resp, 0, sizeof(resp));
@@ -797,13 +798,13 @@ static int rxkad_respond_to_challenge(struct rxrpc_connection *conn,
 		(conn->channels[3] ? conn->channels[3]->call_id : 0);
 	resp.encrypted.inc_nonce = htonl(nonce + 1);
 	resp.encrypted.level = htonl(conn->security_level);
-	resp.kvno = htonl(payload->k.kvno);
-	resp.ticket_len = htonl(payload->k.ticket_len);
+	resp.kvno = htonl(token->kad->kvno);
+	resp.ticket_len = htonl(token->kad->ticket_len);
 
 	/* calculate the response checksum and then do the encryption */
 	rxkad_calc_response_checksum(&resp);
-	rxkad_encrypt_response(conn, &resp, &payload->k);
-	return rxkad_send_response(conn, &sp->hdr, &resp, &payload->k);
+	rxkad_encrypt_response(conn, &resp, token->kad);
+	return rxkad_send_response(conn, &sp->hdr, &resp, token->kad);
 
 protocol_error:
 	*_abort_code = abort_code;


^ permalink raw reply related

* [PATCH 1/4] RxRPC: Declare the security index constants symbolically
From: David Howells @ 2009-09-14 11:17 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-afs, netdev, David Howells

Declare the security index constants symbolically rather than just referring
to them numerically.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/linux/rxrpc.h |    7 +++++++
 net/rxrpc/ar-key.c    |    4 ++--
 net/rxrpc/rxkad.c     |    6 +++---
 3 files changed, 12 insertions(+), 5 deletions(-)


diff --git a/include/linux/rxrpc.h b/include/linux/rxrpc.h
index f7b826b..a53915c 100644
--- a/include/linux/rxrpc.h
+++ b/include/linux/rxrpc.h
@@ -58,5 +58,12 @@ struct sockaddr_rxrpc {
 #define RXRPC_SECURITY_AUTH	1	/* authenticated packets */
 #define RXRPC_SECURITY_ENCRYPT	2	/* encrypted packets */
 
+/*
+ * RxRPC security indices
+ */
+#define RXRPC_SECURITY_NONE	0	/* no security protocol */
+#define RXRPC_SECURITY_RXKAD	2	/* kaserver or kerberos 4 */
+#define RXRPC_SECURITY_RXGK	4	/* gssapi-based */
+#define RXRPC_SECURITY_RXK5	5	/* kerberos 5 */
 
 #endif /* _LINUX_RXRPC_H */
diff --git a/net/rxrpc/ar-key.c b/net/rxrpc/ar-key.c
index ad8c7a7..b3d10e7 100644
--- a/net/rxrpc/ar-key.c
+++ b/net/rxrpc/ar-key.c
@@ -122,7 +122,7 @@ static int rxrpc_instantiate(struct key *key, const void *data, size_t datalen)
 		       tsec->ticket[6], tsec->ticket[7]);
 
 	ret = -EPROTONOSUPPORT;
-	if (tsec->security_index != 2)
+	if (tsec->security_index != RXRPC_SECURITY_RXKAD)
 		goto error;
 
 	key->type_data.x[0] = tsec->security_index;
@@ -308,7 +308,7 @@ int rxrpc_get_server_data_key(struct rxrpc_connection *conn,
 	_debug("key %d", key_serial(key));
 
 	data.kver = 1;
-	data.tsec.security_index = 2;
+	data.tsec.security_index = RXRPC_SECURITY_RXKAD;
 	data.tsec.ticket_len = 0;
 	data.tsec.expiry = expiry;
 	data.tsec.kvno = 0;
diff --git a/net/rxrpc/rxkad.c b/net/rxrpc/rxkad.c
index ef8f910..acec762 100644
--- a/net/rxrpc/rxkad.c
+++ b/net/rxrpc/rxkad.c
@@ -42,7 +42,7 @@ struct rxkad_level2_hdr {
 	__be32	checksum;	/* decrypted data checksum */
 };
 
-MODULE_DESCRIPTION("RxRPC network protocol type-2 security (Kerberos)");
+MODULE_DESCRIPTION("RxRPC network protocol type-2 security (Kerberos 4)");
 MODULE_AUTHOR("Red Hat, Inc.");
 MODULE_LICENSE("GPL");
 
@@ -506,7 +506,7 @@ static int rxkad_verify_packet(const struct rxrpc_call *call,
 	if (!call->conn->cipher)
 		return 0;
 
-	if (sp->hdr.securityIndex != 2) {
+	if (sp->hdr.securityIndex != RXRPC_SECURITY_RXKAD) {
 		*_abort_code = RXKADINCONSISTENCY;
 		_leave(" = -EPROTO [not rxkad]");
 		return -EPROTO;
@@ -1122,7 +1122,7 @@ static void rxkad_clear(struct rxrpc_connection *conn)
 static struct rxrpc_security rxkad = {
 	.owner				= THIS_MODULE,
 	.name				= "rxkad",
-	.security_index			= RXKAD_VERSION,
+	.security_index			= RXRPC_SECURITY_RXKAD,
 	.init_connection_security	= rxkad_init_connection_security,
 	.prime_packet_security		= rxkad_prime_packet_security,
 	.secure_packet			= rxkad_secure_packet,


^ permalink raw reply related

* Re: [PATCH 13/19] uwb: convert to netdev_tx_t
From: David Vrabel @ 2009-09-14 10:47 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: David Miller, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-usb-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20090901055129.729527950-ZtmgI6mnKB3QT0dZR+AlfA@public.gmane.org>

Acked-by: David Vrabel <david.vrabel-kQvG35nSl+M@public.gmane.org>

David
-- 
David Vrabel, Senior Software Engineer, Drivers
CSR, Churchill House, Cambridge Business Park,  Tel: +44 (0)1223 692562
Cowley Road, Cambridge, CB4 0WZ                 http://www.csr.com/


Member of the CSR plc group of companies. CSR plc registered in England and Wales, registered number 4187346, registered office Churchill House, Cambridge Business Park, Cowley Road, Cambridge, CB4 0WZ, United Kingdom
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH RFC] tun: export underlying socket
From: Michael S. Tsirkin @ 2009-09-14 10:10 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: David Miller, netdev, herbert
In-Reply-To: <4AAE1026.4090702@voltaire.com>

On Mon, Sep 14, 2009 at 12:43:02PM +0300, Or Gerlitz wrote:
> Michael S. Tsirkin wrote:
>> That's already possible. However virtualization users are familiar
>> with configuring the tun device, and tun has grown
>> virtualization-specific extensions, so I don't see a reason not to
>> accomodate these uses
> Today packets are written/read from/to Qemu to/from tun device, how  
> would the use case with vhost will look like?

- Configure bridge and tun using existing scripts
- pass tun fd to vhost via an ioctl
- vhost calls tun_get_socket
- from this point, guest networking just goes faster

> Is this the user setting an uplink NIC + bridge + per VM tun device but  
> the packets will go from/to virtio-net in the guest kernel to/from vhost  
> in the host kernel and then from/to vhost to/from tun? so eventually no  
> packets will be seen by the qemu process? I don't see what these scheme  
> buys people, I got very much confused.
>
> Or.

A lot of people have asked for tun support in vhost, because qemu
currently uses tun.  With this scheme existing code and scripts can be
used to configure both tun and bridge.  You also can utilize
virtualization-specific features in tun.

-- 
MST

^ permalink raw reply

* Re: L2 switching in igb
From: Or Gerlitz @ 2009-09-14 10:02 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, Kirsher, Jeffrey T, Fischer, Anna,
	netdev@vger.kernel.org, David Miller, Stephen Hemminger
In-Reply-To: <4AA937A1.9070504@intel.com>

Alexander Duyck wrote:
> You are correct, the vSwitch can basically do VEPA by disabling local 
> loopback enable bit in the DTXSWC register.  This would force all 
> traffic from the PF/VFs out the lan physical port and from the lan 
> physical port to the appropriate PF/VFs without doing any switching in 
> between PF/VFs.
To have VEPA support another bit has to be programmed... its the one 
that doesn't let the PF to forward a packet to a VF whose source mac 
matches the one in the packet (e.g multicast sender).

> add an rtnl_link_ops interface to handle vSwitch configuration that 
> could then be applied to the igb netdevs that support VEPA/vSwitch 
> technologies.  A subset of that interface could then be dedicated to 
> VF configuration to handle things such as spawning VFs, setting the 
> default mac addresses, security controls, etc.
Yes, lets do that. I'd like to suggest that a "VF programmable from user 
space" context  will contain a <mac, vlan-id, priority-bits, rate> 
tuple, such that in the absence of vlan tag, the VF driver will "sign" 
the packet (skb) with vlan-id and priority-bits assigned by the admin 
and the PF NIC will mandate that the VF originated traffic will not 
exceed the rate.

Or.


Or.


^ permalink raw reply

* Re: bisect results of MSI-X related panic (help!)
From: Tejun Heo @ 2009-09-14  9:43 UTC (permalink / raw)
  To: Frans Pop; +Cc: Jesse Brandeburg, linux-kernel, netdev, Ingo Molnar
In-Reply-To: <4AAE0F7B.5050203@kernel.org>

Tejun Heo wrote:
> Frans Pop wrote:
>> Jesse Brandeburg wrote:
>>> I've bisected, here is my bisect log, problem is that the commit
>>> identified is a merge commit, and *I don't know what to revert to test*.
>>> It appears the parent of the merge:
>>> 6e15cf04860074ad032e88c306bea656bbdd0f22 is marked good, but looks to be
>>> in a possibly related area to the panic.
>> That merge does contain quite a few merge fixups, so it's quite possible 
>> one of them is the cause of the failure.
>> Maybe the simplest way to verify that is to compile both parents of the 
>> merge to doublecheck that they work OK. Then, if a compile of the merge 
>> itself is bad, the problem really is in the merge commit itself.
>>
>> That commit is the "percpu" merge, so I've added Tejun (author of most of 
>> that branch) and Ingo (merger) in CC.
> 
> Sorry, the oops doesn't ring a bell, well, not yet at least.  It would
> be great if the bisection can be narrowed down more.

Also, building w/ debug option on, capturing more oops traces and
pasting gdb output of l *<oops address> might shed some more light.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH RFC] tun: export underlying socket
From: Or Gerlitz @ 2009-09-14  9:43 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: David Miller, netdev, herbert
In-Reply-To: <20090914091151.GE14030@redhat.com>

Michael S. Tsirkin wrote:
> That's already possible. However virtualization users are familiar with configuring the tun device, and tun has grown virtualization-specific extensions, so I don't see a reason not to accomodate these uses
Today packets are written/read from/to Qemu to/from tun device, how 
would the use case with vhost will look like?

Is this the user setting an uplink NIC + bridge + per VM tun device but 
the packets will go from/to virtio-net in the guest kernel to/from vhost 
in the host kernel and then from/to vhost to/from tun? so eventually no 
packets will be seen by the qemu process? I don't see what these scheme 
buys people, I got very much confused.

Or.


^ permalink raw reply

* Re: bisect results of MSI-X related panic (help!)
From: Tejun Heo @ 2009-09-14  9:40 UTC (permalink / raw)
  To: Frans Pop; +Cc: Jesse Brandeburg, linux-kernel, netdev, Ingo Molnar
In-Reply-To: <200909120623.49764.elendil@planet.nl>

Frans Pop wrote:
> Jesse Brandeburg wrote:
>> I've bisected, here is my bisect log, problem is that the commit
>> identified is a merge commit, and *I don't know what to revert to test*.
>> It appears the parent of the merge:
>> 6e15cf04860074ad032e88c306bea656bbdd0f22 is marked good, but looks to be
>> in a possibly related area to the panic.
> 
> That merge does contain quite a few merge fixups, so it's quite possible 
> one of them is the cause of the failure.
> Maybe the simplest way to verify that is to compile both parents of the 
> merge to doublecheck that they work OK. Then, if a compile of the merge 
> itself is bad, the problem really is in the merge commit itself.
> 
> That commit is the "percpu" merge, so I've added Tejun (author of most of 
> that branch) and Ingo (merger) in CC.

Sorry, the oops doesn't ring a bell, well, not yet at least.  It would
be great if the bisection can be narrowed down more.

-- 
tejun

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox