Netdev List
 help / color / mirror / Atom feed
* Re: InfiniBand/RDMA merge plans for 2.6.24
From: Shirley Ma @ 2007-09-14 18:36 UTC (permalink / raw)
  To: Roland Dreier; +Cc: general, linux-kernel, netdev, netdev-owner
In-Reply-To: <adazlzpjkk6.fsf@cisco.com>

> IPoIB CM handles this properly by gathering together single pages in
> skbs' fragment lists.
> 
>  - R.

Then can we reuse IPoIB CM code here?

Thanks
Shirley 

^ permalink raw reply

* [ofa-general] Re: [PATCH 1/10 REV5] [Doc] HOWTO Documentation for batching
From: Randy Dunlap @ 2007-09-14 18:37 UTC (permalink / raw)
  To: Krishna Kumar
  Cc: johnpol, jagana, herbert, gaagaan, Robert.Olsson, kumarkr,
	rdreier, peter.p.waskiewicz.jr, hadi, kaber, jeff, general,
	netdev, tgraf, mcarlson, sri, shemminger, davem, mchan
In-Reply-To: <20070914090118.17589.43799.sendpatchset@K50wks273871wss.in.ibm.com>

On Fri, 14 Sep 2007 14:31:18 +0530 Krishna Kumar wrote:

> Add Documentation describing batching skb xmit capability.
> 
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
> ---
>  batching_skb_xmit.txt |  107 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 107 insertions(+)
> 
> diff -ruNp org/Documentation/networking/batching_skb_xmit.txt new/Documentation/networking/batching_skb_xmit.txt
> --- org/Documentation/networking/batching_skb_xmit.txt	1970-01-01 05:30:00.000000000 +0530
> +++ new/Documentation/networking/batching_skb_xmit.txt	2007-09-14 10:25:36.000000000 +0530
> @@ -0,0 +1,107 @@
> +
> +Section 4: Nitty gritty details for driver writers
> +--------------------------------------------------
> +
> +	Batching is enabled from core networking stack only from softirq
> +	context (NET_TX_SOFTIRQ), and dev_queue_xmit() doesn't use batching.
> +
> +	This leads to the following situation:
> +		A skb was not sent out as either driver lock was contested or
> +		the device was blocked. When the softirq handler runs, it
> +		moves all skbs from the device queue to the batch list, but
> +		then it too could fail to send due to lock contention. The
> +		next xmit (of a single skb) called from dev_queue_xmit() will
> +		not use batching and try to xmit skb, while previous skbs are
> +		still present in the batch list. This results in the receiver
> +		getting out-of-order packets, and in case of TCP the sender
> +		would have unnecessary retransmissions.
> +
> +	To fix this problem, error cases where driver xmit gets called with a
> +	skb must code as follows:
> +		1. If driver xmit cannot get tx lock, return NETDEV_TX_LOCKED
> +		   as usual. This allows qdisc to requeue the skb.
> +		2. If driver xmit got the lock but failed to send the skb, it
> +		   should return NETDEV_TX_BUSY but before that it should have
> +		   queue'd the skb to the batch list. In this case, the qdisc

                   queued

> +		   does not requeue the skb.

and then
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>

Thanks,
---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply

* Re: e1000 driver and samba
From: L F @ 2007-09-14 18:40 UTC (permalink / raw)
  To: Kok, Auke; +Cc: netdev
In-Reply-To: <46EAC25B.2060404@intel.com>

On 9/14/07, Kok, Auke <auke-jan.h.kok@intel.com> wrote:
> this slowness might have been masking the issue
That is possible. However, it worked for upwards of twelve months
without an error.

> I have not yet seen other reports of this issue, and it would be interesting to
> see if the stack or driver is seeing errors. Please post `ethtool -S eth0` after
> the samba connection resets or fails.
If you look for it on the Realtek cards, there had been sporadic
issues up to late 2005. The solution posted universally was 'change
card'.

I include the content of ethtool -S as requested:
beehive:~# ethtool -S eth4
NIC statistics:
     rx_packets: 43538709
     tx_packets: 68726231
     rx_bytes: 34124849453
     tx_bytes: 74817483835
     rx_broadcast: 20891
     tx_broadcast: 8941
     rx_multicast: 459
     tx_multicast: 0
     rx_errors: 0
     tx_errors: 0
     tx_dropped: 0
     multicast: 459
     collisions: 0
     rx_length_errors: 0
     rx_over_errors: 0
     rx_crc_errors: 0
     rx_frame_errors: 0
     rx_no_buffer_count: 0
     rx_missed_errors: 0
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     tx_window_errors: 0
     tx_abort_late_coll: 0
     tx_deferred_ok: 486
     tx_single_coll_ok: 0
     tx_multi_coll_ok: 0
     tx_timeout_count: 0
     tx_restart_queue: 0
     rx_long_length_errors: 0
     rx_short_length_errors: 0
     rx_align_errors: 0
     tx_tcp_seg_good: 0
     tx_tcp_seg_failed: 0
     rx_flow_control_xon: 488
     rx_flow_control_xoff: 488
     tx_flow_control_xon: 0
     tx_flow_control_xoff: 0
     rx_long_byte_count: 34124849453
     rx_csum_offload_good: 43449333
     rx_csum_offload_errors: 0
     rx_header_split: 0
     alloc_rx_buff_failed: 0
     tx_smbus: 0
     rx_smbus: 0
     dropped_smbus: 0

I am no expert, but I do not see anything that obviously points to an
issue there.
Now, something I did not mention before, though it was clearly evident
from context, is that the errors ONLY occur on samba WRITE. I can read
hundreds of GBs of data without error.

> Just as a precaution, try a different ethernet cable. Even the switch in between
> the target and you might have issues.
I will try that and report back. I would not suspect the switch
because transferring between other machines - WinXP machines -
operates correctly, as far as I can tell.

> I know our lab folks do plenty of samba testing but I will see if they can run a
> stress test against a smb target in the way that you describe.
Thank you, I would appreciate that. My concern is more generalised
than this single machine: I will have to check a significant number of
other production machines to see if such errors are common.

Rgds,
Luigi Fabio

^ permalink raw reply

* Re: e1000 driver and samba
From: L F @ 2007-09-14 18:41 UTC (permalink / raw)
  To: Francois Romieu; +Cc: netdev
In-Reply-To: <20070914182645.GA22975@electric-eye.fr.zoreil.com>

On 9/14/07, Francois Romieu <romieu@fr.zoreil.com> wrote:
> For the 8169 or the 8110, try 2.6.23-rc6 +
>
> http://www.fr.zoreil.com/people/francois/misc/20070903-2.6.23-rc5-r8169-test.patch
Thank you, I will give that a whirl also, because there are some
machine builds which will not have Intel boards in them and they need
to work, no questions asked. I will report back.

Rgds,
LF

^ permalink raw reply

* [PATCH 4/8] SCTP: Implete SCTP-AUTH parameter processing
From: Vlad Yasevich @ 2007-09-14 18:44 UTC (permalink / raw)
  To: lksctp-developers; +Cc: netdev, Vlad Yasevich
In-Reply-To: <1189795499444-git-send-email-vladislav.yasevich@hp.com>

Implement processing for the CHUNKS, RANDOM, and HMAC parameters and
deal with how this parameters are effected by association restarts.
In particular, during unexpeted INIT processing, we need to reply with
parameters from the original INIT chunk.  Also, after restart, we need
to update the old association with new peer parameters and change the
association shared keys.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
---
 include/net/sctp/command.h |    1 +
 net/sctp/associola.c       |   21 ++++++-
 net/sctp/sm_make_chunk.c   |  162 +++++++++++++++++++++++++++++++++++++++++++-
 net/sctp/sm_sideeffect.c   |    5 ++
 net/sctp/sm_statefuns.c    |   35 ++++++++++
 5 files changed, 220 insertions(+), 4 deletions(-)

diff --git a/include/net/sctp/command.h b/include/net/sctp/command.h
index f56c8d6..b873336 100644
--- a/include/net/sctp/command.h
+++ b/include/net/sctp/command.h
@@ -102,6 +102,7 @@ typedef enum {
 	SCTP_CMD_SET_SK_ERR,	 /* Set sk_err */
 	SCTP_CMD_ASSOC_CHANGE,	 /* generate and send assoc_change event */
 	SCTP_CMD_ADAPTATION_IND, /* generate and send adaptation event */
+	SCTP_CMD_ASSOC_SHKEY,    /* generate the association shared keys */
 	SCTP_CMD_LAST
 } sctp_verb_t;
 
diff --git a/net/sctp/associola.c b/net/sctp/associola.c
index b96c132..09e592b 100644
--- a/net/sctp/associola.c
+++ b/net/sctp/associola.c
@@ -416,6 +416,9 @@ void sctp_association_free(struct sctp_association *asoc)
 
 	/* Free peer's cached cookie. */
 	kfree(asoc->peer.cookie);
+	kfree(asoc->peer.peer_random);
+	kfree(asoc->peer.peer_chunks);
+	kfree(asoc->peer.peer_hmacs);
 
 	/* Release the transport structures. */
 	list_for_each_safe(pos, temp, &asoc->peer.transport_addr_list) {
@@ -1149,7 +1152,23 @@ void sctp_assoc_update(struct sctp_association *asoc,
 		}
 	}
 
-	/* SCTP-AUTH: XXX something needs to be done here*/
+	/* SCTP-AUTH: Save the peer parameters from the new assocaitions
+	 * and also move the association shared keys over
+	 */
+	kfree(asoc->peer.peer_random);
+	asoc->peer.peer_random = new->peer.peer_random;
+	new->peer.peer_random = NULL;
+
+	kfree(asoc->peer.peer_chunks);
+	asoc->peer.peer_chunks = new->peer.peer_chunks;
+	new->peer.peer_chunks = NULL;
+
+	kfree(asoc->peer.peer_hmacs);
+	asoc->peer.peer_hmacs = new->peer.peer_hmacs;
+	new->peer.peer_hmacs = NULL;
+
+	sctp_auth_key_put(asoc->asoc_shared_key);
+	sctp_auth_asoc_init_active_key(asoc, GFP_ATOMIC);
 }
 
 /* Update the retran path for sending a retransmitted packet.
diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
index db70448..cd4eb21 100644
--- a/net/sctp/sm_make_chunk.c
+++ b/net/sctp/sm_make_chunk.c
@@ -182,6 +182,8 @@ struct sctp_chunk *sctp_make_init(const struct sctp_association *asoc,
 	sctp_supported_ext_param_t ext_param;
 	int num_ext = 0;
 	__u8 extensions[3];
+	sctp_paramhdr_t *auth_chunks = NULL,
+			*auth_hmacs = NULL;
 
 	/* RFC 2960 3.3.2 Initiation (INIT) (1)
 	 *
@@ -214,8 +216,6 @@ struct sctp_chunk *sctp_make_init(const struct sctp_association *asoc,
 	 *  An implementation supporting this extension [ADDIP] MUST list
 	 *  the ASCONF,the ASCONF-ACK, and the AUTH  chunks in its INIT and
 	 *  INIT-ACK parameters.
-	 *  XXX: We don't support AUTH just yet, so don't list it.  AUTH
-	 *  support should add it.
 	 */
 	if (sctp_addip_enable) {
 		extensions[num_ext] = SCTP_CID_ASCONF;
@@ -226,6 +226,29 @@ struct sctp_chunk *sctp_make_init(const struct sctp_association *asoc,
 	chunksize += sizeof(aiparam);
 	chunksize += vparam_len;
 
+	/* Account for AUTH related parameters */
+	if (sctp_auth_enable) {
+		/* Add random parameter length*/
+		chunksize += sizeof(asoc->c.auth_random);
+
+		/* Add HMACS parameter length if any were defined */
+		auth_hmacs = (sctp_paramhdr_t *)asoc->c.auth_hmacs;
+		if (auth_hmacs->length)
+			chunksize += ntohs(auth_hmacs->length);
+		else
+			auth_hmacs = NULL;
+
+		/* Add CHUNKS parameter length */
+		auth_chunks = (sctp_paramhdr_t *)asoc->c.auth_chunks;
+		if (auth_chunks->length)
+			chunksize += ntohs(auth_chunks->length);
+		else
+			auth_hmacs = NULL;
+
+		extensions[num_ext] = SCTP_CID_AUTH;
+		num_ext += 1;
+	}
+
 	/* If we have any extensions to report, account for that */
 	if (num_ext)
 		chunksize += sizeof(sctp_supported_ext_param_t) + num_ext;
@@ -285,6 +308,17 @@ struct sctp_chunk *sctp_make_init(const struct sctp_association *asoc,
 	aiparam.adaptation_ind = htonl(sp->adaptation_ind);
 	sctp_addto_chunk(retval, sizeof(aiparam), &aiparam);
 
+	/* Add SCTP-AUTH chunks to the parameter list */
+	if (sctp_auth_enable) {
+		sctp_addto_chunk(retval, sizeof(asoc->c.auth_random),
+				 asoc->c.auth_random);
+		if (auth_hmacs)
+			sctp_addto_chunk(retval, ntohs(auth_hmacs->length),
+					auth_hmacs);
+		if (auth_chunks)
+			sctp_addto_chunk(retval, ntohs(auth_chunks->length),
+					auth_chunks);
+	}
 nodata:
 	kfree(addrs.v);
 	return retval;
@@ -305,6 +339,9 @@ struct sctp_chunk *sctp_make_init_ack(const struct sctp_association *asoc,
 	sctp_supported_ext_param_t ext_param;
 	int num_ext = 0;
 	__u8 extensions[3];
+	sctp_paramhdr_t *auth_chunks = NULL,
+			*auth_hmacs = NULL,
+			*auth_random = NULL;
 
 	retval = NULL;
 
@@ -350,6 +387,26 @@ struct sctp_chunk *sctp_make_init_ack(const struct sctp_association *asoc,
 	chunksize += sizeof(ext_param) + num_ext;
 	chunksize += sizeof(aiparam);
 
+	if (asoc->peer.auth_capable) {
+		auth_random = (sctp_paramhdr_t *)asoc->c.auth_random;
+		chunksize += ntohs(auth_random->length);
+
+		auth_hmacs = (sctp_paramhdr_t *)asoc->c.auth_hmacs;
+		if (auth_hmacs->length)
+			chunksize += ntohs(auth_hmacs->length);
+		else
+			auth_hmacs = NULL;
+
+		auth_chunks = (sctp_paramhdr_t *)asoc->c.auth_chunks;
+		if (auth_chunks->length)
+			chunksize += ntohs(auth_chunks->length);
+		else
+			auth_chunks = NULL;
+
+		extensions[num_ext] = SCTP_CID_AUTH;
+		num_ext += 1;
+	}
+
 	/* Now allocate and fill out the chunk.  */
 	retval = sctp_make_chunk(asoc, SCTP_CID_INIT_ACK, 0, chunksize);
 	if (!retval)
@@ -381,6 +438,17 @@ struct sctp_chunk *sctp_make_init_ack(const struct sctp_association *asoc,
 	aiparam.adaptation_ind = htonl(sctp_sk(asoc->base.sk)->adaptation_ind);
 	sctp_addto_chunk(retval, sizeof(aiparam), &aiparam);
 
+	if (asoc->peer.auth_capable) {
+		sctp_addto_chunk(retval, ntohs(auth_random->length),
+				 auth_random);
+		if (auth_hmacs)
+			sctp_addto_chunk(retval, ntohs(auth_hmacs->length),
+					auth_hmacs);
+		if (auth_chunks)
+			sctp_addto_chunk(retval, ntohs(auth_chunks->length),
+					auth_chunks);
+	}
+
 	/* We need to remove the const qualifier at this point.  */
 	retval->asoc = (struct sctp_association *) asoc;
 
@@ -1736,6 +1804,12 @@ static void sctp_process_ext_param(struct sctp_association *asoc,
 				!asoc->peer.prsctp_capable)
 				    asoc->peer.prsctp_capable = 1;
 			    break;
+		    case SCTP_CID_AUTH:
+			    /* if the peer reports AUTH, assume that he
+			     * supports AUTH.
+			     */
+			    asoc->peer.auth_capable = 1;
+			    break;
 		    case SCTP_CID_ASCONF:
 		    case SCTP_CID_ASCONF_ACK:
 			    /* don't need to do anything for ASCONF */
@@ -1871,7 +1945,42 @@ static int sctp_verify_param(const struct sctp_association *asoc,
 	case SCTP_PARAM_FWD_TSN_SUPPORT:
 		if (sctp_prsctp_enable)
 			break;
+		goto fallthrough;
+
+	case SCTP_PARAM_RANDOM:
+		if (!sctp_auth_enable)
+			goto fallthrough;
+
+		/* SCTP-AUTH: Secion 6.1
+		 * If the random number is not 32 byte long the association
+		 * MUST be aborted.  The ABORT chunk SHOULD contain the error
+		 * cause 'Protocol Violation'.
+		 */
+		if (SCTP_AUTH_RANDOM_LENGTH !=
+			ntohs(param.p->length) - sizeof(sctp_paramhdr_t))
+			return sctp_process_inv_paramlength(asoc, param.p,
+							chunk, err_chunk);
+		break;
+
+	case SCTP_PARAM_CHUNKS:
+		if (!sctp_auth_enable)
+			goto fallthrough;
+
+		/* SCTP-AUTH: Section 3.2
+		 * The CHUNKS parameter MUST be included once in the INIT or
+		 *  INIT-ACK chunk if the sender wants to receive authenticated
+		 *  chunks.  Its maximum length is 260 bytes.
+		 */
+		if (260 < ntohs(param.p->length))
+			return sctp_process_inv_paramlength(asoc, param.p,
+							chunk, err_chunk);
+		break;
+
+	case SCTP_PARAM_HMAC_ALGO:
+		if (!sctp_auth_enable)
+			break;
 		/* Fall Through */
+fallthrough:
 	default:
 		SCTP_DEBUG_PRINTK("Unrecognized param: %d for chunk %d.\n",
 				ntohs(param.p->type), cid);
@@ -1976,13 +2085,19 @@ int sctp_process_init(struct sctp_association *asoc, sctp_cid_t cid,
 	}
 
 	/* Process the initialization parameters.  */
-
 	sctp_walk_params(param, peer_init, init_hdr.params) {
 
 		if (!sctp_process_param(asoc, param, peer_addr, gfp))
 			goto clean_up;
 	}
 
+	/* AUTH: After processing the parameters, make sure that we
+	 * have all the required info to potentially do authentications.
+	 */
+	if (asoc->peer.auth_capable && (!asoc->peer.peer_random ||
+					!asoc->peer.peer_hmacs))
+		asoc->peer.auth_capable = 0;
+
 	/* Walk list of transports, removing transports in the UNKNOWN state. */
 	list_for_each_safe(pos, temp, &asoc->peer.transport_addr_list) {
 		transport = list_entry(pos, struct sctp_transport, transports);
@@ -2222,6 +2337,47 @@ static int sctp_process_param(struct sctp_association *asoc,
 			break;
 		}
 		/* Fall Through */
+		goto fall_through;
+
+	case SCTP_PARAM_RANDOM:
+		if (!sctp_auth_enable)
+			goto fall_through;
+
+		/* Save peer's random parameter */
+		asoc->peer.peer_random = kmemdup(param.p,
+					    ntohs(param.p->length), gfp);
+		if (!asoc->peer.peer_random) {
+			retval = 0;
+			break;
+		}
+		break;
+
+	case SCTP_PARAM_HMAC_ALGO:
+		if (!sctp_auth_enable)
+			goto fall_through;
+
+		/* Save peer's HMAC list */
+		asoc->peer.peer_hmacs = kmemdup(param.p,
+					    ntohs(param.p->length), gfp);
+		if (!asoc->peer.peer_hmacs) {
+			retval = 0;
+			break;
+		}
+
+		/* Set the default HMAC the peer requested*/
+		sctp_auth_asoc_set_default_hmac(asoc, param.hmac_algo);
+		break;
+
+	case SCTP_PARAM_CHUNKS:
+		if (!sctp_auth_enable)
+			goto fall_through;
+
+		asoc->peer.peer_chunks = kmemdup(param.p,
+					    ntohs(param.p->length), gfp);
+		if (!asoc->peer.peer_chunks)
+			retval = 0;
+		break;
+fall_through:
 	default:
 		/* Any unrecognized parameters should have been caught
 		 * and handled by sctp_verify_param() which should be
diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
index 8d78900..bbdc938 100644
--- a/net/sctp/sm_sideeffect.c
+++ b/net/sctp/sm_sideeffect.c
@@ -1524,6 +1524,11 @@ static int sctp_cmd_interpreter(sctp_event_t event_type,
 			sctp_cmd_adaptation_ind(commands, asoc);
 			break;
 
+		case SCTP_CMD_ASSOC_SHKEY:
+			error = sctp_auth_asoc_init_active_key(asoc,
+						GFP_ATOMIC);
+			break;
+
 		default:
 			printk(KERN_WARNING "Impossible command: %u, %p\n",
 			       cmd->verb, cmd->obj.ptr);
diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
index faa1381..7d8e92f 100644
--- a/net/sctp/sm_statefuns.c
+++ b/net/sctp/sm_statefuns.c
@@ -524,6 +524,11 @@ sctp_disposition_t sctp_sf_do_5_1C_ack(const struct sctp_endpoint *ep,
 	sctp_add_cmd_sf(commands, SCTP_CMD_NEW_STATE,
 			SCTP_STATE(SCTP_STATE_COOKIE_ECHOED));
 
+	/* SCTP-AUTH: genereate the assocition shared keys so that
+	 * we can potentially signe the COOKIE-ECHO.
+	 */
+	sctp_add_cmd_sf(commands, SCTP_CMD_ASSOC_SHKEY, SCTP_NULL());
+
 	/* 5.1 C) "A" shall then send the State Cookie received in the
 	 * INIT ACK chunk in a COOKIE ECHO chunk, ...
 	 */
@@ -661,6 +666,14 @@ sctp_disposition_t sctp_sf_do_5_1D_ce(const struct sctp_endpoint *ep,
 			       peer_init, GFP_ATOMIC))
 		goto nomem_init;
 
+	/* SCTP-AUTH:  Now that we've populate required fields in
+	 * sctp_process_init, set up the assocaition shared keys as
+	 * necessary so that we can potentially authenticate the ACK
+	 */
+	error = sctp_auth_asoc_init_active_key(new_asoc, GFP_ATOMIC);
+	if (error)
+		goto nomem_init;
+
 	repl = sctp_make_cookie_ack(new_asoc, chunk);
 	if (!repl)
 		goto nomem_init;
@@ -1222,6 +1235,26 @@ static void sctp_tietags_populate(struct sctp_association *new_asoc,
 	new_asoc->c.initial_tsn         = asoc->c.initial_tsn;
 }
 
+static void sctp_auth_params_populate(struct sctp_association *new_asoc,
+				    const struct sctp_association *asoc)
+{
+	/* Only perform this if AUTH extension is enabled */
+	if (!sctp_auth_enable)
+		return;
+
+	/* We need to provide the same parameter information as
+	 * was in the original INIT.  This means that we need to copy
+	 * the HMACS, CHUNKS, and RANDOM parameter from the original
+	 * assocaition.
+	 */
+	memcpy(new_asoc->c.auth_random, asoc->c.auth_random,
+		sizeof(asoc->c.auth_random));
+	memcpy(new_asoc->c.auth_hmacs, asoc->c.auth_hmacs,
+		sizeof(asoc->c.auth_hmacs));
+	memcpy(new_asoc->c.auth_chunks, asoc->c.auth_chunks,
+		sizeof(asoc->c.auth_chunks));
+}
+
 /*
  * Compare vtag/tietag values to determine unexpected COOKIE-ECHO
  * handling action.
@@ -1379,6 +1412,8 @@ static sctp_disposition_t sctp_sf_do_unexpected_init(
 
 	sctp_tietags_populate(new_asoc, asoc);
 
+	sctp_auth_params_populate(new_asoc, asoc);
+
 	/* B) "Z" shall respond immediately with an INIT ACK chunk.  */
 
 	/* If there are errors need to be reported for unknown parameters,
-- 
1.5.2.4


^ permalink raw reply related

* [PATCH 7/8] SCTP: API updates to suport SCTP-AUTH extensions.
From: Vlad Yasevich @ 2007-09-14 18:44 UTC (permalink / raw)
  To: lksctp-developers; +Cc: netdev, Vlad Yasevich
In-Reply-To: <1189795499444-git-send-email-vladislav.yasevich@hp.com>

Add SCTP-AUTH API.  The API implemented here was
agreed to between implementors at the 9th SCTP Interop.
It will be documented in the next revision of the
SCTP socket API spec.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
---
 include/net/sctp/auth.h     |   16 +++
 include/net/sctp/ulpevent.h |    4 +
 include/net/sctp/user.h     |   90 +++++++++++++
 net/sctp/auth.c             |  193 +++++++++++++++++++++++++++
 net/sctp/sm_statefuns.c     |   13 ++
 net/sctp/socket.c           |  304 +++++++++++++++++++++++++++++++++++++++++++
 net/sctp/ulpevent.c         |   37 +++++
 7 files changed, 657 insertions(+), 0 deletions(-)

diff --git a/include/net/sctp/auth.h b/include/net/sctp/auth.h
index 10c8010..4945954 100644
--- a/include/net/sctp/auth.h
+++ b/include/net/sctp/auth.h
@@ -43,6 +43,7 @@
 struct sctp_endpoint;
 struct sctp_association;
 struct sctp_authkey;
+struct sctp_hmacalgo;
 
 /*
  * Define a generic struct that will hold all the info
@@ -109,4 +110,19 @@ int sctp_auth_recv_cid(sctp_cid_t chunk, const struct sctp_association *asoc);
 void sctp_auth_calculate_hmac(const struct sctp_association *asoc,
 			    struct sk_buff *skb,
 			    struct sctp_auth_chunk *auth, gfp_t gfp);
+
+/* API Helpers */
+int sctp_auth_ep_add_chunkid(struct sctp_endpoint *ep, __u8 chunk_id);
+int sctp_auth_ep_set_hmacs(struct sctp_endpoint *ep,
+			    struct sctp_hmacalgo *hmacs);
+int sctp_auth_set_key(struct sctp_endpoint *ep,
+		      struct sctp_association *asoc,
+		      struct sctp_authkey *auth_key);
+int sctp_auth_set_active_key(struct sctp_endpoint *ep,
+		      struct sctp_association *asoc,
+		      __u16 key_id);
+int sctp_auth_del_key_id(struct sctp_endpoint *ep,
+		      struct sctp_association *asoc,
+		      __u16 key_id);
+
 #endif
diff --git a/include/net/sctp/ulpevent.h b/include/net/sctp/ulpevent.h
index de88ed5..922a151 100644
--- a/include/net/sctp/ulpevent.h
+++ b/include/net/sctp/ulpevent.h
@@ -128,6 +128,10 @@ struct sctp_ulpevent *sctp_ulpevent_make_rcvmsg(struct sctp_association *asoc,
 	struct sctp_chunk *chunk,
 	gfp_t gfp);
 
+struct sctp_ulpevent *sctp_ulpevent_make_authkey(
+	const struct sctp_association *asoc, __u16 key_id,
+	__u32 indication, gfp_t gfp);
+
 void sctp_ulpevent_read_sndrcvinfo(const struct sctp_ulpevent *event,
 	struct msghdr *);
 __u16 sctp_ulpevent_get_notification_type(const struct sctp_ulpevent *event);
diff --git a/include/net/sctp/user.h b/include/net/sctp/user.h
index 6d2b577..00848b6 100644
--- a/include/net/sctp/user.h
+++ b/include/net/sctp/user.h
@@ -103,6 +103,21 @@ enum sctp_optname {
 #define SCTP_PARTIAL_DELIVERY_POINT SCTP_PARTIAL_DELIVERY_POINT
 	SCTP_MAX_BURST,		/* Set/Get max burst */
 #define SCTP_MAX_BURST SCTP_MAX_BURST
+	SCTP_AUTH_CHUNK,	/* Set only: add a chunk type to authenticat */
+#define SCTP_AUTH_CHUNK SCTP_AUTH_CHUNK
+	SCTP_HMAC_IDENT,
+#define SCTP_HMAC_IDENT SCTP_HMAC_IDENT
+	SCTP_AUTH_KEY,
+#define SCTP_AUTH_KEY SCTP_AUTH_KEY
+	SCTP_AUTH_ACTIVE_KEY,
+#define SCTP_AUTH_ACTIVE_KEY SCTP_AUTH_ACTIVE_KEY
+	SCTP_AUTH_DELETE_KEY,
+#define SCTP_AUTH_DELETE_KEY SCTP_AUTH_DELETE_KEY
+	SCTP_PEER_AUTH_CHUNKS,		/* Read only */
+#define SCTP_PEER_AUTH_CHUNKS SCTP_PEER_AUTH_CHUNKS
+	SCTP_LOCAL_AUTH_CHUNKS,		/* Read only */
+#define SCTP_LOCAL_AUTH_CHUNKS SCTP_LOCAL_AUTH_CHUNKS
+
 
 	/* Internal Socket Options. Some of the sctp library functions are 
 	 * implemented using these socket options.
@@ -370,6 +385,19 @@ struct sctp_pdapi_event {
 
 enum { SCTP_PARTIAL_DELIVERY_ABORTED=0, };
 
+struct sctp_authkey_event {
+	__u16 auth_type;
+	__u16 auth_flags;
+	__u32 auth_length;
+	__u16 auth_keynumber;
+	__u16 auth_altkeynumber;
+	__u32 auth_indication;
+	sctp_assoc_t auth_assoc_id;
+};
+
+enum { SCTP_AUTH_NEWKEY = 0, };
+
+
 /*
  * Described in Section 7.3
  *   Ancillary Data and Notification Interest Options
@@ -405,6 +433,7 @@ union sctp_notification {
 	struct sctp_shutdown_event sn_shutdown_event;
 	struct sctp_adaptation_event sn_adaptation_event;
 	struct sctp_pdapi_event sn_pdapi_event;
+	struct sctp_authkey_event sn_authkey_event;
 };
 
 /* Section 5.3.1
@@ -421,6 +450,7 @@ enum sctp_sn_type {
 	SCTP_SHUTDOWN_EVENT,
 	SCTP_PARTIAL_DELIVERY_EVENT,
 	SCTP_ADAPTATION_INDICATION,
+	SCTP_AUTHENTICATION_EVENT,
 };
 
 /* Notification error codes used to fill up the error fields in some
@@ -539,6 +569,54 @@ struct sctp_paddrparams {
 	__u32			spp_flags;
 } __attribute__((packed, aligned(4)));
 
+/*
+ * 7.1.18.  Add a chunk that must be authenticated (SCTP_AUTH_CHUNK)
+ *
+ * This set option adds a chunk type that the user is requesting to be
+ * received only in an authenticated way.  Changes to the list of chunks
+ * will only effect future associations on the socket.
+ */
+struct sctp_authchunk {
+	__u8		sauth_chunk;
+};
+
+/*
+ * 7.1.19.  Get or set the list of supported HMAC Identifiers (SCTP_HMAC_IDENT)
+ *
+ * This option gets or sets the list of HMAC algorithms that the local
+ * endpoint requires the peer to use.
+*/
+struct sctp_hmacalgo {
+	__u16		shmac_num_idents;
+	__u16		shmac_idents[];
+};
+
+/*
+ * 7.1.20.  Set a shared key (SCTP_AUTH_KEY)
+ *
+ * This option will set a shared secret key which is used to build an
+ * association shared key.
+ */
+struct sctp_authkey {
+	sctp_assoc_t	sca_assoc_id;
+	__u16		sca_keynumber;
+	__u16		sca_keylen;
+	__u8		sca_key[];
+};
+
+/*
+ * 7.1.21.  Get or set the active shared key (SCTP_AUTH_ACTIVE_KEY)
+ *
+ * This option will get or set the active shared key to be used to build
+ * the association shared key.
+ */
+
+struct sctp_authkeyid {
+	sctp_assoc_t	scact_assoc_id;
+	__u16		scact_keynumber;
+};
+
+
 /* 7.1.23. Delayed Ack Timer (SCTP_DELAYED_ACK_TIME)
  *
  *   This options will get or set the delayed ack timer.  The time is set
@@ -608,6 +686,18 @@ struct sctp_status {
 };
 
 /*
+ * 7.2.3.  Get the list of chunks the peer requires to be authenticated
+ *         (SCTP_PEER_AUTH_CHUNKS)
+ *
+ * This option gets a list of chunks for a specified association that
+ * the peer requires to be received authenticated only.
+ */
+struct sctp_authchunks {
+	sctp_assoc_t            gauth_assoc_id;
+	uint8_t                 gauth_chunks[];
+};
+
+/*
  * 8.3, 8.5 get all peer/local addresses in an association.
  * This parameter struct is used by SCTP_GET_PEER_ADDRS and 
  * SCTP_GET_LOCAL_ADDRS socket options used internally to implement
diff --git a/net/sctp/auth.c b/net/sctp/auth.c
index 1fee43e..c2b3999 100644
--- a/net/sctp/auth.c
+++ b/net/sctp/auth.c
@@ -742,3 +742,196 @@ free:
 	if (free_key)
 		sctp_auth_key_put(asoc_key);
 }
+
+/* API Helpers */
+
+/* Add a chunk to the endpoint authenticated chunk list */
+int sctp_auth_ep_add_chunkid(struct sctp_endpoint *ep, __u8 chunk_id)
+{
+	struct sctp_chunks_param *p = ep->auth_chunk_list;
+	__u16 nchunks;
+	__u16 param_len;
+
+	/* If this chunk is already specified, we are done */
+	if (__sctp_auth_cid(chunk_id, p))
+		return 0;
+
+	/* Check if we can add this chunk to the array */
+	param_len = ntohs(p->param_hdr.length);
+	nchunks = param_len - sizeof(sctp_paramhdr_t);
+	if (nchunks == SCTP_NUM_CHUNK_TYPES)
+		return -EINVAL;
+
+	p->chunks[nchunks] = chunk_id;
+	p->param_hdr.length = htons(param_len + 1);
+	return 0;
+}
+
+/* Add hmac identifires to the endpoint list of supported hmac ids */
+int sctp_auth_ep_set_hmacs(struct sctp_endpoint *ep,
+			   struct sctp_hmacalgo *hmacs)
+{
+	int has_sha1 = 0;
+	__u16 id;
+	int i;
+
+	/* Scan the list looking for unsupported id.  Also make sure that
+	 * SHA1 is specified.
+	 */
+	for (i = 0; i < hmacs->shmac_num_idents; i++) {
+		id = hmacs->shmac_idents[i];
+
+		if (SCTP_AUTH_HMAC_ID_SHA1 == id)
+			has_sha1 = 1;
+
+		if (!sctp_hmac_list[id].hmac_name)
+			return -EOPNOTSUPP;
+	}
+
+	if (!has_sha1)
+		return -EINVAL;
+
+	memcpy(ep->auth_hmacs_list->hmac_ids, &hmacs->shmac_idents[0],
+		hmacs->shmac_num_idents * sizeof(__u16));
+	ep->auth_hmacs_list->param_hdr.length = htons(sizeof(sctp_paramhdr_t) +
+				hmacs->shmac_num_idents * sizeof(__u16));
+	return 0;
+}
+
+/* Set a new shared key on either endpoint or association.  If the
+ * the key with a same ID already exists, replace the key (remove the
+ * old key and add a new one).
+ */
+int sctp_auth_set_key(struct sctp_endpoint *ep,
+		      struct sctp_association *asoc,
+		      struct sctp_authkey *auth_key)
+{
+	struct sctp_shared_key *cur_key = NULL;
+	struct sctp_auth_bytes *key;
+	struct list_head *sh_keys;
+	int replace = 0;
+
+	/* Try to find the given key id to see if
+	 * we are doing a replace, or adding a new key
+	 */
+	if (asoc)
+		sh_keys = &asoc->endpoint_shared_keys;
+	else
+		sh_keys = &ep->endpoint_shared_keys;
+
+	key_for_each(cur_key, sh_keys) {
+		if (cur_key->key_id == auth_key->sca_keynumber) {
+			replace = 1;
+			break;
+		}
+	}
+
+	/* If we are not replacing a key id, we need to allocate
+	 * a shared key.
+	 */
+	if (!replace) {
+		cur_key = sctp_auth_shkey_create(auth_key->sca_keynumber,
+						 GFP_KERNEL);
+		if (!cur_key)
+			return -ENOMEM;
+	}
+
+	/* Create a new key data based on the info passed in */
+	key = sctp_auth_create_key(auth_key->sca_keylen, GFP_KERNEL);
+	if (!key)
+		goto nomem;
+
+	memcpy(key->data, &auth_key->sca_key[0], auth_key->sca_keylen);
+
+	/* If we are replacing, remove the old keys data from the
+	 * key id.  If we are adding new key id, add it to the
+	 * list.
+	 */
+	if (replace)
+		sctp_auth_key_put(cur_key->key);
+	else
+		list_add(&cur_key->key_list, sh_keys);
+
+	cur_key->key = key;
+	sctp_auth_key_hold(key);
+
+	return 0;
+nomem:
+	if (!replace)
+		sctp_auth_shkey_free(cur_key);
+
+	return -ENOMEM;
+}
+
+int sctp_auth_set_active_key(struct sctp_endpoint *ep,
+			     struct sctp_association *asoc,
+			     __u16  key_id)
+{
+	struct sctp_shared_key *key;
+	struct list_head *sh_keys;
+	int found = 0;
+
+	/* The key identifier MUST correst to an existing key */
+	if (asoc)
+		sh_keys = &asoc->endpoint_shared_keys;
+	else
+		sh_keys = &ep->endpoint_shared_keys;
+
+	key_for_each(key, sh_keys) {
+		if (key->key_id == key_id) {
+			found = 1;
+			break;
+		}
+	}
+
+	if (!found)
+		return -EINVAL;
+
+	if (asoc) {
+		asoc->active_key_id = key_id;
+		sctp_auth_asoc_init_active_key(asoc, GFP_KERNEL);
+	} else
+		ep->active_key_id = key_id;
+
+	return 0;
+}
+
+int sctp_auth_del_key_id(struct sctp_endpoint *ep,
+			 struct sctp_association *asoc,
+			 __u16  key_id)
+{
+	struct sctp_shared_key *key;
+	struct list_head *sh_keys;
+	int found = 0;
+
+	/* The key identifier MUST NOT be the current active key
+	 * The key identifier MUST correst to an existing key
+	 */
+	if (asoc) {
+		if (asoc->active_key_id == key_id)
+			return -EINVAL;
+
+		sh_keys = &asoc->endpoint_shared_keys;
+	} else {
+		if (ep->active_key_id == key_id)
+			return -EINVAL;
+
+		sh_keys = &ep->endpoint_shared_keys;
+	}
+
+	key_for_each(key, sh_keys) {
+		if (key->key_id == key_id) {
+			found = 1;
+			break;
+		}
+	}
+
+	if (!found)
+		return -EINVAL;
+
+	/* Delete the shared key */
+	list_del_init(&key->key_list);
+	sctp_auth_shkey_free(key);
+
+	return 0;
+}
diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
index 21b0f48..a9b46d2 100644
--- a/net/sctp/sm_statefuns.c
+++ b/net/sctp/sm_statefuns.c
@@ -3776,6 +3776,19 @@ sctp_disposition_t sctp_sf_eat_auth(const struct sctp_endpoint *ep,
 			break;
 	}
 
+	if (asoc->active_key_id != ntohs(auth_hdr->shkey_id)) {
+		struct sctp_ulpevent *ev;
+
+		ev = sctp_ulpevent_make_authkey(asoc, ntohs(auth_hdr->shkey_id),
+				    SCTP_AUTH_NEWKEY, GFP_ATOMIC);
+
+		if (!ev)
+			return -ENOMEM;
+
+		sctp_add_cmd_sf(commands, SCTP_CMD_EVENT_ULP,
+				SCTP_ULPEVENT(ev));
+	}
+
 	return SCTP_DISPOSITION_CONSUME;
 }
 
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index f53545a..8e48fe4 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -2968,6 +2968,164 @@ static int sctp_setsockopt_maxburst(struct sock *sk,
 	return 0;
 }
 
+/*
+ * 7.1.18.  Add a chunk that must be authenticated (SCTP_AUTH_CHUNK)
+ *
+ * This set option adds a chunk type that the user is requesting to be
+ * received only in an authenticated way.  Changes to the list of chunks
+ * will only effect future associations on the socket.
+ */
+static int sctp_setsockopt_auth_chunk(struct sock *sk,
+				    char __user *optval,
+				    int optlen)
+{
+	struct sctp_authchunk val;
+
+	if (optlen != sizeof(struct sctp_authchunk))
+		return -EINVAL;
+	if (copy_from_user(&val, optval, optlen))
+		return -EFAULT;
+
+	switch (val.sauth_chunk) {
+		case SCTP_CID_INIT:
+		case SCTP_CID_INIT_ACK:
+		case SCTP_CID_SHUTDOWN_COMPLETE:
+		case SCTP_CID_AUTH:
+			return -EINVAL;
+	}
+
+	/* add this chunk id to the endpoint */
+	return sctp_auth_ep_add_chunkid(sctp_sk(sk)->ep, val.sauth_chunk);
+}
+
+/*
+ * 7.1.19.  Get or set the list of supported HMAC Identifiers (SCTP_HMAC_IDENT)
+ *
+ * This option gets or sets the list of HMAC algorithms that the local
+ * endpoint requires the peer to use.
+ */
+static int sctp_setsockopt_hmac_ident(struct sock *sk,
+				    char __user *optval,
+				    int optlen)
+{
+	struct sctp_hmacalgo *hmacs;
+	int err;
+
+	if (optlen < sizeof(struct sctp_hmacalgo))
+		return -EINVAL;
+
+	hmacs = kmalloc(optlen, GFP_KERNEL);
+	if (!hmacs)
+		return -ENOMEM;
+
+	if (copy_from_user(hmacs, optval, optlen)) {
+		err = -EFAULT;
+		goto out;
+	}
+
+	if (hmacs->shmac_num_idents == 0 ||
+	    hmacs->shmac_num_idents > SCTP_AUTH_NUM_HMACS) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	err = sctp_auth_ep_set_hmacs(sctp_sk(sk)->ep, hmacs);
+out:
+	kfree(hmacs);
+	return err;
+}
+
+/*
+ * 7.1.20.  Set a shared key (SCTP_AUTH_KEY)
+ *
+ * This option will set a shared secret key which is used to build an
+ * association shared key.
+ */
+static int sctp_setsockopt_auth_key(struct sock *sk,
+				    char __user *optval,
+				    int optlen)
+{
+	struct sctp_authkey *authkey;
+	struct sctp_association *asoc;
+	int ret;
+
+	if (optlen <= sizeof(struct sctp_authkey))
+		return -EINVAL;
+
+	authkey = kmalloc(optlen, GFP_KERNEL);
+	if (!authkey)
+		return -ENOMEM;
+
+	if (copy_from_user(authkey, optval, optlen)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	asoc = sctp_id2assoc(sk, authkey->sca_assoc_id);
+	if (!asoc && authkey->sca_assoc_id && sctp_style(sk, UDP)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = sctp_auth_set_key(sctp_sk(sk)->ep, asoc, authkey);
+out:
+	kfree(authkey);
+	return ret;
+}
+
+/*
+ * 7.1.21.  Get or set the active shared key (SCTP_AUTH_ACTIVE_KEY)
+ *
+ * This option will get or set the active shared key to be used to build
+ * the association shared key.
+ */
+static int sctp_setsockopt_active_key(struct sock *sk,
+					char __user *optval,
+					int optlen)
+{
+	struct sctp_authkeyid val;
+	struct sctp_association *asoc;
+
+	if (optlen != sizeof(struct sctp_authkeyid))
+		return -EINVAL;
+	if (copy_from_user(&val, optval, optlen))
+		return -EFAULT;
+
+	asoc = sctp_id2assoc(sk, val.scact_assoc_id);
+	if (!asoc && val.scact_assoc_id && sctp_style(sk, UDP))
+		return -EINVAL;
+
+	return sctp_auth_set_active_key(sctp_sk(sk)->ep, asoc,
+					val.scact_keynumber);
+}
+
+/*
+ * 7.1.22.  Delete a shared key (SCTP_AUTH_DELETE_KEY)
+ *
+ * This set option will delete a shared secret key from use.
+ */
+static int sctp_setsockopt_del_key(struct sock *sk,
+					char __user *optval,
+					int optlen)
+{
+	struct sctp_authkeyid val;
+	struct sctp_association *asoc;
+
+	if (optlen != sizeof(struct sctp_authkeyid))
+		return -EINVAL;
+	if (copy_from_user(&val, optval, optlen))
+		return -EFAULT;
+
+	asoc = sctp_id2assoc(sk, val.scact_assoc_id);
+	if (!asoc && val.scact_assoc_id && sctp_style(sk, UDP))
+		return -EINVAL;
+
+	return sctp_auth_del_key_id(sctp_sk(sk)->ep, asoc,
+				    val.scact_keynumber);
+
+}
+
+
 /* API 6.2 setsockopt(), getsockopt()
  *
  * Applications use setsockopt() and getsockopt() to set or retrieve
@@ -3091,6 +3249,21 @@ SCTP_STATIC int sctp_setsockopt(struct sock *sk, int level, int optname,
 	case SCTP_MAX_BURST:
 		retval = sctp_setsockopt_maxburst(sk, optval, optlen);
 		break;
+	case SCTP_AUTH_CHUNK:
+		retval = sctp_setsockopt_auth_chunk(sk, optval, optlen);
+		break;
+	case SCTP_HMAC_IDENT:
+		retval = sctp_setsockopt_hmac_ident(sk, optval, optlen);
+		break;
+	case SCTP_AUTH_KEY:
+		retval = sctp_setsockopt_auth_key(sk, optval, optlen);
+		break;
+	case SCTP_AUTH_ACTIVE_KEY:
+		retval = sctp_setsockopt_active_key(sk, optval, optlen);
+		break;
+	case SCTP_AUTH_DELETE_KEY:
+		retval = sctp_setsockopt_del_key(sk, optval, optlen);
+		break;
 	default:
 		retval = -ENOPROTOOPT;
 		break;
@@ -4870,6 +5043,118 @@ static int sctp_getsockopt_maxburst(struct sock *sk, int len,
 	return -ENOTSUPP;
 }
 
+static int sctp_getsockopt_hmac_ident(struct sock *sk, int len,
+				    char __user *optval, int __user *optlen)
+{
+	struct sctp_hmac_algo_param *hmacs;
+	__u16 param_len;
+
+	hmacs = sctp_sk(sk)->ep->auth_hmacs_list;
+	param_len = ntohs(hmacs->param_hdr.length);
+
+	if (len < param_len)
+		return -EINVAL;
+	if (put_user(len, optlen))
+		return -EFAULT;
+	if (copy_to_user(optval, hmacs->hmac_ids, len))
+		return -EFAULT;
+
+	return 0;
+}
+
+static int sctp_getsockopt_active_key(struct sock *sk, int len,
+				    char __user *optval, int __user *optlen)
+{
+	struct sctp_authkeyid val;
+	struct sctp_association *asoc;
+
+	if (len < sizeof(struct sctp_authkeyid))
+		return -EINVAL;
+	if (copy_from_user(&val, optval, sizeof(struct sctp_authkeyid)))
+		return -EFAULT;
+
+	asoc = sctp_id2assoc(sk, val.scact_assoc_id);
+	if (!asoc && val.scact_assoc_id && sctp_style(sk, UDP))
+		return -EINVAL;
+
+	if (asoc)
+		val.scact_keynumber = asoc->active_key_id;
+	else
+		val.scact_keynumber = sctp_sk(sk)->ep->active_key_id;
+
+	return 0;
+}
+
+static int sctp_getsockopt_peer_auth_chunks(struct sock *sk, int len,
+				    char __user *optval, int __user *optlen)
+{
+	struct sctp_authchunks val;
+	struct sctp_association *asoc;
+	struct sctp_chunks_param *ch;
+	char __user *to;
+
+	if (len <= sizeof(struct sctp_authchunks))
+		return -EINVAL;
+
+	if (copy_from_user(&val, optval, sizeof(struct sctp_authchunks)))
+		return -EFAULT;
+
+	to = val.gauth_chunks;
+	asoc = sctp_id2assoc(sk, val.gauth_assoc_id);
+	if (!asoc)
+		return -EINVAL;
+
+	ch = asoc->peer.peer_chunks;
+
+	/* See if the user provided enough room for all the data */
+	if (len < ntohs(ch->param_hdr.length))
+		return -EINVAL;
+
+	len = ntohs(ch->param_hdr.length);
+	if (put_user(len, optlen))
+		return -EFAULT;
+	if (copy_to_user(to, ch->chunks, len))
+		return -EFAULT;
+
+	return 0;
+}
+
+static int sctp_getsockopt_local_auth_chunks(struct sock *sk, int len,
+				    char __user *optval, int __user *optlen)
+{
+	struct sctp_authchunks val;
+	struct sctp_association *asoc;
+	struct sctp_chunks_param *ch;
+	char __user *to;
+
+	if (len <= sizeof(struct sctp_authchunks))
+		return -EINVAL;
+
+	if (copy_from_user(&val, optval, sizeof(struct sctp_authchunks)))
+		return -EFAULT;
+
+	to = val.gauth_chunks;
+	asoc = sctp_id2assoc(sk, val.gauth_assoc_id);
+	if (!asoc && val.gauth_assoc_id && sctp_style(sk, UDP))
+		return -EINVAL;
+
+	if (asoc)
+		ch = (struct sctp_chunks_param*)asoc->c.auth_chunks;
+	else
+		ch = sctp_sk(sk)->ep->auth_chunk_list;
+
+	if (len < ntohs(ch->param_hdr.length))
+		return -EINVAL;
+
+	len = ntohs(ch->param_hdr.length);
+	if (put_user(len, optlen))
+		return -EFAULT;
+	if (copy_to_user(to, ch->chunks, len))
+		return -EFAULT;
+
+	return 0;
+}
+
 SCTP_STATIC int sctp_getsockopt(struct sock *sk, int level, int optname,
 				char __user *optval, int __user *optlen)
 {
@@ -4993,6 +5278,25 @@ SCTP_STATIC int sctp_getsockopt(struct sock *sk, int level, int optname,
 	case SCTP_MAX_BURST:
 		retval = sctp_getsockopt_maxburst(sk, len, optval, optlen);
 		break;
+	case SCTP_AUTH_KEY:
+	case SCTP_AUTH_CHUNK:
+	case SCTP_AUTH_DELETE_KEY:
+		retval = -EOPNOTSUPP;
+		break;
+	case SCTP_HMAC_IDENT:
+		retval = sctp_getsockopt_hmac_ident(sk, len, optval, optlen);
+		break;
+	case SCTP_AUTH_ACTIVE_KEY:
+		retval = sctp_getsockopt_active_key(sk, len, optval, optlen);
+		break;
+	case SCTP_PEER_AUTH_CHUNKS:
+		retval = sctp_getsockopt_peer_auth_chunks(sk, len, optval,
+							optlen);
+		break;
+	case SCTP_LOCAL_AUTH_CHUNKS:
+		retval = sctp_getsockopt_local_auth_chunks(sk, len, optval,
+							optlen);
+		break;
 	default:
 		retval = -ENOPROTOOPT;
 		break;
diff --git a/net/sctp/ulpevent.c b/net/sctp/ulpevent.c
index 5dc094b..2c17c7e 100644
--- a/net/sctp/ulpevent.c
+++ b/net/sctp/ulpevent.c
@@ -813,6 +813,43 @@ fail:
 	return NULL;
 }
 
+struct sctp_ulpevent *sctp_ulpevent_make_authkey(
+	const struct sctp_association *asoc, __u16 key_id,
+	__u32 indication, gfp_t gfp)
+{
+	struct sctp_ulpevent *event;
+	struct sctp_authkey_event *ak;
+	struct sk_buff *skb;
+
+	event = sctp_ulpevent_new(sizeof(struct sctp_authkey_event),
+				  MSG_NOTIFICATION, gfp);
+	if (!event)
+		goto fail;
+
+	skb = sctp_event2skb(event);
+	ak = (struct sctp_authkey_event *)
+		skb_put(skb, sizeof(struct sctp_authkey_event));
+
+	ak->auth_type = SCTP_AUTHENTICATION_EVENT;
+	ak->auth_flags = 0;
+	ak->auth_length = sizeof(struct sctp_authkey_event);
+
+	ak->auth_keynumber = key_id;
+	ak->auth_altkeynumber = 0;
+	ak->auth_indication = indication;
+
+	/*
+	 * The association id field, holds the identifier for the association.
+	 */
+	sctp_ulpevent_set_owner(event, asoc);
+	ak->auth_assoc_id = sctp_assoc2id(asoc);
+
+	return event;
+fail:
+	return NULL;
+}
+
+
 /* Return the notification type, assuming this is a notification
  * event.
  */
-- 
1.5.2.4


^ permalink raw reply related

* [PATCH 2/8] SCTP: Implement SCTP-AUTH internals
From: Vlad Yasevich @ 2007-09-14 18:44 UTC (permalink / raw)
  To: lksctp-developers; +Cc: netdev, Vlad Yasevich
In-Reply-To: <1189795499444-git-send-email-vladislav.yasevich@hp.com>

This patch implements the internals operations of the AUTH, such as
key computation and storage.  It also adds necessary variables to
the SCTP data structures.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
---
 include/net/sctp/auth.h      |  112 +++++++
 include/net/sctp/constants.h |   49 +++-
 include/net/sctp/sctp.h      |    1 +
 include/net/sctp/structs.h   |   71 ++++-
 net/sctp/Makefile            |    3 +-
 net/sctp/auth.c              |  744 ++++++++++++++++++++++++++++++++++++++++++
 net/sctp/objcnt.c            |    2 +
 7 files changed, 975 insertions(+), 7 deletions(-)
 create mode 100644 include/net/sctp/auth.h
 create mode 100644 net/sctp/auth.c

diff --git a/include/net/sctp/auth.h b/include/net/sctp/auth.h
new file mode 100644
index 0000000..10c8010
--- /dev/null
+++ b/include/net/sctp/auth.h
@@ -0,0 +1,112 @@
+/* SCTP kernel reference Implementation
+ * (C) Copyright 2007 Hewlett-Packard Development Company, L.P.
+ *
+ * This file is part of the SCTP kernel reference Implementation
+ *
+ * The SCTP reference implementation is free software;
+ * you can redistribute it and/or modify it under the terms of
+ * the GNU General Public License as published by
+ * the Free Software Foundation; either version 2, or (at your option)
+ * any later version.
+ *
+ * The SCTP reference implementation is distributed in the hope that it
+ * will be useful, but WITHOUT ANY WARRANTY; without even the implied
+ *                 ************************
+ * warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ * See the GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with GNU CC; see the file COPYING.  If not, write to
+ * the Free Software Foundation, 59 Temple Place - Suite 330,
+ * Boston, MA 02111-1307, USA.
+ *
+ * Please send any bug reports or fixes you make to the
+ * email address(es):
+ *    lksctp developers <lksctp-developers@lists.sourceforge.net>
+ *
+ * Or submit a bug report through the following website:
+ *    http://www.sf.net/projects/lksctp
+ *
+ * Written or modified by:
+ *   Vlad Yasevich     <vladislav.yasevich@hp.com>
+ *
+ * Any bugs reported given to us we will try to fix... any fixes shared will
+ * be incorporated into the next SCTP release.
+ */
+
+#ifndef __sctp_auth_h__
+#define __sctp_auth_h__
+
+#include <linux/list.h>
+#include <linux/crypto.h>
+
+struct sctp_endpoint;
+struct sctp_association;
+struct sctp_authkey;
+
+/*
+ * Define a generic struct that will hold all the info
+ * necessary for an HMAC transform
+ */
+struct sctp_hmac {
+	__u16 hmac_id;		/* one of the above ids */
+	char *hmac_name;	/* name for loading */
+	__u16 hmac_len;		/* length of the signature */
+};
+
+/* This is generic structure that containst authentication bytes used
+ * as keying material.  It's a what is referred to as byte-vector all
+ * over SCTP-AUTH
+ */
+struct sctp_auth_bytes {
+	atomic_t refcnt;
+	__u32 len;
+	__u8  data[];
+};
+
+/* Definition for a shared key, weather endpoint or association */
+struct sctp_shared_key {
+	struct list_head key_list;
+	__u16 key_id;
+	struct sctp_auth_bytes *key;
+};
+
+#define key_for_each(__key, __list_head) \
+	list_for_each_entry(__key, __list_head, key_list)
+
+#define key_for_each_safe(__key, __tmp, __list_head) \
+	list_for_each_entry_safe(__key, __tmp, __list_head, key_list)
+
+static inline void sctp_auth_key_hold(struct sctp_auth_bytes *key)
+{
+	if (!key)
+		return;
+
+	atomic_inc(&key->refcnt);
+}
+
+void sctp_auth_key_put(struct sctp_auth_bytes *key);
+struct sctp_shared_key *sctp_auth_shkey_create(__u16 key_id, gfp_t gfp);
+void sctp_auth_shkey_free(struct sctp_shared_key *sh_key);
+void sctp_auth_destroy_keys(struct list_head *keys);
+int sctp_auth_asoc_init_active_key(struct sctp_association *asoc, gfp_t gfp);
+struct sctp_shared_key *sctp_auth_get_shkey(
+				const struct sctp_association *asoc,
+				__u16 key_id);
+int sctp_auth_asoc_copy_shkeys(const struct sctp_endpoint *ep,
+				struct sctp_association *asoc,
+				gfp_t gfp);
+int sctp_auth_init_hmacs(struct sctp_endpoint *ep, gfp_t gfp);
+void sctp_auth_destroy_hmacs(struct crypto_hash *auth_hmacs[]);
+struct sctp_hmac *sctp_auth_get_hmac(__u16 hmac_id);
+struct sctp_hmac *sctp_auth_asoc_get_hmac(const struct sctp_association *asoc);
+void sctp_auth_asoc_set_default_hmac(struct sctp_association *asoc,
+				     struct sctp_hmac_algo_param *hmacs);
+int sctp_auth_asoc_verify_hmac_id(const struct sctp_association *asoc,
+				    __u16 hmac_id);
+int sctp_auth_send_cid(sctp_cid_t chunk, const struct sctp_association *asoc);
+int sctp_auth_recv_cid(sctp_cid_t chunk, const struct sctp_association *asoc);
+void sctp_auth_calculate_hmac(const struct sctp_association *asoc,
+			    struct sk_buff *skb,
+			    struct sctp_auth_chunk *auth, gfp_t gfp);
+#endif
diff --git a/include/net/sctp/constants.h b/include/net/sctp/constants.h
index bb37724..777118f 100644
--- a/include/net/sctp/constants.h
+++ b/include/net/sctp/constants.h
@@ -64,12 +64,18 @@ enum { SCTP_DEFAULT_INSTREAMS = SCTP_MAX_STREAM };
 #define SCTP_CID_MAX			SCTP_CID_ASCONF_ACK
 
 #define SCTP_NUM_BASE_CHUNK_TYPES	(SCTP_CID_BASE_MAX + 1)
-#define SCTP_NUM_CHUNK_TYPES		(SCTP_NUM_BASE_CHUNKTYPES + 2)
 
 #define SCTP_NUM_ADDIP_CHUNK_TYPES	2
 
 #define SCTP_NUM_PRSCTP_CHUNK_TYPES	1
 
+#define SCTP_NUM_AUTH_CHUNK_TYPES	1
+
+#define SCTP_NUM_CHUNK_TYPES		(SCTP_NUM_BASE_CHUNK_TYPES + \
+					 SCTP_NUM_ADDIP_CHUNK_TYPES +\
+					 SCTP_NUM_PRSCTP_CHUNK_TYPES +\
+					 SCTP_NUM_AUTH_CHUNK_TYPES)
+
 /* These are the different flavours of event.  */
 typedef enum {
 
@@ -409,4 +415,45 @@ typedef enum {
 	SCTP_LOWER_CWND_INACTIVE,
 } sctp_lower_cwnd_t;
 
+
+/* SCTP-AUTH Necessary constants */
+
+/* SCTP-AUTH, Section 3.3
+ *
+ *  The following Table 2 shows the currently defined values for HMAC
+ *  identifiers.
+ *
+ *  +-----------------+--------------------------+
+ *  | HMAC Identifier | Message Digest Algorithm |
+ *  +-----------------+--------------------------+
+ *  | 0               | Reserved                 |
+ *  | 1               | SHA-1 defined in [8]     |
+ *  | 2               | Reserved                 |
+ *  | 3               | SHA-256 defined in [8]   |
+ *  +-----------------+--------------------------+
+ */
+enum {
+	SCTP_AUTH_HMAC_ID_RESERVED_0,
+	SCTP_AUTH_HMAC_ID_SHA1,
+	SCTP_AUTH_HMAC_ID_RESERVED_2,
+	SCTP_AUTH_HMAC_ID_SHA256
+};
+
+#define SCTP_AUTH_HMAC_ID_MAX	SCTP_AUTH_HMAC_ID_SHA256
+#define SCTP_AUTH_NUM_HMACS (SCTP_AUTH_HMAC_ID_SHA256 + 1)
+#define SCTP_SHA1_SIG_SIZE 20
+#define SCTP_SHA256_SIG_SIZE 32
+
+/*  SCTP-AUTH, Section 3.2
+ *     The chunk types for INIT, INIT-ACK, SHUTDOWN-COMPLETE and AUTH chunks
+ *     MUST NOT be listed in the CHUNKS parameter
+ */
+#define SCTP_NUM_NOAUTH_CHUNKS	4
+#define SCTP_AUTH_MAX_CHUNKS	(SCTP_NUM_CHUNK_TYPES - SCTP_NUM_NOAUTH_CHUNKS)
+
+/* SCTP-AUTH Section 6.1
+ * The RANDOM parameter MUST contain a 32 byte random number.
+ */
+#define SCTP_AUTH_RANDOM_LENGTH 32
+
 #endif /* __sctp_constants_h__ */
diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
index 46d7d09..b542e96 100644
--- a/include/net/sctp/sctp.h
+++ b/include/net/sctp/sctp.h
@@ -340,6 +340,7 @@ extern atomic_t sctp_dbg_objcnt_bind_bucket;
 extern atomic_t sctp_dbg_objcnt_addr;
 extern atomic_t sctp_dbg_objcnt_ssnmap;
 extern atomic_t sctp_dbg_objcnt_datamsg;
+extern atomic_t sctp_dbg_objcnt_keys;
 
 /* Macros to atomically increment/decrement objcnt counters.  */
 #define SCTP_DBG_OBJCNT_INC(name) \
diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 8505ecc..a668825 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -64,6 +64,7 @@
 #include <linux/skbuff.h>	/* We need sk_buff_head. */
 #include <linux/workqueue.h>	/* We need tq_struct.	 */
 #include <linux/sctp.h>		/* We need sctp* header structs.  */
+#include <net/sctp/auth.h>	/* We need auth specific structs */
 
 /* A convenience structure for handling sockaddr structures.
  * We should wean ourselves off this.
@@ -213,6 +214,9 @@ extern struct sctp_globals {
 
 	/* Flag to indicate if PR-SCTP is enabled. */
 	int prsctp_enable;
+
+	/* Flag to idicate if SCTP-AUTH is enabled */
+	int auth_enable;
 } sctp_globals;
 
 #define sctp_rto_initial		(sctp_globals.rto_initial)
@@ -244,6 +248,7 @@ extern struct sctp_globals {
 #define sctp_local_addr_list		(sctp_globals.local_addr_list)
 #define sctp_addip_enable		(sctp_globals.addip_enable)
 #define sctp_prsctp_enable		(sctp_globals.prsctp_enable)
+#define sctp_auth_enable		(sctp_globals.auth_enable)
 
 /* SCTP Socket type: UDP or TCP style. */
 typedef enum {
@@ -393,6 +398,9 @@ struct sctp_cookie {
 
 	__u32 adaptation_ind;
 
+	__u8 auth_random[sizeof(sctp_paramhdr_t) + SCTP_AUTH_RANDOM_LENGTH];
+	__u8 auth_hmacs[SCTP_AUTH_NUM_HMACS + 2];
+	__u8 auth_chunks[sizeof(sctp_paramhdr_t) + SCTP_AUTH_MAX_CHUNKS];
 
 	/* This is a shim for my peer's INIT packet, followed by
 	 * a copy of the raw address list of the association.
@@ -436,6 +444,9 @@ union sctp_params {
 	union sctp_addr_param *addr;
 	struct sctp_adaptation_ind_param *aind;
 	struct sctp_supported_ext_param *ext;
+	struct sctp_random_param *random;
+	struct sctp_chunks_param *chunks;
+	struct sctp_hmac_algo_param *hmac_algo;
 };
 
 /* RFC 2960.  Section 3.3.5 Heartbeat.
@@ -674,6 +685,7 @@ struct sctp_chunk {
 		struct sctp_errhdr *err_hdr;
 		struct sctp_addiphdr *addip_hdr;
 		struct sctp_fwdtsn_hdr *fwdtsn_hdr;
+		struct sctp_authhdr *auth_hdr;
 	} subh;
 
 	__u8 *chunk_end;
@@ -719,6 +731,7 @@ struct sctp_chunk {
 	__s8 fast_retransmit;	 /* Is this chunk fast retransmitted? */
 	__u8 tsn_missing_report; /* Data chunk missing counter. */
 	__u8 data_accepted; 	/* At least 1 chunk in this packet accepted */
+	__u8 auth;		/* IN: was auth'ed | OUT: needs auth */
 };
 
 void sctp_chunk_hold(struct sctp_chunk *);
@@ -766,16 +779,22 @@ struct sctp_packet {
 	 */
 	struct sctp_transport *transport;
 
+	/* pointer to the auth chunk for this packet */
+	struct sctp_chunk *auth;
+
 	/* This packet contains a COOKIE-ECHO chunk. */
-	char has_cookie_echo;
+	__u8 has_cookie_echo;
+
+	/* This packet contains a SACK chunk. */
+	__u8 has_sack;
 
-	/* This packet containsa SACK chunk. */
-	char has_sack;
+	/* This packet contains an AUTH chunk */
+	__u8 has_auth;
 
 	/* SCTP cannot fragment this packet. So let ip fragment it. */
-	char ipfragok;
+	__u8 ipfragok;
 
-	int malloced;
+	__u8 malloced;
 };
 
 struct sctp_packet *sctp_packet_init(struct sctp_packet *,
@@ -1285,6 +1304,21 @@ struct sctp_endpoint {
 
 	/* rcvbuf acct. policy.	*/
 	__u32 rcvbuf_policy;
+
+	/* SCTP AUTH: array of the HMACs that will be allocated
+	 * we need this per association so that we don't serialize
+	 */
+	struct crypto_hash **auth_hmacs;
+
+	/* SCTP-AUTH: hmacs for the endpoint encoded into parameter */
+	 struct sctp_hmac_algo_param *auth_hmacs_list;
+
+	/* SCTP-AUTH: chunks to authenticate encoded into parameter */
+	struct sctp_chunks_param *auth_chunk_list;
+
+	/* SCTP-AUTH: endpoint shared keys */
+	struct list_head endpoint_shared_keys;
+	__u16 active_key_id;
 };
 
 /* Recover the outter endpoint structure. */
@@ -1491,6 +1525,7 @@ struct sctp_association {
 		__u8	hostname_address;/* Peer understands DNS addresses? */
 		__u8    asconf_capable;  /* Does peer support ADDIP? */
 		__u8    prsctp_capable;  /* Can peer do PR-SCTP? */
+		__u8	auth_capable;	 /* Is peer doing SCTP-AUTH? */
 
 		__u32   adaptation_ind;	 /* Adaptation Code point. */
 
@@ -1508,6 +1543,14 @@ struct sctp_association {
 		 * Initial TSN Value minus 1
 		 */
 		__u32 addip_serial;
+
+		/* SCTP-AUTH: We need to know pears random number, hmac list
+		 * and authenticated chunk list.  All that is part of the
+		 * cookie and these are just pointers to those locations
+		 */
+		sctp_random_param_t *peer_random;
+		sctp_chunks_param_t *peer_chunks;
+		sctp_hmac_algo_param_t *peer_hmacs;
 	} peer;
 
 	/* State       : A state variable indicating what state the
@@ -1791,6 +1834,24 @@ struct sctp_association {
 	 */
 	__u32 addip_serial;
 
+	/* SCTP AUTH: list of the endpoint shared keys.  These
+	 * keys are provided out of band by the user applicaton
+	 * and can't change during the lifetime of the association
+	 */
+	struct list_head endpoint_shared_keys;
+
+	/* SCTP AUTH:
+	 * The current generated assocaition shared key (secret)
+	 */
+	struct sctp_auth_bytes *asoc_shared_key;
+
+	/* SCTP AUTH: hmac id of the first peer requested algorithm
+	 * that we support.
+	 */
+	__u16 default_hmac_id;
+
+	__u16 active_key_id;
+
 	/* Need to send an ECNE Chunk? */
 	char need_ecne;
 
diff --git a/net/sctp/Makefile b/net/sctp/Makefile
index 70c828b..1da7204 100644
--- a/net/sctp/Makefile
+++ b/net/sctp/Makefile
@@ -9,7 +9,8 @@ sctp-y := sm_statetable.o sm_statefuns.o sm_sideeffect.o \
 	  transport.o chunk.o sm_make_chunk.o ulpevent.o \
 	  inqueue.o outqueue.o ulpqueue.o command.o \
 	  tsnmap.o bind_addr.o socket.o primitive.o \
-	  output.o input.o debug.o ssnmap.o proc.o crc32c.o
+	  output.o input.o debug.o ssnmap.o proc.o crc32c.o \
+	  auth.o
 
 sctp-$(CONFIG_SCTP_DBG_OBJCNT) += objcnt.o
 sctp-$(CONFIG_SYSCTL) += sysctl.o
diff --git a/net/sctp/auth.c b/net/sctp/auth.c
new file mode 100644
index 0000000..1fee43e
--- /dev/null
+++ b/net/sctp/auth.c
@@ -0,0 +1,744 @@
+/* SCTP kernel reference Implementation
+ * (C) Copyright 2007 Hewlett-Packard Development Company, L.P.
+ *
+ * This file is part of the SCTP kernel reference Implementation
+ *
+ * The SCTP reference implementation is free software;
+ * you can redistribute it and/or modify it under the terms of
+ * the GNU General Public License as published by
+ * the Free Software Foundation; either version 2, or (at your option)
+ * any later version.
+ *
+ * The SCTP reference implementation is distributed in the hope that it
+ * will be useful, but WITHOUT ANY WARRANTY; without even the implied
+ *                 ************************
+ * warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ * See the GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with GNU CC; see the file COPYING.  If not, write to
+ * the Free Software Foundation, 59 Temple Place - Suite 330,
+ * Boston, MA 02111-1307, USA.
+ *
+ * Please send any bug reports or fixes you make to the
+ * email address(es):
+ *    lksctp developers <lksctp-developers@lists.sourceforge.net>
+ *
+ * Or submit a bug report through the following website:
+ *    http://www.sf.net/projects/lksctp
+ *
+ * Written or modified by:
+ *   Vlad Yasevich     <vladislav.yasevich@hp.com>
+ *
+ * Any bugs reported given to us we will try to fix... any fixes shared will
+ * be incorporated into the next SCTP release.
+ */
+
+#include <linux/types.h>
+#include <linux/crypto.h>
+#include <net/sctp/sctp.h>
+#include <net/sctp/auth.h>
+
+static struct sctp_hmac sctp_hmac_list[SCTP_AUTH_NUM_HMACS] = {
+	{
+		/* id 0 is reserved.  as all 0 */
+		.hmac_id = SCTP_AUTH_HMAC_ID_RESERVED_0,
+	},
+	{
+		.hmac_id = SCTP_AUTH_HMAC_ID_SHA1,
+		.hmac_name="hmac(sha1)",
+		.hmac_len = SCTP_SHA1_SIG_SIZE,
+	},
+	{
+		/* id 2 is reserved as well */
+		.hmac_id = SCTP_AUTH_HMAC_ID_RESERVED_2,
+	},
+	{
+		.hmac_id = SCTP_AUTH_HMAC_ID_SHA256,
+		.hmac_name="hmac(sha256)",
+		.hmac_len = SCTP_SHA256_SIG_SIZE,
+	}
+};
+
+
+void sctp_auth_key_put(struct sctp_auth_bytes *key)
+{
+	if (!key)
+		return;
+
+	if (atomic_dec_and_test(&key->refcnt)) {
+		kfree(key);
+		SCTP_DBG_OBJCNT_DEC(keys);
+	}
+}
+
+/* Create a new key structure of a given length */
+static struct sctp_auth_bytes *sctp_auth_create_key(__u32 key_len, gfp_t gfp)
+{
+	struct sctp_auth_bytes *key;
+
+	/* Allocate the shared key */
+	key = kmalloc(sizeof(struct sctp_auth_bytes) + key_len, gfp);
+	if (!key)
+		return NULL;
+
+	key->len = key_len;
+	atomic_set(&key->refcnt, 1);
+	SCTP_DBG_OBJCNT_INC(keys);
+
+	return key;
+}
+
+/* Create a new shared key container with a give key id */
+struct sctp_shared_key *sctp_auth_shkey_create(__u16 key_id, gfp_t gfp)
+{
+	struct sctp_shared_key *new;
+
+	/* Allocate the shared key container */
+	new = kzalloc(sizeof(struct sctp_shared_key), gfp);
+	if (!new)
+		return NULL;
+
+	INIT_LIST_HEAD(&new->key_list);
+	new->key_id = key_id;
+
+	return new;
+}
+
+/* Free the shared key stucture */
+void sctp_auth_shkey_free(struct sctp_shared_key *sh_key)
+{
+	BUG_ON(!list_empty(&sh_key->key_list));
+	sctp_auth_key_put(sh_key->key);
+	sh_key->key = NULL;
+	kfree(sh_key);
+}
+
+/* Destory the entire key list.  This is done during the
+ * associon and endpoint free process.
+ */
+void sctp_auth_destroy_keys(struct list_head *keys)
+{
+	struct sctp_shared_key *ep_key;
+	struct sctp_shared_key *tmp;
+
+	if (list_empty(keys))
+		return;
+
+	key_for_each_safe(ep_key, tmp, keys) {
+		list_del_init(&ep_key->key_list);
+		sctp_auth_shkey_free(ep_key);
+	}
+}
+
+/* Compare two byte vectors as numbers.  Return values
+ * are:
+ * 	  0 - vectors are equal
+ * 	< 0 - vector 1 is smaller then vector2
+ * 	> 0 - vector 1 is greater then vector2
+ *
+ * Algorithm is:
+ * 	This is performed by selecting the numerically smaller key vector...
+ *	If the key vectors are equal as numbers but differ in length ...
+ *	the shorter vector is considered smaller
+ *
+ * Examples (with small values):
+ * 	000123456789 > 123456789 (first number is longer)
+ * 	000123456789 < 234567891 (second number is larger numerically)
+ * 	123456789 > 2345678 	 (first number is both larger & longer)
+ */
+static int sctp_auth_compare_vectors(struct sctp_auth_bytes *vector1,
+			      struct sctp_auth_bytes *vector2)
+{
+	int diff;
+	int i;
+	const __u8 *longer;
+
+	diff = vector1->len - vector2->len;
+	if (diff) {
+		longer = (diff > 0) ? vector1->data : vector2->data;
+
+		/* Check to see if the longer number is
+		 * lead-zero padded.  If it is not, it
+		 * is automatically larger numerically.
+		 */
+		for (i = 0; i < abs(diff); i++ ) {
+			if (longer[i] != 0)
+				return diff;
+		}
+	}
+
+	/* lengths are the same, compare numbers */
+	return memcmp(vector1->data, vector2->data, vector1->len);
+}
+
+/*
+ * Create a key vector as described in SCTP-AUTH, Section 6.1
+ *    The RANDOM parameter, the CHUNKS parameter and the HMAC-ALGO
+ *    parameter sent by each endpoint are concatenated as byte vectors.
+ *    These parameters include the parameter type, parameter length, and
+ *    the parameter value, but padding is omitted; all padding MUST be
+ *    removed from this concatenation before proceeding with further
+ *    computation of keys.  Parameters which were not sent are simply
+ *    omitted from the concatenation process.  The resulting two vectors
+ *    are called the two key vectors.
+ */
+static struct sctp_auth_bytes *sctp_auth_make_key_vector(
+			sctp_random_param_t *random,
+			sctp_chunks_param_t *chunks,
+			sctp_hmac_algo_param_t *hmacs,
+			gfp_t gfp)
+{
+	struct sctp_auth_bytes *new;
+	__u32	len;
+	__u32	offset = 0;
+
+	len = ntohs(random->param_hdr.length) + ntohs(hmacs->param_hdr.length);
+        if (chunks)
+		len += ntohs(chunks->param_hdr.length);
+
+	new = kmalloc(sizeof(struct sctp_auth_bytes) + len, gfp);
+	if (!new)
+		return NULL;
+
+	new->len = len;
+
+	memcpy(new->data, random, ntohs(random->param_hdr.length));
+	offset += ntohs(random->param_hdr.length);
+
+	if (chunks) {
+		memcpy(new->data + offset, chunks,
+			ntohs(chunks->param_hdr.length));
+		offset += ntohs(chunks->param_hdr.length);
+	}
+
+	memcpy(new->data + offset, hmacs, ntohs(hmacs->param_hdr.length));
+
+	return new;
+}
+
+
+/* Make a key vector based on our local parameters */
+struct sctp_auth_bytes *sctp_auth_make_local_vector(
+				    const struct sctp_association *asoc,
+				    gfp_t gfp)
+{
+	return sctp_auth_make_key_vector(
+				    (sctp_random_param_t*)asoc->c.auth_random,
+				    (sctp_chunks_param_t*)asoc->c.auth_chunks,
+				    (sctp_hmac_algo_param_t*)asoc->c.auth_hmacs,
+				    gfp);
+}
+
+/* Make a key vector based on peer's parameters */
+struct sctp_auth_bytes *sctp_auth_make_peer_vector(
+				    const struct sctp_association *asoc,
+				    gfp_t gfp)
+{
+	return sctp_auth_make_key_vector(asoc->peer.peer_random,
+					 asoc->peer.peer_chunks,
+					 asoc->peer.peer_hmacs,
+					 gfp);
+}
+
+
+/* Set the value of the association shared key base on the parameters
+ * given.  The algorithm is:
+ *    From the endpoint pair shared keys and the key vectors the
+ *    association shared keys are computed.  This is performed by selecting
+ *    the numerically smaller key vector and concatenating it to the
+ *    endpoint pair shared key, and then concatenating the numerically
+ *    larger key vector to that.  The result of the concatenation is the
+ *    association shared key.
+ */
+static struct sctp_auth_bytes *sctp_auth_asoc_set_secret(
+			struct sctp_shared_key *ep_key,
+			struct sctp_auth_bytes *first_vector,
+			struct sctp_auth_bytes *last_vector,
+			gfp_t gfp)
+{
+	struct sctp_auth_bytes *secret;
+	__u32 offset = 0;
+	__u32 auth_len;
+
+	auth_len = first_vector->len + last_vector->len;
+	if (ep_key->key)
+		auth_len += ep_key->key->len;
+
+	secret = sctp_auth_create_key(auth_len, gfp);
+	if (!secret)
+		return NULL;
+
+	if (ep_key->key) {
+		memcpy(secret->data, ep_key->key->data, ep_key->key->len);
+		offset += ep_key->key->len;
+	}
+
+	memcpy(secret->data + offset, first_vector->data, first_vector->len);
+	offset += first_vector->len;
+
+	memcpy(secret->data + offset, last_vector->data, last_vector->len);
+
+	return secret;
+}
+
+/* Create an association shared key.  Follow the algorithm
+ * described in SCTP-AUTH, Section 6.1
+ */
+static struct sctp_auth_bytes *sctp_auth_asoc_create_secret(
+				 const struct sctp_association *asoc,
+				 struct sctp_shared_key *ep_key,
+				 gfp_t gfp)
+{
+	struct sctp_auth_bytes *local_key_vector;
+	struct sctp_auth_bytes *peer_key_vector;
+	struct sctp_auth_bytes	*first_vector,
+				*last_vector;
+	struct sctp_auth_bytes	*secret = NULL;
+	int	cmp;
+
+
+	/* Now we need to build the key vectors
+	 * SCTP-AUTH , Section 6.1
+	 *    The RANDOM parameter, the CHUNKS parameter and the HMAC-ALGO
+	 *    parameter sent by each endpoint are concatenated as byte vectors.
+	 *    These parameters include the parameter type, parameter length, and
+	 *    the parameter value, but padding is omitted; all padding MUST be
+	 *    removed from this concatenation before proceeding with further
+	 *    computation of keys.  Parameters which were not sent are simply
+	 *    omitted from the concatenation process.  The resulting two vectors
+	 *    are called the two key vectors.
+	 */
+	
+	local_key_vector = sctp_auth_make_local_vector(asoc, gfp);
+	peer_key_vector = sctp_auth_make_peer_vector(asoc, gfp);
+
+	if (!peer_key_vector || !local_key_vector)
+		goto out;
+
+	/* Figure out the order in wich the key_vectors will be
+	 * added to the endpoint shared key.
+	 * SCTP-AUTH, Section 6.1:
+	 *   This is performed by selecting the numerically smaller key
+	 *   vector and concatenating it to the endpoint pair shared
+	 *   key, and then concatenating the numerically larger key
+	 *   vector to that.  If the key vectors are equal as numbers
+	 *   but differ in length, then the concatenation order is the
+	 *   endpoint shared key, followed by the shorter key vector,
+	 *   followed by the longer key vector.  Otherwise, the key
+	 *   vectors are identical, and may be concatenated to the
+	 *   endpoint pair key in any order.
+	 */
+	cmp = sctp_auth_compare_vectors(local_key_vector,
+					peer_key_vector);
+	if (cmp < 0) {
+		first_vector = local_key_vector;
+		last_vector = peer_key_vector;
+	} else {
+		first_vector = peer_key_vector;
+		last_vector = local_key_vector;
+	}
+
+	secret = sctp_auth_asoc_set_secret(ep_key, first_vector, last_vector,
+					    gfp);
+out:
+	kfree(local_key_vector);
+	kfree(peer_key_vector);
+
+	return secret; 
+}
+
+/*
+ * Populate the association overlay list with the list
+ * from the endpoint.
+ */
+int sctp_auth_asoc_copy_shkeys(const struct sctp_endpoint *ep,
+				struct sctp_association *asoc,
+				gfp_t gfp)
+{
+	struct sctp_shared_key *sh_key;
+	struct sctp_shared_key *new;
+
+	BUG_ON(!list_empty(&asoc->endpoint_shared_keys));
+
+	key_for_each(sh_key, &ep->endpoint_shared_keys) {
+		new = sctp_auth_shkey_create(sh_key->key_id, gfp);
+		if (!new)
+			goto nomem;
+
+		new->key = sh_key->key;
+		sctp_auth_key_hold(new->key);
+		list_add(&new->key_list, &asoc->endpoint_shared_keys);
+	}
+
+	return 0;
+
+nomem:
+	sctp_auth_destroy_keys(&asoc->endpoint_shared_keys);
+	return -ENOMEM;
+}
+
+
+/* Public interface to creat the association shared key.
+ * See code above for the algorithm.
+ */
+int sctp_auth_asoc_init_active_key(struct sctp_association *asoc, gfp_t gfp)
+{
+	struct sctp_auth_bytes	*secret;
+	struct sctp_shared_key *ep_key;
+
+	/* If we don't support AUTH, or peer is not capable
+	 * we don't need to do anything.
+	 */
+	if (!sctp_auth_enable || !asoc->peer.auth_capable)
+		return 0;
+
+	/* If the key_id is non-zero and we couldn't find an
+	 * endpoint pair shared key, we can't compute the
+	 * secret.
+	 * For key_id 0, endpoint pair shared key is a NULL key.
+	 */
+	ep_key = sctp_auth_get_shkey(asoc, asoc->active_key_id);
+	BUG_ON(!ep_key);
+
+	secret = sctp_auth_asoc_create_secret(asoc, ep_key, gfp);
+	if (!secret)
+		return -ENOMEM;
+
+	sctp_auth_key_put(asoc->asoc_shared_key);
+	asoc->asoc_shared_key = secret;
+
+	return 0;
+}
+
+
+/* Find the endpoint pair shared key based on the key_id */
+struct sctp_shared_key *sctp_auth_get_shkey(
+				const struct sctp_association *asoc,
+				__u16 key_id)
+{
+	struct sctp_shared_key *key = NULL;
+	
+	/* First search associations set of endpoint pair shared keys */
+	key_for_each(key, &asoc->endpoint_shared_keys) {
+		if (key->key_id == key_id)
+			break;
+	}
+
+	return key;
+}
+
+/* 
+ * Initialize all the possible digest transforms that we can use.  Right now
+ * now, the supported digests are SHA1 and SHA256.  We do this here once
+ * because of the restrictiong that transforms may only be allocated in
+ * user context.  This forces us to pre-allocated all possible transforms
+ * at the endpoint init time.
+ */
+int sctp_auth_init_hmacs(struct sctp_endpoint *ep, gfp_t gfp)
+{
+	struct crypto_hash *tfm = NULL;
+	__u16   id;
+
+	/* if the transforms are already allocted, we are done */
+	if (!sctp_auth_enable) {
+		ep->auth_hmacs = NULL;
+		return 0;
+	}
+
+	if (ep->auth_hmacs)
+		return 0;
+
+	/* Allocated the array of pointers to transorms */
+	ep->auth_hmacs = kzalloc(
+			    sizeof(struct crypto_hash *) * SCTP_AUTH_NUM_HMACS,
+			    gfp);
+	if (!ep->auth_hmacs)
+		return -ENOMEM;
+
+	for (id = 0; id < SCTP_AUTH_NUM_HMACS; id++) {
+
+		/* See is we support the id.  Supported IDs have name and
+		 * length fields set, so that we can allocated and use
+		 * them.  We can safely just check for name, for without the
+		 * name, we can't allocate the TFM.
+		 */
+		if (!sctp_hmac_list[id].hmac_name)
+			continue;
+
+		/* If this TFM has been allocated, we are all set */
+		if (ep->auth_hmacs[id])
+			continue;
+
+		/* Allocate the ID */
+		tfm = crypto_alloc_hash(sctp_hmac_list[id].hmac_name, 0,
+					CRYPTO_ALG_ASYNC);
+		if (IS_ERR(tfm))
+			goto out_err;
+
+		ep->auth_hmacs[id] = tfm;
+	}
+
+	return 0;
+
+out_err:
+	/* Clean up any successfull allocations */
+	sctp_auth_destroy_hmacs(ep->auth_hmacs);
+	return -ENOMEM;
+}
+
+/* Destroy the hmac tfm array */
+void sctp_auth_destroy_hmacs(struct crypto_hash *auth_hmacs[])
+{
+	int i;
+
+	if (!auth_hmacs)
+		return;
+
+	for (i = 0; i < SCTP_AUTH_NUM_HMACS; i++)
+	{
+		if (auth_hmacs[i])
+			crypto_free_hash(auth_hmacs[i]);
+	}
+	kfree(auth_hmacs);
+}
+
+
+struct sctp_hmac *sctp_auth_get_hmac(__u16 hmac_id)
+{
+	return &sctp_hmac_list[hmac_id];
+}
+
+/* Get an hmac description information that we can use to build
+ * the AUTH chunk
+ */
+struct sctp_hmac *sctp_auth_asoc_get_hmac(const struct sctp_association *asoc)
+{
+	struct sctp_hmac_algo_param *hmacs;
+	__u16 n_elt;
+	__u16 id = 0;
+	int i;
+
+	/* If we have a default entry, use it */
+	if (asoc->default_hmac_id)
+		return &sctp_hmac_list[asoc->default_hmac_id];
+
+	/* Since we do not have a default entry, find the first entry
+	 * we support and return that.  Do not cache that id.
+	 */
+	hmacs = asoc->peer.peer_hmacs;
+	if (!hmacs)
+		return NULL;
+
+	n_elt = (ntohs(hmacs->param_hdr.length) - sizeof(sctp_paramhdr_t)) >> 1;
+	for (i = 0; i < n_elt; i++) {
+		id = ntohs(hmacs->hmac_ids[i]);
+
+		/* Check the id is in the supported range */
+		if (id > SCTP_AUTH_HMAC_ID_MAX)
+			continue;
+
+		/* See is we support the id.  Supported IDs have name and
+		 * length fields set, so that we can allocated and use
+		 * them.  We can safely just check for name, for without the
+		 * name, we can't allocate the TFM.
+		 */
+		if (!sctp_hmac_list[id].hmac_name)
+			continue;
+
+		break;
+	}
+
+	if (id == 0)
+		return NULL;
+
+	return &sctp_hmac_list[id];
+}
+
+static int __sctp_auth_find_hmacid(__u16 *hmacs, int n_elts, __u16 hmac_id)
+{
+	int  found = 0;
+	int  i;
+
+	for (i = 0; i < n_elts; i++) {
+		if (hmac_id == hmacs[i]) {
+			found = 1;
+			break;
+		}
+	}
+
+	return found;
+}
+
+/* See if the HMAC_ID is one that we claim as supported */
+int sctp_auth_asoc_verify_hmac_id(const struct sctp_association *asoc,
+				    __u16 hmac_id)
+{
+	struct sctp_hmac_algo_param *hmacs;
+	__u16 n_elt;
+
+	if (!asoc)
+		return 0;
+
+	hmacs = (struct sctp_hmac_algo_param *)asoc->c.auth_hmacs;
+	n_elt = (ntohs(hmacs->param_hdr.length) - sizeof(sctp_paramhdr_t)) >> 1;
+
+	return __sctp_auth_find_hmacid(hmacs->hmac_ids, n_elt, hmac_id);
+}
+
+
+/* Cache the default HMAC id.  This to follow this text from SCTP-AUTH:
+ * Section 6.1:
+ *   The receiver of a HMAC-ALGO parameter SHOULD use the first listed
+ *   algorithm it supports.
+ */
+void sctp_auth_asoc_set_default_hmac(struct sctp_association *asoc,
+				     struct sctp_hmac_algo_param *hmacs)
+{
+	struct sctp_endpoint *ep;
+	__u16   id;
+	int	i;
+	int	n_params;
+
+	/* if the default id is already set, use it */
+	if (asoc->default_hmac_id)
+		return;
+
+	n_params = (ntohs(hmacs->param_hdr.length)
+				- sizeof(sctp_paramhdr_t)) >> 1;
+	ep = asoc->ep;
+	for (i = 0; i < n_params; i++) {
+		id = ntohs(hmacs->hmac_ids[i]);
+
+		/* Check the id is in the supported range */
+		if (id > SCTP_AUTH_HMAC_ID_MAX)
+			continue;
+
+		/* If this TFM has been allocated, use this id */
+		if (ep->auth_hmacs[id]) {
+			asoc->default_hmac_id = id;
+			break;
+		}
+	}
+}
+
+
+/* Check to see if the given chunk is supposed to be authenticated */
+static int __sctp_auth_cid(sctp_cid_t chunk, struct sctp_chunks_param *param)
+{
+	unsigned short len;
+	int found = 0;
+	int i;
+
+	if (!param)
+		return 0;
+
+	len = ntohs(param->param_hdr.length) - sizeof(sctp_paramhdr_t);
+
+	/* SCTP-AUTH, Section 3.2
+	 *    The chunk types for INIT, INIT-ACK, SHUTDOWN-COMPLETE and AUTH
+	 *    chunks MUST NOT be listed in the CHUNKS parameter.  However, if
+	 *    a CHUNKS parameter is received then the types for INIT, INIT-ACK,
+	 *    SHUTDOWN-COMPLETE and AUTH chunks MUST be ignored.
+	 */
+	for (i = 0; !found && i < len; i++) {
+		switch (param->chunks[i]) {
+		    case SCTP_CID_INIT:
+		    case SCTP_CID_INIT_ACK:
+		    case SCTP_CID_SHUTDOWN_COMPLETE:
+		    case SCTP_CID_AUTH:
+			break; 
+
+		    default:
+			if (param->chunks[i] == chunk)
+			    found = 1;
+			break;
+		}
+	}
+
+	return found;
+}
+
+/* Check if peer requested that this chunk is authenticated */
+int sctp_auth_send_cid(sctp_cid_t chunk, const struct sctp_association *asoc)
+{
+	if (!sctp_auth_enable || !asoc || !asoc->peer.auth_capable)
+		return 0;
+
+	return __sctp_auth_cid(chunk, asoc->peer.peer_chunks);
+}
+
+/* Check if we requested that peer authenticate this chunk. */
+int sctp_auth_recv_cid(sctp_cid_t chunk, const struct sctp_association *asoc)
+{
+	if (!sctp_auth_enable || !asoc)
+		return 0;
+
+	return __sctp_auth_cid(chunk,
+			      (struct sctp_chunks_param *)asoc->c.auth_chunks);
+}
+
+/* SCTP-AUTH: Section 6.2:
+ *    The sender MUST calculate the MAC as described in RFC2104 [2] using
+ *    the hash function H as described by the MAC Identifier and the shared
+ *    association key K based on the endpoint pair shared key described by
+ *    the shared key identifier.  The 'data' used for the computation of
+ *    the AUTH-chunk is given by the AUTH chunk with its HMAC field set to
+ *    zero (as shown in Figure 6) followed by all chunks that are placed
+ *    after the AUTH chunk in the SCTP packet.
+ */
+void sctp_auth_calculate_hmac(const struct sctp_association *asoc,
+			      struct sk_buff *skb,
+			      struct sctp_auth_chunk *auth,
+			      gfp_t gfp)
+{
+	struct scatterlist sg;
+	struct hash_desc desc;
+	struct sctp_auth_bytes *asoc_key;
+	__u16 key_id, hmac_id;
+	__u8 *digest;
+	unsigned char *end;
+	int free_key = 0;
+
+	/* Extract the info we need:
+	 * - hmac id
+	 * - key id
+	 */
+	key_id = ntohs(auth->auth_hdr.shkey_id);
+	hmac_id = ntohs(auth->auth_hdr.hmac_id);
+
+	if (key_id == asoc->active_key_id)
+		asoc_key = asoc->asoc_shared_key;
+	else {
+		struct sctp_shared_key *ep_key;
+		
+		ep_key = sctp_auth_get_shkey(asoc, key_id);
+		if (!ep_key)
+			return;
+
+		asoc_key = sctp_auth_asoc_create_secret(asoc, ep_key, gfp);
+		if (!asoc_key)
+			return;
+
+		free_key = 1;
+	}
+	    
+	/* set up scatter list */
+	end = skb_tail_pointer(skb);
+	sg.page = virt_to_page(auth);
+	sg.offset = (unsigned long)(auth) % PAGE_SIZE;
+	sg.length = end - (unsigned char *)auth;
+
+	desc.tfm = asoc->ep->auth_hmacs[hmac_id];
+	desc.flags = 0;
+
+	digest = auth->auth_hdr.hmac;
+	if (crypto_hash_setkey(desc.tfm, &asoc_key->data[0], asoc_key->len))
+		goto free;
+
+	crypto_hash_digest(&desc, &sg, sg.length, digest);
+
+free:
+	if (free_key)
+		sctp_auth_key_put(asoc_key);
+}
diff --git a/net/sctp/objcnt.c b/net/sctp/objcnt.c
index fcfb9d8..2cf6ad6 100644
--- a/net/sctp/objcnt.c
+++ b/net/sctp/objcnt.c
@@ -58,6 +58,7 @@ SCTP_DBG_OBJCNT(chunk);
 SCTP_DBG_OBJCNT(addr);
 SCTP_DBG_OBJCNT(ssnmap);
 SCTP_DBG_OBJCNT(datamsg);
+SCTP_DBG_OBJCNT(keys);
 
 /* An array to make it easy to pretty print the debug information
  * to the proc fs.
@@ -73,6 +74,7 @@ static sctp_dbg_objcnt_entry_t sctp_dbg_objcnt[] = {
 	SCTP_DBG_OBJCNT_ENTRY(addr),
 	SCTP_DBG_OBJCNT_ENTRY(ssnmap),
 	SCTP_DBG_OBJCNT_ENTRY(datamsg),
+	SCTP_DBG_OBJCNT_ENTRY(keys),
 };
 
 /* Callback from procfs to read out objcount information.
-- 
1.5.2.4


^ permalink raw reply related

* [PATCH 5/8] SCTP: Enable the sending of the AUTH chunk.
From: Vlad Yasevich @ 2007-09-14 18:44 UTC (permalink / raw)
  To: lksctp-developers; +Cc: netdev, Vlad Yasevich
In-Reply-To: <1189795499444-git-send-email-vladislav.yasevich@hp.com>

SCTP-AUTH, Section 6.2:

   Endpoints MUST send all requested chunks authenticated where this has
   been requested by the peer.  The other chunks MAY be sent
   authenticated or not.  If endpoint pair shared keys are used, one of
   them MUST be selected for authentication.

   To send chunks in an authenticated way, the sender MUST include these
   chunks after an AUTH chunk.  This means that a sender MUST bundle
   chunks in order to authenticate them.

   If the endpoint has no endpoint pair shared key for the peer, it MUST
   use Shared Key Identifier 0 with an empty endpoint pair shared key.
   If there are multiple endpoint shared keys the sender selects one and
   uses the corresponding Shared Key Identifier

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
---
 include/net/sctp/sm.h      |    1 +
 include/net/sctp/structs.h |    3 +
 net/sctp/chunk.c           |   12 ++++
 net/sctp/output.c          |  131 +++++++++++++++++++++++++++++++++++---------
 net/sctp/sm_make_chunk.c   |   39 +++++++++++++
 5 files changed, 159 insertions(+), 27 deletions(-)

diff --git a/include/net/sctp/sm.h b/include/net/sctp/sm.h
index 991c85b..ee62897 100644
--- a/include/net/sctp/sm.h
+++ b/include/net/sctp/sm.h
@@ -254,6 +254,7 @@ int sctp_process_asconf_ack(struct sctp_association *asoc,
 struct sctp_chunk *sctp_make_fwdtsn(const struct sctp_association *asoc,
 				    __u32 new_cum_tsn, size_t nstreams,
 				    struct sctp_fwdtsn_skip *skiplist);
+struct sctp_chunk *sctp_make_auth(const struct sctp_association *asoc);
 
 void sctp_chunk_assign_tsn(struct sctp_chunk *);
 void sctp_chunk_assign_ssn(struct sctp_chunk *);
diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index a668825..8cb7bc4 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -791,6 +791,9 @@ struct sctp_packet {
 	/* This packet contains an AUTH chunk */
 	__u8 has_auth;
 
+	/* This packet contains at least 1 DATA chunk */
+	__u8 has_data;
+
 	/* SCTP cannot fragment this packet. So let ip fragment it. */
 	__u8 ipfragok;
 
diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c
index 77fb7b0..619d0f2 100644
--- a/net/sctp/chunk.c
+++ b/net/sctp/chunk.c
@@ -194,6 +194,18 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
 
 	max = asoc->frag_point;
 
+	/* If the the peer requested that we authenticate DATA chunks
+	 * we need to accound for bundling of the AUTH chunks along with
+	 * DATA.
+	 */
+	if (sctp_auth_send_cid(SCTP_CID_DATA, asoc)) {
+		struct sctp_hmac *hmac_desc = sctp_auth_asoc_get_hmac(asoc);
+
+		if (hmac_desc)
+			max -= WORD_ROUND(sizeof(sctp_auth_chunk_t) +
+					    hmac_desc->hmac_len);
+	}
+
 	whole = 0;
 	first_len = max;
 
diff --git a/net/sctp/output.c b/net/sctp/output.c
index 49b9f5f..847639d 100644
--- a/net/sctp/output.c
+++ b/net/sctp/output.c
@@ -80,6 +80,7 @@ struct sctp_packet *sctp_packet_config(struct sctp_packet *packet,
 	packet->has_cookie_echo = 0;
 	packet->has_sack = 0;
 	packet->has_auth = 0;
+	packet->has_data = 0;
 	packet->ipfragok = 0;
 	packet->auth = NULL;
 
@@ -124,6 +125,7 @@ struct sctp_packet *sctp_packet_init(struct sctp_packet *packet,
 	packet->has_cookie_echo = 0;
 	packet->has_sack = 0;
 	packet->has_auth = 0;
+	packet->has_data = 0;
 	packet->ipfragok = 0;
 	packet->malloced = 0;
 	packet->auth = NULL;
@@ -185,6 +187,39 @@ sctp_xmit_t sctp_packet_transmit_chunk(struct sctp_packet *packet,
 	return retval;
 }
 
+/* Try to bundle an auth chunk into the packet. */
+static sctp_xmit_t sctp_packet_bundle_auth(struct sctp_packet *pkt,
+					   struct sctp_chunk *chunk)
+{
+	struct sctp_association *asoc = pkt->transport->asoc;
+	struct sctp_chunk *auth;
+	sctp_xmit_t retval = SCTP_XMIT_OK;
+
+	/* if we don't have an association, we can't do authentication */
+	if (!asoc)
+		return retval;
+
+	/* See if this is an auth chunk we are bundling or if
+	 * auth is already bundled.
+	 */
+	if (chunk->chunk_hdr->type == SCTP_CID_AUTH || pkt->auth)
+		return retval;
+
+	/* if the peer did not request this chunk to be authenticated,
+	 * don't do it
+	 */
+	if (!chunk->auth)
+		return retval;
+
+	auth = sctp_make_auth(asoc);
+	if (!auth)
+		return retval;
+
+	retval = sctp_packet_append_chunk(pkt, auth);
+
+	return retval;
+}
+
 /* Try to bundle a SACK with the packet. */
 static sctp_xmit_t sctp_packet_bundle_sack(struct sctp_packet *pkt,
 					   struct sctp_chunk *chunk)
@@ -231,12 +266,17 @@ sctp_xmit_t sctp_packet_append_chunk(struct sctp_packet *packet,
 	SCTP_DEBUG_PRINTK("%s: packet:%p chunk:%p\n", __FUNCTION__, packet,
 			  chunk);
 
-	retval = sctp_packet_bundle_sack(packet, chunk);
-	psize = packet->size;
+	/* Try to bundle AUTH chunk */
+	retval = sctp_packet_bundle_auth(packet, chunk);
+	if (retval != SCTP_XMIT_OK)
+		goto finish;
 
+	/* Try to bundle SACK chunk */
+	retval = sctp_packet_bundle_sack(packet, chunk);
 	if (retval != SCTP_XMIT_OK)
 		goto finish;
 
+	psize = packet->size;
 	pmtu  = ((packet->transport->asoc) ?
 		 (packet->transport->asoc->pathmtu) :
 		 (packet->transport->pathmtu));
@@ -245,10 +285,16 @@ sctp_xmit_t sctp_packet_append_chunk(struct sctp_packet *packet,
 
 	/* Decide if we need to fragment or resubmit later. */
 	if (too_big) {
-		/* Both control chunks and data chunks with TSNs are
-		 * non-fragmentable.
+		/* It's OK to fragmet at IP level if any one of the following
+		 * is true:
+		 * 	1. The packet is empty (meaning this chunk is greater
+		 * 	   the MTU)
+		 * 	2. The chunk we are adding is a control chunk
+		 * 	3. The packet doesn't have any data in it yet and data
+		 * 	requires authentication.
 		 */
-		if (sctp_packet_empty(packet) || !sctp_chunk_is_data(chunk)) {
+		if (sctp_packet_empty(packet) || !sctp_chunk_is_data(chunk) ||
+		    (!packet->has_data && chunk->auth)) {
 			/* We no longer do re-fragmentation.
 			 * Just fragment at the IP layer, if we
 			 * actually hit this condition
@@ -270,16 +316,31 @@ append:
 	/* DATA is a special case since we must examine both rwnd and cwnd
 	 * before we send DATA.
 	 */
-	if (sctp_chunk_is_data(chunk)) {
+	switch (chunk->chunk_hdr->type) {
+	    case SCTP_CID_DATA:
 		retval = sctp_packet_append_data(packet, chunk);
 		/* Disallow SACK bundling after DATA. */
 		packet->has_sack = 1;
+		/* Disallow AUTH bundling after DATA */
+		packet->has_auth = 1;
+		/* Let it be knows that packet has DATA in it */
+		packet->has_data = 1;
 		if (SCTP_XMIT_OK != retval)
 			goto finish;
-	} else if (SCTP_CID_COOKIE_ECHO == chunk->chunk_hdr->type)
+		break;
+	    case SCTP_CID_COOKIE_ECHO:
 		packet->has_cookie_echo = 1;
-	else if (SCTP_CID_SACK == chunk->chunk_hdr->type)
+		break;
+
+	    case SCTP_CID_SACK:
 		packet->has_sack = 1;
+		break;
+
+	    case SCTP_CID_AUTH:
+		packet->has_auth = 1;
+		packet->auth = chunk;
+		break;
+	}
 
 	/* It is OK to send this chunk.  */
 	list_add_tail(&chunk->list, &packet->chunk_list);
@@ -307,6 +368,8 @@ int sctp_packet_transmit(struct sctp_packet *packet)
 	int padding;		/* How much padding do we need?  */
 	__u8 has_data = 0;
 	struct dst_entry *dst = tp->dst;
+	unsigned char *auth = NULL;	/* pointer to auth in skb data */
+	__u32 cksum_buf_len = sizeof(struct sctphdr);
 
 	SCTP_DEBUG_PRINTK("%s: packet:%p\n", __FUNCTION__, packet);
 
@@ -360,16 +423,6 @@ int sctp_packet_transmit(struct sctp_packet *packet)
 	sh->vtag     = htonl(packet->vtag);
 	sh->checksum = 0;
 
-	/* 2) Calculate the Adler-32 checksum of the whole packet,
-	 *    including the SCTP common header and all the
-	 *    chunks.
-	 *
-	 * Note: Adler-32 is no longer applicable, as has been replaced
-	 * by CRC32-C as described in <draft-ietf-tsvwg-sctpcsum-02.txt>.
-	 */
-	if (!(dst->dev->features & NETIF_F_NO_CSUM))
-		crc32 = sctp_start_cksum((__u8 *)sh, sizeof(struct sctphdr));
-
 	/**
 	 * 6.10 Bundling
 	 *
@@ -420,14 +473,16 @@ int sctp_packet_transmit(struct sctp_packet *packet)
 		if (padding)
 			memset(skb_put(chunk->skb, padding), 0, padding);
 
-		if (dst->dev->features & NETIF_F_NO_CSUM)
-			memcpy(skb_put(nskb, chunk->skb->len),
+		/* if this is the auth chunk that we are adding,
+		 * store pointer where it will be added and put
+		 * the auth into the packet.
+		 */
+		if (chunk == packet->auth)
+			auth = skb_tail_pointer(nskb);
+
+		cksum_buf_len += chunk->skb->len;
+		memcpy(skb_put(nskb, chunk->skb->len),
 			       chunk->skb->data, chunk->skb->len);
-		else
-			crc32 = sctp_update_copy_cksum(skb_put(nskb,
-							chunk->skb->len),
-						chunk->skb->data,
-						chunk->skb->len, crc32);
 
 		SCTP_DEBUG_PRINTK("%s %p[%s] %s 0x%x, %s %d, %s %d, %s %d\n",
 				  "*** Chunk", chunk,
@@ -449,9 +504,31 @@ int sctp_packet_transmit(struct sctp_packet *packet)
 			sctp_chunk_free(chunk);
 	}
 
-	/* Perform final transformation on checksum. */
-	if (!(dst->dev->features & NETIF_F_NO_CSUM))
+	/* SCTP-AUTH, Section 6.2
+	 *    The sender MUST calculate the MAC as described in RFC2104 [2]
+	 *    using the hash function H as described by the MAC Identifier and
+	 *    the shared association key K based on the endpoint pair shared key
+	 *    described by the shared key identifier.  The 'data' used for the
+	 *    computation of the AUTH-chunk is given by the AUTH chunk with its
+	 *    HMAC field set to zero (as shown in Figure 6) followed by all
+	 *    chunks that are placed after the AUTH chunk in the SCTP packet.
+	 */
+	if (auth)
+		sctp_auth_calculate_hmac(asoc, nskb,
+					(struct sctp_auth_chunk *)auth,
+					GFP_ATOMIC);
+
+	/* 2) Calculate the Adler-32 checksum of the whole packet,
+	 *    including the SCTP common header and all the
+	 *    chunks.
+	 *
+	 * Note: Adler-32 is no longer applicable, as has been replaced
+	 * by CRC32-C as described in <draft-ietf-tsvwg-sctpcsum-02.txt>.
+	 */
+	if (!(dst->dev->features & NETIF_F_NO_CSUM)) {
+		crc32 = sctp_start_cksum((__u8 *)sh, cksum_buf_len);
 		crc32 = sctp_end_cksum(crc32);
+	}
 
 	/* 3) Put the resultant value into the checksum field in the
 	 *    common header, and leave the rest of the bits unchanged.
diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
index cd4eb21..7cd8241 100644
--- a/net/sctp/sm_make_chunk.c
+++ b/net/sctp/sm_make_chunk.c
@@ -1111,6 +1111,41 @@ nodata:
 	return retval;
 }
 
+struct sctp_chunk *sctp_make_auth(const struct sctp_association *asoc)
+{
+	struct sctp_chunk *retval;
+	struct sctp_hmac *hmac_desc;
+	struct sctp_authhdr auth_hdr;
+	__u8 *hmac;
+
+	/* Get the first hmac that the peer told us to use */
+	hmac_desc = sctp_auth_asoc_get_hmac(asoc);
+	if (unlikely(!hmac_desc))
+		return NULL;
+
+	retval = sctp_make_chunk(asoc, SCTP_CID_AUTH, 0,
+			hmac_desc->hmac_len + sizeof(sctp_authhdr_t));
+	if (!retval)
+		return NULL;
+
+	auth_hdr.hmac_id = htons(hmac_desc->hmac_id);
+	auth_hdr.shkey_id = htons(asoc->active_key_id);
+
+	retval->subh.auth_hdr = sctp_addto_chunk(retval, sizeof(sctp_authhdr_t),
+						&auth_hdr);
+
+	hmac = skb_put(retval->skb, hmac_desc->hmac_len);
+	memset(hmac, 0, hmac_desc->hmac_len);
+
+	/* Adjust the chunk header to include the empty MAC */
+	retval->chunk_hdr->length =
+		htons(ntohs(retval->chunk_hdr->length) + hmac_desc->hmac_len);
+	retval->chunk_end = skb_tail_pointer(retval->skb);
+
+	return retval;
+}
+
+
 /********************************************************************
  * 2nd Level Abstractions
  ********************************************************************/
@@ -1225,6 +1260,10 @@ struct sctp_chunk *sctp_make_chunk(const struct sctp_association *asoc,
 	retval->chunk_hdr = chunk_hdr;
 	retval->chunk_end = ((__u8 *)chunk_hdr) + sizeof(struct sctp_chunkhdr);
 
+	/* Determine if the chunk needs to be authenticated */
+	if (sctp_auth_send_cid(type, asoc))
+		retval->auth = 1;
+
 	/* Set the skb to the belonging sock for accounting.  */
 	skb->sk = sk;
 
-- 
1.5.2.4


^ permalink raw reply related

* [RFC PATCH 0/8] Implement SCTP-AUTH specification.
From: Vlad Yasevich @ 2007-09-14 18:44 UTC (permalink / raw)
  To: lksctp-developers; +Cc: netdev

Hi All

The following series of 8 patches is the implementation of SCTP-AUTH
spec (RFC 4895) that was tested at the 9th SCTP Interop.  The code is
based on the implementation of the Supported Extensions parameter that
was send out separately for inclusion into 2.6.24.

I'd really appreciate people reviewing this as I hope for this
to make the 2.6.24 code base.

Thanks
-vlad

^ permalink raw reply

* [PATCH 1/8] SCTP: protocol definitions for SCTP-AUTH implementation
From: Vlad Yasevich @ 2007-09-14 18:44 UTC (permalink / raw)
  To: lksctp-developers; +Cc: netdev, Vlad Yasevich
In-Reply-To: <1189795499444-git-send-email-vladislav.yasevich@hp.com>

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
---
 include/linux/sctp.h |  100 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 99 insertions(+), 1 deletions(-)

diff --git a/include/linux/sctp.h b/include/linux/sctp.h
index f4d717b..5eb38cc 100644
--- a/include/linux/sctp.h
+++ b/include/linux/sctp.h
@@ -102,6 +102,9 @@ typedef enum {
         SCTP_CID_ECN_CWR		= 13,
         SCTP_CID_SHUTDOWN_COMPLETE	= 14,
 
+	/* AUTH Extension Section 4.1 */
+	SCTP_CID_AUTH			= 0x0F,
+
 	/* PR-SCTP Sec 3.2 */
 	SCTP_CID_FWD_TSN		= 0xC0,
 
@@ -180,6 +183,11 @@ typedef enum {
 	SCTP_PARAM_SUPPORTED_ADDRESS_TYPES	= __constant_htons(12),
 	SCTP_PARAM_ECN_CAPABLE			= __constant_htons(0x8000),
 
+	/* AUTH Extension Section 3 */
+	SCTP_PARAM_RANDOM			= __constant_htons(0x8002),
+	SCTP_PARAM_CHUNKS			= __constant_htons(0x8003),
+	SCTP_PARAM_HMAC_ALGO			= __constant_htons(0x8004),
+
 	/* Add-IP: Supported Extensions, Section 4.2 */
 	SCTP_PARAM_SUPPORTED_EXT	= __constant_htons(0x8008),
 
@@ -305,6 +313,24 @@ typedef struct sctp_supported_ext_param {
 	__u8 chunks[0];
 } __attribute__((packed)) sctp_supported_ext_param_t;
 
+/* AUTH Section 3.1 Random */
+typedef struct sctp_random_param {
+	sctp_paramhdr_t param_hdr;
+	__u8 random_val[0];
+} __attribute__((packed)) sctp_random_param_t;
+
+/* AUTH Section 3.2 Chunk List */
+typedef struct sctp_chunks_param {
+	sctp_paramhdr_t param_hdr;
+	__u8 chunks[0];
+} __attribute__((packed)) sctp_chunks_param_t;
+
+/* AUTH Section 3.3 HMAC Algorithm */
+typedef struct sctp_hmac_algo_param {
+	sctp_paramhdr_t param_hdr;
+	__be16 hmac_ids[0];
+} __attribute__((packed)) sctp_hmac_algo_param_t;
+
 /* RFC 2960.  Section 3.3.3 Initiation Acknowledgement (INIT ACK) (2):
  *   The INIT ACK chunk is used to acknowledge the initiation of an SCTP
  *   association.
@@ -471,7 +497,19 @@ typedef enum {
 	SCTP_ERROR_RSRC_LOW	= __constant_htons(0x0101),
 	SCTP_ERROR_DEL_SRC_IP	= __constant_htons(0x0102),
 	SCTP_ERROR_ASCONF_ACK   = __constant_htons(0x0103),
-	SCTP_ERROR_REQ_REFUSED	= __constant_htons(0x0104)
+	SCTP_ERROR_REQ_REFUSED	= __constant_htons(0x0104),
+
+	/* AUTH Section 4.  New Error Cause
+	 *
+	 * This section defines a new error cause that will be sent if an AUTH
+	 * chunk is received with an unsupported HMAC identifier.
+	 * illustrates the new error cause.
+	 *
+	 * Cause Code      Error Cause Name
+	 * --------------------------------------------------------------
+	 * 0x0105          Unsupported HMAC Identifier
+	 */
+	 SCTP_ERROR_UNSUP_HMAC	= __constant_htons(0x0105)
 } sctp_error_t;
 
 
@@ -609,4 +647,64 @@ typedef struct sctp_addip_chunk {
 	sctp_addiphdr_t addip_hdr;
 } __attribute__((packed)) sctp_addip_chunk_t;
 
+/* AUTH
+ * Section 4.1  Authentication Chunk (AUTH)
+ *
+ *   This chunk is used to hold the result of the HMAC calculation.
+ *
+ *    0                   1                   2                   3
+ *    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ *   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ *   | Type = 0x0F   |   Flags=0     |             Length            |
+ *   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ *   |     Shared Key Identifier     |   HMAC Identifier             |
+ *   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ *   |                                                               |
+ *   \                             HMAC                              /
+ *   /                                                               \
+ *   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ *
+ *   Type: 1 byte (unsigned integer)
+ *   	This value MUST be set to 0x0F for  all AUTH-chunks.
+ *
+ *   Flags: 1 byte (unsigned integer)
+ *	Set to zero on transmit and ignored on receipt.
+ *
+ *   Length: 2 bytes (unsigned integer)
+ *   	This value holds the length of the HMAC in bytes plus 8.
+ *
+ *  Shared Key Identifier: 2 bytes (unsigned integer)
+ *	This value describes which endpoint pair shared key is used.
+ *
+ *   HMAC Identifier: 2 bytes (unsigned integer)
+ *   	This value describes which message digest is being used.  Table 2
+ *	shows the currently defined values.
+ *
+ *    The following Table 2 shows the currently defined values for HMAC
+ *       identifiers.
+ *
+ *	 +-----------------+--------------------------+
+ *	 | HMAC Identifier | Message Digest Algorithm |
+ *	 +-----------------+--------------------------+
+ *	 | 0               | Reserved                 |
+ *	 | 1               | SHA-1 defined in [8]     |
+ *	 | 2               | Reserved                 |
+ *	 | 3               | SHA-256 defined in [8]   |
+ *	 +-----------------+--------------------------+
+ *
+ *
+ *   HMAC: n bytes (unsigned integer) This hold the result of the HMAC
+ *      calculation.
+ */
+typedef struct sctp_authhdr {
+	__be16 shkey_id;
+	__be16 hmac_id;
+	__u8   hmac[0];
+} __attribute__((packed)) sctp_authhdr_t;
+
+typedef struct sctp_auth_chunk {
+	sctp_chunkhdr_t chunk_hdr;
+	sctp_authhdr_t auth_hdr;
+} __attribute__((packed)) sctp_auth_chunk_t;
+
 #endif /* __LINUX_SCTP_H__ */
-- 
1.5.2.4


^ permalink raw reply related

* [PATCH 6/8] SCTP: Implement the receive and verification of AUTH chunk
From: Vlad Yasevich @ 2007-09-14 18:44 UTC (permalink / raw)
  To: lksctp-developers; +Cc: netdev, Vlad Yasevich
In-Reply-To: <1189795499444-git-send-email-vladislav.yasevich@hp.com>

This patch implements the receive path needed to process authenticated
chunks.  Add ability to process the AUTH chunk and handle edge cases
for authenticated COOKIE-ECHO as well.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
---
 include/net/sctp/constants.h |    4 +-
 include/net/sctp/sm.h        |    1 +
 include/net/sctp/structs.h   |    8 ++
 net/sctp/associola.c         |   10 ++
 net/sctp/endpointola.c       |   29 ++++++
 net/sctp/input.c             |   65 +++++++++++--
 net/sctp/inqueue.c           |   19 ++++
 net/sctp/sm_statefuns.c      |  218 +++++++++++++++++++++++++++++++++++++++++-
 net/sctp/sm_statetable.c     |   33 +++++++
 9 files changed, 374 insertions(+), 13 deletions(-)

diff --git a/include/net/sctp/constants.h b/include/net/sctp/constants.h
index 777118f..da8354e 100644
--- a/include/net/sctp/constants.h
+++ b/include/net/sctp/constants.h
@@ -183,7 +183,9 @@ typedef enum {
 	SCTP_IERROR_NO_DATA,
 	SCTP_IERROR_BAD_STREAM,
 	SCTP_IERROR_BAD_PORTS,
-
+	SCTP_IERROR_AUTH_BAD_HMAC,
+	SCTP_IERROR_AUTH_BAD_KEYID,
+	SCTP_IERROR_PROTO_VIOLATION,
 } sctp_ierror_t;
 
 
diff --git a/include/net/sctp/sm.h b/include/net/sctp/sm.h
index ee62897..4357307 100644
--- a/include/net/sctp/sm.h
+++ b/include/net/sctp/sm.h
@@ -144,6 +144,7 @@ sctp_state_fn_t sctp_sf_do_asconf_ack;
 sctp_state_fn_t sctp_sf_do_9_2_reshutack;
 sctp_state_fn_t sctp_sf_eat_fwd_tsn;
 sctp_state_fn_t sctp_sf_eat_fwd_tsn_fast;
+sctp_state_fn_t sctp_sf_eat_auth;
 
 /* Prototypes for primitive event state functions.  */
 sctp_state_fn_t sctp_sf_do_prm_asoc;
diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 8cb7bc4..3215da4 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -719,6 +719,13 @@ struct sctp_chunk {
 	 */
 	struct sctp_transport *transport;
 
+	/* SCTP-AUTH:  For the special case inbound processing of COOKIE-ECHO
+	 * we need save a pointer to the AUTH chunk, since the SCTP-AUTH
+	 * spec violates the principle premis that all chunks are processed
+	 * in order.
+	 */
+	struct sk_buff *auth_chunk;
+
 	__u8 rtt_in_progress;	/* Is this chunk used for RTT calculation? */
 	__u8 resent;		/* Has this chunk ever been retransmitted. */
 	__u8 has_tsn;		/* Does this chunk have a TSN yet? */
@@ -1060,6 +1067,7 @@ void sctp_inq_init(struct sctp_inq *);
 void sctp_inq_free(struct sctp_inq *);
 void sctp_inq_push(struct sctp_inq *, struct sctp_chunk *packet);
 struct sctp_chunk *sctp_inq_pop(struct sctp_inq *);
+struct sctp_chunkhdr *sctp_inq_peek(struct sctp_inq *);
 void sctp_inq_set_th_handler(struct sctp_inq *, work_func_t);
 
 /* This is the structure we use to hold outbound chunks.  You push
diff --git a/net/sctp/associola.c b/net/sctp/associola.c
index 09e592b..5bf2009 100644
--- a/net/sctp/associola.c
+++ b/net/sctp/associola.c
@@ -1015,6 +1015,16 @@ static void sctp_assoc_bh_rcv(struct work_struct *work)
 		state = asoc->state;
 		subtype = SCTP_ST_CHUNK(chunk->chunk_hdr->type);
 
+		/* SCTP-AUTH, Section 6.3:
+		 *    The receiver has a list of chunk types which it expects
+		 *    to be received only after an AUTH-chunk.  This list has
+		 *    been sent to the peer during the association setup.  It
+		 *    MUST silently discard these chunks if they are not placed
+		 *    after an AUTH chunk in the packet.
+		 */
+		if (sctp_auth_recv_cid(subtype.chunk, asoc) && !chunk->auth)
+			continue;
+
 		/* Remember where the last DATA chunk came from so we
 		 * know where to send the SACK.
 		 */
diff --git a/net/sctp/endpointola.c b/net/sctp/endpointola.c
index 84843ed..71aa266 100644
--- a/net/sctp/endpointola.c
+++ b/net/sctp/endpointola.c
@@ -413,6 +413,7 @@ static void sctp_endpoint_bh_rcv(struct work_struct *work)
 	sctp_subtype_t subtype;
 	sctp_state_t state;
 	int error = 0;
+	int first_time = 1;	/* is this the first time through the looop */
 
 	if (ep->base.dead)
 		return;
@@ -424,6 +425,29 @@ static void sctp_endpoint_bh_rcv(struct work_struct *work)
 	while (NULL != (chunk = sctp_inq_pop(inqueue))) {
 		subtype = SCTP_ST_CHUNK(chunk->chunk_hdr->type);
 
+		/* If the first chunk in the packet is AUTH, do special
+		 * processing specified in Section 6.3 of SCTP-AUTH spec
+		 */
+		if (first_time && (subtype.chunk == SCTP_CID_AUTH)) {
+			struct sctp_chunkhdr *next_hdr;
+
+			next_hdr = sctp_inq_peek(inqueue);
+			if (!next_hdr)
+				goto normal;
+
+			/* If the next chunk is COOKIE-ECHO, skip the AUTH
+			 * chunk while saving a pointer to it so we can do
+			 * Authentication later (during cookie-echo
+			 * processing).
+			 */
+			if (next_hdr->type == SCTP_CID_COOKIE_ECHO) {
+				chunk->auth_chunk = skb_clone(chunk->skb,
+								GFP_ATOMIC);
+				chunk->auth = 1;
+				continue;
+			}
+		}
+normal:
 		/* We might have grown an association since last we
 		 * looked, so try again.
 		 *
@@ -439,6 +463,8 @@ static void sctp_endpoint_bh_rcv(struct work_struct *work)
 		}
 
 		state = asoc ? asoc->state : SCTP_STATE_CLOSED;
+		if (sctp_auth_recv_cid(subtype.chunk, asoc) && !chunk->auth)
+			continue;
 
 		/* Remember where the last DATA chunk came from so we
 		 * know where to send the SACK.
@@ -462,5 +488,8 @@ static void sctp_endpoint_bh_rcv(struct work_struct *work)
 		 */
 		if (!sctp_sk(sk)->ep)
 			break;
+
+		if (first_time)
+			first_time = 0;
 	}
 }
diff --git a/net/sctp/input.c b/net/sctp/input.c
index 47e5601..5a97716 100644
--- a/net/sctp/input.c
+++ b/net/sctp/input.c
@@ -903,15 +903,6 @@ static struct sctp_association *__sctp_rcv_init_lookup(struct sk_buff *skb,
 
 	ch = (sctp_chunkhdr_t *) skb->data;
 
-	/* If this is INIT/INIT-ACK look inside the chunk too. */
-	switch (ch->type) {
-	case SCTP_CID_INIT:
-	case SCTP_CID_INIT_ACK:
-		break;
-	default:
-		return NULL;
-	}
-
 	/* The code below will attempt to walk the chunk and extract
 	 * parameter information.  Before we do that, we need to verify
 	 * that the chunk length doesn't cause overflow.  Otherwise, we'll
@@ -956,6 +947,60 @@ static struct sctp_association *__sctp_rcv_init_lookup(struct sk_buff *skb,
 	return NULL;
 }
 
+/* SCTP-AUTH, Section 6.3:
+*    If the receiver does not find a STCB for a packet containing an AUTH
+*    chunk as the first chunk and not a COOKIE-ECHO chunk as the second
+*    chunk, it MUST use the chunks after the AUTH chunk to look up an existing
+*    association.
+*
+* This means that any chunks that can help us identify the association need
+* to be looked at to find this assocation.
+*
+* TODO: The only chunk currently defined that can do that is ASCONF, but we
+* don't support that functionality yet.
+*/
+static struct sctp_association *__sctp_rcv_auth_lookup(struct sk_buff *skb,
+				      const union sctp_addr *paddr,
+				      const union sctp_addr *laddr,
+				      struct sctp_transport **transportp)
+{
+	/* XXX - walk through the chunks looking for something that can
+	 * help us find the association.  INIT, and INIT-ACK are not permitted.
+	 * That leaves ASCONF, but we don't support that yet.
+	 */
+	return NULL;
+}
+
+/*
+ * There are circumstances when we need to look inside the SCTP packet
+ * for information to help us find the association.   Examples
+ * include looking inside of INIT/INIT-ACK chunks or after the AUTH
+ * chunks.
+ */
+static struct sctp_association *__sctp_rcv_lookup_harder(struct sk_buff *skb,
+				      const union sctp_addr *paddr,
+				      const union sctp_addr *laddr,
+				      struct sctp_transport **transportp)
+{
+	sctp_chunkhdr_t *ch;
+
+	ch = (sctp_chunkhdr_t *) skb->data;
+
+	/* If this is INIT/INIT-ACK look inside the chunk too. */
+	switch (ch->type) {
+	case SCTP_CID_INIT:
+	case SCTP_CID_INIT_ACK:
+		return __sctp_rcv_init_lookup(skb, laddr, transportp);
+		break;
+
+	case SCTP_CID_AUTH:
+		return __sctp_rcv_auth_lookup(skb, paddr, laddr, transportp);
+		break;
+	}
+
+	return NULL;
+}
+
 /* Lookup an association for an inbound skb. */
 static struct sctp_association *__sctp_rcv_lookup(struct sk_buff *skb,
 				      const union sctp_addr *paddr,
@@ -971,7 +1016,7 @@ static struct sctp_association *__sctp_rcv_lookup(struct sk_buff *skb,
 	 * parameters within the INIT or INIT-ACK.
 	 */
 	if (!asoc)
-		asoc = __sctp_rcv_init_lookup(skb, laddr, transportp);
+		asoc = __sctp_rcv_lookup_harder(skb, paddr, laddr, transportp);
 
 	return asoc;
 }
diff --git a/net/sctp/inqueue.c b/net/sctp/inqueue.c
index 88aa224..2e48613 100644
--- a/net/sctp/inqueue.c
+++ b/net/sctp/inqueue.c
@@ -100,6 +100,25 @@ void sctp_inq_push(struct sctp_inq *q, struct sctp_chunk *chunk)
 	q->immediate.func(&q->immediate);
 }
 
+/* Peek at the next chunk on the inqeue. */
+struct sctp_chunkhdr *sctp_inq_peek(struct sctp_inq *queue)
+{
+	struct sctp_chunk *chunk;
+	sctp_chunkhdr_t *ch = NULL;
+
+	chunk = queue->in_progress;
+	/* If there is no more chunks in this packet, say so */
+	if (chunk->singleton ||
+	    chunk->end_of_packet ||
+	    chunk->pdiscard)
+		    return NULL;
+
+	ch = (sctp_chunkhdr_t *)chunk->chunk_end;
+
+	return ch;
+}
+
+
 /* Extract a chunk from an SCTP inqueue.
  *
  * WARNING:  If you need to put the chunk on another queue, you need to
diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
index 7d8e92f..21b0f48 100644
--- a/net/sctp/sm_statefuns.c
+++ b/net/sctp/sm_statefuns.c
@@ -118,6 +118,11 @@ static sctp_disposition_t sctp_sf_violation_ctsn(
 				     void *arg,
 				     sctp_cmd_seq_t *commands);
 
+static sctp_ierror_t sctp_sf_authenticate(const struct sctp_endpoint *ep,
+				    const struct sctp_association *asoc,
+				    const sctp_subtype_t type,
+				    struct sctp_chunk *chunk);
+
 /* Small helper function that checks if the chunk length
  * is of the appropriate length.  The 'required_length' argument
  * is set to be the size of a specific chunk we are testing.
@@ -470,8 +475,6 @@ sctp_disposition_t sctp_sf_do_5_1C_ack(const struct sctp_endpoint *ep,
 			      (sctp_init_chunk_t *)chunk->chunk_hdr, chunk,
 			      &err_chunk)) {
 
-		SCTP_INC_STATS(SCTP_MIB_ABORTEDS);
-
 		/* This chunk contains fatal error. It is to be discarded.
 		 * Send an ABORT, with causes if there is any.
 		 */
@@ -496,6 +499,22 @@ sctp_disposition_t sctp_sf_do_5_1C_ack(const struct sctp_endpoint *ep,
 			sctp_sf_tabort_8_4_8(ep, asoc, type, arg, commands);
 			error = SCTP_ERROR_INV_PARAM;
 		}
+
+		/* SCTP-AUTH, Section 6.3:
+		 *    It should be noted that if the receiver wants to tear
+		 *    down an association in an authenticated way only, the
+		 *    handling of malformed packets should not result in
+		 *    tearing down the association.
+		 *
+		 * This means that if we only want to abort associations
+		 * in an authenticated way (i.e AUTH+ABORT), then we
+		 * can't destory this association just becuase the packet
+		 * was malformed.
+		 */
+		if (sctp_auth_recv_cid(SCTP_CID_ABORT, asoc))
+			return sctp_sf_pdiscard(ep, asoc, type, arg, commands);
+
+		SCTP_INC_STATS(SCTP_MIB_ABORTEDS);
 		return sctp_stop_t1_and_abort(commands, error, ECONNREFUSED,
 						asoc, chunk->transport);
 	}
@@ -674,6 +693,36 @@ sctp_disposition_t sctp_sf_do_5_1D_ce(const struct sctp_endpoint *ep,
 	if (error)
 		goto nomem_init;
 
+	/* SCTP-AUTH:  auth_chunk pointer is only set when the cookie-echo
+	 * is supposed to be authenticated and we have to do delayed
+	 * authentication.  We've just recreated the association using
+	 * the information in the cookie and now it's much easier to
+	 * do the authentication.
+	 */
+	if (chunk->auth_chunk) {
+		struct sctp_chunk auth;
+		sctp_ierror_t ret;
+
+		/* set-up our fake chunk so that we can process it */
+		auth.skb = chunk->auth_chunk;
+		auth.asoc = chunk->asoc;
+		auth.sctp_hdr = chunk->sctp_hdr;
+		auth.chunk_hdr = (sctp_chunkhdr_t *)skb_push(chunk->auth_chunk,
+					    sizeof(sctp_chunkhdr_t));
+		skb_pull(chunk->auth_chunk, sizeof(sctp_chunkhdr_t));
+		auth.transport = chunk->transport;
+
+		ret = sctp_sf_authenticate(ep, new_asoc, type, &auth);
+
+		/* We can now safely free the auth_chunk clone */
+		kfree_skb(chunk->auth_chunk);
+
+		if (ret != SCTP_IERROR_NO_ERROR) {
+			sctp_association_free(new_asoc);
+			return sctp_sf_pdiscard(ep, asoc, type, arg, commands);
+		}
+	}
+
 	repl = sctp_make_cookie_ack(new_asoc, chunk);
 	if (!repl)
 		goto nomem_init;
@@ -3581,6 +3630,156 @@ gen_shutdown:
 }
 
 /*
+ * SCTP-AUTH Section 6.3 Receving authenticated chukns
+ *
+ *    The receiver MUST use the HMAC algorithm indicated in the HMAC
+ *    Identifier field.  If this algorithm was not specified by the
+ *    receiver in the HMAC-ALGO parameter in the INIT or INIT-ACK chunk
+ *    during association setup, the AUTH chunk and all chunks after it MUST
+ *    be discarded and an ERROR chunk SHOULD be sent with the error cause
+ *    defined in Section 4.1.
+ *
+ *    If an endpoint with no shared key receives a Shared Key Identifier
+ *    other than 0, it MUST silently discard all authenticated chunks.  If
+ *    the endpoint has at least one endpoint pair shared key for the peer,
+ *    it MUST use the key specified by the Shared Key Identifier if a
+ *    key has been configured for that Shared Key Identifier.  If no
+ *    endpoint pair shared key has been configured for that Shared Key
+ *    Identifier, all authenticated chunks MUST be silently discarded.
+ *
+ * Verification Tag:  8.5 Verification Tag [Normal verification]
+ *
+ * The return value is the disposition of the chunk.
+ */
+static sctp_ierror_t sctp_sf_authenticate(const struct sctp_endpoint *ep,
+				    const struct sctp_association *asoc,
+				    const sctp_subtype_t type,
+				    struct sctp_chunk *chunk)
+{
+	struct sctp_authhdr *auth_hdr;
+	struct sctp_hmac *hmac;
+	unsigned int sig_len;
+	__u16 key_id;
+	__u8 *save_digest;
+	__u8 *digest;
+
+	/* Pull in the auth header, so we can do some more verification */
+	auth_hdr = (struct sctp_authhdr *)chunk->skb->data;
+	chunk->subh.auth_hdr = auth_hdr;
+	skb_pull(chunk->skb, sizeof(struct sctp_authhdr));
+
+	/* Make sure that we suport the HMAC algorithm from the auth
+	 * chunk.
+	 */
+	if (!sctp_auth_asoc_verify_hmac_id(asoc, auth_hdr->hmac_id))
+		return SCTP_IERROR_AUTH_BAD_HMAC;
+
+	/* Make sure that the provided shared key identifier has been
+	 * configured
+	 */
+	key_id = ntohs(auth_hdr->shkey_id);
+	if (key_id != asoc->active_key_id && !sctp_auth_get_shkey(asoc, key_id))
+		return SCTP_IERROR_AUTH_BAD_KEYID;
+
+
+	/* Make sure that the length of the signature matches what
+	 * we expect.
+	 */
+	sig_len = ntohs(chunk->chunk_hdr->length) - sizeof(sctp_auth_chunk_t);
+	hmac = sctp_auth_get_hmac(ntohs(auth_hdr->hmac_id));
+	if (sig_len != hmac->hmac_len)
+		return SCTP_IERROR_PROTO_VIOLATION;
+
+	/* Now that we've done validation checks, we can compute and
+	 * verify the hmac.  The steps involved are:
+	 *  1. Save the digest from the chunk.
+	 *  2. Zero out the digest in the chunk.
+	 *  3. Compute the new digest
+	 *  4. Compare saved and new digests.
+	 */
+	digest = auth_hdr->hmac;
+	skb_pull(chunk->skb, sig_len);
+
+	save_digest = kmemdup(digest, sig_len, GFP_ATOMIC);
+	if (!save_digest)
+		goto nomem;
+
+	memset(digest, 0, sig_len);
+
+	sctp_auth_calculate_hmac(asoc, chunk->skb,
+				(struct sctp_auth_chunk *)chunk->chunk_hdr,
+				GFP_ATOMIC);
+
+	/* Discard the packet if the digests do not match */
+	if (memcmp(save_digest, digest, sig_len)) {
+		kfree(save_digest);
+		return SCTP_IERROR_BAD_SIG;
+	}
+
+	kfree(save_digest);
+	chunk->auth = 1;
+
+	return SCTP_IERROR_NO_ERROR;
+nomem:
+	return SCTP_IERROR_NOMEM;
+}
+
+sctp_disposition_t sctp_sf_eat_auth(const struct sctp_endpoint *ep,
+				    const struct sctp_association *asoc,
+				    const sctp_subtype_t type,
+				    void *arg,
+				    sctp_cmd_seq_t *commands)
+{
+	struct sctp_authhdr *auth_hdr;
+	struct sctp_chunk *chunk = arg;
+	struct sctp_chunk *err_chunk;
+	sctp_ierror_t error;
+
+	if (!sctp_vtag_verify(chunk, asoc)) {
+		sctp_add_cmd_sf(commands, SCTP_CMD_REPORT_BAD_TAG,
+				SCTP_NULL());
+		return sctp_sf_pdiscard(ep, asoc, type, arg, commands);
+	}
+
+	/* Make sure that the AUTH chunk has valid length.  */
+	if (!sctp_chunk_length_valid(chunk, sizeof(struct sctp_auth_chunk)))
+		return sctp_sf_violation_chunklen(ep, asoc, type, arg,
+						  commands);
+
+	auth_hdr = (struct sctp_authhdr *)chunk->skb->data;
+	error = sctp_sf_authenticate(ep, asoc, type, chunk);
+	switch (error) {
+		case SCTP_IERROR_AUTH_BAD_HMAC:
+			/* Generate the ERROR chunk and discard the rest
+			 * of the packet
+			 */
+			err_chunk = sctp_make_op_error(asoc, chunk,
+							SCTP_ERROR_UNSUP_HMAC,
+							&auth_hdr->hmac_id,
+							sizeof(__u16));
+			if (err_chunk) {
+				sctp_add_cmd_sf(commands, SCTP_CMD_REPLY,
+						SCTP_CHUNK(err_chunk));
+			}
+			/* Fall Through */
+		case SCTP_IERROR_AUTH_BAD_KEYID:
+		case SCTP_IERROR_BAD_SIG:
+			return sctp_sf_pdiscard(ep, asoc, type, arg, commands);
+			break;
+		case SCTP_IERROR_PROTO_VIOLATION:
+			return sctp_sf_violation_chunklen(ep, asoc, type, arg,
+							  commands);
+			break;
+		case SCTP_IERROR_NOMEM:
+			return SCTP_DISPOSITION_NOMEM;
+		default:
+			break;
+	}
+
+	return SCTP_DISPOSITION_CONSUME;
+}
+
+/*
  * Process an unknown chunk.
  *
  * Section: 3.2. Also, 2.1 in the implementor's guide.
@@ -3766,6 +3965,20 @@ static sctp_disposition_t sctp_sf_abort_violation(
 	if (!abort)
 		goto nomem;
 
+	/* SCTP-AUTH, Section 6.3:
+	 *    It should be noted that if the receiver wants to tear
+	 *    down an association in an authenticated way only, the
+	 *    handling of malformed packets should not result in
+	 *    tearing down the association.
+	 *
+	 * This means that if we only want to abort associations
+	 * in an authenticated way (i.e AUTH+ABORT), then we
+	 * can't destory this association just becuase the packet
+	 * was malformed.
+	 */
+	if (sctp_auth_recv_cid(SCTP_CID_ABORT, asoc))
+		goto discard;
+
 	sctp_add_cmd_sf(commands, SCTP_CMD_REPLY, SCTP_CHUNK(abort));
 	SCTP_INC_STATS(SCTP_MIB_OUTCTRLCHUNKS);
 
@@ -3784,6 +3997,7 @@ static sctp_disposition_t sctp_sf_abort_violation(
 		SCTP_DEC_STATS(SCTP_MIB_CURRESTAB);
 	}
 
+discard:
 	sctp_add_cmd_sf(commands, SCTP_CMD_DISCARD_PACKET, SCTP_NULL());
 
 	SCTP_INC_STATS(SCTP_MIB_ABORTEDS);
diff --git a/net/sctp/sm_statetable.c b/net/sctp/sm_statetable.c
index 70a91ec..e7efccf 100644
--- a/net/sctp/sm_statetable.c
+++ b/net/sctp/sm_statetable.c
@@ -523,6 +523,34 @@ static const sctp_sm_table_entry_t prsctp_chunk_event_table[SCTP_NUM_PRSCTP_CHUN
 	TYPE_SCTP_FWD_TSN,
 }; /*state_fn_t prsctp_chunk_event_table[][] */
 
+#define TYPE_SCTP_AUTH { \
+	/* SCTP_STATE_EMPTY */ \
+	TYPE_SCTP_FUNC(sctp_sf_ootb), \
+	/* SCTP_STATE_CLOSED */ \
+	TYPE_SCTP_FUNC(sctp_sf_tabort_8_4_8), \
+	/* SCTP_STATE_COOKIE_WAIT */ \
+	TYPE_SCTP_FUNC(sctp_sf_discard_chunk), \
+	/* SCTP_STATE_COOKIE_ECHOED */ \
+	TYPE_SCTP_FUNC(sctp_sf_eat_auth), \
+	/* SCTP_STATE_ESTABLISHED */ \
+	TYPE_SCTP_FUNC(sctp_sf_eat_auth), \
+	/* SCTP_STATE_SHUTDOWN_PENDING */ \
+	TYPE_SCTP_FUNC(sctp_sf_eat_auth), \
+	/* SCTP_STATE_SHUTDOWN_SENT */ \
+	TYPE_SCTP_FUNC(sctp_sf_eat_auth), \
+	/* SCTP_STATE_SHUTDOWN_RECEIVED */ \
+	TYPE_SCTP_FUNC(sctp_sf_eat_auth), \
+	/* SCTP_STATE_SHUTDOWN_ACK_SENT */ \
+	TYPE_SCTP_FUNC(sctp_sf_eat_auth), \
+} /* TYPE_SCTP_AUTH */
+
+/* The primary index for this table is the chunk type.
+ * The secondary index for this table is the state.
+ */
+static const sctp_sm_table_entry_t auth_chunk_event_table[SCTP_NUM_AUTH_CHUNK_TYPES][SCTP_STATE_NUM_STATES] = {
+	TYPE_SCTP_AUTH,
+}; /*state_fn_t auth_chunk_event_table[][] */
+
 static const sctp_sm_table_entry_t
 chunk_event_table_unknown[SCTP_STATE_NUM_STATES] = {
 	/* SCTP_STATE_EMPTY */
@@ -976,5 +1004,10 @@ static const sctp_sm_table_entry_t *sctp_chunk_event_lookup(sctp_cid_t cid,
 			return &addip_chunk_event_table[1][state];
 	}
 
+	if (sctp_auth_enable) {
+		if (cid == SCTP_CID_AUTH)
+			return &auth_chunk_event_table[0][state];
+	}
+
 	return &chunk_event_table_unknown[state];
 }
-- 
1.5.2.4


^ permalink raw reply related

* [PATCH 8/8] SCTP: Tie ADD-IP and AUTH functionality as required by spec.
From: Vlad Yasevich @ 2007-09-14 18:44 UTC (permalink / raw)
  To: lksctp-developers; +Cc: netdev, Vlad Yasevich
In-Reply-To: <1189795499444-git-send-email-vladislav.yasevich@hp.com>

ADD-IP spec requires AUTH. It is, in fact, dangerous without AUTH.
So, disable ADD-IP functionality if the peer claims to support
ADD-IP, but not AUTH.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
---
 include/net/sctp/structs.h |    1 +
 net/sctp/sm_make_chunk.c   |   13 ++++++++++++-
 2 files changed, 13 insertions(+), 1 deletions(-)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 3215da4..8e772a9 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -1537,6 +1537,7 @@ struct sctp_association {
 		__u8    asconf_capable;  /* Does peer support ADDIP? */
 		__u8    prsctp_capable;  /* Can peer do PR-SCTP? */
 		__u8	auth_capable;	 /* Is peer doing SCTP-AUTH? */
+		__u8	addip_capable;	 /* Can peer do ADD-IP *?
 
 		__u32   adaptation_ind;	 /* Adaptation Code point. */
 
diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
index 7cd8241..80ca98e 100644
--- a/net/sctp/sm_make_chunk.c
+++ b/net/sctp/sm_make_chunk.c
@@ -1851,7 +1851,8 @@ static void sctp_process_ext_param(struct sctp_association *asoc,
 			    break;
 		    case SCTP_CID_ASCONF:
 		    case SCTP_CID_ASCONF_ACK:
-			    /* don't need to do anything for ASCONF */
+			    asoc->peer.addip_capable = 1;
+			    break;
 		    default:
 			    break;
 		}
@@ -2137,6 +2138,16 @@ int sctp_process_init(struct sctp_association *asoc, sctp_cid_t cid,
 					!asoc->peer.peer_hmacs))
 		asoc->peer.auth_capable = 0;
 
+
+	/* If the peer claims support for ADD-IP without support
+	 * for AUTH, disable support for ADD-IP.
+	 */
+	if (asoc->peer.addip_capable && !asoc->peer.auth_capable) {
+		asoc->peer.addip_disable_mask |= (SCTP_PARAM_ADD_IP |
+						  SCTP_PARAM_DEL_IP |
+						  SCTP_PARAM_SET_PRIMARY);
+	}
+
 	/* Walk list of transports, removing transports in the UNKNOWN state. */
 	list_for_each_safe(pos, temp, &asoc->peer.transport_addr_list) {
 		transport = list_entry(pos, struct sctp_transport, transports);
-- 
1.5.2.4


^ permalink raw reply related

* [PATCH 3/8] SCTP: Implement SCTP-AUTH initializations.
From: Vlad Yasevich @ 2007-09-14 18:44 UTC (permalink / raw)
  To: lksctp-developers; +Cc: netdev, Vlad Yasevich
In-Reply-To: <1189795499444-git-send-email-vladislav.yasevich@hp.com>

The patch initializes AUTH related members of the generic SCTP
structures and provides a way to enable/disable auth extension.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
---
 net/sctp/associola.c   |   34 +++++++++++++++++++
 net/sctp/endpointola.c |   83 ++++++++++++++++++++++++++++++++++++++++++++++++
 net/sctp/output.c      |    4 ++
 net/sctp/protocol.c    |    3 ++
 net/sctp/sysctl.c      |    9 +++++
 5 files changed, 133 insertions(+), 0 deletions(-)

diff --git a/net/sctp/associola.c b/net/sctp/associola.c
index 2ad1caf..b96c132 100644
--- a/net/sctp/associola.c
+++ b/net/sctp/associola.c
@@ -74,6 +74,8 @@ static struct sctp_association *sctp_association_init(struct sctp_association *a
 {
 	struct sctp_sock *sp;
 	int i;
+	sctp_paramhdr_t *p;
+	int err;
 
 	/* Retrieve the SCTP per socket area.  */
 	sp = sctp_sk((struct sock *)sk);
@@ -299,6 +301,30 @@ static struct sctp_association *sctp_association_init(struct sctp_association *a
 	asoc->default_timetolive = sp->default_timetolive;
 	asoc->default_rcv_context = sp->default_rcv_context;
 
+	/* AUTH related initializations */
+	INIT_LIST_HEAD(&asoc->endpoint_shared_keys);
+	err = sctp_auth_asoc_copy_shkeys(ep, asoc, gfp);
+	if (err)
+		goto fail_init;
+
+	asoc->active_key_id = ep->active_key_id;
+	asoc->asoc_shared_key = NULL;
+
+	asoc->default_hmac_id = 0;
+	/* Save the hmacs and chunks list into this association */
+	if (ep->auth_hmacs_list)
+		memcpy(asoc->c.auth_hmacs, ep->auth_hmacs_list,
+			ntohs(ep->auth_hmacs_list->param_hdr.length));
+	if (ep->auth_chunk_list)
+		memcpy(asoc->c.auth_chunks, ep->auth_chunk_list,
+			ntohs(ep->auth_chunk_list->param_hdr.length));
+
+	/* Get the AUTH random number for this association */
+	p = (sctp_paramhdr_t *)asoc->c.auth_random;
+	p->type = SCTP_PARAM_RANDOM;
+	p->length = htons(sizeof(sctp_paramhdr_t) + SCTP_AUTH_RANDOM_LENGTH);
+	get_random_bytes(p+1, SCTP_AUTH_RANDOM_LENGTH);
+
 	return asoc;
 
 fail_init:
@@ -408,6 +434,12 @@ void sctp_association_free(struct sctp_association *asoc)
 	if (asoc->addip_last_asconf)
 		sctp_chunk_free(asoc->addip_last_asconf);
 
+	/* AUTH - Free the endpoint shared keys */
+	sctp_auth_destroy_keys(&asoc->endpoint_shared_keys);
+
+	/* AUTH - Free the association shared key */
+	sctp_auth_key_put(asoc->asoc_shared_key);
+
 	sctp_association_put(asoc);
 }
 
@@ -1116,6 +1148,8 @@ void sctp_assoc_update(struct sctp_association *asoc,
 			sctp_assoc_set_id(asoc, GFP_ATOMIC);
 		}
 	}
+
+	/* SCTP-AUTH: XXX something needs to be done here*/
 }
 
 /* Update the retran path for sending a retransmitted packet.
diff --git a/net/sctp/endpointola.c b/net/sctp/endpointola.c
index aba9258..84843ed 100644
--- a/net/sctp/endpointola.c
+++ b/net/sctp/endpointola.c
@@ -69,12 +69,56 @@ static struct sctp_endpoint *sctp_endpoint_init(struct sctp_endpoint *ep,
 						struct sock *sk,
 						gfp_t gfp)
 {
+	struct sctp_hmac_algo_param *auth_hmacs = NULL;
+	struct sctp_chunks_param *auth_chunks = NULL;
+	struct sctp_shared_key *null_key;
+	int err;
+
 	memset(ep, 0, sizeof(struct sctp_endpoint));
 
 	ep->digest = kzalloc(SCTP_SIGNATURE_SIZE, gfp);
 	if (!ep->digest)
 		return NULL;
 
+	if (sctp_auth_enable) {
+		/* Allocate space for HMACS and CHUNKS authentication
+		 * variables.  There are arrays that we encode directly
+		 * into parameters to make the rest of the operations easier.
+		 */
+		auth_hmacs = kzalloc(sizeof(sctp_hmac_algo_param_t) +
+				sizeof(__u16) * SCTP_AUTH_NUM_HMACS, gfp);
+		if (!auth_hmacs)
+			goto nomem;
+
+		auth_chunks = kzalloc(sizeof(sctp_chunks_param_t) +
+					SCTP_NUM_CHUNK_TYPES, gfp);
+		if (!auth_chunks)
+			goto nomem;
+
+		/* Initialize the HMACS parameter.
+		 * SCTP-AUTH: Section 3.3
+		 *    Every endpoint supporting SCTP chunk authentication MUST
+		 *    support the HMAC based on the SHA-1 algorithm.
+		 */
+		auth_hmacs->param_hdr.type = SCTP_PARAM_HMAC_ALGO;
+		auth_hmacs->param_hdr.length =
+					htons(sizeof(sctp_paramhdr_t) + 2);
+		auth_hmacs->hmac_ids[0] = htons(SCTP_AUTH_HMAC_ID_SHA1);
+
+		/* Initialize the CHUNKS parameter */
+		auth_chunks->param_hdr.type = SCTP_PARAM_CHUNKS;
+
+		/* If the Add-IP functionality is enabled, we must
+		 * authenticate, ASCONF and ASCONF-ACK chunks
+		 */
+		if (sctp_addip_enable) {
+			auth_chunks->chunks[0] = SCTP_CID_ASCONF;
+			auth_chunks->chunks[1] = SCTP_CID_ASCONF_ACK;
+			auth_chunks->param_hdr.length =
+					htons(sizeof(sctp_paramhdr_t) + 2);
+		}
+	}
+
 	/* Initialize the base structure. */
 	/* What type of endpoint are we?  */
 	ep->base.type = SCTP_EP_TYPE_SOCKET;
@@ -115,7 +159,36 @@ static struct sctp_endpoint *sctp_endpoint_init(struct sctp_endpoint *ep,
 	ep->last_key = ep->current_key = 0;
 	ep->key_changed_at = jiffies;
 
+	/* SCTP-AUTH extensions*/
+	INIT_LIST_HEAD(&ep->endpoint_shared_keys);
+	null_key = sctp_auth_shkey_create(0, GFP_KERNEL);
+	if (!null_key)
+		goto nomem;
+
+	list_add(&null_key->key_list, &ep->endpoint_shared_keys);
+
+	/* Allocate and initialize transorms arrays for suported HMACs. */
+	err = sctp_auth_init_hmacs(ep, gfp);
+	if (err)
+		goto nomem_hmacs;
+
+	/* Add the null key to the endpoint shared keys list and
+	 * set the hmcas and chunks pointers.
+	 */
+	ep->auth_hmacs_list = auth_hmacs;
+	ep->auth_chunk_list = auth_chunks;
+
 	return ep;
+
+nomem_hmacs:
+	sctp_auth_destroy_keys(&ep->endpoint_shared_keys);
+nomem:
+	/* Free all allocations */
+	kfree(auth_hmacs);
+	kfree(auth_chunks);
+	kfree(ep->digest);
+	return NULL;
+
 }
 
 /* Create a sctp_endpoint with all that boring stuff initialized.
@@ -188,6 +261,16 @@ static void sctp_endpoint_destroy(struct sctp_endpoint *ep)
 	/* Free the digest buffer */
 	kfree(ep->digest);
 
+	/* SCTP-AUTH: Free up AUTH releated data such as shared keys
+	 * chunks and hmacs arrays that were allocated
+	 */
+	sctp_auth_destroy_keys(&ep->endpoint_shared_keys);
+	kfree(ep->auth_hmacs_list);
+	kfree(ep->auth_chunk_list);
+
+	/* AUTH - Free any allocated HMAC transform containers */
+	sctp_auth_destroy_hmacs(ep->auth_hmacs);
+
 	/* Cleanup. */
 	sctp_inq_free(&ep->base.inqueue);
 	sctp_bind_addr_free(&ep->base.bind_addr);
diff --git a/net/sctp/output.c b/net/sctp/output.c
index d85543d..49b9f5f 100644
--- a/net/sctp/output.c
+++ b/net/sctp/output.c
@@ -79,7 +79,9 @@ struct sctp_packet *sctp_packet_config(struct sctp_packet *packet,
 	packet->vtag = vtag;
 	packet->has_cookie_echo = 0;
 	packet->has_sack = 0;
+	packet->has_auth = 0;
 	packet->ipfragok = 0;
+	packet->auth = NULL;
 
 	if (ecn_capable && sctp_packet_empty(packet)) {
 		chunk = sctp_get_ecne_prepend(packet->transport->asoc);
@@ -121,8 +123,10 @@ struct sctp_packet *sctp_packet_init(struct sctp_packet *packet,
 	packet->vtag = 0;
 	packet->has_cookie_echo = 0;
 	packet->has_sack = 0;
+	packet->has_auth = 0;
 	packet->ipfragok = 0;
 	packet->malloced = 0;
+	packet->auth = NULL;
 	return packet;
 }
 
diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index c49eb99..6919846 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -1167,6 +1167,9 @@ SCTP_STATIC __init int sctp_init(void)
 	/* Enable PR-SCTP by default. */
 	sctp_prsctp_enable = 1;
 
+	/* Disable AUTH by default. */
+	sctp_auth_enable = 0;
+
 	sctp_sysctl_register();
 
 	INIT_LIST_HEAD(&sctp_address_families);
diff --git a/net/sctp/sysctl.c b/net/sctp/sysctl.c
index 39b10ee..0669778 100644
--- a/net/sctp/sysctl.c
+++ b/net/sctp/sysctl.c
@@ -254,6 +254,15 @@ static ctl_table sctp_table[] = {
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec,
 	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "auth_enable",
+		.data		= &sctp_auth_enable,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec
+	},
 	{ .ctl_name = 0 }
 };
 
-- 
1.5.2.4


^ permalink raw reply related

* Distributed storage. Move away from char device ioctls.
From: Evgeniy Polyakov @ 2007-09-14 18:54 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel, linux-fsdevel

Hi.

I'm pleased to announce fourth release of the distributed storage
subsystem, which allows to form a storage on top of remote and local
nodes, which in turn can be exported to another storage as a node to
form tree-like storages.

This release includes new configuration interface (kernel connector over
netlink socket) and number of fixes of various bugs found during move 
to it (in error path).

Further TODO list includes:
* implement optional saving of mirroring/linear information on the remote
	nodes (simple)
* new redundancy algorithm (complex)
* some thoughts about distributed filesystem tightly connected to DST
	(far-far planes so far)

Homepage:
http://tservice.net.ru/~s0mbre/old/?section=projects&item=dst

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

diff --git a/Documentation/dst/algorithms.txt b/Documentation/dst/algorithms.txt
new file mode 100644
index 0000000..bfc6984
--- /dev/null
+++ b/Documentation/dst/algorithms.txt
@@ -0,0 +1,115 @@
+Each storage by itself is just a set of contiguous logical blocks, with
+allowed number of operations. Nodes, each of which has own start and size,
+are placed into storage by appropriate algorithm, which remaps
+logical sector number into real node's sector. One can create
+own algorithms, since DST has pluggable interface for that.
+Currently mirrored and linear algorithms are supported.
+
+Let's briefly describe how they work.
+
+Linear algorithm.
+Simple approach of concatenating storages into single device with
+increased size is used in this algorithm. Essentially new device
+has size equal to sum of sizes of underlying nodes and nodes are
+placed one after another.
+
+  /----- Node 1 ---\                         /------ Node 3 ----\
+start              end                     start               end
+ |==================|========================|==================|
+ |                start                     end                 |
+ |                  \------- Node 2 ---------/                  |
+ |                                                              |
+start                                                          end
+ \-------------------------- DST storage ----------------------/
+
+			        /\
+			        ||
+			        ||
+
+			   IO operations
+
+			    Figure 1. 
+     3 nodes combined into single storage using linear algorithm.
+
+Mirror algorithm.
+In this algorithms nodes are placed under each other, so when
+operation comes to the first one, it can be mirrored to all
+underlying nodes. In case of reading, actual data is obtained from
+the nearest node - algoritm keeps track of previous operation
+and knows where it was stopped, so that subsequent seek to the 
+start of the new request will take the shortest time.
+Writing is always mirrored to all underlying nodes.
+
+                  IO operations
+                       ||
+                       ||
+                       \/
+
+|---------------- DST storate -------------------|
+|      prev position                             |
+|-------|------------ Node 1 --------------------|
+|                              prev pos          |
+|-------------------- Node 2 -----|--------------|
+|prev pos                                        |
+|---|---------------- Node 3 --------------------|
+
+		Figure 2.
+   3 nodes combined into single storage using mirror algorithm.
+
+Each algorithm must implement number of callbacks,
+which must be registered during initialization time.
+
+struct dst_alg_ops
+{
+	int			(*add_node)(struct dst_node *n);
+	void			(*del_node)(struct dst_node *n);
+	int 			(*remap)(struct dst_request *req);
+	int			(*error)(struct kst_state *state, int err);
+	struct module 		*owner;
+};
+
+@add_node.
+This callback is invoked when new node is being added into the storage,
+but before node is actually added into the storage, so that it could
+be accessed from it. When it is called, all appropriate initialization
+of the underlying device is already completed (system has been connected
+to remote node or got a reference to the local block device). At this
+stage algorithm can add node into private map. 
+It must return zero on success or negative value otherwise.
+
+@del_node.
+This callback is invoked when node is being deleted from the storage,
+i.e. when its reference counter hits zero. It is called before
+any cleaning is performed.
+It must return zero on success or negative value otherwise.
+
+@remap.
+This callback is invoked each time new bio hits the storage.
+Request structure contains BIO itself, pointer to the node, which originally
+stores the whole region under given IO request, and various parameters
+used by storage core to process this block request.
+It must return zero on success or negative value otherwise. It is upto
+this method to call all cleaning if remapping failed, for example it must
+call kst_bio_endio() for given callback in case of error, which in turn
+will call bio_endio(). Note, that dst_request structure provided in this
+callback is allocated on stack, so if there is a need to use it outside
+of the given function, it must be cloned (it will happen automatically
+in state's push callback, but that copy will not be shared by any other
+user).
+
+@error.
+This callback is invoked for each error, which happend when processed
+requests for remote nodes or when talking to remote size
+of the local export node (state contains data related to data
+transfers over the network).
+If this function has fixed given error, it must return 0 or negative
+error value otherwise.
+
+@owner.
+This is module reference counter updated automatically by DST core.
+
+Algorithm must provide its name and above structure to the 
+dst_alloc_alg() function, which will return a reference to the newly
+created algorithm.
+To remove it, one needs to call dst_remove_alg() with given algorithm
+pointer.
diff --git a/Documentation/dst/dst.txt b/Documentation/dst/dst.txt
new file mode 100644
index 0000000..3b326aa
--- /dev/null
+++ b/Documentation/dst/dst.txt
@@ -0,0 +1,66 @@
+Distributed storage. Design and implementation.
+http://tservice.net.ru/~s0mbre/old/?section=projects&item=dst
+
+	     Evgeniy Polyakov
+
+This document is intended to briefly describe design and
+implementation details of the distributed storage project,
+aimed to create ability to group physically and/or logically
+distributed storages into single device.
+
+Main operational unit in the storage is node. Node can represent
+either remote storage, connected to local machine, or local
+device, or storage exported to the outside of the system.
+Here goes small explaination of basic therms.
+
+Local node.
+This node is just a logical link between block device (with given
+major and minor numbers) and structure in the DST hierarchy,
+which represents number of sectors on the area, corresponding to given
+block device. it can be a disk, a device mapper node or stacked
+block device on top of another underlying DST nodes.
+
+Local export node.
+Essentially the same as local node, but it allows to access
+to its data via network. Remote clients can connect to given local 
+export node and read or write blocks according to its size.
+Blocks are then forwarded to underlying local node and processed
+there accordingly to the nature of the local node.
+
+Remote node.
+This type of nodes contain remotely accessible devices. One can think
+about remote nodes as remote disks, which can be connected to
+local system and combined into single storage. Remote nodes
+are presented as number of sectors accessed over the network
+by the local machine, where distributed storage is being formed.
+
+
+Each node or set of them can be formed into single array, which
+in turn becomes a local node, which can be exported further by stacking
+a local export node on top of it.
+
+Each storage by itself is just a set of contiguous logical blocks, with
+allowed number of operations. Nodes, each of which has own start and size,
+are placed into storage by appropriate algorithm, which remaps
+logical sector number into real node's sector. One can create
+own algorithms, since DST has pluggable interface for that.
+Currently mirrored and linear algorithms are supported.
+One can find more details in Documentation/dst/algorithms.txt file.
+
+Main goal of the distributed storage is to combine remote nodes into
+single device, so each block IO request is being sent over the network
+(contrary requests for local nodes are handled by the gneric block
+layer features). Each network connection has number of variables which
+describe it (socket, list of requests, error handling and so on),
+which form kst_state structure. This network state is added into per-socket
+polling state machine, and can be processed by dedicated thread when
+becomes ready. This system forms asynchronous IO for given block
+requests. If block request can be processed without blocking, then
+no new structures are allocated and async part of the state is not used.
+
+When connection to the remote peer breaks, DST core tries to reconnect
+to failed node and no requests are marked as errorneous, instead
+they live in the queue until reconnectin is established.
+
+Userspace code, setup documentation and examples can be found on project's
+homepage above.
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index b4c8319..ca6592d 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -451,6 +451,8 @@ config ATA_OVER_ETH
 	This driver provides Support for ATA over Ethernet block
 	devices like the Coraid EtherDrive (R) Storage Blade.
 
+source "drivers/block/dst/Kconfig"
+
 source "drivers/s390/block/Kconfig"
 
 endmenu
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index dd88e33..fcf042d 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -29,3 +29,4 @@ obj-$(CONFIG_VIODASD)		+= viodasd.o
 obj-$(CONFIG_BLK_DEV_SX8)	+= sx8.o
 obj-$(CONFIG_BLK_DEV_UB)	+= ub.o
 
+obj-$(CONFIG_DST)		+= dst/
diff --git a/drivers/block/dst/Kconfig b/drivers/block/dst/Kconfig
new file mode 100644
index 0000000..5bb9de8
--- /dev/null
+++ b/drivers/block/dst/Kconfig
@@ -0,0 +1,20 @@
+config DST
+	tristate "Distributed storage"
+	depends on NET
+	select CONNECTOR
+	---help---
+	This driver allows to create a distributed storage.
+
+config DST_ALG_LINEAR
+	tristate "Linear distribution algorithm"
+	depends on DST
+	---help---
+	This module allows to create linear mapping of the nodes
+	in the distributed storage.
+
+config DST_ALG_MIRROR
+	tristate "Mirror distribution algorithm"
+	depends on DST
+	---help---
+	This module allows to create a mirror of the noes in the
+	distributed storage.
diff --git a/drivers/block/dst/Makefile b/drivers/block/dst/Makefile
new file mode 100644
index 0000000..1400e94
--- /dev/null
+++ b/drivers/block/dst/Makefile
@@ -0,0 +1,6 @@
+obj-$(CONFIG_DST) += dst.o
+
+dst-y := dcore.o kst.o
+
+obj-$(CONFIG_DST_ALG_LINEAR) += alg_linear.o
+obj-$(CONFIG_DST_ALG_MIRROR) += alg_mirror.o
diff --git a/drivers/block/dst/alg_linear.c b/drivers/block/dst/alg_linear.c
new file mode 100644
index 0000000..584f99e
--- /dev/null
+++ b/drivers/block/dst/alg_linear.c
@@ -0,0 +1,99 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/dst.h>
+
+static struct dst_alg *alg_linear;
+
+/*
+ * This callback is invoked when node is removed from storage.
+ */
+static void dst_linear_del_node(struct dst_node *n)
+{
+}
+
+/*
+ * This callback is invoked when node is added to storage.
+ */
+static int dst_linear_add_node(struct dst_node *n)
+{
+	struct dst_storage *st = n->st;
+
+	n->start = st->disk_size;
+	st->disk_size += n->size;
+
+	return 0;
+}
+
+static int dst_linear_remap(struct dst_request *req)
+{
+	int err;
+
+	if (req->node->bdev) {
+		generic_make_request(req->bio);
+		return 0;
+	}
+
+	err = kst_check_permissions(req->state, req->bio);
+	if (err)
+		return err;
+
+	return req->state->ops->push(req);
+}
+
+/*
+ * Failover callback - it is invoked each time error happens during
+ * request processing.
+ */
+static int dst_linear_error(struct kst_state *st, int err)
+{
+	if (err)
+		set_bit(DST_NODE_FROZEN, &st->node->flags);
+	else
+		clear_bit(DST_NODE_FROZEN, &st->node->flags);
+	return 0;
+}
+
+static struct dst_alg_ops alg_linear_ops = {
+	.remap		= dst_linear_remap,
+	.add_node 	= dst_linear_add_node,
+	.del_node 	= dst_linear_del_node,
+	.error		= dst_linear_error,
+	.owner		= THIS_MODULE,
+};
+
+static int __devinit alg_linear_init(void)
+{
+	alg_linear = dst_alloc_alg("alg_linear", &alg_linear_ops);
+	if (!alg_linear)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void __devexit alg_linear_exit(void)
+{
+	dst_remove_alg(alg_linear);
+}
+
+module_init(alg_linear_init);
+module_exit(alg_linear_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Evgeniy Polyakov <johnpol@2ka.mipt.ru>");
+MODULE_DESCRIPTION("Linear distributed algorithm.");
diff --git a/drivers/block/dst/alg_mirror.c b/drivers/block/dst/alg_mirror.c
new file mode 100644
index 0000000..190d130
--- /dev/null
+++ b/drivers/block/dst/alg_mirror.c
@@ -0,0 +1,768 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/poll.h>
+#include <linux/dst.h>
+
+#define DST_MIRROR_MAX_CHUNKS		4096
+
+struct dst_mirror_priv
+{
+	unsigned int		chunk_num;
+
+	u64			last_start;
+
+	spinlock_t		backlog_lock;
+	struct list_head	backlog_list;
+
+	unsigned long		*chunk;
+};
+
+static struct dst_alg *alg_mirror;
+static struct bio_set *dst_mirror_bio_set;
+
+static ssize_t dst_mirror_chunk_mask_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct dst_node *n = container_of(dev, struct dst_node, device);
+	struct dst_mirror_priv *priv = n->priv;
+	unsigned int i;
+	int rest = PAGE_SIZE;
+
+	for (i = 0; i < priv->chunk_num/BITS_PER_LONG; ++i) {
+		int bit, j;
+
+		for (j = 0; j < BITS_PER_LONG; ++j) {
+			bit = (priv->chunk[i] >> j) & 1;
+			sprintf(buf, "%c", (bit)?'+':'-');
+			buf++;
+		}
+
+		rest -= BITS_PER_LONG;
+
+		if (rest < BITS_PER_LONG)
+			break;
+	}
+
+	return PAGE_SIZE - rest;
+}
+
+static DEVICE_ATTR(chunks, 0444, dst_mirror_chunk_mask_show, NULL);
+
+/*
+ * This callback is invoked when node is removed from storage.
+ */
+static void dst_mirror_del_node(struct dst_node *n)
+{
+	struct dst_mirror_priv *priv = n->priv;
+
+	if (priv) {
+		vfree(priv->chunk);
+		kfree(priv);
+		n->priv = NULL;
+	}
+
+	if (n->device.parent == &n->st->device)
+		device_remove_file(&n->device, &dev_attr_chunks);
+}
+
+static void dst_mirror_handle_priv(struct dst_node *n)
+{
+	if (n->priv) {
+		int err;
+		err = device_create_file(&n->device, &dev_attr_chunks);
+	}
+}
+
+/*
+ * This callback is invoked when node is added to storage.
+ */
+static int dst_mirror_add_node(struct dst_node *n)
+{
+	struct dst_storage *st = n->st;
+	struct dst_mirror_priv *priv;
+
+	if (st->disk_size)
+		st->disk_size = min(n->size, st->disk_size);
+	else
+		st->disk_size = n->size;
+
+	priv = kzalloc(sizeof(struct dst_mirror_priv), GFP_KERNEL);
+	if (!priv)
+		return -ENOMEM;
+
+	priv->chunk_num = st->disk_size;
+
+	priv->chunk = vmalloc(priv->chunk_num/BITS_PER_LONG * sizeof(long));
+	if (!priv->chunk)
+		goto err_out_free;
+
+	memset(priv->chunk, 0, priv->chunk_num/BITS_PER_LONG * sizeof(long));
+
+	spin_lock_init(&priv->backlog_lock);
+	INIT_LIST_HEAD(&priv->backlog_list);
+
+	dprintk("%s: %llu:%llu, chunk_num: %u, disk_size: %llu.\n",
+			__func__, n->start, n->size,
+			priv->chunk_num, st->disk_size);
+
+	n->priv_callback = &dst_mirror_handle_priv;
+	n->priv = priv;
+
+	return 0;
+
+err_out_free:
+	kfree(priv);
+	return -ENOMEM;
+}
+
+static void dst_mirror_sync_destructor(struct bio *bio)
+{
+	struct bio_vec *bv;
+	int i;
+
+	bio_for_each_segment(bv, bio, i)
+		__free_page(bv->bv_page);
+	bio_free(bio, dst_mirror_bio_set);
+}
+
+static void dst_mirror_sync_requeue(struct dst_node *n)
+{
+	struct dst_mirror_priv *p = n->priv;
+	struct dst_request *req;
+	unsigned int num, idx, i;
+	u64 start;
+	unsigned long flags;
+	int err;
+
+	while (!list_empty(&p->backlog_list)) {
+		req = NULL;
+		spin_lock_irqsave(&p->backlog_lock, flags);
+		if (!list_empty(&p->backlog_list)) {
+			req = list_entry(p->backlog_list.next,
+					struct dst_request,
+					request_list_entry);
+			list_del(&req->request_list_entry);
+		}
+		spin_unlock_irqrestore(&p->backlog_lock, flags);
+
+		if (!req)
+			break;
+
+		start = req->start - to_sector(req->orig_size - req->size);
+
+		idx = start;
+		num = to_sector(req->orig_size);
+
+		for (i=0; i<num; ++i)
+			if (test_bit(idx+i, p->chunk))
+				break;
+
+		dprintk("%s: idx: %u, num: %u, i: %u, req: %p, "
+				"start: %llu, size: %llu.\n",
+				__func__, idx, num, i, req, 
+				req->start, req->orig_size);
+
+		err = -1;
+		if (i != num) {
+			err = kst_enqueue_req(n->state, req);
+			if (err) {
+				printk("%s: congestion [%c]: req: %p, "
+						"start: %llu, size: %llu.\n",
+					__func__,
+					(bio_rw(req->bio) == WRITE)?'W':'R',
+					req, req->start, req->size);
+				kst_del_req(req);
+			}
+		}
+		if (err) {
+			req->bio_endio(req, err);
+			dst_free_request(req);
+		}
+	}
+
+	kst_wake(n->state);
+}
+
+static void dst_mirror_mark_sync(struct dst_node *n)
+{
+	if (test_bit(DST_NODE_NOTSYNC, &n->flags)) {
+		clear_bit(DST_NODE_NOTSYNC, &n->flags);
+		printk("%s: node: %p, %llu:%llu synchronization "
+				"has been completed.\n",
+			__func__, n, n->start, n->size);
+	}
+}
+
+static void dst_mirror_mark_notsync(struct dst_node *n)
+{
+	if (!test_bit(DST_NODE_NOTSYNC, &n->flags)) {
+		set_bit(DST_NODE_NOTSYNC, &n->flags);
+		printk("%s: not synced node n: %p.\n", __func__, n);
+	}
+}
+
+/*
+ * Without errors it is always called under node's request lock,
+ * so it is safe to requeue them.
+ */
+static void dst_mirror_bio_error(struct dst_request *req, int err)
+{
+	int i;
+	struct dst_mirror_priv *priv = req->node->priv;
+	unsigned int num, idx;
+	void (*process_bit[])(int nr, volatile void *addr) =
+		{&__clear_bit, &__set_bit};
+	u64 start = req->start - to_sector(req->orig_size - req->size);
+
+	if (err)
+		dst_mirror_mark_notsync(req->node);
+	else
+		dst_mirror_sync_requeue(req->node);
+
+	priv->last_start = req->start;
+
+	idx = start;
+	num = to_sector(req->orig_size);
+
+	dprintk("%s: req_priv: %p, chunk %p, %llu:%llu start: %llu, size: %llu, "
+		"chunk_num: %u, idx: %d, num: %d, err: %d.\n",
+		__func__, req->priv, priv->chunk, req->node->start, 
+		req->node->size, start, req->orig_size, priv->chunk_num, 
+		idx, num, err);
+
+	if (unlikely(idx >= priv->chunk_num || idx + num > priv->chunk_num)) {
+		printk("%s: %llu:%llu req: %p, start: %llu, orig_size: %llu, "
+			"req_start: %llu, req_size: %llu, "
+			"chunk_num: %u, idx: %d, num: %d, err: %d.\n",
+			__func__, req->node->start, req->node->size, req,
+			start, req->orig_size, 
+			req->start, req->size,
+			priv->chunk_num, idx, num, err);
+		return;
+	}
+
+	for (i=0; i<num; ++i)
+		process_bit[!!err](idx+i, priv->chunk);
+}
+
+static void dst_mirror_sync_req_endio(struct dst_request *req, int err)
+{
+	unsigned long notsync = 0;
+	struct dst_mirror_priv *priv = req->node->priv;
+	int i;
+
+	dst_mirror_bio_error(req, err);
+
+	printk("%s: freeing bio: %p, bi_size: %u, "
+			"orig_size: %llu, req: %p, node: %p.\n",
+		__func__, req->bio, req->bio->bi_size, req->orig_size, req,
+		req->node);
+
+	bio_put(req->bio);
+
+	for (i = 0; i < priv->chunk_num/BITS_PER_LONG; ++i) {
+		notsync = priv->chunk[i];
+
+		if (notsync)
+			break;
+	}
+
+	if (!notsync)
+		dst_mirror_mark_sync(req->node);
+}
+
+static int dst_mirror_sync_endio(struct bio *bio, unsigned int size, int err)
+{
+	struct dst_request *req = bio->bi_private;
+	struct dst_node *n = req->node;
+	struct dst_mirror_priv *priv = n->priv;
+	unsigned long flags;
+
+	printk("%s: bio: %p, err: %d, size: %u, req: %p.\n",
+			__func__, bio, err, bio->bi_size, req);
+
+	if (bio->bi_size)
+		return 1;
+
+	bio->bi_rw = WRITE;
+	bio->bi_size = req->orig_size;
+	bio->bi_sector = req->start;
+
+	if (!err) {
+		spin_lock_irqsave(&priv->backlog_lock, flags);
+		list_add_tail(&req->request_list_entry, &priv->backlog_list);
+		spin_unlock_irqrestore(&priv->backlog_lock, flags);
+		kst_wake(req->state);
+	} else {
+		req->bio_endio(req, err);
+		dst_free_request(req);
+	}
+	return 0;
+}
+
+static int dst_mirror_sync_block(struct dst_node *n,
+		int bit_start, int bit_num)
+{
+	u64 start = to_bytes(bit_start);
+	struct bio *bio;
+	unsigned int nr_pages = to_bytes(bit_num)/PAGE_SIZE, i;
+	struct page *page;
+	int err = -ENOMEM;
+	struct dst_request *req;
+
+	printk("%s: bit_start: %d, bit_num: %d, start: %llu, nr_pages: %u, "
+			"disk_size: %llu.\n",
+			__func__, bit_start, bit_num, start, nr_pages,
+			n->st->disk_size);
+
+	while (nr_pages) {
+		req = dst_clone_request(NULL, n->w->req_pool);
+		if (!req)
+			return -ENOMEM;
+
+		bio = bio_alloc_bioset(GFP_NOIO, nr_pages, dst_mirror_bio_set);
+		if (!bio)
+			goto err_out_free_req;
+
+		bio->bi_rw = READ;
+		bio->bi_private = req;
+		bio->bi_sector = to_sector(start);
+		bio->bi_bdev = NULL;
+		bio->bi_destructor = dst_mirror_sync_destructor;
+		bio->bi_end_io = dst_mirror_sync_endio;
+
+		for (i = 0; i < nr_pages; ++i) {
+			err = -ENOMEM;
+
+			page = alloc_page(GFP_NOIO);
+			if (!page)
+				break;
+
+			err = bio_add_pc_page(n->st->queue, bio,
+					page, PAGE_SIZE, 0);
+			if (err <= 0)
+				break;
+			err = 0;
+		}
+
+		if (err && !bio->bi_vcnt)
+			goto err_out_put_bio;
+
+		req->node = n;
+		req->state = n->state;
+		req->start = bio->bi_sector;
+		req->size = req->orig_size = bio->bi_size;
+		req->bio = bio;
+		req->idx = bio->bi_idx;
+		req->num = bio->bi_vcnt;
+		req->bio_endio = &dst_mirror_sync_req_endio;
+		req->callback = &kst_data_callback;
+
+		dprintk("%s: start: %llu, size(pages): %u, bio: %p, "
+				"size: %u, cnt: %d, req: %p, size: %llu.\n",
+				__func__, bio->bi_sector, nr_pages, bio,
+				bio->bi_size, bio->bi_vcnt, req, req->size);
+
+		err = n->st->queue->make_request_fn(n->st->queue, bio);
+		if (err)
+			goto err_out_put_bio;
+
+		nr_pages -= bio->bi_vcnt;
+		start += bio->bi_size;
+	}
+
+	return 0;
+
+err_out_put_bio:
+	bio_put(bio);
+err_out_free_req:
+	dst_free_request(req);
+	return err;
+}
+
+/*
+ * Resync logic.
+ *
+ * System allocates and queues requests for number of regions.
+ * Each request initially is reading from the one of the nodes.
+ * When it is completed, system checks if given region was already
+ * written to, and in such case just drops read request, otherwise
+ * it writes it to the node being updated. Any write clears not-uptodate
+ * bit, which is used as a flag that region must be synchronized or not.
+ * Reading is never performed from the node under resync.
+ */
+static int dst_mirror_resync(struct dst_node *n)
+{
+	int err = 0, sync = 0;
+	struct dst_mirror_priv *priv = n->priv;
+	unsigned int i;
+
+	printk("%s: node: %p, %llu:%llu synchronization has been started.\n",
+			__func__, n, n->start, n->size);
+
+	for (i = 0; i < priv->chunk_num/BITS_PER_LONG; ++i) {
+		int bit, num, start;
+		unsigned long word = priv->chunk[i];
+
+		if (!word)
+			continue;
+
+		num = 0;
+		start = -1;
+		while (word && num < BITS_PER_LONG) {
+			bit = __ffs(word);
+			if (start == -1)
+				start = bit;
+			num++;
+			word >>= (bit+1);
+		}
+
+		if (start != -1) {
+			err = dst_mirror_sync_block(n, start + i*BITS_PER_LONG,
+					num);
+			if (err)
+				break;
+			sync++;
+		}
+	}
+
+	if (!sync && !err)
+		dst_mirror_mark_sync(n);
+
+	return err;
+}
+
+static void dst_mirror_destructor(struct bio *bio)
+{
+	dprintk("%s: bio: %p.\n", __func__, bio);
+	bio_free(bio, dst_mirror_bio_set);
+}
+
+static int dst_mirror_end_io(struct bio *bio, unsigned int size, int err)
+{
+	struct dst_request *req = bio->bi_private;
+
+	if (bio->bi_size)
+		return 0;
+
+	dprintk("%s: req: %p, bio: %p, req->bio: %p, err: %d.\n",
+			__func__, req, bio, req->bio, err);
+	req->bio_endio(req, err);
+	bio_put(bio);
+	return 0;
+}
+
+static void dst_mirror_read_endio(struct dst_request *req, int err)
+{
+	dst_mirror_bio_error(req, err);
+
+	if (!err)
+		kst_bio_endio(req, 0);
+}
+
+static void dst_mirror_write_endio(struct dst_request *req, int err)
+{
+	dst_mirror_bio_error(req, err);
+
+	req = req->priv;
+
+	dprintk("%s: req: %p, priv: %p err: %d, bio: %p, "
+			"cnt: %d, orig_size: %llu.\n",
+		__func__, req, req->priv, err, req->bio,
+		atomic_read(&req->refcnt), req->orig_size);
+
+	if (atomic_dec_and_test(&req->refcnt)) {
+		dprintk("%s: freeing bio %p.\n", __func__, req->bio);
+		bio_endio(req->bio, req->orig_size, 0);
+		dst_free_request(req);
+	}
+}
+
+static int dst_mirror_process_request(struct dst_request *req,
+		struct dst_node *n)
+{
+	int err = 0;
+
+	/*
+	 * Block layer requires to clone a bio.
+	 */
+	if (n->bdev) {
+		struct bio *clone = bio_alloc_bioset(GFP_NOIO,
+			req->bio->bi_max_vecs, dst_mirror_bio_set);
+
+		__bio_clone(clone, req->bio);
+
+		clone->bi_bdev = n->bdev;
+		clone->bi_destructor = dst_mirror_destructor;
+		clone->bi_private = req;
+		clone->bi_end_io = &dst_mirror_end_io;
+
+		dprintk("%s: clone: %p, bio: %p, req: %p.\n",
+				__func__, clone, req->bio, req);
+
+		generic_make_request(clone);
+	} else {
+		struct dst_request nr;
+		/*
+		 * Network state processing engine will clone request 
+		 * by itself if needed. We can not use the same structure
+		 * here, since number of its fields will be modified.
+		 */
+		memcpy(&nr, req, sizeof(struct dst_request));
+
+		nr.node = n;
+		nr.state = n->state;
+		nr.priv = req;
+
+		err = kst_check_permissions(n->state, req->bio);
+		if (!err)
+			err = n->state->ops->push(&nr);
+	}
+
+	dprintk("%s: req: %p, n: %p, bdev: %p, err: %d.\n",
+			__func__, req, n, n->bdev, err);
+	return err;
+}
+
+static int dst_mirror_write(struct dst_request *oreq)
+{
+	struct dst_node *n, *node = oreq->node;
+	struct dst_request *req;
+	int num, err = 0, err_num = 0, orig_num;
+
+	req = dst_clone_request(oreq, oreq->node->w->req_pool);
+	if (!req) {
+		kst_bio_endio(oreq, -ENOMEM);
+		return -ENOMEM;
+	}
+
+	req->priv = req;
+
+	/*
+	 * This logic is pretty simple - req->bio_endio will not
+	 * call bio_endio() until all mirror devices completed
+	 * processing of the request (no matter with or without error).
+	 * Mirror's req->bio_endio callback will take care of that.
+	 */
+	orig_num = num = atomic_read(&req->node->shared_num) + 1;
+	atomic_set(&req->refcnt, num);
+
+	req->bio_endio = &dst_mirror_write_endio;
+
+	dprintk("\n%s: req: %p, mirror to %d nodes.\n",
+			__func__, req, num);
+
+	err = dst_mirror_process_request(req, node);
+	if (err)
+		err_num++;
+
+	if (--num) {
+		list_for_each_entry(n, &node->shared, shared) {
+			dprintk("\n%s: req: %p, start: %llu, size: %llu, "
+					"num: %d, n: %p, state: %p.\n",
+				__func__, req, req->start, 
+				req->size, num, n, n->state);
+
+			err = dst_mirror_process_request(req, n);
+			if (err)
+				err_num++;
+
+			if (--num <= 0)
+				break;
+		}
+	}
+
+	if (err_num == orig_num) {
+		dprintk("%s: req: %p, num: %d, err: %d.\n",
+				__func__, req, num, err);
+		return -ENODEV;
+	}
+
+	return 0;
+}
+
+static int dst_mirror_read(struct dst_request *req)
+{
+	struct dst_node *node = req->node, *n, *min_dist_node;
+	struct dst_mirror_priv *priv = node->priv;
+	u64 dist, d;
+	int err;
+
+	req->bio_endio = &dst_mirror_read_endio;
+
+	do {
+		err = -ENODEV;
+		min_dist_node = NULL;
+		dist = -1ULL;
+ 
+		/*
+		 * Reading is never performed from the node under resync.
+		 * If this will cause any troubles (like all nodes must be
+		 * resynced between each other), this check can be removed
+		 * and per-chunk dirty bit can be tested instead.
+		 */
+
+		if (!test_bit(DST_NODE_NOTSYNC, &node->flags)) {
+			priv = node->priv;
+			if (req->start > priv->last_start)
+				dist = req->start - priv->last_start;
+			else
+				dist = priv->last_start - req->start;
+			min_dist_node = req->node;
+		}
+
+		list_for_each_entry(n, &node->shared, shared) {
+			if (test_bit(DST_NODE_NOTSYNC, &n->flags))
+				continue;
+
+			priv = n->priv;
+
+			if (req->start > priv->last_start)
+				d = req->start - priv->last_start;
+			else
+				d = priv->last_start - req->start;
+
+			if (d < dist)
+				min_dist_node = n;
+		}
+
+		if (!min_dist_node)
+			break;
+
+		req->node = min_dist_node;
+		req->state = req->node->state;
+
+		if (req->node->bdev) {
+			req->bio->bi_bdev = req->node->bdev;
+			generic_make_request(req->bio);
+			err = 0;
+			break;
+		}
+
+		err = req->state->ops->push(req);
+		if (err) {
+			printk("%s: 1 req: %p, bio: %p, node: %p, err: %d.\n",
+				__func__, req, req->bio, min_dist_node, err);
+			dst_mirror_mark_notsync(req->node);
+		}
+	} while (err && min_dist_node);
+
+	if (err) {
+		printk("%s: req: %p, bio: %p, node: %p, err: %d.\n",
+			__func__, req, req->bio, min_dist_node, err);
+		kst_bio_endio(req, err);
+	}
+	return err;
+}
+
+/*
+ * This callback is invoked from block layer request processing function,
+ * its task is to remap block request to different nodes.
+ */
+static int dst_mirror_remap(struct dst_request *req)
+{
+	int (*remap[])(struct dst_request *) = 
+		{&dst_mirror_read, &dst_mirror_write};
+
+	return remap[bio_rw(req->bio) == WRITE](req);
+}
+
+static int dst_mirror_error(struct kst_state *st, int err)
+{
+	struct dst_request *req, *tmp;
+	unsigned int revents = st->socket->ops->poll(NULL, st->socket, NULL);
+
+	if (err == -EEXIST)
+		return err;
+
+	if (!(revents & (POLLERR | POLLHUP))) {
+		if (test_bit(DST_NODE_NOTSYNC, &st->node->flags)) {
+			return dst_mirror_resync(st->node);
+		}
+		return 0;
+	}
+
+	dst_mirror_mark_notsync(st->node);
+
+	mutex_lock(&st->request_lock);
+	list_for_each_entry_safe(req, tmp, &st->request_list,
+					request_list_entry) {
+		kst_del_req(req);
+		dprintk("%s: requeue [%c], start: %llu, idx: %d,"
+				" num: %d, size: %llu, offset: %u, err: %d.\n",
+			__func__, (bio_rw(req->bio) == WRITE)?'W':'R',
+			req->start, req->idx, req->num, req->size,
+			req->offset, err);
+
+		if (bio_rw(req->bio) == READ) {
+			req->start -= to_sector(req->orig_size - req->size);
+			req->size = req->orig_size;
+			req->flags &= ~DST_REQ_HEADER_SENT;
+			req->idx = 0;
+			if (dst_mirror_read(req))
+				kst_complete_req(req, err);
+			else
+				dst_free_request(req);
+		} else {
+			kst_complete_req(req, err);
+		}
+	}
+	mutex_unlock(&st->request_lock);
+	return err;
+}
+
+static struct dst_alg_ops alg_mirror_ops = {
+	.remap		= dst_mirror_remap,
+	.add_node	= dst_mirror_add_node,
+	.del_node	= dst_mirror_del_node,
+	.error		= dst_mirror_error,
+	.owner		= THIS_MODULE,
+};
+
+static int __devinit alg_mirror_init(void)
+{
+	int err = -ENOMEM;
+
+	dst_mirror_bio_set = bioset_create(256, 256);
+	if (!dst_mirror_bio_set)
+		return -ENOMEM;
+
+	alg_mirror = dst_alloc_alg("alg_mirror", &alg_mirror_ops);
+	if (!alg_mirror)
+		goto err_out;
+
+	return 0;
+
+err_out:
+	bioset_free(dst_mirror_bio_set);
+	return err;
+}
+
+static void __devexit alg_mirror_exit(void)
+{
+	dst_remove_alg(alg_mirror);
+	bioset_free(dst_mirror_bio_set);
+}
+
+module_init(alg_mirror_init);
+module_exit(alg_mirror_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Evgeniy Polyakov <johnpol@2ka.mipt.ru>");
+MODULE_DESCRIPTION("Mirror distributed algorithm.");
diff --git a/drivers/block/dst/dcore.c b/drivers/block/dst/dcore.c
new file mode 100644
index 0000000..d6374ff
--- /dev/null
+++ b/drivers/block/dst/dcore.c
@@ -0,0 +1,1527 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/blkdev.h>
+#include <linux/bio.h>
+#include <linux/slab.h>
+#include <linux/connector.h>
+#include <linux/socket.h>
+#include <linux/dst.h>
+#include <linux/device.h>
+#include <linux/in.h>
+#include <linux/in6.h>
+#include <linux/buffer_head.h>
+
+#include <net/sock.h>
+
+static LIST_HEAD(dst_storage_list);
+static LIST_HEAD(dst_alg_list);
+static DEFINE_MUTEX(dst_storage_lock);
+static DEFINE_MUTEX(dst_alg_lock);
+static int dst_major;
+static struct kst_worker *kst_main_worker;
+static struct cb_id cn_dst_id = { CN_DST_IDX, CN_DST_VAL };
+
+struct kmem_cache *dst_request_cache;
+
+/*
+ * DST sysfs tree. For device called 'storage' which is formed
+ * on top of two nodes this looks like this:
+ *
+ * /sys/devices/storage/
+ * /sys/devices/storage/alg : alg_linear
+ * /sys/devices/storage/n-800/type : R: 192.168.4.80:1025
+ * /sys/devices/storage/n-800/size : 800
+ * /sys/devices/storage/n-800/start : 800
+ * /sys/devices/storage/n-0/type : R: 192.168.4.81:1025
+ * /sys/devices/storage/n-0/size : 800
+ * /sys/devices/storage/n-0/start : 0
+ * /sys/devices/storage/remove_all_nodes
+ * /sys/devices/storage/nodes : sectors (start [size]): 0 [800] | 800 [800]
+ * /sys/devices/storage/name : storage
+ */
+
+static int dst_dev_match(struct device *dev, struct device_driver *drv)
+{
+	return 1;
+}
+
+static void dst_dev_release(struct device *dev)
+{
+}
+
+static struct bus_type dst_dev_bus_type = {
+	.name 		= "dst",
+	.match 		= &dst_dev_match,
+};
+
+static struct device dst_dev = {
+	.bus 		= &dst_dev_bus_type,
+	.release 	= &dst_dev_release
+};
+
+static void dst_node_release(struct device *dev)
+{
+}
+
+static struct device dst_node_dev = {
+	.release 	= &dst_node_release
+};
+
+static struct bio_set *dst_bio_set;
+
+static void dst_destructor(struct bio *bio)
+{
+	bio_free(bio, dst_bio_set);
+}
+
+/*
+ * Internal callback for local requests (i.e. for local disk),
+ * which are splitted between nodes (part with local node destination
+ * ends up with this ->bi_end_io() callback).
+ */
+static int dst_end_io(struct bio *bio, unsigned int size, int err)
+{
+	struct bio *orig_bio = bio->bi_private;
+
+	if (bio->bi_size)
+		return 0;
+
+	dprintk("%s: bio: %p, orig_bio: %p, size: %u, orig_size: %u.\n",
+		__func__, bio, orig_bio, size, orig_bio->bi_size);
+
+	bio_endio(orig_bio, size, 0);
+	bio_put(bio);
+	return 0;
+}
+
+/*
+ * This function sends processing request down to block layer (for local node)
+ * or to network state machine (for remote node).
+ */
+static int dst_node_push(struct dst_request *req)
+{
+	int err = 0;
+	struct dst_node *n = req->node;
+
+	if (n->bdev) {
+		struct bio *bio = req->bio;
+
+		dprintk("%s: start: %llu, num: %d, idx: %d, offset: %u, "
+				"size: %llu, bi_idx: %d, bi_vcnt: %d.\n",
+			__func__, req->start, req->num, req->idx,
+			req->offset, req->size,	bio->bi_idx, bio->bi_vcnt);
+
+		if (likely(bio->bi_idx == req->idx &&
+					bio->bi_vcnt == req->num)) {
+			bio->bi_bdev = n->bdev;
+			bio->bi_sector = req->start;
+		} else {
+			struct bio *clone = bio_alloc_bioset(GFP_NOIO,
+					bio->bi_max_vecs, dst_bio_set);
+			struct bio_vec *bv;
+
+			err = -ENOMEM;
+			if (!clone)
+				goto out_put;
+
+			__bio_clone(clone, bio);
+
+			bv = bio_iovec_idx(clone, req->idx);
+			bv->bv_offset += req->offset;
+			clone->bi_idx = req->idx;
+			clone->bi_vcnt = req->num;
+			clone->bi_bdev = n->bdev;
+			clone->bi_sector = req->start;
+			clone->bi_destructor = dst_destructor;
+			clone->bi_private = bio;
+			clone->bi_size = req->orig_size;
+			clone->bi_end_io = &dst_end_io;
+			req->bio = clone;
+
+			dprintk("%s: start: %llu, num: %d, idx: %d, "
+				"offset: %u, size: %llu, "
+				"bi_idx: %d, bi_vcnt: %d, req: %p, bio: %p.\n",
+				__func__, req->start, req->num, req->idx,
+				req->offset, req->size,
+				clone->bi_idx, clone->bi_vcnt, req, req->bio);
+
+		}
+	}
+
+	err = n->st->alg->ops->remap(req);
+
+out_put:
+	dst_node_put(n);
+	return err;
+}
+
+/*
+ * This function is invoked from block layer request processing function,
+ * its task is to remap block request to different nodes.
+ */
+static int dst_remap(struct dst_storage *st, struct bio *bio)
+{
+	struct dst_node *n;
+	int err = -EINVAL, i, cnt;
+	unsigned int bio_sectors = bio->bi_size>>9;
+	struct bio_vec *bv;
+	struct dst_request req;
+	u64 rest_in_node, start, total_size;
+
+	mutex_lock(&st->tree_lock);
+	n = dst_storage_tree_search(st, bio->bi_sector);
+	mutex_unlock(&st->tree_lock);
+
+	if (!n) {
+		dprintk("%s: failed to find a node for bio: %p, "
+				"sector: %llu.\n",
+				__func__, bio, bio->bi_sector);
+		return -ENODEV;
+	}
+
+	dprintk("%s: bio: %llu-%llu, dev: %llu-%llu, in sectors.\n",
+			__func__, bio->bi_sector, bio->bi_sector+bio_sectors,
+			n->start, n->start+n->size);
+
+	memset(&req, 0, sizeof(struct dst_request));
+
+	start = bio->bi_sector;
+	total_size = bio->bi_size;
+
+	req.flags = (test_bit(DST_NODE_FROZEN, &n->flags))?
+				DST_REQ_ALWAYS_QUEUE:0;
+	req.start = start - n->start;
+	req.offset = 0;
+	req.state = n->state;
+	req.node = n;
+	req.bio = bio;
+
+	req.size = bio->bi_size;
+	req.orig_size = bio->bi_size;
+	req.idx = bio->bi_idx;
+	req.num = bio->bi_vcnt;
+
+	req.bio_endio = &kst_bio_endio;
+
+	/*
+	 * Common fast path - block request does not cross
+	 * boundaries between nodes.
+	 */
+	if (likely(bio->bi_sector + bio_sectors <= n->start + n->size))
+		return dst_node_push(&req);
+
+	req.size = 0;
+	req.idx = 0;
+	req.num = 1;
+
+	cnt = bio->bi_vcnt;
+
+	rest_in_node = to_bytes(n->size - req.start);
+
+	for (i = 0; i < cnt; ++i) {
+		bv = bio_iovec_idx(bio, i);
+
+		if (req.size + bv->bv_len >= rest_in_node) {
+			unsigned int diff = req.size + bv->bv_len -
+				rest_in_node;
+
+			req.size += bv->bv_len - diff;
+			req.start = start - n->start;
+			req.orig_size = req.size;
+			req.bio = bio;
+			req.bio_endio = &kst_bio_endio;
+
+			dprintk("%s: split: start: %llu/%llu, size: %llu, "
+					"total_size: %llu, diff: %u, idx: %d, "
+					"num: %d, bv_len: %u, bv_offset: %u.\n",
+					__func__, start, req.start, req.size,
+					total_size, diff, req.idx, req.num,
+					bv->bv_len, bv->bv_offset);
+
+			err = dst_node_push(&req);
+			if (err)
+				break;
+
+			total_size -= req.orig_size;
+
+			if (!total_size)
+				break;
+
+			start += to_sector(req.orig_size);
+
+			req.flags = (test_bit(DST_NODE_FROZEN, &n->flags))?
+				DST_REQ_ALWAYS_QUEUE:0;
+			req.orig_size = req.size = diff;
+
+			if (diff) {
+				req.offset = bv->bv_len - diff;
+				req.idx = req.num - 1;
+			} else {
+				req.idx = req.num;
+				req.offset = 0;
+			}
+
+			dprintk("%s: next: start: %llu, size: %llu, "
+				"total_size: %llu, diff: %u, idx: %d, "
+				"num: %d, offset: %u, bv_len: %u, "
+				"bv_offset: %u.\n",
+				__func__, start, req.size, total_size, diff,
+				req.idx, req.num, req.offset,
+				bv->bv_len, bv->bv_offset);
+
+			mutex_lock(&st->tree_lock);
+			n = dst_storage_tree_search(st, start);
+			mutex_unlock(&st->tree_lock);
+
+			if (!n) {
+				err = -ENODEV;
+				dprintk("%s: failed to find a split node for "
+				  "bio: %p, sector: %llu, start: %llu.\n",
+						__func__, bio, bio->bi_sector,
+						req.start);
+				break;
+			}
+
+			req.state = n->state;
+			req.node = n;
+			req.start = start - n->start;
+			rest_in_node = to_bytes(n->size - req.start);
+
+			dprintk("%s: req.start: %llu, start: %llu, "
+					"dev_start: %llu, dev_size: %llu, "
+					"rest_in_node: %llu.\n",
+				__func__, req.start, start, n->start,
+				n->size, rest_in_node);
+		} else {
+			req.size += bv->bv_len;
+			req.num++;
+		}
+	}
+
+	dprintk("%s: last request: start: %llu, size: %llu, "
+			"total_size: %llu.\n", __func__,
+			req.start, req.size, total_size);
+	if (total_size) {
+		req.orig_size = req.size;
+		req.bio = bio;
+		req.bio_endio = &kst_bio_endio;
+
+		dprintk("%s: last: start: %llu/%llu, size: %llu, "
+				"total_size: %llu, idx: %d, num: %d.\n",
+			__func__, start, req.start, req.size,
+			total_size, req.idx, req.num);
+
+		err = dst_node_push(&req);
+		if (!err) {
+			total_size -= req.orig_size;
+
+			BUG_ON(total_size != 0);
+		}
+	}
+
+	dprintk("%s: end bio: %p, err: %d.\n", __func__, bio, err);
+	return err;
+}
+
+
+/*
+ * Distributed storage erquest processing function.
+ * It calls algorithm spcific remapping code only.
+ */
+static int dst_request(request_queue_t *q, struct bio *bio)
+{
+	struct dst_storage *st = q->queuedata;
+	int err;
+
+	dprintk("\n%s: start: st: %p, bio: %p, cnt: %u.\n",
+			__func__, st, bio, bio->bi_vcnt);
+
+	err = dst_remap(st, bio);
+
+	dprintk("%s: end: st: %p, bio: %p, err: %d.\n",
+			__func__, st, bio, err);
+	return 0;
+}
+
+static void dst_unplug(request_queue_t *q)
+{
+}
+
+static int dst_flush(request_queue_t *q, struct gendisk *disk, sector_t *sec)
+{
+	return 0;
+}
+
+static struct block_device_operations dst_blk_ops = {
+	.owner =	THIS_MODULE,
+};
+
+/*
+ * Block layer binding - disk is created when array is fully configured
+ * by userspace request.
+ */
+static int dst_create_disk(struct dst_storage *st)
+{
+	int err = -ENOMEM;
+
+	st->queue = blk_alloc_queue(GFP_KERNEL);
+	if (!st->queue)
+		goto err_out_exit;
+
+	st->queue->queuedata = st;
+	blk_queue_make_request(st->queue, dst_request);
+	blk_queue_bounce_limit(st->queue, BLK_BOUNCE_ANY);
+	st->queue->unplug_fn = dst_unplug;
+	st->queue->issue_flush_fn = dst_flush;
+
+	err = -EINVAL;
+	st->disk = alloc_disk(1);
+	if (!st->disk)
+		goto err_out_free_queue;
+
+	st->disk->major = dst_major;
+	st->disk->first_minor = (((unsigned long)st->disk) ^
+		(((unsigned long)st->disk) >> 31)) & 0xff;
+	st->disk->fops = &dst_blk_ops;
+	st->disk->queue = st->queue;
+	st->disk->private_data = st;
+	snprintf(st->disk->disk_name, sizeof(st->disk->disk_name),
+			"dst-%s-%d", st->name, st->disk->first_minor);
+
+	return 0;
+
+err_out_free_queue:
+	blk_cleanup_queue(st->queue);
+err_out_exit:
+	return err;
+}
+
+static void dst_remove_disk(struct dst_storage *st)
+{
+	del_gendisk(st->disk);
+	put_disk(st->disk);
+	blk_cleanup_queue(st->queue);
+}
+
+/*
+ * Shows node name in sysfs.
+ */
+static ssize_t dst_name_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct dst_storage *st = container_of(dev, struct dst_storage, device);
+
+	return sprintf(buf, "%s\n", st->name);
+}
+
+static void dst_remove_all_nodes(struct dst_storage *st)
+{
+	struct dst_node *n, *node, *tmp;
+	struct rb_node *rb_node;
+
+	mutex_lock(&st->tree_lock);
+	while ((rb_node = rb_first(&st->tree_root)) != NULL) {
+		n = rb_entry(rb_node, struct dst_node, tree_node);
+		dprintk("%s: n: %p, start: %llu, size: %llu.\n",
+				__func__, n, n->start, n->size);
+		rb_erase(&n->tree_node, &st->tree_root);
+		if (!n->shared_head && atomic_read(&n->shared_num)) {
+			list_for_each_entry_safe(node, tmp, &n->shared, shared) {
+				list_del(&node->shared);
+				atomic_dec(&node->shared_head->refcnt);
+				node->shared_head = NULL;
+				dst_node_put(node);
+			}
+		}
+		dst_node_put(n);
+	}
+	mutex_unlock(&st->tree_lock);
+}
+
+/*
+ * Shows node layout in syfs.
+ */
+static ssize_t dst_nodes_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct dst_storage *st = container_of(dev, struct dst_storage, device);
+	int size = PAGE_CACHE_SIZE, sz;
+	struct dst_node *n;
+	struct rb_node *rb_node;
+
+	sz = sprintf(buf, "sectors (start [size]): ");
+	size -= sz;
+	buf += sz;
+
+	mutex_lock(&st->tree_lock);
+	for (rb_node = rb_first(&st->tree_root); rb_node;
+			rb_node = rb_next(rb_node)) {
+		n = rb_entry(rb_node, struct dst_node, tree_node);
+		if (size < 32)
+			break;
+		sz = sprintf(buf, "%llu [%llu]", n->start, n->size);
+		buf += sz;
+		size -= sz;
+
+		if (!rb_next(rb_node))
+			break;
+
+		sz = sprintf(buf, " | ");
+		buf += sz;
+		size -= sz;
+	}
+	mutex_unlock(&st->tree_lock);
+	size -= sprintf(buf, "\n");
+	return PAGE_CACHE_SIZE - size;
+}
+
+/*
+ * Algorithm currently being used by given storage.
+ */
+static ssize_t dst_alg_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct dst_storage *st = container_of(dev, struct dst_storage, device);
+	return sprintf(buf, "%s\n", st->alg->name);
+}
+
+/*
+ * Writing to this sysfs file allows to remove all nodes
+ * and storage itself automatically.
+ */
+static ssize_t dst_remove_nodes(struct device *dev,
+		struct device_attribute *attr,
+		const char *buf, size_t count)
+{
+	struct dst_storage *st = container_of(dev, struct dst_storage, device);
+	dst_remove_all_nodes(st);
+	return count;
+}
+
+static DEVICE_ATTR(name, 0444, dst_name_show, NULL);
+static DEVICE_ATTR(nodes, 0444, dst_nodes_show, NULL);
+static DEVICE_ATTR(alg, 0444, dst_alg_show, NULL);
+static DEVICE_ATTR(remove_all_nodes, 0644, NULL, dst_remove_nodes);
+
+static int dst_create_storage_attributes(struct dst_storage *st)
+{
+	int err;
+
+	err = device_create_file(&st->device, &dev_attr_name);
+	err = device_create_file(&st->device, &dev_attr_nodes);
+	err = device_create_file(&st->device, &dev_attr_alg);
+	err = device_create_file(&st->device, &dev_attr_remove_all_nodes);
+	return 0;
+}
+
+static void dst_remove_storage_attributes(struct dst_storage *st)
+{
+	device_remove_file(&st->device, &dev_attr_name);
+	device_remove_file(&st->device, &dev_attr_nodes);
+	device_remove_file(&st->device, &dev_attr_alg);
+	device_remove_file(&st->device, &dev_attr_remove_all_nodes);
+}
+
+static void dst_storage_sysfs_exit(struct dst_storage *st)
+{
+	dst_remove_storage_attributes(st);
+	device_unregister(&st->device);
+}
+
+static int dst_storage_sysfs_init(struct dst_storage *st)
+{
+	int err;
+
+	memcpy(&st->device, &dst_dev, sizeof(struct device));
+	snprintf(st->device.bus_id, sizeof(st->device.bus_id), "%s", st->name);
+
+	err = device_register(&st->device);
+	if (err) {
+		dprintk(KERN_ERR "Failed to register dst device %s, err: %d.\n",
+			st->name, err);
+		goto err_out_exit;
+	}
+
+	dst_create_storage_attributes(st);
+
+	return 0;
+
+err_out_exit:
+	return err;
+}
+
+/*
+ * This functions shows size and start of the appropriate node.
+ * Both are in sectors.
+ */
+static ssize_t dst_show_start(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct dst_node *n = container_of(dev, struct dst_node, device);
+
+	return sprintf(buf, "%llu\n", n->start);
+}
+
+static ssize_t dst_show_size(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct dst_node *n = container_of(dev, struct dst_node, device);
+
+	return sprintf(buf, "%llu\n", n->size);
+}
+
+/*
+ * Shows type of the remote node - device major/minor number
+ * for local nodes and address (af_inet ipv4/ipv6 only) for remote nodes.
+ */
+static ssize_t dst_show_type(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct dst_node *n = container_of(dev, struct dst_node, device);
+	struct sockaddr addr;
+	struct socket *sock;
+	int addrlen;
+
+	if (!n->state && !n->bdev)
+		return 0;
+
+	if (n->bdev)
+		return sprintf(buf, "L: %d:%d\n",
+				MAJOR(n->bdev->bd_dev), MINOR(n->bdev->bd_dev));
+
+	sock = n->state->socket;
+	if (sock->ops->getname(sock, &addr, &addrlen, 2))
+		return 0;
+
+	if (sock->ops->family == AF_INET) {
+		struct sockaddr_in *sin = (struct sockaddr_in *)&addr;
+		return sprintf(buf, "R: %u.%u.%u.%u:%d\n",
+			NIPQUAD(sin->sin_addr.s_addr), ntohs(sin->sin_port));
+	} else if (sock->ops->family == AF_INET6) {
+		struct sockaddr_in6 *sin = (struct sockaddr_in6 *)&addr;
+		return sprintf(buf,
+			"R: %04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x:%d\n",
+			NIP6(sin->sin6_addr), ntohs(sin->sin6_port));
+	}
+	return 0;
+}
+
+static DEVICE_ATTR(start, 0444, dst_show_start, NULL);
+static DEVICE_ATTR(size, 0444, dst_show_size, NULL);
+static DEVICE_ATTR(type, 0444, dst_show_type, NULL);
+
+static int dst_create_node_attributes(struct dst_node *n)
+{
+	int err;
+
+	err = device_create_file(&n->device, &dev_attr_start);
+	err = device_create_file(&n->device, &dev_attr_size);
+	err = device_create_file(&n->device, &dev_attr_type);
+	return 0;
+}
+
+static void dst_remove_node_attributes(struct dst_node *n)
+{
+	device_remove_file(&n->device, &dev_attr_start);
+	device_remove_file(&n->device, &dev_attr_size);
+	device_remove_file(&n->device, &dev_attr_type);
+}
+
+static void dst_node_sysfs_exit(struct dst_node *n)
+{
+	if (n->device.parent == &n->st->device) {
+		dst_remove_node_attributes(n);
+		device_unregister(&n->device);
+		n->device.parent = NULL;
+	}
+}
+
+static int dst_node_sysfs_init(struct dst_node *n)
+{
+	int err;
+
+	memcpy(&n->device, &dst_node_dev, sizeof(struct device));
+
+	n->device.parent = &n->st->device;
+
+	snprintf(n->device.bus_id, sizeof(n->device.bus_id),
+			"n-%llu-%p", n->start, n);
+	err = device_register(&n->device);
+	if (err) {
+		dprintk(KERN_ERR "Failed to register node, err: %d.\n", err);
+		goto err_out_exit;
+	}
+
+	dst_create_node_attributes(n);
+
+	return 0;
+
+err_out_exit:
+	n->device.parent = NULL;
+	return err;
+}
+
+/*
+ * Gets a reference for given storage, if
+ * storage with given name and algorithm being used
+ * does not exist it is created.
+ */
+static struct dst_storage *dst_get_storage(char *name, char *aname, int alloc)
+{
+	struct dst_storage *st, *rst = NULL;
+	int err;
+	struct dst_alg *alg;
+
+	mutex_lock(&dst_storage_lock);
+	list_for_each_entry(st, &dst_storage_list, entry) {
+		if (!strcmp(name, st->name) && !strcmp(st->alg->name, aname)) {
+			rst = st;
+			atomic_inc(&st->refcnt);
+			break;
+		}
+	}
+	mutex_unlock(&dst_storage_lock);
+
+	if (rst || !alloc)
+		return rst;
+
+	st = kzalloc(sizeof(struct dst_storage), GFP_KERNEL);
+	if (!st)
+		return NULL;
+
+	mutex_init(&st->tree_lock);
+	/*
+	 * One for storage itself,
+	 * another one for attached node below.
+	 */
+	atomic_set(&st->refcnt, 2);
+	snprintf(st->name, DST_NAMELEN, "%s", name);
+	st->tree_root.rb_node = NULL;
+
+	err = dst_storage_sysfs_init(st);
+	if (err)
+		goto err_out_free;
+
+	err = dst_create_disk(st);
+	if (err)
+		goto err_out_sysfs_exit;
+
+	mutex_lock(&dst_alg_lock);
+	list_for_each_entry(alg, &dst_alg_list, entry) {
+		if (!strcmp(alg->name, aname)) {
+			atomic_inc(&alg->refcnt);
+			try_module_get(alg->ops->owner);
+			st->alg = alg;
+			break;
+		}
+	}
+	mutex_unlock(&dst_alg_lock);
+
+	if (!st->alg)
+		goto err_out_disk_remove;
+
+	mutex_lock(&dst_storage_lock);
+	list_add_tail(&st->entry, &dst_storage_list);
+	mutex_unlock(&dst_storage_lock);
+
+	return st;
+
+err_out_disk_remove:
+	dst_remove_disk(st);
+err_out_sysfs_exit:
+	dst_storage_sysfs_init(st);
+err_out_free:
+	kfree(st);
+	return NULL;
+}
+
+/*
+ * Allows to allocate and add new algorithm by external modules.
+ */
+struct dst_alg *dst_alloc_alg(char *name, struct dst_alg_ops *ops)
+{
+	struct dst_alg *alg;
+
+	alg = kzalloc(sizeof(struct dst_alg), GFP_KERNEL);
+	if (!alg)
+		return NULL;
+	snprintf(alg->name, DST_NAMELEN, "%s", name);
+	atomic_set(&alg->refcnt, 1);
+	alg->ops = ops;
+
+	mutex_lock(&dst_alg_lock);
+	list_add_tail(&alg->entry, &dst_alg_list);
+	mutex_unlock(&dst_alg_lock);
+
+	return alg;
+}
+EXPORT_SYMBOL_GPL(dst_alloc_alg);
+
+static void dst_free_alg(struct dst_alg *alg)
+{
+	dprintk("%s: alg: %p.\n", __func__, alg);
+	kfree(alg);
+}
+
+/*
+ * Algorithm is never freed directly,
+ * since its module reference counter is increased
+ * by storage when it is created - just like network protocols.
+ */
+static inline void dst_put_alg(struct dst_alg *alg)
+{
+	dprintk("%s: alg: %p, refcnt: %d.\n",
+			__func__, alg, atomic_read(&alg->refcnt));
+	module_put(alg->ops->owner);
+	if (atomic_dec_and_test(&alg->refcnt))
+		dst_free_alg(alg);
+}
+
+/*
+ * Removing algorithm from main list of supported algorithms.
+ */
+void dst_remove_alg(struct dst_alg *alg)
+{
+	mutex_lock(&dst_alg_lock);
+	list_del_init(&alg->entry);
+	mutex_unlock(&dst_alg_lock);
+
+	dst_put_alg(alg);
+}
+EXPORT_SYMBOL_GPL(dst_remove_alg);
+
+static void dst_cleanup_node(struct dst_node *n)
+{
+	struct dst_storage *st = n->st;
+
+	dprintk("%s: node: %p.\n", __func__, n);
+
+	n->st->alg->ops->del_node(n);
+
+	if (n->shared_head) {
+		mutex_lock(&st->tree_lock);
+		list_del(&n->shared);
+		mutex_unlock(&st->tree_lock);
+
+		atomic_dec(&n->shared_head->refcnt);
+		dst_node_put(n->shared_head);
+		n->shared_head = NULL;
+	}
+
+	if (n->cleanup)
+		n->cleanup(n);
+	dst_node_sysfs_exit(n);
+	kfree(n);
+}
+
+static void dst_free_storage(struct dst_storage *st)
+{
+	dprintk("%s: st: %p.\n", __func__, st);
+
+	BUG_ON(rb_first(&st->tree_root) != NULL);
+
+	dst_put_alg(st->alg);
+	kfree(st);
+}
+
+static inline void dst_put_storage(struct dst_storage *st)
+{
+	dprintk("%s: st: %p, refcnt: %d.\n",
+			__func__, st, atomic_read(&st->refcnt));
+	if (atomic_dec_and_test(&st->refcnt))
+		dst_free_storage(st);
+}
+
+void dst_node_put(struct dst_node *n)
+{
+	dprintk("%s: node: %p, start: %llu, size: %llu, refcnt: %d.\n",
+			__func__, n, n->start, n->size,
+			atomic_read(&n->refcnt));
+
+	if (atomic_dec_and_test(&n->refcnt)) {
+		struct dst_storage *st = n->st;
+
+		dprintk("%s: freeing node: %p, start: %llu, size: %llu, "
+				"refcnt: %d.\n",
+				__func__, n, n->start, n->size,
+				atomic_read(&n->refcnt));
+
+		dst_cleanup_node(n);
+		dst_put_storage(st);
+	}
+}
+EXPORT_SYMBOL_GPL(dst_node_put);
+
+static inline int dst_compare_id(struct dst_node *old, u64 new)
+{
+	if (old->start + old->size <= new)
+		return 1;
+	if (old->start > new)
+		return -1;
+	return 0;
+}
+
+/*
+ * Tree of of the nodes, which form the storage.
+ * Tree is indexed via start of the node and its size.
+ * Comparison function above.
+ */
+struct dst_node *dst_storage_tree_search(struct dst_storage *st, u64 start)
+{
+	struct rb_node *n = st->tree_root.rb_node;
+	struct dst_node *dn;
+	int cmp;
+
+	while (n) {
+		dn = rb_entry(n, struct dst_node, tree_node);
+
+		cmp = dst_compare_id(dn, start);
+		dprintk("%s: tree: %llu-%llu, new: %llu.\n",
+			__func__, dn->start, dn->start+dn->size, start);
+		if (cmp < 0)
+			n = n->rb_left;
+		else if (cmp > 0)
+			n = n->rb_right;
+		else {
+			return dst_node_get(dn);
+		}
+	}
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(dst_storage_tree_search);
+
+/*
+ * This function allows to remove a node with given start address
+ * from the storage.
+ */
+static struct dst_node *dst_storage_tree_del(struct dst_storage *st, u64 start)
+{
+	struct dst_node *n = dst_storage_tree_search(st, start);
+
+	if (!n)
+		return NULL;
+
+	rb_erase(&n->tree_node, &st->tree_root);
+	dst_node_put(n);
+	return n;
+}
+
+/*
+ * This function allows to add given node to the storage.
+ * Returns -EEXIST if the same area is already covered by another node.
+ * This is return must be checked for redundancy algorithms.
+ */
+static struct dst_node *dst_storage_tree_add(struct dst_node *new,
+		struct dst_storage *st)
+{
+	struct rb_node **n = &st->tree_root.rb_node, *parent = NULL;
+	struct dst_node *dn;
+	int cmp;
+
+	while (*n) {
+		parent = *n;
+		dn = rb_entry(parent, struct dst_node, tree_node);
+
+		cmp = dst_compare_id(dn, new->start);
+		dprintk("%s: tree: %llu-%llu, new: %llu.\n",
+				__func__, dn->start, dn->start+dn->size,
+				new->start);
+		if (cmp < 0)
+			n = &parent->rb_left;
+		else if (cmp > 0)
+			n = &parent->rb_right;
+		else {
+			return dn;
+		}
+	}
+
+	rb_link_node(&new->tree_node, parent, n);
+	rb_insert_color(&new->tree_node, &st->tree_root);
+
+	return NULL;
+}
+
+/*
+ * This function finds devices major/minor numbers for given pathname.
+ */
+static int dst_lookup_device(const char *path, dev_t *dev)
+{
+	int err;
+	struct nameidata nd;
+	struct inode *inode;
+
+	err = path_lookup(path, LOOKUP_FOLLOW, &nd);
+	if (err)
+		return err;
+
+	inode = nd.dentry->d_inode;
+	if (!inode) {
+		err = -ENOENT;
+		goto out;
+	}
+
+	if (!S_ISBLK(inode->i_mode)) {
+		err = -ENOTBLK;
+		goto out;
+	}
+
+	*dev = inode->i_rdev;
+
+out:
+	path_release(&nd);
+	return err;
+}
+
+/*
+ * Cleanup routings for local, local exporting and remote nodes.
+ */
+static void dst_cleanup_remote(struct dst_node *n)
+{
+	if (n->state) {
+		kst_state_exit(n->state);
+		n->state = NULL;
+	}
+}
+
+static void dst_cleanup_local(struct dst_node *n)
+{
+	if (n->bdev) {
+		sync_blockdev(n->bdev);
+		blkdev_put(n->bdev);
+		n->bdev = NULL;
+	}
+}
+
+static void dst_cleanup_local_export(struct dst_node *n)
+{
+	dst_cleanup_local(n);
+	dst_cleanup_remote(n);
+}
+
+/*
+ * Setup routings for local, local exporting and remote nodes.
+ */
+static int dst_setup_local(struct dst_node *n, struct dst_ctl *ctl,
+		struct dst_local_ctl *l)
+{
+	dev_t dev;
+	int err;
+
+	err = dst_lookup_device(l->name, &dev);
+	if (err)
+		return err;
+
+	n->bdev = open_by_devnum(dev, FMODE_READ|FMODE_WRITE);
+	if (!n->bdev)
+		return -ENODEV;
+
+	if (!n->size)
+		n->size = get_capacity(n->bdev->bd_disk);
+
+	return 0;
+}
+
+static int dst_setup_local_export(struct dst_node *n, struct dst_ctl *ctl,
+		struct dst_le_template *tmp)
+{
+	int err;
+
+	err = dst_setup_local(n, ctl, &tmp->le->lctl);
+	if (err)
+		goto err_out_exit;
+
+	n->state = kst_listener_state_init(n, tmp);
+	if (IS_ERR(n->state)) {
+		err = PTR_ERR(n->state);
+		goto err_out_cleanup;
+	}
+
+	return 0;
+
+err_out_cleanup:
+	dst_cleanup_local(n);
+err_out_exit:
+	return err;
+}
+
+static int dst_request_remote_config(struct dst_node *n, struct socket *sock)
+{
+	struct dst_remote_request cfg;
+	struct msghdr msg;
+	struct kvec iov;
+	int err;
+
+	memset(&cfg, 0, sizeof(struct dst_remote_request));
+	cfg.cmd = cpu_to_be32(DST_REMOTE_CFG);
+
+	iov.iov_base = &cfg;
+	iov.iov_len = sizeof(struct dst_remote_request);
+
+	msg.msg_iov = (struct iovec *)&iov;
+	msg.msg_iovlen = 1;
+	msg.msg_name = NULL;
+	msg.msg_namelen = 0;
+	msg.msg_control = NULL;
+	msg.msg_controllen = 0;
+	msg.msg_flags = MSG_WAITALL;
+
+	err = kernel_sendmsg(sock, &msg, &iov, 1, iov.iov_len);
+	if (err <= 0) {
+		if (err == 0)
+			err = -ECONNRESET;
+		return err;
+	}
+
+	iov.iov_base = &cfg;
+	iov.iov_len = sizeof(struct dst_remote_request);
+
+	msg.msg_iov = (struct iovec *)&iov;
+	msg.msg_iovlen = 1;
+	msg.msg_name = NULL;
+	msg.msg_namelen = 0;
+	msg.msg_control = NULL;
+	msg.msg_controllen = 0;
+	msg.msg_flags = MSG_WAITALL;
+
+	err = kernel_recvmsg(sock, &msg, &iov, 1, iov.iov_len, msg.msg_flags);
+	if (err <= 0) {
+		if (err == 0)
+			err = -ECONNRESET;
+		return err;
+	}
+
+	if (be32_to_cpu(cfg.cmd) != DST_REMOTE_CFG)
+		return -EINVAL;
+
+	n->size = be64_to_cpu(cfg.sector);
+
+	return 0;
+}
+
+static int dst_setup_remote(struct dst_node *n, struct dst_ctl *ctl,
+		struct dst_remote_ctl *r)
+{
+	int err;
+	struct socket *sock;
+
+	err = sock_create(r->addr.sa_family, r->type, r->proto, &sock);
+	if (err < 0)
+		goto err_out_exit;
+
+	sock->sk->sk_sndtimeo = sock->sk->sk_rcvtimeo =
+		msecs_to_jiffies(DST_DEFAULT_TIMEO);
+
+	err = sock->ops->connect(sock, (struct sockaddr *)&r->addr,
+			r->addr.sa_data_len, 0);
+	if (err)
+		goto err_out_destroy;
+
+	if (!n->size) {
+		err = dst_request_remote_config(n, sock);
+		if (err)
+			goto err_out_destroy;
+	}
+
+	n->state = kst_data_state_init(n, sock);
+	if (IS_ERR(n->state)) {
+		err = PTR_ERR(n->state);
+		goto err_out_destroy;
+	}
+
+	return 0;
+
+err_out_destroy:
+	sock_release(sock);
+err_out_exit:
+	return err;
+}
+
+/*
+ * This function inserts node into storage.
+ */
+static int dst_insert_node(struct dst_node *n)
+{
+	int err;
+	struct dst_storage *st = n->st;
+	struct dst_node *dn;
+
+	err = st->alg->ops->add_node(n);
+	if (err)
+		return err;
+
+	err = dst_node_sysfs_init(n);
+	if (err)
+		goto err_out_remove_node;
+
+	mutex_lock(&st->tree_lock);
+	dn = dst_storage_tree_add(n, st);
+	if (dn) {
+		err = -EINVAL;
+		dn->size = st->disk_size;
+		if (dn->start == n->start) {
+			err = 0;
+			n->shared_head = dst_node_get(dn);
+			atomic_inc(&dn->shared_num);
+			list_add_tail(&n->shared, &dn->shared);
+		}
+	}
+	mutex_unlock(&st->tree_lock);
+	if (err)
+		goto err_out_sysfs_exit;
+
+	if (n->priv_callback)
+		n->priv_callback(n);
+
+	return 0;
+
+err_out_sysfs_exit:
+	dst_node_sysfs_exit(n);
+err_out_remove_node:
+	st->alg->ops->del_node(n);
+	return err;
+}
+
+static struct dst_node *dst_alloc_node(struct dst_ctl *ctl,
+		void (*cleanup)(struct dst_node *))
+{
+	struct dst_storage *st;
+	struct dst_node *n;
+
+	st = dst_get_storage(ctl->st, ctl->alg, 1);
+	if (!st)
+		goto err_out_exit;
+
+	n = kzalloc(sizeof(struct dst_node), GFP_KERNEL);
+	if (!n)
+		goto err_out_put_storage;
+
+	n->w = kst_main_worker;
+	n->st = st;
+	n->cleanup = cleanup;
+	n->start = ctl->start;
+	n->size = ctl->size;
+	INIT_LIST_HEAD(&n->shared);
+	n->shared_head = NULL;
+	atomic_set(&n->shared_num, 0);
+	atomic_set(&n->refcnt, 1);
+
+	return n;
+
+err_out_put_storage:
+	mutex_lock(&dst_storage_lock);
+	list_del_init(&st->entry);
+	mutex_unlock(&dst_storage_lock);
+
+	dst_put_storage(st);
+err_out_exit:
+	return NULL;
+}
+
+/*
+ * Control callback for userspace commands to setup
+ * different nodes and start/stop array.
+ */
+static int dst_add_remote(struct dst_ctl *ctl, void *data, unsigned int len)
+{
+	struct dst_node *n;
+	int err;
+	struct dst_remote_ctl *rctl = data;
+
+	if (len != sizeof(struct dst_remote_ctl))
+		return -EINVAL;
+
+	n = dst_alloc_node(ctl, &dst_cleanup_remote);
+	if (!n)
+		return -ENOMEM;
+
+	err = dst_setup_remote(n, ctl, rctl);
+	if (err < 0)
+		goto err_out_free;
+
+	err = dst_insert_node(n);
+	if (err)
+		goto err_out_free;
+
+	return 0;
+
+err_out_free:
+	dst_node_put(n);
+	return err;
+}
+
+static int dst_add_local_export(struct dst_ctl *ctl, void *data, unsigned int len)
+{
+	struct dst_node *n;
+	int err;
+	struct dst_le_template tmp;
+	
+	if (len < sizeof(struct dst_local_export_ctl))
+		return -EINVAL;
+
+	tmp.le = data;
+
+	len -= sizeof(struct dst_local_export_ctl);
+	data += sizeof(struct dst_local_export_ctl);
+
+	if (len != tmp.le->secure_attr_num * sizeof(struct dst_secure))
+		return -EINVAL;
+
+	tmp.data = data;
+
+	n = dst_alloc_node(ctl, &dst_cleanup_local_export);
+	if (!n)
+		return -EINVAL;
+
+	err = dst_setup_local_export(n, ctl, &tmp);
+	if (err < 0)
+		goto err_out_free;
+
+	err = dst_insert_node(n);
+	if (err)
+		goto err_out_free;
+
+
+	return 0;
+
+err_out_free:
+	dst_node_put(n);
+	return err;
+}
+
+static int dst_add_local(struct dst_ctl *ctl, void *data, unsigned int len)
+{
+	struct dst_node *n;
+	int err;
+	struct dst_local_ctl *lctl = data;
+
+	if (len != sizeof(struct dst_local_ctl))
+		return -EINVAL;
+
+	n = dst_alloc_node(ctl, &dst_cleanup_local);
+	if (!n)
+		return -EINVAL;
+
+	err = dst_setup_local(n, ctl, lctl);
+	if (err < 0)
+		goto err_out_free;
+
+	err = dst_insert_node(n);
+	if (err)
+		goto err_out_free;
+
+	return 0;
+
+err_out_free:
+	dst_node_put(n);
+	return err;
+}
+
+static int dst_del_node(struct dst_ctl *ctl, void *data, unsigned int len)
+{
+	struct dst_node *n;
+	struct dst_storage *st;
+	int err = -ENODEV;
+
+	if (len)
+		return -EINVAL;
+
+	st = dst_get_storage(ctl->st, ctl->alg, 0);
+	if (!st)
+		goto err_out_exit;
+
+	mutex_lock(&st->tree_lock);
+	n = dst_storage_tree_del(st, ctl->start);
+	mutex_unlock(&st->tree_lock);
+	if (!n)
+		goto err_out_put;
+
+	dst_node_put(n);
+	dst_put_storage(st);
+
+	return 0;
+
+err_out_put:
+	dst_put_storage(st);
+err_out_exit:
+	return err;
+}
+
+static int dst_start_storage(struct dst_ctl *ctl, void *data, unsigned int len)
+{
+	struct dst_storage *st;
+
+	if (len)
+		return -EINVAL;
+
+	st = dst_get_storage(ctl->st, ctl->alg, 0);
+	if (!st)
+		return -ENODEV;
+
+	mutex_lock(&st->tree_lock);
+	if (!(st->flags & DST_ST_STARTED)) {
+		set_capacity(st->disk, st->disk_size);
+		add_disk(st->disk);
+		st->flags |= DST_ST_STARTED;
+		dprintk("%s: STARTED st: %p, disk_size: %llu.\n",
+				__func__, st, st->disk_size);
+	}
+	mutex_unlock(&st->tree_lock);
+
+	dst_put_storage(st);
+
+	return 0;
+}
+
+static int dst_stop_storage(struct dst_ctl *ctl, void *data, unsigned int len)
+{
+	struct dst_storage *st;
+
+	if (len)
+		return -EINVAL;
+
+	st = dst_get_storage(ctl->st, ctl->alg, 0);
+	if (!st)
+		return -ENODEV;
+
+	dprintk("%s: STOPPED storage: %s.\n", __func__, st->name);
+
+	dst_storage_sysfs_exit(st);
+
+	mutex_lock(&dst_storage_lock);
+	list_del_init(&st->entry);
+	mutex_unlock(&dst_storage_lock);
+
+	if (st->flags & DST_ST_STARTED)
+		dst_remove_disk(st);
+
+	dst_remove_all_nodes(st);
+	dst_put_storage(st); /* One reference got above */
+	dst_put_storage(st); /* Another reference set during initialization */
+
+	return 0;
+}
+
+typedef int (*dst_command_func)(struct dst_ctl *ctl, void *data, unsigned int len);
+
+/*
+ * List of userspace commands.
+ */
+static dst_command_func dst_commands[] = {
+	[DST_ADD_REMOTE] = &dst_add_remote,
+	[DST_ADD_LOCAL] = &dst_add_local,
+	[DST_ADD_LOCAL_EXPORT] = &dst_add_local_export,
+	[DST_DEL_NODE] = &dst_del_node,
+	[DST_START_STORAGE] = &dst_start_storage,
+	[DST_STOP_STORAGE] = &dst_stop_storage,
+};
+
+/*
+ * Configuration parser.
+ */
+static void cn_dst_callback(void *data)
+{
+	struct dst_ctl *ctl;
+	struct cn_msg *msg = data;
+
+	if (msg->len < sizeof(struct dst_ctl))
+		return;
+
+	ctl = (struct dst_ctl *)msg->data;
+
+	if (ctl->cmd >= DST_CMD_MAX)
+		return;
+
+	dst_commands[ctl->cmd](ctl, msg->data + sizeof(struct dst_ctl), 
+			msg->len - sizeof(struct dst_ctl));
+}
+
+static int dst_sysfs_init(void)
+{
+	return bus_register(&dst_dev_bus_type);
+}
+
+static void dst_sysfs_exit(void)
+{
+	bus_unregister(&dst_dev_bus_type);
+}
+
+static int __init dst_sys_init(void)
+{
+	int err = -ENOMEM;
+
+	dst_request_cache = kmem_cache_create("dst", sizeof(struct dst_request),
+				       0, 0, NULL, NULL);
+	if (!dst_request_cache)
+		return -ENOMEM;
+
+	dst_bio_set = bioset_create(32, 32);
+	if (!dst_bio_set)
+		goto err_out_destroy;
+
+	err = register_blkdev(dst_major, DST_NAME);
+	if (err < 0)
+		goto err_out_destroy_bioset;
+	if (err)
+		dst_major = err;
+
+	err = dst_sysfs_init();
+	if (err)
+		goto err_out_unregister;
+
+	kst_main_worker = kst_worker_init(0);
+	if (IS_ERR(kst_main_worker)) {
+		err = PTR_ERR(kst_main_worker);
+		goto err_out_sysfs_exit;
+	}
+
+	err = cn_add_callback(&cn_dst_id, "DST", cn_dst_callback);
+	if (err)
+		goto err_out_worker_exit;
+
+	return 0;
+
+err_out_worker_exit:
+	kst_worker_exit(kst_main_worker);
+err_out_sysfs_exit:
+	dst_sysfs_exit();
+err_out_unregister:
+	unregister_blkdev(dst_major, DST_NAME);
+err_out_destroy_bioset:
+	bioset_free(dst_bio_set);
+err_out_destroy:
+	kmem_cache_destroy(dst_request_cache);
+	return err;
+}
+
+static void __exit dst_sys_exit(void)
+{
+	cn_del_callback(&cn_dst_id);
+	dst_sysfs_exit();
+	unregister_blkdev(dst_major, DST_NAME);
+	kst_exit_all();
+	bioset_free(dst_bio_set);
+	kmem_cache_destroy(dst_request_cache);
+}
+
+module_init(dst_sys_init);
+module_exit(dst_sys_exit);
+
+MODULE_DESCRIPTION("Distributed storage");
+MODULE_AUTHOR("Evgeniy Polyakov <johnpol@2ka.mipt.ru>");
+MODULE_LICENSE("GPL");
diff --git a/drivers/block/dst/kst.c b/drivers/block/dst/kst.c
new file mode 100644
index 0000000..b0608c9
--- /dev/null
+++ b/drivers/block/dst/kst.c
@@ -0,0 +1,1606 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/socket.h>
+#include <linux/kthread.h>
+#include <linux/net.h>
+#include <linux/in.h>
+#include <linux/poll.h>
+#include <linux/bio.h>
+#include <linux/dst.h>
+
+#include <net/sock.h>
+
+struct kst_poll_helper
+{
+	poll_table 		pt;
+	struct kst_state	*st;
+};
+
+static LIST_HEAD(kst_worker_list);
+static DEFINE_MUTEX(kst_worker_mutex);
+
+/*
+ * This function creates bound socket for local export node.
+ */
+static int kst_sock_create(struct kst_state *st, struct saddr *addr,
+		int type, int proto, int backlog)
+{
+	int err;
+
+	err = sock_create(addr->sa_family, type, proto, &st->socket);
+	if (err)
+		goto err_out_exit;
+
+	err = st->socket->ops->bind(st->socket, (struct sockaddr *)addr,
+			addr->sa_data_len);
+
+	err = st->socket->ops->listen(st->socket, backlog);
+	if (err)
+		goto err_out_release;
+
+	st->socket->sk->sk_allocation = GFP_NOIO;
+
+	return 0;
+
+err_out_release:
+	sock_release(st->socket);
+err_out_exit:
+	return err;
+}
+
+static void kst_sock_release(struct kst_state *st)
+{
+	if (st->socket) {
+		sock_release(st->socket);
+		st->socket = NULL;
+	}
+}
+
+void kst_wake(struct kst_state *st)
+{
+	if (st) {
+		struct kst_worker *w = st->node->w;
+		unsigned long flags;
+
+		spin_lock_irqsave(&w->ready_lock, flags);
+		if (list_empty(&st->ready_entry))
+			list_add_tail(&st->ready_entry, &w->ready_list);
+		spin_unlock_irqrestore(&w->ready_lock, flags);
+
+		wake_up(&w->wait);
+	}
+}
+EXPORT_SYMBOL_GPL(kst_wake);
+
+/*
+ * Polling machinery.
+ */
+static int kst_state_wake_callback(wait_queue_t *wait, unsigned mode,
+		int sync, void *key)
+{
+	struct kst_state *st = container_of(wait, struct kst_state, wait);
+	kst_wake(st);
+	return 1;
+}
+
+static void kst_queue_func(struct file *file, wait_queue_head_t *whead,
+				 poll_table *pt)
+{
+	struct kst_state *st = container_of(pt, struct kst_poll_helper, pt)->st;
+
+	st->whead = whead;
+	init_waitqueue_func_entry(&st->wait, kst_state_wake_callback);
+	add_wait_queue(whead, &st->wait);
+}
+
+static void kst_poll_exit(struct kst_state *st)
+{
+	if (st->whead) {
+		remove_wait_queue(st->whead, &st->wait);
+		st->whead = NULL;
+	}
+}
+
+/*
+ * This function removes request from state tree and ordering list.
+ */
+void kst_del_req(struct dst_request *req)
+{
+	struct kst_state *st = req->state;
+
+	rb_erase(&req->request_entry, &st->request_root);
+	RB_CLEAR_NODE(&req->request_entry);
+	list_del_init(&req->request_list_entry);
+}
+EXPORT_SYMBOL_GPL(kst_del_req);
+
+static struct dst_request *kst_req_first(struct kst_state *st)
+{
+	struct dst_request *req = NULL;
+
+	if (!list_empty(&st->request_list))
+		req = list_entry(st->request_list.next, struct dst_request,
+				request_list_entry);
+	return req;
+}
+
+/*
+ * This function dequeues first request from the queue and tree.
+ */
+static struct dst_request *kst_dequeue_req(struct kst_state *st)
+{
+	struct dst_request *req;
+
+	mutex_lock(&st->request_lock);
+	req = kst_req_first(st);
+	if (req)
+		kst_del_req(req);
+	mutex_unlock(&st->request_lock);
+	return req;
+}
+
+static inline int dst_compare_request_id(struct dst_request *old,
+		struct dst_request *new)
+{
+	int cmd = 0;
+
+	if (old->start + to_sector(old->orig_size) <= new->start)
+		cmd = 1;
+	if (old->start >= new->start + to_sector(new->orig_size))
+		cmd = -1;
+
+	dprintk("%s: old: op: %lu, start: %llu, size: %llu, off: %u, "
+		"new: op: %lu, start: %llu, size: %llu, off: %u, cmp: %d.\n",
+		__func__, bio_rw(old->bio), old->start, old->orig_size,
+		old->offset,
+		bio_rw(new->bio), new->start, new->orig_size,
+		new->offset, cmd);
+
+	return cmd;
+}
+
+/*
+ * This function enqueues request into tree, indexed by start of the request,
+ * and also puts request into ordered queue.
+ */
+int kst_enqueue_req(struct kst_state *st, struct dst_request *req)
+{
+	struct rb_node **n = &st->request_root.rb_node, *parent = NULL;
+	struct dst_request *old = NULL;
+	int cmp, err = 0;
+
+	while (*n) {
+		parent = *n;
+		old = rb_entry(parent, struct dst_request, request_entry);
+
+		cmp = dst_compare_request_id(old, req);
+		if (cmp < 0)
+			n = &parent->rb_left;
+		else if (cmp > 0)
+			n = &parent->rb_right;
+		else {
+			printk("%s: [%c] old_req: %p, start: %llu, "
+					"size: %llu.\n",
+					__func__, 
+					(bio_rw(old->bio) == WRITE)?'W':'R',
+					old, old->start, old->orig_size);
+			err = -EEXIST;
+			break;
+		}
+	}
+
+	if (!err) {
+		rb_link_node(&req->request_entry, parent, n);
+		rb_insert_color(&req->request_entry, &st->request_root);
+	}
+
+	if (req->size != req->orig_size)
+		list_add(&req->request_list_entry, &st->request_list);
+	else
+		list_add_tail(&req->request_list_entry, &st->request_list);
+	return err;
+}
+EXPORT_SYMBOL_GPL(kst_enqueue_req);
+
+/*
+ * BIOs for local exporting node are freed via this function.
+ */
+static void kst_export_put_bio(struct bio *bio)
+{
+	int i;
+	struct bio_vec *bv;
+
+	dprintk("%s: bio: %p, size: %u, idx: %d, num: %d.\n",
+			__func__, bio, bio->bi_size, bio->bi_idx,
+			bio->bi_vcnt);
+
+	bio_for_each_segment(bv, bio, i)
+		__free_page(bv->bv_page);
+	bio_put(bio);
+}
+
+/*
+ * This is a generic request completion function for requests,
+ * queued for async processing.
+ * If it is local export node, state machine is different,
+ * see details below.
+ */
+void kst_complete_req(struct dst_request *req, int err)
+{
+	dprintk("%s: bio: %p, req: %p, size: %llu, orig_size: %llu, "
+			"bi_size: %u, err: %d, flags: %u.\n",
+			__func__, req->bio, req, req->size, req->orig_size,
+			req->bio->bi_size, err, req->flags);
+
+	if (req->flags & DST_REQ_EXPORT) {
+		if (req->flags & DST_REQ_EXPORT_WRITE) {
+			req->bio->bi_rw = WRITE;
+			generic_make_request(req->bio);
+		} else
+			kst_export_put_bio(req->bio);
+	} else {
+		req->bio_endio(req, err);
+	}
+	dst_free_request(req);
+}
+EXPORT_SYMBOL_GPL(kst_complete_req);
+
+static void kst_flush_requests(struct kst_state *st)
+{
+	struct dst_request *req;
+
+	while ((req = kst_dequeue_req(st)) != NULL)
+		kst_complete_req(req, -EIO);
+}
+
+static int kst_poll_init(struct kst_state *st)
+{
+	struct kst_poll_helper ph;
+
+	ph.st = st;
+	init_poll_funcptr(&ph.pt, &kst_queue_func);
+
+	st->socket->ops->poll(NULL, st->socket, &ph.pt);
+	return 0;
+}
+
+/*
+ * Main state creation function.
+ * It creates new state according to given operations
+ * and links it into worker structure and node.
+ */
+static struct kst_state *kst_state_init(struct dst_node *node,
+		unsigned int permissions,
+		struct kst_state_ops *ops, void *data)
+{
+	struct kst_state *st;
+	int err;
+
+	st = kzalloc(sizeof(struct kst_state), GFP_KERNEL);
+	if (!st)
+		return ERR_PTR(-ENOMEM);
+
+	st->permissions = permissions;
+	st->node = node;
+	st->ops = ops;
+	INIT_LIST_HEAD(&st->ready_entry);
+	INIT_LIST_HEAD(&st->entry);
+	st->request_root.rb_node = NULL;
+	INIT_LIST_HEAD(&st->request_list);
+	mutex_init(&st->request_lock);
+
+	err = st->ops->init(st, data);
+	if (err)
+		goto err_out_free;
+	mutex_lock(&node->w->state_mutex);
+	list_add_tail(&st->entry, &node->w->state_list);
+	mutex_unlock(&node->w->state_mutex);
+
+	kst_wake(st);
+
+	return st;
+
+err_out_free:
+	kfree(st);
+	return ERR_PTR(err);
+}
+
+/*
+ * This function is called when node is removed,
+ * or when state is destroyed for connected to local exporting
+ * node client.
+ */
+void kst_state_exit(struct kst_state *st)
+{
+	struct kst_worker *w = st->node->w;
+
+	dprintk("%s: st: %p.\n", __func__, st);
+
+	mutex_lock(&w->state_mutex);
+	list_del_init(&st->entry);
+	mutex_unlock(&w->state_mutex);
+
+	st->ops->exit(st);
+
+	st->node->state = NULL;
+
+	kfree(st);
+}
+
+static int kst_error(struct kst_state *st, int err)
+{
+	if ((err == -ECONNRESET || err == -EPIPE) && st->ops->recovery(st, err))
+		err = st->ops->recovery(st, err);
+
+	return st->node->st->alg->ops->error(st, err);
+}
+
+/*
+ * This is main state processing function.
+ * It tries to complete request and invoke appropriate
+ * callbacks in case of errors or successfull operation finish.
+ */
+static int kst_thread_process_state(struct kst_state *st)
+{
+	int err, empty;
+	unsigned int revents;
+	struct dst_request *req, *tmp;
+
+	mutex_lock(&st->request_lock);
+	if (st->ops->ready) {
+		err = st->ops->ready(st);
+		if (err) {
+			mutex_unlock(&st->request_lock);
+			if (err < 0)
+				kst_state_exit(st);
+			return err;
+		}
+	}
+
+	err = 0;
+	empty = 1;
+	req = NULL;
+	list_for_each_entry_safe(req, tmp, &st->request_list,
+			request_list_entry) {
+		empty = 0;
+		revents = st->socket->ops->poll(st->socket->file,
+				st->socket, NULL);
+		dprintk("\n%s: st: %p, revents: %x.\n", __func__, st, revents);
+		if (!revents)
+			break;
+		err = req->callback(req, revents);
+		dprintk("%s: callback returned, st: %p, err: %d.\n",
+				__func__, st, err);
+		if (err)
+			break;
+	}
+	mutex_unlock(&st->request_lock);
+
+	dprintk("%s: req: %p, err: %d.\n", __func__, req, err);
+	if (err < 0) {
+		err = kst_error(st, err);
+		if (err && (st != st->node->state)) {
+			dprintk("%s: err: %d, st: %p, node->state: %p.\n",
+					__func__, err, st, st->node->state);
+			/*
+			 * Accepted client has state not related to storage
+			 * node, so it must be freed explicitely.
+			 */
+
+			kst_state_exit(st);
+			return err;
+		}
+
+		kst_wake(st);
+	}
+
+	if (list_empty(&st->request_list) && !empty)
+		kst_wake(st);
+
+	return err;
+}
+
+/*
+ * Main worker thread - one per storage.
+ */
+static int kst_thread_func(void *data)
+{
+	struct kst_worker *w = data;
+	struct kst_state *st;
+	unsigned long flags;
+	int err = 0;
+
+	while (!kthread_should_stop()) {
+		wait_event_interruptible_timeout(w->wait,
+				!list_empty(&w->ready_list) ||
+				kthread_should_stop(),
+				HZ);
+
+		st = NULL;
+		spin_lock_irqsave(&w->ready_lock, flags);
+		if (!list_empty(&w->ready_list)) {
+			st = list_entry(w->ready_list.next, struct kst_state,
+					ready_entry);
+			list_del_init(&st->ready_entry);
+		}
+		spin_unlock_irqrestore(&w->ready_lock, flags);
+
+		if (!st)
+			continue;
+
+		err = kst_thread_process_state(st);
+	}
+
+	return err;
+}
+
+/*
+ * Worker initialization - this object will host andprocess all states,
+ * which in turn host requests for remote targets.
+ */
+struct kst_worker *kst_worker_init(int id)
+{
+	struct kst_worker *w;
+	int err;
+
+	w = kzalloc(sizeof(struct kst_worker), GFP_KERNEL);
+	if (!w)
+		return ERR_PTR(-ENOMEM);
+
+	w->id = id;
+	init_waitqueue_head(&w->wait);
+	spin_lock_init(&w->ready_lock);
+	mutex_init(&w->state_mutex);
+
+	INIT_LIST_HEAD(&w->ready_list);
+	INIT_LIST_HEAD(&w->state_list);
+
+	w->req_pool = mempool_create_slab_pool(256, dst_request_cache);
+	if (!w->req_pool) {
+		err = -ENOMEM;
+		goto err_out_free;
+	}
+
+	w->thread = kthread_run(&kst_thread_func, w, "kst%d", w->id);
+	if (IS_ERR(w->thread)) {
+		err = PTR_ERR(w->thread);
+		goto err_out_destroy;
+	}
+
+	mutex_lock(&kst_worker_mutex);
+	list_add_tail(&w->entry, &kst_worker_list);
+	mutex_unlock(&kst_worker_mutex);
+
+	return w;
+
+err_out_destroy:
+	mempool_destroy(w->req_pool);
+err_out_free:
+	kfree(w);
+	return ERR_PTR(err);
+}
+
+void kst_worker_exit(struct kst_worker *w)
+{
+	struct kst_state *st, *n;
+
+	mutex_lock(&kst_worker_mutex);
+	list_del(&w->entry);
+	mutex_unlock(&kst_worker_mutex);
+
+	kthread_stop(w->thread);
+
+	list_for_each_entry_safe(st, n, &w->state_list, entry) {
+		kst_state_exit(st);
+	}
+
+	mempool_destroy(w->req_pool);
+	kfree(w);
+}
+
+/*
+ * Common state exit callback.
+ * Removes itself from worker's list of states,
+ * releases socket and flushes all requests.
+ */
+static void kst_common_exit(struct kst_state *st)
+{
+	unsigned long flags;
+
+	dprintk("%s: st: %p.\n", __func__, st);
+	kst_poll_exit(st);
+
+	spin_lock_irqsave(&st->node->w->ready_lock, flags);
+	list_del_init(&st->ready_entry);
+	spin_unlock_irqrestore(&st->node->w->ready_lock, flags);
+
+	kst_sock_release(st);
+	kst_flush_requests(st);
+}
+
+/*
+ * Listen socket contains security attributes in request_list,
+ * so it can not be flushed via usual way.
+ */
+static void kst_listen_flush(struct kst_state *st)
+{
+	struct dst_secure *s, *tmp;
+
+	list_for_each_entry_safe(s, tmp, &st->request_list, sec_entry) {
+		list_del(&s->sec_entry);
+		kfree(s);
+	}
+}
+
+static void kst_listen_exit(struct kst_state *st)
+{
+	kst_listen_flush(st);
+	kst_common_exit(st);
+}
+
+/*
+ * Header sending function - may block.
+ */
+static int kst_data_send_header(struct kst_state *st,
+		struct dst_remote_request *r)
+{
+	struct msghdr msg;
+	struct kvec iov;
+
+	iov.iov_base = r;
+	iov.iov_len = sizeof(struct dst_remote_request);
+
+	msg.msg_iov = (struct iovec *)&iov;
+	msg.msg_iovlen = 1;
+	msg.msg_name = NULL;
+	msg.msg_namelen = 0;
+	msg.msg_control = NULL;
+	msg.msg_controllen = 0;
+	msg.msg_flags = MSG_WAITALL | MSG_NOSIGNAL;
+
+	return kernel_sendmsg(st->socket, &msg, &iov, 1, iov.iov_len);
+}
+
+/*
+ * BIO vector receiving function - does not block, but may sleep because
+ * of scheduling policy.
+ */
+static int kst_data_recv_bio_vec(struct kst_state *st, struct bio_vec *bv,
+		unsigned int offset, unsigned int size)
+{
+	struct msghdr msg;
+	struct kvec iov;
+	void *kaddr;
+	int err;
+
+	kaddr = kmap(bv->bv_page);
+
+	iov.iov_base = kaddr + bv->bv_offset + offset;
+	iov.iov_len = size;
+
+	msg.msg_iov = (struct iovec *)&iov;
+	msg.msg_iovlen = 1;
+	msg.msg_name = NULL;
+	msg.msg_namelen = 0;
+	msg.msg_control = NULL;
+	msg.msg_controllen = 0;
+	msg.msg_flags = MSG_DONTWAIT | MSG_NOSIGNAL;
+
+	err = kernel_recvmsg(st->socket, &msg, &iov, 1, iov.iov_len,
+			msg.msg_flags);
+	kunmap(bv->bv_page);
+
+	return err;
+}
+
+/*
+ * BIO vector sending function - does not block, but may sleep because
+ * of scheduling policy.
+ */
+static int kst_data_send_bio_vec(struct kst_state *st, struct bio_vec *bv,
+		unsigned int offset, unsigned int size)
+{
+	return kernel_sendpage(st->socket, bv->bv_page,
+			bv->bv_offset + offset, size,
+			MSG_DONTWAIT | MSG_NOSIGNAL);
+}
+
+typedef int (*kst_data_process_bio_vec_t)(struct kst_state *st,
+		struct bio_vec *bv, unsigned int offset, unsigned int size);
+
+/*
+ * @req: processing request.
+ * Contains BIO and all related to its processing info.
+ *
+ * This function sends or receives requested number of pages from given BIO.
+ *
+ * In case of errors negative return value is returned and @size,
+ * @index and @off are set to the:
+ * - number of bytes not yet processed (i.e. the rest of the bytes to be
+ *   processed).
+ * - index of the last bio_vec started to be processed (header sent).
+ * - offset of the first byte to be processed in the bio_vec.
+ *
+ * If there are no errors, zero is returned.
+ * -EAGAIN is not an error and is transformed into zero return value,
+ * called must check if @size is zero, in that case whole BIO is processed
+ * and thus req->bio_endio() can be called, othervise new request must be allocated
+ * to be processed later.
+ */
+static int kst_data_process_bio(struct dst_request *req)
+{
+	int err = -ENOSPC, partial = (req->size != req->orig_size);
+	struct dst_remote_request r;
+	kst_data_process_bio_vec_t func;
+	unsigned int cur_size;
+
+	r.flags = cpu_to_be32(((unsigned long)req->bio) & 0xffffffff);
+
+	if (bio_rw(req->bio) == WRITE) {
+		r.cmd = cpu_to_be32(DST_WRITE);
+		func = kst_data_send_bio_vec;
+	} else {
+		r.cmd = cpu_to_be32(DST_READ);
+		func = kst_data_recv_bio_vec;
+	}
+
+	dprintk("%s: start: [%c], start: %llu, idx: %d, num: %d, "
+			"size: %llu, offset: %u.\n",
+			__func__, (bio_rw(req->bio) == WRITE)?'W':'R',
+			req->start, req->idx, req->num, req->size, req->offset);
+
+	while (req->idx < req->num) {
+		struct bio_vec *bv = bio_iovec_idx(req->bio, req->idx);
+
+		cur_size = min_t(u64, bv->bv_len - req->offset, req->size);
+
+		if (cur_size == 0) {
+			printk("%s: %d/%d: start: %llu, "
+				"bv_offset: %u, bv_len: %u, "
+				"req_offset: %u, req_size: %llu, "
+				"req: %p, bio: %p, err: %d.\n",
+				__func__, req->idx, req->num, req->start, 
+				bv->bv_offset, bv->bv_len,
+				req->offset, req->size,
+				req, req->bio, err);
+			BUG();
+		}
+
+		if (!(req->flags & DST_REQ_HEADER_SENT)) {
+			r.sector = cpu_to_be64(req->start);
+			r.offset = cpu_to_be32(bv->bv_offset + req->offset);
+			r.size = cpu_to_be32(cur_size);
+
+			err = kst_data_send_header(req->state, &r);
+			if (err != sizeof(struct dst_remote_request)) {
+				dprintk("%s: %d/%d: header: start: %llu, "
+					"bv_offset: %u, bv_len: %u, "
+					"a offset: %u, offset: %u, "
+					"cur_size: %u, err: %d.\n",
+					__func__, req->idx, req->num,
+					req->start, bv->bv_offset, bv->bv_len,
+					bv->bv_offset + req->offset,
+					req->offset, cur_size, err);
+				if (err >= 0)
+					err = -EINVAL;
+				break;
+			}
+
+			req->flags |= DST_REQ_HEADER_SENT;
+		}
+
+		err = func(req->state, bv, req->offset, cur_size);
+		if (err <= 0)
+			break;
+
+		req->offset += err;
+		req->size -= err;
+
+		if (req->offset != bv->bv_len) {
+			dprintk("%s: %d/%d: this: start: %llu, bv_offset: %u, "
+				"bv_len: %u, a offset: %u, offset: %u, "
+				"cur_size: %u, err: %d.\n",
+				__func__, req->idx, req->num, req->start,
+				bv->bv_offset, bv->bv_len,
+				bv->bv_offset + req->offset,
+				req->offset, cur_size, err);
+			err = -EAGAIN;
+			break;
+		}
+		req->offset = 0;
+		req->idx++;
+		req->flags &= ~DST_REQ_HEADER_SENT;
+
+		req->start += to_sector(bv->bv_len);
+	}
+
+	if (err <= 0 && err != -EAGAIN) {
+		if (err == 0)
+			err = -ECONNRESET;
+	} else
+		err = 0;
+
+	if (req->size) {
+		req->state->flags |= KST_FLAG_PARTIAL;
+	} else if (partial) {
+		req->state->flags &= ~KST_FLAG_PARTIAL;
+	}
+
+	if (err < 0 || (req->idx == req->num && req->size)) {
+		dprintk("%s: return: idx: %d, num: %d, offset: %u, "
+				"size: %llu, err: %d.\n",
+			__func__, req->idx, req->num, req->offset,
+			req->size, err);
+	}
+	dprintk("%s: end: start: %llu, idx: %d, num: %d, "
+			"size: %llu, offset: %u.\n",
+		__func__, req->start, req->idx, req->num,
+		req->size, req->offset);
+
+	return err;
+}
+
+void kst_bio_endio(struct dst_request *req, int err)
+{
+	if (err)
+		printk("%s: freeing bio: %p, bi_size: %u, "
+			"orig_size: %llu, req: %p.\n",
+		__func__, req->bio, req->bio->bi_size, req->orig_size, req);
+	bio_endio(req->bio, req->orig_size, err);
+}
+EXPORT_SYMBOL_GPL(kst_bio_endio);
+
+/*
+ * This callback is invoked by worker thread to process given request.
+ */
+int kst_data_callback(struct dst_request *req, unsigned int revents)
+{
+	int err;
+
+	dprintk("%s: req: %p, num: %d, idx: %d, bio: %p, "
+			"revents: %x, flags: %x.\n",
+			__func__, req, req->num, req->idx, req->bio,
+			revents, req->flags);
+
+	if (req->flags & DST_REQ_EXPORT_READ)
+		return 1;
+
+	err = kst_data_process_bio(req);
+	if (err < 0)
+		goto err_out;
+
+	if (!req->size) {
+		dprintk("%s: complete: req: %p, bio: %p.\n",
+				__func__, req, req->bio);
+		kst_del_req(req);
+		kst_complete_req(req, 0);
+		return 0;
+	}
+
+	if (revents & (POLLERR | POLLHUP | POLLRDHUP)) {
+		err = -EPIPE;
+		goto err_out;
+	}
+
+	return 1;
+
+err_out:
+	return err;
+}
+EXPORT_SYMBOL_GPL(kst_data_callback);
+
+#define KST_CONG_COMPLETED		(0)
+#define KST_CONG_NOT_FOUND		(1)
+#define KST_CONG_QUEUE			(-1)
+
+/*
+ * kst_congestion - checks for data congestion, i.e. the case, when given
+ * 	block request crosses an area of the another block request which
+ * 	is not yet sent to the remote node.
+ *
+ * @req: dst request containing block io related information.
+ *
+ * Return value:
+ * %KST_CONG_COMPLETED  - congestion was found and processed,
+ * 	bio must be ended, request is completed.
+ * %KST_CONG_NOT_FOUND  - no congestion found,
+ * 	request must be processed as usual
+ * %KST_CONG_QUEUE - congestion has been found, but bio is not completed,
+ * 	new request must be allocated and processed.
+ */
+static int kst_congestion(struct dst_request *req)
+{
+	int cmp, i;
+	struct kst_state *st = req->state;
+	struct rb_node *n = st->request_root.rb_node;
+	struct dst_request *old = NULL, *dst_req, *src_req;
+
+	while (n) {
+		src_req = rb_entry(n, struct dst_request, request_entry);
+		cmp = dst_compare_request_id(src_req, req);
+
+		if (cmp < 0)
+			n = n->rb_left;
+		else if (cmp > 0)
+			n = n->rb_right;
+		else {
+			old = src_req;
+			break;
+		}
+	}
+
+	if (likely(!old))
+		return KST_CONG_NOT_FOUND;
+
+	dprintk("%s: old: op: %lu, start: %llu, size: %llu, off: %u, "
+			"new: op: %lu, start: %llu, size: %llu, off: %u.\n",
+		__func__, bio_rw(old->bio), old->start, old->orig_size,
+		old->offset,
+		bio_rw(req->bio), req->start, req->orig_size, req->offset);
+
+	if ((bio_rw(old->bio) != WRITE) && (bio_rw(req->bio) != WRITE)) {
+		return KST_CONG_QUEUE;
+	}
+
+	if (unlikely(req->offset != old->offset))
+		return KST_CONG_QUEUE;
+
+	src_req = old;
+	dst_req = req;
+	if (bio_rw(req->bio) == WRITE) {
+		dst_req = old;
+		src_req = req;
+	}
+
+	/* Actually we could partially complete new request by copying
+	 * part of the first one, but not now, consider this as a
+	 * (low-priority) todo item.
+	 */
+	if (src_req->start + src_req->orig_size <
+			dst_req->start + dst_req->orig_size)
+		return KST_CONG_QUEUE;
+
+	/*
+	 * So, only process if new request is differnt from old one,
+	 * or subsequent write, i.e.:
+	 * - not completed write and request to read
+	 * - not completed read and request to write
+	 * - not completed write and request to (over)write
+	 */
+	for (i = old->idx; i < old->num; ++i) {
+		struct bio_vec *bv_src, *bv_dst;
+		void *src, *dst;
+		u64 len;
+
+		bv_src = bio_iovec_idx(src_req->bio, i);
+		bv_dst = bio_iovec_idx(dst_req->bio, i);
+
+		if (unlikely(bv_dst->bv_offset != bv_src->bv_offset))
+			return KST_CONG_QUEUE;
+
+		if (unlikely(bv_dst->bv_len != bv_src->bv_len))
+			return KST_CONG_QUEUE;
+
+		src = kmap_atomic(bv_src->bv_page, KM_USER0);
+		dst = kmap_atomic(bv_dst->bv_page, KM_USER1);
+
+		len = min_t(u64, bv_dst->bv_len, dst_req->size);
+
+		memcpy(dst + bv_dst->bv_offset, src + bv_src->bv_offset, len);
+
+		kunmap_atomic(src, KM_USER0);
+		kunmap_atomic(dst, KM_USER1);
+
+		dst_req->idx++;
+		dst_req->size -= len;
+		dst_req->offset = 0;
+		dst_req->start += to_sector(len);
+
+		if (!dst_req->size)
+			break;
+	}
+
+	if (req == dst_req)
+		return KST_CONG_COMPLETED;
+
+	kst_del_req(dst_req);
+	kst_complete_req(dst_req, 0);
+
+	return KST_CONG_NOT_FOUND;
+}
+
+struct dst_request *dst_clone_request(struct dst_request *req, mempool_t *pool)
+{
+	struct dst_request *new_req;
+
+	new_req = mempool_alloc(pool, GFP_NOIO);
+	if (!new_req)
+		return NULL;
+
+	memset(new_req, 0, sizeof(struct dst_request));
+
+	dprintk("%s: req: %p, new_req: %p, bio: %p.\n",
+			__func__, req, new_req, req->bio);
+
+	RB_CLEAR_NODE(&new_req->request_entry);
+
+	if (req) {
+		new_req->bio = req->bio;
+		new_req->state = req->state;
+		new_req->node = req->node;
+		new_req->idx = req->idx;
+		new_req->num = req->num;
+		new_req->size = req->size;
+		new_req->orig_size = req->orig_size;
+		new_req->offset = req->offset;
+		new_req->start = req->start;
+		new_req->flags = req->flags;
+		new_req->bio_endio = req->bio_endio;
+		new_req->priv = req->priv;
+	}
+
+	return new_req;
+}
+EXPORT_SYMBOL_GPL(dst_clone_request);
+
+void dst_free_request(struct dst_request *req)
+{
+	dprintk("%s: free req: %p, pool: %p, bio: %p, state: %p, node: %p.\n",
+			__func__, req, req->node->w->req_pool,
+			req->bio, req->state, req->node);
+	mempool_free(req, req->node->w->req_pool);
+}
+EXPORT_SYMBOL_GPL(dst_free_request);
+
+/*
+ * This is main data processing function, eventually invoked from block layer.
+ * It tries to complte request, but if it is about to block, it allocates
+ * new request and queues it to main worker to be processed when events allow.
+ */
+static int kst_data_push(struct dst_request *req)
+{
+	struct kst_state *st = req->state;
+	struct dst_request *new_req;
+	unsigned int revents;
+	int err, locked = 0;
+
+	dprintk("%s: start: %llu, size: %llu, bio: %p.\n",
+			__func__, req->start, req->size, req->bio);
+
+	if (mutex_trylock(&st->request_lock)) {
+		locked = 1;
+
+		if (st->flags & (KST_FLAG_PARTIAL | DST_REQ_ALWAYS_QUEUE))
+			goto alloc_new_req;
+
+		err = kst_congestion(req);
+		if (err == KST_CONG_COMPLETED) {
+			err = 0;
+			goto out_bio_endio;
+		}
+
+		if (err == KST_CONG_NOT_FOUND) {
+			revents = st->socket->ops->poll(NULL, st->socket, NULL);
+			dprintk("%s: st: %p, bio: %p, revents: %x.\n",
+					__func__, st, req->bio, revents);
+			if (revents & POLLOUT) {
+				err = kst_data_process_bio(req);
+				if (err < 0)
+					goto out_unlock;
+
+				if (!req->size) {
+					err = 0;
+					goto out_bio_endio;
+				}
+			}
+		}
+	}
+
+alloc_new_req:
+	err = -ENOMEM;
+	new_req = dst_clone_request(req, req->node->w->req_pool);
+	if (!new_req)
+		goto out_unlock;
+
+	new_req->callback = &kst_data_callback;
+
+	if (!locked)
+		mutex_lock(&st->request_lock);
+	locked = 1;
+
+	err = kst_enqueue_req(st, new_req);
+	mutex_unlock(&st->request_lock);
+	locked = 0;
+	if (err) {
+		printk(KERN_NOTICE "%s: congestion [%c], start: %llu, idx: %d,"
+				" num: %d, size: %llu, offset: %u, err: %d.\n",
+			__func__, (bio_rw(req->bio) == WRITE)?'W':'R',
+			req->start, req->idx, req->num, req->size,
+			req->offset, err);
+	}
+
+	kst_wake(st);
+
+	return 0;
+
+out_bio_endio:
+	req->bio_endio(req, err);
+out_unlock:
+	if (locked)
+		mutex_unlock(&st->request_lock);
+	locked = 0;
+
+	if (err) {
+		err = kst_error(st, err);
+		if (!err)
+			goto alloc_new_req;
+	}
+
+	if (err) {
+		printk("%s: error [%c], start: %llu, idx: %d, num: %d, "
+				"size: %llu, offset: %u, err: %d.\n",
+			__func__, (bio_rw(req->bio) == WRITE)?'W':'R',
+			req->start, req->idx, req->num, req->size,
+			req->offset, err);
+		req->bio_endio(req, err);
+	}
+
+	kst_wake(st);
+	return err;
+}
+
+/*
+ * Remote node initialization callback.
+ */
+static int kst_data_init(struct kst_state *st, void *data)
+{
+	int err;
+
+	st->socket = data;
+	st->socket->sk->sk_allocation = GFP_NOIO;
+	/*
+	 * Why not?
+	 */
+	st->socket->sk->sk_sndbuf = st->socket->sk->sk_sndbuf = 1024*1024*10;
+
+	err = kst_poll_init(st);
+	if (err)
+		return err;
+
+	return 0;
+}
+
+/*
+ * Remote node recovery function - tries to reconnect to given target.
+ */
+static int kst_data_recovery(struct kst_state *st, int err)
+{
+	struct socket *sock;
+	struct sockaddr addr;
+	int addrlen;
+	struct dst_request *req;
+
+	if (err != -ECONNRESET && err != -EPIPE) {
+		dprintk("%s: state %p does not know how "
+				"to recover from error %d.\n",
+				__func__, st, err);
+		return err;
+	}
+
+	err = sock_create(st->socket->ops->family, st->socket->type,
+			st->socket->sk->sk_protocol, &sock);
+	if (err < 0)
+		goto err_out_exit;
+
+	sock->sk->sk_sndtimeo = sock->sk->sk_rcvtimeo =
+		msecs_to_jiffies(DST_DEFAULT_TIMEO);
+
+	err = sock->ops->getname(st->socket, &addr, &addrlen, 2);
+	if (err)
+		goto err_out_destroy;
+
+	err = sock->ops->connect(sock, &addr, addrlen, 0);
+	if (err)
+		goto err_out_destroy;
+
+	kst_poll_exit(st);
+	kst_sock_release(st);
+
+	mutex_lock(&st->request_lock);
+	err = st->ops->init(st, sock);
+	if (!err) {
+		/*
+		 * After reconnection is completed all requests
+		 * must be resent from the state they were finished previously,
+		 * but with new headers.
+		 */
+		list_for_each_entry(req, &st->request_list, request_list_entry)
+			req->flags &= ~DST_REQ_HEADER_SENT;
+	}
+	mutex_unlock(&st->request_lock);
+	if (err < 0)
+		goto err_out_destroy;
+
+	kst_wake(st);
+	dprintk("%s: recovery completed.\n", __func__);
+
+	return 0;
+
+err_out_destroy:
+	sock_release(sock);
+err_out_exit:
+	dprintk("%s: revovery failed: st: %p, err: %d.\n", __func__, st, err);
+	return err;
+}
+
+static inline void kst_convert_header(struct dst_remote_request *r)
+{
+	r->cmd = be32_to_cpu(r->cmd);
+	r->sector = be64_to_cpu(r->sector);
+	r->offset = be32_to_cpu(r->offset);
+	r->size = be32_to_cpu(r->size);
+	r->flags = be32_to_cpu(r->flags);
+}
+
+/*
+ * Local exporting node end IO callbacks.
+ */
+static int kst_export_write_end_io(struct bio *bio, unsigned int size, int err)
+{
+	dprintk("%s: bio: %p, size: %u, idx: %d, num: %d, err: %d.\n",
+		__func__, bio, bio->bi_size, bio->bi_idx, bio->bi_vcnt, err);
+
+	if (bio->bi_size)
+		return 1;
+
+	kst_export_put_bio(bio);
+	return 0;
+}
+
+static int kst_export_read_end_io(struct bio *bio, unsigned int size, int err)
+{
+	struct dst_request *req = bio->bi_private;
+	struct kst_state *st = req->state;
+
+	dprintk("%s: bio: %p, req: %p, size: %u, idx: %d, num: %d, err: %d.\n",
+		__func__, bio, req, bio->bi_size, bio->bi_idx,
+		bio->bi_vcnt, err);
+
+	if (bio->bi_size)
+		return 1;
+
+	bio->bi_size = req->size = req->orig_size;
+	bio->bi_rw = WRITE;
+	req->flags &= ~DST_REQ_EXPORT_READ;
+	kst_wake(st);
+	return 0;
+}
+
+/*
+ * This callback is invoked each time new request from remote
+ * node to given local export node is received.
+ * It allocates new block IO request and queues it for processing.
+ */
+static int kst_export_ready(struct kst_state *st)
+{
+	struct dst_remote_request r;
+	struct msghdr msg;
+	struct kvec iov;
+	struct bio *bio;
+	int err, nr, i;
+	struct dst_request *req;
+	sector_t data_size;
+	unsigned int revents = st->socket->ops->poll(NULL, st->socket, NULL);
+
+	if (revents & (POLLERR | POLLHUP)) {
+		err = -EPIPE;
+		goto err_out_exit;
+	}
+
+	if (!(revents & POLLIN) || !list_empty(&st->request_list))
+		return 0;
+
+	iov.iov_base = &r;
+	iov.iov_len = sizeof(struct dst_remote_request);
+
+	msg.msg_iov = (struct iovec *)&iov;
+	msg.msg_iovlen = 1;
+	msg.msg_name = NULL;
+	msg.msg_namelen = 0;
+	msg.msg_control = NULL;
+	msg.msg_controllen = 0;
+	msg.msg_flags = MSG_WAITALL | MSG_NOSIGNAL;
+
+	err = kernel_recvmsg(st->socket, &msg, &iov, 1,
+			iov.iov_len, msg.msg_flags);
+	if (err != sizeof(struct dst_remote_request)) {
+		err = -EINVAL;
+		goto err_out_exit;
+	}
+
+	kst_convert_header(&r);
+
+	dprintk("\n%s: cmd: %u, sector: %llu, size: %u, "
+			"flags: %x, offset: %u.\n",
+			__func__, r.cmd, r.sector, r.size, r.flags, r.offset);
+
+	err = -EINVAL;
+	if (r.cmd != DST_READ && r.cmd != DST_WRITE && r.cmd != DST_REMOTE_CFG)
+		goto err_out_exit;
+
+	data_size = get_capacity(st->node->bdev->bd_disk);
+	if ((signed)(r.sector + to_sector(r.size)) < 0 ||
+			(signed)(r.sector + to_sector(r.size)) > data_size ||
+			(signed)r.sector > data_size)
+		goto err_out_exit;
+
+	if (r.cmd == DST_REMOTE_CFG) {
+		r.sector = data_size;
+		kst_convert_header(&r);
+
+		iov.iov_base = &r;
+		iov.iov_len = sizeof(struct dst_remote_request);
+
+		msg.msg_iov = (struct iovec *)&iov;
+		msg.msg_iovlen = 1;
+		msg.msg_name = NULL;
+		msg.msg_namelen = 0;
+		msg.msg_control = NULL;
+		msg.msg_controllen = 0;
+		msg.msg_flags = MSG_WAITALL | MSG_NOSIGNAL;
+
+		err = kernel_sendmsg(st->socket, &msg, &iov, 1, iov.iov_len);
+		if (err != sizeof(struct dst_remote_request)) {
+			err = -EINVAL;
+			goto err_out_exit;
+		}
+		kst_wake(st);
+		return 0;
+	}
+
+	nr = r.size/PAGE_SIZE + 1;
+
+	while (r.size) {
+		int nr_pages = min(BIO_MAX_PAGES, nr);
+		unsigned int size;
+		struct page *page;
+
+		err = -ENOMEM;
+		req = dst_clone_request(NULL, st->node->w->req_pool);
+		if (!req)
+			goto err_out_exit;
+
+		dprintk("%s: alloc req: %p, pool: %p.\n",
+				__func__, req, st->node->w->req_pool);
+
+		bio = bio_alloc(GFP_NOIO, nr_pages);
+		if (!bio)
+			goto err_out_free_req;
+
+		req->flags = DST_REQ_EXPORT | DST_REQ_HEADER_SENT;
+		req->bio = bio;
+		req->state = st;
+		req->node = st->node;
+		req->callback = &kst_data_callback;
+		req->bio_endio = &kst_bio_endio;
+
+		/*
+		 * Yes, looks a bit weird.
+		 * Logic is simple - for local exporting node all operations
+		 * are reversed compared to usual nodes, since usual nodes
+		 * process remote data and local export node process remote
+		 * requests, so that writing data means sending data to
+		 * remote node and receiving on the local export one.
+		 *
+		 * So, to process writing to the exported node we need first 
+		 * to receive data from the net (i.e. to perform READ 
+		 * operationin terms of usual node), and then put it to the 
+		 * storage (WRITE command, so it will be changed before 
+		 * calling generic_make_request()).
+		 *
+		 * To process read request from the exported node we need
+		 * first to read it from storage (READ command for BIO)
+		 * and then send it over the net (perform WRITE operation
+		 * in terms of network).
+		 */
+		if (r.cmd == DST_WRITE) {
+			req->flags |= DST_REQ_EXPORT_WRITE;
+			bio->bi_end_io = kst_export_write_end_io;
+		} else {
+			req->flags |= DST_REQ_EXPORT_READ;
+			bio->bi_end_io = kst_export_read_end_io;
+		}
+		bio->bi_rw = READ;
+		bio->bi_private = req;
+		bio->bi_sector = r.sector;
+		bio->bi_bdev = st->node->bdev;
+
+		for (i = 0; i < nr_pages; ++i) {
+			page = alloc_page(GFP_NOIO);
+			if (!page)
+				break;
+
+			size = min_t(u32, PAGE_SIZE, r.size);
+
+			err = bio_add_page(bio, page, size, r.offset);
+			dprintk("%s: %d/%d: page: %p, size: %u, offset: %u, "
+					"err: %d.\n",
+					__func__, i, nr_pages, page, size,
+					r.offset, err);
+			if (err <= 0)
+				break;
+
+			if (err == size) {
+				r.offset = 0;
+				nr--;
+			} else {
+				r.offset += err;
+			}
+
+			r.size -= err;
+			r.sector += to_sector(err);
+
+			if (!r.size)
+				break;
+		}
+
+		if (!bio->bi_vcnt) {
+			err = -ENOMEM;
+			goto err_out_put;
+		}
+
+		req->size = req->orig_size = bio->bi_size;
+		req->start = bio->bi_sector;
+		req->idx = 0;
+		req->num = bio->bi_vcnt;
+
+		dprintk("%s: submitting: bio: %p, req: %p, start: %llu, "
+			"size: %llu, idx: %d, num: %d, offset: %u, err: %d.\n",
+			__func__, bio, req, req->start, req->size,
+			req->idx, req->num, req->offset, err);
+
+		err = kst_enqueue_req(st, req);
+		if (err)
+			goto err_out_put;
+
+		if (r.cmd == DST_READ) {
+			generic_make_request(bio);
+		}
+	}
+
+	kst_wake(st);
+	return 0;
+
+err_out_put:
+	bio_put(bio);
+err_out_free_req:
+	dst_free_request(req);
+err_out_exit:
+	dprintk("%s: error: %d.\n", __func__, err);
+	return err;
+}
+
+static void kst_export_exit(struct kst_state *st)
+{
+	struct dst_node *n = st->node;
+
+	dprintk("%s: st: %p.\n", __func__, st);
+
+	kst_common_exit(st);
+	dst_node_put(n);
+}
+
+static struct kst_state_ops kst_data_export_ops = {
+	.init = &kst_data_init,
+	.push = &kst_data_push,
+	.exit = &kst_export_exit,
+	.ready = &kst_export_ready,
+};
+
+/*
+ * This callback is invoked each time listening socket for
+ * given local export node becomes ready.
+ * It creates new state for connected client and queues for processing.
+ */
+static int kst_listen_ready(struct kst_state *st)
+{
+	struct socket *newsock;
+	struct saddr addr;
+	struct kst_state *newst;
+	int err;
+	unsigned int revents, permissions = 0;
+	struct dst_secure *s;
+
+	revents = st->socket->ops->poll(NULL, st->socket, NULL);
+	if (!(revents & POLLIN))
+		return 1;
+
+	err = sock_create(st->socket->ops->family, st->socket->type,
+			st->socket->sk->sk_protocol, &newsock);
+	if (err)
+		goto err_out_exit;
+
+	err = st->socket->ops->accept(st->socket, newsock, 0);
+	if (err)
+		goto err_out_put;
+
+	if (newsock->ops->getname(newsock, (struct sockaddr *)&addr,
+				  (int *)&addr.sa_data_len, 2) < 0) {
+		err = -ECONNABORTED;
+		goto err_out_put;
+	}
+
+	list_for_each_entry(s, &st->request_list, sec_entry) {
+		void *sec_addr, *new_addr;
+
+		sec_addr = ((void *)&s->sec.addr) + s->sec.check_offset;
+		new_addr = ((void *)&addr) + s->sec.check_offset;
+
+		if (!memcmp(sec_addr, new_addr,	
+				addr.sa_data_len - s->sec.check_offset)) {
+			permissions = s->sec.permissions;
+			break;
+		}
+	}
+
+	/*
+	 * So far only reading and writing are supported.
+	 * Block device does not know about anything else,
+	 * but as far as I recall, there was a prognosis,
+	 * that computer will never require more than 640kb of RAM.
+	 */
+	if (permissions == 0) {
+		err = -EPERM;
+		goto err_out_put;
+	}
+
+	if (st->socket->ops->family == AF_INET) {
+		struct sockaddr_in *sin = (struct sockaddr_in *)&addr;
+		printk(KERN_INFO "%s: Client: %u.%u.%u.%u:%d.\n", __func__,
+			NIPQUAD(sin->sin_addr.s_addr), ntohs(sin->sin_port));
+	} else if (st->socket->ops->family == AF_INET6) {
+		struct sockaddr_in6 *sin = (struct sockaddr_in6 *)&addr;
+		printk(KERN_INFO "%s: Client: "
+			"%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x:%d",
+			__func__, 
+			NIP6(sin->sin6_addr), ntohs(sin->sin6_port));
+	}
+
+	dst_node_get(st->node);
+	newst = kst_state_init(st->node, permissions,
+			&kst_data_export_ops, newsock);
+	if (IS_ERR(newst)) {
+		err = PTR_ERR(newst);
+		goto err_out_put;
+	}
+
+	/*
+	 * Negative return value means error, positive - stop this state 
+	 * processing. Zero allows to check state for pending requests.
+	 * Listening socket contains security objects in request list,
+	 * since it does not have any requests.
+	 */
+	return 1;
+
+err_out_put:
+	sock_release(newsock);
+err_out_exit:
+	return 1;
+}
+
+static int kst_listen_init(struct kst_state *st, void *data)
+{
+	int err = -ENOMEM, i;
+	struct dst_le_template *tmp = data;
+	struct dst_secure *s;
+
+	for (i=0; i<tmp->le->secure_attr_num; ++i) {
+		s = kmalloc(sizeof(struct dst_secure), GFP_KERNEL);
+		if (!s)
+			goto err_out_exit;
+
+		memcpy(&s->sec, tmp->data, sizeof(struct dst_secure_user));
+
+		list_add_tail(&s->sec_entry, &st->request_list);
+		tmp->data += sizeof(struct dst_secure_user);
+
+		if (s->sec.addr.sa_family == AF_INET) {
+			struct sockaddr_in *sin = 
+				(struct sockaddr_in *)&s->sec.addr;
+			printk(KERN_INFO "%s: Client: %u.%u.%u.%u:%d, "
+					"permissions: %x.\n", 
+				__func__, NIPQUAD(sin->sin_addr.s_addr), 
+				ntohs(sin->sin_port), s->sec.permissions);
+		} else if (s->sec.addr.sa_family == AF_INET6) {
+			struct sockaddr_in6 *sin = 
+				(struct sockaddr_in6 *)&s->sec.addr;
+			printk(KERN_INFO "%s: Client: "
+				"%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x:%d, "
+				"permissions: %x.\n", 
+				__func__, NIP6(sin->sin6_addr), 
+				ntohs(sin->sin6_port), s->sec.permissions);
+		}
+	}
+
+	err = kst_sock_create(st, &tmp->le->rctl.addr, tmp->le->rctl.type,
+			tmp->le->rctl.proto, tmp->le->backlog);
+	if (err)
+		goto err_out_exit;
+
+	err = kst_poll_init(st);
+	if (err)
+		goto err_out_release;
+
+	return 0;
+
+err_out_release:
+	kst_sock_release(st);
+err_out_exit:
+	kst_listen_flush(st);
+	return err;
+}
+
+/*
+ * Operations for different types of states.
+ * There are three:
+ * data state - created for remote node, when distributed storage connects
+ * 	to remote node, which contain data.
+ * listen state - created for local export node, when remote distributed
+ * 	storage's node connects to given node to get/put data.
+ * data export state - created for each client connected to above listen
+ * 	state.
+ */
+static struct kst_state_ops kst_listen_ops = {
+	.init = &kst_listen_init,
+	.exit = &kst_listen_exit,
+	.ready = &kst_listen_ready,
+};
+static struct kst_state_ops kst_data_ops = {
+	.init = &kst_data_init,
+	.push = &kst_data_push,
+	.exit = &kst_common_exit,
+	.recovery = &kst_data_recovery,
+};
+
+struct kst_state *kst_listener_state_init(struct dst_node *node,
+		struct dst_le_template *tmp)
+{
+	return kst_state_init(node, DST_PERM_READ | DST_PERM_WRITE,
+			&kst_listen_ops, tmp);
+}
+
+struct kst_state *kst_data_state_init(struct dst_node *node,
+		struct socket *newsock)
+{
+	return kst_state_init(node, DST_PERM_READ | DST_PERM_WRITE,
+			&kst_data_ops, newsock);
+}
+
+/*
+ * Remove all workers and associated states.
+ */
+void kst_exit_all(void)
+{
+	struct kst_worker *w, *n;
+
+	list_for_each_entry_safe(w, n, &kst_worker_list, entry) {
+		kst_worker_exit(w);
+	}
+}
diff --git a/include/linux/connector.h b/include/linux/connector.h
index 10eb56b..9e67d58 100644
--- a/include/linux/connector.h
+++ b/include/linux/connector.h
@@ -36,9 +36,11 @@
 #define CN_VAL_CIFS                     0x1
 #define CN_W1_IDX			0x3	/* w1 communication */
 #define CN_W1_VAL			0x1
+#define CN_DST_IDX			0x4	/* Distributed storage */
+#define CN_DST_VAL			0x1
 
 
-#define CN_NETLINK_USERS		4
+#define CN_NETLINK_USERS		5
 
 /*
  * Maximum connector's message size.
diff --git a/include/linux/dst.h b/include/linux/dst.h
new file mode 100644
index 0000000..3fd41dd
--- /dev/null
+++ b/include/linux/dst.h
@@ -0,0 +1,354 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef __DST_H
+#define __DST_H
+
+#include <linux/types.h>
+
+#define DST_NAMELEN		32
+#define DST_NAME		"dst"
+#define DST_IOCTL		0xba
+
+enum {
+	DST_DEL_NODE	= 0,	/* Remove node with given id from storage */
+	DST_ADD_REMOTE,		/* Add remote node with given id to the storage */
+	DST_ADD_LOCAL,		/* Add local node with given id to the storage */
+	DST_ADD_LOCAL_EXPORT,	/* Add local node with given id to the storage to be exported and used by remote peers */
+	DST_START_STORAGE,	/* Array is ready and storage can be started, if there will be new nodes
+				 * added to the storage, they will be checked against existing size and
+				 * probably be dropped (for example in mirror format when new node has smaller
+				 * size than array created) or inserted.
+				 */
+	DST_STOP_STORAGE,	/* Remove array and all nodes. */
+	DST_CMD_MAX
+};
+
+#define DST_CTL_FLAGS_REMOTE	(1<<0)
+#define DST_CTL_FLAGS_EXPORT	(1<<1)
+
+struct dst_ctl
+{
+	char			st[DST_NAMELEN];
+	char			alg[DST_NAMELEN];
+	__u32			flags, cmd;
+	__u64			start, size;
+};
+
+struct dst_local_ctl
+{
+	char			name[DST_NAMELEN];
+};
+
+#define SADDR_MAX_DATA	128
+
+struct saddr {
+	unsigned short		sa_family;			/* address family, AF_xxx	*/
+	char			sa_data[SADDR_MAX_DATA];	/* 14 bytes of protocol address	*/
+	unsigned short		sa_data_len;			/* Number of bytes used in sa_data */
+};
+
+struct dst_remote_ctl
+{
+	__u16			type;
+	__u16			proto;
+	struct saddr		addr;
+};
+
+#define DST_PERM_READ		(1<<0)
+#define DST_PERM_WRITE		(1<<1)
+
+/*
+ * Right now it is simple model, where each remote address
+ * is assigned to set of permissions it is allowed to perform.
+ * In real world block device does not know anything but
+ * reading and writing, so it should be more than enough.
+ */
+struct dst_secure_user
+{
+	unsigned int		permissions;
+	unsigned short		check_offset;
+	struct saddr		addr;
+};
+
+struct dst_local_export_ctl
+{
+	__u32			backlog;
+	int			secure_attr_num;
+	struct dst_local_ctl	lctl;
+	struct dst_remote_ctl	rctl;
+};
+
+enum {
+	DST_REMOTE_CFG		= 1, 		/* Request remote configuration */
+	DST_WRITE,				/* Writing */
+	DST_READ,				/* Reading */
+	DST_NCMD_MAX,
+};
+
+struct dst_remote_request
+{
+	__u32			cmd;
+	__u32			flags;
+	__u64			sector;
+	__u32			offset;
+	__u32			size;
+};
+
+#ifdef __KERNEL__
+
+#include <linux/rbtree.h>
+#include <linux/net.h>
+#include <linux/blkdev.h>
+#include <linux/bio.h>
+#include <linux/mempool.h>
+#include <linux/device.h>
+
+//#define DST_DEBUG
+
+#ifdef DST_DEBUG
+#define dprintk(f, a...) printk(KERN_NOTICE f, ##a)
+#else
+#define dprintk(f, a...) do {} while (0)
+#endif
+
+struct kst_worker
+{
+	struct list_head	entry;
+
+	struct list_head	state_list;
+	struct mutex		state_mutex;
+
+	struct list_head	ready_list;
+	spinlock_t		ready_lock;
+
+	mempool_t		*req_pool;
+
+	struct task_struct	*thread;
+
+	wait_queue_head_t 	wait;
+
+	int			id;
+};
+
+struct kst_state;
+struct dst_node;
+
+#define DST_REQ_HEADER_SENT	(1<<0)
+#define DST_REQ_EXPORT		(1<<1)
+#define DST_REQ_EXPORT_WRITE	(1<<2)
+#define DST_REQ_EXPORT_READ	(1<<3)
+#define DST_REQ_ALWAYS_QUEUE	(1<<4)
+
+struct dst_request
+{
+	struct rb_node		request_entry;
+	struct list_head	request_list_entry;
+	struct bio		*bio;
+	struct kst_state 	*state;
+	struct dst_node 	*node;
+
+	u32			flags;
+
+	int 			(*callback)(struct dst_request *dst,
+						unsigned int revents);
+	void			(*bio_endio)(struct dst_request *dst, 
+						int err);
+
+	void			*priv;
+	atomic_t		refcnt;
+
+	u64			size, orig_size, start;
+	int			idx, num;
+	u32			offset;
+};
+
+struct kst_state_ops
+{
+	int 		(*init)(struct kst_state *, void *);
+	int 		(*push)(struct dst_request *req);
+	int		(*ready)(struct kst_state *);
+	int		(*recovery)(struct kst_state *, int err);
+	void 		(*exit)(struct kst_state *);
+};
+
+#define KST_FLAG_PARTIAL		(1<<0)
+
+struct kst_state
+{
+	struct list_head	entry;
+	struct list_head	ready_entry;
+
+	wait_queue_t 		wait;
+	wait_queue_head_t 	*whead;
+
+	struct dst_node		*node;
+	struct socket		*socket;
+
+	u32			flags, permissions;
+
+	struct rb_root		request_root;
+	struct mutex		request_lock;
+	struct list_head	request_list;
+
+	struct kst_state_ops	*ops;
+};
+
+#define DST_DEFAULT_TIMEO	2000
+
+struct dst_storage;
+
+struct dst_alg_ops
+{
+	int			(*add_node)(struct dst_node *n);
+	void			(*del_node)(struct dst_node *n);
+	int 			(*remap)(struct dst_request *req);
+	int			(*error)(struct kst_state *state, int err);
+	struct module 		*owner;
+};
+
+struct dst_alg
+{
+	struct list_head	entry;
+	char			name[DST_NAMELEN];
+	atomic_t		refcnt;
+	struct dst_alg_ops	*ops;
+};
+
+#define DST_ST_STARTED		(1<<0)
+
+struct dst_storage
+{
+	struct list_head	entry;
+	char			name[DST_NAMELEN];
+	struct dst_alg		*alg;
+	atomic_t		refcnt;
+	struct mutex		tree_lock;
+	struct rb_root		tree_root;
+
+	request_queue_t		*queue;
+	struct gendisk		*disk;
+
+	long			flags;
+	u64			disk_size;
+
+	struct device		device;
+};
+
+#define DST_NODE_FROZEN		0
+#define DST_NODE_NOTSYNC	1
+
+struct dst_node
+{
+	struct rb_node		tree_node;
+
+	struct list_head	shared;
+	struct dst_node		*shared_head;
+
+	struct block_device 	*bdev;
+	struct dst_storage	*st;
+	struct kst_state	*state;
+	struct kst_worker	*w;
+
+	atomic_t		refcnt;
+	atomic_t		shared_num;
+
+	void			(*cleanup)(struct dst_node *);
+
+	long			flags;
+
+	u64			start, size;
+
+	void			(*priv_callback)(struct dst_node *);
+	void			*priv;
+
+	struct device		device;
+};
+
+struct dst_le_template
+{
+	struct dst_local_export_ctl	*le;
+	void 				*data;
+};
+
+struct dst_secure
+{
+	struct list_head	sec_entry;
+	struct dst_secure_user	sec;
+};
+
+void kst_state_exit(struct kst_state *st);
+
+struct kst_worker *kst_worker_init(int id);
+void kst_worker_exit(struct kst_worker *w);
+
+struct kst_state *kst_listener_state_init(struct dst_node *node,
+		struct dst_le_template *tmp);
+struct kst_state *kst_data_state_init(struct dst_node *node,
+		struct socket *newsock);
+
+void kst_wake(struct kst_state *st);
+
+void kst_exit_all(void);
+
+struct dst_alg *dst_alloc_alg(char *name, struct dst_alg_ops *ops);
+void dst_remove_alg(struct dst_alg *alg);
+
+struct dst_node *dst_storage_tree_search(struct dst_storage *st, u64 start);
+
+void dst_node_put(struct dst_node *n);
+
+static inline struct dst_node *dst_node_get(struct dst_node *n)
+{
+	atomic_inc(&n->refcnt);
+	return n;
+}
+
+struct dst_request *dst_clone_request(struct dst_request *req, mempool_t *pool);
+void dst_free_request(struct dst_request *req);
+
+void kst_complete_req(struct dst_request *req, int err);
+void kst_bio_endio(struct dst_request *req, int err);
+void kst_del_req(struct dst_request *req);
+int kst_enqueue_req(struct kst_state *st, struct dst_request *req);
+
+int kst_data_callback(struct dst_request *req, unsigned int revents);
+
+extern struct kmem_cache *dst_request_cache;
+
+static inline sector_t to_sector(unsigned long n)
+{
+	return (n >> 9);
+}
+
+static inline unsigned long to_bytes(sector_t n)
+{
+	return (n << 9);
+}
+
+/*
+ * Checks state's permissions.
+ * Returns -EPERM if check failed.
+ */
+static inline int kst_check_permissions(struct kst_state *st, struct bio *bio)
+{
+	if ((bio_rw(bio) == WRITE) && !(st->permissions & DST_PERM_WRITE))
+		return -EPERM;
+
+	return 0;
+}
+
+#endif /* __KERNEL__ */
+#endif /* __DST_H */

-- 
	Evgeniy Polyakov

^ permalink raw reply related

* Re: Distributed storage. Move away from char device ioctls.
From: Jeff Garzik @ 2007-09-14 19:07 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: netdev, linux-kernel, linux-fsdevel
In-Reply-To: <20070914185429.GA9439@2ka.mipt.ru>

Evgeniy Polyakov wrote:
> Hi.
> 
> I'm pleased to announce fourth release of the distributed storage
> subsystem, which allows to form a storage on top of remote and local
> nodes, which in turn can be exported to another storage as a node to
> form tree-like storages.
> 
> This release includes new configuration interface (kernel connector over
> netlink socket) and number of fixes of various bugs found during move 
> to it (in error path).
> 
> Further TODO list includes:
> * implement optional saving of mirroring/linear information on the remote
> 	nodes (simple)
> * new redundancy algorithm (complex)
> * some thoughts about distributed filesystem tightly connected to DST
> 	(far-far planes so far)
> 
> Homepage:
> http://tservice.net.ru/~s0mbre/old/?section=projects&item=dst
> 
> Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

My thoughts.  But first a disclaimer:   Perhaps you will recall me as 
one of the people who really reads all your patches, and examines your 
code and proposals closely.  So, with that in mind...

I question the value of distributed block services (DBS), whether its 
your version or the others out there.  DBS are not very useful, because 
it still relies on a useful filesystem sitting on top of the DBS.  It 
devolves into one of two cases:  (1) multi-path much like today's SCSI, 
with distributed filesystem arbitrarion to ensure coherency, or (2) the 
filesystem running on top of the DBS is on a single host, and thus, a 
single point of failure (SPOF).

It is quite logical to extend the concepts of RAID across the network, 
but ultimately you are still bound by the inflexibility and simplicity 
of the block device.

In contrast, a distributed filesystem offers far more scalability, 
eliminates single points of failure, and offers more room for 
optimization and redundancy across the cluster.

A distributed filesystem is also much more complex, which is why 
distributed block devices are so appealing :)

With a redundant, distributed filesystem, you simply do not need any 
complexity at all at the block device level.  You don't even need RAID.

It is my hope that you will put your skills towards a distributed 
filesystem :)  Of the current solutions, GFS (currently in kernel) 
scales poorly, and NFS v4.1 is amazingly bloated and overly complex.

I've been waiting for years for a smart person to come along and write a 
POSIX-only distributed filesystem.

	Jeff

^ permalink raw reply

* [v2 PATCH 8/8] SCTP: Tie ADD-IP and AUTH functionality as required by spec.
From: Vlad Yasevich @ 2007-09-14 19:14 UTC (permalink / raw)
  To: lksctp-developers; +Cc: netdev, Vlad Yasevich
In-Reply-To: <11897955003232-git-send-email-vladislav.yasevich@hp.com>

[.. forgot to refresh the patch, the other version has compile problems ..]

ADD-IP spec requires AUTH. It is, in fact, dangerous without AUTH.
So, disable ADD-IP functionality if the peer claims to support
ADD-IP, but not AUTH.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
---
 include/net/sctp/structs.h |    1 +
 net/sctp/sm_make_chunk.c   |   13 ++++++++++++-
 2 files changed, 13 insertions(+), 1 deletions(-)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 3215da4..a29c59a 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -1537,6 +1537,7 @@ struct sctp_association {
 		__u8    asconf_capable;  /* Does peer support ADDIP? */
 		__u8    prsctp_capable;  /* Can peer do PR-SCTP? */
 		__u8	auth_capable;	 /* Is peer doing SCTP-AUTH? */
+		__u8	addip_capable;	 /* Can peer do ADD-IP */
 
 		__u32   adaptation_ind;	 /* Adaptation Code point. */
 
diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
index 7cd8241..5521841 100644
--- a/net/sctp/sm_make_chunk.c
+++ b/net/sctp/sm_make_chunk.c
@@ -1851,7 +1851,8 @@ static void sctp_process_ext_param(struct sctp_association *asoc,
 			    break;
 		    case SCTP_CID_ASCONF:
 		    case SCTP_CID_ASCONF_ACK:
-			    /* don't need to do anything for ASCONF */
+			    asoc->peer.addip_capable = 1;
+			    break;
 		    default:
 			    break;
 		}
@@ -2137,6 +2138,16 @@ int sctp_process_init(struct sctp_association *asoc, sctp_cid_t cid,
 					!asoc->peer.peer_hmacs))
 		asoc->peer.auth_capable = 0;
 
+
+	/* If the peer claims support for ADD-IP without support
+	 * for AUTH, disable support for ADD-IP.
+	 */
+	if (asoc->peer.addip_capable && !asoc->peer.auth_capable) {
+		asoc->peer.addip_disabled_mask |= (SCTP_PARAM_ADD_IP |
+						  SCTP_PARAM_DEL_IP |
+						  SCTP_PARAM_SET_PRIMARY);
+	}
+
 	/* Walk list of transports, removing transports in the UNKNOWN state. */
 	list_for_each_safe(pos, temp, &asoc->peer.transport_addr_list) {
 		transport = list_entry(pos, struct sctp_transport, transports);
-- 
1.5.2.4


^ permalink raw reply related

* Re: [git patches] net driver fixes
From: Jay Vosburgh @ 2007-09-14 19:19 UTC (permalink / raw)
  To: Dan Williams; +Cc: Jeff Garzik, Andrew Morton, Linus Torvalds, netdev, LKML
In-Reply-To: <1189794677.2508.19.camel@xo-3E-67-34.localdomain>

Dan Williams <dcbw@redhat.com> wrote:
[...]
>I admit that I probably don't understand the system architecture of
>where ehea would be used, but would this
>cause /sys/class/net/ethX/carrier to be TRUE even if the device has no
>carrier?  That seems quite wrong IMHO.  When does ehea not have a
>carrier?  And in that case, does sysfs say 1 or 0 for the carrier?

	I don't work on ehea, but I'm generally familiar with it, and
particularly with this patch.

	The usual environment for ehea devices is on large systems
subdivided into multiple logical partitions.  One ehea device serves
many partitions.  By having ehea always report "link up" to the logical
ports (the ports seen by the partitions), the partitions can communicate
amongst themselves even if the external ports (the ports that go to the
switch or whatever) have no link.  

	The ehea device, more or less, acts as a switch connecting the
partitions together.  This switch type of functionality is not dependent
upon the link state of the external ports (any more than the
functionality of any switch is dependent upon whether or not it is
connected to a gateway).

	This, if I'm not mistaken, is the way ehea has always operated
until this particular patch was added.

	This patch (to optionally pass carrier state to the logical
ports) was added largely for bonding, so that the bonding driver can
detect link failures on the external ports (when so desired).  The
default behavior remains the original behavior, i.e., do not pass
external port link state to the logical ports.

	Anyway, to answer your question, the carrier state reported for
the ehea interface on the partition will always be true.  Think of it as
reporting the link state from the logical interface to the "switch" that
connects the partitions; that link exists only within the ehea device
itself, and really can't fail unless the ehea device itself fails.

	With the new option enabled, then ehea is more or less mimicing
a trunk failover type of function, and passing the carrier state of the
"external switch port" to the internal port.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* Re: [PATCH net-2.6.24] introduce MAC_FMT/MAC_ARG
From: Joe Perches @ 2007-09-14 19:41 UTC (permalink / raw)
  To: David Miller; +Cc: johannes, netdev, Andrew Morton, Jeff Garzik
In-Reply-To: <1188598563.6062.191.camel@localhost>

David?  Did you ever get a chance to look at this?
Do you want me to rebase it against your newer net-2.4.26?

http://repo.or.cz/w/linux-2.6/trivial-mods.git

> I've inlined the include changes, but the entire patch
> is quite large. (300KB)
> 
> MAC address format changes:
> 
> UPPER->lower case changes in MAC addresses in printks.
> presentation from "%x %x..." to "%02x:%02x...". 
> seq_printf uses of mac addresses
> 
> Perhaps the seq_printf changes might cause problems
> with usermode programs.
> 
> please pull from:
> git pull git://repo.or.cz/linux-2.6/trivial-mods.git net-2.6.24-print_mac
> 
> Signed-off-by: Joe Perches <joe@perches.com>
> 
> --
> 
>  include/linux/if_ether.h |    7 +++++++
>  include/net/ieee80211.h  |    5 -----
>  include/net/mac80211.h   |    4 ----
>  3 files changed, 7 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/if_ether.h b/include/linux/if_ether.h
> index 3213f6f..bb3eb51 100644
> --- a/include/linux/if_ether.h
> +++ b/include/linux/if_ether.h
> @@ -122,4 +122,11 @@ extern struct ctl_table ether_table[];
>  #endif
>  #endif
>  
> +/*
> + *      Display a 6 byte device address (MAC) in a readable format.
> + */
> +#define MAC_FMT "%02x:%02x:%02x:%02x:%02x:%02x"
> +extern char *print_mac(char* buf, const u8 *addr);
> +#define DECLARE_MAC_BUF(var) char var[18] __maybe_unused
> +
>  #endif	/* _LINUX_IF_ETHER_H */
> diff --git a/include/net/ieee80211.h b/include/net/ieee80211.h
> index bbd85cd..164d132 100644
> --- a/include/net/ieee80211.h
> +++ b/include/net/ieee80211.h
> @@ -119,11 +119,6 @@ do { if (ieee80211_debug_level & (level)) \
>  #define IEEE80211_DEBUG(level, fmt, args...) do {} while (0)
>  #endif				/* CONFIG_IEEE80211_DEBUG */
>  
> -/* debug macros not dependent on CONFIG_IEEE80211_DEBUG */
> -
> -#define MAC_FMT "%02x:%02x:%02x:%02x:%02x:%02x"
> -#define MAC_ARG(x) ((u8*)(x))[0],((u8*)(x))[1],((u8*)(x))[2],((u8*)(x))[3],((u8*)(x))[4],((u8*)(x))[5]
> -
>  /* escape_essid() is intended to be used in debug (and possibly error)
>   * messages. It should never be used for passing essid to user space. */
>  const char *escape_essid(const char *essid, u8 essid_len);
> diff --git a/include/net/mac80211.h b/include/net/mac80211.h
> index ec8c739..6de3ceb 100644
> --- a/include/net/mac80211.h
> +++ b/include/net/mac80211.h
> @@ -1089,8 +1089,4 @@ static inline int ieee80211_get_morefrag(struct ieee80211_hdr *hdr)
>  		IEEE80211_FCTL_MOREFRAGS) != 0;
>  }
>  
> -#define MAC_FMT "%02x:%02x:%02x:%02x:%02x:%02x"
> -#define MAC_ARG(x) ((u8*)(x))[0], ((u8*)(x))[1], ((u8*)(x))[2], \
> -		   ((u8*)(x))[3], ((u8*)(x))[4], ((u8*)(x))[5]
> -
>  #endif /* MAC80211_H */
> 
> --
> 
>  drivers/net/3c503.c                           |    4 +-
>  drivers/net/3c505.c                           |   10 +-
>  drivers/net/3c507.c                           |    6 +-
>  drivers/net/3c509.c                           |    6 +-
>  drivers/net/3c515.c                           |    4 +-
>  drivers/net/3c523.c                           |   20 +--
>  drivers/net/3c527.c                           |    7 +-
>  drivers/net/3c59x.c                           |    7 +-
>  drivers/net/8139cp.c                          |    8 +-
>  drivers/net/8139too.c                         |    8 +-
>  drivers/net/82596.c                           |   18 +--
>  drivers/net/a2065.c                           |    6 +-
>  drivers/net/ac3200.c                          |    8 +-
>  drivers/net/acenic.c                          |    7 +-
>  drivers/net/amd8111e.c                        |   12 +-
>  drivers/net/apne.c                            |    9 +-
>  drivers/net/ariadne.c                         |   44 +++---
>  drivers/net/arm/am79c961a.c                   |    8 +-
>  drivers/net/arm/at91_ether.c                  |   18 +-
>  drivers/net/arm/ether1.c                      |    8 +-
>  drivers/net/arm/ether3.c                      |    8 +-
>  drivers/net/arm/etherh.c                      |    8 +-
>  drivers/net/at1700.c                          |    4 +-
>  drivers/net/atarilance.c                      |   40 +++---
>  drivers/net/atp.c                             |    8 +-
>  drivers/net/b44.c                             |    9 +-
>  drivers/net/bmac.c                            |    6 +-
>  drivers/net/bnx2.c                            |   12 +-
>  drivers/net/bonding/bond_main.c               |   34 ++---
>  drivers/net/bonding/bond_sysfs.c              |   11 +-
>  drivers/net/cassini.c                         |   11 +-
>  drivers/net/cris/eth_v10.c                    |    8 +-
>  drivers/net/cs89x0.c                          |   15 +--
>  drivers/net/de600.c                           |    6 +-
>  drivers/net/de620.c                           |    8 +-
>  drivers/net/declance.c                        |   14 +-
>  drivers/net/depca.c                           |   13 +-
>  drivers/net/dgrs.c                            |   18 +--
>  drivers/net/dl2k.c                            |    7 +-
>  drivers/net/dm9000.c                          |    9 +-
>  drivers/net/e100.c                            |    9 +-
>  drivers/net/e1000/e1000_main.c                |    5 +-
>  drivers/net/eepro.c                           |    5 +-
>  drivers/net/eepro100.c                        |    9 +-
>  drivers/net/epic100.c                         |    9 +-
>  drivers/net/es3210.c                          |   22 ++--
>  drivers/net/ewrk3.c                           |   16 +--
>  drivers/net/fealnx.c                          |    9 +-
>  drivers/net/fec.c                             |    7 +-
>  drivers/net/forcedeth.c                       |   12 +-
>  drivers/net/gianfar.c                         |    7 +-
>  drivers/net/hamachi.c                         |    8 +-
>  drivers/net/hamradio/bpqether.c               |   23 +--
>  drivers/net/hp-plus.c                         |    6 +-
>  drivers/net/hp.c                              |    5 +-
>  drivers/net/hp100.c                           |    6 +-
>  drivers/net/hydra.c                           |    7 +-
>  drivers/net/ibm_emac/ibm_emac_core.c          |   14 +-
>  drivers/net/ibmlana.c                         |    6 +-
>  drivers/net/ibmveth.c                         |    9 +-
>  drivers/net/ioc3-eth.c                        |   12 +-
>  drivers/net/isa-skeleton.c                    |    5 +-
>  drivers/net/jazzsonic.c                       |   10 +-
>  drivers/net/lance.c                           |    6 +-
>  drivers/net/lguest_net.c                      |    4 +-
>  drivers/net/lib82596.c                        |   18 +--
>  drivers/net/lne390.c                          |    9 +-
>  drivers/net/mac89x0.c                         |   11 +-
>  drivers/net/macb.c                            |    6 +-
>  drivers/net/mace.c                            |    9 +-
>  drivers/net/macmace.c                         |    6 +-
>  drivers/net/macsonic.c                        |   21 +--
>  drivers/net/meth.c                            |    6 +-
>  drivers/net/mv643xx_eth.c                     |    5 +-
>  drivers/net/mvme147.c                         |   11 +-
>  drivers/net/myri10ge/myri10ge.c               |   11 +-
>  drivers/net/myri_sbus.c                       |   29 ++---
>  drivers/net/natsemi.c                         |   11 +-
>  drivers/net/ne-h8300.c                        |    8 +-
>  drivers/net/ne.c                              |    5 +-
>  drivers/net/ne2.c                             |   17 +--
>  drivers/net/ne2k-pci.c                        |   11 +-
>  drivers/net/ne3210.c                          |   11 +-
>  drivers/net/netconsole.c                      |   14 +-
>  drivers/net/netxen/netxen_nic_main.c          |   13 +-
>  drivers/net/netxen/netxen_nic_niu.c           |   14 +-
>  drivers/net/ni5010.c                          |    4 +-
>  drivers/net/ns83820.c                         |    7 +-
>  drivers/net/pasemi_mac.c                      |    6 +-
>  drivers/net/pci-skeleton.c                    |    9 +-
>  drivers/net/pcmcia/3c574_cs.c                 |    9 +-
>  drivers/net/pcmcia/3c589_cs.c                 |   10 +-
>  drivers/net/pcmcia/axnet_cs.c                 |    9 +-
>  drivers/net/pcmcia/fmvj18x_cs.c               |    8 +-
>  drivers/net/pcmcia/nmclan_cs.c                |    9 +-
>  drivers/net/pcmcia/pcnet_cs.c                 |    7 +-
>  drivers/net/pcmcia/smc91c92_cs.c              |    8 +-
>  drivers/net/pcmcia/xirc2ps_cs.c               |    9 +-
>  drivers/net/pppoe.c                           |    8 +-
>  drivers/net/ps3_gelic_net.c                   |    7 +-
>  drivers/net/qla3xxx.c                         |    7 +-
>  drivers/net/rionet.c                          |    6 +-
>  drivers/net/rrunner.c                         |    8 +-
>  drivers/net/s2io.c                            |   11 +-
>  drivers/net/sb1250-mac.c                      |    7 +-
>  drivers/net/seeq8005.c                        |    4 +-
>  drivers/net/sgiseeq.c                         |    6 +-
>  drivers/net/sis190.c                          |   10 +-
>  drivers/net/sis900.c                          |    9 +-
>  drivers/net/skge.c                            |    7 +-
>  drivers/net/sky2.c                            |    7 +-
>  drivers/net/smc-mca.c                         |    8 +-
>  drivers/net/smc-ultra.c                       |    8 +-
>  drivers/net/smc-ultra32.c                     |    8 +-
>  drivers/net/smc9194.c                         |    7 +-
>  drivers/net/smc91x.c                          |    9 +-
>  drivers/net/starfire.c                        |   26 ++--
>  drivers/net/sun3lance.c                       |   36 ++---
>  drivers/net/sunbmac.c                         |    8 +-
>  drivers/net/sundance.c                        |   10 +-
>  drivers/net/sungem.c                          |   12 +-
>  drivers/net/sunhme.c                          |   12 +-
>  drivers/net/sunlance.c                        |    9 +-
>  drivers/net/tokenring/abyss.c                 |   12 +-
>  drivers/net/tokenring/ibmtr.c                 |   26 ++--
>  drivers/net/tokenring/lanstreamer.c           |   64 ++++-----
>  drivers/net/tokenring/madgemc.c               |   19 +--
>  drivers/net/tokenring/olympic.c               |  138 +++++++---------
>  drivers/net/tokenring/proteon.c               |    8 +-
>  drivers/net/tokenring/skisa.c                 |    8 +-
>  drivers/net/tokenring/tmspci.c                |   10 +-
>  drivers/net/tsi108_eth.c                      |    7 +-
>  drivers/net/tulip/de2104x.c                   |    9 +-
>  drivers/net/tulip/de4x5.c                     |   33 +---
>  drivers/net/tulip/dmfe.c                      |   15 +-
>  drivers/net/tulip/tulip_core.c                |   15 +-
>  drivers/net/tulip/uli526x.c                   |    9 +-
>  drivers/net/tulip/winbond-840.c               |   29 ++--
>  drivers/net/tulip/xircom_cb.c                 |    7 +-
>  drivers/net/tun.c                             |   33 ++---
>  drivers/net/typhoon.c                         |   10 +-
>  drivers/net/usb/pegasus.c                     |   11 +-
>  drivers/net/usb/usbnet.c                      |    8 +-
>  drivers/net/via-rhine.c                       |   13 +-
>  drivers/net/wd.c                              |    7 +-
>  drivers/net/wireless/airo.c                   |   32 ++---
>  drivers/net/wireless/arlan-main.c             |   23 ++--
>  drivers/net/wireless/atmel.c                  |    7 +-
>  drivers/net/wireless/bcm43xx/bcm43xx.h        |    6 -
>  drivers/net/wireless/hostap/hostap_80211_rx.c |   49 ++++---
>  drivers/net/wireless/hostap/hostap_80211_tx.c |   13 +-
>  drivers/net/wireless/hostap/hostap_ap.c       |  198 ++++++++++++++----------
>  drivers/net/wireless/hostap/hostap_common.h   |    3 -
>  drivers/net/wireless/hostap/hostap_hw.c       |   11 +-
>  drivers/net/wireless/hostap/hostap_info.c     |   17 ++-
>  drivers/net/wireless/hostap/hostap_ioctl.c    |   15 +-
>  drivers/net/wireless/hostap/hostap_main.c     |   30 ++--
>  drivers/net/wireless/hostap/hostap_proc.c     |   15 +-
>  drivers/net/wireless/ipw2100.c                |   48 +++---
>  drivers/net/wireless/ipw2200.c                |  207 ++++++++++++++-----------
>  drivers/net/wireless/libertas/assoc.c         |   19 ++-
>  drivers/net/wireless/libertas/cmdresp.c       |    7 +-
>  drivers/net/wireless/libertas/debugfs.c       |    5 +-
>  drivers/net/wireless/libertas/join.c          |   15 +-
>  drivers/net/wireless/libertas/main.c          |   12 +-
>  drivers/net/wireless/libertas/scan.c          |   14 +-
>  drivers/net/wireless/libertas/wext.c          |    5 +-
>  drivers/net/wireless/netwave_cs.c             |   14 +-
>  drivers/net/wireless/orinoco.c                |    7 +-
>  drivers/net/wireless/prism54/isl_ioctl.c      |   50 ++----
>  drivers/net/wireless/ray_cs.c                 |   15 +-
>  drivers/net/wireless/rtl8187_dev.c            |    7 +-
>  drivers/net/wireless/wavelan.c                |   53 +++----
>  drivers/net/wireless/wavelan_cs.c             |   54 +++----
>  drivers/net/wireless/wl3501_cs.c              |   22 ++--
>  drivers/net/wireless/zd1211rw/zd_chip.c       |    3 +-
>  drivers/net/wireless/zd1211rw/zd_mac.c        |    8 +-
>  drivers/net/yellowfin.c                       |   19 +--
>  drivers/net/znet.c                            |   11 +-
>  drivers/net/zorro8390.c                       |   15 +-
>  180 files changed, 1305 insertions(+), 1504 deletions(-)
> 
> --
> 
>  net/802/tr.c                                   |   28 ++--
>  net/appletalk/aarp.c                           |    9 +-
>  net/atm/br2684.c                               |   16 +--
>  net/atm/lec.c                                  |   33 ++---
>  net/core/dev.c                                 |   13 ++
>  net/core/netpoll.c                             |   12 +--
>  net/core/pktgen.c                              |   17 +--
>  net/ethernet/eth.c                             |    8 +
>  net/ieee80211/ieee80211_crypt_ccmp.c           |   30 +++--
>  net/ieee80211/ieee80211_crypt_tkip.c           |   31 +++--
>  net/ieee80211/ieee80211_rx.c                   |   59 +++++----
>  net/ieee80211/ieee80211_wx.c                   |    5 +-
>  net/ieee80211/softmac/ieee80211softmac_assoc.c |    4 +-
>  net/ieee80211/softmac/ieee80211softmac_auth.c  |   35 +++--
>  net/ieee80211/softmac/ieee80211softmac_wx.c    |    5 +-
>  net/irda/irlan/irlan_client.c                  |    6 +-
>  net/llc/llc_proc.c                             |   12 +-
>  net/mac80211/debugfs_key.c                     |    3 +-
>  net/mac80211/debugfs_netdev.c                  |    3 +-
>  net/mac80211/debugfs_sta.c                     |    6 +-
>  net/mac80211/event.c                           |    5 +-
>  net/mac80211/ieee80211.c                       |    5 +-
>  net/mac80211/ieee80211_ioctl.c                 |    5 +-
>  net/mac80211/ieee80211_sta.c                   |  180 +++++++++++++-----------
>  net/mac80211/key.c                             |   10 +-
>  net/mac80211/rc80211_simple.c                  |    5 +-
>  net/mac80211/rx.c                              |  118 +++++++++------
>  net/mac80211/sta_info.c                        |   13 +-
>  net/mac80211/tkip.c                            |   10 +-
>  net/mac80211/tx.c                              |   32 +++--
>  net/mac80211/wpa.c                             |   19 ++-
>  net/tipc/eth_media.c                           |    4 +-
>  32 files changed, 414 insertions(+), 327 deletions(-)
> 


^ permalink raw reply

* Re: [PATCH net-2.6.24] introduce MAC_FMT/MAC_ARG
From: David Miller @ 2007-09-14 19:48 UTC (permalink / raw)
  To: joe; +Cc: johannes, netdev, akpm, jgarzik
In-Reply-To: <1189798908.19708.170.camel@localhost>

From: Joe Perches <joe@perches.com>
Date: Fri, 14 Sep 2007 12:41:48 -0700

> David?  Did you ever get a chance to look at this?
> Do you want me to rebase it against your newer net-2.4.26?
> 
> http://repo.or.cz/w/linux-2.6/trivial-mods.git

I just got back from 2 weeks of travelling, sit tight :-)

^ permalink raw reply

* Re: [PATCH][NETNS] Use list_for_each_entry_continue_reverse in setup_net
From: Stephen Hemminger @ 2007-09-14 20:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, Linux Netdev List, Linux Containers, devel,
	Daniel Lezcano
In-Reply-To: <m1lkb92tto.fsf@ebiederm.dsl.xmission.com>

On Fri, 14 Sep 2007 08:41:07 -0600
ebiederm@xmission.com (Eric W. Biederman) wrote:

> Stephen Hemminger <shemminger@linux-foundation.org> writes:
> 
> > On Fri, 14 Sep 2007 11:39:32 +0400
> > Pavel Emelyanov <xemul@openvz.org> wrote:
> >
> >> I proposed introducing a list_for_each_entry_continue_reverse
> >> macro to be used in setup_net() when unrolling the failed
> >> ->init callback.
> >> 
> >> Here is the macro and some more cleanup in the setup_net() itself
> >> to remove one variable from the stack :) Minor, but the code
> >> looks nicer.
> >> 
> >> Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
> >
> > Maybe it is time to just eliminate the init hook from the API.
> > It has very few users, and there is no reason the setup needed
> > could be done before or after registering in most cases.
> 
> I guess only have 5 out of the 29 users I have in my full patchset
> is few.  But that is to be expected because so far only the core
> has been converted.
> 
> I looked again at the initialization to see if you had a point about
> the initialization but in every instance I looked at the function
> was performing work that needed to happen during the creation of
> each network namespace.  So the work very much needs to be done there.
> 
> Ok looking some more I can see why this isn't obvious yet.  copy_net_ns
> hasn't been merged yet, and that is where we create new network namespaces.
> And call setup_net on each new network namespace.
> 
> I will take a look at that patch and see if I can come up with a
> safe version of it to merge to allow for a little more transparency.

Could we just make it so dev->init is not allowed to fail? Then it
can be a void function and the nasty unwind code can go?

^ permalink raw reply

* Re: SO_BINDTODEVICE mismatch with man page & comments.
From: David Miller @ 2007-09-14 20:12 UTC (permalink / raw)
  To: greearb; +Cc: netdev, kaber
In-Reply-To: <46DDDFFA.6090705@candelatech.com>

From: Ben Greear <greearb@candelatech.com>
Date: Tue, 04 Sep 2007 15:45:14 -0700

> According to the comment in the net/core/sock.c code (in 2.6.20), I should be able to pass a zero
> optlen to the setsockopt method for SO_BINDTODEVICE:
 ...
> However, earlier in that method it returns -EINVAL if optlen is < sizeof(int).
> 
> The man page has comments similar to that in the code above.
> 
> Also, even when I get the un-bind call working with code similar to:
> 
> int z = 0;
> setsockopt(s, SOL_SOCKET, SO_BINDTODEVICE, &z, sizeof(z));
> 
> The app I'm working on (Xorp) does not appear to work.  Perhaps because
> the kernel does not clean up the cached route when you un-bind
> as it does in the (re)bind logic?
> 
>                                  /* Remove any cached route for this socket. */
>                                  sk_dst_reset(sk);
> 

Ok, the patch below is how I'm dealing with this.

Let me know if things work better now, and also I would appreciate
it if you could contact the man page maintainers to remove the
optlen==0 language.

Thanks.

>From 136f55cf4ad0a3b0185bfc97c68f9e4d74ddcfe7 Mon Sep 17 00:00:00 2001
From: David S. Miller <davem@sunset.davemloft.net>
Date: Fri, 14 Sep 2007 13:10:17 -0700
Subject: [PATCH] [NET]: Fix two issues wrt. SO_BINDTODEVICE.

1) Comments suggest that setting optlen to zero will unbind
   the socket from whatever device it might be attached to.  This
   hasn't been the case since at least 2.2.x because the first thing
   this function does is return -EINVAL if 'optlen' is less than
   sizeof(int).

   Furthermore, there are not "optlen == 0" tests in the
   SO_BINDTODEVICE code either.

   This also means we can toss the "!valbool" code block because if
   that is true we'll also see the first byte of the passed in name
   buffer as '\0' and this will also unbind the socket.

2) We should reset the cached route of the socket after we have made
   the device binding changes, not before.

Reported by Ben Greear.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/core/sock.c |   39 +++++++++++++++++----------------------
 1 files changed, 17 insertions(+), 22 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index cfed7d4..96be0ed 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -591,36 +591,31 @@ set_rcvbuf:
 
 		/* Bind this socket to a particular device like "eth0",
 		 * as specified in the passed interface name. If the
-		 * name is "" or the option length is zero the socket
-		 * is not bound.
+		 * name is "", the socket is not bound.
 		 */
+		if (optlen > IFNAMSIZ - 1)
+			optlen = IFNAMSIZ - 1;
+		memset(devname, 0, sizeof(devname));
+		if (copy_from_user(devname, optval, optlen)) {
+			ret = -EFAULT;
+			break;
+		}
 
-		if (!valbool) {
+		if (devname[0] == '\0') {
 			sk->sk_bound_dev_if = 0;
 		} else {
-			if (optlen > IFNAMSIZ - 1)
-				optlen = IFNAMSIZ - 1;
-			memset(devname, 0, sizeof(devname));
-			if (copy_from_user(devname, optval, optlen)) {
-				ret = -EFAULT;
+			struct net_device *dev = dev_get_by_name(devname);
+			if (!dev) {
+				ret = -ENODEV;
 				break;
 			}
+			sk->sk_bound_dev_if = dev->ifindex;
+			dev_put(dev);
+		}
 
-			/* Remove any cached route for this socket. */
-			sk_dst_reset(sk);
+		/* Remove any cached route for this socket. */
+		sk_dst_reset(sk);
 
-			if (devname[0] == '\0') {
-				sk->sk_bound_dev_if = 0;
-			} else {
-				struct net_device *dev = dev_get_by_name(devname);
-				if (!dev) {
-					ret = -ENODEV;
-					break;
-				}
-				sk->sk_bound_dev_if = dev->ifindex;
-				dev_put(dev);
-			}
-		}
 		break;
 	}
 #endif
-- 
1.5.2.4


^ permalink raw reply related

* Re: [git patches] net driver fixes
From: Dan Williams @ 2007-09-14 20:11 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: Jeff Garzik, Andrew Morton, Linus Torvalds, netdev, LKML
In-Reply-To: <32620.1189797572@death>

On Fri, 2007-09-14 at 12:19 -0700, Jay Vosburgh wrote:
> Dan Williams <dcbw@redhat.com> wrote:
> [...]
> >I admit that I probably don't understand the system architecture of
> >where ehea would be used, but would this
> >cause /sys/class/net/ethX/carrier to be TRUE even if the device has no
> >carrier?  That seems quite wrong IMHO.  When does ehea not have a
> >carrier?  And in that case, does sysfs say 1 or 0 for the carrier?
> 
> 	I don't work on ehea, but I'm generally familiar with it, and
> particularly with this patch.
> 
> 	The usual environment for ehea devices is on large systems
> subdivided into multiple logical partitions.  One ehea device serves
> many partitions.  By having ehea always report "link up" to the logical
> ports (the ports seen by the partitions), the partitions can communicate
> amongst themselves even if the external ports (the ports that go to the
> switch or whatever) have no link.  

(forgive my ignorance of course)

So essentially the ehea device has a 1(+) external ports that may/may
not be connected, but all lpars share the physical hardware itself,
which is quite happy to let all the lpars talk to each other essentially
via loopback even if there is no actual carrier detected on the external
port(s)?  How does addressing work here, is it just L2 addresses?  Feel
free to point me to some docs and tell me to shut up :)

At least these days module parameters can be changed at runtime through
sysfs.  Stuff that can only be set at module load doesn't provide
userspace the flexibility it needs to configure stuff on the fly.

Dan

> 	The ehea device, more or less, acts as a switch connecting the
> partitions together.  This switch type of functionality is not dependent
> upon the link state of the external ports (any more than the
> functionality of any switch is dependent upon whether or not it is
> connected to a gateway).
> 
> 	This, if I'm not mistaken, is the way ehea has always operated
> until this particular patch was added.
> 
> 	This patch (to optionally pass carrier state to the logical
> ports) was added largely for bonding, so that the bonding driver can
> detect link failures on the external ports (when so desired).  The
> default behavior remains the original behavior, i.e., do not pass
> external port link state to the logical ports.
> 
> 	Anyway, to answer your question, the carrier state reported for
> the ehea interface on the partition will always be true.  Think of it as
> reporting the link state from the logical interface to the "switch" that
> connects the partitions; that link exists only within the ehea device
> itself, and really can't fail unless the ehea device itself fails.
> 
> 	With the new option enabled, then ehea is more or less mimicing
> a trunk failover type of function, and passing the carrier state of the
> "external switch port" to the internal port.
> 
> 	-J
> 
> ---
> 	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com


^ permalink raw reply

* Re: [PATCH v3] Make the pr_*() family of macros in kernel.h complete
From: Andrew Morton @ 2007-09-14 20:18 UTC (permalink / raw)
  To: Medve Emilian-EMMEDVE1; +Cc: linux-kernel, netdev, i2c, linux-omap-open-source
In-Reply-To: <598D5675D34BE349929AF5EDE9B03E270155578C@az33exm24.fsl.freescale.net>

On Fri, 14 Sep 2007 06:27:04 -0700 "Medve Emilian-EMMEDVE1" <Emilian.Medve@freescale.com> wrote:

> I realize this e-mail might be nuisance and time waster for you but I'm
> in need of advice. I apologize in advance for any commonsense cultural
> conventions I'm breaking.
> 
> I sent the below patch to four e-mail lists and it lead to orthogonal
> conversations about how the entire kernel logging system/mechanisms need
> to be re-written and thus such incremental improvements as these get out
> of focus...
> 
> In this case I started needing pr_err() and discovered that is defined
> already four times but not with global visibility as some other pr_*()
> from kernel.h (a subset of the entire family). I chose not to define it
> yet the fifth time but clean up the existing definitions and complete
> the family. For some reason it didn't go through even though I had some
> positive feedback. Now it seems I'm encouraged to really define the
> pr_err() for the fifth time...  Not quite sure what to do...

I normally troll the lkml list for patches like this and will sweep them
up.  But I'm presently 1400 messages in arrears so there is some latency.

I'll go grab this patch.

^ permalink raw reply

* Re: [git patches] net driver fixes
From: Jay Vosburgh @ 2007-09-14 20:39 UTC (permalink / raw)
  To: Dan Williams; +Cc: Jeff Garzik, Andrew Morton, Linus Torvalds, netdev, LKML
In-Reply-To: <1189800717.20386.13.camel@xo-3E-67-34.localdomain>

Dan Williams <dcbw@redhat.com> wrote:
[...]
>So essentially the ehea device has a 1(+) external ports that may/may
>not be connected, but all lpars share the physical hardware itself,
>which is quite happy to let all the lpars talk to each other essentially
>via loopback even if there is no actual carrier detected on the external
>port(s)? [...]

	Yes.

>[...]  How does addressing work here, is it just L2 addresses?  

	Yes.  The logical ports all have unique MAC addresses.

> [...] Feel
>free to point me to some docs and tell me to shut up :)

http://www.redbooks.ibm.com/redpieces/abstracts/redp4340.html

	I found this via google; I haven't read it in detail, but it
seems to cover the HEA architecture at a high level.  It talks about the
whole "IVE" (integrated virtual ethernet: the adapter, hypervisor, etc)
system, but HEA is part of that, so it's probably got the answers you're
looking for.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* Re: [PATCH net-2.6.23-rc5] ipsec interfamily route handling fix
From: David Miller @ 2007-09-14 20:42 UTC (permalink / raw)
  To: joakim.koskela; +Cc: netdev
In-Reply-To: <200709061900.10508.joakim.koskela@hiit.fi>

From: Joakim Koskela <joakim.koskela@hiit.fi>
Date: Thu, 6 Sep 2007 19:00:10 +0300

> This patch addresses a couple of issues related to interfamily ipsec
> modes. The problem is that the structure of the routing info changes
> with the family during the __xfrmX_bundle_create, which hasn't been
> taken properly into account. Seems that by coincidence it hasn't
> caused problems on 32bit platforms, but crashes for example on x86_64
> in 6-4 around line 209 of xfrm6_policy.c as rt doesn't point to a
> rt6_info anymore, but actually a struct rtable. With 64bit pointers,
> the rt->rt6i_node pointer seems to hit something usually not null in
> the rtable that rt now points to, making it go for the path_cookie
> assignment and subsequently crashing.
> 
> Tested on both 32/64bit with all four (44/46/64/66) combinations of
> transformation. I'm still a bit worried about how for example nested
> transformations work with all of this and would appreciate if someone
> more familiar with the details of these structs could comment.
> 
> Signed-off-by: Joakim Koskela <jookos@gmail.com>

Since nobody else found time to review this, I did :-)

It's line wrapped so doesn't apply cleanly, but it has technical
issues too.

It sets encap_type in the inner loop, but what if we find multiple
entries some ipv4 and some ipv6?  This logic can't be right.

Instead, we need to treat these objects on an individual basis, I
think, and that requires a bit more changes.

These tunnel handling code blocks are getting messy, perhaps it's
time for a little bit of indirection based upon AF type?

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox