iSCSI front-end for Hail

public inbox for hail-devel@vger.kernel.org
 help / color / mirror / Atom feed

* iSCSI front-end for Hail
@ 2010-05-01 22:28 Jeff Garzik
  2010-05-02  2:56 ` Pete Zaitcev
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Jeff Garzik @ 2010-05-01 22:28 UTC (permalink / raw)
  To: Project Hail

Hail devs,

Project Hail was, in part, conceived as an umbrella of libraries and 
services enabling the mating of a well known, Internet-standard API with 
a back-end that enables distributed storage.  tabled is an example of 
this:  it provides an application front-end compatible with S3 API, 
using Hail back-end services chunkd and CLD.

nfs4d[1] is a second, work-in-progress example.  nfs4d is a fully 
working NFSv4 front-end, waiting to be mated to the Hail back-end services.

A third example is something I poked at long ago, iSCSI.  The vinzvault 
announcement[2] got me thinking about the iSCSI target[3] daemon that I 
had worked on, a while ago.  vinzvault, sheepdog, DST, drbd, nbd and 
iSCSI all attempt to provide remote network attached storage, usually 
for storage on ephemeral virtual machines, similar to Amazon's Elastic 
Block Storage (EBS) on their EC2 grid.

I dusted off my "itd" (iSCSI target daemon) project, fixed a bunch of 
bugs, and got it working[4] in the hopes that this might be useful to 
Hail or vinzvault or so.

itd is a remote iSCSI service exporting one or more slices of storage as 
a standard SCSI device on your system.  It is based off of 
'netbsd-iscsi' in Fedora, which is in turn based off an old, open source 
Intel codebase.  netbsd-iscsi seemed a more pliable codebase than the 
very-nice SCSI TGT project[5].

The web browsable itd tree (with git:// URL for cloning) can be found at 
http://git.kernel.org/?p=daemon/distsrv/itd.git

As I write this email, I am borrowing a lot of networking code from 
tabled, to convert from GNet over to the more-flexible TCP server 
codebase found in tabled -- notably the asynchronous background TCP 
writing code in tabled.  Hopefully will finish and commit this by the 
end of the weekend.

At that point, itd should be a fully compliant SCSI target, capable of 
reading/writing -- to a pre-allocated RAM space.  Once that milestone is 
reached, the RAM storage may be replaced with Hail components, or other 
gadgets like MongoDB[6], to provide scalable, distributed storage.

	Jeff

[1] https://hail.wiki.kernel.org/index.php/Nfs4d
[2] http://www.mail-archive.com/linux-cluster@redhat.com/msg08555.html
[3] a SCSI "target" is a remote network server, in SCSI parlance.  It is 
mated with an "initiator", which is SCSI's term for client.
[4] well, only small WRITEs work at the moment.  but READ is fully 
working at high speeds.
[5] http://stgt.sourceforge.net/
[6] http://www.mongodb.org/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: iSCSI front-end for Hail
  2010-05-01 22:28 iSCSI front-end for Hail Jeff Garzik
@ 2010-05-02  2:56 ` Pete Zaitcev
  2010-05-02  6:32   ` Jeff Garzik
  2010-05-06  3:02 ` Jeff Garzik
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 6+ messages in thread
From: Pete Zaitcev @ 2010-05-02  2:56 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Project Hail

On Sat, 01 May 2010 18:28:42 -0400
Jeff Garzik <jeff@garzik.org> wrote:

> As I write this email, I am borrowing a lot of networking code from 
> tabled, to convert from GNet over to the more-flexible TCP server 
> codebase found in tabled -- notably the asynchronous background TCP 
> writing code in tabled.  Hopefully will finish and commit this by the 
> end of the weekend.

This seems crying for a common repository or something like libhail,
not sure what. Remember the timer case. Eventually we'll make changes
to tabled that itd will need to copy. But I don't know what course
is best.

-- Pete

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: iSCSI front-end for Hail
  2010-05-02  2:56 ` Pete Zaitcev
@ 2010-05-02  6:32   ` Jeff Garzik
  0 siblings, 0 replies; 6+ messages in thread
From: Jeff Garzik @ 2010-05-02  6:32 UTC (permalink / raw)
  To: Pete Zaitcev; +Cc: Project Hail

[-- Attachment #1: Type: text/plain, Size: 1109 bytes --]

On 05/01/2010 10:56 PM, Pete Zaitcev wrote:
> On Sat, 01 May 2010 18:28:42 -0400
> Jeff Garzik<jeff@garzik.org>  wrote:
>
>> As I write this email, I am borrowing a lot of networking code from
>> tabled, to convert from GNet over to the more-flexible TCP server
>> codebase found in tabled -- notably the asynchronous background TCP
>> writing code in tabled.  Hopefully will finish and commit this by the
>> end of the weekend.
>
> This seems crying for a common repository or something like libhail,
> not sure what. Remember the timer case. Eventually we'll make changes
> to tabled that itd will need to copy. But I don't know what course
> is best.

I was definitely thinking along those lines, when I abstracted and 
modularized the code a bit.  See attached...  I put all the TCP 
write-related code into a two structures, tcp_write_state and tcp_write. 
  The code received s/cli_wr/tcp_wr/g and other obvious, cosmetic changes.

libhail definitely seems like the direction to go.  It would be easiest 
from a packaging perspective to put it into CLD.  But maybe it deserves 
its own repo.?

	Jeff




[-- Attachment #2: util.c.txt --]
[-- Type: text/plain, Size: 5770 bytes --]


====================================SNIP CUT HERE SNIP=========================

enum {
	TCP_MAX_WR_IOV		= 512,	/* arbitrary, pick better one */
	TCP_MAX_WR_CNT		= 10000,/* arbitrary, pick better one */
};

struct tcp_write_state {
	int 			fd;
	struct list_head	write_q;
	struct list_head	write_compl_q;
	size_t			write_cnt;	/* water level */
	size_t			write_cnt_max;
	bool			writing;
	struct event		write_ev;

	void			*priv;		/* useable by any app */

	/* stats */
	unsigned long		opt_write;
};

struct tcp_write {
	const void		*buf;		/* write buffer pointer */
	int			togo;		/* write buffer remainder */

	int			length;		/* length for accounting */

						/* callback */
	bool			(*cb)(struct tcp_write_state *, void *, bool);
	void			*cb_data;	/* data passed to cb */

	struct list_head	node;
};

extern int tcp_writeq(struct tcp_write_state *st, const void *buf, unsigned int buflen,
	       bool (*cb)(struct tcp_write_state *, void *, bool),
	       void *cb_data);
extern bool tcp_wr_cb_free(struct tcp_write_state *st, void *cb_data, bool done);
extern void tcp_write_init(struct tcp_write_state *st, int fd);
extern void tcp_write_exit(struct tcp_write_state *st);
extern bool tcp_write_start(struct tcp_write_state *st);



====================================SNIP CUT HERE SNIP=========================



static void tcp_write_complete(struct tcp_write_state *st, struct tcp_write *tmp)
{
	list_del(&tmp->node);
	list_add_tail(&tmp->node, &st->write_compl_q);
}

bool tcp_wr_cb_free(struct tcp_write_state *st, void *cb_data, bool done)
{
	free(cb_data);
	return false;
}

static bool tcp_write_free(struct tcp_write_state *st, struct tcp_write *tmp,
			   bool done)
{
	bool rcb = false;

	st->write_cnt -= tmp->length;
	list_del(&tmp->node);
	if (tmp->cb)
		rcb = tmp->cb(st, tmp->cb_data, done);
	free(tmp);

	return rcb;
}

static void tcp_write_free_all(struct tcp_write_state *st)
{
	struct tcp_write *wr, *tmp;

	list_for_each_entry_safe(wr, tmp, &st->write_compl_q, node) {
		tcp_write_free(st, wr, true);
	}
	list_for_each_entry_safe(wr, tmp, &st->write_q, node) {
		tcp_write_free(st, wr, false);
	}
}

bool tcp_write_run_compl(struct tcp_write_state *st)
{
	struct tcp_write *wr;
	bool do_loop;

	do_loop = false;
	while (!list_empty(&st->write_compl_q)) {
		wr = list_entry(st->write_compl_q.next, struct tcp_write,
				node);
		do_loop |= tcp_write_free(st, wr, true);
	}
	return do_loop;
}

static bool tcp_writable(struct tcp_write_state *st)
{
	int n_iov;
	struct tcp_write *tmp;
	ssize_t rc;
	struct iovec iov[TCP_MAX_WR_IOV];

	/* accumulate pending writes into iovec */
	n_iov = 0;
	list_for_each_entry(tmp, &st->write_q, node) {
		if (n_iov == TCP_MAX_WR_IOV)
			break;
		/* bleh, struct iovec should declare iov_base const */
		iov[n_iov].iov_base = (void *) tmp->buf;
		iov[n_iov].iov_len = tmp->togo;
		n_iov++;
	}

	/* execute non-blocking write */
do_write:
	rc = writev(st->fd, iov, n_iov);
	if (rc < 0) {
		if (errno == EINTR)
			goto do_write;
		if (errno != EAGAIN)
			goto err_out;
		return true;
	}

	/* iterate through write queue, issuing completions based on
	 * amount of data written
	 */
	while (rc > 0) {
		int sz;

		/* get pointer to first record on list */
		tmp = list_entry(st->write_q.next, struct tcp_write, node);

		/* mark data consumed by decreasing tmp->len */
		sz = (tmp->togo < rc) ? tmp->togo : rc;
		tmp->togo -= sz;
		tmp->buf += sz;
		rc -= sz;

		/* if tmp->len reaches zero, write is complete,
		 * so schedule it for clean up (cannot call callback
		 * right away or an endless recursion will result)
		 */
		if (tmp->togo == 0)
			tcp_write_complete(st, tmp);
	}

	/* if we emptied the queue, clear write notification */
	if (list_empty(&st->write_q)) {
		st->writing = false;
		if (event_del(&st->write_ev) < 0)
			goto err_out;
	}

	return true;

err_out:
	tcp_write_free_all(st);
	return false;
}

bool tcp_write_start(struct tcp_write_state *st)
{
	if (list_empty(&st->write_q))
		return true;		/* loop, not poll */

	/* if write-poll already active, nothing further to do */
	if (st->writing)
		return false;		/* poll wait */

	/* attempt optimistic write, in hopes of avoiding poll,
	 * or at least refill the write buffers so as to not
	 * get -immediately- called again by the kernel
	 */
	tcp_writable(st);
	if (list_empty(&st->write_q)) {
		st->opt_write++;
		return true;		/* loop, not poll */
	}

	if (event_add(&st->write_ev, NULL) < 0)
		return true;		/* loop, not poll */

	st->writing = true;

	return false;			/* poll wait */
}

int tcp_writeq(struct tcp_write_state *st, const void *buf, unsigned int buflen,
	       bool (*cb)(struct tcp_write_state *, void *, bool),
	       void *cb_data)
{
	struct tcp_write *wr;

	if (!buf || !buflen)
		return -EINVAL;

	wr = calloc(1, sizeof(struct tcp_write));
	if (!wr)
		return -ENOMEM;

	wr->buf = buf;
	wr->togo = buflen;
	wr->length = buflen;
	wr->cb = cb;
	wr->cb_data = cb_data;
	list_add_tail(&wr->node, &st->write_q);
	st->write_cnt += buflen;
	if (st->write_cnt > st->write_cnt_max)
		st->write_cnt_max = st->write_cnt;

	return 0;
}

size_t tcp_wqueued(struct tcp_write_state *st)
{
	return st->write_cnt;
}

static void tcp_wr_evt(int fd, short events, void *userdata)
{
	struct tcp_write_state *st = userdata;

	tcp_writable(st);
}

void tcp_write_init(struct tcp_write_state *st, int fd)
{
	memset(st, 0, sizeof(*st));

	st->fd = fd;

	INIT_LIST_HEAD(&st->write_q);
	INIT_LIST_HEAD(&st->write_compl_q);

	st->write_cnt_max = TCP_MAX_WR_CNT;

	event_set(&st->write_ev, fd, EV_WRITE | EV_PERSIST,
		  tcp_wr_evt, st);
}

void tcp_write_exit(struct tcp_write_state *st)
{
	if (st->writing)
		event_del(&st->write_ev);

	tcp_write_free_all(st);
}

====================================SNIP CUT HERE SNIP=========================

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: iSCSI front-end for Hail
  2010-05-01 22:28 iSCSI front-end for Hail Jeff Garzik
  2010-05-02  2:56 ` Pete Zaitcev
@ 2010-05-06  3:02 ` Jeff Garzik
  2010-05-07  3:15 ` Jeff Garzik
  2010-05-08 22:16 ` iSCSI back-end design Jeff Garzik
  3 siblings, 0 replies; 6+ messages in thread
From: Jeff Garzik @ 2010-05-06  3:02 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Project Hail

As of commit 23a5795e3ca555a6454b199e071482bb50655508, itd is passing 
integrity and stress tests from two test suites, iscsi-harness found in 
netbsd-iscsi pkg, and basic blkdev integrity tests using dd(1).

There is a whopping big memory leak that needs fixing, but the basics 
appear to be working.

	Jeff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: iSCSI front-end for Hail
  2010-05-01 22:28 iSCSI front-end for Hail Jeff Garzik
  2010-05-02  2:56 ` Pete Zaitcev
  2010-05-06  3:02 ` Jeff Garzik
@ 2010-05-07  3:15 ` Jeff Garzik
  2010-05-08 22:16 ` iSCSI back-end design Jeff Garzik
  3 siblings, 0 replies; 6+ messages in thread
From: Jeff Garzik @ 2010-05-07  3:15 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Project Hail


As of itd commit 196e8f317fc7202460d7adde93dac939caf23f5d, the iSCSI 
target daemon appears to survive stress tests, and does not leak memory. 
  I call that a good first milestone.

	Jeff




^ permalink raw reply	[flat|nested] 6+ messages in thread

* iSCSI back-end design
  2010-05-01 22:28 iSCSI front-end for Hail Jeff Garzik
                   ` (2 preceding siblings ...)
  2010-05-07  3:15 ` Jeff Garzik
@ 2010-05-08 22:16 ` Jeff Garzik
  3 siblings, 0 replies; 6+ messages in thread
From: Jeff Garzik @ 2010-05-08 22:16 UTC (permalink / raw)
  To: Project Hail

The iSCSI target daemon now serves data persistently stored in an mmap'd 
file, as an alternative to RAM storage that exists for the lifetime of 
the daemon.

It has been verified to work with Linux and Microsoft iSCSI initiators 
(== iSCSI clients), appearing as a regular SCSI disk on your system.  It 
answers all the standard SCSI commands, issued by the kernel's SCSI 
subsystem, or userspace SCSI utilities like sg3_utils.

Usage is quite trivial:

	$ dd if=/dev/zero of=/tmp/target0.data bs=1M count=2000
	$ ./itd -f /tmp/target0.data

Then direct your iSCSI initiator (via iscsiadm on Linux) to the portal 
(== IPv4 or IPv6 address) where itd is listening.

Now, on to thinking about a Hail back-end for itd:

One of the recent additions to SCSI (http://www.t10.org/) is "thin 
provisioning", which enables easy management of SCSI devices such as 
flash storage or large arrays, where LBAs (hardware sectors) exist in 
one of two states:

	- mapped: present and have valid user data
	- unmapped: does not contain valid user data

Previously, all LBAs on all SCSI disks were assumed to be mapped.

The ability to unmap sectors, and know which sectors are unmapped, have 
clear advantages to flash-based storage hardware, which may use this 
information to erase or re-use unmapped flash cels.

Similarly, large SCSI arrays -- or Project Hail software -- may make use 
of unmapped sectors by avoiding allocation of storage.  As it applies to 
itd, I was thinking that our SCSI device may be implemented internally 
as an array of 256kB pages, some of which are filled with data, and some 
of which are not:

[0] NULL
[1] NULL
[2] chunkd object 1
[3] chunkd object 12398
[4] NULL
[5] chunkd object 22
[6] ...

The array itself is replicated across multiple chunkd nodes.  The 
superblock, containing information on where to find the array, is stored 
in CLD, updated once every 30-60 seconds, or upon a SYNCHRONIZE CACHE 
command.

The downside to large hardware sectors, of sizes such as 256k, is that 
itd would need to perform a read-modify-write cycle for <256k writes. 
This is mitigated by a large cache, but is nonetheless a factor.  We 
could easily choose a page size of 4k or even 512 bytes, but that, in 
turn, increases the array management overhead.

New SCSI commands related to thin provisioning are already implemented 
in Linux, and are documented here: 
http://www.t10.org/cgi-bin/ac.pl?t=f&f=sbc3r22.pdf

	Jeff

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-05-08 22:16 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-01 22:28 iSCSI front-end for Hail Jeff Garzik
2010-05-02  2:56 ` Pete Zaitcev
2010-05-02  6:32   ` Jeff Garzik
2010-05-06  3:02 ` Jeff Garzik
2010-05-07  3:15 ` Jeff Garzik
2010-05-08 22:16 ` iSCSI back-end design Jeff Garzik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox