[PATCH] blk queue io tracing support

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] blk queue io tracing support
@ 2005-08-23 12:32 Jens Axboe
  2005-08-24  1:03 ` Nathan Scott
  2005-08-24  6:24 ` Nathan Scott
  0 siblings, 2 replies; 20+ messages in thread
From: Jens Axboe @ 2005-08-23 12:32 UTC (permalink / raw)
  To: Linux Kernel

[-- Attachment #1: Type: text/plain, Size: 3380 bytes --]

Hi,

This is a little something I have played with. It allows you to see
exactly what is going on in the block layer for a given queue. Currently
it can logs request queueing and building, dispatches, requeues, and
completions. I've uploaded a little silly app to do dumps here:

http://www.kernel.org/pub/linux/kernel/people/axboe/tools/blktrace.c

Sample output looks like this:

wiggum:~ # ./blktrace /dev/sda
relay name: /relay/sda0
           0  3765 Q R 192-200
           5  3765 G R
          13  3765 M R [200-208]
          15  3765 M R [208-216]
          17  3765 M R [216-224]
          18  3765 M R [224-232]
          19  3765 M R [232-240]
          20  3765 M R [240-248]
          21  3765 M R [248-256]
         154  3765 M R [256-264]
         156  3765 M R [264-272]
         157  3765 M R [272-280]
         159  3765 M R [280-288]
         160  3765 M R [288-296]
         161  3765 M R [296-304]
         162  3765 M R [304-312]
         163  3765 M R [312-320]
         164  3765 M R [320-328]
         170  3765 M R [328-336]
         171  3765 M R [336-344]
         172  3765 M R [344-352]
         173  3765 M R [352-360]
         174  3765 M R [360-368]
         175  3765 M R [368-376]
         177  3765 M R [376-384]
         178  3765 M R [384-392]
         179  3765 Q R 392-400
         180  3765 G R
         181  3765 M R [400-408]
         182  3765 M R [408-416]
         183  3765 M R [416-424]
         184  3765 M R [424-432]
         185  3765 M R [432-440]
         186  3765 M R [440-448]
         187  3765 M R [448-456]
         189  3765 M R [456-464]
         190  3765 M R [464-472]
         191  3765 M R [472-480]
         193  3765 M R [480-488]
         194  3765 M R [488-496]
         196  3765 M R [496-504]
         197  3765 M R [504-512]
         228  3765 D R 192-392
         245  3765 D R 392-512
       14049     0 C R 192-392 [0]
       14067     0 D R 392-512
       14807     0 C R 392-512 [0]
Reads:  Queued:           2,      160KiB
        Completed:        2,      160KiB
        Merges:          38
Writes: Queued:           0,        0KiB
        Completed:        0,        0KiB
        Merges:           0
Events: 47
Missed events: 0

This is a log of a dd if=/dev/sda of=/dev/null bs=64k count=2 and it
shows queueing (Q) and allocation (G) of two requests, along with the
merges (M) that happens there. Finally you see dispatch (D) and
completion (C) of them as well. When sigint is received, blktrace dumps
stats of the current run.

It will work for scsi commands as well, so you can see what is going on
when cdrecord is talking to the device (the cdb is dumped, not the
data). The final integer printed in [] after a completion is the error,
0 for correct completion.

You can register interest in various events, see blktrace.c (grep for
buts and BLKSTARTTRACE).

Patch is against 2.6.13-rc6-mm2. I'm attaching a relayfs update from Tom
Zanussi as well, which is required to handle sub-buffer wrapping
correctly. You need to apply both patches to play with this - and make
sure to enable CONFIG_BLK_DEV_IO_TRACE in your .config, of course. And
blktrace.c relies on relayfs being mounted on /relay, add something ala

none                 /relay               relayfs    defaults 0 0

to your /etc/fstab to accomplish that (or do it manually, only
mentioning it for completeness).

-- 
Jens Axboe


[-- Attachment #2: blk-trace-2.6.13-rc6-mm2-A0 --]
[-- Type: text/plain, Size: 14283 bytes --]

diff -urpN -X /home/axboe/cdrom/exclude /opt/kernel/linux-2.6.13-rc6-mm2/drivers/block/blktrace.c linux-2.6.13-rc6-mm2/drivers/block/blktrace.c
--- /opt/kernel/linux-2.6.13-rc6-mm2/drivers/block/blktrace.c	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.13-rc6-mm2/drivers/block/blktrace.c	2005-08-23 13:34:17.000000000 +0200
@@ -0,0 +1,119 @@
+#include <linux/config.h>
+#include <linux/kernel.h>
+#include <linux/blkdev.h>
+#include <linux/blktrace.h>
+#include <asm/uaccess.h>
+
+void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
+		     int rw, u32 what, int error, int pdu_len, char *pdu_data)
+{
+	struct blk_io_trace t;
+	unsigned long flags;
+
+	if (rw == WRITE)
+		what |= BLK_TC_ACT(BLK_TC_WRITE);
+	else
+		what |= BLK_TC_ACT(BLK_TC_READ);
+		
+	if (((bt->act_mask << BLK_TC_SHIFT) & what) == 0)
+		return;
+
+	t.magic		= BLK_IO_TRACE_MAGIC | BLK_IO_TRACE_VERSION;
+	t.sequence	= atomic_add_return(1, &bt->sequence);
+	t.time		= sched_clock() / 1000;
+	t.sector	= sector;
+	t.bytes		= bytes;
+	t.action	= what;
+	t.pid		= current->pid;
+	t.error		= error;
+	t.pdu_len	= pdu_len;
+
+	local_irq_save(flags);
+	__relay_write(bt->rchan, &t, sizeof(t));
+	if (pdu_len)
+		__relay_write(bt->rchan, pdu_data, pdu_len);
+	local_irq_restore(flags);
+}
+
+int blk_stop_trace(struct block_device *bdev)
+{
+	request_queue_t *q = bdev_get_queue(bdev);
+	struct blk_trace *bt = NULL;
+	int ret = -EINVAL;
+
+	if (!q)
+		return -ENXIO;
+
+	down(&bdev->bd_sem);
+
+	spin_lock_irq(q->queue_lock);
+	if (q->blk_trace) {
+		bt = q->blk_trace;
+		q->blk_trace = NULL;
+		ret = 0;
+	}
+	spin_unlock_irq(q->queue_lock);
+
+	up(&bdev->bd_sem);
+
+	if (bt) {
+		relay_close(bt->rchan);
+		kfree(bt);
+	}
+
+	return ret;
+}
+
+int blk_start_trace(struct block_device *bdev, char __user *arg)
+{
+	request_queue_t *q = bdev_get_queue(bdev);
+	struct blk_user_trace_setup buts;
+	struct blk_trace *bt;
+	char b[BDEVNAME_SIZE];
+	int ret = 0;
+
+	if (!q)
+		return -ENXIO;
+
+	if (copy_from_user(&buts, arg, sizeof(buts)))
+		return -EFAULT;
+
+	if (!buts.buf_size || !buts.buf_nr)
+		return -EINVAL;
+
+	strcpy(buts.name, bdevname(bdev, b));
+
+	if (copy_to_user(arg, &buts, sizeof(buts)))
+		return -EFAULT;
+
+	down(&bdev->bd_sem);
+	ret = -EBUSY;
+	if (q->blk_trace)
+		goto err;
+
+	ret = -ENOMEM;
+	bt = kmalloc(sizeof(*bt), GFP_KERNEL);
+	if (!bt)
+		goto err;
+
+	atomic_set(&bt->sequence, 0);
+
+	bt->rchan = relay_open(bdevname(bdev, b), NULL, buts.buf_size,
+				buts.buf_nr, NULL);
+	ret = -EIO;
+	if (!bt->rchan)
+		goto err;
+
+	bt->act_mask = buts.act_mask;
+	if (!bt->act_mask)
+		bt->act_mask = (u16) -1;
+
+	spin_lock_irq(q->queue_lock);
+	q->blk_trace = bt;
+	spin_unlock_irq(q->queue_lock);
+	ret = 0;
+err:
+	up(&bdev->bd_sem);
+	return ret;
+}
+
diff -urpN -X /home/axboe/cdrom/exclude /opt/kernel/linux-2.6.13-rc6-mm2/drivers/block/elevator.c linux-2.6.13-rc6-mm2/drivers/block/elevator.c
--- /opt/kernel/linux-2.6.13-rc6-mm2/drivers/block/elevator.c	2005-08-23 08:23:51.000000000 +0200
+++ linux-2.6.13-rc6-mm2/drivers/block/elevator.c	2005-08-23 08:24:34.000000000 +0200
@@ -34,6 +34,7 @@
 #include <linux/slab.h>
 #include <linux/init.h>
 #include <linux/compiler.h>
+#include <linux/blktrace.h>
 
 #include <asm/uaccess.h>
 
@@ -371,6 +372,9 @@ struct request *elv_next_request(request
 	int ret;
 
 	while ((rq = __elv_next_request(q)) != NULL) {
+
+		blk_add_trace_rq(q, rq, BLK_TA_ISSUE);
+
 		/*
 		 * just mark as started even if we don't start it, a request
 		 * that has been delayed should not be passed by new incoming
diff -urpN -X /home/axboe/cdrom/exclude /opt/kernel/linux-2.6.13-rc6-mm2/drivers/block/ioctl.c linux-2.6.13-rc6-mm2/drivers/block/ioctl.c
--- /opt/kernel/linux-2.6.13-rc6-mm2/drivers/block/ioctl.c	2005-08-23 08:23:51.000000000 +0200
+++ linux-2.6.13-rc6-mm2/drivers/block/ioctl.c	2005-08-23 08:33:28.000000000 +0200
@@ -4,6 +4,7 @@
 #include <linux/backing-dev.h>
 #include <linux/buffer_head.h>
 #include <linux/smp_lock.h>
+#include <linux/blktrace.h>
 #include <asm/uaccess.h>
 
 static int blkpg_ioctl(struct block_device *bdev, struct blkpg_ioctl_arg __user *arg)
@@ -188,6 +189,10 @@ static int blkdev_locked_ioctl(struct fi
 		return put_ulong(arg, bdev->bd_inode->i_size >> 9);
 	case BLKGETSIZE64:
 		return put_u64(arg, bdev->bd_inode->i_size);
+	case BLKSTARTTRACE:
+		return blk_start_trace(bdev, (char __user *) arg);
+	case BLKSTOPTRACE:
+		return blk_stop_trace(bdev);
 	}
 	return -ENOIOCTLCMD;
 }
diff -urpN -X /home/axboe/cdrom/exclude /opt/kernel/linux-2.6.13-rc6-mm2/drivers/block/Kconfig linux-2.6.13-rc6-mm2/drivers/block/Kconfig
--- /opt/kernel/linux-2.6.13-rc6-mm2/drivers/block/Kconfig	2005-08-23 08:23:51.000000000 +0200
+++ linux-2.6.13-rc6-mm2/drivers/block/Kconfig	2005-08-23 08:30:20.000000000 +0200
@@ -419,6 +419,14 @@ config LBD
 	  your machine, or if you want to have a raid or loopback device
 	  bigger than 2TB.  Otherwise say N.
 
+config BLK_DEV_IO_TRACE
+	bool "Support for tracing block io actions"
+	select RELAYFS
+	help
+	  Say Y here, if you want to be able to trace the block layer actions
+	  on a given queue.
+
+
 config CDROM_PKTCDVD
 	tristate "Packet writing on CD/DVD media"
 	depends on !UML
diff -urpN -X /home/axboe/cdrom/exclude /opt/kernel/linux-2.6.13-rc6-mm2/drivers/block/ll_rw_blk.c linux-2.6.13-rc6-mm2/drivers/block/ll_rw_blk.c
--- /opt/kernel/linux-2.6.13-rc6-mm2/drivers/block/ll_rw_blk.c	2005-08-23 08:23:51.000000000 +0200
+++ linux-2.6.13-rc6-mm2/drivers/block/ll_rw_blk.c	2005-08-23 08:24:34.000000000 +0200
@@ -29,6 +29,7 @@
 #include <linux/swap.h>
 #include <linux/writeback.h>
 #include <linux/blkdev.h>
+#include <linux/blktrace.h>
 
 /*
  * for max sense size
@@ -1625,6 +1626,12 @@ void blk_cleanup_queue(request_queue_t *
 	if (q->queue_tags)
 		__blk_queue_free_tags(q);
 
+	if (q->blk_trace) {
+		relay_close(q->blk_trace->rchan);
+		kfree(q->blk_trace);
+		q->blk_trace = NULL;
+	}
+
 	blk_queue_ordered(q, QUEUE_ORDERED_NONE);
 
 	kmem_cache_free(requestq_cachep, q);
@@ -1971,6 +1978,8 @@ rq_starved:
 	
 	rq_init(q, rq);
 	rq->rl = rl;
+
+	blk_add_trace_generic(q, bio, rw, BLK_TA_GETRQ);
 out:
 	return rq;
 }
@@ -1999,6 +2008,8 @@ static struct request *get_request_wait(
 		if (!rq) {
 			struct io_context *ioc;
 
+			blk_add_trace_generic(q, bio, rw, BLK_TA_SLEEPRQ);
+
 			__generic_unplug_device(q);
 			spin_unlock_irq(q->queue_lock);
 			io_schedule();
@@ -2052,6 +2063,8 @@ EXPORT_SYMBOL(blk_get_request);
  */
 void blk_requeue_request(request_queue_t *q, struct request *rq)
 {
+	blk_add_trace_rq(q, rq, BLK_TA_REQUEUE);
+
 	if (blk_rq_tagged(rq))
 		blk_queue_end_tag(q, rq);
 
@@ -2665,6 +2678,8 @@ static int __make_request(request_queue_
 			if (!q->back_merge_fn(q, req, bio))
 				break;
 
+			blk_add_trace_bio(q, bio, BLK_TA_BACKMERGE);
+
 			req->biotail->bi_next = bio;
 			req->biotail = bio;
 			req->nr_sectors = req->hard_nr_sectors += nr_sectors;
@@ -2680,6 +2695,8 @@ static int __make_request(request_queue_
 			if (!q->front_merge_fn(q, req, bio))
 				break;
 
+			blk_add_trace_bio(q, bio, BLK_TA_FRONTMERGE);
+
 			bio->bi_next = req->bio;
 			req->bio = bio;
 
@@ -2705,6 +2722,8 @@ static int __make_request(request_queue_
 	}
 
 get_rq:
+	blk_add_trace_bio(q, bio, BLK_TA_QUEUE);
+
 	/*
 	 * Grab a free request. This is might sleep but can not fail.
 	 * Returns with the queue unlocked.
@@ -2981,6 +3000,10 @@ end_io:
 		blk_partition_remap(bio);
 
 		ret = q->make_request_fn(q, bio);
+
+		if (ret)
+			blk_add_trace_bio(q, bio, BLK_TA_QUEUE);
+
 	} while (ret);
 }
 
@@ -3099,6 +3122,8 @@ static int __end_that_request_first(stru
 	int total_bytes, bio_nbytes, error, next_idx = 0;
 	struct bio *bio;
 
+	blk_add_trace_rq(req->q, req, BLK_TA_COMPLETE);
+
 	/*
 	 * extend uptodate bool to allow < 0 value to be direct io error
 	 */
diff -urpN -X /home/axboe/cdrom/exclude /opt/kernel/linux-2.6.13-rc6-mm2/drivers/block/Makefile linux-2.6.13-rc6-mm2/drivers/block/Makefile
--- /opt/kernel/linux-2.6.13-rc6-mm2/drivers/block/Makefile	2005-08-23 08:23:51.000000000 +0200
+++ linux-2.6.13-rc6-mm2/drivers/block/Makefile	2005-08-23 08:51:55.000000000 +0200
@@ -45,3 +45,5 @@ obj-$(CONFIG_VIODASD)		+= viodasd.o
 obj-$(CONFIG_BLK_DEV_SX8)	+= sx8.o
 obj-$(CONFIG_BLK_DEV_UB)	+= ub.o
 
+obj-$(CONFIG_BLK_DEV_IO_TRACE)	+= blktrace.o
+
diff -urpN -X /home/axboe/cdrom/exclude /opt/kernel/linux-2.6.13-rc6-mm2/include/linux/blkdev.h linux-2.6.13-rc6-mm2/include/linux/blkdev.h
--- /opt/kernel/linux-2.6.13-rc6-mm2/include/linux/blkdev.h	2005-08-23 08:24:10.000000000 +0200
+++ linux-2.6.13-rc6-mm2/include/linux/blkdev.h	2005-08-23 08:24:34.000000000 +0200
@@ -22,6 +22,7 @@ typedef struct request_queue request_que
 struct elevator_queue;
 typedef struct elevator_queue elevator_t;
 struct request_pm_state;
+struct blk_trace;
 
 #define BLKDEV_MIN_RQ	4
 #define BLKDEV_MAX_RQ	128	/* Default maximum */
@@ -412,6 +413,8 @@ struct request_queue
 	 */
 	struct request		*flush_rq;
 	unsigned char		ordered;
+
+	struct blk_trace	*blk_trace;
 };
 
 enum {
diff -urpN -X /home/axboe/cdrom/exclude /opt/kernel/linux-2.6.13-rc6-mm2/include/linux/blktrace.h linux-2.6.13-rc6-mm2/include/linux/blktrace.h
--- /opt/kernel/linux-2.6.13-rc6-mm2/include/linux/blktrace.h	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.13-rc6-mm2/include/linux/blktrace.h	2005-08-23 13:35:52.000000000 +0200
@@ -0,0 +1,142 @@
+#ifndef BLKTRACE_H
+#define BLKTRACE_H
+
+#include <linux/config.h>
+#include <linux/blkdev.h>
+#include <linux/relayfs_fs.h>
+
+/*
+ * Trace categories
+ */
+enum {
+	BLK_TC_READ	= 1 << 0,	/* reads */
+	BLK_TC_WRITE	= 1 << 1,	/* writes */
+	BLK_TC_QUEUE	= 1 << 2,	/* queueing/merging */
+	BLK_TC_ISSUE	= 1 << 3,	/* issue */
+	BLK_TC_COMPLETE	= 1 << 4,	/* completions */
+	BLK_TC_FS	= 1 << 5,	/* fs requests */
+	BLK_TC_PC	= 1 << 6,	/* pc requests */
+
+	BLK_TC_END	= 1 << 15,	/* only 16-bits, reminder */
+};
+
+#define BLK_TC_SHIFT		(16)
+#define BLK_TC_ACT(act)		((act) << BLK_TC_SHIFT)
+
+/*
+ * Basic trace actions
+ */
+enum {
+	__BLK_TA_QUEUE = 1,		/* queued */
+	__BLK_TA_BACKMERGE,		/* back merged to existing rq */
+	__BLK_TA_FRONTMERGE,		/* front merge to existing rq */
+	__BLK_TA_GETRQ,			/* allocated new request */
+	__BLK_TA_SLEEPRQ,		/* sleeping on rq allocation */
+	__BLK_TA_REQUEUE,		/* request requeued */
+	__BLK_TA_ISSUE,			/* sent to driver */
+	__BLK_TA_COMPLETE,		/* completed by driver */
+};
+
+/*
+ * Trace actions in full. Additionally, read or write is masked
+ */
+#define BLK_TA_QUEUE		(__BLK_TA_QUEUE | BLK_TC_ACT(BLK_TC_QUEUE))
+#define BLK_TA_BACKMERGE	(__BLK_TA_BACKMERGE | BLK_TC_ACT(BLK_TC_QUEUE))
+#define BLK_TA_FRONTMERGE	(__BLK_TA_FRONTMERGE | BLK_TC_ACT(BLK_TC_QUEUE))
+#define	BLK_TA_GETRQ		(__BLK_TA_GETRQ | BLK_TC_ACT(BLK_TC_QUEUE))
+#define	BLK_TA_SLEEPRQ		(__BLK_TA_SLEEPRQ | BLK_TC_ACT(BLK_TC_QUEUE))
+#define	BLK_TA_REQUEUE		(__BLK_TA_REQUEUE | BLK_TC_ACT(BLK_TC_QUEUE))
+#define BLK_TA_ISSUE		(__BLK_TA_ISSUE | BLK_TC_ACT(BLK_TC_ISSUE))
+#define BLK_TA_COMPLETE		(__BLK_TA_COMPLETE| BLK_TC_ACT(BLK_TC_COMPLETE))
+
+#define BLK_IO_TRACE_MAGIC	0x65617400
+#define BLK_IO_TRACE_VERSION	0x01
+
+/*
+ * The trace itself
+ */
+struct blk_io_trace {
+	u32 magic;		/* MAGIC << 8 | version */
+	u32 sequence;		/* event number */
+	u64 time;		/* in microseconds */
+	u64 sector;		/* disk offset */
+	u32 bytes;		/* transfer length */
+	u32 action;		/* what happened */
+	u16 pid;		/* who did it */
+	u16 error;		/* completion error */
+	u16 pdu_len;		/* length of data after this trace */
+};
+
+struct blk_trace {
+	struct rchan *rchan;
+	atomic_t sequence;
+	u16 act_mask;
+};
+
+/*
+ * User setup structure passed with BLKSTARTTRACE
+ */
+struct blk_user_trace_setup {
+	char name[BDEVNAME_SIZE];	/* output */
+	u16 act_mask;			/* input */
+	u32 buf_size;			/* input */
+	u32 buf_nr;			/* input */
+};
+
+#if defined(CONFIG_BLK_DEV_IO_TRACE)
+extern int blk_start_trace(struct block_device *, char __user *);
+extern int blk_stop_trace(struct block_device *);
+extern void __blk_add_trace(struct blk_trace *, sector_t, int, int, u32, int, int, char *);
+
+static inline void blk_add_trace_rq(struct request_queue *q, struct request *rq,
+				    u32 what)
+{
+	struct blk_trace *bt = q->blk_trace;
+	int rw = rq_data_dir(rq);
+
+	if (likely(!bt))
+		return;
+
+	if (blk_pc_request(rq)) {
+		what |= BLK_TC_ACT(BLK_TC_PC);
+		__blk_add_trace(bt, 0, rq->data_len, rw, what, rq->errors, sizeof(rq->cmd), rq->cmd);
+	} else  {
+		what |= BLK_TC_ACT(BLK_TC_FS);
+		__blk_add_trace(bt, rq->hard_sector, rq->hard_nr_sectors << 9, rw, what, rq->errors, 0, NULL);
+	}
+}
+
+static inline void blk_add_trace_bio(struct request_queue *q, struct bio *bio,
+				     u32 what)
+{
+	struct blk_trace *bt = q->blk_trace;
+
+	if (likely(!bt))
+		return;
+
+	__blk_add_trace(bt, bio->bi_sector, bio->bi_size, bio_data_dir(bio), what, !bio_flagged(bio, BIO_UPTODATE), 0, NULL);
+}
+
+static inline void blk_add_trace_generic(struct request_queue *q,
+					 struct bio *bio, int rw, u32 what)
+{
+	struct blk_trace *bt = q->blk_trace;
+
+	if (likely(!bt))
+		return;
+
+	if (bio)
+		blk_add_trace_bio(q, bio, what);
+	else
+		__blk_add_trace(bt, 0, 0, rw, what, 0, 0, NULL);
+}
+
+#else /* !CONFIG_BLK_DEV_IO_TRACE */
+#define blk_start_trace(bdev, arg)		(-EINVAL)
+#define blk_stop_trace(bdev)			(-EINVAL)
+#define blk_add_trace_rq(q, rq, what)		do { } while (0)
+#define blk_add_trace_bio(q, rq, what)		do { } while (0)
+#define blk_add_trace_generic(q, rq, rw, what)	do { } while (0)
+#endif /* CONFIG_BLK_DEV_IO_TRACE */
+
+#endif
diff -urpN -X /home/axboe/cdrom/exclude /opt/kernel/linux-2.6.13-rc6-mm2/include/linux/fs.h linux-2.6.13-rc6-mm2/include/linux/fs.h
--- /opt/kernel/linux-2.6.13-rc6-mm2/include/linux/fs.h	2005-08-23 08:24:10.000000000 +0200
+++ linux-2.6.13-rc6-mm2/include/linux/fs.h	2005-08-23 08:24:34.000000000 +0200
@@ -196,6 +196,8 @@ extern int dir_notify_enable;
 #define BLKBSZGET  _IOR(0x12,112,size_t)
 #define BLKBSZSET  _IOW(0x12,113,size_t)
 #define BLKGETSIZE64 _IOR(0x12,114,size_t)	/* return device size in bytes (u64 *arg) */
+#define BLKSTARTTRACE _IOWR(0x12,115,struct blk_user_trace_setup)
+#define BLKSTOPTRACE _IO(0x12,116)
 
 #define BMAP_IOCTL 1		/* obsolete - kept for compatibility */
 #define FIBMAP	   _IO(0x00,1)	/* bmap access */

[-- Attachment #3: relayfs-read-update --]
[-- Type: text/plain, Size: 8610 bytes --]

--- linux-2.6.13-rc6-mm2/fs/relayfs/inode.c~	2005-08-23 14:29:11.000000000 +0200
+++ linux-2.6.13-rc6-mm2/fs/relayfs/inode.c	2005-08-23 14:29:54.000000000 +0200
@@ -302,94 +302,73 @@
  *	return the original value.
  */
 static inline size_t relayfs_read_start(size_t read_pos,
-					size_t avail,
-					size_t start_subbuf,
 					struct rchan_buf *buf)
 {
-	size_t read_subbuf, adj_read_subbuf;
-	size_t padding, padding_start, padding_end;
+	size_t read_subbuf, padding, padding_start, padding_end;
 	size_t subbuf_size = buf->chan->subbuf_size;
 	size_t n_subbufs = buf->chan->n_subbufs;
-
+	
 	read_subbuf = read_pos / subbuf_size;
-	adj_read_subbuf = (read_subbuf + start_subbuf) % n_subbufs;
-
-	if ((read_subbuf + 1) * subbuf_size <= avail) {
-		padding = buf->padding[adj_read_subbuf];
-		padding_start = (read_subbuf + 1) * subbuf_size - padding;
-		padding_end = (read_subbuf + 1) * subbuf_size;
-		if (read_pos >= padding_start && read_pos < padding_end) {
-			read_subbuf = (read_subbuf + 1) % n_subbufs;
-			read_pos = read_subbuf * subbuf_size;
-		}
+	padding = buf->padding[read_subbuf];
+	padding_start = (read_subbuf + 1) * subbuf_size - padding;
+	padding_end = (read_subbuf + 1) * subbuf_size;
+	if (read_pos >= padding_start && read_pos < padding_end) {
+		read_subbuf = (read_subbuf + 1) % n_subbufs;
+		read_pos = read_subbuf * subbuf_size;
 	}
 
 	return read_pos;
 }
 
 /**
- *	relayfs_read_end - return the end of available bytes to read
- *
- *	If the read_pos is in the middle of a full sub-buffer, return
- *	the padding-adjusted end of that sub-buffer, otherwise return
- *	the position after the last byte written to the buffer.  At
- *	most, 1 sub-buffer can be read at a time.
+ *	relayfs_read_avail - return total available along with buffer start
  *
+ *	Because buffers are circular, the 'beginning' of the buffer
+ *	depends on where the buffer was last written.  If the writer
+ *	has cycled around the buffer, the beginning is defined to be
+ *	the beginning of the sub-buffer following the last sub-buffer
+ *	written to, otherwise it's the beginning of sub-buffer 0.
+ *	
  */
-static inline size_t relayfs_read_end(size_t read_pos,
-				      size_t avail,
-				      size_t start_subbuf,
-				      struct rchan_buf *buf)
+static inline size_t relayfs_read_avail(size_t read_pos,
+					struct rchan_buf *buf)
 {
-	size_t padding, read_endpos, buf_offset;
-	size_t read_subbuf, adj_read_subbuf;
+	size_t padding, avail = 0;
+	size_t read_subbuf, read_offset, write_subbuf, write_offset;
 	size_t subbuf_size = buf->chan->subbuf_size;
-	size_t n_subbufs = buf->chan->n_subbufs;
 
-	buf_offset = buf->offset > subbuf_size ? subbuf_size : buf->offset;
+	write_subbuf = (buf->data - buf->start) / subbuf_size;
+	write_offset = buf->offset > subbuf_size ? subbuf_size : buf->offset;
 	read_subbuf = read_pos / subbuf_size;
-	adj_read_subbuf = (read_subbuf + start_subbuf) % n_subbufs;
+	read_offset = read_pos % subbuf_size;
+	padding = buf->padding[read_subbuf];
 
-	if ((read_subbuf + 1) * subbuf_size <= avail) {
-		padding = buf->padding[adj_read_subbuf];
-		read_endpos = (read_subbuf + 1) * subbuf_size - padding;
+	if (read_subbuf == write_subbuf) {
+		if (read_offset + padding < write_offset)
+			avail = write_offset - (read_offset + padding);
 	} else
-		read_endpos = read_subbuf * subbuf_size + buf_offset;
+		avail = (subbuf_size - padding) - read_offset;
 
-	return read_endpos;
+	return avail;
 }
 
-/**
- *	relayfs_read_avail - return total available along with buffer start
- *
- *	Because buffers are circular, the 'beginning' of the buffer
- *	depends on where the buffer was last written.  If the writer
- *	has cycled around the buffer, the beginning is defined to be
- *	the beginning of the sub-buffer following the last sub-buffer
- *	written to, otherwise it's the beginning of sub-buffer 0.
- *
- */
-static inline size_t relayfs_read_avail(struct rchan_buf *buf,
-					size_t *start_subbuf)
+static void relayfs_read_consume(struct rchan_buf *buf,
+				 size_t read_pos,
+				 size_t bytes_consumed)
 {
-	size_t avail, complete_subbufs, cur_subbuf, buf_offset;
 	size_t subbuf_size = buf->chan->subbuf_size;
-	size_t n_subbufs = buf->chan->n_subbufs;
+	size_t read_subbuf;
+	size_t tmp;
 
-	buf_offset = buf->offset > subbuf_size ? subbuf_size : buf->offset;
+	buf->bytes_consumed += bytes_consumed;
+	read_subbuf = read_pos / buf->chan->subbuf_size;
 
-	if (buf->subbufs_produced >= n_subbufs) {
-		complete_subbufs = n_subbufs - 1;
-		cur_subbuf = (buf->data - buf->start) / subbuf_size;
-		*start_subbuf = (cur_subbuf + 1) % n_subbufs;
-	} else {
-		complete_subbufs = buf->subbufs_produced;
-		*start_subbuf = 0;
+	if (buf->bytes_consumed + buf->padding[read_subbuf] == subbuf_size) {
+		tmp = buf->subbufs_consumed;
+		relay_subbufs_consumed(buf->chan, buf->cpu, 1);
+		if (buf->subbufs_consumed != tmp)
+			buf->bytes_consumed = 0;
 	}
-
-	avail = complete_subbufs * subbuf_size + buf_offset;
-
-	return avail;
 }
 
 /**
@@ -401,7 +380,7 @@
  *
  *	Reads count bytes or the number of bytes available in the
  *	current sub-buffer being read, whichever is smaller.
- *
+ *	
  *	NOTE: The results of reading a relayfs file which is currently
  *	being written to are undefined.  This is because the buffer is
  *	circular and an active writer in the kernel could be
@@ -416,31 +395,40 @@
 {
 	struct inode *inode = filp->f_dentry->d_inode;
 	struct rchan_buf *buf = RELAYFS_I(inode)->buf;
-	size_t read_start, read_end, avail, start_subbuf;
-	size_t buf_size = buf->chan->subbuf_size * buf->chan->n_subbufs;
+	size_t read_start, avail;
 	void *from;
+	long long produced, consumed;
+	size_t subbuf_size = buf->chan->subbuf_size;
+	size_t n_subbufs = buf->chan->n_subbufs;
+	size_t write_offset = buf->offset > subbuf_size ? subbuf_size : buf->offset;
 
-	avail = relayfs_read_avail(buf, &start_subbuf);
-	if (*ppos >= avail)
-		return 0;
+	if (buf->offset > subbuf_size)
+		produced = (buf->subbufs_produced - 1) * subbuf_size + write_offset;
+	else
+		produced = buf->subbufs_produced * subbuf_size + write_offset;
+	consumed = buf->subbufs_consumed * subbuf_size + buf->bytes_consumed;
 
-	read_start = relayfs_read_start(*ppos, avail, start_subbuf, buf);
-	if (read_start == 0 && *ppos)
+	if (produced == consumed)
 		return 0;
 
-	read_end = relayfs_read_end(read_start, avail, start_subbuf, buf);
-	if (read_end == read_start)
-		return 0;
+	relayfs_read_consume(buf, *ppos, 0);
 
-	from = buf->start + start_subbuf * buf->chan->subbuf_size + read_start;
-	if (from >= buf->start + buf_size)
-		from -= buf_size;
+	read_start = relayfs_read_start(*ppos, buf);
 
-	count = min(count, read_end - read_start);
+	avail = relayfs_read_avail(read_start, buf);
+	if (!avail)
+		return 0;
+	
+	from = buf->start + read_start;
+	count = min(count, avail);
 	if (copy_to_user(buffer, from, count))
 		return -EFAULT;
 
 	*ppos = read_start + count;
+	if (*ppos >= subbuf_size * n_subbufs)
+		*ppos = 0;
+
+	relayfs_read_consume(buf, read_start, count);
 
 	return count;
 }
--- linux-2.6.13-rc6-mm2/fs/relayfs/relay.c~	2005-08-23 14:29:08.000000000 +0200
+++ linux-2.6.13-rc6-mm2/fs/relayfs/relay.c	2005-08-23 14:29:48.000000000 +0200
@@ -58,6 +58,14 @@
 					  void *prev_subbuf,
 					  size_t prev_padding)
 {
+	if (relay_buf_full(buf)) {
+//		if (smp_processor_id() == 0) {
+//			printk("buf full, cpu %u\n", smp_processor_id());
+//			klog_printk("buf full, cpu %u\n", smp_processor_id());
+//		}
+		return 0;
+	}
+	
 	return 1;
 }
 
@@ -262,6 +270,7 @@
 	for_each_online_cpu(i) {
 		sprintf(tmpname, "%s%d", base_filename, i);
 		chan->buf[i] = relay_open_buf(chan, tmpname, parent);
+		chan->buf[i]->cpu = i;
 		if (!chan->buf[i])
 			goto free_bufs;
 	}
@@ -328,7 +337,7 @@
 	return length;
 
 toobig:
-	printk(KERN_WARNING "relayfs: event too large (%u)\n", length);
+	printk(KERN_WARNING "relayfs: event too large (%lu)\n", length);
 	WARN_ON(1);
 	return 0;
 }
--- linux-2.6.13-rc6-mm2/include/linux/relayfs_fs.h~	2005-08-23 14:29:21.000000000 +0200
+++ linux-2.6.13-rc6-mm2/include/linux/relayfs_fs.h	2005-08-23 14:29:31.000000000 +0200
@@ -22,7 +22,7 @@
 /*
  * Tracks changes to rchan_buf struct
  */
-#define RELAYFS_CHANNEL_VERSION		4
+#define RELAYFS_CHANNEL_VERSION		5
 
 /*
  * Per-cpu relay channel buffer
@@ -44,6 +44,8 @@
 	unsigned int finalized;		/* buffer has been finalized */
 	size_t *padding;		/* padding counts per sub-buffer */
 	size_t prev_padding;		/* temporary variable */
+	size_t bytes_consumed;		/* bytes consumed in cur read subbuf */
+	unsigned int cpu;		/* this buf's cpu */
 } ____cacheline_aligned;
 
 /*

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] blk queue io tracing support
  2005-08-23 12:32 [PATCH] blk queue io tracing support Jens Axboe
@ 2005-08-24  1:03 ` Nathan Scott
  2005-08-24  7:08   ` Jens Axboe
  2005-08-24  6:24 ` Nathan Scott
  1 sibling, 1 reply; 20+ messages in thread
From: Nathan Scott @ 2005-08-24  1:03 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linux Kernel

On Tue, Aug 23, 2005 at 02:32:36PM +0200, Jens Axboe wrote:
> Hi,
> 
> This is a little something I have played with. It allows you to see
> exactly what is going on in the block layer for a given queue. Currently
> it can logs request queueing and building, dispatches, requeues, and
> completions.

Ah, fabulous.  Thanks Jens!

> +	t.magic		= BLK_IO_TRACE_MAGIC | BLK_IO_TRACE_VERSION;
> +	t.sequence	= atomic_add_return(1, &bt->sequence);
> +	t.time		= sched_clock() / 1000;

Wouldn't it be better to pass out the highest precision available here
& then do the conversion in userspace instead?  I guess one might want
that little bit more for a RAM disk or something ... actually, talking
to one of the SGI people here with alot of experience on IRIX with a
similar facility, the msec resolution there is apparently sometimes an
issue already with fast storage.

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] blk queue io tracing support
  2005-08-23 12:32 [PATCH] blk queue io tracing support Jens Axboe
  2005-08-24  1:03 ` Nathan Scott
@ 2005-08-24  6:24 ` Nathan Scott
  2005-08-24  7:08   ` Jens Axboe
  1 sibling, 1 reply; 20+ messages in thread
From: Nathan Scott @ 2005-08-24  6:24 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linux Kernel

Hi Jens,

On Tue, Aug 23, 2005 at 02:32:36PM +0200, Jens Axboe wrote:
> ...
> +	t.pid		= current->pid;
> ...
> +/*
> + * The trace itself
> + */
> +struct blk_io_trace {
> +	u32 magic;		/* MAGIC << 8 | version */
> +	u32 sequence;		/* event number */
> +	u64 time;		/* in microseconds */
> +	u64 sector;		/* disk offset */
> +	u32 bytes;		/* transfer length */
> +	u32 action;		/* what happened */
> +	u16 pid;		/* who did it */

Also, this field (pid) should probably be a u32.

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] blk queue io tracing support
  2005-08-24  1:03 ` Nathan Scott
@ 2005-08-24  7:08   ` Jens Axboe
  2005-08-24  7:19     ` Nathan Scott
  0 siblings, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2005-08-24  7:08 UTC (permalink / raw)
  To: Nathan Scott; +Cc: Linux Kernel

On Wed, Aug 24 2005, Nathan Scott wrote:
> On Tue, Aug 23, 2005 at 02:32:36PM +0200, Jens Axboe wrote:
> > Hi,
> > 
> > This is a little something I have played with. It allows you to see
> > exactly what is going on in the block layer for a given queue. Currently
> > it can logs request queueing and building, dispatches, requeues, and
> > completions.
> 
> Ah, fabulous.  Thanks Jens!
> 
> > +	t.magic		= BLK_IO_TRACE_MAGIC | BLK_IO_TRACE_VERSION;
> > +	t.sequence	= atomic_add_return(1, &bt->sequence);
> > +	t.time		= sched_clock() / 1000;
> 
> Wouldn't it be better to pass out the highest precision available here
> & then do the conversion in userspace instead?  I guess one might want
> that little bit more for a RAM disk or something ... actually, talking
> to one of the SGI people here with alot of experience on IRIX with a
> similar facility, the msec resolution there is apparently sometimes an
> issue already with fast storage.

This isn't msec precision, it's usec. sched_clock() is in ns! I already
decided that msec is too coarse, but usec _should_ be enough.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] blk queue io tracing support
  2005-08-24  6:24 ` Nathan Scott
@ 2005-08-24  7:08   ` Jens Axboe
  0 siblings, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2005-08-24  7:08 UTC (permalink / raw)
  To: Nathan Scott; +Cc: Linux Kernel

On Wed, Aug 24 2005, Nathan Scott wrote:
> Hi Jens,
> 
> On Tue, Aug 23, 2005 at 02:32:36PM +0200, Jens Axboe wrote:
> > ...
> > +	t.pid		= current->pid;
> > ...
> > +/*
> > + * The trace itself
> > + */
> > +struct blk_io_trace {
> > +	u32 magic;		/* MAGIC << 8 | version */
> > +	u32 sequence;		/* event number */
> > +	u64 time;		/* in microseconds */
> > +	u64 sector;		/* disk offset */
> > +	u32 bytes;		/* transfer length */
> > +	u32 action;		/* what happened */
> > +	u16 pid;		/* who did it */
> 
> Also, this field (pid) should probably be a u32.

Ah yes, thanks!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] blk queue io tracing support
  2005-08-24  7:08   ` Jens Axboe
@ 2005-08-24  7:19     ` Nathan Scott
  2005-08-24  7:25       ` Jens Axboe
  0 siblings, 1 reply; 20+ messages in thread
From: Nathan Scott @ 2005-08-24  7:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linux Kernel

On Wed, Aug 24, 2005 at 09:08:10AM +0200, Jens Axboe wrote:
> ...
> This isn't msec precision, it's usec. sched_clock() is in ns! I already
> decided that msec is too coarse, but usec _should_ be enough.

Right you are (I was thinking m-for-micro, not m-for-milli in my head ;)
- but still, there doesn't seem to be any reason for that divide-by-1000
and reducing the precision in the kernel rather than in userspace, does
there?  Doing it the other way means you wont ever have to worry about
whether it is/isn't sufficient precision for all possible block devices,
and the precision the tool displays will just be a userspace decision.

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] blk queue io tracing support
  2005-08-24  7:19     ` Nathan Scott
@ 2005-08-24  7:25       ` Jens Axboe
  2005-08-24  9:28         ` Jens Axboe
  0 siblings, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2005-08-24  7:25 UTC (permalink / raw)
  To: Nathan Scott; +Cc: Linux Kernel

On Wed, Aug 24 2005, Nathan Scott wrote:
> On Wed, Aug 24, 2005 at 09:08:10AM +0200, Jens Axboe wrote:
> > ...
> > This isn't msec precision, it's usec. sched_clock() is in ns! I already
> > decided that msec is too coarse, but usec _should_ be enough.
> 
> Right you are (I was thinking m-for-micro, not m-for-milli in my head ;)
> - but still, there doesn't seem to be any reason for that divide-by-1000
> and reducing the precision in the kernel rather than in userspace, does
> there?  Doing it the other way means you wont ever have to worry about
> whether it is/isn't sufficient precision for all possible block devices,
> and the precision the tool displays will just be a userspace decision.

I was just worried about wrapping of ->time, but I did make it a 64-bit
unit. So I guess we could go to nsec granularity, there should be plenty
[*] of space. I'll change that too, I'll post an update version later.

* I probably should not make such a statement :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] blk queue io tracing support
  2005-08-24  7:25       ` Jens Axboe
@ 2005-08-24  9:28         ` Jens Axboe
  2005-08-29  4:53           ` Nathan Scott
                             ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Jens Axboe @ 2005-08-24  9:28 UTC (permalink / raw)
  To: Nathan Scott; +Cc: Linux Kernel

[-- Attachment #1: Type: text/plain, Size: 1583 bytes --]

On Wed, Aug 24 2005, Jens Axboe wrote:
> On Wed, Aug 24 2005, Nathan Scott wrote:
> > On Wed, Aug 24, 2005 at 09:08:10AM +0200, Jens Axboe wrote:
> > > ...
> > > This isn't msec precision, it's usec. sched_clock() is in ns! I already
> > > decided that msec is too coarse, but usec _should_ be enough.
> > 
> > Right you are (I was thinking m-for-micro, not m-for-milli in my head ;)
> > - but still, there doesn't seem to be any reason for that divide-by-1000
> > and reducing the precision in the kernel rather than in userspace, does
> > there?  Doing it the other way means you wont ever have to worry about
> > whether it is/isn't sufficient precision for all possible block devices,
> > and the precision the tool displays will just be a userspace decision.
> 
> I was just worried about wrapping of ->time, but I did make it a 64-bit
> unit. So I guess we could go to nsec granularity, there should be plenty
> [*] of space. I'll change that too, I'll post an update version later.

Ok, updated version. The tool at:

http://www.kernel.org/pub/linux/kernel/people/axboe/tools/blktrace.c

has been updated as well, the protocol version was increased to
accomodate the trace structure changes.

Changes:

- Include full nsec resolution in ->time
- Bump ->pid to full 32-bit
- Include barrier and sync request options
- Move requeue to seperate trace category

Patch attached is against 2.6.13-rc6-mm2. Still a good idea to apply the
relayfs read update from the previous mail [*] as well.

[*] http://marc.theaimsgroup.com/?l=linux-kernel&m=112480046405961&w=2

-- 
Jens Axboe


[-- Attachment #2: blk-trace-2.6.13-rc6-mm2-B1 --]
[-- Type: text/plain, Size: 14698 bytes --]

diff -urpN -X linux-2.6.13-rc6-mm2/Documentation/dontdiff /opt/kernel/linux-2.6.13-rc6-mm2/drivers/block/Kconfig linux-2.6.13-rc6-mm2/drivers/block/Kconfig
--- /opt/kernel/linux-2.6.13-rc6-mm2/drivers/block/Kconfig	2005-08-24 13:17:28.000000000 +0200
+++ linux-2.6.13-rc6-mm2/drivers/block/Kconfig	2005-08-24 11:52:14.000000000 +0200
@@ -419,6 +419,14 @@ config LBD
 	  your machine, or if you want to have a raid or loopback device
 	  bigger than 2TB.  Otherwise say N.
 
+config BLK_DEV_IO_TRACE
+	bool "Support for tracing block io actions"
+	select RELAYFS
+	help
+	  Say Y here, if you want to be able to trace the block layer actions
+	  on a given queue.
+
+
 config CDROM_PKTCDVD
 	tristate "Packet writing on CD/DVD media"
 	depends on !UML
diff -urpN -X linux-2.6.13-rc6-mm2/Documentation/dontdiff /opt/kernel/linux-2.6.13-rc6-mm2/drivers/block/Makefile linux-2.6.13-rc6-mm2/drivers/block/Makefile
--- /opt/kernel/linux-2.6.13-rc6-mm2/drivers/block/Makefile	2005-08-07 20:18:56.000000000 +0200
+++ linux-2.6.13-rc6-mm2/drivers/block/Makefile	2005-08-24 11:52:14.000000000 +0200
@@ -45,3 +45,5 @@ obj-$(CONFIG_VIODASD)		+= viodasd.o
 obj-$(CONFIG_BLK_DEV_SX8)	+= sx8.o
 obj-$(CONFIG_BLK_DEV_UB)	+= ub.o
 
+obj-$(CONFIG_BLK_DEV_IO_TRACE)	+= blktrace.o
+
diff -urpN -X linux-2.6.13-rc6-mm2/Documentation/dontdiff /opt/kernel/linux-2.6.13-rc6-mm2/drivers/block/blktrace.c linux-2.6.13-rc6-mm2/drivers/block/blktrace.c
--- /opt/kernel/linux-2.6.13-rc6-mm2/drivers/block/blktrace.c	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.13-rc6-mm2/drivers/block/blktrace.c	2005-08-24 13:22:11.000000000 +0200
@@ -0,0 +1,124 @@
+#include <linux/config.h>
+#include <linux/kernel.h>
+#include <linux/blkdev.h>
+#include <linux/blktrace.h>
+#include <asm/uaccess.h>
+
+void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
+		     int rw, u32 what, int error, int pdu_len, char *pdu_data)
+{
+	struct blk_io_trace t;
+	unsigned long flags;
+
+	if (rw & (1 << BIO_RW_BARRIER))
+		what |= BLK_TC_ACT(BLK_TC_BARRIER);
+	if (rw & (1 << BIO_RW_SYNC))
+		what |= BLK_TC_ACT(BLK_TC_SYNC);
+
+	if (rw & WRITE)
+		what |= BLK_TC_ACT(BLK_TC_WRITE);
+	else
+		what |= BLK_TC_ACT(BLK_TC_READ);
+		
+	if (((bt->act_mask << BLK_TC_SHIFT) & what) == 0)
+		return;
+
+	t.magic		= BLK_IO_TRACE_MAGIC | BLK_IO_TRACE_VERSION;
+	t.sequence	= atomic_add_return(1, &bt->sequence);
+	t.time		= sched_clock();
+	t.sector	= sector;
+	t.bytes		= bytes;
+	t.action	= what;
+	t.pid		= current->pid;
+	t.error		= error;
+	t.pdu_len	= pdu_len;
+
+	local_irq_save(flags);
+	__relay_write(bt->rchan, &t, sizeof(t));
+	if (pdu_len)
+		__relay_write(bt->rchan, pdu_data, pdu_len);
+	local_irq_restore(flags);
+}
+
+int blk_stop_trace(struct block_device *bdev)
+{
+	request_queue_t *q = bdev_get_queue(bdev);
+	struct blk_trace *bt = NULL;
+	int ret = -EINVAL;
+
+	if (!q)
+		return -ENXIO;
+
+	down(&bdev->bd_sem);
+
+	spin_lock_irq(q->queue_lock);
+	if (q->blk_trace) {
+		bt = q->blk_trace;
+		q->blk_trace = NULL;
+		ret = 0;
+	}
+	spin_unlock_irq(q->queue_lock);
+
+	up(&bdev->bd_sem);
+
+	if (bt) {
+		relay_close(bt->rchan);
+		kfree(bt);
+	}
+
+	return ret;
+}
+
+int blk_start_trace(struct block_device *bdev, char __user *arg)
+{
+	request_queue_t *q = bdev_get_queue(bdev);
+	struct blk_user_trace_setup buts;
+	struct blk_trace *bt;
+	char b[BDEVNAME_SIZE];
+	int ret = 0;
+
+	if (!q)
+		return -ENXIO;
+
+	if (copy_from_user(&buts, arg, sizeof(buts)))
+		return -EFAULT;
+
+	if (!buts.buf_size || !buts.buf_nr)
+		return -EINVAL;
+
+	strcpy(buts.name, bdevname(bdev, b));
+
+	if (copy_to_user(arg, &buts, sizeof(buts)))
+		return -EFAULT;
+
+	down(&bdev->bd_sem);
+	ret = -EBUSY;
+	if (q->blk_trace)
+		goto err;
+
+	ret = -ENOMEM;
+	bt = kmalloc(sizeof(*bt), GFP_KERNEL);
+	if (!bt)
+		goto err;
+
+	atomic_set(&bt->sequence, 0);
+
+	bt->rchan = relay_open(bdevname(bdev, b), NULL, buts.buf_size,
+				buts.buf_nr, NULL);
+	ret = -EIO;
+	if (!bt->rchan)
+		goto err;
+
+	bt->act_mask = buts.act_mask;
+	if (!bt->act_mask)
+		bt->act_mask = (u16) -1;
+
+	spin_lock_irq(q->queue_lock);
+	q->blk_trace = bt;
+	spin_unlock_irq(q->queue_lock);
+	ret = 0;
+err:
+	up(&bdev->bd_sem);
+	return ret;
+}
+
diff -urpN -X linux-2.6.13-rc6-mm2/Documentation/dontdiff /opt/kernel/linux-2.6.13-rc6-mm2/drivers/block/elevator.c linux-2.6.13-rc6-mm2/drivers/block/elevator.c
--- /opt/kernel/linux-2.6.13-rc6-mm2/drivers/block/elevator.c	2005-08-07 20:18:56.000000000 +0200
+++ linux-2.6.13-rc6-mm2/drivers/block/elevator.c	2005-08-24 11:52:14.000000000 +0200
@@ -34,6 +34,7 @@
 #include <linux/slab.h>
 #include <linux/init.h>
 #include <linux/compiler.h>
+#include <linux/blktrace.h>
 
 #include <asm/uaccess.h>
 
@@ -371,6 +372,9 @@ struct request *elv_next_request(request
 	int ret;
 
 	while ((rq = __elv_next_request(q)) != NULL) {
+
+		blk_add_trace_rq(q, rq, BLK_TA_ISSUE);
+
 		/*
 		 * just mark as started even if we don't start it, a request
 		 * that has been delayed should not be passed by new incoming
diff -urpN -X linux-2.6.13-rc6-mm2/Documentation/dontdiff /opt/kernel/linux-2.6.13-rc6-mm2/drivers/block/ioctl.c linux-2.6.13-rc6-mm2/drivers/block/ioctl.c
--- /opt/kernel/linux-2.6.13-rc6-mm2/drivers/block/ioctl.c	2005-08-07 20:18:56.000000000 +0200
+++ linux-2.6.13-rc6-mm2/drivers/block/ioctl.c	2005-08-24 11:52:14.000000000 +0200
@@ -4,6 +4,7 @@
 #include <linux/backing-dev.h>
 #include <linux/buffer_head.h>
 #include <linux/smp_lock.h>
+#include <linux/blktrace.h>
 #include <asm/uaccess.h>
 
 static int blkpg_ioctl(struct block_device *bdev, struct blkpg_ioctl_arg __user *arg)
@@ -188,6 +189,10 @@ static int blkdev_locked_ioctl(struct fi
 		return put_ulong(arg, bdev->bd_inode->i_size >> 9);
 	case BLKGETSIZE64:
 		return put_u64(arg, bdev->bd_inode->i_size);
+	case BLKSTARTTRACE:
+		return blk_start_trace(bdev, (char __user *) arg);
+	case BLKSTOPTRACE:
+		return blk_stop_trace(bdev);
 	}
 	return -ENOIOCTLCMD;
 }
diff -urpN -X linux-2.6.13-rc6-mm2/Documentation/dontdiff /opt/kernel/linux-2.6.13-rc6-mm2/drivers/block/ll_rw_blk.c linux-2.6.13-rc6-mm2/drivers/block/ll_rw_blk.c
--- /opt/kernel/linux-2.6.13-rc6-mm2/drivers/block/ll_rw_blk.c	2005-08-24 13:17:28.000000000 +0200
+++ linux-2.6.13-rc6-mm2/drivers/block/ll_rw_blk.c	2005-08-24 11:52:14.000000000 +0200
@@ -29,6 +29,7 @@
 #include <linux/swap.h>
 #include <linux/writeback.h>
 #include <linux/blkdev.h>
+#include <linux/blktrace.h>
 
 /*
  * for max sense size
@@ -1625,6 +1626,12 @@ void blk_cleanup_queue(request_queue_t *
 	if (q->queue_tags)
 		__blk_queue_free_tags(q);
 
+	if (q->blk_trace) {
+		relay_close(q->blk_trace->rchan);
+		kfree(q->blk_trace);
+		q->blk_trace = NULL;
+	}
+
 	blk_queue_ordered(q, QUEUE_ORDERED_NONE);
 
 	kmem_cache_free(requestq_cachep, q);
@@ -1971,6 +1978,8 @@ rq_starved:
 	
 	rq_init(q, rq);
 	rq->rl = rl;
+
+	blk_add_trace_generic(q, bio, rw, BLK_TA_GETRQ);
 out:
 	return rq;
 }
@@ -1999,6 +2008,8 @@ static struct request *get_request_wait(
 		if (!rq) {
 			struct io_context *ioc;
 
+			blk_add_trace_generic(q, bio, rw, BLK_TA_SLEEPRQ);
+
 			__generic_unplug_device(q);
 			spin_unlock_irq(q->queue_lock);
 			io_schedule();
@@ -2052,6 +2063,8 @@ EXPORT_SYMBOL(blk_get_request);
  */
 void blk_requeue_request(request_queue_t *q, struct request *rq)
 {
+	blk_add_trace_rq(q, rq, BLK_TA_REQUEUE);
+
 	if (blk_rq_tagged(rq))
 		blk_queue_end_tag(q, rq);
 
@@ -2665,6 +2678,8 @@ static int __make_request(request_queue_
 			if (!q->back_merge_fn(q, req, bio))
 				break;
 
+			blk_add_trace_bio(q, bio, BLK_TA_BACKMERGE);
+
 			req->biotail->bi_next = bio;
 			req->biotail = bio;
 			req->nr_sectors = req->hard_nr_sectors += nr_sectors;
@@ -2680,6 +2695,8 @@ static int __make_request(request_queue_
 			if (!q->front_merge_fn(q, req, bio))
 				break;
 
+			blk_add_trace_bio(q, bio, BLK_TA_FRONTMERGE);
+
 			bio->bi_next = req->bio;
 			req->bio = bio;
 
@@ -2705,6 +2722,8 @@ static int __make_request(request_queue_
 	}
 
 get_rq:
+	blk_add_trace_bio(q, bio, BLK_TA_QUEUE);
+
 	/*
 	 * Grab a free request. This is might sleep but can not fail.
 	 * Returns with the queue unlocked.
@@ -2981,6 +3000,10 @@ end_io:
 		blk_partition_remap(bio);
 
 		ret = q->make_request_fn(q, bio);
+
+		if (ret)
+			blk_add_trace_bio(q, bio, BLK_TA_QUEUE);
+
 	} while (ret);
 }
 
@@ -3099,6 +3122,8 @@ static int __end_that_request_first(stru
 	int total_bytes, bio_nbytes, error, next_idx = 0;
 	struct bio *bio;
 
+	blk_add_trace_rq(req->q, req, BLK_TA_COMPLETE);
+
 	/*
 	 * extend uptodate bool to allow < 0 value to be direct io error
 	 */
diff -urpN -X linux-2.6.13-rc6-mm2/Documentation/dontdiff /opt/kernel/linux-2.6.13-rc6-mm2/include/linux/blkdev.h linux-2.6.13-rc6-mm2/include/linux/blkdev.h
--- /opt/kernel/linux-2.6.13-rc6-mm2/include/linux/blkdev.h	2005-08-24 13:17:35.000000000 +0200
+++ linux-2.6.13-rc6-mm2/include/linux/blkdev.h	2005-08-24 11:52:14.000000000 +0200
@@ -22,6 +22,7 @@ typedef struct request_queue request_que
 struct elevator_queue;
 typedef struct elevator_queue elevator_t;
 struct request_pm_state;
+struct blk_trace;
 
 #define BLKDEV_MIN_RQ	4
 #define BLKDEV_MAX_RQ	128	/* Default maximum */
@@ -412,6 +413,8 @@ struct request_queue
 	 */
 	struct request		*flush_rq;
 	unsigned char		ordered;
+
+	struct blk_trace	*blk_trace;
 };
 
 enum {
diff -urpN -X linux-2.6.13-rc6-mm2/Documentation/dontdiff /opt/kernel/linux-2.6.13-rc6-mm2/include/linux/blktrace.h linux-2.6.13-rc6-mm2/include/linux/blktrace.h
--- /opt/kernel/linux-2.6.13-rc6-mm2/include/linux/blktrace.h	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.13-rc6-mm2/include/linux/blktrace.h	2005-08-24 13:22:24.000000000 +0200
@@ -0,0 +1,145 @@
+#ifndef BLKTRACE_H
+#define BLKTRACE_H
+
+#include <linux/config.h>
+#include <linux/blkdev.h>
+#include <linux/relayfs_fs.h>
+
+/*
+ * Trace categories
+ */
+enum {
+	BLK_TC_READ	= 1 << 0,	/* reads */
+	BLK_TC_WRITE	= 1 << 1,	/* writes */
+	BLK_TC_BARRIER	= 1 << 2,	/* barrier */
+	BLK_TC_SYNC	= 1 << 3,	/* barrier */
+	BLK_TC_QUEUE	= 1 << 4,	/* queueing/merging */
+	BLK_TC_REQUEUE	= 1 << 5,	/* requeueing */
+	BLK_TC_ISSUE	= 1 << 6,	/* issue */
+	BLK_TC_COMPLETE	= 1 << 7,	/* completions */
+	BLK_TC_FS	= 1 << 8,	/* fs requests */
+	BLK_TC_PC	= 1 << 9,	/* pc requests */
+
+	BLK_TC_END	= 1 << 15,	/* only 16-bits, reminder */
+};
+
+#define BLK_TC_SHIFT		(16)
+#define BLK_TC_ACT(act)		((act) << BLK_TC_SHIFT)
+
+/*
+ * Basic trace actions
+ */
+enum {
+	__BLK_TA_QUEUE = 1,		/* queued */
+	__BLK_TA_BACKMERGE,		/* back merged to existing rq */
+	__BLK_TA_FRONTMERGE,		/* front merge to existing rq */
+	__BLK_TA_GETRQ,			/* allocated new request */
+	__BLK_TA_SLEEPRQ,		/* sleeping on rq allocation */
+	__BLK_TA_REQUEUE,		/* request requeued */
+	__BLK_TA_ISSUE,			/* sent to driver */
+	__BLK_TA_COMPLETE,		/* completed by driver */
+};
+
+/*
+ * Trace actions in full. Additionally, read or write is masked
+ */
+#define BLK_TA_QUEUE		(__BLK_TA_QUEUE | BLK_TC_ACT(BLK_TC_QUEUE))
+#define BLK_TA_BACKMERGE	(__BLK_TA_BACKMERGE | BLK_TC_ACT(BLK_TC_QUEUE))
+#define BLK_TA_FRONTMERGE	(__BLK_TA_FRONTMERGE | BLK_TC_ACT(BLK_TC_QUEUE))
+#define	BLK_TA_GETRQ		(__BLK_TA_GETRQ | BLK_TC_ACT(BLK_TC_QUEUE))
+#define	BLK_TA_SLEEPRQ		(__BLK_TA_SLEEPRQ | BLK_TC_ACT(BLK_TC_QUEUE))
+#define	BLK_TA_REQUEUE		(__BLK_TA_REQUEUE | BLK_TC_ACT(BLK_TC_REQUEUE))
+#define BLK_TA_ISSUE		(__BLK_TA_ISSUE | BLK_TC_ACT(BLK_TC_ISSUE))
+#define BLK_TA_COMPLETE		(__BLK_TA_COMPLETE| BLK_TC_ACT(BLK_TC_COMPLETE))
+
+#define BLK_IO_TRACE_MAGIC	0x65617400
+#define BLK_IO_TRACE_VERSION	0x02
+
+/*
+ * The trace itself
+ */
+struct blk_io_trace {
+	u32 magic;		/* MAGIC << 8 | version */
+	u32 sequence;		/* event number */
+	u64 time;		/* in microseconds */
+	u64 sector;		/* disk offset */
+	u32 bytes;		/* transfer length */
+	u32 action;		/* what happened */
+	u32 pid;		/* who did it */
+	u16 error;		/* completion error */
+	u16 pdu_len;		/* length of data after this trace */
+};
+
+struct blk_trace {
+	struct rchan *rchan;
+	atomic_t sequence;
+	u16 act_mask;
+};
+
+/*
+ * User setup structure passed with BLKSTARTTRACE
+ */
+struct blk_user_trace_setup {
+	char name[BDEVNAME_SIZE];	/* output */
+	u16 act_mask;			/* input */
+	u32 buf_size;			/* input */
+	u32 buf_nr;			/* input */
+};
+
+#if defined(CONFIG_BLK_DEV_IO_TRACE)
+extern int blk_start_trace(struct block_device *, char __user *);
+extern int blk_stop_trace(struct block_device *);
+extern void __blk_add_trace(struct blk_trace *, sector_t, int, int, u32, int, int, char *);
+
+static inline void blk_add_trace_rq(struct request_queue *q, struct request *rq,
+				    u32 what)
+{
+	struct blk_trace *bt = q->blk_trace;
+	int rw = rq->flags & 0x07;
+
+	if (likely(!bt))
+		return;
+
+	if (blk_pc_request(rq)) {
+		what |= BLK_TC_ACT(BLK_TC_PC);
+		__blk_add_trace(bt, 0, rq->data_len, rw, what, rq->errors, sizeof(rq->cmd), rq->cmd);
+	} else  {
+		what |= BLK_TC_ACT(BLK_TC_FS);
+		__blk_add_trace(bt, rq->hard_sector, rq->hard_nr_sectors << 9, rw, what, rq->errors, 0, NULL);
+	}
+}
+
+static inline void blk_add_trace_bio(struct request_queue *q, struct bio *bio,
+				     u32 what)
+{
+	struct blk_trace *bt = q->blk_trace;
+
+	if (likely(!bt))
+		return;
+
+	__blk_add_trace(bt, bio->bi_sector, bio->bi_size, bio->bi_rw, what, !bio_flagged(bio, BIO_UPTODATE), 0, NULL);
+}
+
+static inline void blk_add_trace_generic(struct request_queue *q,
+					 struct bio *bio, int rw, u32 what)
+{
+	struct blk_trace *bt = q->blk_trace;
+
+	if (likely(!bt))
+		return;
+
+	if (bio)
+		blk_add_trace_bio(q, bio, what);
+	else
+		__blk_add_trace(bt, 0, 0, rw, what, 0, 0, NULL);
+}
+
+#else /* !CONFIG_BLK_DEV_IO_TRACE */
+#define blk_start_trace(bdev, arg)		(-EINVAL)
+#define blk_stop_trace(bdev)			(-EINVAL)
+#define blk_add_trace_rq(q, rq, what)		do { } while (0)
+#define blk_add_trace_bio(q, rq, what)		do { } while (0)
+#define blk_add_trace_generic(q, rq, rw, what)	do { } while (0)
+#endif /* CONFIG_BLK_DEV_IO_TRACE */
+
+#endif
diff -urpN -X linux-2.6.13-rc6-mm2/Documentation/dontdiff /opt/kernel/linux-2.6.13-rc6-mm2/include/linux/fs.h linux-2.6.13-rc6-mm2/include/linux/fs.h
--- /opt/kernel/linux-2.6.13-rc6-mm2/include/linux/fs.h	2005-08-24 13:17:35.000000000 +0200
+++ linux-2.6.13-rc6-mm2/include/linux/fs.h	2005-08-24 11:52:14.000000000 +0200
@@ -196,6 +196,8 @@ extern int dir_notify_enable;
 #define BLKBSZGET  _IOR(0x12,112,size_t)
 #define BLKBSZSET  _IOW(0x12,113,size_t)
 #define BLKGETSIZE64 _IOR(0x12,114,size_t)	/* return device size in bytes (u64 *arg) */
+#define BLKSTARTTRACE _IOWR(0x12,115,struct blk_user_trace_setup)
+#define BLKSTOPTRACE _IO(0x12,116)
 
 #define BMAP_IOCTL 1		/* obsolete - kept for compatibility */
 #define FIBMAP	   _IO(0x00,1)	/* bmap access */

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] blk queue io tracing support
  2005-08-24  9:28         ` Jens Axboe
@ 2005-08-29  4:53           ` Nathan Scott
  2005-08-29  5:57             ` Jens Axboe
  2005-08-30 23:43           ` Nathan Scott
  2005-08-30 23:48           ` Nathan Scott
  2 siblings, 1 reply; 20+ messages in thread
From: Nathan Scott @ 2005-08-29  4:53 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linux Kernel

On Wed, Aug 24, 2005 at 11:28:39AM +0200, Jens Axboe wrote:
> ...
> Patch attached is against 2.6.13-rc6-mm2. Still a good idea to apply the
> relayfs read update from the previous mail [*] as well.

Hi Jens,

There's a minor config botch in there, I get this:

scripts/kconfig/conf -s arch/i386/Kconfig
drivers/block/Kconfig:466:warning: 'select' used by config symbol 'BLK_DEV_IO_TRACE' refer to undefined symbol 'RELAYFS'

The patch below seems to resolve it.

cheers.

-- 
Nathan


Index: relayfs-2.6.x-xfs/drivers/block/Kconfig
===================================================================
--- relayfs-2.6.x-xfs.orig/drivers/block/Kconfig
+++ relayfs-2.6.x-xfs/drivers/block/Kconfig
@@ -463,7 +463,7 @@
 
 config BLK_DEV_IO_TRACE
 	bool "Support for tracing block io actions"
-	select RELAYFS
+	select RELAYFS_FS
 	help
 	  Say Y here, if you want to be able to trace the block layer actions
 	  on a given queue.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] blk queue io tracing support
  2005-08-29  4:53           ` Nathan Scott
@ 2005-08-29  5:57             ` Jens Axboe
  0 siblings, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2005-08-29  5:57 UTC (permalink / raw)
  To: Nathan Scott; +Cc: Linux Kernel

On Mon, Aug 29 2005, Nathan Scott wrote:
> On Wed, Aug 24, 2005 at 11:28:39AM +0200, Jens Axboe wrote:
> > ...
> > Patch attached is against 2.6.13-rc6-mm2. Still a good idea to apply the
> > relayfs read update from the previous mail [*] as well.
> 
> Hi Jens,
> 
> There's a minor config botch in there, I get this:
> 
> scripts/kconfig/conf -s arch/i386/Kconfig
> drivers/block/Kconfig:466:warning: 'select' used by config symbol 'BLK_DEV_IO_TRACE' refer to undefined symbol 'RELAYFS'
> 
> The patch below seems to resolve it.

Thanks, you are right, the name is indeed RELAYFS_FS.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] blk queue io tracing support
  2005-08-24  9:28         ` Jens Axboe
  2005-08-29  4:53           ` Nathan Scott
@ 2005-08-30 23:43           ` Nathan Scott
  2005-08-31  7:31             ` Jens Axboe
  2005-08-30 23:48           ` Nathan Scott
  2 siblings, 1 reply; 20+ messages in thread
From: Nathan Scott @ 2005-08-30 23:43 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linux Kernel

Hi Jens,

On Wed, Aug 24, 2005 at 11:28:39AM +0200, Jens Axboe wrote:
> Patch attached is against 2.6.13-rc6-mm2. Still a good idea to apply the
> relayfs read update from the previous mail [*] as well.

There's a small memory leak there on one of the start-tracing
error paths (relay_open failure)... this should plug it up.

cheers.

-- 
Nathan


Index: 2.6.x-xfs/drivers/block/blktrace.c
===================================================================
--- 2.6.x-xfs.orig/drivers/block/blktrace.c
+++ 2.6.x-xfs/drivers/block/blktrace.c
@@ -73,9 +73,9 @@ int blk_start_trace(struct block_device 
 {
 	request_queue_t *q = bdev_get_queue(bdev);
 	struct blk_user_trace_setup buts;
-	struct blk_trace *bt;
+	struct blk_trace *bt = NULL;
 	char b[BDEVNAME_SIZE];
-	int ret = 0;
+	int ret;
 
 	if (!q)
 		return -ENXIO;
@@ -116,9 +116,14 @@ int blk_start_trace(struct block_device 
 	spin_lock_irq(q->queue_lock);
 	q->blk_trace = bt;
 	spin_unlock_irq(q->queue_lock);
-	ret = 0;
+
+	up(&bdev->bd_sem);
+	return 0;
+
 err:
 	up(&bdev->bd_sem);
+	if (bt)
+		kfree(bt);
 	return ret;
 }
 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] blk queue io tracing support
  2005-08-24  9:28         ` Jens Axboe
  2005-08-29  4:53           ` Nathan Scott
  2005-08-30 23:43           ` Nathan Scott
@ 2005-08-30 23:48           ` Nathan Scott
  2005-08-30 23:58             ` Nathan Scott
                               ` (2 more replies)
  2 siblings, 3 replies; 20+ messages in thread
From: Nathan Scott @ 2005-08-30 23:48 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linux Kernel

Hi Jens,

On Wed, Aug 24, 2005 at 11:28:39AM +0200, Jens Axboe wrote:
> Ok, updated version.

One thing I found a bit awkward was the way its putting all inodes
in the root of the relayfs namespace, with the cpuid tacked on the
end of the bdevname - I was a bit confused at first when a trace of
sdd on my 4P box spontaneously created files for "partitions" sdd0,
sdd1, sdd2, and sdd3 ;).

I suppose if many more users of relayfs spring into existance, this
is going to get quite ugly.  Below is a patch that aligns the names
to the conventions used in sysfs; so, for example, when running two
traces simultaneously on /dev/sdd and /dev/sdb, instead of this:

# find /relay
/relay
/relay/sdd3
/relay/sdd2
/relay/sdd1
/relay/sdd0
/relay/sdb3
/relay/sdb2
/relay/sdb1
/relay/sdb0

it now uses this...

# find /relay
/relay
/relay/block
/relay/block/sdd
/relay/block/sdd/trace3
/relay/block/sdd/trace2
/relay/block/sdd/trace1
/relay/block/sdd/trace0
/relay/block/sdb
/relay/block/sdb/trace3
/relay/block/sdb/trace2
/relay/block/sdb/trace1
/relay/block/sdb/trace0

and does the correct dynamic setup and teardown of the hierarchy
as the userspace tool starts and stops tracing.  I had to modify
the relayfs rmdir code a bit to make this work properly, I'll
send a separate patch for that shortly.

> http://www.kernel.org/pub/linux/kernel/people/axboe/tools/blktrace.c
> 
> has been updated as well, the protocol version was increased to
> accomodate the trace structure changes.

I have the associated userspace change for this, as well as several
other fixes and tweaks for your tool - if you could slap a copyright
and license notice onto that source (pretty please? :) I'll send 'em
right along.

cheers.

-- 
Nathan

Index: 2.6.x-xfs/drivers/block/blktrace.c
===================================================================
--- 2.6.x-xfs.orig/drivers/block/blktrace.c
+++ 2.6.x-xfs/drivers/block/blktrace.c
@@ -40,6 +40,50 @@ void __blk_add_trace(struct blk_trace *b
 	local_irq_restore(flags);
 }
 
+static struct dentry * blk_tree_root;
+static DEFINE_SPINLOCK(blk_tree_lock);
+
+static inline void blk_remove_root(void)
+{
+	if (relayfs_remove_dir(blk_tree_root) != -ENOTEMPTY)
+		blk_tree_root = NULL;
+}
+
+static void blk_remove_tree(struct dentry *dir)
+{
+	spin_lock(&blk_tree_lock);
+	relayfs_remove_dir(dir);
+	blk_remove_root();
+	spin_unlock(&blk_tree_lock);
+}
+
+static struct dentry *blk_create_tree(const char *blk_name)
+{
+	struct dentry *dir;
+
+	spin_lock(&blk_tree_lock);
+	if (!blk_tree_root) {
+		blk_tree_root = relayfs_create_dir("block", NULL);
+		if (!blk_tree_root) {
+			spin_unlock(&blk_tree_lock);
+			return NULL;
+		}
+	}
+	dir = relayfs_create_dir(blk_name, blk_tree_root);
+	if (!dir)
+		blk_remove_root();
+	spin_unlock(&blk_tree_lock);
+
+	return dir;
+}
+
+void blk_cleanup_trace(struct blk_trace *bt)
+{
+	relay_close(bt->rchan);
+	blk_remove_tree(bt->dir);
+	kfree(bt);
+}
+
 int blk_stop_trace(struct block_device *bdev)
 {
 	request_queue_t *q = bdev_get_queue(bdev);
@@ -61,10 +105,8 @@ int blk_stop_trace(struct block_device *
 
 	up(&bdev->bd_sem);
 
-	if (bt) {
-		relay_close(bt->rchan);
-		kfree(bt);
-	}
+	if (bt)
+		blk_cleanup_trace(bt);
 
 	return ret;
 }
@@ -74,6 +116,7 @@ int blk_start_trace(struct block_device 
 	request_queue_t *q = bdev_get_queue(bdev);
 	struct blk_user_trace_setup buts;
 	struct blk_trace *bt = NULL;
+	struct dentry *dir = NULL;
 	char b[BDEVNAME_SIZE];
 	int ret;
 
@@ -101,11 +144,16 @@ int blk_start_trace(struct block_device 
 	if (!bt)
 		goto err;
 
+	ret = -ENOENT;
+	dir = blk_create_tree(bdevname(bdev, b));
+	if (!dir)
+		goto err;
+
+	bt->dir = dir;
 	atomic_set(&bt->sequence, 0);
 
-	bt->rchan = relay_open(bdevname(bdev, b), NULL, buts.buf_size,
-				buts.buf_nr, NULL);
 	ret = -EIO;
+	bt->rchan = relay_open("trace", dir, buts.buf_size, buts.buf_nr, NULL);
 	if (!bt->rchan)
 		goto err;
 
@@ -122,6 +170,8 @@ int blk_start_trace(struct block_device 
 
 err:
 	up(&bdev->bd_sem);
+	if (dir)
+		blk_remove_tree(dir);
 	if (bt)
 		kfree(bt);
 	return ret;
Index: 2.6.x-xfs/include/linux/blktrace.h
===================================================================
--- 2.6.x-xfs.orig/include/linux/blktrace.h
+++ 2.6.x-xfs/include/linux/blktrace.h
@@ -71,6 +71,7 @@ struct blk_io_trace {
 };
 
 struct blk_trace {
+	struct dentry *dir;
 	struct rchan *rchan;
 	atomic_t sequence;
 	u16 act_mask;
@@ -89,6 +90,7 @@ struct blk_user_trace_setup {
 #if defined(CONFIG_BLK_DEV_IO_TRACE)
 extern int blk_start_trace(struct block_device *, char __user *);
 extern int blk_stop_trace(struct block_device *);
+extern void blk_cleanup_trace(struct blk_trace *);
 extern void __blk_add_trace(struct blk_trace *, sector_t, int, int, u32, int, int, char *);
 
 static inline void blk_add_trace_rq(struct request_queue *q, struct request *rq,
@@ -137,6 +139,7 @@ static inline void blk_add_trace_generic
 #else /* !CONFIG_BLK_DEV_IO_TRACE */
 #define blk_start_trace(bdev, arg)		(-EINVAL)
 #define blk_stop_trace(bdev)			(-EINVAL)
+#define blk_cleanup_trace(bt)			do { } while (0)
 #define blk_add_trace_rq(q, rq, what)		do { } while (0)
 #define blk_add_trace_bio(q, rq, what)		do { } while (0)
 #define blk_add_trace_generic(q, rq, rw, what)	do { } while (0)
Index: 2.6.x-xfs/drivers/block/ll_rw_blk.c
===================================================================
--- 2.6.x-xfs.orig/drivers/block/ll_rw_blk.c
+++ 2.6.x-xfs/drivers/block/ll_rw_blk.c
@@ -1625,8 +1625,7 @@ void blk_cleanup_queue(request_queue_t *
 		__blk_queue_free_tags(q);
 
 	if (q->blk_trace) {
-		relay_close(q->blk_trace->rchan);
-		kfree(q->blk_trace);
+		blk_cleanup_trace(q->blk_trace);
 		q->blk_trace = NULL;
 	}
 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] blk queue io tracing support
  2005-08-30 23:48           ` Nathan Scott
@ 2005-08-30 23:58             ` Nathan Scott
  2005-08-31  4:19               ` Tom Zanussi
  2005-08-31  7:33             ` Jens Axboe
  2005-09-02 11:20             ` Jens Axboe
  2 siblings, 1 reply; 20+ messages in thread
From: Nathan Scott @ 2005-08-30 23:58 UTC (permalink / raw)
  To: Jens Axboe, Tom Zanussi; +Cc: Linux Kernel

Hi there,

On Wed, Aug 31, 2005 at 09:48:23AM +1000, Nathan Scott wrote:
> ...
> # find /relay
> /relay
> /relay/block
> /relay/block/sdd
> /relay/block/sdd/trace3
> /relay/block/sdd/trace2
> /relay/block/sdd/trace1
> /relay/block/sdd/trace0
> /relay/block/sdb
> /relay/block/sdb/trace3
> /relay/block/sdb/trace2
> /relay/block/sdb/trace1
> /relay/block/sdb/trace0
> 
> and does the correct dynamic setup and teardown of the hierarchy
> as the userspace tool starts and stops tracing.  I had to modify
> the relayfs rmdir code a bit to make this work properly, I'll
> send a separate patch for that shortly.

Here it is.  The problem was that relayfs is allowing a directory
with children to be removed rather than returning -ENOTEMPTY.  It
looks like this can be resolved by splitting the shared relayfs
unlink code (which is using simple_unlink) into separate file/dir
variants, one using simple_unlink, the other using simple_rmdir.

cheers.

-- 
Nathan


Index: 2.6.x-xfs/fs/relayfs/inode.c
===================================================================
--- 2.6.x-xfs.orig/fs/relayfs/inode.c
+++ 2.6.x-xfs/fs/relayfs/inode.c
@@ -187,8 +187,8 @@ struct dentry *relayfs_create_dir(const 
 }
 
 /**
- *	relayfs_remove - remove a file or directory in the relay filesystem
- *	@dentry: file or directory dentry
+ *	relayfs_remove - remove a file in the relay filesystem
+ *	@dentry: file dentry
  */
 int relayfs_remove(struct dentry *dentry)
 {
@@ -219,10 +219,31 @@ int relayfs_remove(struct dentry *dentry
  */
 int relayfs_remove_dir(struct dentry *dentry)
 {
+	struct dentry *parent;
+	int error = 0;
+
 	if (!dentry)
 		return -EINVAL;
+	parent = dentry->d_parent;
+	if (!parent)
+		return -EINVAL;
+
+	parent = dget(parent);
+	down(&parent->d_inode->i_sem);
+	if (dentry->d_inode) {
+		error = simple_rmdir(parent->d_inode, dentry);
+		if (!error)
+			d_delete(dentry);
+	}
+	if (!error)
+		dput(dentry);
+	up(&parent->d_inode->i_sem);
+	dput(parent);
+
+	if (!error)
+		simple_release_fs(&relayfs_mount, &relayfs_mount_count);
 
-	return relayfs_remove(dentry);
+	return error;
 }
 
 /**

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] blk queue io tracing support
  2005-08-30 23:58             ` Nathan Scott
@ 2005-08-31  4:19               ` Tom Zanussi
  2005-08-31  4:33                 ` Nathan Scott
  0 siblings, 1 reply; 20+ messages in thread
From: Tom Zanussi @ 2005-08-31  4:19 UTC (permalink / raw)
  To: Nathan Scott; +Cc: Jens Axboe, Linux Kernel

Nathan Scott writes:
 > Hi there,
 > 
 > On Wed, Aug 31, 2005 at 09:48:23AM +1000, Nathan Scott wrote:
 > > ...
 > > # find /relay
 > > /relay
 > > /relay/block
 > > /relay/block/sdd
 > > /relay/block/sdd/trace3
 > > /relay/block/sdd/trace2
 > > /relay/block/sdd/trace1
 > > /relay/block/sdd/trace0
 > > /relay/block/sdb
 > > /relay/block/sdb/trace3
 > > /relay/block/sdb/trace2
 > > /relay/block/sdb/trace1
 > > /relay/block/sdb/trace0
 > > 
 > > and does the correct dynamic setup and teardown of the hierarchy
 > > as the userspace tool starts and stops tracing.  I had to modify
 > > the relayfs rmdir code a bit to make this work properly, I'll
 > > send a separate patch for that shortly.
 > 
 > Here it is.  The problem was that relayfs is allowing a directory
 > with children to be removed rather than returning -ENOTEMPTY.  It
 > looks like this can be resolved by splitting the shared relayfs
 > unlink code (which is using simple_unlink) into separate file/dir
 > variants, one using simple_unlink, the other using simple_rmdir.
 > 

Hi,

You're right, it should be using simple_rmdir rather than
simple_unlink for removing directories.  Thanks for sending the patch,
which I've modified a bit to avoid splitting the rmdir/unlink cases
into separate functions, since they're almost the same except for what
they end up calling.  relayfs_remove_dir now doesn't do anything but
call relayfs_remove (it didn't do much more than that before anyway),
but it makes sense to me to keep it, as the counterpart to
relayfs_create_dir.  Let me know if you see any problems with it.

Thanks,

Tom

--- inode.c~	2005-08-31 04:08:07.000000000 -0500
+++ inode.c	2005-08-31 03:44:40.000000000 -0500
@@ -189,26 +189,39 @@ struct dentry *relayfs_create_dir(const 
 /**
  *	relayfs_remove - remove a file or directory in the relay filesystem
  *	@dentry: file or directory dentry
+ *
+ *	Returns 0 if successful, negative otherwise.
  */
 int relayfs_remove(struct dentry *dentry)
 {
-	struct dentry *parent = dentry->d_parent;
+	struct dentry *parent;
+	int error = 0;
+
+	if (!dentry)
+		return -EINVAL;
+	parent = dentry->d_parent;
 	if (!parent)
 		return -EINVAL;
 
 	parent = dget(parent);
 	down(&parent->d_inode->i_sem);
 	if (dentry->d_inode) {
-		simple_unlink(parent->d_inode, dentry);
-		d_delete(dentry);
+		if (S_ISDIR(dentry->d_inode->i_mode))
+			error = simple_rmdir(parent->d_inode, dentry);
+		else
+			error = simple_unlink(parent->d_inode, dentry);
+		if (!error)
+			d_delete(dentry);
 	}
-	dput(dentry);
+	if (!error)
+		dput(dentry);
 	up(&parent->d_inode->i_sem);
 	dput(parent);
 
-	simple_release_fs(&relayfs_mount, &relayfs_mount_count);
+	if (!error)
+		simple_release_fs(&relayfs_mount, &relayfs_mount_count);
 
-	return 0;
+	return error;
 }
 
 /**
@@ -219,9 +232,6 @@ int relayfs_remove(struct dentry *dentry
  */
 int relayfs_remove_dir(struct dentry *dentry)
 {
-	if (!dentry)
-		return -EINVAL;
-
 	return relayfs_remove(dentry);
 }
 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] blk queue io tracing support
  2005-08-31  4:19               ` Tom Zanussi
@ 2005-08-31  4:33                 ` Nathan Scott
  2005-08-31  4:53                   ` Nathan Scott
  2005-08-31  4:55                   ` Tom Zanussi
  0 siblings, 2 replies; 20+ messages in thread
From: Nathan Scott @ 2005-08-31  4:33 UTC (permalink / raw)
  To: Tom Zanussi; +Cc: Jens Axboe, Linux Kernel

Hi Tom,

On Tue, Aug 30, 2005 at 11:19:04PM -0500, Tom Zanussi wrote:
> You're right, it should be using simple_rmdir rather than
> simple_unlink for removing directories.  Thanks for sending the patch,

No problem.

> which I've modified a bit to avoid splitting the rmdir/unlink cases
> into separate functions, since they're almost the same except for what
> they end up calling.  relayfs_remove_dir now doesn't do anything but
> call relayfs_remove (it didn't do much more than that before anyway),
> but it makes sense to me to keep it, as the counterpart to
> relayfs_create_dir.  Let me know if you see any problems with it.

Looks OK, I'll give it a spin.

On an unrelated note, are there any known issues with using epoll
on relayfs file descriptors?  I'm having a few troubles, and just
wondering if its me doing something silly, or if its known to not
work...?  Symptoms of the problem are epoll continually reaching
its timeout with no modified fds found (when I know the inode has
modified trace buffers attached) ... and the epoll code is a bit
too hairy for me to go find a quick fix - seems like it should be
able to work though since relayfs has a ->poll implementation.

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] blk queue io tracing support
  2005-08-31  4:33                 ` Nathan Scott
@ 2005-08-31  4:53                   ` Nathan Scott
  2005-08-31  4:55                   ` Tom Zanussi
  1 sibling, 0 replies; 20+ messages in thread
From: Nathan Scott @ 2005-08-31  4:53 UTC (permalink / raw)
  To: Tom Zanussi; +Cc: Jens Axboe, Linux Kernel

On Wed, Aug 31, 2005 at 02:33:10PM +1000, Nathan Scott wrote:
> ...
> On an unrelated note, are there any known issues with using epoll
> on relayfs file descriptors?  I'm having a few troubles, and just
> wondering if its me doing something silly, or if its known to not
> work...?  Symptoms of the problem are epoll continually reaching
> its timeout with no modified fds found (when I know the inode has
> modified trace buffers attached) ...

Actually, poll(2) seems to have the same behaviour with a simpler
test case (i.e. no epoll, & with just one fd being polled) - if I
read(2) from it every few thousand usec (using the blktrace tool)
it sees new data, but if I poll, it never reports the descriptor
as changed (this is a 2.6.13 kernel with the relayfs patches from
-mm patched into it and Jens' blktrace patch generating the data
that I'm attempting to poll).

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] blk queue io tracing support
  2005-08-31  4:33                 ` Nathan Scott
  2005-08-31  4:53                   ` Nathan Scott
@ 2005-08-31  4:55                   ` Tom Zanussi
  1 sibling, 0 replies; 20+ messages in thread
From: Tom Zanussi @ 2005-08-31  4:55 UTC (permalink / raw)
  To: Nathan Scott; +Cc: Tom Zanussi, Jens Axboe, Linux Kernel

Nathan Scott writes:
 > Hi Tom,
 > 
 > On Tue, Aug 30, 2005 at 11:19:04PM -0500, Tom Zanussi wrote:
 > > You're right, it should be using simple_rmdir rather than
 > > simple_unlink for removing directories.  Thanks for sending the patch,
 > 
 > No problem.
 > 
 > > which I've modified a bit to avoid splitting the rmdir/unlink cases
 > > into separate functions, since they're almost the same except for what
 > > they end up calling.  relayfs_remove_dir now doesn't do anything but
 > > call relayfs_remove (it didn't do much more than that before anyway),
 > > but it makes sense to me to keep it, as the counterpart to
 > > relayfs_create_dir.  Let me know if you see any problems with it.
 > 
 > Looks OK, I'll give it a spin.
 > 
 > On an unrelated note, are there any known issues with using epoll
 > on relayfs file descriptors?  I'm having a few troubles, and just
 > wondering if its me doing something silly, or if its known to not
 > work...?  Symptoms of the problem are epoll continually reaching
 > its timeout with no modified fds found (when I know the inode has
 > modified trace buffers attached) ... and the epoll code is a bit
 > too hairy for me to go find a quick fix - seems like it should be
 > able to work though since relayfs has a ->poll implementation.

Well, the relayfs poll implementation is based on completed
sub-buffers, so you can be writing events into a buffer, but until a
buffer switch happens, you won't be notified that anything's changed.
The reason for the sub-buffer granularity is that relayfs was
originally meant for use only with mmap(), but now that there's a
read(), I'll probably have to make some changes to the poll
implementation as well.

Tom



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] blk queue io tracing support
  2005-08-30 23:43           ` Nathan Scott
@ 2005-08-31  7:31             ` Jens Axboe
  0 siblings, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2005-08-31  7:31 UTC (permalink / raw)
  To: Nathan Scott; +Cc: Linux Kernel

On Wed, Aug 31 2005, Nathan Scott wrote:
> Hi Jens,
> 
> On Wed, Aug 24, 2005 at 11:28:39AM +0200, Jens Axboe wrote:
> > Patch attached is against 2.6.13-rc6-mm2. Still a good idea to apply the
> > relayfs read update from the previous mail [*] as well.
> 
> There's a small memory leak there on one of the start-tracing
> error paths (relay_open failure)... this should plug it up.
> 
> cheers.
> 
> -- 
> Nathan
> 
> 
> Index: 2.6.x-xfs/drivers/block/blktrace.c
> ===================================================================
> --- 2.6.x-xfs.orig/drivers/block/blktrace.c
> +++ 2.6.x-xfs/drivers/block/blktrace.c
> @@ -73,9 +73,9 @@ int blk_start_trace(struct block_device 
>  {
>  	request_queue_t *q = bdev_get_queue(bdev);
>  	struct blk_user_trace_setup buts;
> -	struct blk_trace *bt;
> +	struct blk_trace *bt = NULL;
>  	char b[BDEVNAME_SIZE];
> -	int ret = 0;
> +	int ret;
>  
>  	if (!q)
>  		return -ENXIO;
> @@ -116,9 +116,14 @@ int blk_start_trace(struct block_device 
>  	spin_lock_irq(q->queue_lock);
>  	q->blk_trace = bt;
>  	spin_unlock_irq(q->queue_lock);
> -	ret = 0;
> +
> +	up(&bdev->bd_sem);
> +	return 0;
> +
>  err:
>  	up(&bdev->bd_sem);
> +	if (bt)
> +		kfree(bt);
>  	return ret;
>  }

Indeed, thanks! I've applied the patch, I'll do a new release against
2.6.14-pre/rc/git as soon as relayfs gets merged. Or 2.6.13-mm1, if that
comes first.

BTW, the trace tools now live in a git repo here:

rsync://rsync.kernel.org/pub/scm/linux/kernel/git/axboe/blktrace.git

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] blk queue io tracing support
  2005-08-30 23:48           ` Nathan Scott
  2005-08-30 23:58             ` Nathan Scott
@ 2005-08-31  7:33             ` Jens Axboe
  2005-09-02 11:20             ` Jens Axboe
  2 siblings, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2005-08-31  7:33 UTC (permalink / raw)
  To: Nathan Scott; +Cc: Linux Kernel

On Wed, Aug 31 2005, Nathan Scott wrote:
> Hi Jens,
> 
> On Wed, Aug 24, 2005 at 11:28:39AM +0200, Jens Axboe wrote:
> > Ok, updated version.
> 
> One thing I found a bit awkward was the way its putting all inodes
> in the root of the relayfs namespace, with the cpuid tacked on the
> end of the bdevname - I was a bit confused at first when a trace of
> sdd on my 4P box spontaneously created files for "partitions" sdd0,
> sdd1, sdd2, and sdd3 ;).

Yeah I agree, it's not very logical.

> 
> I suppose if many more users of relayfs spring into existance, this
> is going to get quite ugly.  Below is a patch that aligns the names
> to the conventions used in sysfs; so, for example, when running two
> traces simultaneously on /dev/sdd and /dev/sdb, instead of this:
> 
> # find /relay
> /relay
> /relay/sdd3
> /relay/sdd2
> /relay/sdd1
> /relay/sdd0
> /relay/sdb3
> /relay/sdb2
> /relay/sdb1
> /relay/sdb0
> 
> it now uses this...
> 
> # find /relay
> /relay
> /relay/block
> /relay/block/sdd
> /relay/block/sdd/trace3
> /relay/block/sdd/trace2
> /relay/block/sdd/trace1
> /relay/block/sdd/trace0
> /relay/block/sdb
> /relay/block/sdb/trace3
> /relay/block/sdb/trace2
> /relay/block/sdb/trace1
> /relay/block/sdb/trace0
> 
> and does the correct dynamic setup and teardown of the hierarchy
> as the userspace tool starts and stops tracing.  I had to modify
> the relayfs rmdir code a bit to make this work properly, I'll
> send a separate patch for that shortly.

It makes sense to me, please work with the relayfs people to get this
integrated. I really don't want to carry any extra stuff for relayfs
around.

> > http://www.kernel.org/pub/linux/kernel/people/axboe/tools/blktrace.c
> > 
> > has been updated as well, the protocol version was increased to
> > accomodate the trace structure changes.
> 
> I have the associated userspace change for this, as well as several
> other fixes and tweaks for your tool - if you could slap a copyright
> and license notice onto that source (pretty please? :) I'll send 'em
> right along.

You bet, I'll add it right away.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] blk queue io tracing support
  2005-08-30 23:48           ` Nathan Scott
  2005-08-30 23:58             ` Nathan Scott
  2005-08-31  7:33             ` Jens Axboe
@ 2005-09-02 11:20             ` Jens Axboe
  2 siblings, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2005-09-02 11:20 UTC (permalink / raw)
  To: Nathan Scott; +Cc: Linux Kernel

On Wed, Aug 31 2005, Nathan Scott wrote:
> Hi Jens,
> 
> On Wed, Aug 24, 2005 at 11:28:39AM +0200, Jens Axboe wrote:
> > Ok, updated version.
> 
> One thing I found a bit awkward was the way its putting all inodes
> in the root of the relayfs namespace, with the cpuid tacked on the
> end of the bdevname - I was a bit confused at first when a trace of
> sdd on my 4P box spontaneously created files for "partitions" sdd0,
> sdd1, sdd2, and sdd3 ;).
> 
> I suppose if many more users of relayfs spring into existance, this
> is going to get quite ugly.  Below is a patch that aligns the names
> to the conventions used in sysfs; so, for example, when running two
> traces simultaneously on /dev/sdd and /dev/sdb, instead of this:
> 
> # find /relay
> /relay
> /relay/sdd3
> /relay/sdd2
> /relay/sdd1
> /relay/sdd0
> /relay/sdb3
> /relay/sdb2
> /relay/sdb1
> /relay/sdb0
> 
> it now uses this...
> 
> # find /relay
> /relay
> /relay/block
> /relay/block/sdd
> /relay/block/sdd/trace3
> /relay/block/sdd/trace2
> /relay/block/sdd/trace1
> /relay/block/sdd/trace0
> /relay/block/sdb
> /relay/block/sdb/trace3
> /relay/block/sdb/trace2
> /relay/block/sdb/trace1
> /relay/block/sdb/trace0
> 
> and does the correct dynamic setup and teardown of the hierarchy
> as the userspace tool starts and stops tracing.  I had to modify
> the relayfs rmdir code a bit to make this work properly, I'll
> send a separate patch for that shortly.
> 
> > http://www.kernel.org/pub/linux/kernel/people/axboe/tools/blktrace.c
> > 
> > has been updated as well, the protocol version was increased to
> > accomodate the trace structure changes.
> 
> I have the associated userspace change for this, as well as several
> other fixes and tweaks for your tool - if you could slap a copyright
> and license notice onto that source (pretty please? :) I'll send 'em
> right along.

I've committed this patch. However, there's an issue with it:


> +static struct dentry *blk_create_tree(const char *blk_name)
> +{
> +	struct dentry *dir;
> +
> +	spin_lock(&blk_tree_lock);
> +	if (!blk_tree_root) {
> +		blk_tree_root = relayfs_create_dir("block", NULL);
> +		if (!blk_tree_root) {
> +			spin_unlock(&blk_tree_lock);
> +			return NULL;
> +		}
> +	}
> +	dir = relayfs_create_dir(blk_name, blk_tree_root);
> +	if (!dir)
> +		blk_remove_root();
> +	spin_unlock(&blk_tree_lock);

That doesn't look very safe, relayfs_create_dir() could block. I've
changed the blk_tree_lock to be a simple mutex instead. The patch is
committed to the git repo.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2005-09-02 11:20 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-08-23 12:32 [PATCH] blk queue io tracing support Jens Axboe
2005-08-24  1:03 ` Nathan Scott
2005-08-24  7:08   ` Jens Axboe
2005-08-24  7:19     ` Nathan Scott
2005-08-24  7:25       ` Jens Axboe
2005-08-24  9:28         ` Jens Axboe
2005-08-29  4:53           ` Nathan Scott
2005-08-29  5:57             ` Jens Axboe
2005-08-30 23:43           ` Nathan Scott
2005-08-31  7:31             ` Jens Axboe
2005-08-30 23:48           ` Nathan Scott
2005-08-30 23:58             ` Nathan Scott
2005-08-31  4:19               ` Tom Zanussi
2005-08-31  4:33                 ` Nathan Scott
2005-08-31  4:53                   ` Nathan Scott
2005-08-31  4:55                   ` Tom Zanussi
2005-08-31  7:33             ` Jens Axboe
2005-09-02 11:20             ` Jens Axboe
2005-08-24  6:24 ` Nathan Scott
2005-08-24  7:08   ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox