linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH] convert block trace points to TRACE_EVENT()
@ 2009-05-14  6:20 Li Zefan
  2009-05-18  8:35 ` Ingo Molnar
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Li Zefan @ 2009-05-14  6:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jens Axboe, Steven Rostedt, Frederic Weisbecker, Tom Zanussi,
	Theodore Ts'o, Steven Whitehouse, KOSAKI Motohiro, LKML

TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
these new capabilities to this tracepoint:

  - zero-copy and per-cpu splice() tracing
  - binary tracing without printf overhead
  - structured logging records exposed under /debug/tracing/events
  - trace events embedded in function tracer output and other plugins
  - user-defined, per tracepoint filter expressions
  ...

Cons and problems:

  - no dev_t info for the output of plug, unplug_timer and unplug_io events.
    no dev_t info for getrq and sleeprq events if bio == NULL.
    no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.

  - for large packet commands, only 16 bytes of the command will be output.
    Because TRACE_EVENT doesn't support dynamic-sized arrays, though it
    supports dynamic-sized strings.

  - a packet command is converted to a string in TP_assign, not TP_print.
    While blktrace do the convertion just before output.

  - in blktrace, an event can have 2 different print formats, but a TRACE_EVENT
    has a unique format. (see the output of getrq and rq_insert)


Following are some comparisons between TRACE_EVENT and blktrace:

plug:
  kjournald-480   [000]   303.084981: block_plug: [kjournald]
  kjournald-480   [000]   303.084981:   8,0    P   N [kjournald]

unplug_io:
  kblockd/0-118   [000]   300.052973: block_unplug_io: [kblockd/0] 1
  kblockd/0-118   [000]   300.052974:   8,0    U   N [kblockd/0] 1

remap:
  kjournald-480   [000]   303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384
  kjournald-480   [000]   303.085043:   8,0    A   W 102736992 + 8 <- (8,8) 33384

bio_backmerge:
  kjournald-480   [000]   303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald]
  kjournald-480   [000]   303.085086:   8,0    M   W 102737032 + 8 [kjournald]

getrq:
  kjournald-480   [000]   303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald]
  kjournald-480   [000]   303.084975:   8,0    G   W 102736984 + 8 [kjournald]

  bash-2066  [001]  1072.953770:   8,0    G   N [bash]
  bash-2066  [001]  1072.953773: block_getrq: 0,0 N 0 + 0 [bash]

rq_complete:
  konsole-2065  [001]   300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0]
  konsole-2065  [001]   300.053191:   8,0    C   W 103669040 + 16 [0]

rq_insert:
  kjournald-480   [000]   303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald]
  kjournald-480   [000]   303.084986:   8,0    I   W 102736984 + 8 [kjournald]

  ksoftirqd/1-7   [001]  1072.953811:   8,0    C   N (5a 00 08 00 00 00 00 00 24 00) [0]
  ksoftirqd/1-7   [001]  1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0]

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
---
 block/blk-core.c             |   16 +-
 block/elevator.c             |    8 +-
 drivers/md/dm.c              |    5 +-
 fs/bio.c                     |    3 +-
 include/linux/blktrace_api.h |    8 +
 include/trace/block.h        |   76 -------
 include/trace/events/block.h |  476 ++++++++++++++++++++++++++++++++++++++++++
 kernel/trace/Makefile        |    5 +-
 kernel/trace/blktrace.c      |   78 +++++++-
 mm/bounce.c                  |    5 +-
 10 files changed, 576 insertions(+), 104 deletions(-)
 delete mode 100644 include/trace/block.h
 create mode 100644 include/trace/events/block.h

diff --git a/block/blk-core.c b/block/blk-core.c
index d028baf..ff50f9d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -28,22 +28,14 @@
 #include <linux/task_io_accounting_ops.h>
 #include <linux/blktrace_api.h>
 #include <linux/fault-inject.h>
-#include <trace/block.h>
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/block.h>
 
 #include "blk.h"
 
-DEFINE_TRACE(block_plug);
-DEFINE_TRACE(block_unplug_io);
-DEFINE_TRACE(block_unplug_timer);
-DEFINE_TRACE(block_getrq);
-DEFINE_TRACE(block_sleeprq);
-DEFINE_TRACE(block_rq_requeue);
-DEFINE_TRACE(block_bio_backmerge);
-DEFINE_TRACE(block_bio_frontmerge);
-DEFINE_TRACE(block_bio_queue);
-DEFINE_TRACE(block_rq_complete);
-DEFINE_TRACE(block_remap);	/* Also used in drivers/md/dm.c */
 EXPORT_TRACEPOINT_SYMBOL_GPL(block_remap);
+EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_complete);
 
 static int __make_request(struct request_queue *q, struct bio *bio);
 
diff --git a/block/elevator.c b/block/elevator.c
index 7073a90..e220f0c 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -33,17 +33,16 @@
 #include <linux/compiler.h>
 #include <linux/delay.h>
 #include <linux/blktrace_api.h>
-#include <trace/block.h>
 #include <linux/hash.h>
 #include <linux/uaccess.h>
 
+#include <trace/events/block.h>
+
 #include "blk.h"
 
 static DEFINE_SPINLOCK(elv_list_lock);
 static LIST_HEAD(elv_list);
 
-DEFINE_TRACE(block_rq_abort);
-
 /*
  * Merge hash stuff.
  */
@@ -55,9 +54,6 @@ static const int elv_hash_shift = 6;
 #define rq_hash_key(rq)		((rq)->sector + (rq)->nr_sectors)
 #define ELV_ON_HASH(rq)		(!hlist_unhashed(&(rq)->hash))
 
-DEFINE_TRACE(block_rq_insert);
-DEFINE_TRACE(block_rq_issue);
-
 /*
  * Query io scheduler to see if the current process issuing bio may be
  * merged with rq.
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index e2ee4a7..3fd8b1e 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -20,7 +20,8 @@
 #include <linux/idr.h>
 #include <linux/hdreg.h>
 #include <linux/blktrace_api.h>
-#include <trace/block.h>
+
+#include <trace/events/block.h>
 
 #define DM_MSG_PREFIX "core"
 
@@ -53,8 +54,6 @@ struct dm_target_io {
 	union map_info info;
 };
 
-DEFINE_TRACE(block_bio_complete);
-
 /*
  * For request-based dm.
  * One of these is allocated per request.
diff --git a/fs/bio.c b/fs/bio.c
index 9871164..740699c 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -26,10 +26,9 @@
 #include <linux/mempool.h>
 #include <linux/workqueue.h>
 #include <linux/blktrace_api.h>
-#include <trace/block.h>
 #include <scsi/sg.h>		/* for struct sg_iovec */
 
-DEFINE_TRACE(block_split);
+#include <trace/events/block.h>
 
 /*
  * Test patch to inline a certain number of bi_io_vec's inside the bio
diff --git a/include/linux/blktrace_api.h b/include/linux/blktrace_api.h
index 82b4636..c14fb5b 100644
--- a/include/linux/blktrace_api.h
+++ b/include/linux/blktrace_api.h
@@ -218,5 +218,13 @@ static inline int blk_trace_init_sysfs(struct device *dev)
 
 #endif /* CONFIG_BLK_DEV_IO_TRACE */
 
+#ifdef CONFIG_EVENT_TRACING
+
+extern void __dump_pdu(char *buf, struct request *rq);
+extern void __fill_rwbs(char *rwbs, u32 rw, int bytes);
+extern void __fill_rwbs_rq(char *rwbs, struct request *rq);
+
+#endif
+
 #endif /* __KERNEL__ */
 #endif
diff --git a/include/trace/block.h b/include/trace/block.h
deleted file mode 100644
index 5b12efa..0000000
--- a/include/trace/block.h
+++ /dev/null
@@ -1,76 +0,0 @@
-#ifndef _TRACE_BLOCK_H
-#define _TRACE_BLOCK_H
-
-#include <linux/blkdev.h>
-#include <linux/tracepoint.h>
-
-DECLARE_TRACE(block_rq_abort,
-	TP_PROTO(struct request_queue *q, struct request *rq),
-	      TP_ARGS(q, rq));
-
-DECLARE_TRACE(block_rq_insert,
-	TP_PROTO(struct request_queue *q, struct request *rq),
-	      TP_ARGS(q, rq));
-
-DECLARE_TRACE(block_rq_issue,
-	TP_PROTO(struct request_queue *q, struct request *rq),
-	      TP_ARGS(q, rq));
-
-DECLARE_TRACE(block_rq_requeue,
-	TP_PROTO(struct request_queue *q, struct request *rq),
-	      TP_ARGS(q, rq));
-
-DECLARE_TRACE(block_rq_complete,
-	TP_PROTO(struct request_queue *q, struct request *rq),
-	      TP_ARGS(q, rq));
-
-DECLARE_TRACE(block_bio_bounce,
-	TP_PROTO(struct request_queue *q, struct bio *bio),
-	      TP_ARGS(q, bio));
-
-DECLARE_TRACE(block_bio_complete,
-	TP_PROTO(struct request_queue *q, struct bio *bio),
-	      TP_ARGS(q, bio));
-
-DECLARE_TRACE(block_bio_backmerge,
-	TP_PROTO(struct request_queue *q, struct bio *bio),
-	      TP_ARGS(q, bio));
-
-DECLARE_TRACE(block_bio_frontmerge,
-	TP_PROTO(struct request_queue *q, struct bio *bio),
-	      TP_ARGS(q, bio));
-
-DECLARE_TRACE(block_bio_queue,
-	TP_PROTO(struct request_queue *q, struct bio *bio),
-	      TP_ARGS(q, bio));
-
-DECLARE_TRACE(block_getrq,
-	TP_PROTO(struct request_queue *q, struct bio *bio, int rw),
-	      TP_ARGS(q, bio, rw));
-
-DECLARE_TRACE(block_sleeprq,
-	TP_PROTO(struct request_queue *q, struct bio *bio, int rw),
-	      TP_ARGS(q, bio, rw));
-
-DECLARE_TRACE(block_plug,
-	TP_PROTO(struct request_queue *q),
-	      TP_ARGS(q));
-
-DECLARE_TRACE(block_unplug_timer,
-	TP_PROTO(struct request_queue *q),
-	      TP_ARGS(q));
-
-DECLARE_TRACE(block_unplug_io,
-	TP_PROTO(struct request_queue *q),
-	      TP_ARGS(q));
-
-DECLARE_TRACE(block_split,
-	TP_PROTO(struct request_queue *q, struct bio *bio, unsigned int pdu),
-	      TP_ARGS(q, bio, pdu));
-
-DECLARE_TRACE(block_remap,
-	TP_PROTO(struct request_queue *q, struct bio *bio, dev_t dev,
-		 sector_t from),
-	      TP_ARGS(q, bio, dev, from));
-
-#endif
diff --git a/include/trace/events/block.h b/include/trace/events/block.h
new file mode 100644
index 0000000..e9aa2b5
--- /dev/null
+++ b/include/trace/events/block.h
@@ -0,0 +1,476 @@
+#if !defined(_TRACE_BLOCK_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_BLOCK_H
+
+#include <linux/blkdev.h>
+#include <linux/tracepoint.h>
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM block
+
+TRACE_EVENT(block_rq_abort,
+
+	TP_PROTO(struct request_queue *q, struct request *rq),
+
+	TP_ARGS(q, rq),
+
+	TP_STRUCT__entry(
+		__field( dev_t,		dev			)
+		__field( sector_t,	sector			)
+		__field( unsigned int,	nr_sector		)
+		__field( int,		errors			)
+		__array( char,		rwbs,	6		)
+		__array( char,		cmd,	BLK_MAX_CDB * 3	)
+	),
+
+	TP_fast_assign(
+		__entry->dev	   = rq->rq_disk ? disk_devt(rq->rq_disk) : 0;
+		__entry->sector    = blk_pc_request(rq) ? 0 : rq->hard_sector;
+		__entry->nr_sector = blk_pc_request(rq) ?
+						0 : rq->hard_nr_sectors;
+		__entry->errors    = rq->errors;
+		__fill_rwbs_rq(__entry->rwbs, rq);
+		__dump_pdu(__entry->cmd, rq);
+	),
+
+	TP_printk("%d,%d %s (%s) %llu + %u [%d]",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->rwbs, __entry->cmd,
+		  __entry->sector, __entry->nr_sector, __entry->errors)
+);
+
+TRACE_EVENT(block_rq_insert,
+
+	TP_PROTO(struct request_queue *q, struct request *rq),
+
+	TP_ARGS(q, rq),
+
+	TP_STRUCT__entry(
+		__field( dev_t,		dev			)
+		__field( sector_t,	sector			)
+		__field( unsigned int,	nr_sector		)
+		__field( unsigned int,	bytes			)
+		__array( char,		rwbs,	6		)
+		__array( char,		cmd,	BLK_MAX_CDB * 3	)
+		__array( char,          comm,   TASK_COMM_LEN   )
+	),
+
+	TP_fast_assign(
+		__entry->dev	   = rq->rq_disk ? disk_devt(rq->rq_disk) : 0;
+		__entry->sector    = blk_pc_request(rq) ? 0 : rq->hard_sector;
+		__entry->nr_sector = blk_pc_request(rq) ?
+						0 : rq->hard_nr_sectors;
+		__entry->bytes     = blk_pc_request(rq) ? rq->data_len : 0;
+		__fill_rwbs_rq(__entry->rwbs, rq);
+		__dump_pdu(__entry->cmd, rq);
+		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
+	),
+
+	TP_printk("%d,%d %s %u (%s) %llu + %u [%s]",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->rwbs, __entry->bytes, __entry->cmd,
+		  __entry->sector, __entry->nr_sector, __entry->comm)
+);
+
+TRACE_EVENT(block_rq_issue,
+
+	TP_PROTO(struct request_queue *q, struct request *rq),
+
+	TP_ARGS(q, rq),
+
+	TP_STRUCT__entry(
+		__field( dev_t,		dev			)
+		__field( sector_t,	sector			)
+		__field( unsigned int,	nr_sector		)
+		__field( unsigned int,	bytes			)
+		__array( char,		rwbs,	6		)
+		__array( char,		cmd,	BLK_MAX_CDB * 3	)
+		__array( char,          comm,   TASK_COMM_LEN   )
+	),
+
+	TP_fast_assign(
+		__entry->dev	   = rq->rq_disk ? disk_devt(rq->rq_disk) : 0;
+		__entry->sector    = blk_pc_request(rq) ? 0 : rq->hard_sector;
+		__entry->nr_sector = blk_pc_request(rq) ?
+						0 : rq->hard_nr_sectors;
+		__entry->bytes     = blk_pc_request(rq) ? rq->data_len : 0;
+		__fill_rwbs_rq(__entry->rwbs, rq);
+		__dump_pdu(__entry->cmd, rq);
+		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
+	),
+
+	TP_printk("%d,%d %s %u (%s) %llu + %u [%s]",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->rwbs, __entry->bytes, __entry->cmd,
+		  __entry->sector, __entry->nr_sector, __entry->comm)
+);
+
+TRACE_EVENT(block_rq_requeue,
+
+	TP_PROTO(struct request_queue *q, struct request *rq),
+
+	TP_ARGS(q, rq),
+
+	TP_STRUCT__entry(
+		__field( dev_t,		dev			)
+		__field( sector_t,	sector			)
+		__field( unsigned int,	nr_sector		)
+		__field( int,		errors			)
+		__array( char,		rwbs,	6		)
+		__array( char,		cmd,	BLK_MAX_CDB * 3	)
+	),
+
+	TP_fast_assign(
+		__entry->dev	   = rq->rq_disk ? disk_devt(rq->rq_disk) : 0;
+		__entry->sector    = blk_pc_request(rq) ? 0 : rq->hard_sector;
+		__entry->nr_sector = blk_pc_request(rq) ?
+						0 : rq->hard_nr_sectors;
+		__entry->errors	   = rq->errors;
+		__dump_pdu(__entry->cmd, rq);
+	),
+
+	TP_printk("%d,%d %s (%s) %llu + %u [%d]",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->rwbs, __entry->cmd,
+		  __entry->sector, __entry->nr_sector, __entry->errors)
+);
+
+TRACE_EVENT(block_rq_complete,
+
+	TP_PROTO(struct request_queue *q, struct request *rq),
+
+	TP_ARGS(q, rq),
+
+	TP_STRUCT__entry(
+		__field( dev_t,		dev			)
+		__field( sector_t,	sector			)
+		__field( unsigned int,	nr_sector		)
+		__field( int,		errors			)
+		__array( char,		rwbs,	6		)
+		__array( char,		cmd,	BLK_MAX_CDB * 3	)
+	),
+
+	TP_fast_assign(
+		__entry->dev	   = rq->rq_disk ? disk_devt(rq->rq_disk) : 0;
+		__entry->sector    = blk_pc_request(rq) ? 0 : rq->hard_sector;
+		__entry->nr_sector = blk_pc_request(rq) ?
+						0 : rq->hard_nr_sectors;
+		__entry->errors    = rq->errors;
+		__fill_rwbs_rq(__entry->rwbs, rq);
+		__dump_pdu(__entry->cmd, rq);
+	),
+
+	TP_printk("%d,%d %s (%s) %llu + %u [%d]",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->rwbs, __entry->cmd,
+		  __entry->sector, __entry->nr_sector, __entry->errors)
+);
+TRACE_EVENT(block_bio_bounce,
+
+	TP_PROTO(struct request_queue *q, struct bio *bio),
+
+	TP_ARGS(q, bio),
+
+	TP_STRUCT__entry(
+		__field( dev_t,		dev			)
+		__field( sector_t,	sector			)
+		__field( unsigned int,	nr_sector		)
+		__array( char,		rwbs,	6		)
+		__array( char,		comm,	TASK_COMM_LEN	)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= bio->bi_bdev->bd_dev;
+		__entry->sector		= bio->bi_sector;
+		__entry->nr_sector	= bio->bi_size >> 9;
+		__fill_rwbs(__entry->rwbs, bio->bi_rw, bio->bi_size);
+		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
+	),
+
+	TP_printk("%d,%d %s %llu + %u [%s]",
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rwbs,
+		  __entry->sector, __entry->nr_sector, __entry->comm)
+);
+
+TRACE_EVENT(block_bio_complete,
+
+	TP_PROTO(struct request_queue *q, struct bio *bio),
+
+	TP_ARGS(q, bio),
+
+	TP_STRUCT__entry(
+		__field( dev_t,		dev		)
+		__field( sector_t,	sector		)
+		__field( unsigned,	nr_sector	)
+		__field( int,		error		)
+		__array( char,		rwbs,	6	)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= bio->bi_bdev->bd_dev;
+		__entry->sector		= bio->bi_sector;
+		__entry->nr_sector	= bio->bi_size >> 9;
+		__fill_rwbs(__entry->rwbs, bio->bi_rw, bio->bi_size);
+	),
+
+	TP_printk("%d,%d %s %llu + %u [%d]",
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rwbs,
+		  __entry->sector, __entry->nr_sector, __entry->error)
+);
+
+TRACE_EVENT(block_bio_backmerge,
+
+	TP_PROTO(struct request_queue *q, struct bio *bio),
+
+	TP_ARGS(q, bio),
+
+	TP_STRUCT__entry(
+		__field( dev_t,		dev			)
+		__field( sector_t,	sector			)
+		__field( unsigned int,	nr_sector		)
+		__array( char,		rwbs,	6		)
+		__array( char,		comm,	TASK_COMM_LEN	)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= bio->bi_bdev->bd_dev;
+		__entry->sector		= bio->bi_sector;
+		__entry->nr_sector	= bio->bi_size >> 9;
+		__fill_rwbs(__entry->rwbs, bio->bi_rw, bio->bi_size);
+		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
+	),
+
+	TP_printk("%d,%d %s %llu + %u [%s]",
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rwbs,
+		  __entry->sector, __entry->nr_sector, __entry->comm)
+);
+
+TRACE_EVENT(block_bio_frontmerge,
+
+	TP_PROTO(struct request_queue *q, struct bio *bio),
+
+	TP_ARGS(q, bio),
+
+	TP_STRUCT__entry(
+		__field( dev_t,		dev			)
+		__field( sector_t,	sector			)
+		__field( unsigned,	nr_sector		)
+		__array( char,		rwbs,	6		)
+		__array( char,		comm,	TASK_COMM_LEN	)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= bio->bi_bdev->bd_dev;
+		__entry->sector		= bio->bi_sector;
+		__entry->nr_sector	= bio->bi_size >> 9;
+		__fill_rwbs(__entry->rwbs, bio->bi_rw, bio->bi_size);
+		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
+	),
+
+	TP_printk("%d,%d %s %llu + %u [%s]",
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rwbs,
+		  __entry->sector, __entry->nr_sector, __entry->comm)
+);
+
+TRACE_EVENT(block_bio_queue,
+
+	TP_PROTO(struct request_queue *q, struct bio *bio),
+
+	TP_ARGS(q, bio),
+
+	TP_STRUCT__entry(
+		__field( dev_t,		dev			)
+		__field( sector_t,	sector			)
+		__field( unsigned int,	nr_sector		)
+		__array( char,		rwbs,	6		)
+		__array( char,		comm,	TASK_COMM_LEN	)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= bio->bi_bdev->bd_dev;
+		__entry->sector		= bio->bi_sector;
+		__entry->nr_sector	= bio->bi_size >> 9;
+		__fill_rwbs(__entry->rwbs, bio->bi_rw, bio->bi_size);
+		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
+	),
+
+	TP_printk("%d,%d %s %llu + %u [%s]",
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rwbs,
+		  __entry->sector, __entry->nr_sector, __entry->comm)
+);
+
+TRACE_EVENT(block_getrq,
+
+	TP_PROTO(struct request_queue *q, struct bio *bio, int rw),
+
+	TP_ARGS(q, bio, rw),
+
+	TP_STRUCT__entry(
+		__field( dev_t,		dev			)
+		__field( sector_t,	sector			)
+		__field( unsigned int,	nr_sector		)
+		__array( char,		rwbs,	6		)
+		__array( char,		comm,	TASK_COMM_LEN	)
+        ),
+
+	TP_fast_assign(
+		__entry->dev		= bio ? bio->bi_bdev->bd_dev : 0;
+		__entry->sector		= bio ? bio->bi_sector : 0;
+		__entry->nr_sector	= bio ? bio->bi_size >> 9 : 0;
+		__fill_rwbs(__entry->rwbs,
+			    bio ? bio->bi_rw : 0, __entry->nr_sector);
+		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
+        ),
+
+	TP_printk("%d,%d %s %llu + %u [%s]",
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rwbs,
+		  __entry->sector, __entry->nr_sector, __entry->comm)
+);
+
+TRACE_EVENT(block_sleeprq,
+
+	TP_PROTO(struct request_queue *q, struct bio *bio, int rw),
+
+	TP_ARGS(q, bio, rw),
+
+	TP_STRUCT__entry(
+		__field( dev_t,		dev			)
+		__field( sector_t,	sector			)
+		__field( unsigned int,	nr_sector		)
+		__array( char,		rwbs,	6		)
+		__array( char,		comm,	TASK_COMM_LEN	)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= bio ? bio->bi_bdev->bd_dev : 0;
+		__entry->sector		= bio ? bio->bi_sector : 0;
+		__entry->nr_sector	= bio ? bio->bi_size >> 9 : 0;
+		__fill_rwbs(__entry->rwbs,
+			    bio ? bio->bi_rw : 0, __entry->nr_sector);
+		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
+	),
+
+	TP_printk("%d,%d %s %llu + %u [%s]",
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rwbs,
+		  __entry->sector, __entry->nr_sector, __entry->comm)
+);
+
+TRACE_EVENT(block_plug,
+
+	TP_PROTO(struct request_queue *q),
+
+	TP_ARGS(q),
+
+	TP_STRUCT__entry(
+		__array( char,		comm,	TASK_COMM_LEN	)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
+	),
+
+	TP_printk("[%s]", __entry->comm)
+);
+
+TRACE_EVENT(block_unplug_timer,
+
+	TP_PROTO(struct request_queue *q),
+
+	TP_ARGS(q),
+
+	TP_STRUCT__entry(
+		__field( int,		nr_rq			)
+		__array( char,		comm,	TASK_COMM_LEN	)
+	),
+
+	TP_fast_assign(
+		__entry->nr_rq	= q->rq.count[READ] + q->rq.count[WRITE];
+		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
+	),
+
+	TP_printk("[%s] %d", __entry->comm, __entry->nr_rq)
+);
+
+TRACE_EVENT(block_unplug_io,
+
+	TP_PROTO(struct request_queue *q),
+
+	TP_ARGS(q),
+
+	TP_STRUCT__entry(
+		__field( int,		nr_rq			)
+		__array( char,		comm,	TASK_COMM_LEN	)
+	),
+
+	TP_fast_assign(
+		__entry->nr_rq	= q->rq.count[READ] + q->rq.count[WRITE];
+		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
+	),
+
+	TP_printk("[%s] %d", __entry->comm, __entry->nr_rq)
+);
+
+TRACE_EVENT(block_split,
+
+	TP_PROTO(struct request_queue *q, struct bio *bio,
+		 unsigned int new_sector),
+
+	TP_ARGS(q, bio, new_sector),
+
+	TP_STRUCT__entry(
+		__field( dev_t,		dev				)
+		__field( sector_t,	sector				)
+		__field( sector_t,	new_sector			)
+		__array( char,		rwbs,		6		)
+		__array( char,		comm,		TASK_COMM_LEN	)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= bio->bi_bdev->bd_dev;
+		__entry->sector		= bio->bi_sector;
+		__entry->new_sector	= new_sector;
+		__fill_rwbs(__entry->rwbs, bio->bi_rw, bio->bi_size);
+		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
+	),
+
+	TP_printk("%d,%d %s %llu / %llu [%s]",
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rwbs,
+		  __entry->sector, __entry->new_sector, __entry->comm)
+);
+
+TRACE_EVENT(block_remap,
+
+	TP_PROTO(struct request_queue *q, struct bio *bio, dev_t dev,
+		 sector_t from),
+
+	TP_ARGS(q, bio, dev, from),
+
+	TP_STRUCT__entry(
+		__field( dev_t,		dev		)
+		__field( sector_t,	sector		)
+		__field( unsigned int,	nr_sector	)
+		__field( dev_t,		old_dev		)
+		__field( sector_t,	old_sector	)
+		__array( char,		rwbs,	6	)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= bio->bi_bdev->bd_dev;
+		__entry->sector		= bio->bi_sector;
+		__entry->nr_sector	= bio->bi_size >> 9;
+		__entry->old_dev	= dev;
+		__entry->old_sector	= from;
+		__fill_rwbs(__entry->rwbs, bio->bi_rw, bio->bi_size);
+	),
+
+	TP_printk("%d,%d %s %llu + %u <- (%d,%d) %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rwbs,
+		  __entry->sector, __entry->nr_sector,
+		  MAJOR(__entry->old_dev), MINOR(__entry->old_dev),
+		  __entry->old_sector)
+);
+
+#endif /* _TRACE_BLOCK_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
+
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 848e5ce..e452519 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -45,7 +45,10 @@ obj-$(CONFIG_HW_BRANCH_TRACER) += trace_hw_branches.o
 obj-$(CONFIG_POWER_TRACER) += trace_power.o
 obj-$(CONFIG_KMEMTRACE) += kmemtrace.o
 obj-$(CONFIG_WORKQUEUE_TRACER) += trace_workqueue.o
-obj-$(CONFIG_BLK_DEV_IO_TRACE)	+= blktrace.o
+obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o
+ifeq ($(CONFIG_BLOCK),y)
+obj-$(CONFIG_EVENT_TRACING) += blktrace.o
+endif
 obj-$(CONFIG_EVENT_TRACING) += trace_events.o
 obj-$(CONFIG_EVENT_TRACING) += trace_export.o
 obj-$(CONFIG_FTRACE_SYSCALLS) += trace_syscalls.o
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 05b4747..0181bf4 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -23,10 +23,14 @@
 #include <linux/mutex.h>
 #include <linux/debugfs.h>
 #include <linux/time.h>
-#include <trace/block.h>
 #include <linux/uaccess.h>
+
+#include <trace/events/block.h>
+
 #include "trace_output.h"
 
+#ifdef CONFIG_BLK_DEV_IO_TRACE
+
 static unsigned int blktrace_seq __read_mostly = 1;
 
 static struct trace_array *blk_tr;
@@ -1657,3 +1661,75 @@ int blk_trace_init_sysfs(struct device *dev)
 	return sysfs_create_group(&dev->kobj, &blk_trace_attr_group);
 }
 
+#endif /* CONFIG_BLK_DEV_IO_TRACE */
+
+#ifdef CONFIG_EVENT_TRACING
+
+void __dump_pdu(char *buf, struct request *rq)
+{
+	int i, end;
+	int len = min_t(int, rq->cmd_len, BLK_MAX_CDB);
+	unsigned char *cmd = rq->cmd;
+
+	if (!blk_pc_request(rq)) {
+		buf[0] = '\0';
+		return;
+	}
+
+	for (end = len - 1; end >= 0; end--)
+		if (cmd[end])
+			break;
+	end++;
+
+	for (i = 0; i < len; i++) {
+		buf += sprintf(buf, "%s%02x", i == 0 ? "" : " ", cmd[i]);
+		if (i == end && end != len - 1) {
+			sprintf(buf, " ..");
+			break;
+		}
+	}
+}
+
+void __fill_rwbs(char *rwbs, u32 rw, int bytes)
+{
+	int i = 0;
+
+	if (rw & WRITE)
+		rwbs[i++] = 'W';
+	else if (rw & 1 << BIO_RW_DISCARD)
+		rwbs[i++] = 'D';
+	else if (bytes)
+		rwbs[i++] = 'R';
+	else
+		rwbs[i++] = 'N';
+
+	if (rw & 1 << BIO_RW_AHEAD)
+		rwbs[i++] = 'A';
+	if (rw & 1 << BIO_RW_BARRIER)
+		rwbs[i++] = 'B';
+	if (rw & 1 << BIO_RW_SYNCIO)
+		rwbs[i++] = 'S';
+	if (rw & 1 << BIO_RW_META)
+		rwbs[i++] = 'M';
+
+	rwbs[i] = '\0';
+}
+
+void __fill_rwbs_rq(char *rwbs, struct request *rq)
+{
+	int rw = rq->cmd_flags & 0x03;
+	int bytes;
+
+	if (blk_discard_rq(rq))
+		rw |= (1 << BIO_RW_DISCARD);
+
+	if (blk_pc_request(rq))
+		bytes = rq->data_len;
+	else
+		bytes = rq->hard_nr_sectors << 9;
+
+	__fill_rwbs(rwbs, rw, bytes);
+}
+
+#endif /* CONFIG_EVENT_TRACING */
+
diff --git a/mm/bounce.c b/mm/bounce.c
index e590272..65f5e17 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -14,16 +14,15 @@
 #include <linux/hash.h>
 #include <linux/highmem.h>
 #include <linux/blktrace_api.h>
-#include <trace/block.h>
 #include <asm/tlbflush.h>
 
+#include <trace/events/block.h>
+
 #define POOL_SIZE	64
 #define ISA_POOL_SIZE	16
 
 static mempool_t *page_pool, *isa_page_pool;
 
-DEFINE_TRACE(block_bio_bounce);
-
 #ifdef CONFIG_HIGHMEM
 static __init int init_emergency_pool(void)
 {
-- 
1.5.4.rc3



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] convert block trace points to TRACE_EVENT()
  2009-05-14  6:20 [RFC][PATCH] convert block trace points to TRACE_EVENT() Li Zefan
@ 2009-05-18  8:35 ` Ingo Molnar
  2009-05-19  6:04   ` Li Zefan
  2009-05-18 13:05 ` Frederic Weisbecker
  2009-05-19 12:59 ` Jeff Moyer
  2 siblings, 1 reply; 14+ messages in thread
From: Ingo Molnar @ 2009-05-18  8:35 UTC (permalink / raw)
  To: Li Zefan
  Cc: Jens Axboe, Steven Rostedt, Frederic Weisbecker, Tom Zanussi,
	Theodore Ts'o, Steven Whitehouse, KOSAKI Motohiro, LKML


* Li Zefan <lizf@cn.fujitsu.com> wrote:

> TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
> these new capabilities to this tracepoint:
> 
>   - zero-copy and per-cpu splice() tracing
>   - binary tracing without printf overhead
>   - structured logging records exposed under /debug/tracing/events
>   - trace events embedded in function tracer output and other plugins
>   - user-defined, per tracepoint filter expressions
>   ...

Nice!

> Cons and problems:
> 
>   - no dev_t info for the output of plug, unplug_timer and unplug_io events.
>     no dev_t info for getrq and sleeprq events if bio == NULL.
>     no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.

Cannot we output the numeric major:minor pairs?

>   - for large packet commands, only 16 bytes of the command will be output.
>     Because TRACE_EVENT doesn't support dynamic-sized arrays, though it
>     supports dynamic-sized strings.
>
>   - a packet command is converted to a string in TP_assign, not TP_print.
>     While blktrace do the convertion just before output.

Couldnt we do a memcpy instead of the snprintf() in __dump_pdu()? We 
dont actually interpret the bytes there. We could extend the 
in-kernel printk format with a 'dump raw memory in hex' type of 
format specifier.

OTOH, packet requests are rather rare, right? So going to ASCII 
there results in a simpler interface. In the !blk_pc_request(rq) 
common case we just return early without any snprintf overhead.

>   - in blktrace, an event can have 2 different print formats, but 
>     a TRACE_EVENT has a unique format. (see the output of getrq 
>     and rq_insert)

Is this a problem?

I think a good way forward would be to benchmark the ioctl versus 
the splice based TRACE_EVENT tracing (via some artificially high 
rate event, to push things), and see where we are right now in terms 
of overhead.

	Ingo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] convert block trace points to TRACE_EVENT()
  2009-05-14  6:20 [RFC][PATCH] convert block trace points to TRACE_EVENT() Li Zefan
  2009-05-18  8:35 ` Ingo Molnar
@ 2009-05-18 13:05 ` Frederic Weisbecker
  2009-05-19  6:11   ` Li Zefan
  2009-05-19 12:59 ` Jeff Moyer
  2 siblings, 1 reply; 14+ messages in thread
From: Frederic Weisbecker @ 2009-05-18 13:05 UTC (permalink / raw)
  To: Li Zefan
  Cc: Ingo Molnar, Jens Axboe, Steven Rostedt, Tom Zanussi,
	Theodore Ts'o, Steven Whitehouse, KOSAKI Motohiro, LKML

On Thu, May 14, 2009 at 02:20:03PM +0800, Li Zefan wrote:
> TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
> these new capabilities to this tracepoint:
> 
>   - zero-copy and per-cpu splice() tracing
>   - binary tracing without printf overhead
>   - structured logging records exposed under /debug/tracing/events
>   - trace events embedded in function tracer output and other plugins
>   - user-defined, per tracepoint filter expressions
>   ...
> 
> Cons and problems:
> 
>   - no dev_t info for the output of plug, unplug_timer and unplug_io events.
>     no dev_t info for getrq and sleeprq events if bio == NULL.
>     no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.
> 
>   - for large packet commands, only 16 bytes of the command will be output.
>     Because TRACE_EVENT doesn't support dynamic-sized arrays, though it
>     supports dynamic-sized strings.
> 
>   - a packet command is converted to a string in TP_assign, not TP_print.
>     While blktrace do the convertion just before output.
> 
>   - in blktrace, an event can have 2 different print formats, but a TRACE_EVENT
>     has a unique format. (see the output of getrq and rq_insert)



I'm starting to think it would be nice to choose between several outputs
in a trace event.
Ie: perhaps we need a kind of per event flag, something simple, just to
choose between several TP_printk output. Not sure how much it would
(non) trivial to implement though...

Frederic.


 
> 
> Following are some comparisons between TRACE_EVENT and blktrace:
> 
> plug:
>   kjournald-480   [000]   303.084981: block_plug: [kjournald]
>   kjournald-480   [000]   303.084981:   8,0    P   N [kjournald]
> 
> unplug_io:
>   kblockd/0-118   [000]   300.052973: block_unplug_io: [kblockd/0] 1
>   kblockd/0-118   [000]   300.052974:   8,0    U   N [kblockd/0] 1
> 
> remap:
>   kjournald-480   [000]   303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384
>   kjournald-480   [000]   303.085043:   8,0    A   W 102736992 + 8 <- (8,8) 33384
> 
> bio_backmerge:
>   kjournald-480   [000]   303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald]
>   kjournald-480   [000]   303.085086:   8,0    M   W 102737032 + 8 [kjournald]
> 
> getrq:
>   kjournald-480   [000]   303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald]
>   kjournald-480   [000]   303.084975:   8,0    G   W 102736984 + 8 [kjournald]
> 
>   bash-2066  [001]  1072.953770:   8,0    G   N [bash]
>   bash-2066  [001]  1072.953773: block_getrq: 0,0 N 0 + 0 [bash]
> 
> rq_complete:
>   konsole-2065  [001]   300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0]
>   konsole-2065  [001]   300.053191:   8,0    C   W 103669040 + 16 [0]
> 
> rq_insert:
>   kjournald-480   [000]   303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald]
>   kjournald-480   [000]   303.084986:   8,0    I   W 102736984 + 8 [kjournald]
> 
>   ksoftirqd/1-7   [001]  1072.953811:   8,0    C   N (5a 00 08 00 00 00 00 00 24 00) [0]
>   ksoftirqd/1-7   [001]  1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0]
> 
> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
> ---
>  block/blk-core.c             |   16 +-
>  block/elevator.c             |    8 +-
>  drivers/md/dm.c              |    5 +-
>  fs/bio.c                     |    3 +-
>  include/linux/blktrace_api.h |    8 +
>  include/trace/block.h        |   76 -------
>  include/trace/events/block.h |  476 ++++++++++++++++++++++++++++++++++++++++++
>  kernel/trace/Makefile        |    5 +-
>  kernel/trace/blktrace.c      |   78 +++++++-
>  mm/bounce.c                  |    5 +-
>  10 files changed, 576 insertions(+), 104 deletions(-)
>  delete mode 100644 include/trace/block.h
>  create mode 100644 include/trace/events/block.h
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index d028baf..ff50f9d 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -28,22 +28,14 @@
>  #include <linux/task_io_accounting_ops.h>
>  #include <linux/blktrace_api.h>
>  #include <linux/fault-inject.h>
> -#include <trace/block.h>
> +
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/block.h>
>  
>  #include "blk.h"
>  
> -DEFINE_TRACE(block_plug);
> -DEFINE_TRACE(block_unplug_io);
> -DEFINE_TRACE(block_unplug_timer);
> -DEFINE_TRACE(block_getrq);
> -DEFINE_TRACE(block_sleeprq);
> -DEFINE_TRACE(block_rq_requeue);
> -DEFINE_TRACE(block_bio_backmerge);
> -DEFINE_TRACE(block_bio_frontmerge);
> -DEFINE_TRACE(block_bio_queue);
> -DEFINE_TRACE(block_rq_complete);
> -DEFINE_TRACE(block_remap);	/* Also used in drivers/md/dm.c */
>  EXPORT_TRACEPOINT_SYMBOL_GPL(block_remap);
> +EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_complete);
>  
>  static int __make_request(struct request_queue *q, struct bio *bio);
>  
> diff --git a/block/elevator.c b/block/elevator.c
> index 7073a90..e220f0c 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -33,17 +33,16 @@
>  #include <linux/compiler.h>
>  #include <linux/delay.h>
>  #include <linux/blktrace_api.h>
> -#include <trace/block.h>
>  #include <linux/hash.h>
>  #include <linux/uaccess.h>
>  
> +#include <trace/events/block.h>
> +
>  #include "blk.h"
>  
>  static DEFINE_SPINLOCK(elv_list_lock);
>  static LIST_HEAD(elv_list);
>  
> -DEFINE_TRACE(block_rq_abort);
> -
>  /*
>   * Merge hash stuff.
>   */
> @@ -55,9 +54,6 @@ static const int elv_hash_shift = 6;
>  #define rq_hash_key(rq)		((rq)->sector + (rq)->nr_sectors)
>  #define ELV_ON_HASH(rq)		(!hlist_unhashed(&(rq)->hash))
>  
> -DEFINE_TRACE(block_rq_insert);
> -DEFINE_TRACE(block_rq_issue);
> -
>  /*
>   * Query io scheduler to see if the current process issuing bio may be
>   * merged with rq.
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index e2ee4a7..3fd8b1e 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -20,7 +20,8 @@
>  #include <linux/idr.h>
>  #include <linux/hdreg.h>
>  #include <linux/blktrace_api.h>
> -#include <trace/block.h>
> +
> +#include <trace/events/block.h>
>  
>  #define DM_MSG_PREFIX "core"
>  
> @@ -53,8 +54,6 @@ struct dm_target_io {
>  	union map_info info;
>  };
>  
> -DEFINE_TRACE(block_bio_complete);
> -
>  /*
>   * For request-based dm.
>   * One of these is allocated per request.
> diff --git a/fs/bio.c b/fs/bio.c
> index 9871164..740699c 100644
> --- a/fs/bio.c
> +++ b/fs/bio.c
> @@ -26,10 +26,9 @@
>  #include <linux/mempool.h>
>  #include <linux/workqueue.h>
>  #include <linux/blktrace_api.h>
> -#include <trace/block.h>
>  #include <scsi/sg.h>		/* for struct sg_iovec */
>  
> -DEFINE_TRACE(block_split);
> +#include <trace/events/block.h>
>  
>  /*
>   * Test patch to inline a certain number of bi_io_vec's inside the bio
> diff --git a/include/linux/blktrace_api.h b/include/linux/blktrace_api.h
> index 82b4636..c14fb5b 100644
> --- a/include/linux/blktrace_api.h
> +++ b/include/linux/blktrace_api.h
> @@ -218,5 +218,13 @@ static inline int blk_trace_init_sysfs(struct device *dev)
>  
>  #endif /* CONFIG_BLK_DEV_IO_TRACE */
>  
> +#ifdef CONFIG_EVENT_TRACING
> +
> +extern void __dump_pdu(char *buf, struct request *rq);
> +extern void __fill_rwbs(char *rwbs, u32 rw, int bytes);
> +extern void __fill_rwbs_rq(char *rwbs, struct request *rq);
> +
> +#endif
> +
>  #endif /* __KERNEL__ */
>  #endif
> diff --git a/include/trace/block.h b/include/trace/block.h
> deleted file mode 100644
> index 5b12efa..0000000
> --- a/include/trace/block.h
> +++ /dev/null
> @@ -1,76 +0,0 @@
> -#ifndef _TRACE_BLOCK_H
> -#define _TRACE_BLOCK_H
> -
> -#include <linux/blkdev.h>
> -#include <linux/tracepoint.h>
> -
> -DECLARE_TRACE(block_rq_abort,
> -	TP_PROTO(struct request_queue *q, struct request *rq),
> -	      TP_ARGS(q, rq));
> -
> -DECLARE_TRACE(block_rq_insert,
> -	TP_PROTO(struct request_queue *q, struct request *rq),
> -	      TP_ARGS(q, rq));
> -
> -DECLARE_TRACE(block_rq_issue,
> -	TP_PROTO(struct request_queue *q, struct request *rq),
> -	      TP_ARGS(q, rq));
> -
> -DECLARE_TRACE(block_rq_requeue,
> -	TP_PROTO(struct request_queue *q, struct request *rq),
> -	      TP_ARGS(q, rq));
> -
> -DECLARE_TRACE(block_rq_complete,
> -	TP_PROTO(struct request_queue *q, struct request *rq),
> -	      TP_ARGS(q, rq));
> -
> -DECLARE_TRACE(block_bio_bounce,
> -	TP_PROTO(struct request_queue *q, struct bio *bio),
> -	      TP_ARGS(q, bio));
> -
> -DECLARE_TRACE(block_bio_complete,
> -	TP_PROTO(struct request_queue *q, struct bio *bio),
> -	      TP_ARGS(q, bio));
> -
> -DECLARE_TRACE(block_bio_backmerge,
> -	TP_PROTO(struct request_queue *q, struct bio *bio),
> -	      TP_ARGS(q, bio));
> -
> -DECLARE_TRACE(block_bio_frontmerge,
> -	TP_PROTO(struct request_queue *q, struct bio *bio),
> -	      TP_ARGS(q, bio));
> -
> -DECLARE_TRACE(block_bio_queue,
> -	TP_PROTO(struct request_queue *q, struct bio *bio),
> -	      TP_ARGS(q, bio));
> -
> -DECLARE_TRACE(block_getrq,
> -	TP_PROTO(struct request_queue *q, struct bio *bio, int rw),
> -	      TP_ARGS(q, bio, rw));
> -
> -DECLARE_TRACE(block_sleeprq,
> -	TP_PROTO(struct request_queue *q, struct bio *bio, int rw),
> -	      TP_ARGS(q, bio, rw));
> -
> -DECLARE_TRACE(block_plug,
> -	TP_PROTO(struct request_queue *q),
> -	      TP_ARGS(q));
> -
> -DECLARE_TRACE(block_unplug_timer,
> -	TP_PROTO(struct request_queue *q),
> -	      TP_ARGS(q));
> -
> -DECLARE_TRACE(block_unplug_io,
> -	TP_PROTO(struct request_queue *q),
> -	      TP_ARGS(q));
> -
> -DECLARE_TRACE(block_split,
> -	TP_PROTO(struct request_queue *q, struct bio *bio, unsigned int pdu),
> -	      TP_ARGS(q, bio, pdu));
> -
> -DECLARE_TRACE(block_remap,
> -	TP_PROTO(struct request_queue *q, struct bio *bio, dev_t dev,
> -		 sector_t from),
> -	      TP_ARGS(q, bio, dev, from));
> -
> -#endif
> diff --git a/include/trace/events/block.h b/include/trace/events/block.h
> new file mode 100644
> index 0000000..e9aa2b5
> --- /dev/null
> +++ b/include/trace/events/block.h
> @@ -0,0 +1,476 @@
> +#if !defined(_TRACE_BLOCK_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_BLOCK_H
> +
> +#include <linux/blkdev.h>
> +#include <linux/tracepoint.h>
> +
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM block
> +
> +TRACE_EVENT(block_rq_abort,
> +
> +	TP_PROTO(struct request_queue *q, struct request *rq),
> +
> +	TP_ARGS(q, rq),
> +
> +	TP_STRUCT__entry(
> +		__field( dev_t,		dev			)
> +		__field( sector_t,	sector			)
> +		__field( unsigned int,	nr_sector		)
> +		__field( int,		errors			)
> +		__array( char,		rwbs,	6		)
> +		__array( char,		cmd,	BLK_MAX_CDB * 3	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->dev	   = rq->rq_disk ? disk_devt(rq->rq_disk) : 0;
> +		__entry->sector    = blk_pc_request(rq) ? 0 : rq->hard_sector;
> +		__entry->nr_sector = blk_pc_request(rq) ?
> +						0 : rq->hard_nr_sectors;
> +		__entry->errors    = rq->errors;
> +		__fill_rwbs_rq(__entry->rwbs, rq);
> +		__dump_pdu(__entry->cmd, rq);
> +	),
> +
> +	TP_printk("%d,%d %s (%s) %llu + %u [%d]",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->rwbs, __entry->cmd,
> +		  __entry->sector, __entry->nr_sector, __entry->errors)
> +);
> +
> +TRACE_EVENT(block_rq_insert,
> +
> +	TP_PROTO(struct request_queue *q, struct request *rq),
> +
> +	TP_ARGS(q, rq),
> +
> +	TP_STRUCT__entry(
> +		__field( dev_t,		dev			)
> +		__field( sector_t,	sector			)
> +		__field( unsigned int,	nr_sector		)
> +		__field( unsigned int,	bytes			)
> +		__array( char,		rwbs,	6		)
> +		__array( char,		cmd,	BLK_MAX_CDB * 3	)
> +		__array( char,          comm,   TASK_COMM_LEN   )
> +	),
> +
> +	TP_fast_assign(
> +		__entry->dev	   = rq->rq_disk ? disk_devt(rq->rq_disk) : 0;
> +		__entry->sector    = blk_pc_request(rq) ? 0 : rq->hard_sector;
> +		__entry->nr_sector = blk_pc_request(rq) ?
> +						0 : rq->hard_nr_sectors;
> +		__entry->bytes     = blk_pc_request(rq) ? rq->data_len : 0;
> +		__fill_rwbs_rq(__entry->rwbs, rq);
> +		__dump_pdu(__entry->cmd, rq);
> +		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
> +	),
> +
> +	TP_printk("%d,%d %s %u (%s) %llu + %u [%s]",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->rwbs, __entry->bytes, __entry->cmd,
> +		  __entry->sector, __entry->nr_sector, __entry->comm)
> +);
> +
> +TRACE_EVENT(block_rq_issue,
> +
> +	TP_PROTO(struct request_queue *q, struct request *rq),
> +
> +	TP_ARGS(q, rq),
> +
> +	TP_STRUCT__entry(
> +		__field( dev_t,		dev			)
> +		__field( sector_t,	sector			)
> +		__field( unsigned int,	nr_sector		)
> +		__field( unsigned int,	bytes			)
> +		__array( char,		rwbs,	6		)
> +		__array( char,		cmd,	BLK_MAX_CDB * 3	)
> +		__array( char,          comm,   TASK_COMM_LEN   )
> +	),
> +
> +	TP_fast_assign(
> +		__entry->dev	   = rq->rq_disk ? disk_devt(rq->rq_disk) : 0;
> +		__entry->sector    = blk_pc_request(rq) ? 0 : rq->hard_sector;
> +		__entry->nr_sector = blk_pc_request(rq) ?
> +						0 : rq->hard_nr_sectors;
> +		__entry->bytes     = blk_pc_request(rq) ? rq->data_len : 0;
> +		__fill_rwbs_rq(__entry->rwbs, rq);
> +		__dump_pdu(__entry->cmd, rq);
> +		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
> +	),
> +
> +	TP_printk("%d,%d %s %u (%s) %llu + %u [%s]",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->rwbs, __entry->bytes, __entry->cmd,
> +		  __entry->sector, __entry->nr_sector, __entry->comm)
> +);
> +
> +TRACE_EVENT(block_rq_requeue,
> +
> +	TP_PROTO(struct request_queue *q, struct request *rq),
> +
> +	TP_ARGS(q, rq),
> +
> +	TP_STRUCT__entry(
> +		__field( dev_t,		dev			)
> +		__field( sector_t,	sector			)
> +		__field( unsigned int,	nr_sector		)
> +		__field( int,		errors			)
> +		__array( char,		rwbs,	6		)
> +		__array( char,		cmd,	BLK_MAX_CDB * 3	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->dev	   = rq->rq_disk ? disk_devt(rq->rq_disk) : 0;
> +		__entry->sector    = blk_pc_request(rq) ? 0 : rq->hard_sector;
> +		__entry->nr_sector = blk_pc_request(rq) ?
> +						0 : rq->hard_nr_sectors;
> +		__entry->errors	   = rq->errors;
> +		__dump_pdu(__entry->cmd, rq);
> +	),
> +
> +	TP_printk("%d,%d %s (%s) %llu + %u [%d]",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->rwbs, __entry->cmd,
> +		  __entry->sector, __entry->nr_sector, __entry->errors)
> +);
> +
> +TRACE_EVENT(block_rq_complete,
> +
> +	TP_PROTO(struct request_queue *q, struct request *rq),
> +
> +	TP_ARGS(q, rq),
> +
> +	TP_STRUCT__entry(
> +		__field( dev_t,		dev			)
> +		__field( sector_t,	sector			)
> +		__field( unsigned int,	nr_sector		)
> +		__field( int,		errors			)
> +		__array( char,		rwbs,	6		)
> +		__array( char,		cmd,	BLK_MAX_CDB * 3	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->dev	   = rq->rq_disk ? disk_devt(rq->rq_disk) : 0;
> +		__entry->sector    = blk_pc_request(rq) ? 0 : rq->hard_sector;
> +		__entry->nr_sector = blk_pc_request(rq) ?
> +						0 : rq->hard_nr_sectors;
> +		__entry->errors    = rq->errors;
> +		__fill_rwbs_rq(__entry->rwbs, rq);
> +		__dump_pdu(__entry->cmd, rq);
> +	),
> +
> +	TP_printk("%d,%d %s (%s) %llu + %u [%d]",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->rwbs, __entry->cmd,
> +		  __entry->sector, __entry->nr_sector, __entry->errors)
> +);
> +TRACE_EVENT(block_bio_bounce,
> +
> +	TP_PROTO(struct request_queue *q, struct bio *bio),
> +
> +	TP_ARGS(q, bio),
> +
> +	TP_STRUCT__entry(
> +		__field( dev_t,		dev			)
> +		__field( sector_t,	sector			)
> +		__field( unsigned int,	nr_sector		)
> +		__array( char,		rwbs,	6		)
> +		__array( char,		comm,	TASK_COMM_LEN	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->dev		= bio->bi_bdev->bd_dev;
> +		__entry->sector		= bio->bi_sector;
> +		__entry->nr_sector	= bio->bi_size >> 9;
> +		__fill_rwbs(__entry->rwbs, bio->bi_rw, bio->bi_size);
> +		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
> +	),
> +
> +	TP_printk("%d,%d %s %llu + %u [%s]",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rwbs,
> +		  __entry->sector, __entry->nr_sector, __entry->comm)
> +);
> +
> +TRACE_EVENT(block_bio_complete,
> +
> +	TP_PROTO(struct request_queue *q, struct bio *bio),
> +
> +	TP_ARGS(q, bio),
> +
> +	TP_STRUCT__entry(
> +		__field( dev_t,		dev		)
> +		__field( sector_t,	sector		)
> +		__field( unsigned,	nr_sector	)
> +		__field( int,		error		)
> +		__array( char,		rwbs,	6	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->dev		= bio->bi_bdev->bd_dev;
> +		__entry->sector		= bio->bi_sector;
> +		__entry->nr_sector	= bio->bi_size >> 9;
> +		__fill_rwbs(__entry->rwbs, bio->bi_rw, bio->bi_size);
> +	),
> +
> +	TP_printk("%d,%d %s %llu + %u [%d]",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rwbs,
> +		  __entry->sector, __entry->nr_sector, __entry->error)
> +);
> +
> +TRACE_EVENT(block_bio_backmerge,
> +
> +	TP_PROTO(struct request_queue *q, struct bio *bio),
> +
> +	TP_ARGS(q, bio),
> +
> +	TP_STRUCT__entry(
> +		__field( dev_t,		dev			)
> +		__field( sector_t,	sector			)
> +		__field( unsigned int,	nr_sector		)
> +		__array( char,		rwbs,	6		)
> +		__array( char,		comm,	TASK_COMM_LEN	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->dev		= bio->bi_bdev->bd_dev;
> +		__entry->sector		= bio->bi_sector;
> +		__entry->nr_sector	= bio->bi_size >> 9;
> +		__fill_rwbs(__entry->rwbs, bio->bi_rw, bio->bi_size);
> +		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
> +	),
> +
> +	TP_printk("%d,%d %s %llu + %u [%s]",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rwbs,
> +		  __entry->sector, __entry->nr_sector, __entry->comm)
> +);
> +
> +TRACE_EVENT(block_bio_frontmerge,
> +
> +	TP_PROTO(struct request_queue *q, struct bio *bio),
> +
> +	TP_ARGS(q, bio),
> +
> +	TP_STRUCT__entry(
> +		__field( dev_t,		dev			)
> +		__field( sector_t,	sector			)
> +		__field( unsigned,	nr_sector		)
> +		__array( char,		rwbs,	6		)
> +		__array( char,		comm,	TASK_COMM_LEN	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->dev		= bio->bi_bdev->bd_dev;
> +		__entry->sector		= bio->bi_sector;
> +		__entry->nr_sector	= bio->bi_size >> 9;
> +		__fill_rwbs(__entry->rwbs, bio->bi_rw, bio->bi_size);
> +		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
> +	),
> +
> +	TP_printk("%d,%d %s %llu + %u [%s]",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rwbs,
> +		  __entry->sector, __entry->nr_sector, __entry->comm)
> +);
> +
> +TRACE_EVENT(block_bio_queue,
> +
> +	TP_PROTO(struct request_queue *q, struct bio *bio),
> +
> +	TP_ARGS(q, bio),
> +
> +	TP_STRUCT__entry(
> +		__field( dev_t,		dev			)
> +		__field( sector_t,	sector			)
> +		__field( unsigned int,	nr_sector		)
> +		__array( char,		rwbs,	6		)
> +		__array( char,		comm,	TASK_COMM_LEN	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->dev		= bio->bi_bdev->bd_dev;
> +		__entry->sector		= bio->bi_sector;
> +		__entry->nr_sector	= bio->bi_size >> 9;
> +		__fill_rwbs(__entry->rwbs, bio->bi_rw, bio->bi_size);
> +		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
> +	),
> +
> +	TP_printk("%d,%d %s %llu + %u [%s]",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rwbs,
> +		  __entry->sector, __entry->nr_sector, __entry->comm)
> +);
> +
> +TRACE_EVENT(block_getrq,
> +
> +	TP_PROTO(struct request_queue *q, struct bio *bio, int rw),
> +
> +	TP_ARGS(q, bio, rw),
> +
> +	TP_STRUCT__entry(
> +		__field( dev_t,		dev			)
> +		__field( sector_t,	sector			)
> +		__field( unsigned int,	nr_sector		)
> +		__array( char,		rwbs,	6		)
> +		__array( char,		comm,	TASK_COMM_LEN	)
> +        ),
> +
> +	TP_fast_assign(
> +		__entry->dev		= bio ? bio->bi_bdev->bd_dev : 0;
> +		__entry->sector		= bio ? bio->bi_sector : 0;
> +		__entry->nr_sector	= bio ? bio->bi_size >> 9 : 0;
> +		__fill_rwbs(__entry->rwbs,
> +			    bio ? bio->bi_rw : 0, __entry->nr_sector);
> +		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
> +        ),
> +
> +	TP_printk("%d,%d %s %llu + %u [%s]",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rwbs,
> +		  __entry->sector, __entry->nr_sector, __entry->comm)
> +);
> +
> +TRACE_EVENT(block_sleeprq,
> +
> +	TP_PROTO(struct request_queue *q, struct bio *bio, int rw),
> +
> +	TP_ARGS(q, bio, rw),
> +
> +	TP_STRUCT__entry(
> +		__field( dev_t,		dev			)
> +		__field( sector_t,	sector			)
> +		__field( unsigned int,	nr_sector		)
> +		__array( char,		rwbs,	6		)
> +		__array( char,		comm,	TASK_COMM_LEN	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->dev		= bio ? bio->bi_bdev->bd_dev : 0;
> +		__entry->sector		= bio ? bio->bi_sector : 0;
> +		__entry->nr_sector	= bio ? bio->bi_size >> 9 : 0;
> +		__fill_rwbs(__entry->rwbs,
> +			    bio ? bio->bi_rw : 0, __entry->nr_sector);
> +		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
> +	),
> +
> +	TP_printk("%d,%d %s %llu + %u [%s]",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rwbs,
> +		  __entry->sector, __entry->nr_sector, __entry->comm)
> +);
> +
> +TRACE_EVENT(block_plug,
> +
> +	TP_PROTO(struct request_queue *q),
> +
> +	TP_ARGS(q),
> +
> +	TP_STRUCT__entry(
> +		__array( char,		comm,	TASK_COMM_LEN	)
> +	),
> +
> +	TP_fast_assign(
> +		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
> +	),
> +
> +	TP_printk("[%s]", __entry->comm)
> +);
> +
> +TRACE_EVENT(block_unplug_timer,
> +
> +	TP_PROTO(struct request_queue *q),
> +
> +	TP_ARGS(q),
> +
> +	TP_STRUCT__entry(
> +		__field( int,		nr_rq			)
> +		__array( char,		comm,	TASK_COMM_LEN	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->nr_rq	= q->rq.count[READ] + q->rq.count[WRITE];
> +		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
> +	),
> +
> +	TP_printk("[%s] %d", __entry->comm, __entry->nr_rq)
> +);
> +
> +TRACE_EVENT(block_unplug_io,
> +
> +	TP_PROTO(struct request_queue *q),
> +
> +	TP_ARGS(q),
> +
> +	TP_STRUCT__entry(
> +		__field( int,		nr_rq			)
> +		__array( char,		comm,	TASK_COMM_LEN	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->nr_rq	= q->rq.count[READ] + q->rq.count[WRITE];
> +		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
> +	),
> +
> +	TP_printk("[%s] %d", __entry->comm, __entry->nr_rq)
> +);
> +
> +TRACE_EVENT(block_split,
> +
> +	TP_PROTO(struct request_queue *q, struct bio *bio,
> +		 unsigned int new_sector),
> +
> +	TP_ARGS(q, bio, new_sector),
> +
> +	TP_STRUCT__entry(
> +		__field( dev_t,		dev				)
> +		__field( sector_t,	sector				)
> +		__field( sector_t,	new_sector			)
> +		__array( char,		rwbs,		6		)
> +		__array( char,		comm,		TASK_COMM_LEN	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->dev		= bio->bi_bdev->bd_dev;
> +		__entry->sector		= bio->bi_sector;
> +		__entry->new_sector	= new_sector;
> +		__fill_rwbs(__entry->rwbs, bio->bi_rw, bio->bi_size);
> +		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
> +	),
> +
> +	TP_printk("%d,%d %s %llu / %llu [%s]",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rwbs,
> +		  __entry->sector, __entry->new_sector, __entry->comm)
> +);
> +
> +TRACE_EVENT(block_remap,
> +
> +	TP_PROTO(struct request_queue *q, struct bio *bio, dev_t dev,
> +		 sector_t from),
> +
> +	TP_ARGS(q, bio, dev, from),
> +
> +	TP_STRUCT__entry(
> +		__field( dev_t,		dev		)
> +		__field( sector_t,	sector		)
> +		__field( unsigned int,	nr_sector	)
> +		__field( dev_t,		old_dev		)
> +		__field( sector_t,	old_sector	)
> +		__array( char,		rwbs,	6	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->dev		= bio->bi_bdev->bd_dev;
> +		__entry->sector		= bio->bi_sector;
> +		__entry->nr_sector	= bio->bi_size >> 9;
> +		__entry->old_dev	= dev;
> +		__entry->old_sector	= from;
> +		__fill_rwbs(__entry->rwbs, bio->bi_rw, bio->bi_size);
> +	),
> +
> +	TP_printk("%d,%d %s %llu + %u <- (%d,%d) %llu",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rwbs,
> +		  __entry->sector, __entry->nr_sector,
> +		  MAJOR(__entry->old_dev), MINOR(__entry->old_dev),
> +		  __entry->old_sector)
> +);
> +
> +#endif /* _TRACE_BLOCK_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
> +
> diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> index 848e5ce..e452519 100644
> --- a/kernel/trace/Makefile
> +++ b/kernel/trace/Makefile
> @@ -45,7 +45,10 @@ obj-$(CONFIG_HW_BRANCH_TRACER) += trace_hw_branches.o
>  obj-$(CONFIG_POWER_TRACER) += trace_power.o
>  obj-$(CONFIG_KMEMTRACE) += kmemtrace.o
>  obj-$(CONFIG_WORKQUEUE_TRACER) += trace_workqueue.o
> -obj-$(CONFIG_BLK_DEV_IO_TRACE)	+= blktrace.o
> +obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o
> +ifeq ($(CONFIG_BLOCK),y)
> +obj-$(CONFIG_EVENT_TRACING) += blktrace.o
> +endif
>  obj-$(CONFIG_EVENT_TRACING) += trace_events.o
>  obj-$(CONFIG_EVENT_TRACING) += trace_export.o
>  obj-$(CONFIG_FTRACE_SYSCALLS) += trace_syscalls.o
> diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> index 05b4747..0181bf4 100644
> --- a/kernel/trace/blktrace.c
> +++ b/kernel/trace/blktrace.c
> @@ -23,10 +23,14 @@
>  #include <linux/mutex.h>
>  #include <linux/debugfs.h>
>  #include <linux/time.h>
> -#include <trace/block.h>
>  #include <linux/uaccess.h>
> +
> +#include <trace/events/block.h>
> +
>  #include "trace_output.h"
>  
> +#ifdef CONFIG_BLK_DEV_IO_TRACE
> +
>  static unsigned int blktrace_seq __read_mostly = 1;
>  
>  static struct trace_array *blk_tr;
> @@ -1657,3 +1661,75 @@ int blk_trace_init_sysfs(struct device *dev)
>  	return sysfs_create_group(&dev->kobj, &blk_trace_attr_group);
>  }
>  
> +#endif /* CONFIG_BLK_DEV_IO_TRACE */
> +
> +#ifdef CONFIG_EVENT_TRACING
> +
> +void __dump_pdu(char *buf, struct request *rq)
> +{
> +	int i, end;
> +	int len = min_t(int, rq->cmd_len, BLK_MAX_CDB);
> +	unsigned char *cmd = rq->cmd;
> +
> +	if (!blk_pc_request(rq)) {
> +		buf[0] = '\0';
> +		return;
> +	}
> +
> +	for (end = len - 1; end >= 0; end--)
> +		if (cmd[end])
> +			break;
> +	end++;
> +
> +	for (i = 0; i < len; i++) {
> +		buf += sprintf(buf, "%s%02x", i == 0 ? "" : " ", cmd[i]);
> +		if (i == end && end != len - 1) {
> +			sprintf(buf, " ..");
> +			break;
> +		}
> +	}
> +}
> +
> +void __fill_rwbs(char *rwbs, u32 rw, int bytes)
> +{
> +	int i = 0;
> +
> +	if (rw & WRITE)
> +		rwbs[i++] = 'W';
> +	else if (rw & 1 << BIO_RW_DISCARD)
> +		rwbs[i++] = 'D';
> +	else if (bytes)
> +		rwbs[i++] = 'R';
> +	else
> +		rwbs[i++] = 'N';
> +
> +	if (rw & 1 << BIO_RW_AHEAD)
> +		rwbs[i++] = 'A';
> +	if (rw & 1 << BIO_RW_BARRIER)
> +		rwbs[i++] = 'B';
> +	if (rw & 1 << BIO_RW_SYNCIO)
> +		rwbs[i++] = 'S';
> +	if (rw & 1 << BIO_RW_META)
> +		rwbs[i++] = 'M';
> +
> +	rwbs[i] = '\0';
> +}
> +
> +void __fill_rwbs_rq(char *rwbs, struct request *rq)
> +{
> +	int rw = rq->cmd_flags & 0x03;
> +	int bytes;
> +
> +	if (blk_discard_rq(rq))
> +		rw |= (1 << BIO_RW_DISCARD);
> +
> +	if (blk_pc_request(rq))
> +		bytes = rq->data_len;
> +	else
> +		bytes = rq->hard_nr_sectors << 9;
> +
> +	__fill_rwbs(rwbs, rw, bytes);
> +}
> +
> +#endif /* CONFIG_EVENT_TRACING */
> +
> diff --git a/mm/bounce.c b/mm/bounce.c
> index e590272..65f5e17 100644
> --- a/mm/bounce.c
> +++ b/mm/bounce.c
> @@ -14,16 +14,15 @@
>  #include <linux/hash.h>
>  #include <linux/highmem.h>
>  #include <linux/blktrace_api.h>
> -#include <trace/block.h>
>  #include <asm/tlbflush.h>
>  
> +#include <trace/events/block.h>
> +
>  #define POOL_SIZE	64
>  #define ISA_POOL_SIZE	16
>  
>  static mempool_t *page_pool, *isa_page_pool;
>  
> -DEFINE_TRACE(block_bio_bounce);
> -
>  #ifdef CONFIG_HIGHMEM
>  static __init int init_emergency_pool(void)
>  {
> -- 
> 1.5.4.rc3
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] convert block trace points to TRACE_EVENT()
  2009-05-18  8:35 ` Ingo Molnar
@ 2009-05-19  6:04   ` Li Zefan
  0 siblings, 0 replies; 14+ messages in thread
From: Li Zefan @ 2009-05-19  6:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jens Axboe, Steven Rostedt, Frederic Weisbecker, Tom Zanussi,
	Theodore Ts'o, Steven Whitehouse, KOSAKI Motohiro, LKML

Ingo Molnar wrote:
> * Li Zefan <lizf@cn.fujitsu.com> wrote:
> 
>> TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
>> these new capabilities to this tracepoint:
>>
>>   - zero-copy and per-cpu splice() tracing
>>   - binary tracing without printf overhead
>>   - structured logging records exposed under /debug/tracing/events
>>   - trace events embedded in function tracer output and other plugins
>>   - user-defined, per tracepoint filter expressions
>>   ...
> 
> Nice!
> 
>> Cons and problems:
>>
>>   - no dev_t info for the output of plug, unplug_timer and unplug_io events.
>>     no dev_t info for getrq and sleeprq events if bio == NULL.
>>     no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.
> 
> Cannot we output the numeric major:minor pairs?
> 

No, we can't.

Take plug tracepoint for example, the only argument is a struct request_queue,
but we can't map from a queue to a device, since there is no 1:1 mapping.

That's why blktrace adds dev_t info in struct blk_trace, which is associated
to a queue.

>>   - for large packet commands, only 16 bytes of the command will be output.
>>     Because TRACE_EVENT doesn't support dynamic-sized arrays, though it
>>     supports dynamic-sized strings.
>>
>>   - a packet command is converted to a string in TP_assign, not TP_print.
>>     While blktrace do the convertion just before output.
> 
> Couldnt we do a memcpy instead of the snprintf() in __dump_pdu()? We 
> dont actually interpret the bytes there. We could extend the 
> in-kernel printk format with a 'dump raw memory in hex' type of 
> format specifier.
> 

Sure, it's do-able. The disavantage is then we can't do filtering
on __entry->cmd, because now it's unsigned char[], not a string.

> OTOH, packet requests are rather rare, right? So going to ASCII 
> there results in a simpler interface. In the !blk_pc_request(rq) 
> common case we just return early without any snprintf overhead.
> 

Right. :)

>>   - in blktrace, an event can have 2 different print formats, but 
>>     a TRACE_EVENT has a unique format. (see the output of getrq 
>>     and rq_insert)
> 
> Is this a problem?
> 

One of the defect is, we have __entry->cmd[] even it's not used
if !blk_pc_request(rq).

This can be avoided though, by using __string() (needs small modification)
instead of __array().

> I think a good way forward would be to benchmark the ioctl versus 
> the splice based TRACE_EVENT tracing (via some artificially high 
> rate event, to push things), and see where we are right now in terms 
> of overhead.
> 

I'll try.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] convert block trace points to TRACE_EVENT()
  2009-05-18 13:05 ` Frederic Weisbecker
@ 2009-05-19  6:11   ` Li Zefan
  0 siblings, 0 replies; 14+ messages in thread
From: Li Zefan @ 2009-05-19  6:11 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, Jens Axboe, Steven Rostedt, Tom Zanussi,
	Theodore Ts'o, Steven Whitehouse, KOSAKI Motohiro, LKML

Frederic Weisbecker wrote:
> On Thu, May 14, 2009 at 02:20:03PM +0800, Li Zefan wrote:
>> TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
>> these new capabilities to this tracepoint:
>>
>>   - zero-copy and per-cpu splice() tracing
>>   - binary tracing without printf overhead
>>   - structured logging records exposed under /debug/tracing/events
>>   - trace events embedded in function tracer output and other plugins
>>   - user-defined, per tracepoint filter expressions
>>   ...
>>
>> Cons and problems:
>>
>>   - no dev_t info for the output of plug, unplug_timer and unplug_io events.
>>     no dev_t info for getrq and sleeprq events if bio == NULL.
>>     no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.
>>
>>   - for large packet commands, only 16 bytes of the command will be output.
>>     Because TRACE_EVENT doesn't support dynamic-sized arrays, though it
>>     supports dynamic-sized strings.
>>
>>   - a packet command is converted to a string in TP_assign, not TP_print.
>>     While blktrace do the convertion just before output.
>>
>>   - in blktrace, an event can have 2 different print formats, but a TRACE_EVENT
>>     has a unique format. (see the output of getrq and rq_insert)
> 
> I'm starting to think it would be nice to choose between several outputs
> in a trace event.
> Ie: perhaps we need a kind of per event flag, something simple, just to
> choose between several TP_printk output. Not sure how much it would
> (non) trivial to implement though...
> 

If a trace event wants several TP_printk output, probably it wants different
structs of trace entry, otherwise we'll be wasting memory.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] convert block trace points to TRACE_EVENT()
  2009-05-14  6:20 [RFC][PATCH] convert block trace points to TRACE_EVENT() Li Zefan
  2009-05-18  8:35 ` Ingo Molnar
  2009-05-18 13:05 ` Frederic Weisbecker
@ 2009-05-19 12:59 ` Jeff Moyer
  2009-05-19 13:08   ` Christoph Hellwig
  2 siblings, 1 reply; 14+ messages in thread
From: Jeff Moyer @ 2009-05-19 12:59 UTC (permalink / raw)
  To: Li Zefan
  Cc: Ingo Molnar, Jens Axboe, Steven Rostedt, Frederic Weisbecker,
	Tom Zanussi, Theodore Ts'o, Steven Whitehouse,
	KOSAKI Motohiro, LKML

Li Zefan <lizf@cn.fujitsu.com> writes:

> TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
> these new capabilities to this tracepoint:

>   - zero-copy and per-cpu splice() tracing
>   - binary tracing without printf overhead
>   - structured logging records exposed under /debug/tracing/events
>   - trace events embedded in function tracer output and other plugins
>   - user-defined, per tracepoint filter expressions
>   ...
>
> Cons and problems:
>
>   - no dev_t info for the output of plug, unplug_timer and unplug_io events.

I'm not sure I'm comfortable with this.  Having had to diagnose
performance problems using blktrace, I find the dev_t information for
plugs and unplugs especially interesting.  Maybe we can still figure
this out given the context in the trace, though.  Can you comment on
that?

>     no dev_t info for getrq and sleeprq events if bio == NULL.
>     no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.
>
>   - for large packet commands, only 16 bytes of the command will be output.
>     Because TRACE_EVENT doesn't support dynamic-sized arrays, though it
>     supports dynamic-sized strings.
>
>   - a packet command is converted to a string in TP_assign, not TP_print.
>     While blktrace do the convertion just before output.
>
>   - in blktrace, an event can have 2 different print formats, but a TRACE_EVENT
>     has a unique format. (see the output of getrq and rq_insert)
>
>
> Following are some comparisons between TRACE_EVENT and blktrace:
>
> plug:
>   kjournald-480   [000]   303.084981: block_plug: [kjournald]
>   kjournald-480   [000]   303.084981:   8,0    P   N [kjournald]

I'm not a big fan of having to change my scripts!  Why did you switch
from the single letter output to the verbose strings?  Am I just being a
curmudgeon?

I have one further question, which will expose my ignorance on this
implementation.  Would the proposed patch change the information
available to the blktrace utility, or is this just for users such as
systemtap?

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] convert block trace points to TRACE_EVENT()
  2009-05-19 12:59 ` Jeff Moyer
@ 2009-05-19 13:08   ` Christoph Hellwig
  2009-05-19 15:49     ` FUJITA Tomonori
  2009-05-20  8:38     ` Li Zefan
  0 siblings, 2 replies; 14+ messages in thread
From: Christoph Hellwig @ 2009-05-19 13:08 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Li Zefan, Ingo Molnar, Jens Axboe, Steven Rostedt,
	Frederic Weisbecker, Tom Zanussi, Theodore Ts'o,
	Steven Whitehouse, KOSAKI Motohiro, LKML

On Tue, May 19, 2009 at 08:59:29AM -0400, Jeff Moyer wrote:
> I'm not sure I'm comfortable with this.  Having had to diagnose
> performance problems using blktrace, I find the dev_t information for
> plugs and unplugs especially interesting.  Maybe we can still figure
> this out given the context in the trace, though.  Can you comment on
> that?

The problem is that right now we still allow request queues shared by
multiple gendisks, so we can't get at the gendisk.  The blktrace ioctl
interface gets around that by doing the setup with an ioctl on the
block device node and storing the dev_t at that point.

But I don't think we actually have any drivers left sharing a queue
that way, and I'd love to get rid of that for various reasons.

> from the single letter output to the verbose strings?  Am I just being a
> curmudgeon?
> 
> I have one further question, which will expose my ignorance on this
> implementation.  Would the proposed patch change the information
> available to the blktrace utility,

That's the output from /sys/kernel/debug/tracing/trace when the
block events are enabled.  The blktrace utility should continue to
work without changes, although it could be ported to use the ftrace
binary ring buffer directly in the future.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] convert block trace points to TRACE_EVENT()
  2009-05-19 13:08   ` Christoph Hellwig
@ 2009-05-19 15:49     ` FUJITA Tomonori
  2009-05-19 17:33       ` Jens Axboe
  2009-05-20  8:38     ` Li Zefan
  1 sibling, 1 reply; 14+ messages in thread
From: FUJITA Tomonori @ 2009-05-19 15:49 UTC (permalink / raw)
  To: hch
  Cc: jmoyer, lizf, mingo, jens.axboe, rostedt, fweisbec, tzanussi,
	tytso, swhiteho, kosaki.motohiro, linux-kernel

On Tue, 19 May 2009 09:08:41 -0400
Christoph Hellwig <hch@infradead.org> wrote:

> On Tue, May 19, 2009 at 08:59:29AM -0400, Jeff Moyer wrote:
> > I'm not sure I'm comfortable with this.  Having had to diagnose
> > performance problems using blktrace, I find the dev_t information for
> > plugs and unplugs especially interesting.  Maybe we can still figure
> > this out given the context in the trace, though.  Can you comment on
> > that?
> 
> The problem is that right now we still allow request queues shared by
> multiple gendisks, so we can't get at the gendisk.  The blktrace ioctl
> interface gets around that by doing the setup with an ioctl on the
> block device node and storing the dev_t at that point.
> 
> But I don't think we actually have any drivers left sharing a queue
> that way, and I'd love to get rid of that for various reasons.

There are still some (e.g. drivers/block/floppy.c)? I though that we
hit this when we tried to add the command filter feature though I
might be wrong (I can't recall).

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] convert block trace points to TRACE_EVENT()
  2009-05-19 15:49     ` FUJITA Tomonori
@ 2009-05-19 17:33       ` Jens Axboe
  0 siblings, 0 replies; 14+ messages in thread
From: Jens Axboe @ 2009-05-19 17:33 UTC (permalink / raw)
  To: FUJITA Tomonori
  Cc: hch, jmoyer, lizf, mingo, rostedt, fweisbec, tzanussi, tytso,
	swhiteho, kosaki.motohiro, linux-kernel

On Wed, May 20 2009, FUJITA Tomonori wrote:
> On Tue, 19 May 2009 09:08:41 -0400
> Christoph Hellwig <hch@infradead.org> wrote:
> 
> > On Tue, May 19, 2009 at 08:59:29AM -0400, Jeff Moyer wrote:
> > > I'm not sure I'm comfortable with this.  Having had to diagnose
> > > performance problems using blktrace, I find the dev_t information for
> > > plugs and unplugs especially interesting.  Maybe we can still figure
> > > this out given the context in the trace, though.  Can you comment on
> > > that?
> > 
> > The problem is that right now we still allow request queues shared by
> > multiple gendisks, so we can't get at the gendisk.  The blktrace ioctl
> > interface gets around that by doing the setup with an ioctl on the
> > block device node and storing the dev_t at that point.
> > 
> > But I don't think we actually have any drivers left sharing a queue
> > that way, and I'd love to get rid of that for various reasons.
> 
> There are still some (e.g. drivers/block/floppy.c)? I though that we
> hit this when we tried to add the command filter feature though I
> might be wrong (I can't recall).

And mtd, iirc. But only a few. I'll fiddle up a patch to finally get rid
of this, it has been pending for... years.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] convert block trace points to TRACE_EVENT()
  2009-05-19 13:08   ` Christoph Hellwig
  2009-05-19 15:49     ` FUJITA Tomonori
@ 2009-05-20  8:38     ` Li Zefan
  2009-05-23 12:40       ` Christoph Hellwig
  1 sibling, 1 reply; 14+ messages in thread
From: Li Zefan @ 2009-05-20  8:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeff Moyer, Ingo Molnar, Jens Axboe, Steven Rostedt,
	Frederic Weisbecker, Tom Zanussi, Theodore Ts'o,
	Steven Whitehouse, KOSAKI Motohiro, LKML

Christoph Hellwig wrote:
> On Tue, May 19, 2009 at 08:59:29AM -0400, Jeff Moyer wrote:
>> I'm not sure I'm comfortable with this.  Having had to diagnose
>> performance problems using blktrace, I find the dev_t information for
>> plugs and unplugs especially interesting.  Maybe we can still figure
>> this out given the context in the trace, though.  Can you comment on
>> that?
> 
> The problem is that right now we still allow request queues shared by
> multiple gendisks, so we can't get at the gendisk.  The blktrace ioctl
> interface gets around that by doing the setup with an ioctl on the
> block device node and storing the dev_t at that point.
> 
> But I don't think we actually have any drivers left sharing a queue
> that way, and I'd love to get rid of that for various reasons.
> 

It will be great if this can be done.

>> from the single letter output to the verbose strings?  Am I just being a
>> curmudgeon?
>>
>> I have one further question, which will expose my ignorance on this
>> implementation.  Would the proposed patch change the information
>> available to the blktrace utility,
> 
> That's the output from /sys/kernel/debug/tracing/trace when the
> block events are enabled.  The blktrace utility should continue to
> work without changes, although it could be ported to use the ftrace
> binary ring buffer directly in the future.
> 

Yes, this patch doesn't affect ioctl-based blktrace or ftrace-plugin
blktrace at all.

To use blktracer:
  # echo 1 > /sys/block/sda/trace/enable
  # echo blk > /debug/tracing/current_tracer
  # cat /debug/tracing/trace_pipe

To use block trace events:
  # echo 'block:*' > /debug/tracing/set_event
  # cat /trace_pipe

You can mixed-use blktracer and trace events, to see how they differ
in the output.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] convert block trace points to TRACE_EVENT()
  2009-05-20  8:38     ` Li Zefan
@ 2009-05-23 12:40       ` Christoph Hellwig
  2009-05-24  5:15         ` KOSAKI Motohiro
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2009-05-23 12:40 UTC (permalink / raw)
  To: Li Zefan
  Cc: Christoph Hellwig, Jeff Moyer, Ingo Molnar, Jens Axboe,
	Steven Rostedt, Frederic Weisbecker, Tom Zanussi,
	Theodore Ts'o, Steven Whitehouse, KOSAKI Motohiro, LKML

On Wed, May 20, 2009 at 04:38:54PM +0800, Li Zefan wrote:
> Yes, this patch doesn't affect ioctl-based blktrace or ftrace-plugin
> blktrace at all.

Is there any good reason for keeping the ftrace plugin once we have
the fully function event tracer support for the block events?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] convert block trace points to TRACE_EVENT()
  2009-05-23 12:40       ` Christoph Hellwig
@ 2009-05-24  5:15         ` KOSAKI Motohiro
  2009-05-24  8:57           ` Christoph Hellwig
  0 siblings, 1 reply; 14+ messages in thread
From: KOSAKI Motohiro @ 2009-05-24  5:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: kosaki.motohiro, Li Zefan, Jeff Moyer, Ingo Molnar, Jens Axboe,
	Steven Rostedt, Frederic Weisbecker, Tom Zanussi,
	Theodore Ts'o, Steven Whitehouse, LKML

> On Wed, May 20, 2009 at 04:38:54PM +0800, Li Zefan wrote:
> > Yes, this patch doesn't affect ioctl-based blktrace or ftrace-plugin
> > blktrace at all.
> 
> Is there any good reason for keeping the ftrace plugin once we have
> the fully function event tracer support for the block events?

waiting userland blktrace tool convert using event tracer? 

just guess :-)




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] convert block trace points to TRACE_EVENT()
  2009-05-24  5:15         ` KOSAKI Motohiro
@ 2009-05-24  8:57           ` Christoph Hellwig
  2009-05-24 13:47             ` KOSAKI Motohiro
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2009-05-24  8:57 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Christoph Hellwig, Li Zefan, Jeff Moyer, Ingo Molnar, Jens Axboe,
	Steven Rostedt, Frederic Weisbecker, Tom Zanussi,
	Theodore Ts'o, Steven Whitehouse, LKML

On Sun, May 24, 2009 at 02:15:05PM +0900, KOSAKI Motohiro wrote:
> > Is there any good reason for keeping the ftrace plugin once we have
> > the fully function event tracer support for the block events?
> 
> waiting userland blktrace tool convert using event tracer? 

blktrace(8) uses the ioctl interface, which we should keep for a long
time for backwards compatiblity.  


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] convert block trace points to TRACE_EVENT()
  2009-05-24  8:57           ` Christoph Hellwig
@ 2009-05-24 13:47             ` KOSAKI Motohiro
  0 siblings, 0 replies; 14+ messages in thread
From: KOSAKI Motohiro @ 2009-05-24 13:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: kosaki.motohiro, Li Zefan, Jeff Moyer, Ingo Molnar, Jens Axboe,
	Steven Rostedt, Frederic Weisbecker, Tom Zanussi,
	Theodore Ts'o, Steven Whitehouse, LKML

> On Sun, May 24, 2009 at 02:15:05PM +0900, KOSAKI Motohiro wrote:
> > > Is there any good reason for keeping the ftrace plugin once we have
> > > the fully function event tracer support for the block events?
> > 
> > waiting userland blktrace tool convert using event tracer? 
> 
> blktrace(8) uses the ioctl interface, which we should keep for a long
> time for backwards compatiblity.  

Ah, I misunderstand your point. sorry.
I though you suggested to remove ioctl. but it's reverse. ok. :)

So, I guess same formatting make easy understanding for administrator, maybe.






^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2009-05-24 13:48 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-05-14  6:20 [RFC][PATCH] convert block trace points to TRACE_EVENT() Li Zefan
2009-05-18  8:35 ` Ingo Molnar
2009-05-19  6:04   ` Li Zefan
2009-05-18 13:05 ` Frederic Weisbecker
2009-05-19  6:11   ` Li Zefan
2009-05-19 12:59 ` Jeff Moyer
2009-05-19 13:08   ` Christoph Hellwig
2009-05-19 15:49     ` FUJITA Tomonori
2009-05-19 17:33       ` Jens Axboe
2009-05-20  8:38     ` Li Zefan
2009-05-23 12:40       ` Christoph Hellwig
2009-05-24  5:15         ` KOSAKI Motohiro
2009-05-24  8:57           ` Christoph Hellwig
2009-05-24 13:47             ` KOSAKI Motohiro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).