linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH] ftrace interface for blktrace
@ 2009-01-16 18:16 Arnaldo Carvalho de Melo
  2009-01-16 19:15 ` Steven Rostedt
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Arnaldo Carvalho de Melo @ 2009-01-16 18:16 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Ingo Molnar, Steven Rostedt, Frederic Weisbecker, linux-kernel

Hi Jens,

	The patch below adds a ftrace interface for blktrace, allowing
people to use it without any required userspace tools, using sysfs to setup the
act_mask, pid, start_lba, end_lba.

	Example usage:

[root@f10-1 ~]# mount -t debugfs none /sys/kernel/debug/
[root@f10-1 ~]# ls /sys/block/sda/sda1/trace/
act_mask  enable  end_lba  pid  start_lba
[root@f10-1 ~]# cat /sys/kernel/debug/tracing/available_tracers 
blk function_graph wakeup function sched_switch nop
[root@f10-1 ~]# echo blk > /sys/kernel/debug/tracing/current_tracer 
[root@f10-1 ~]# cat /sys/kernel/debug/tracing/trace
# tracer: blk
#
# DEV   CPU TIMESTAMP     PID ACT FLG
#  |     |     |           |   |   |
[root@f10-1 ~]# echo 1 > /sys/block/sda/sda1/trace/enable 
[root@f10-1 ~]# tail /sys/kernel/debug/tracing/trace
  8,1    0   143.070085  1208  P  NS [tail]
  8,1    0   143.070087  1208  I RBS 4154375 + 8 [tail]
  8,1    0   143.070091  1208  A WBS 4155391 + 16 <- (8,1) 4155328
  8,1    0   143.070091  1208  Q   R 4155391 + 16 [tail]
  8,1    0   143.070094  1208  G  RB 4155391 + 16 [tail]
  8,1    0   143.070094  1208  I RBS 4155391 + 16 [tail]
  8,1    0   143.070097  1208  U  WS [tail] 2
  8,1    0   143.070101  1208  D  WB 4154375 + 8 [tail]
  8,1    0   143.070341  1208  D  WB 4155391 + 16 [tail]
  8,1    0   143.070635     0  C  RS 4154375 + 8 [0]
[root@f10-1 ~]#

	Filters can be set using:

[root@f10-1 ~]# echo write,sync > /sys/block/sda/sda1/trace/act_mask 
[root@f10-1 ~]# cat /sys/block/sda/sda1/trace/act_mask 
0xa

	One can go on interactively adding more filters, for instance:

[root@f10-1 ~]# echo `pidof kjournald` > /sys/block/sda/sda1/trace/pid
[root@f10-1 ~]# cat /sys/block/sda/sda1/trace/pid
423

	And use the existing ftrace infrastructure to, for instance,
replicate the btrace (blktrace -o - | blkparse) functionality by simply
using cat:

[root@f10-1 ~]# cat /sys/kernel/debug/tracing/trace_pipe 
  8,1    0   457.253356   423  A WBS 6335 + 8 <- (8,1) 6272
  8,1    0   457.253360   423  Q   R 6335 + 8 [kjournald]
  8,1    0   457.253367   423  G  RB 6335 + 8 [kjournald]
  8,1    0   457.253372   423  I RBS 6335 + 8 [kjournald]
  8,1    0   457.253376   423  A WBS 6343 + 8 <- (8,1) 6280
  8,1    0   457.253376   423  Q   R 6343 + 8 [kjournald]
  8,1    0   457.253378   423  M   W 6343 + 8 [kjournald]
  8,1    0   457.253380   423  A WBS 6351 + 8 <- (8,1) 6288
CONTROL+C
[root@f10-1 ~]#

	This patch was done on Linus tree, I'll now work with it on
linux-tip so that I can experiment with binary tracing, i.e to ask for
the output of /sys/kernel/debug/tracing/trace_pipe to be raw, exactly
the one provide by relayfs based variant, so that we will be able to
continue using blkparse to validate the trace entries, as a debugging
aid.

	Also there are other interesting things that one gets from using
the ftrace infrastructure, such as being able to set see traffic that
happens in just a cpuset (/sys/kernel/debug/tracing/tracing_cpumask),
and others that I plan to add such as what Steve did in "[PATCH 2/2]
ftrace: add stack trace to function tracer", so that we could have
things like (made up example):

[root@f10-1 ~]# cat /sys/kernel/debug/tracing/trace_pipe 
  8,1    0   457.253356   423  A WBS 6335 + 8 <- (8,1) 6272
  8,1    0   457.253360   423  Q   R 6335 + 8 [kjournald]
  8,1    0   457.253367   423  G  RB 6335 + 8 [kjournald]
  8,1    0   457.253372   423  I RBS 6335 + 8 [kjournald]
 <= sync_buffer
 <= __wait_on_bit
 <= out_of_line_wait_on_bit
 <= __wait_on_buffer
 <= sync_dirty_buffer
 <= journal_commit_transaction
 <= kjournald
  8,1    0   457.253376   423  A WBS 6343 + 8 <- (8,1) 6280
  8,1    0   457.253376   423  Q   R 6343 + 8 [kjournald]
  8,1    0   457.253378   423  M   W 6343 + 8 [kjournald]
  8,1    0   457.253380   423  A WBS 6351 + 8 <- (8,1) 6288

	More polishing is needed, I wrote the patch this way just to
show where things are different, with new code, for registering the
tracer, bits stolen from the userspace blktrace, allowing config via
sysfs at the bottom.

	Ideally we would move this to kernel/trace/block.c, break it
into smaller self contained patches, etc.

Comments?

- Arnaldo

 block/blktrace.c      |  578 +++++++++++++++++++++++++++++++++++++++++++++++++-
 fs/partitions/check.c |    7 
 2 files changed, 583 insertions(+), 2 deletions(-)

diff --git a/block/blktrace.c b/block/blktrace.c
index b0a2cae..10985f9 100644
--- a/block/blktrace.c
+++ b/block/blktrace.c
@@ -25,9 +25,13 @@
 #include <linux/time.h>
 #include <trace/block.h>
 #include <asm/uaccess.h>
+#include <../kernel/trace/trace.h>
 
 static unsigned int blktrace_seq __read_mostly = 1;
 
+static struct trace_array *blk_tr;
+static int __read_mostly  blk_tracer_enabled;
+
 /* Global reference count of probes */
 static DEFINE_MUTEX(blk_probe_mutex);
 static atomic_t blk_probes_ref = ATOMIC_INIT(0);
@@ -43,6 +47,9 @@ static void trace_note(struct blk_trace *bt, pid_t pid, int action,
 {
 	struct blk_io_trace *t;
 
+	if (!bt->rchan)
+		return;
+
 	t = relay_reserve(bt->rchan, sizeof(*t) + len);
 	if (t) {
 		const int cpu = smp_processor_id();
@@ -90,6 +97,9 @@ void __trace_note_message(struct blk_trace *bt, const char *fmt, ...)
 	unsigned long flags;
 	char *buf;
 
+	if (!bt->msg_data)
+		return;
+
 	local_irq_save(flags);
 	buf = per_cpu_ptr(bt->msg_data, smp_processor_id());
 	va_start(args, fmt);
@@ -131,13 +141,14 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
 		     int rw, u32 what, int error, int pdu_len, void *pdu_data)
 {
 	struct task_struct *tsk = current;
+	struct ring_buffer_event *event = NULL;
 	struct blk_io_trace *t;
 	unsigned long flags;
 	unsigned long *sequence;
 	pid_t pid;
 	int cpu;
 
-	if (unlikely(bt->trace_state != Blktrace_running))
+	if (unlikely(bt->trace_state != Blktrace_running || !blk_tracer_enabled))
 		return;
 
 	what |= ddir_act[rw & WRITE];
@@ -158,6 +169,25 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
 	 */
 	local_irq_save(flags);
 
+	if (blk_tr) {
+		struct trace_array_cpu *data;
+
+		tracing_record_cmdline(current);
+		cpu = raw_smp_processor_id();
+		data = blk_tr->data[cpu];
+
+		if (unlikely(atomic_read(&data->disabled)))
+			goto out_local_irq_restore;
+
+		event = ring_buffer_lock_reserve(blk_tr->buffer,
+						 sizeof(*t) + pdu_len, &flags);
+		if (!event)
+			goto out_local_irq_restore;
+		
+		t = ring_buffer_event_data(event);
+		goto record_it;
+	}
+
 	if (unlikely(tsk->btrace_seq != blktrace_seq))
 		trace_note_tsk(bt, tsk);
 
@@ -169,19 +199,26 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
 		t->magic = BLK_IO_TRACE_MAGIC | BLK_IO_TRACE_VERSION;
 		t->sequence = ++(*sequence);
 		t->time = ktime_to_ns(ktime_get());
+		t->cpu = cpu;
+record_it:
 		t->sector = sector;
 		t->bytes = bytes;
 		t->action = what;
 		t->pid = pid;
 		t->device = bt->dev;
-		t->cpu = cpu;
 		t->error = error;
 		t->pdu_len = pdu_len;
 
 		if (pdu_len)
 			memcpy((void *) t + sizeof(*t), pdu_data, pdu_len);
+
+		if (blk_tr) {
+			ring_buffer_unlock_commit(blk_tr->buffer, event, flags);
+			trace_wake_up();
+		}
 	}
 
+out_local_irq_restore:
 	local_irq_restore(flags);
 }
 
@@ -888,3 +925,540 @@ static void blk_unregister_tracepoints(void)
 
 	tracepoint_synchronize_unregister();
 }
+
+/*
+ * struct blk_io_tracer formatting routines
+ */
+
+static void fill_rwbs(char *rwbs, struct blk_io_trace *t)
+{
+        int i = 0;
+
+        if (t->action & BLK_TC_DISCARD)	   rwbs[i++] = 'D';
+        else if (t->action & BLK_TC_WRITE) rwbs[i++] = 'W';
+        else if (t->bytes)		   rwbs[i++] = 'R';
+        else				   rwbs[i++] = 'N';
+
+        if (t->action & BLK_TC_AHEAD)	   rwbs[i++] = 'A';
+        if (t->action & BLK_TC_BARRIER)	   rwbs[i++] = 'B';
+        if (t->action & BLK_TC_SYNC)	   rwbs[i++] = 'S';
+        if (t->action & BLK_TC_META)	   rwbs[i++] = 'M';
+
+        rwbs[i] = '\0';
+}
+
+static inline void *pdu_start(struct blk_io_trace *t)
+{
+	return t + 1;
+}
+
+static inline u32 t_sec(const struct blk_io_trace *t)
+{
+	return t->bytes >> 9;
+}
+
+static __u64 get_pdu_int(struct blk_io_trace *t)
+{
+	__u64 *val = pdu_start(t);
+
+	return be64_to_cpu(*val);
+}
+
+static void get_pdu_remap(struct blk_io_trace *t, struct blk_io_trace_remap *r)
+{
+	struct blk_io_trace_remap *__r = pdu_start(t);
+	__u64 sector = __r->sector;
+
+	r->device = be32_to_cpu(__r->device);
+	r->device_from = be32_to_cpu(__r->device_from);
+	r->sector = be64_to_cpu(sector);
+}
+
+static int blk_log_action(struct trace_iterator *iter, struct blk_io_trace *t,
+			  const char *act)
+{
+	char rwbs[6];
+	unsigned long long ts	= ns2usecs(iter->ts);
+	unsigned long usec_rem	= do_div(ts, USEC_PER_SEC);
+	unsigned secs		= (unsigned long)ts;
+
+	fill_rwbs(rwbs, t);
+	return trace_seq_printf(&iter->seq,
+				"%3d,%-3d %2d %5d.%06lu %5u %2s %3s ",
+				MAJOR(t->device), MINOR(t->device), iter->cpu,
+				secs, usec_rem, t->pid, act, rwbs);
+}
+
+static int blk_log_generic(struct trace_iterator *iter, struct blk_io_trace *t,
+			   const char *act)
+{
+	int ret = blk_log_action(iter, t, act);
+	if (ret) {
+		const char *cmd = trace_find_cmdline(t->pid);
+
+		if (t_sec(t))
+			ret = trace_seq_printf(&iter->seq, "%llu + %u [%s]\n",
+					       (unsigned long long)t->sector,
+					       t_sec(t), cmd);
+		else
+			ret = trace_seq_printf(&iter->seq, "[%s]\n", cmd);
+	}
+	return ret;
+}
+
+static int blk_log_with_err(struct trace_iterator *iter, struct blk_io_trace *t,
+			    const char *act)
+{
+	int ret = blk_log_action(iter, t, act);
+	if (ret) {
+		if (t_sec(t))
+			ret = trace_seq_printf(&iter->seq, "%llu + %u [%d]\n",
+					       (unsigned long long)t->sector,
+					       t_sec(t), t->error);
+		else
+			ret = trace_seq_printf(&iter->seq, "%llu [%d]\n",
+					       (unsigned long long)t->sector,
+					       t->error);
+	}
+	return ret;
+}
+
+static int blk_log_remap(struct trace_iterator *iter, struct blk_io_trace *t)
+{
+	int ret = blk_log_action(iter, t, "A");
+	if (ret) {
+		struct blk_io_trace_remap r = { .device = 0, };
+
+		get_pdu_remap(t, &r);
+		t->device = r.device_from;
+		ret = trace_seq_printf(&iter->seq, "%llu + %u <- (%d,%d) %llu\n",
+				       (unsigned long long)t->sector,
+				       t_sec(t), MAJOR(r.device), MINOR(r.device),
+				       (unsigned long long)r.sector);
+	}
+	return ret;
+}
+
+static int blk_log_plug(struct trace_iterator *iter, struct blk_io_trace *t)
+{
+	int ret = blk_log_action(iter, t, "P");
+	if (ret)
+		ret = trace_seq_printf(&iter->seq, "[%s]\n",
+				       trace_find_cmdline(t->pid));
+	return ret;
+}
+
+static int blk_log_unplug(struct trace_iterator *iter, struct blk_io_trace *t,
+			  const char *act)
+{
+	int ret = blk_log_action(iter, t, act);
+	if (ret)
+		ret = trace_seq_printf(&iter->seq, "[%s] %u\n",
+				       trace_find_cmdline(t->pid),
+				       get_pdu_int(t));
+	return ret;
+}
+
+static int blk_log_split(struct trace_iterator *iter, struct blk_io_trace *t,
+			 const char *act)
+{
+	int ret = blk_log_action(iter, t, act);
+	if (ret)
+		ret = trace_seq_printf(&iter->seq, "%llu / %u [%s]\n",
+				       (unsigned long long)t->sector,
+				       get_pdu_int(t),
+				       trace_find_cmdline(t->pid));
+	return ret;
+}
+
+/*
+ * struct tracer operations
+ */
+
+static void blk_tracer_print_header(struct seq_file *m)
+{
+	seq_puts(m, "# DEV   CPU TIMESTAMP     PID ACT FLG\n"
+		    "#  |     |     |           |   |   |\n");
+}
+
+static enum print_line_t blk_tracer_print_line(struct trace_iterator *iter)
+{
+	struct trace_seq *s = &iter->seq;
+	struct blk_io_trace *t = (struct blk_io_trace *)iter->ent;
+	const u16 what = t->action & ((1 << BLK_TC_SHIFT) - 1);
+	int ret;
+
+	switch (what) {
+	case __BLK_TA_ISSUE:	    ret = blk_log_generic(iter, t, "D"); break;
+	case __BLK_TA_INSERT:	    ret = blk_log_generic(iter, t, "I"); break;
+	case __BLK_TA_QUEUE:	    ret = blk_log_generic(iter, t, "Q"); break;
+	case __BLK_TA_BOUNCE:	    ret = blk_log_generic(iter, t, "B"); break;
+	case __BLK_TA_BACKMERGE:    ret = blk_log_generic(iter, t, "M"); break;
+	case __BLK_TA_FRONTMERGE:   ret = blk_log_generic(iter, t, "F"); break;
+	case __BLK_TA_GETRQ:	    ret = blk_log_generic(iter, t, "G"); break;
+	case __BLK_TA_SLEEPRQ:	    ret = blk_log_generic(iter, t, "S"); break;
+	case __BLK_TA_REQUEUE:	    ret = blk_log_with_err(iter, t, "R"); break;
+	case __BLK_TA_COMPLETE:	    ret = blk_log_with_err(iter, t, "C"); break;
+	case __BLK_TA_PLUG:	    ret = blk_log_plug(iter, t); break;
+	case __BLK_TA_UNPLUG_IO:    ret = blk_log_unplug(iter, t, "U"); break;
+	case __BLK_TA_UNPLUG_TIMER: ret = blk_log_unplug(iter, t, "UT"); break;
+	case __BLK_TA_REMAP:	    ret = blk_log_remap(iter, t); break;
+	case __BLK_TA_SPLIT:	    ret = blk_log_split(iter, t, "X"); break;
+	default:
+		ret = trace_seq_printf(s, "Bad pc action %x\n", what);
+		break;
+	}
+
+	if (ret == 0)
+		return TRACE_TYPE_PARTIAL_LINE;
+
+	return TRACE_TYPE_HANDLED;
+}
+
+static void blk_tracer_start(struct trace_array *tr)
+{
+	int cpu;
+
+	tr->time_start = ftrace_now(tr->cpu);
+
+	for_each_online_cpu(cpu)
+		tracing_reset(tr, cpu);
+
+	mutex_lock(&blk_probe_mutex);
+	if (atomic_add_return(1, &blk_probes_ref) == 1)
+		if (blk_register_tracepoints())
+			atomic_dec(&blk_probes_ref);
+	mutex_unlock(&blk_probe_mutex);
+}
+
+static int blk_tracer_init(struct trace_array *tr)
+{
+	blk_tr = tr;
+	blk_tracer_start(tr);
+	mutex_lock(&blk_probe_mutex);
+	blk_tracer_enabled++;
+	mutex_unlock(&blk_probe_mutex);
+	return 0;
+}
+
+static void blk_tracer_stop(struct trace_array *tr)
+{
+	mutex_lock(&blk_probe_mutex);
+	if (atomic_dec_and_test(&blk_probes_ref))
+		blk_unregister_tracepoints();
+	mutex_unlock(&blk_probe_mutex);
+}
+
+static void blk_tracer_reset(struct trace_array *tr)
+{
+	if (!atomic_read(&blk_probes_ref))
+		return;
+
+	mutex_lock(&blk_probe_mutex);
+	blk_tracer_enabled--;
+	WARN_ON(blk_tracer_enabled < 0);
+	mutex_unlock(&blk_probe_mutex);
+
+	blk_tracer_stop(tr);
+}
+
+static struct tracer blk_tracer __read_mostly = {
+	.name		= "blk",
+	.init		= blk_tracer_init,
+	.reset		= blk_tracer_reset,
+	.start		= blk_tracer_start,
+	.stop		= blk_tracer_stop,
+	.print_header	= blk_tracer_print_header,
+	.print_line	= blk_tracer_print_line,
+};
+
+static int __init init_blk_tracer(void)
+{
+	return register_tracer(&blk_tracer);
+}
+
+device_initcall(init_blk_tracer);
+
+static int blk_trace_remove_queue(struct request_queue *q)
+{
+	struct blk_trace *bt;
+
+	bt = xchg(&q->blk_trace, NULL);
+	if (bt == NULL)
+		return -EINVAL;
+
+	kfree(bt);
+	return 0;
+}
+
+/*
+ * Setup everything required to start tracing
+ */
+static int blk_trace_setup_queue(struct request_queue *q, dev_t dev)
+{
+	struct blk_trace *old_bt, *bt = NULL;
+	int ret;
+
+	ret = -ENOMEM;
+	bt = kzalloc(sizeof(*bt), GFP_KERNEL);
+	if (!bt)
+		goto err;
+
+	bt->dev = dev;
+	bt->act_mask = (u16)-1;
+	bt->end_lba = -1ULL;
+	bt->trace_state = Blktrace_running;
+
+	old_bt = xchg(&q->blk_trace, bt);
+	if (old_bt != NULL) {
+		(void)xchg(&q->blk_trace, old_bt);
+		kfree(bt);
+		ret = -EBUSY;
+	}
+	return 0;
+err:
+	return ret;
+}
+
+/*
+ * sysfs interface to enable and configure tracing
+ */
+
+static ssize_t sysfs_blk_trace_enable_show(struct device *dev,
+					   struct device_attribute *attr,
+					   char *buf)
+{
+	struct hd_struct *p = dev_to_part(dev);
+	struct block_device *bdev;
+	ssize_t ret = -ENXIO;
+
+	lock_kernel();
+	bdev = bdget(part_devt(p));
+	if (bdev != NULL) {
+		struct request_queue *q = bdev_get_queue(bdev);
+
+		if (q != NULL) {
+			mutex_lock(&bdev->bd_mutex);
+			ret = sprintf(buf, "%u\n", !!q->blk_trace);
+			mutex_unlock(&bdev->bd_mutex);
+		}
+
+		bdput(bdev);
+	}
+
+	unlock_kernel();
+	return ret;
+}
+
+static ssize_t sysfs_blk_trace_enable_store(struct device *dev,
+					    struct device_attribute *attr,
+					    const char *buf, size_t count)
+{
+	struct block_device *bdev;
+	struct request_queue *q;
+	struct hd_struct *p;
+	int value;
+	ssize_t ret = -ENXIO;
+
+	if (count == 0 || sscanf(buf, "%d", &value) != 1)
+		goto out;
+
+	lock_kernel();
+	p = dev_to_part(dev);
+	bdev = bdget(part_devt(p));
+	if (bdev == NULL)
+		goto out_unlock_kernel;
+
+	q = bdev_get_queue(bdev);
+	if (q == NULL)
+		goto out_bdput;
+
+	mutex_lock(&bdev->bd_mutex);
+	if (value)
+		ret = blk_trace_setup_queue(q, bdev->bd_dev);
+	else
+		ret = blk_trace_remove_queue(q);
+	mutex_unlock(&bdev->bd_mutex);
+
+	if (ret == 0)
+		ret = count;
+out_bdput:
+	bdput(bdev);
+out_unlock_kernel:
+	unlock_kernel();
+out:
+	return ret;
+}
+
+static ssize_t sysfs_blk_trace_attr_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf);
+static ssize_t sysfs_blk_trace_attr_store(struct device *dev,
+					  struct device_attribute *attr,
+					  const char *buf, size_t count);
+#define BLK_TRACE_DEVICE_ATTR(_name) \
+	DEVICE_ATTR(_name, S_IRUGO | S_IWUSR, \
+		    sysfs_blk_trace_attr_show, \
+		    sysfs_blk_trace_attr_store)
+
+static DEVICE_ATTR(enable, S_IRUGO | S_IWUSR,
+		   sysfs_blk_trace_enable_show, sysfs_blk_trace_enable_store);
+static BLK_TRACE_DEVICE_ATTR(act_mask);
+static BLK_TRACE_DEVICE_ATTR(pid);
+static BLK_TRACE_DEVICE_ATTR(start_lba);
+static BLK_TRACE_DEVICE_ATTR(end_lba);
+
+static struct attribute *blk_trace_attrs[] = {
+	&dev_attr_enable.attr,
+	&dev_attr_act_mask.attr,
+	&dev_attr_pid.attr,
+	&dev_attr_start_lba.attr,
+	&dev_attr_end_lba.attr,
+	NULL
+};
+
+struct attribute_group blk_trace_attr_group = {
+	.name  = "trace",
+	.attrs = blk_trace_attrs,
+};
+
+static int blk_str2act_mask(const char *str)
+{
+	int mask = 0;
+	char *copy = kstrdup(str, GFP_KERNEL), *s;
+
+	if (copy == NULL)
+		return -ENOMEM;
+
+	s = strstrip(copy);
+
+	while (1) {
+		char *sep = strchr(s, ',');
+
+		if (sep != NULL)
+			*sep = '\0';
+
+		if (strcasecmp(s, "barrier") == 0)
+			mask |= BLK_TC_BARRIER;
+		else if (strcasecmp(s, "complete") == 0)
+			mask |= BLK_TC_COMPLETE;
+		else if (strcasecmp(s, "fs") == 0)
+			mask |= BLK_TC_FS;
+		else if (strcasecmp(s, "issue") == 0)
+			mask |= BLK_TC_ISSUE;
+		else if (strcasecmp(s, "pc") == 0)
+			mask |= BLK_TC_PC;
+		else if (strcasecmp(s, "queue") == 0)
+			mask |= BLK_TC_QUEUE;
+		else if (strcasecmp(s, "read") == 0)
+			mask |= BLK_TC_READ;
+		else if (strcasecmp(s, "requeue") == 0)
+			mask |= BLK_TC_REQUEUE;
+		else if (strcasecmp(s, "sync") == 0)
+			mask |= BLK_TC_SYNC;
+		else if (strcasecmp(s, "write") == 0)
+			mask |= BLK_TC_WRITE;
+
+		if (sep == NULL)
+			break;
+
+		s = sep + 1;
+	}
+	kfree(copy);
+
+	return mask;
+}
+
+static ssize_t sysfs_blk_trace_attr_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	struct hd_struct *p = dev_to_part(dev);
+	struct request_queue *q;
+	struct block_device *bdev;
+	ssize_t ret = -ENXIO;
+
+	lock_kernel();
+	bdev = bdget(part_devt(p));
+	if (bdev == NULL)
+		goto out_unlock_kernel;
+
+	q = bdev_get_queue(bdev);
+	if (q == NULL)
+		goto out_bdput;
+	mutex_lock(&bdev->bd_mutex);
+	if (q->blk_trace == NULL)
+		ret = sprintf(buf, "disabled\n");
+	else if (attr == &dev_attr_act_mask)
+		ret = sprintf(buf, "%#x\n", q->blk_trace->act_mask);
+	else if (attr == &dev_attr_pid)
+		ret = sprintf(buf, "%u\n", q->blk_trace->pid);
+	else if (attr == &dev_attr_start_lba)
+		ret = sprintf(buf, "%llu\n", q->blk_trace->start_lba);
+	else if (attr == &dev_attr_end_lba)
+		ret = sprintf(buf, "%llu\n", q->blk_trace->end_lba);
+	mutex_unlock(&bdev->bd_mutex);
+out_bdput:
+	bdput(bdev);
+out_unlock_kernel:
+	unlock_kernel();
+	return ret;
+}
+
+static ssize_t sysfs_blk_trace_attr_store(struct device *dev,
+					  struct device_attribute *attr,
+					  const char *buf, size_t count)
+{
+	struct block_device *bdev;
+	struct request_queue *q;
+	struct hd_struct *p;
+	u64 value;
+	ssize_t ret = -ENXIO;
+
+	if (count == 0)
+		goto out;
+
+	if (attr == &dev_attr_act_mask) {
+		if (sscanf(buf, "%llx", &value) != 1) {
+			/* Assume it is a list of trace category names */
+			value = blk_str2act_mask(buf);
+			if (value < 0)
+				goto out;
+		}
+	} else if (sscanf(buf, "%llu", &value) != 1)
+		goto out;
+
+	lock_kernel();
+	p = dev_to_part(dev);
+	bdev = bdget(part_devt(p));
+	if (bdev == NULL)
+		goto out_unlock_kernel;
+
+	q = bdev_get_queue(bdev);
+	if (q == NULL)
+		goto out_bdput;
+
+	mutex_lock(&bdev->bd_mutex);
+	ret = 0;
+	if (q->blk_trace == NULL)
+		ret = blk_trace_setup_queue(q, bdev->bd_dev);
+
+	if (ret == 0) {
+		if (attr == &dev_attr_act_mask)
+			q->blk_trace->act_mask = value;
+		else if (attr == &dev_attr_pid)
+			q->blk_trace->pid = value;
+		else if (attr == &dev_attr_start_lba)
+			q->blk_trace->start_lba = value;
+		else if (attr == &dev_attr_end_lba)
+			q->blk_trace->end_lba = value;
+		ret = count;
+	}
+	mutex_unlock(&bdev->bd_mutex);
+out_bdput:
+	bdput(bdev);
+out_unlock_kernel:
+	unlock_kernel();
+out:
+	return ret;
+}
diff --git a/fs/partitions/check.c b/fs/partitions/check.c
index 6d72024..01714ef 100644
--- a/fs/partitions/check.c
+++ b/fs/partitions/check.c
@@ -268,6 +268,10 @@ ssize_t part_fail_store(struct device *dev,
 }
 #endif
 
+#ifdef CONFIG_BLK_DEV_IO_TRACE
+extern struct attribute_group blk_trace_attr_group;
+#endif
+
 static DEVICE_ATTR(partition, S_IRUGO, part_partition_show, NULL);
 static DEVICE_ATTR(start, S_IRUGO, part_start_show, NULL);
 static DEVICE_ATTR(size, S_IRUGO, part_size_show, NULL);
@@ -294,6 +298,9 @@ static struct attribute_group part_attr_group = {
 
 static struct attribute_group *part_attr_groups[] = {
 	&part_attr_group,
+#ifdef CONFIG_BLK_DEV_IO_TRACE
+	&blk_trace_attr_group,
+#endif
 	NULL
 };
 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] ftrace interface for blktrace
  2009-01-16 18:16 [RFC][PATCH] ftrace interface for blktrace Arnaldo Carvalho de Melo
@ 2009-01-16 19:15 ` Steven Rostedt
  2009-01-16 19:41   ` Arnaldo Carvalho de Melo
  2009-01-17 13:37 ` Frederic Weisbecker
  2009-01-17 19:09 ` Ingo Molnar
  2 siblings, 1 reply; 14+ messages in thread
From: Steven Rostedt @ 2009-01-16 19:15 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Jens Axboe, Ingo Molnar, Frederic Weisbecker, linux-kernel


Arnaldo,

Cool stuff!

On Fri, 16 Jan 2009, Arnaldo Carvalho de Melo wrote:

> Hi Jens,
> 
> 	The patch below adds a ftrace interface for blktrace, allowing
> people to use it without any required userspace tools, using sysfs to setup the
> act_mask, pid, start_lba, end_lba.
> 
> 	Example usage:
> 
> [root@f10-1 ~]# mount -t debugfs none /sys/kernel/debug/

I think you are the only person that mounts debugfs on the "official"
location ;-)

> [root@f10-1 ~]# ls /sys/block/sda/sda1/trace/
> act_mask  enable  end_lba  pid  start_lba
> [root@f10-1 ~]# cat /sys/kernel/debug/tracing/available_tracers 
> blk function_graph wakeup function sched_switch nop
> [root@f10-1 ~]# echo blk > /sys/kernel/debug/tracing/current_tracer 

You must like to type.

-- Steve

> [root@f10-1 ~]# cat /sys/kernel/debug/tracing/trace
> # tracer: blk
> #
> # DEV   CPU TIMESTAMP     PID ACT FLG
> #  |     |     |           |   |   |
> [root@f10-1 ~]# echo 1 > /sys/block/sda/sda1/trace/enable 
> [root@f10-1 ~]# tail /sys/kernel/debug/tracing/trace
>   8,1    0   143.070085  1208  P  NS [tail]
>   8,1    0   143.070087  1208  I RBS 4154375 + 8 [tail]
>   8,1    0   143.070091  1208  A WBS 4155391 + 16 <- (8,1) 4155328
>   8,1    0   143.070091  1208  Q   R 4155391 + 16 [tail]
>   8,1    0   143.070094  1208  G  RB 4155391 + 16 [tail]
>   8,1    0   143.070094  1208  I RBS 4155391 + 16 [tail]
>   8,1    0   143.070097  1208  U  WS [tail] 2
>   8,1    0   143.070101  1208  D  WB 4154375 + 8 [tail]
>   8,1    0   143.070341  1208  D  WB 4155391 + 16 [tail]
>   8,1    0   143.070635     0  C  RS 4154375 + 8 [0]
> [root@f10-1 ~]#
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] ftrace interface for blktrace
  2009-01-16 19:15 ` Steven Rostedt
@ 2009-01-16 19:41   ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 14+ messages in thread
From: Arnaldo Carvalho de Melo @ 2009-01-16 19:41 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Jens Axboe, Ingo Molnar, Frederic Weisbecker, linux-kernel

Em Fri, Jan 16, 2009 at 02:15:55PM -0500, Steven Rostedt escreveu:
> 
> Arnaldo,
> 
> Cool stuff!
> 
> On Fri, 16 Jan 2009, Arnaldo Carvalho de Melo wrote:
> 
> > Hi Jens,
> > 
> > 	The patch below adds a ftrace interface for blktrace, allowing
> > people to use it without any required userspace tools, using sysfs to setup the
> > act_mask, pid, start_lba, end_lba.
> > 
> > 	Example usage:
> > 
> > [root@f10-1 ~]# mount -t debugfs none /sys/kernel/debug/
> 
> I think you are the only person that mounts debugfs on the "official"
> location ;-)
> 
> > [root@f10-1 ~]# ls /sys/block/sda/sda1/trace/
> > act_mask  enable  end_lba  pid  start_lba
> > [root@f10-1 ~]# cat /sys/kernel/debug/tracing/available_tracers 
> > blk function_graph wakeup function sched_switch nop
> > [root@f10-1 ~]# echo blk > /sys/kernel/debug/tracing/current_tracer 
> 
> You must like to type.

Nah, autocompletion and 'set -o vi' command line editing makes it easy,
but yes, one could mount it as:

# mkdir /d
# mount -t debugfs none /d

:-)

Other than that, I guess the implementation must be OK, huh? :-)

Well, off to play fat boys soccer.

- Arnaldo 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] ftrace interface for blktrace
  2009-01-16 18:16 [RFC][PATCH] ftrace interface for blktrace Arnaldo Carvalho de Melo
  2009-01-16 19:15 ` Steven Rostedt
@ 2009-01-17 13:37 ` Frederic Weisbecker
  2009-01-17 20:22   ` Arnaldo Carvalho de Melo
  2009-01-17 19:09 ` Ingo Molnar
  2 siblings, 1 reply; 14+ messages in thread
From: Frederic Weisbecker @ 2009-01-17 13:37 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jens Axboe, Ingo Molnar, Steven Rostedt,
	linux-kernel

On Fri, Jan 16, 2009 at 04:16:11PM -0200, Arnaldo Carvalho de Melo wrote:
> Hi Jens,
> 
> 	The patch below adds a ftrace interface for blktrace, allowing
> people to use it without any required userspace tools, using sysfs to setup the
> act_mask, pid, start_lba, end_lba.
> 	Example usage:
> 
> [root@f10-1 ~]# mount -t debugfs none /sys/kernel/debug/
> [root@f10-1 ~]# ls /sys/block/sda/sda1/trace/
> act_mask  enable  end_lba  pid  start_lba


That reminds me it would be good to provide a more powerful tracer options
Api, to set numeric or string values for the options, like the kernel
parameters..... So you would avoid to implement your own way to do it.
I should put my hands into it one day.


> [root@f10-1 ~]# cat /sys/kernel/debug/tracing/available_tracers 
> blk function_graph wakeup function sched_switch nop
> [root@f10-1 ~]# echo blk > /sys/kernel/debug/tracing/current_tracer 
> [root@f10-1 ~]# cat /sys/kernel/debug/tracing/trace
> # tracer: blk
> #
> # DEV   CPU TIMESTAMP     PID ACT FLG
> #  |     |     |           |   |   |
> [root@f10-1 ~]# echo 1 > /sys/block/sda/sda1/trace/enable 
> [root@f10-1 ~]# tail /sys/kernel/debug/tracing/trace
>   8,1    0   143.070085  1208  P  NS [tail]
>   8,1    0   143.070087  1208  I RBS 4154375 + 8 [tail]
>   8,1    0   143.070091  1208  A WBS 4155391 + 16 <- (8,1) 4155328
>   8,1    0   143.070091  1208  Q   R 4155391 + 16 [tail]
>   8,1    0   143.070094  1208  G  RB 4155391 + 16 [tail]
>   8,1    0   143.070094  1208  I RBS 4155391 + 16 [tail]
>   8,1    0   143.070097  1208  U  WS [tail] 2
>   8,1    0   143.070101  1208  D  WB 4154375 + 8 [tail]
>   8,1    0   143.070341  1208  D  WB 4155391 + 16 [tail]
>   8,1    0   143.070635     0  C  RS 4154375 + 8 [0]


nice!


> [root@f10-1 ~]#
> 
> 	Filters can be set using:
> 
> [root@f10-1 ~]# echo write,sync > /sys/block/sda/sda1/trace/act_mask 
> [root@f10-1 ~]# cat /sys/block/sda/sda1/trace/act_mask 
> 0xa

That would be better to see the thing as a string, like you set it.

 
> 	One can go on interactively adding more filters, for instance:
> 
> [root@f10-1 ~]# echo `pidof kjournald` > /sys/block/sda/sda1/trace/pid
> [root@f10-1 ~]# cat /sys/block/sda/sda1/trace/pid
> 423


Or why not simply kjournald? :-)


> 	And use the existing ftrace infrastructure to, for instance,
> replicate the btrace (blktrace -o - | blkparse) functionality by simply
> using cat:
> 
> [root@f10-1 ~]# cat /sys/kernel/debug/tracing/trace_pipe 
>   8,1    0   457.253356   423  A WBS 6335 + 8 <- (8,1) 6272
>   8,1    0   457.253360   423  Q   R 6335 + 8 [kjournald]
>   8,1    0   457.253367   423  G  RB 6335 + 8 [kjournald]
>   8,1    0   457.253372   423  I RBS 6335 + 8 [kjournald]
>   8,1    0   457.253376   423  A WBS 6343 + 8 <- (8,1) 6280
>   8,1    0   457.253376   423  Q   R 6343 + 8 [kjournald]
>   8,1    0   457.253378   423  M   W 6343 + 8 [kjournald]
>   8,1    0   457.253380   423  A WBS 6351 + 8 <- (8,1) 6288
> CONTROL+C
> [root@f10-1 ~]#
> 
> 	This patch was done on Linus tree, I'll now work with it on
> linux-tip so that I can experiment with binary tracing, i.e to ask for
> the output of /sys/kernel/debug/tracing/trace_pipe to be raw, exactly
> the one provide by relayfs based variant, so that we will be able to
> continue using blkparse to validate the trace entries, as a debugging
> aid.


But I guess there will be some changes to bring on the tracing to make
it support binary flow...


> 	Also there are other interesting things that one gets from using
> the ftrace infrastructure, such as being able to set see traffic that
> happens in just a cpuset (/sys/kernel/debug/tracing/tracing_cpumask),
> and others that I plan to add such as what Steve did in "[PATCH 2/2]
> ftrace: add stack trace to function tracer", so that we could have
> things like (made up example):
> 
> [root@f10-1 ~]# cat /sys/kernel/debug/tracing/trace_pipe 
>   8,1    0   457.253356   423  A WBS 6335 + 8 <- (8,1) 6272
>   8,1    0   457.253360   423  Q   R 6335 + 8 [kjournald]
>   8,1    0   457.253367   423  G  RB 6335 + 8 [kjournald]
>   8,1    0   457.253372   423  I RBS 6335 + 8 [kjournald]
>  <= sync_buffer
>  <= __wait_on_bit
>  <= out_of_line_wait_on_bit
>  <= __wait_on_buffer
>  <= sync_dirty_buffer
>  <= journal_commit_transaction
>  <= kjournald
>   8,1    0   457.253376   423  A WBS 6343 + 8 <- (8,1) 6280
>   8,1    0   457.253376   423  Q   R 6343 + 8 [kjournald]
>   8,1    0   457.253378   423  M   W 6343 + 8 [kjournald]
>   8,1    0   457.253380   423  A WBS 6351 + 8 <- (8,1) 6288


Cool, but I guess you will not need this ftrace plugin to do it.
stack traces entries for tracing are already supported for a while.

 
> 	More polishing is needed, I wrote the patch this way just to
> show where things are different, with new code, for registering the
> tracer, bits stolen from the userspace blktrace, allowing config via
> sysfs at the bottom.
> 
> 	Ideally we would move this to kernel/trace/block.c, break it
> into smaller self contained patches, etc.
> 
> Comments?
> 
> - Arnaldo
> 
>  block/blktrace.c      |  578 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  fs/partitions/check.c |    7 
>  2 files changed, 583 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blktrace.c b/block/blktrace.c
> index b0a2cae..10985f9 100644
> --- a/block/blktrace.c
> +++ b/block/blktrace.c
> @@ -25,9 +25,13 @@
>  #include <linux/time.h>
>  #include <trace/block.h>
>  #include <asm/uaccess.h>
> +#include <../kernel/trace/trace.h>


Yeah, it really needs to be moved on kernel/trace :-)


>  static unsigned int blktrace_seq __read_mostly = 1;
>  
> +static struct trace_array *blk_tr;
> +static int __read_mostly  blk_tracer_enabled;
> +
>  /* Global reference count of probes */
>  static DEFINE_MUTEX(blk_probe_mutex);
>  static atomic_t blk_probes_ref = ATOMIC_INIT(0);
> @@ -43,6 +47,9 @@ static void trace_note(struct blk_trace *bt, pid_t pid, int action,
>  {
>  	struct blk_io_trace *t;
>  
> +	if (!bt->rchan)
> +		return;
> +
>  	t = relay_reserve(bt->rchan, sizeof(*t) + len);
>  	if (t) {
>  		const int cpu = smp_processor_id();
> @@ -90,6 +97,9 @@ void __trace_note_message(struct blk_trace *bt, const char *fmt, ...)
>  	unsigned long flags;
>  	char *buf;
>  
> +	if (!bt->msg_data)
> +		return;
> +
>  	local_irq_save(flags);
>  	buf = per_cpu_ptr(bt->msg_data, smp_processor_id());
>  	va_start(args, fmt);
> @@ -131,13 +141,14 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
>  		     int rw, u32 what, int error, int pdu_len, void *pdu_data)
>  {
>  	struct task_struct *tsk = current;
> +	struct ring_buffer_event *event = NULL;
>  	struct blk_io_trace *t;
>  	unsigned long flags;
>  	unsigned long *sequence;
>  	pid_t pid;
>  	int cpu;
>  
> -	if (unlikely(bt->trace_state != Blktrace_running))
> +	if (unlikely(bt->trace_state != Blktrace_running || !blk_tracer_enabled))
>  		return;
>  
>  	what |= ddir_act[rw & WRITE];
> @@ -158,6 +169,25 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
>  	 */
>  	local_irq_save(flags);


Is the local_irq_save still needed here? Note that the ring buffer is safe by itself.

> +	if (blk_tr) {
> +		struct trace_array_cpu *data;
> +
> +		tracing_record_cmdline(current);
> +		cpu = raw_smp_processor_id();


I think you can use smp_processor_id(). There is no risk of tracing recursion here.


> +		data = blk_tr->data[cpu];
> +
> +		if (unlikely(atomic_read(&data->disabled)))
> +			goto out_local_irq_restore;
> +
> +		event = ring_buffer_lock_reserve(blk_tr->buffer,
> +						 sizeof(*t) + pdu_len, &flags);
> +		if (!event)
> +			goto out_local_irq_restore;
> +		
> +		t = ring_buffer_event_data(event);
> +		goto record_it;
> +	}
> +
>  	if (unlikely(tsk->btrace_seq != blktrace_seq))
>  		trace_note_tsk(bt, tsk);
>  
> @@ -169,19 +199,26 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
>  		t->magic = BLK_IO_TRACE_MAGIC | BLK_IO_TRACE_VERSION;
>  		t->sequence = ++(*sequence);
>  		t->time = ktime_to_ns(ktime_get());

You already have a timestamp for each entry on the ring buffer.

> +		t->cpu = cpu;


Ditto. But perhaps you can be preempted between the block action and the trace insertion.
I'm not sure of the context here.. So perhaps you must keep it.


> +record_it:
>  		t->sector = sector;
>  		t->bytes = bytes;
>  		t->action = what;
>  		t->pid = pid;
>  		t->device = bt->dev;
> -		t->cpu = cpu;
>  		t->error = error;
>  		t->pdu_len = pdu_len;
>  
>  		if (pdu_len)
>  			memcpy((void *) t + sizeof(*t), pdu_data, pdu_len);
> +
> +		if (blk_tr) {
> +			ring_buffer_unlock_commit(blk_tr->buffer, event, flags);
> +			trace_wake_up();
> +		}
>  	}
>  
> +out_local_irq_restore:
>  	local_irq_restore(flags);
>  }
>  
> @@ -888,3 +925,540 @@ static void blk_unregister_tracepoints(void)
>  
>  	tracepoint_synchronize_unregister();
>  }
> +
> +/*
> + * struct blk_io_tracer formatting routines
> + */
> +
> +static void fill_rwbs(char *rwbs, struct blk_io_trace *t)
> +{
> +        int i = 0;
> +
> +        if (t->action & BLK_TC_DISCARD)	   rwbs[i++] = 'D';
> +        else if (t->action & BLK_TC_WRITE) rwbs[i++] = 'W';
> +        else if (t->bytes)		   rwbs[i++] = 'R';
> +        else				   rwbs[i++] = 'N';
> +
> +        if (t->action & BLK_TC_AHEAD)	   rwbs[i++] = 'A';
> +        if (t->action & BLK_TC_BARRIER)	   rwbs[i++] = 'B';
> +        if (t->action & BLK_TC_SYNC)	   rwbs[i++] = 'S';
> +        if (t->action & BLK_TC_META)	   rwbs[i++] = 'M';
> +
> +        rwbs[i] = '\0';
> +}
> +
> +static inline void *pdu_start(struct blk_io_trace *t)
> +{
> +	return t + 1;
> +}
> +
> +static inline u32 t_sec(const struct blk_io_trace *t)
> +{
> +	return t->bytes >> 9;
> +}
> +
> +static __u64 get_pdu_int(struct blk_io_trace *t)
> +{
> +	__u64 *val = pdu_start(t);
> +
> +	return be64_to_cpu(*val);
> +}
> +
> +static void get_pdu_remap(struct blk_io_trace *t, struct blk_io_trace_remap *r)
> +{
> +	struct blk_io_trace_remap *__r = pdu_start(t);
> +	__u64 sector = __r->sector;
> +
> +	r->device = be32_to_cpu(__r->device);
> +	r->device_from = be32_to_cpu(__r->device_from);
> +	r->sector = be64_to_cpu(sector);
> +}
> +
> +static int blk_log_action(struct trace_iterator *iter, struct blk_io_trace *t,
> +			  const char *act)
> +{
> +	char rwbs[6];
> +	unsigned long long ts	= ns2usecs(iter->ts);
> +	unsigned long usec_rem	= do_div(ts, USEC_PER_SEC);
> +	unsigned secs		= (unsigned long)ts;
> +
> +	fill_rwbs(rwbs, t);
> +	return trace_seq_printf(&iter->seq,
> +				"%3d,%-3d %2d %5d.%06lu %5u %2s %3s ",
> +				MAJOR(t->device), MINOR(t->device), iter->cpu,
> +				secs, usec_rem, t->pid, act, rwbs);
> +}
> +
> +static int blk_log_generic(struct trace_iterator *iter, struct blk_io_trace *t,
> +			   const char *act)
> +{
> +	int ret = blk_log_action(iter, t, act);
> +	if (ret) {
> +		const char *cmd = trace_find_cmdline(t->pid);
> +
> +		if (t_sec(t))
> +			ret = trace_seq_printf(&iter->seq, "%llu + %u [%s]\n",
> +					       (unsigned long long)t->sector,
> +					       t_sec(t), cmd);
> +		else
> +			ret = trace_seq_printf(&iter->seq, "[%s]\n", cmd);
> +	}
> +	return ret;
> +}
> +
> +static int blk_log_with_err(struct trace_iterator *iter, struct blk_io_trace *t,
> +			    const char *act)
> +{
> +	int ret = blk_log_action(iter, t, act);
> +	if (ret) {
> +		if (t_sec(t))
> +			ret = trace_seq_printf(&iter->seq, "%llu + %u [%d]\n",
> +					       (unsigned long long)t->sector,
> +					       t_sec(t), t->error);
> +		else
> +			ret = trace_seq_printf(&iter->seq, "%llu [%d]\n",
> +					       (unsigned long long)t->sector,
> +					       t->error);
> +	}
> +	return ret;
> +}
> +
> +static int blk_log_remap(struct trace_iterator *iter, struct blk_io_trace *t)
> +{
> +	int ret = blk_log_action(iter, t, "A");
> +	if (ret) {
> +		struct blk_io_trace_remap r = { .device = 0, };
> +
> +		get_pdu_remap(t, &r);
> +		t->device = r.device_from;
> +		ret = trace_seq_printf(&iter->seq, "%llu + %u <- (%d,%d) %llu\n",
> +				       (unsigned long long)t->sector,
> +				       t_sec(t), MAJOR(r.device), MINOR(r.device),
> +				       (unsigned long long)r.sector);
> +	}
> +	return ret;
> +}
> +
> +static int blk_log_plug(struct trace_iterator *iter, struct blk_io_trace *t)
> +{
> +	int ret = blk_log_action(iter, t, "P");
> +	if (ret)
> +		ret = trace_seq_printf(&iter->seq, "[%s]\n",
> +				       trace_find_cmdline(t->pid));
> +	return ret;
> +}
> +
> +static int blk_log_unplug(struct trace_iterator *iter, struct blk_io_trace *t,
> +			  const char *act)
> +{
> +	int ret = blk_log_action(iter, t, act);
> +	if (ret)
> +		ret = trace_seq_printf(&iter->seq, "[%s] %u\n",
> +				       trace_find_cmdline(t->pid),
> +				       get_pdu_int(t));
> +	return ret;
> +}
> +
> +static int blk_log_split(struct trace_iterator *iter, struct blk_io_trace *t,
> +			 const char *act)
> +{
> +	int ret = blk_log_action(iter, t, act);
> +	if (ret)
> +		ret = trace_seq_printf(&iter->seq, "%llu / %u [%s]\n",
> +				       (unsigned long long)t->sector,
> +				       get_pdu_int(t),
> +				       trace_find_cmdline(t->pid));
> +	return ret;
> +}
> +
> +/*
> + * struct tracer operations
> + */
> +
> +static void blk_tracer_print_header(struct seq_file *m)
> +{
> +	seq_puts(m, "# DEV   CPU TIMESTAMP     PID ACT FLG\n"
> +		    "#  |     |     |           |   |   |\n");
> +}
> +
> +static enum print_line_t blk_tracer_print_line(struct trace_iterator *iter)
> +{
> +	struct trace_seq *s = &iter->seq;
> +	struct blk_io_trace *t = (struct blk_io_trace *)iter->ent;
> +	const u16 what = t->action & ((1 << BLK_TC_SHIFT) - 1);
> +	int ret;
> +
> +	switch (what) {
> +	case __BLK_TA_ISSUE:	    ret = blk_log_generic(iter, t, "D"); break;
> +	case __BLK_TA_INSERT:	    ret = blk_log_generic(iter, t, "I"); break;
> +	case __BLK_TA_QUEUE:	    ret = blk_log_generic(iter, t, "Q"); break;
> +	case __BLK_TA_BOUNCE:	    ret = blk_log_generic(iter, t, "B"); break;
> +	case __BLK_TA_BACKMERGE:    ret = blk_log_generic(iter, t, "M"); break;
> +	case __BLK_TA_FRONTMERGE:   ret = blk_log_generic(iter, t, "F"); break;
> +	case __BLK_TA_GETRQ:	    ret = blk_log_generic(iter, t, "G"); break;
> +	case __BLK_TA_SLEEPRQ:	    ret = blk_log_generic(iter, t, "S"); break;
> +	case __BLK_TA_REQUEUE:	    ret = blk_log_with_err(iter, t, "R"); break;
> +	case __BLK_TA_COMPLETE:	    ret = blk_log_with_err(iter, t, "C"); break;
> +	case __BLK_TA_PLUG:	    ret = blk_log_plug(iter, t); break;
> +	case __BLK_TA_UNPLUG_IO:    ret = blk_log_unplug(iter, t, "U"); break;
> +	case __BLK_TA_UNPLUG_TIMER: ret = blk_log_unplug(iter, t, "UT"); break;
> +	case __BLK_TA_REMAP:	    ret = blk_log_remap(iter, t); break;
> +	case __BLK_TA_SPLIT:	    ret = blk_log_split(iter, t, "X"); break;
> +	default:
> +		ret = trace_seq_printf(s, "Bad pc action %x\n", what);
> +		break;
> +	}
> +
> +	if (ret == 0)
> +		return TRACE_TYPE_PARTIAL_LINE;
> +
> +	return TRACE_TYPE_HANDLED;
> +}
> +
> +static void blk_tracer_start(struct trace_array *tr)
> +{
> +	int cpu;
> +
> +	tr->time_start = ftrace_now(tr->cpu);
> +
> +	for_each_online_cpu(cpu)
> +		tracing_reset(tr, cpu);
> +
> +	mutex_lock(&blk_probe_mutex);
> +	if (atomic_add_return(1, &blk_probes_ref) == 1)
> +		if (blk_register_tracepoints())
> +			atomic_dec(&blk_probes_ref);
> +	mutex_unlock(&blk_probe_mutex);


Is this mutex really needed? You already use an atomic test that make
your wall.


> +}
> +
> +static int blk_tracer_init(struct trace_array *tr)
> +{
> +	blk_tr = tr;
> +	blk_tracer_start(tr);
> +	mutex_lock(&blk_probe_mutex);
> +	blk_tracer_enabled++;


I think you will not have more than one blk_tracer_init() call
while your tracer is enabled.

So, unless there are other callsites that use your tracer, I guess this counter
is useless.

> +	mutex_unlock(&blk_probe_mutex);
> +	return 0;
> +}
> +
> +static void blk_tracer_stop(struct trace_array *tr)
> +{
> +	mutex_lock(&blk_probe_mutex);
> +	if (atomic_dec_and_test(&blk_probes_ref))
> +		blk_unregister_tracepoints();
> +	mutex_unlock(&blk_probe_mutex);
> +}
> +
> +static void blk_tracer_reset(struct trace_array *tr)
> +{
> +	if (!atomic_read(&blk_probes_ref))
> +		return;
> +
> +	mutex_lock(&blk_probe_mutex);
> +	blk_tracer_enabled--;
> +	WARN_ON(blk_tracer_enabled < 0);


Ditto, I don't think that can happen.


> +	mutex_unlock(&blk_probe_mutex);
> +
> +	blk_tracer_stop(tr);
> +}
> +
> +static struct tracer blk_tracer __read_mostly = {
> +	.name		= "blk",
> +	.init		= blk_tracer_init,
> +	.reset		= blk_tracer_reset,
> +	.start		= blk_tracer_start,
> +	.stop		= blk_tracer_stop,
> +	.print_header	= blk_tracer_print_header,
> +	.print_line	= blk_tracer_print_line,
> +};
> +
> +static int __init init_blk_tracer(void)
> +{
> +	return register_tracer(&blk_tracer);
> +}
> +
> +device_initcall(init_blk_tracer);
> +
> +static int blk_trace_remove_queue(struct request_queue *q)
> +{
> +	struct blk_trace *bt;
> +
> +	bt = xchg(&q->blk_trace, NULL);
> +	if (bt == NULL)
> +		return -EINVAL;
> +
> +	kfree(bt);
> +	return 0;
> +}
> +
> +/*
> + * Setup everything required to start tracing
> + */
> +static int blk_trace_setup_queue(struct request_queue *q, dev_t dev)
> +{
> +	struct blk_trace *old_bt, *bt = NULL;
> +	int ret;
> +
> +	ret = -ENOMEM;
> +	bt = kzalloc(sizeof(*bt), GFP_KERNEL);
> +	if (!bt)
> +		goto err;
> +
> +	bt->dev = dev;
> +	bt->act_mask = (u16)-1;

That looks strange on a first view.
Why not ~0 ? But that doesn't really matter...

> +	bt->end_lba = -1ULL;
> +	bt->trace_state = Blktrace_running;
> +
> +	old_bt = xchg(&q->blk_trace, bt);
> +	if (old_bt != NULL) {
> +		(void)xchg(&q->blk_trace, old_bt);
> +		kfree(bt);
> +		ret = -EBUSY;
> +	}
> +	return 0;
> +err:
> +	return ret;
> +}
> +
> +/*
> + * sysfs interface to enable and configure tracing
> + */
> +
> +static ssize_t sysfs_blk_trace_enable_show(struct device *dev,
> +					   struct device_attribute *attr,
> +					   char *buf)
> +{
> +	struct hd_struct *p = dev_to_part(dev);
> +	struct block_device *bdev;
> +	ssize_t ret = -ENXIO;
> +
> +	lock_kernel();

Ouch, I hope you will add a comment to explain why you are using the bkl
here, I guess this is block related but I'm not sure of anything :-)

> +	bdev = bdget(part_devt(p));
> +	if (bdev != NULL) {
> +		struct request_queue *q = bdev_get_queue(bdev);
> +
> +		if (q != NULL) {
> +			mutex_lock(&bdev->bd_mutex);
> +			ret = sprintf(buf, "%u\n", !!q->blk_trace);

I guess you can be a bit racy here and drop the mutex. That wouldn't matter.

> +			mutex_unlock(&bdev->bd_mutex);
> +		}
> +
> +		bdput(bdev);
> +	}
> +
> +	unlock_kernel();
> +	return ret;
> +}
> +
> +static ssize_t sysfs_blk_trace_enable_store(struct device *dev,
> +					    struct device_attribute *attr,
> +					    const char *buf, size_t count)
> +{
> +	struct block_device *bdev;
> +	struct request_queue *q;
> +	struct hd_struct *p;
> +	int value;
> +	ssize_t ret = -ENXIO;
> +
> +	if (count == 0 || sscanf(buf, "%d", &value) != 1)
> +		goto out;
> +
> +	lock_kernel();
> +	p = dev_to_part(dev);
> +	bdev = bdget(part_devt(p));
> +	if (bdev == NULL)
> +		goto out_unlock_kernel;
> +
> +	q = bdev_get_queue(bdev);
> +	if (q == NULL)
> +		goto out_bdput;
> +
> +	mutex_lock(&bdev->bd_mutex);
> +	if (value)
> +		ret = blk_trace_setup_queue(q, bdev->bd_dev);
> +	else
> +		ret = blk_trace_remove_queue(q);
> +	mutex_unlock(&bdev->bd_mutex);
> +
> +	if (ret == 0)
> +		ret = count;
> +out_bdput:
> +	bdput(bdev);
> +out_unlock_kernel:
> +	unlock_kernel();
> +out:
> +	return ret;
> +}
> +
> +static ssize_t sysfs_blk_trace_attr_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf);
> +static ssize_t sysfs_blk_trace_attr_store(struct device *dev,
> +					  struct device_attribute *attr,
> +					  const char *buf, size_t count);
> +#define BLK_TRACE_DEVICE_ATTR(_name) \
> +	DEVICE_ATTR(_name, S_IRUGO | S_IWUSR, \
> +		    sysfs_blk_trace_attr_show, \
> +		    sysfs_blk_trace_attr_store)
> +
> +static DEVICE_ATTR(enable, S_IRUGO | S_IWUSR,
> +		   sysfs_blk_trace_enable_show, sysfs_blk_trace_enable_store);
> +static BLK_TRACE_DEVICE_ATTR(act_mask);
> +static BLK_TRACE_DEVICE_ATTR(pid);
> +static BLK_TRACE_DEVICE_ATTR(start_lba);
> +static BLK_TRACE_DEVICE_ATTR(end_lba);
> +
> +static struct attribute *blk_trace_attrs[] = {
> +	&dev_attr_enable.attr,
> +	&dev_attr_act_mask.attr,
> +	&dev_attr_pid.attr,
> +	&dev_attr_start_lba.attr,
> +	&dev_attr_end_lba.attr,
> +	NULL
> +};
> +
> +struct attribute_group blk_trace_attr_group = {
> +	.name  = "trace",
> +	.attrs = blk_trace_attrs,
> +};
> +
> +static int blk_str2act_mask(const char *str)
> +{
> +	int mask = 0;
> +	char *copy = kstrdup(str, GFP_KERNEL), *s;
> +
> +	if (copy == NULL)
> +		return -ENOMEM;
> +
> +	s = strstrip(copy);
> +
> +	while (1) {
> +		char *sep = strchr(s, ',');
> +
> +		if (sep != NULL)
> +			*sep = '\0';
> +
> +		if (strcasecmp(s, "barrier") == 0)
> +			mask |= BLK_TC_BARRIER;
> +		else if (strcasecmp(s, "complete") == 0)
> +			mask |= BLK_TC_COMPLETE;
> +		else if (strcasecmp(s, "fs") == 0)
> +			mask |= BLK_TC_FS;
> +		else if (strcasecmp(s, "issue") == 0)
> +			mask |= BLK_TC_ISSUE;
> +		else if (strcasecmp(s, "pc") == 0)
> +			mask |= BLK_TC_PC;
> +		else if (strcasecmp(s, "queue") == 0)
> +			mask |= BLK_TC_QUEUE;
> +		else if (strcasecmp(s, "read") == 0)
> +			mask |= BLK_TC_READ;
> +		else if (strcasecmp(s, "requeue") == 0)
> +			mask |= BLK_TC_REQUEUE;
> +		else if (strcasecmp(s, "sync") == 0)
> +			mask |= BLK_TC_SYNC;
> +		else if (strcasecmp(s, "write") == 0)
> +			mask |= BLK_TC_WRITE;
> +
> +		if (sep == NULL)
> +			break;
> +
> +		s = sep + 1;
> +	}
> +	kfree(copy);
> +
> +	return mask;
> +}
> +
> +static ssize_t sysfs_blk_trace_attr_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf)
> +{
> +	struct hd_struct *p = dev_to_part(dev);
> +	struct request_queue *q;
> +	struct block_device *bdev;
> +	ssize_t ret = -ENXIO;
> +
> +	lock_kernel();
> +	bdev = bdget(part_devt(p));
> +	if (bdev == NULL)
> +		goto out_unlock_kernel;
> +
> +	q = bdev_get_queue(bdev);
> +	if (q == NULL)
> +		goto out_bdput;
> +	mutex_lock(&bdev->bd_mutex);
> +	if (q->blk_trace == NULL)
> +		ret = sprintf(buf, "disabled\n");
> +	else if (attr == &dev_attr_act_mask)
> +		ret = sprintf(buf, "%#x\n", q->blk_trace->act_mask);
> +	else if (attr == &dev_attr_pid)
> +		ret = sprintf(buf, "%u\n", q->blk_trace->pid);
> +	else if (attr == &dev_attr_start_lba)
> +		ret = sprintf(buf, "%llu\n", q->blk_trace->start_lba);
> +	else if (attr == &dev_attr_end_lba)
> +		ret = sprintf(buf, "%llu\n", q->blk_trace->end_lba);
> +	mutex_unlock(&bdev->bd_mutex);
> +out_bdput:
> +	bdput(bdev);
> +out_unlock_kernel:
> +	unlock_kernel();
> +	return ret;
> +}
> +
> +static ssize_t sysfs_blk_trace_attr_store(struct device *dev,
> +					  struct device_attribute *attr,
> +					  const char *buf, size_t count)
> +{
> +	struct block_device *bdev;
> +	struct request_queue *q;
> +	struct hd_struct *p;
> +	u64 value;
> +	ssize_t ret = -ENXIO;
> +
> +	if (count == 0)
> +		goto out;
> +
> +	if (attr == &dev_attr_act_mask) {
> +		if (sscanf(buf, "%llx", &value) != 1) {
> +			/* Assume it is a list of trace category names */
> +			value = blk_str2act_mask(buf);
> +			if (value < 0)
> +				goto out;
> +		}
> +	} else if (sscanf(buf, "%llu", &value) != 1)
> +		goto out;
> +
> +	lock_kernel();
> +	p = dev_to_part(dev);
> +	bdev = bdget(part_devt(p));
> +	if (bdev == NULL)
> +		goto out_unlock_kernel;
> +
> +	q = bdev_get_queue(bdev);
> +	if (q == NULL)
> +		goto out_bdput;
> +
> +	mutex_lock(&bdev->bd_mutex);
> +	ret = 0;
> +	if (q->blk_trace == NULL)
> +		ret = blk_trace_setup_queue(q, bdev->bd_dev);
> +
> +	if (ret == 0) {
> +		if (attr == &dev_attr_act_mask)
> +			q->blk_trace->act_mask = value;
> +		else if (attr == &dev_attr_pid)
> +			q->blk_trace->pid = value;
> +		else if (attr == &dev_attr_start_lba)
> +			q->blk_trace->start_lba = value;
> +		else if (attr == &dev_attr_end_lba)
> +			q->blk_trace->end_lba = value;
> +		ret = count;
> +	}
> +	mutex_unlock(&bdev->bd_mutex);
> +out_bdput:
> +	bdput(bdev);
> +out_unlock_kernel:
> +	unlock_kernel();
> +out:
> +	return ret;
> +}
> diff --git a/fs/partitions/check.c b/fs/partitions/check.c
> index 6d72024..01714ef 100644
> --- a/fs/partitions/check.c
> +++ b/fs/partitions/check.c
> @@ -268,6 +268,10 @@ ssize_t part_fail_store(struct device *dev,
>  }
>  #endif
>  
> +#ifdef CONFIG_BLK_DEV_IO_TRACE
> +extern struct attribute_group blk_trace_attr_group;
> +#endif
> +
>  static DEVICE_ATTR(partition, S_IRUGO, part_partition_show, NULL);
>  static DEVICE_ATTR(start, S_IRUGO, part_start_show, NULL);
>  static DEVICE_ATTR(size, S_IRUGO, part_size_show, NULL);
> @@ -294,6 +298,9 @@ static struct attribute_group part_attr_group = {
>  
>  static struct attribute_group *part_attr_groups[] = {
>  	&part_attr_group,
> +#ifdef CONFIG_BLK_DEV_IO_TRACE
> +	&blk_trace_attr_group,
> +#endif
>  	NULL
>  };



This tracer looks good :-)

  
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] ftrace interface for blktrace
  2009-01-16 18:16 [RFC][PATCH] ftrace interface for blktrace Arnaldo Carvalho de Melo
  2009-01-16 19:15 ` Steven Rostedt
  2009-01-17 13:37 ` Frederic Weisbecker
@ 2009-01-17 19:09 ` Ingo Molnar
  2009-01-17 19:14   ` Jens Axboe
  2 siblings, 1 reply; 14+ messages in thread
From: Ingo Molnar @ 2009-01-17 19:09 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jens Axboe, Steven Rostedt,
	Frederic Weisbecker, linux-kernel


* Arnaldo Carvalho de Melo <acme@redhat.com> wrote:

> Hi Jens,
> 
> 	The patch below adds a ftrace interface for blktrace, allowing 
> people to use it without any required userspace tools, using sysfs to 
> setup the act_mask, pid, start_lba, end_lba.

Very nice patch!

Jens, i'm wondering what's your take on this direction is. I think the 
consolidation effect is great and the built-in IO tracing capabilities are 
very nice.

We can still get the raw events too and do user-space post-processing, 
when that is desired - so this does not limit anything that blktrace was 
able to do before - it only extends on it.

Is there any particular blktrace feature you can think of that is not 
present in the blktrace ftrace plugin?

	Ingo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] ftrace interface for blktrace
  2009-01-17 19:09 ` Ingo Molnar
@ 2009-01-17 19:14   ` Jens Axboe
  2009-01-17 20:08     ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2009-01-17 19:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arnaldo Carvalho de Melo, Steven Rostedt, Frederic Weisbecker,
	linux-kernel

On Sat, Jan 17 2009, Ingo Molnar wrote:
> 
> * Arnaldo Carvalho de Melo <acme@redhat.com> wrote:
> 
> > Hi Jens,
> > 
> > 	The patch below adds a ftrace interface for blktrace, allowing 
> > people to use it without any required userspace tools, using sysfs to 
> > setup the act_mask, pid, start_lba, end_lba.
> 
> Very nice patch!
> 
> Jens, i'm wondering what's your take on this direction is. I think the 
> consolidation effect is great and the built-in IO tracing capabilities are 
> very nice.

I like the current patch, from a quick look. I'll look more soonish
(monday). I've always liked the concept of being able to mix various
traces into the same stream, since it makes it MUCH easier to see what
on earth is going wrong. What I didn't like in the patches Acme did was
the enable part, I'll need to look into that. Again, from a quick look,
what is the BKL doing in there?!

> We can still get the raw events too and do user-space post-processing, 
> when that is desired - so this does not limit anything that blktrace was 
> able to do before - it only extends on it.
> 
> Is there any particular blktrace feature you can think of that is not 
> present in the blktrace ftrace plugin?

Can't answer that yet, I'll give it a more thorough look next week.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] ftrace interface for blktrace
  2009-01-17 19:14   ` Jens Axboe
@ 2009-01-17 20:08     ` Arnaldo Carvalho de Melo
  2009-01-20 14:42       ` [RFC][PATCHv2] " Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 14+ messages in thread
From: Arnaldo Carvalho de Melo @ 2009-01-17 20:08 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Ingo Molnar, Steven Rostedt, Frederic Weisbecker, linux-kernel

Em Sat, Jan 17, 2009 at 08:14:30PM +0100, Jens Axboe escreveu:
> On Sat, Jan 17 2009, Ingo Molnar wrote:
> > 
> > * Arnaldo Carvalho de Melo <acme@redhat.com> wrote:
> > 
> > > Hi Jens,
> > > 
> > > 	The patch below adds a ftrace interface for blktrace, allowing 
> > > people to use it without any required userspace tools, using sysfs to 
> > > setup the act_mask, pid, start_lba, end_lba.
> > 
> > Very nice patch!
> > 
> > Jens, i'm wondering what's your take on this direction is. I think the 
> > consolidation effect is great and the built-in IO tracing capabilities are 
> > very nice.
> 
> I like the current patch, from a quick look. I'll look more soonish
> (monday). I've always liked the concept of being able to mix various
> traces into the same stream, since it makes it MUCH easier to see what
> on earth is going wrong. What I didn't like in the patches Acme did was
> the enable part, I'll need to look into that. Again, from a quick look,
> what is the BKL doing in there?!

I'm just trying to keep locking expectations in the previous, similar
path, in block/ioctl.c, blkdev_ioctl, where we setup q->blk_trace:

        case BLKTRACESTART:
        case BLKTRACESTOP:
        case BLKTRACESETUP:
        case BLKTRACETEARDOWN:
                lock_kernel();
                ret = blk_trace_ioctl(bdev, cmd, (char __user *) arg);
                unlock_kernel();
                break;

So I took the safest path and grabbed the big K lock.
 
> > We can still get the raw events too and do user-space post-processing, 
> > when that is desired - so this does not limit anything that blktrace was 
> > able to do before - it only extends on it.
> > 
> > Is there any particular blktrace feature you can think of that is not 
> > present in the blktrace ftrace plugin?
> 
> Can't answer that yet, I'll give it a more thorough look next week.

Looking forward to that. And will get it split into multiple patches,
hopefully by monday.

- Arnaldo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] ftrace interface for blktrace
  2009-01-17 13:37 ` Frederic Weisbecker
@ 2009-01-17 20:22   ` Arnaldo Carvalho de Melo
  2009-01-19 13:33     ` Ingo Molnar
  0 siblings, 1 reply; 14+ messages in thread
From: Arnaldo Carvalho de Melo @ 2009-01-17 20:22 UTC (permalink / raw)
  To: Frederic Weisbecker; +Cc: Jens Axboe, Ingo Molnar, Steven Rostedt, linux-kernel

Em Sat, Jan 17, 2009 at 02:37:39PM +0100, Frederic Weisbecker escreveu:
> On Fri, Jan 16, 2009 at 04:16:11PM -0200, Arnaldo Carvalho de Melo wrote:
> > Hi Jens,
> > 
> > 	The patch below adds a ftrace interface for blktrace, allowing
> > people to use it without any required userspace tools, using sysfs to setup the
> > act_mask, pid, start_lba, end_lba.
> > 	Example usage:
> > 
> > [root@f10-1 ~]# mount -t debugfs none /sys/kernel/debug/
> > [root@f10-1 ~]# ls /sys/block/sda/sda1/trace/
> > act_mask  enable  end_lba  pid  start_lba
> 
> 
> That reminds me it would be good to provide a more powerful tracer options
> Api, to set numeric or string values for the options, like the kernel
> parameters..... So you would avoid to implement your own way to do it.
> I should put my hands into it one day.

I thought about that as well, and did a very quick (as in almost didn't)
look at other places in the kernel for list of strings to bitmap
facilities, but couldn't find after some 5 or 10 seconds :-)
 
> > [root@f10-1 ~]# cat /sys/kernel/debug/tracing/available_tracers 
> > blk function_graph wakeup function sched_switch nop
> > [root@f10-1 ~]# echo blk > /sys/kernel/debug/tracing/current_tracer 
> > [root@f10-1 ~]# cat /sys/kernel/debug/tracing/trace
> > # tracer: blk
> > #
> > # DEV   CPU TIMESTAMP     PID ACT FLG
> > #  |     |     |           |   |   |
> > [root@f10-1 ~]# echo 1 > /sys/block/sda/sda1/trace/enable 
> > [root@f10-1 ~]# tail /sys/kernel/debug/tracing/trace
> >   8,1    0   143.070085  1208  P  NS [tail]
> >   8,1    0   143.070087  1208  I RBS 4154375 + 8 [tail]
> >   8,1    0   143.070091  1208  A WBS 4155391 + 16 <- (8,1) 4155328
> >   8,1    0   143.070091  1208  Q   R 4155391 + 16 [tail]
> >   8,1    0   143.070094  1208  G  RB 4155391 + 16 [tail]
> >   8,1    0   143.070094  1208  I RBS 4155391 + 16 [tail]
> >   8,1    0   143.070097  1208  U  WS [tail] 2
> >   8,1    0   143.070101  1208  D  WB 4154375 + 8 [tail]
> >   8,1    0   143.070341  1208  D  WB 4155391 + 16 [tail]
> >   8,1    0   143.070635     0  C  RS 4154375 + 8 [0]
> 
> 
> nice!

echo verbose > /d/tracing/iter_ctrl

will, in the next patchset, convert those one (or two) action (ACT)
codes into something more intelligible for the block layer impaired :)
 
> 
> > [root@f10-1 ~]#
> > 
> > 	Filters can be set using:
> > 
> > [root@f10-1 ~]# echo write,sync > /sys/block/sda/sda1/trace/act_mask 
> > [root@f10-1 ~]# cat /sys/block/sda/sda1/trace/act_mask 
> > 0xa
> 
> That would be better to see the thing as a string, like you set it.

Agreed, will do, was just impatient enough to get the patch posted after
too much context switching over last week
 
> > 	One can go on interactively adding more filters, for instance:
> > 
> > [root@f10-1 ~]# echo `pidof kjournald` > /sys/block/sda/sda1/trace/pid
> > [root@f10-1 ~]# cat /sys/block/sda/sda1/trace/pid
> > 423
> 
> 
> Or why not simply kjournald? :-)

Unsure, perhaps

kjournald(423)

if iter_ctrl has "verbose"?
 
> 
> > 	And use the existing ftrace infrastructure to, for instance,
> > replicate the btrace (blktrace -o - | blkparse) functionality by simply
> > using cat:
> > 
> > [root@f10-1 ~]# cat /sys/kernel/debug/tracing/trace_pipe 
> >   8,1    0   457.253356   423  A WBS 6335 + 8 <- (8,1) 6272
> >   8,1    0   457.253360   423  Q   R 6335 + 8 [kjournald]
> >   8,1    0   457.253367   423  G  RB 6335 + 8 [kjournald]
> >   8,1    0   457.253372   423  I RBS 6335 + 8 [kjournald]
> >   8,1    0   457.253376   423  A WBS 6343 + 8 <- (8,1) 6280
> >   8,1    0   457.253376   423  Q   R 6343 + 8 [kjournald]
> >   8,1    0   457.253378   423  M   W 6343 + 8 [kjournald]
> >   8,1    0   457.253380   423  A WBS 6351 + 8 <- (8,1) 6288
> > CONTROL+C
> > [root@f10-1 ~]#
> > 
> > 	This patch was done on Linus tree, I'll now work with it on
> > linux-tip so that I can experiment with binary tracing, i.e to ask for
> > the output of /sys/kernel/debug/tracing/trace_pipe to be raw, exactly
> > the one provide by relayfs based variant, so that we will be able to
> > continue using blkparse to validate the trace entries, as a debugging
> > aid.
> 
> 
> But I guess there will be some changes to bring on the tracing to make
> it support binary flow...

I have to do some checking, but probably there will still be interest in
having the kernel to do as little as possible for high volume tracing
and let the, in this case, already existing raw post processing tools to
do the pretty-printing in userspace.
 
> > 	Also there are other interesting things that one gets from using
> > the ftrace infrastructure, such as being able to set see traffic that
> > happens in just a cpuset (/sys/kernel/debug/tracing/tracing_cpumask),
> > and others that I plan to add such as what Steve did in "[PATCH 2/2]
> > ftrace: add stack trace to function tracer", so that we could have
> > things like (made up example):
> > 
> > [root@f10-1 ~]# cat /sys/kernel/debug/tracing/trace_pipe 
> >   8,1    0   457.253356   423  A WBS 6335 + 8 <- (8,1) 6272
> >   8,1    0   457.253360   423  Q   R 6335 + 8 [kjournald]
> >   8,1    0   457.253367   423  G  RB 6335 + 8 [kjournald]
> >   8,1    0   457.253372   423  I RBS 6335 + 8 [kjournald]
> >  <= sync_buffer
> >  <= __wait_on_bit
> >  <= out_of_line_wait_on_bit
> >  <= __wait_on_buffer
> >  <= sync_dirty_buffer
> >  <= journal_commit_transaction
> >  <= kjournald
> >   8,1    0   457.253376   423  A WBS 6343 + 8 <- (8,1) 6280
> >   8,1    0   457.253376   423  Q   R 6343 + 8 [kjournald]
> >   8,1    0   457.253378   423  M   W 6343 + 8 [kjournald]
> >   8,1    0   457.253380   423  A WBS 6351 + 8 <- (8,1) 6288
> 
> 
> Cool, but I guess you will not need this ftrace plugin to do it.
> stack traces entries for tracing are already supported for a while.

Pardon my ignorance, but how could one combine blk + stack_trace without
any kind of support in the blk tracer? Of course I even _expect_ that
this be possible without much ado, but I haven't got to that
enlightenment yet :-)
 
> > 	More polishing is needed, I wrote the patch this way just to
> > show where things are different, with new code, for registering the
> > tracer, bits stolen from the userspace blktrace, allowing config via
> > sysfs at the bottom.
> > 
> > 	Ideally we would move this to kernel/trace/block.c, break it
> > into smaller self contained patches, etc.
> > 
> > Comments?
> > 
> > - Arnaldo
> > 
> >  block/blktrace.c      |  578 +++++++++++++++++++++++++++++++++++++++++++++++++-
> >  fs/partitions/check.c |    7 
> >  2 files changed, 583 insertions(+), 2 deletions(-)
> > 
> > diff --git a/block/blktrace.c b/block/blktrace.c
> > index b0a2cae..10985f9 100644
> > --- a/block/blktrace.c
> > +++ b/block/blktrace.c
> > @@ -25,9 +25,13 @@
> >  #include <linux/time.h>
> >  #include <trace/block.h>
> >  #include <asm/uaccess.h>
> > +#include <../kernel/trace/trace.h>
> 
> 
> Yeah, it really needs to be moved on kernel/trace :-)
> 
> 
> >  static unsigned int blktrace_seq __read_mostly = 1;
> >  
> > +static struct trace_array *blk_tr;
> > +static int __read_mostly  blk_tracer_enabled;
> > +
> >  /* Global reference count of probes */
> >  static DEFINE_MUTEX(blk_probe_mutex);
> >  static atomic_t blk_probes_ref = ATOMIC_INIT(0);
> > @@ -43,6 +47,9 @@ static void trace_note(struct blk_trace *bt, pid_t pid, int action,
> >  {
> >  	struct blk_io_trace *t;
> >  
> > +	if (!bt->rchan)
> > +		return;
> > +
> >  	t = relay_reserve(bt->rchan, sizeof(*t) + len);
> >  	if (t) {
> >  		const int cpu = smp_processor_id();
> > @@ -90,6 +97,9 @@ void __trace_note_message(struct blk_trace *bt, const char *fmt, ...)
> >  	unsigned long flags;
> >  	char *buf;
> >  
> > +	if (!bt->msg_data)
> > +		return;
> > +
> >  	local_irq_save(flags);
> >  	buf = per_cpu_ptr(bt->msg_data, smp_processor_id());
> >  	va_start(args, fmt);
> > @@ -131,13 +141,14 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
> >  		     int rw, u32 what, int error, int pdu_len, void *pdu_data)
> >  {
> >  	struct task_struct *tsk = current;
> > +	struct ring_buffer_event *event = NULL;
> >  	struct blk_io_trace *t;
> >  	unsigned long flags;
> >  	unsigned long *sequence;
> >  	pid_t pid;
> >  	int cpu;
> >  
> > -	if (unlikely(bt->trace_state != Blktrace_running))
> > +	if (unlikely(bt->trace_state != Blktrace_running || !blk_tracer_enabled))
> >  		return;
> >  
> >  	what |= ddir_act[rw & WRITE];
> > @@ -158,6 +169,25 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
> >  	 */
> >  	local_irq_save(flags);
> 
> 
> Is the local_irq_save still needed here? Note that the ring buffer is safe by itself.

Oh well, I didn't even tried to check that, just used a template from
the ctx tracer from some time ago, will recheck that
 
> > +	if (blk_tr) {
> > +		struct trace_array_cpu *data;
> > +
> > +		tracing_record_cmdline(current);
> > +		cpu = raw_smp_processor_id();
> 
> 
> I think you can use smp_processor_id(). There is no risk of tracing recursion here.

another c'n'p mindless oversight
 
> 
> > +		data = blk_tr->data[cpu];
> > +
> > +		if (unlikely(atomic_read(&data->disabled)))
> > +			goto out_local_irq_restore;
> > +
> > +		event = ring_buffer_lock_reserve(blk_tr->buffer,
> > +						 sizeof(*t) + pdu_len, &flags);
> > +		if (!event)
> > +			goto out_local_irq_restore;
> > +		
> > +		t = ring_buffer_event_data(event);
> > +		goto record_it;
> > +	}
> > +
> >  	if (unlikely(tsk->btrace_seq != blktrace_seq))
> >  		trace_note_tsk(bt, tsk);
> >  
> > @@ -169,19 +199,26 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
> >  		t->magic = BLK_IO_TRACE_MAGIC | BLK_IO_TRACE_VERSION;
> >  		t->sequence = ++(*sequence);
> >  		t->time = ktime_to_ns(ktime_get());
> 
> You already have a timestamp for each entry on the ring buffer.

I know that, and please notice that this is only taken if we're using
the old relay based stuff (hint, look at the goto)

> > +		t->cpu = cpu;
> 
> 
> Ditto. But perhaps you can be preempted between the block action and the trace insertion.
> I'm not sure of the context here.. So perhaps you must keep it.

Ditto

> 
> > +record_it:
> >  		t->sector = sector;
> >  		t->bytes = bytes;
> >  		t->action = what;
> >  		t->pid = pid;
> >  		t->device = bt->dev;
> > -		t->cpu = cpu;
> >  		t->error = error;
> >  		t->pdu_len = pdu_len;
> >  
> >  		if (pdu_len)
> >  			memcpy((void *) t + sizeof(*t), pdu_data, pdu_len);
> > +
> > +		if (blk_tr) {
> > +			ring_buffer_unlock_commit(blk_tr->buffer, event, flags);
> > +			trace_wake_up();
> > +		}
> >  	}
> >  
> > +out_local_irq_restore:
> >  	local_irq_restore(flags);
> >  }
> >  
> > @@ -888,3 +925,540 @@ static void blk_unregister_tracepoints(void)
> >  
> >  	tracepoint_synchronize_unregister();
> >  }
> > +
> > +/*
> > + * struct blk_io_tracer formatting routines
> > + */
> > +
> > +static void fill_rwbs(char *rwbs, struct blk_io_trace *t)
> > +{
> > +        int i = 0;
> > +
> > +        if (t->action & BLK_TC_DISCARD)	   rwbs[i++] = 'D';
> > +        else if (t->action & BLK_TC_WRITE) rwbs[i++] = 'W';
> > +        else if (t->bytes)		   rwbs[i++] = 'R';
> > +        else				   rwbs[i++] = 'N';
> > +
> > +        if (t->action & BLK_TC_AHEAD)	   rwbs[i++] = 'A';
> > +        if (t->action & BLK_TC_BARRIER)	   rwbs[i++] = 'B';
> > +        if (t->action & BLK_TC_SYNC)	   rwbs[i++] = 'S';
> > +        if (t->action & BLK_TC_META)	   rwbs[i++] = 'M';
> > +
> > +        rwbs[i] = '\0';
> > +}
> > +
> > +static inline void *pdu_start(struct blk_io_trace *t)
> > +{
> > +	return t + 1;
> > +}
> > +
> > +static inline u32 t_sec(const struct blk_io_trace *t)
> > +{
> > +	return t->bytes >> 9;
> > +}
> > +
> > +static __u64 get_pdu_int(struct blk_io_trace *t)
> > +{
> > +	__u64 *val = pdu_start(t);
> > +
> > +	return be64_to_cpu(*val);
> > +}
> > +
> > +static void get_pdu_remap(struct blk_io_trace *t, struct blk_io_trace_remap *r)
> > +{
> > +	struct blk_io_trace_remap *__r = pdu_start(t);
> > +	__u64 sector = __r->sector;
> > +
> > +	r->device = be32_to_cpu(__r->device);
> > +	r->device_from = be32_to_cpu(__r->device_from);
> > +	r->sector = be64_to_cpu(sector);
> > +}
> > +
> > +static int blk_log_action(struct trace_iterator *iter, struct blk_io_trace *t,
> > +			  const char *act)
> > +{
> > +	char rwbs[6];
> > +	unsigned long long ts	= ns2usecs(iter->ts);
> > +	unsigned long usec_rem	= do_div(ts, USEC_PER_SEC);
> > +	unsigned secs		= (unsigned long)ts;
> > +
> > +	fill_rwbs(rwbs, t);
> > +	return trace_seq_printf(&iter->seq,
> > +				"%3d,%-3d %2d %5d.%06lu %5u %2s %3s ",
> > +				MAJOR(t->device), MINOR(t->device), iter->cpu,
> > +				secs, usec_rem, t->pid, act, rwbs);
> > +}
> > +
> > +static int blk_log_generic(struct trace_iterator *iter, struct blk_io_trace *t,
> > +			   const char *act)
> > +{
> > +	int ret = blk_log_action(iter, t, act);
> > +	if (ret) {
> > +		const char *cmd = trace_find_cmdline(t->pid);
> > +
> > +		if (t_sec(t))
> > +			ret = trace_seq_printf(&iter->seq, "%llu + %u [%s]\n",
> > +					       (unsigned long long)t->sector,
> > +					       t_sec(t), cmd);
> > +		else
> > +			ret = trace_seq_printf(&iter->seq, "[%s]\n", cmd);
> > +	}
> > +	return ret;
> > +}
> > +
> > +static int blk_log_with_err(struct trace_iterator *iter, struct blk_io_trace *t,
> > +			    const char *act)
> > +{
> > +	int ret = blk_log_action(iter, t, act);
> > +	if (ret) {
> > +		if (t_sec(t))
> > +			ret = trace_seq_printf(&iter->seq, "%llu + %u [%d]\n",
> > +					       (unsigned long long)t->sector,
> > +					       t_sec(t), t->error);
> > +		else
> > +			ret = trace_seq_printf(&iter->seq, "%llu [%d]\n",
> > +					       (unsigned long long)t->sector,
> > +					       t->error);
> > +	}
> > +	return ret;
> > +}
> > +
> > +static int blk_log_remap(struct trace_iterator *iter, struct blk_io_trace *t)
> > +{
> > +	int ret = blk_log_action(iter, t, "A");
> > +	if (ret) {
> > +		struct blk_io_trace_remap r = { .device = 0, };
> > +
> > +		get_pdu_remap(t, &r);
> > +		t->device = r.device_from;
> > +		ret = trace_seq_printf(&iter->seq, "%llu + %u <- (%d,%d) %llu\n",
> > +				       (unsigned long long)t->sector,
> > +				       t_sec(t), MAJOR(r.device), MINOR(r.device),
> > +				       (unsigned long long)r.sector);
> > +	}
> > +	return ret;
> > +}
> > +
> > +static int blk_log_plug(struct trace_iterator *iter, struct blk_io_trace *t)
> > +{
> > +	int ret = blk_log_action(iter, t, "P");
> > +	if (ret)
> > +		ret = trace_seq_printf(&iter->seq, "[%s]\n",
> > +				       trace_find_cmdline(t->pid));
> > +	return ret;
> > +}
> > +
> > +static int blk_log_unplug(struct trace_iterator *iter, struct blk_io_trace *t,
> > +			  const char *act)
> > +{
> > +	int ret = blk_log_action(iter, t, act);
> > +	if (ret)
> > +		ret = trace_seq_printf(&iter->seq, "[%s] %u\n",
> > +				       trace_find_cmdline(t->pid),
> > +				       get_pdu_int(t));
> > +	return ret;
> > +}
> > +
> > +static int blk_log_split(struct trace_iterator *iter, struct blk_io_trace *t,
> > +			 const char *act)
> > +{
> > +	int ret = blk_log_action(iter, t, act);
> > +	if (ret)
> > +		ret = trace_seq_printf(&iter->seq, "%llu / %u [%s]\n",
> > +				       (unsigned long long)t->sector,
> > +				       get_pdu_int(t),
> > +				       trace_find_cmdline(t->pid));
> > +	return ret;
> > +}
> > +
> > +/*
> > + * struct tracer operations
> > + */
> > +
> > +static void blk_tracer_print_header(struct seq_file *m)
> > +{
> > +	seq_puts(m, "# DEV   CPU TIMESTAMP     PID ACT FLG\n"
> > +		    "#  |     |     |           |   |   |\n");
> > +}
> > +
> > +static enum print_line_t blk_tracer_print_line(struct trace_iterator *iter)
> > +{
> > +	struct trace_seq *s = &iter->seq;
> > +	struct blk_io_trace *t = (struct blk_io_trace *)iter->ent;
> > +	const u16 what = t->action & ((1 << BLK_TC_SHIFT) - 1);
> > +	int ret;
> > +
> > +	switch (what) {
> > +	case __BLK_TA_ISSUE:	    ret = blk_log_generic(iter, t, "D"); break;
> > +	case __BLK_TA_INSERT:	    ret = blk_log_generic(iter, t, "I"); break;
> > +	case __BLK_TA_QUEUE:	    ret = blk_log_generic(iter, t, "Q"); break;
> > +	case __BLK_TA_BOUNCE:	    ret = blk_log_generic(iter, t, "B"); break;
> > +	case __BLK_TA_BACKMERGE:    ret = blk_log_generic(iter, t, "M"); break;
> > +	case __BLK_TA_FRONTMERGE:   ret = blk_log_generic(iter, t, "F"); break;
> > +	case __BLK_TA_GETRQ:	    ret = blk_log_generic(iter, t, "G"); break;
> > +	case __BLK_TA_SLEEPRQ:	    ret = blk_log_generic(iter, t, "S"); break;
> > +	case __BLK_TA_REQUEUE:	    ret = blk_log_with_err(iter, t, "R"); break;
> > +	case __BLK_TA_COMPLETE:	    ret = blk_log_with_err(iter, t, "C"); break;
> > +	case __BLK_TA_PLUG:	    ret = blk_log_plug(iter, t); break;
> > +	case __BLK_TA_UNPLUG_IO:    ret = blk_log_unplug(iter, t, "U"); break;
> > +	case __BLK_TA_UNPLUG_TIMER: ret = blk_log_unplug(iter, t, "UT"); break;
> > +	case __BLK_TA_REMAP:	    ret = blk_log_remap(iter, t); break;
> > +	case __BLK_TA_SPLIT:	    ret = blk_log_split(iter, t, "X"); break;
> > +	default:
> > +		ret = trace_seq_printf(s, "Bad pc action %x\n", what);
> > +		break;
> > +	}
> > +
> > +	if (ret == 0)
> > +		return TRACE_TYPE_PARTIAL_LINE;
> > +
> > +	return TRACE_TYPE_HANDLED;
> > +}
> > +
> > +static void blk_tracer_start(struct trace_array *tr)
> > +{
> > +	int cpu;
> > +
> > +	tr->time_start = ftrace_now(tr->cpu);
> > +
> > +	for_each_online_cpu(cpu)
> > +		tracing_reset(tr, cpu);
> > +
> > +	mutex_lock(&blk_probe_mutex);
> > +	if (atomic_add_return(1, &blk_probes_ref) == 1)
> > +		if (blk_register_tracepoints())
> > +			atomic_dec(&blk_probes_ref);
> > +	mutex_unlock(&blk_probe_mutex);
> 
> 
> Is this mutex really needed? You already use an atomic test that make
> your wall.
> 
> 
> > +}
> > +
> > +static int blk_tracer_init(struct trace_array *tr)
> > +{
> > +	blk_tr = tr;
> > +	blk_tracer_start(tr);
> > +	mutex_lock(&blk_probe_mutex);
> > +	blk_tracer_enabled++;
> 
> 
> I think you will not have more than one blk_tracer_init() call
> while your tracer is enabled.
> 
> So, unless there are other callsites that use your tracer, I guess this counter
> is useless.

Groan... I'll have to think about this and go over the other tracers as well...
 
> > +	mutex_unlock(&blk_probe_mutex);
> > +	return 0;
> > +}
> > +
> > +static void blk_tracer_stop(struct trace_array *tr)
> > +{
> > +	mutex_lock(&blk_probe_mutex);
> > +	if (atomic_dec_and_test(&blk_probes_ref))
> > +		blk_unregister_tracepoints();
> > +	mutex_unlock(&blk_probe_mutex);
> > +}
> > +
> > +static void blk_tracer_reset(struct trace_array *tr)
> > +{
> > +	if (!atomic_read(&blk_probes_ref))
> > +		return;
> > +
> > +	mutex_lock(&blk_probe_mutex);
> > +	blk_tracer_enabled--;
> > +	WARN_ON(blk_tracer_enabled < 0);
> 
> 
> Ditto, I don't think that can happen.
> 
> 
> > +	mutex_unlock(&blk_probe_mutex);
> > +
> > +	blk_tracer_stop(tr);
> > +}
> > +
> > +static struct tracer blk_tracer __read_mostly = {
> > +	.name		= "blk",
> > +	.init		= blk_tracer_init,
> > +	.reset		= blk_tracer_reset,
> > +	.start		= blk_tracer_start,
> > +	.stop		= blk_tracer_stop,
> > +	.print_header	= blk_tracer_print_header,
> > +	.print_line	= blk_tracer_print_line,
> > +};
> > +
> > +static int __init init_blk_tracer(void)
> > +{
> > +	return register_tracer(&blk_tracer);
> > +}
> > +
> > +device_initcall(init_blk_tracer);
> > +
> > +static int blk_trace_remove_queue(struct request_queue *q)
> > +{
> > +	struct blk_trace *bt;
> > +
> > +	bt = xchg(&q->blk_trace, NULL);
> > +	if (bt == NULL)
> > +		return -EINVAL;
> > +
> > +	kfree(bt);
> > +	return 0;
> > +}
> > +
> > +/*
> > + * Setup everything required to start tracing
> > + */
> > +static int blk_trace_setup_queue(struct request_queue *q, dev_t dev)
> > +{
> > +	struct blk_trace *old_bt, *bt = NULL;
> > +	int ret;
> > +
> > +	ret = -ENOMEM;
> > +	bt = kzalloc(sizeof(*bt), GFP_KERNEL);
> > +	if (!bt)
> > +		goto err;
> > +
> > +	bt->dev = dev;
> > +	bt->act_mask = (u16)-1;
> 
> That looks strange on a first view.
> Why not ~0 ? But that doesn't really matter...
> 
> > +	bt->end_lba = -1ULL;
> > +	bt->trace_state = Blktrace_running;
> > +
> > +	old_bt = xchg(&q->blk_trace, bt);
> > +	if (old_bt != NULL) {
> > +		(void)xchg(&q->blk_trace, old_bt);
> > +		kfree(bt)
> > +		ret = -EBUSY;
> > +	}
> > +	return 0;
> > +err:
> > +	return ret;
> > +}
> > +
> > +/*
> > + * sysfs interface to enable and configure tracing
> > + */
> > +
> > +static ssize_t sysfs_blk_trace_enable_show(struct device *dev,
> > +					   struct device_attribute *attr,
> > +					   char *buf)
> > +{
> > +	struct hd_struct *p = dev_to_part(dev);
> > +	struct block_device *bdev;
> > +	ssize_t ret = -ENXIO;
> > +
> > +	lock_kernel();
> 
> Ouch, I hope you will add a comment to explain why you are using the bkl
> here, I guess this is block related but I'm not sure of anything :-)

Look at my response to Jens, and more importantly, to whatever answers
he provides for my questions there :-)

> > +	bdev = bdget(part_devt(p));
> > +	if (bdev != NULL) {
> > +		struct request_queue *q = bdev_get_queue(bdev);
> > +
> > +		if (q != NULL) {
> > +			mutex_lock(&bdev->bd_mutex);
> > +			ret = sprintf(buf, "%u\n", !!q->blk_trace);
> 
> I guess you can be a bit racy here and drop the mutex. That wouldn't matter.
> 
> > +			mutex_unlock(&bdev->bd_mutex);
> > +		}
> > +
> > +		bdput(bdev);
> > +	}
> > +
> > +	unlock_kernel();
> > +	return ret;
> > +}
> > +
> > +static ssize_t sysfs_blk_trace_enable_store(struct device *dev,
> > +					    struct device_attribute *attr,
> > +					    const char *buf, size_t count)
> > +{
> > +	struct block_device *bdev;
> > +	struct request_queue *q;
> > +	struct hd_struct *p;
> > +	int value;
> > +	ssize_t ret = -ENXIO;
> > +
> > +	if (count == 0 || sscanf(buf, "%d", &value) != 1)
> > +		goto out;
> > +
> > +	lock_kernel();
> > +	p = dev_to_part(dev);
> > +	bdev = bdget(part_devt(p));
> > +	if (bdev == NULL)
> > +		goto out_unlock_kernel;
> > +
> > +	q = bdev_get_queue(bdev);
> > +	if (q == NULL)
> > +		goto out_bdput;
> > +
> > +	mutex_lock(&bdev->bd_mutex);
> > +	if (value)
> > +		ret = blk_trace_setup_queue(q, bdev->bd_dev);
> > +	else
> > +		ret = blk_trace_remove_queue(q);
> > +	mutex_unlock(&bdev->bd_mutex);
> > +
> > +	if (ret == 0)
> > +		ret = count;
> > +out_bdput:
> > +	bdput(bdev);
> > +out_unlock_kernel:
> > +	unlock_kernel();
> > +out:
> > +	return ret;
> > +}
> > +
> > +static ssize_t sysfs_blk_trace_attr_show(struct device *dev,
> > +					 struct device_attribute *attr,
> > +					 char *buf);
> > +static ssize_t sysfs_blk_trace_attr_store(struct device *dev,
> > +					  struct device_attribute *attr,
> > +					  const char *buf, size_t count);
> > +#define BLK_TRACE_DEVICE_ATTR(_name) \
> > +	DEVICE_ATTR(_name, S_IRUGO | S_IWUSR, \
> > +		    sysfs_blk_trace_attr_show, \
> > +		    sysfs_blk_trace_attr_store)
> > +
> > +static DEVICE_ATTR(enable, S_IRUGO | S_IWUSR,
> > +		   sysfs_blk_trace_enable_show, sysfs_blk_trace_enable_store);
> > +static BLK_TRACE_DEVICE_ATTR(act_mask);
> > +static BLK_TRACE_DEVICE_ATTR(pid);
> > +static BLK_TRACE_DEVICE_ATTR(start_lba);
> > +static BLK_TRACE_DEVICE_ATTR(end_lba);
> > +
> > +static struct attribute *blk_trace_attrs[] = {
> > +	&dev_attr_enable.attr,
> > +	&dev_attr_act_mask.attr,
> > +	&dev_attr_pid.attr,
> > +	&dev_attr_start_lba.attr,
> > +	&dev_attr_end_lba.attr,
> > +	NULL
> > +};
> > +
> > +struct attribute_group blk_trace_attr_group = {
> > +	.name  = "trace",
> > +	.attrs = blk_trace_attrs,
> > +};
> > +
> > +static int blk_str2act_mask(const char *str)
> > +{
> > +	int mask = 0;
> > +	char *copy = kstrdup(str, GFP_KERNEL), *s;
> > +
> > +	if (copy == NULL)
> > +		return -ENOMEM;
> > +
> > +	s = strstrip(copy);
> > +
> > +	while (1) {
> > +		char *sep = strchr(s, ',');
> > +
> > +		if (sep != NULL)
> > +			*sep = '\0';
> > +
> > +		if (strcasecmp(s, "barrier") == 0)
> > +			mask |= BLK_TC_BARRIER;
> > +		else if (strcasecmp(s, "complete") == 0)
> > +			mask |= BLK_TC_COMPLETE;
> > +		else if (strcasecmp(s, "fs") == 0)
> > +			mask |= BLK_TC_FS;
> > +		else if (strcasecmp(s, "issue") == 0)
> > +			mask |= BLK_TC_ISSUE;
> > +		else if (strcasecmp(s, "pc") == 0)
> > +			mask |= BLK_TC_PC;
> > +		else if (strcasecmp(s, "queue") == 0)
> > +			mask |= BLK_TC_QUEUE;
> > +		else if (strcasecmp(s, "read") == 0)
> > +			mask |= BLK_TC_READ;
> > +		else if (strcasecmp(s, "requeue") == 0)
> > +			mask |= BLK_TC_REQUEUE;
> > +		else if (strcasecmp(s, "sync") == 0)
> > +			mask |= BLK_TC_SYNC;
> > +		else if (strcasecmp(s, "write") == 0)
> > +			mask |= BLK_TC_WRITE;
> > +
> > +		if (sep == NULL)
> > +			break;
> > +
> > +		s = sep + 1;
> > +	}
> > +	kfree(copy);
> > +
> > +	return mask;
> > +}
> > +
> > +static ssize_t sysfs_blk_trace_attr_show(struct device *dev,
> > +					 struct device_attribute *attr,
> > +					 char *buf)
> > +{
> > +	struct hd_struct *p = dev_to_part(dev);
> > +	struct request_queue *q;
> > +	struct block_device *bdev;
> > +	ssize_t ret = -ENXIO;
> > +
> > +	lock_kernel();
> > +	bdev = bdget(part_devt(p));
> > +	if (bdev == NULL)
> > +		goto out_unlock_kernel;
> > +
> > +	q = bdev_get_queue(bdev);
> > +	if (q == NULL)
> > +		goto out_bdput;
> > +	mutex_lock(&bdev->bd_mutex);
> > +	if (q->blk_trace == NULL)
> > +		ret = sprintf(buf, "disabled\n");
> > +	else if (attr == &dev_attr_act_mask)
> > +		ret = sprintf(buf, "%#x\n", q->blk_trace->act_mask);
> > +	else if (attr == &dev_attr_pid)
> > +		ret = sprintf(buf, "%u\n", q->blk_trace->pid);
> > +	else if (attr == &dev_attr_start_lba)
> > +		ret = sprintf(buf, "%llu\n", q->blk_trace->start_lba);
> > +	else if (attr == &dev_attr_end_lba)
> > +		ret = sprintf(buf, "%llu\n", q->blk_trace->end_lba);
> > +	mutex_unlock(&bdev->bd_mutex);
> > +out_bdput:
> > +	bdput(bdev);
> > +out_unlock_kernel:
> > +	unlock_kernel();
> > +	return ret;
> > +}
> > +
> > +static ssize_t sysfs_blk_trace_attr_store(struct device *dev,
> > +					  struct device_attribute *attr,
> > +					  const char *buf, size_t count)
> > +{
> > +	struct block_device *bdev;
> > +	struct request_queue *q;
> > +	struct hd_struct *p;
> > +	u64 value;
> > +	ssize_t ret = -ENXIO;
> > +
> > +	if (count == 0)
> > +		goto out;
> > +
> > +	if (attr == &dev_attr_act_mask) {
> > +		if (sscanf(buf, "%llx", &value) != 1) {
> > +			/* Assume it is a list of trace category names */
> > +			value = blk_str2act_mask(buf);
> > +			if (value < 0)
> > +				goto out;
> > +		}
> > +	} else if (sscanf(buf, "%llu", &value) != 1)
> > +		goto out;
> > +
> > +	lock_kernel();
> > +	p = dev_to_part(dev);
> > +	bdev = bdget(part_devt(p));
> > +	if (bdev == NULL)
> > +		goto out_unlock_kernel;
> > +
> > +	q = bdev_get_queue(bdev);
> > +	if (q == NULL)
> > +		goto out_bdput;
> > +
> > +	mutex_lock(&bdev->bd_mutex);
> > +	ret = 0;
> > +	if (q->blk_trace == NULL)
> > +		ret = blk_trace_setup_queue(q, bdev->bd_dev);
> > +
> > +	if (ret == 0) {
> > +		if (attr == &dev_attr_act_mask)
> > +			q->blk_trace->act_mask = value;
> > +		else if (attr == &dev_attr_pid)
> > +			q->blk_trace->pid = value;
> > +		else if (attr == &dev_attr_start_lba)
> > +			q->blk_trace->start_lba = value;
> > +		else if (attr == &dev_attr_end_lba)
> > +			q->blk_trace->end_lba = value;
> > +		ret = count;
> > +	}
> > +	mutex_unlock(&bdev->bd_mutex);
> > +out_bdput:
> > +	bdput(bdev);
> > +out_unlock_kernel:
> > +	unlock_kernel();
> > +out:
> > +	return ret;
> > +}
> > diff --git a/fs/partitions/check.c b/fs/partitions/check.c
> > index 6d72024..01714ef 100644
> > --- a/fs/partitions/check.c
> > +++ b/fs/partitions/check.c
> > @@ -268,6 +268,10 @@ ssize_t part_fail_store(struct device *dev,
> >  }
> >  #endif
> >  
> > +#ifdef CONFIG_BLK_DEV_IO_TRACE
> > +extern struct attribute_group blk_trace_attr_group;
> > +#endif
> > +
> >  static DEVICE_ATTR(partition, S_IRUGO, part_partition_show, NULL);
> >  static DEVICE_ATTR(start, S_IRUGO, part_start_show, NULL);
> >  static DEVICE_ATTR(size, S_IRUGO, part_size_show, NULL);
> > @@ -294,6 +298,9 @@ static struct attribute_group part_attr_group = {
> >  
> >  static struct attribute_group *part_attr_groups[] = {
> >  	&part_attr_group,
> > +#ifdef CONFIG_BLK_DEV_IO_TRACE
> > +	&blk_trace_attr_group,
> > +#endif
> >  	NULL
> >  };
> 
> 
> 
> This tracer looks good :-)

Thanks!

I hope that by dropping the requirement to a userspace bit (be it big or
small) we can set a record for such kind of tracing infrastructure now
that tracepoints are being proposed left and right.

After all, somebody said that userspace is just a test case for the
kernel, and for things like developer assists...

- Arnaldo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH] ftrace interface for blktrace
  2009-01-17 20:22   ` Arnaldo Carvalho de Melo
@ 2009-01-19 13:33     ` Ingo Molnar
  0 siblings, 0 replies; 14+ messages in thread
From: Ingo Molnar @ 2009-01-19 13:33 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Frederic Weisbecker, Jens Axboe,
	Steven Rostedt, linux-kernel


* Arnaldo Carvalho de Melo <acme@redhat.com> wrote:

> I hope that by dropping the requirement to a userspace bit (be it big or 
> small) we can set a record for such kind of tracing infrastructure now 
> that tracepoints are being proposed left and right.
> 
> After all, somebody said that userspace is just a test case for the 
> kernel, and for things like developer assists...

Exactly.

Also, for certain tracing modes where we just do the run to gather 
statistics (but do not plan to capture any trace), it is more efficient to 
have an in-kernel histogram than to export gobs of data to user-space at 
great expense, just to throw away most of it there ;-)

	Ingo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC][PATCHv2] ftrace interface for blktrace
  2009-01-17 20:08     ` Arnaldo Carvalho de Melo
@ 2009-01-20 14:42       ` Arnaldo Carvalho de Melo
  2009-01-20 17:03         ` Steven Rostedt
  2009-01-20 23:23         ` Frederic Weisbecker
  0 siblings, 2 replies; 14+ messages in thread
From: Arnaldo Carvalho de Melo @ 2009-01-20 14:42 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Ingo Molnar, Steven Rostedt, Frederic Weisbecker, linux-kernel

Em Sat, Jan 17, 2009 at 06:08:04PM -0200, Arnaldo Carvalho de Melo escreveu:
> Em Sat, Jan 17, 2009 at 08:14:30PM +0100, Jens Axboe escreveu:
> > On Sat, Jan 17 2009, Ingo Molnar wrote:
> > > * Arnaldo Carvalho de Melo <acme@redhat.com> wrote:
> > > > Hi Jens,
> > > > 	The patch below adds a ftrace interface for blktrace, allowing 
> > > > people to use it without any required userspace tools, using sysfs to 
> > > > setup the act_mask, pid, start_lba, end_lba.
> > > 
> > > Very nice patch!
> > > 
> > > Jens, i'm wondering what's your take on this direction is. I think the 
> > > consolidation effect is great and the built-in IO tracing capabilities are 
> > > very nice.
> > 
> > I like the current patch, from a quick look. I'll look more soonish
> > (monday). I've always liked the concept of being able to mix various
> > traces into the same stream, since it makes it MUCH easier to see what
> > on earth is going wrong. What I didn't like in the patches Acme did was
> > the enable part, I'll need to look into that. Again, from a quick look,
> > what is the BKL doing in there?!
> 
> I'm just trying to keep locking expectations in the previous, similar
> path, in block/ioctl.c, blkdev_ioctl, where we setup q->blk_trace:
> 
>         case BLKTRACESTART:
>         case BLKTRACESTOP:
>         case BLKTRACESETUP:
>         case BLKTRACETEARDOWN:
>                 lock_kernel();
>                 ret = blk_trace_ioctl(bdev, cmd, (char __user *) arg);
>                 unlock_kernel();
>                 break;
> 
> So I took the safest path and grabbed the big K lock.
>  
> > > We can still get the raw events too and do user-space post-processing, 
> > > when that is desired - so this does not limit anything that blktrace was 
> > > able to do before - it only extends on it.
> > > 
> > > Is there any particular blktrace feature you can think of that is not 
> > > present in the blktrace ftrace plugin?
> > 
> > Can't answer that yet, I'll give it a more thorough look next week.
> 
> Looking forward to that. And will get it split into multiple patches,
> hopefully by monday.

Ok, still as one patch, but with lots of improvements, lets see them
using examples:

Setup debugfs/tracing environment:

$ mkdir /d
$ mount -t debugfs none /d
$ ln -s /d/tracing /t

Now enable tracing on the desired device, setting up filters if needed:

$ echo 1 > /sys/block/sda/sda1/trace/enable

root@f10-1 ~]# head /t/trace
# tracer: blk
#
            bash-955   [000]   565.025000:   8,1    A WBS 3932975 + 8 <- (8,1) 3932912
            bash-955   [000]   565.025000:   8,1    Q   R 3932975 + 8 [<idle>]
            bash-955   [000]   565.025000:   8,1    G  RB 3932975 + 8 [<idle>]
            bash-955   [000]   565.025000:   8,1    P  NS [<idle>]
            bash-955   [000]   565.025000:   8,1    I RBS 3932975 + 8 [<idle>]
            bash-955   [000]   565.025000:   8,1    U  WS [<idle>] 1
            bash-955   [000]   565.025000:   8,1    D  WB 3932975 + 8 [<idle>]
          <idle>-0     [000]   565.044000:   8,1    C  RS 3932975 + 8 [0]
[root@f10-1 ~]#

This is the standardized format used in the sched_switch, sched_wakeup and
other tracers.

This is done by adding a tracer option, that starts disabled, named
"blk_classic", using this construct:

/* Select an alternative, minimalistic output than the original one */
#define TRACE_BLK_OPT_CLASSIC   0x1

static struct tracer_opt blk_tracer_opts[] = {
        /* Default disable the minimalistic output */
        { TRACER_OPT(blk_classic, TRACE_BLK_OPT_CLASSIC ) },
        { }
};

static struct tracer_flags blk_tracer_flags = {
        .val  = 0,
        .opts = blk_tracer_opts,
};

With this disabled the ->print_line in our struct tracer always returns
TRACE_TYPE_UNHANDLED, leaving to the struct trace_event ->trace routine to
pretty print it if the entry in the ring buffer is of type TRACE_BLK, that
we register now in init_blk_tracer by using register_ftrace_event.

I did this because I want, at least while this patch is being considered by
Jens and the other blktrace developers, to both to have the classic format,
activated by:

[root@f10-1 ~]# echo blk_classic > /t/trace_options 
[root@f10-1 ~]# cat /t/trace_options 
print-parent nosym-offset nosym-addr noverbose noraw nohex nobin noblock nostacktrace nosched-tree ftrace_printk noftrace_preempt nobranch annotate nouserstacktrace nosym-userobj noprintk-msg-only blk_classic

So that we use struct trace -> print_line and have full control of what is
formatted:

[root@f10-1 ~]# tail /t/trace
  8,1    0   289.469000  1054  D  WB 526039 + 8 [ls]
  8,1    0   289.470000     0  C  RS 526039 + 8 [1751]
  8,1    0   289.470000  1054  A WBS 266439 + 8 <- (8,1) 266376
  8,1    0   289.470000  1054  Q   R 266439 + 8 [ls]
  8,1    0   289.470000  1054  G  RB 266439 + 8 [ls]
  8,1    0   289.470000  1054  P  NS [ls]
  8,1    0   289.470000  1054  I RBS 266439 + 8 [ls]
  8,1    0   289.470000  1054  U  WS [ls] 1
  8,1    0   289.470000  1054  D  WB 266439 + 8 [ls]
  8,1    0   289.470000     0  C  RS 266439 + 8 [4295]
[root@f10-1 ~]# 

Now back to non_classic mode...

# echo noblk_classic > /t/trace_options

... where we can mix trace events by doing...

# echo stacktrace > /t/trace_options

... and get something interesting:

[root@f10-1 ~]# tail /t/trace
              ls-1060  [000]   431.697000:   8,1    A WBS 12320847 + 8 <- (8,1) 12320784
              ls-1060  [000]   431.697000: 
 <= submit_bio
 <= submit_bh
 <= __ext3_get_inode_loc
 <= ext3_iget
 <= ext3_lookup
 <= do_lookup
 <= __link_path_walk
              ls-1060  [000]   431.697000:   8,1    Q   R 12320847 + 8 [ls]
              ls-1060  [000]   431.697000: 
 <= generic_make_request
 <= submit_bio
 <= submit_bh
 <= __ext3_get_inode_loc
 <= ext3_iget
 <= ext3_lookup
 <= do_lookup
              ls-1060  [000]   431.697000:   8,1    G  RB 12320847 + 8 [ls]
              ls-1060  [000]   431.697000: 
 <= get_request
 <= get_request_wait
 <= __make_request
 <= generic_make_request
 <= submit_bio
 <= submit_bh
 <= __ext3_get_inode_loc
              ls-1060  [000]   431.697000:   8,1    P  NS [ls]
              ls-1060  [000]   431.697000: 
 <= __make_request
 <= generic_make_request
 <= submit_bio
 <= submit_bh
 <= __ext3_get_inode_loc
 <= ext3_iget
 <= ext3_lookup
              ls-1060  [000]   431.697000:   8,1    I RBS 12320847 + 8 [ls]
              ls-1060  [000]   431.697000: 
 <= elv_insert
 <= __elv_add_request
 <= __make_request
 <= generic_make_request
 <= submit_bio
 <= submit_bh
 <= __ext3_get_inode_loc
<SNIP>

Now lets disable stacktraces:

# echo nostacktrace > /t/trace_options

And try doing things a bit more verbosely, using the TRACE_ITER_VERBOSE
that were already part of the existing ftrace infrastructure.

# echo verbose > /t/trace_options
# cat /t/trace_pipe 
              ls-1065  [000]   694.325000:   8,1   remap WBS 3670207 + 8 <- (8,1) 3670144
              ls-1065  [000]   694.325000:   8,1   queue   R 3670207 + 8 [ls]
              ls-1065  [000]   694.325000:   8,1   getrq  RB 3670207 + 8 [ls]
              ls-1065  [000]   694.325000:   8,1   plug  NS [ls]
              ls-1065  [000]   694.325000:   8,1   insert RBS 3670207 + 8 [ls]
              ls-1065  [000]   694.325000:   8,1   unplug_io  WS [ls] 1
              ls-1065  [000]   694.325000:   8,1   issue  WB 3670207 + 8 [ls]
          <idle>-0     [000]   694.343000:   8,1   complete  RS 3670207 + 8 [191]

Hint: look at the action (insert, unplug_io, issue, etc).

We can make it a bit more verbose using this iter flag, the above is
just a suggestion on using the standard flags interface.

Frederic, for now I'm using t->action as the "sub-type" we discussed.

Now that we have noblk_classic mode we can actually look at the struct
trace_event->binary() way of doing things + the
don't-print-anything-just-call-me->binary() flag and then we could
synthesize the old blktrace format and simply do:

cat /t/trace_pipe | blkparse -i -

As a test to check that everything is as the old blktrace(8) codebase
expects.

Jens, there are several fields in struct blk_io_trace that could go,
like pid, cpu, timestamp, because they are already in struct
trace_entry, and sequence, if my understanding of what was discussed
here is not flawed:

http://kerneltrap.org/mailarchive/linux-kernel/2008/9/23/3375544

Patch against yesterday's linux-tip,

Comments?

- Arnaldo

diff --git a/block/blktrace.c b/block/blktrace.c
index b0a2cae..21116c4 100644
--- a/block/blktrace.c
+++ b/block/blktrace.c
@@ -25,9 +25,27 @@
 #include <linux/time.h>
 #include <trace/block.h>
 #include <asm/uaccess.h>
+#include <../kernel/trace/trace_output.h>
 
 static unsigned int blktrace_seq __read_mostly = 1;
 
+static struct trace_array *blk_tr;
+static int __read_mostly  blk_tracer_enabled;
+
+/* Select an alternative, minimalistic output than the original one */
+#define TRACE_BLK_OPT_CLASSIC 	0x1
+
+static struct tracer_opt blk_tracer_opts[] = {
+	/* Default disable the minimalistic output */
+	{ TRACER_OPT(blk_classic, TRACE_BLK_OPT_CLASSIC ) },
+	{ }
+};
+
+static struct tracer_flags blk_tracer_flags = {
+	.val  = 0,
+	.opts = blk_tracer_opts,
+};
+
 /* Global reference count of probes */
 static DEFINE_MUTEX(blk_probe_mutex);
 static atomic_t blk_probes_ref = ATOMIC_INIT(0);
@@ -43,6 +61,9 @@ static void trace_note(struct blk_trace *bt, pid_t pid, int action,
 {
 	struct blk_io_trace *t;
 
+	if (!bt->rchan)
+		return;
+
 	t = relay_reserve(bt->rchan, sizeof(*t) + len);
 	if (t) {
 		const int cpu = smp_processor_id();
@@ -90,6 +111,9 @@ void __trace_note_message(struct blk_trace *bt, const char *fmt, ...)
 	unsigned long flags;
 	char *buf;
 
+	if (!bt->msg_data)
+		return;
+
 	local_irq_save(flags);
 	buf = per_cpu_ptr(bt->msg_data, smp_processor_id());
 	va_start(args, fmt);
@@ -131,13 +155,14 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
 		     int rw, u32 what, int error, int pdu_len, void *pdu_data)
 {
 	struct task_struct *tsk = current;
+	struct ring_buffer_event *event = NULL;
 	struct blk_io_trace *t;
 	unsigned long flags;
 	unsigned long *sequence;
 	pid_t pid;
-	int cpu;
+	int cpu, pc = 0;
 
-	if (unlikely(bt->trace_state != Blktrace_running))
+	if (unlikely(bt->trace_state != Blktrace_running || !blk_tracer_enabled))
 		return;
 
 	what |= ddir_act[rw & WRITE];
@@ -150,6 +175,24 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
 	pid = tsk->pid;
 	if (unlikely(act_log_check(bt, what, sector, pid)))
 		return;
+	cpu = raw_smp_processor_id();
+
+	if (blk_tr) {
+		struct trace_entry *ent;
+		tracing_record_cmdline(current);
+
+		event = ring_buffer_lock_reserve(blk_tr->buffer,
+						 sizeof(*t) + pdu_len, &flags);
+		if (!event)
+			return;
+		
+		ent = ring_buffer_event_data(event);
+		t = (struct blk_io_trace *)ent;
+		pc = preempt_count();
+		tracing_generic_entry_update(ent, 0, pc);
+		ent->type = TRACE_BLK;
+		goto record_it;
+	}
 
 	/*
 	 * A word about the locking here - we disable interrupts to reserve
@@ -163,23 +206,33 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
 
 	t = relay_reserve(bt->rchan, sizeof(*t) + pdu_len);
 	if (t) {
-		cpu = smp_processor_id();
 		sequence = per_cpu_ptr(bt->sequence, cpu);
 
 		t->magic = BLK_IO_TRACE_MAGIC | BLK_IO_TRACE_VERSION;
 		t->sequence = ++(*sequence);
 		t->time = ktime_to_ns(ktime_get());
+		t->cpu = cpu;
+		t->pid = pid;
+record_it:
 		t->sector = sector;
 		t->bytes = bytes;
 		t->action = what;
-		t->pid = pid;
 		t->device = bt->dev;
-		t->cpu = cpu;
 		t->error = error;
 		t->pdu_len = pdu_len;
 
 		if (pdu_len)
 			memcpy((void *) t + sizeof(*t), pdu_data, pdu_len);
+
+		if (blk_tr) {
+			ring_buffer_unlock_commit(blk_tr->buffer, event, flags);
+			if (pid != 0 &&
+			    (blk_tracer_flags.val & TRACE_BLK_OPT_CLASSIC) == 0 &&
+			    (trace_flags & TRACE_ITER_STACKTRACE) != 0)
+				__trace_stack(blk_tr, NULL, flags, 5, pc);
+			trace_wake_up();
+			return;
+		}
 	}
 
 	local_irq_restore(flags);
@@ -888,3 +941,584 @@ static void blk_unregister_tracepoints(void)
 
 	tracepoint_synchronize_unregister();
 }
+
+/*
+ * struct blk_io_tracer formatting routines
+ */
+
+static void fill_rwbs(char *rwbs, const struct blk_io_trace *t)
+{
+        int i = 0;
+
+        if (t->action & BLK_TC_DISCARD)	   rwbs[i++] = 'D';
+        else if (t->action & BLK_TC_WRITE) rwbs[i++] = 'W';
+        else if (t->bytes)		   rwbs[i++] = 'R';
+        else				   rwbs[i++] = 'N';
+
+        if (t->action & BLK_TC_AHEAD)	   rwbs[i++] = 'A';
+        if (t->action & BLK_TC_BARRIER)	   rwbs[i++] = 'B';
+        if (t->action & BLK_TC_SYNC)	   rwbs[i++] = 'S';
+        if (t->action & BLK_TC_META)	   rwbs[i++] = 'M';
+
+        rwbs[i] = '\0';
+}
+
+static inline
+const struct blk_io_trace *te_blk_io_trace(const struct trace_entry *ent)
+{
+	return (const struct blk_io_trace *)ent;
+}
+
+static inline const void *pdu_start(const struct trace_entry *ent)
+{
+	return te_blk_io_trace(ent) + 1;
+}
+
+static inline u32 t_sec(const struct trace_entry *ent)
+{
+	return te_blk_io_trace(ent)->bytes >> 9;
+}
+
+static inline unsigned long long t_sector(const struct trace_entry *ent)
+{
+	return te_blk_io_trace(ent)->sector;
+}
+
+static inline __u16 t_error(const struct trace_entry *ent)
+{
+	return te_blk_io_trace(ent)->sector;
+}
+
+static __u64 get_pdu_int(const struct trace_entry *ent)
+{
+	const __u64 *val = pdu_start(ent);
+	return be64_to_cpu(*val);
+}
+
+static void get_pdu_remap(const struct trace_entry *ent,
+			  struct blk_io_trace_remap *r)
+{
+	const struct blk_io_trace_remap *__r = pdu_start(ent);
+	__u64 sector = __r->sector;
+
+	r->device = be32_to_cpu(__r->device);
+	r->device_from = be32_to_cpu(__r->device_from);
+	r->sector = be64_to_cpu(sector);
+}
+
+static int blk_log_action_iter(struct trace_iterator *iter, const char *act)
+{
+	char rwbs[6];
+	unsigned long long ts  = ns2usecs(iter->ts);
+	unsigned long usec_rem = do_div(ts, USEC_PER_SEC);
+	unsigned secs	       = (unsigned long)ts;
+	const struct trace_entry *ent = iter->ent;
+	const struct blk_io_trace *t = (const struct blk_io_trace *)ent;
+
+	fill_rwbs(rwbs, t);
+
+	return trace_seq_printf(&iter->seq,
+				"%3d,%-3d %2d %5d.%06lu %5u %2s %3s ",
+				MAJOR(t->device), MINOR(t->device), iter->cpu,
+				secs, usec_rem, ent->pid, act, rwbs);
+}
+
+static int blk_log_action_seq(struct trace_seq *s, const struct blk_io_trace *t,
+			      const char *act)
+{
+	char rwbs[6];
+	fill_rwbs(rwbs, t);
+	return trace_seq_printf(s, "%3d,%-3d %2s %3s ",
+				MAJOR(t->device), MINOR(t->device), act, rwbs);
+}
+
+static int blk_log_generic(struct trace_seq *s, const struct trace_entry *ent)
+{
+	const char *cmd = trace_find_cmdline(ent->pid);
+
+	if (t_sec(ent))
+		return trace_seq_printf(s, "%llu + %u [%s]\n",
+					t_sector(ent), t_sec(ent), cmd);
+	return trace_seq_printf(s, "[%s]\n", cmd);
+}
+
+static int blk_log_with_error(struct trace_seq *s, const struct trace_entry *ent)
+{
+	if (t_sec(ent))
+		return trace_seq_printf(s, "%llu + %u [%d]\n", t_sector(ent),
+					t_sec(ent), t_error(ent));
+	return trace_seq_printf(s, "%llu [%d]\n", t_sector(ent), t_error(ent));
+}
+
+static int blk_log_remap(struct trace_seq *s, const struct trace_entry *ent)
+{
+	struct blk_io_trace_remap r = { .device = 0, };
+
+	get_pdu_remap(ent, &r);
+	return trace_seq_printf(s, "%llu + %u <- (%d,%d) %llu\n",
+			       t_sector(ent),
+			       t_sec(ent), MAJOR(r.device), MINOR(r.device),
+			       (unsigned long long)r.sector);
+}
+
+static int blk_log_plug(struct trace_seq *s, const struct trace_entry *ent)
+{
+	return trace_seq_printf(s, "[%s]\n", trace_find_cmdline(ent->pid));
+}
+
+static int blk_log_unplug(struct trace_seq *s, const struct trace_entry *ent)
+{
+	return trace_seq_printf(s, "[%s] %llu\n", trace_find_cmdline(ent->pid),
+				get_pdu_int(ent));
+}
+
+static int blk_log_split(struct trace_seq *s, const struct trace_entry *ent)
+{
+	return trace_seq_printf(s, "%llu / %llu [%s]\n", t_sector(ent),
+				get_pdu_int(ent), trace_find_cmdline(ent->pid));
+}
+
+/*
+ * struct tracer operations
+ */
+
+static void blk_tracer_print_header(struct seq_file *m)
+{
+	if (!(blk_tracer_flags.val & TRACE_BLK_OPT_CLASSIC))
+		return;
+	seq_puts(m, "# DEV   CPU TIMESTAMP     PID ACT FLG\n"
+		    "#  |     |     |           |   |   |\n");
+}
+
+static void blk_tracer_start(struct trace_array *tr)
+{
+	int cpu;
+
+	tr->time_start = ftrace_now(tr->cpu);
+
+	for_each_online_cpu(cpu)
+		tracing_reset(tr, cpu);
+
+	mutex_lock(&blk_probe_mutex);
+	if (atomic_add_return(1, &blk_probes_ref) == 1)
+		if (blk_register_tracepoints())
+			atomic_dec(&blk_probes_ref);
+	mutex_unlock(&blk_probe_mutex);
+}
+
+static int blk_tracer_init(struct trace_array *tr)
+{
+	blk_tr = tr;
+	blk_tracer_start(tr);
+	mutex_lock(&blk_probe_mutex);
+	blk_tracer_enabled++;
+	mutex_unlock(&blk_probe_mutex);
+	return 0;
+}
+
+static void blk_tracer_stop(struct trace_array *tr)
+{
+	mutex_lock(&blk_probe_mutex);
+	if (atomic_dec_and_test(&blk_probes_ref))
+		blk_unregister_tracepoints();
+	mutex_unlock(&blk_probe_mutex);
+}
+
+static void blk_tracer_reset(struct trace_array *tr)
+{
+	if (!atomic_read(&blk_probes_ref))
+		return;
+
+	mutex_lock(&blk_probe_mutex);
+	blk_tracer_enabled--;
+	WARN_ON(blk_tracer_enabled < 0);
+	mutex_unlock(&blk_probe_mutex);
+
+	blk_tracer_stop(tr);
+}
+
+static struct {
+	const char *act[2];
+	int 	   (*print)(struct trace_seq *s, const struct trace_entry *ent);
+} what2act[] __read_mostly = {
+	[__BLK_TA_QUEUE]	= {{  "Q", "queue" }, 	   blk_log_generic },
+	[__BLK_TA_BACKMERGE]	= {{  "M", "backmerge" },  blk_log_generic },
+	[__BLK_TA_FRONTMERGE]	= {{  "F", "frontmerge" }, blk_log_generic },
+	[__BLK_TA_GETRQ]	= {{  "G", "getrq" },	   blk_log_generic },
+	[__BLK_TA_SLEEPRQ]	= {{  "S", "sleeprq" },	   blk_log_generic },
+	[__BLK_TA_REQUEUE]	= {{  "R", "requeue" },	   blk_log_with_error },
+	[__BLK_TA_ISSUE]	= {{  "D", "issue" },	   blk_log_generic },
+	[__BLK_TA_COMPLETE]	= {{  "C", "complete" },   blk_log_with_error },
+	[__BLK_TA_PLUG]		= {{  "P", "plug" },	   blk_log_plug },
+	[__BLK_TA_UNPLUG_IO]	= {{  "U", "unplug_io" },  blk_log_unplug },
+	[__BLK_TA_UNPLUG_TIMER]	= {{ "UT", "unplug_timer" }, blk_log_unplug },
+	[__BLK_TA_INSERT]	= {{  "I", "insert" },	   blk_log_generic },
+	[__BLK_TA_SPLIT]	= {{  "X", "split" },	   blk_log_split },
+	[__BLK_TA_BOUNCE]	= {{  "B", "bounce" },	   blk_log_generic },
+	[__BLK_TA_REMAP]	= {{  "A", "remap" },	   blk_log_remap },
+};
+
+static int blk_trace_event_print(struct trace_seq *s, struct trace_entry *ent,
+				 int flags)
+{
+	const struct blk_io_trace *t = (struct blk_io_trace *)ent;
+	const u16 what = t->action & ((1 << BLK_TC_SHIFT) - 1);
+	int ret;
+
+	if (unlikely(what == 0 || what > ARRAY_SIZE(what2act)))
+		ret = trace_seq_printf(s, "Bad pc action %x\n", what);
+	else {
+		const bool long_act = !!(trace_flags & TRACE_ITER_VERBOSE);
+		ret = blk_log_action_seq(s, t, what2act[what].act[long_act]);
+		if (ret)
+			ret = what2act[what].print(s, ent);
+	}
+
+	return ret ? TRACE_TYPE_HANDLED : TRACE_TYPE_PARTIAL_LINE;
+}
+
+static enum print_line_t blk_tracer_print_line(struct trace_iterator *iter)
+{
+	const struct blk_io_trace *t;
+	u16 what;
+	int ret;
+
+	if (!(blk_tracer_flags.val & TRACE_BLK_OPT_CLASSIC))
+		return TRACE_TYPE_UNHANDLED;
+
+	t = (const struct blk_io_trace *)iter->ent;
+	what = t->action & ((1 << BLK_TC_SHIFT) - 1);
+
+	if (unlikely(what == 0 || what > ARRAY_SIZE(what2act)))
+		ret = trace_seq_printf(&iter->seq, "Bad pc action %x\n", what);
+	else {
+		const bool long_act = !!(trace_flags & TRACE_ITER_VERBOSE);
+		ret = blk_log_action_iter(iter, what2act[what].act[long_act]);
+		if (ret)
+			ret = what2act[what].print(&iter->seq, iter->ent);
+	}
+
+	return ret ? TRACE_TYPE_HANDLED : TRACE_TYPE_PARTIAL_LINE;
+}
+
+static struct tracer blk_tracer __read_mostly = {
+	.name		= "blk",
+	.init		= blk_tracer_init,
+	.reset		= blk_tracer_reset,
+	.start		= blk_tracer_start,
+	.stop		= blk_tracer_stop,
+	.print_header	= blk_tracer_print_header,
+	.print_line	= blk_tracer_print_line,
+	.flags		= &blk_tracer_flags,
+};
+
+static struct trace_event trace_blk_event = {
+	.type	 	= TRACE_BLK,
+	.trace		= blk_trace_event_print,
+	.latency_trace	= blk_trace_event_print,
+	.raw		= trace_nop_print,
+	.hex		= trace_nop_print,
+	.binary		= trace_nop_print,
+};
+
+static int __init init_blk_tracer(void)
+{
+	if (!register_ftrace_event(&trace_blk_event)) {
+		pr_warning("Warning: could not register block events\n");
+		return 1;
+	}
+
+	if (register_tracer(&blk_tracer) != 0) {
+		pr_warning("Warning: could not register the block tracer\n");
+		unregister_ftrace_event(&trace_blk_event);
+		return 1;
+	}
+
+	return 0;
+}
+
+device_initcall(init_blk_tracer);
+
+static int blk_trace_remove_queue(struct request_queue *q)
+{
+	struct blk_trace *bt;
+
+	bt = xchg(&q->blk_trace, NULL);
+	if (bt == NULL)
+		return -EINVAL;
+
+	kfree(bt);
+	return 0;
+}
+
+/*
+ * Setup everything required to start tracing
+ */
+static int blk_trace_setup_queue(struct request_queue *q, dev_t dev)
+{
+	struct blk_trace *old_bt, *bt = NULL;
+	int ret;
+
+	ret = -ENOMEM;
+	bt = kzalloc(sizeof(*bt), GFP_KERNEL);
+	if (!bt)
+		goto err;
+
+	bt->dev = dev;
+	bt->act_mask = (u16)-1;
+	bt->end_lba = -1ULL;
+	bt->trace_state = Blktrace_running;
+
+	old_bt = xchg(&q->blk_trace, bt);
+	if (old_bt != NULL) {
+		(void)xchg(&q->blk_trace, old_bt);
+		kfree(bt);
+		ret = -EBUSY;
+	}
+	return 0;
+err:
+	return ret;
+}
+
+/*
+ * sysfs interface to enable and configure tracing
+ */
+
+static ssize_t sysfs_blk_trace_enable_show(struct device *dev,
+					   struct device_attribute *attr,
+					   char *buf)
+{
+	struct hd_struct *p = dev_to_part(dev);
+	struct block_device *bdev;
+	ssize_t ret = -ENXIO;
+
+	lock_kernel();
+	bdev = bdget(part_devt(p));
+	if (bdev != NULL) {
+		struct request_queue *q = bdev_get_queue(bdev);
+
+		if (q != NULL) {
+			mutex_lock(&bdev->bd_mutex);
+			ret = sprintf(buf, "%u\n", !!q->blk_trace);
+			mutex_unlock(&bdev->bd_mutex);
+		}
+
+		bdput(bdev);
+	}
+
+	unlock_kernel();
+	return ret;
+}
+
+static ssize_t sysfs_blk_trace_enable_store(struct device *dev,
+					    struct device_attribute *attr,
+					    const char *buf, size_t count)
+{
+	struct block_device *bdev;
+	struct request_queue *q;
+	struct hd_struct *p;
+	int value;
+	ssize_t ret = -ENXIO;
+
+	if (count == 0 || sscanf(buf, "%d", &value) != 1)
+		goto out;
+
+	lock_kernel();
+	p = dev_to_part(dev);
+	bdev = bdget(part_devt(p));
+	if (bdev == NULL)
+		goto out_unlock_kernel;
+
+	q = bdev_get_queue(bdev);
+	if (q == NULL)
+		goto out_bdput;
+
+	mutex_lock(&bdev->bd_mutex);
+	if (value)
+		ret = blk_trace_setup_queue(q, bdev->bd_dev);
+	else
+		ret = blk_trace_remove_queue(q);
+	mutex_unlock(&bdev->bd_mutex);
+
+	if (ret == 0)
+		ret = count;
+out_bdput:
+	bdput(bdev);
+out_unlock_kernel:
+	unlock_kernel();
+out:
+	return ret;
+}
+
+static ssize_t sysfs_blk_trace_attr_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf);
+static ssize_t sysfs_blk_trace_attr_store(struct device *dev,
+					  struct device_attribute *attr,
+					  const char *buf, size_t count);
+#define BLK_TRACE_DEVICE_ATTR(_name) \
+	DEVICE_ATTR(_name, S_IRUGO | S_IWUSR, \
+		    sysfs_blk_trace_attr_show, \
+		    sysfs_blk_trace_attr_store)
+
+static DEVICE_ATTR(enable, S_IRUGO | S_IWUSR,
+		   sysfs_blk_trace_enable_show, sysfs_blk_trace_enable_store);
+static BLK_TRACE_DEVICE_ATTR(act_mask);
+static BLK_TRACE_DEVICE_ATTR(pid);
+static BLK_TRACE_DEVICE_ATTR(start_lba);
+static BLK_TRACE_DEVICE_ATTR(end_lba);
+
+static struct attribute *blk_trace_attrs[] = {
+	&dev_attr_enable.attr,
+	&dev_attr_act_mask.attr,
+	&dev_attr_pid.attr,
+	&dev_attr_start_lba.attr,
+	&dev_attr_end_lba.attr,
+	NULL
+};
+
+struct attribute_group blk_trace_attr_group = {
+	.name  = "trace",
+	.attrs = blk_trace_attrs,
+};
+
+static int blk_str2act_mask(const char *str)
+{
+	int mask = 0;
+	char *copy = kstrdup(str, GFP_KERNEL), *s;
+
+	if (copy == NULL)
+		return -ENOMEM;
+
+	s = strstrip(copy);
+
+	while (1) {
+		char *sep = strchr(s, ',');
+
+		if (sep != NULL)
+			*sep = '\0';
+
+		if (strcasecmp(s, "barrier") == 0)
+			mask |= BLK_TC_BARRIER;
+		else if (strcasecmp(s, "complete") == 0)
+			mask |= BLK_TC_COMPLETE;
+		else if (strcasecmp(s, "fs") == 0)
+			mask |= BLK_TC_FS;
+		else if (strcasecmp(s, "issue") == 0)
+			mask |= BLK_TC_ISSUE;
+		else if (strcasecmp(s, "pc") == 0)
+			mask |= BLK_TC_PC;
+		else if (strcasecmp(s, "queue") == 0)
+			mask |= BLK_TC_QUEUE;
+		else if (strcasecmp(s, "read") == 0)
+			mask |= BLK_TC_READ;
+		else if (strcasecmp(s, "requeue") == 0)
+			mask |= BLK_TC_REQUEUE;
+		else if (strcasecmp(s, "sync") == 0)
+			mask |= BLK_TC_SYNC;
+		else if (strcasecmp(s, "write") == 0)
+			mask |= BLK_TC_WRITE;
+
+		if (sep == NULL)
+			break;
+
+		s = sep + 1;
+	}
+	kfree(copy);
+
+	return mask;
+}
+
+static ssize_t sysfs_blk_trace_attr_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	struct hd_struct *p = dev_to_part(dev);
+	struct request_queue *q;
+	struct block_device *bdev;
+	ssize_t ret = -ENXIO;
+
+	lock_kernel();
+	bdev = bdget(part_devt(p));
+	if (bdev == NULL)
+		goto out_unlock_kernel;
+
+	q = bdev_get_queue(bdev);
+	if (q == NULL)
+		goto out_bdput;
+	mutex_lock(&bdev->bd_mutex);
+	if (q->blk_trace == NULL)
+		ret = sprintf(buf, "disabled\n");
+	else if (attr == &dev_attr_act_mask)
+		ret = sprintf(buf, "%#x\n", q->blk_trace->act_mask);
+	else if (attr == &dev_attr_pid)
+		ret = sprintf(buf, "%u\n", q->blk_trace->pid);
+	else if (attr == &dev_attr_start_lba)
+		ret = sprintf(buf, "%llu\n", q->blk_trace->start_lba);
+	else if (attr == &dev_attr_end_lba)
+		ret = sprintf(buf, "%llu\n", q->blk_trace->end_lba);
+	mutex_unlock(&bdev->bd_mutex);
+out_bdput:
+	bdput(bdev);
+out_unlock_kernel:
+	unlock_kernel();
+	return ret;
+}
+
+static ssize_t sysfs_blk_trace_attr_store(struct device *dev,
+					  struct device_attribute *attr,
+					  const char *buf, size_t count)
+{
+	struct block_device *bdev;
+	struct request_queue *q;
+	struct hd_struct *p;
+	u64 value;
+	ssize_t ret = -ENXIO;
+
+	if (count == 0)
+		goto out;
+
+	if (attr == &dev_attr_act_mask) {
+		if (sscanf(buf, "%llx", &value) != 1) {
+			/* Assume it is a list of trace category names */
+			value = blk_str2act_mask(buf);
+			if (value < 0)
+				goto out;
+		}
+	} else if (sscanf(buf, "%llu", &value) != 1)
+		goto out;
+
+	lock_kernel();
+	p = dev_to_part(dev);
+	bdev = bdget(part_devt(p));
+	if (bdev == NULL)
+		goto out_unlock_kernel;
+
+	q = bdev_get_queue(bdev);
+	if (q == NULL)
+		goto out_bdput;
+
+	mutex_lock(&bdev->bd_mutex);
+	ret = 0;
+	if (q->blk_trace == NULL)
+		ret = blk_trace_setup_queue(q, bdev->bd_dev);
+
+	if (ret == 0) {
+		if (attr == &dev_attr_act_mask)
+			q->blk_trace->act_mask = value;
+		else if (attr == &dev_attr_pid)
+			q->blk_trace->pid = value;
+		else if (attr == &dev_attr_start_lba)
+			q->blk_trace->start_lba = value;
+		else if (attr == &dev_attr_end_lba)
+			q->blk_trace->end_lba = value;
+		ret = count;
+	}
+	mutex_unlock(&bdev->bd_mutex);
+out_bdput:
+	bdput(bdev);
+out_unlock_kernel:
+	unlock_kernel();
+out:
+	return ret;
+}
diff --git a/fs/partitions/check.c b/fs/partitions/check.c
index 6d72024..01714ef 100644
--- a/fs/partitions/check.c
+++ b/fs/partitions/check.c
@@ -268,6 +268,10 @@ ssize_t part_fail_store(struct device *dev,
 }
 #endif
 
+#ifdef CONFIG_BLK_DEV_IO_TRACE
+extern struct attribute_group blk_trace_attr_group;
+#endif
+
 static DEVICE_ATTR(partition, S_IRUGO, part_partition_show, NULL);
 static DEVICE_ATTR(start, S_IRUGO, part_start_show, NULL);
 static DEVICE_ATTR(size, S_IRUGO, part_size_show, NULL);
@@ -294,6 +298,9 @@ static struct attribute_group part_attr_group = {
 
 static struct attribute_group *part_attr_groups[] = {
 	&part_attr_group,
+#ifdef CONFIG_BLK_DEV_IO_TRACE
+	&blk_trace_attr_group,
+#endif
 	NULL
 };
 
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 54b7278..0d6e314 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -32,6 +32,7 @@ enum trace_type {
 	TRACE_KMEM_ALLOC,
 	TRACE_KMEM_FREE,
 	TRACE_POWER,
+	TRACE_BLK,
 
 	__TRACE_LAST_TYPE,
 };

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCHv2] ftrace interface for blktrace
  2009-01-20 14:42       ` [RFC][PATCHv2] " Arnaldo Carvalho de Melo
@ 2009-01-20 17:03         ` Steven Rostedt
  2009-01-20 17:58           ` Arnaldo Carvalho de Melo
  2009-01-20 23:23         ` Frederic Weisbecker
  1 sibling, 1 reply; 14+ messages in thread
From: Steven Rostedt @ 2009-01-20 17:03 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Jens Axboe, Ingo Molnar, Frederic Weisbecker, linux-kernel


On Tue, 20 Jan 2009, Arnaldo Carvalho de Melo wrote:
> 
> Setup debugfs/tracing environment:
> 
> $ mkdir /d
> $ mount -t debugfs none /d
> $ ln -s /d/tracing /t

OK, now you are just making fun of me ;-)

> 
> Now enable tracing on the desired device, setting up filters if needed:
> 
> $ echo 1 > /sys/block/sda/sda1/trace/enable
> 
> root@f10-1 ~]# head /t/trace
> # tracer: blk
> #
>             bash-955   [000]   565.025000:   8,1    A WBS 3932975 + 8 <- (8,1) 3932912
>             bash-955   [000]   565.025000:   8,1    Q   R 3932975 + 8 [<idle>]
>             bash-955   [000]   565.025000:   8,1    G  RB 3932975 + 8 [<idle>]
>             bash-955   [000]   565.025000:   8,1    P  NS [<idle>]
>             bash-955   [000]   565.025000:   8,1    I RBS 3932975 + 8 [<idle>]
>             bash-955   [000]   565.025000:   8,1    U  WS [<idle>] 1
>             bash-955   [000]   565.025000:   8,1    D  WB 3932975 + 8 [<idle>]
>           <idle>-0     [000]   565.044000:   8,1    C  RS 3932975 + 8 [0]
> [root@f10-1 ~]#
> 
> This is the standardized format used in the sched_switch, sched_wakeup and
> other tracers.
> 
> This is done by adding a tracer option, that starts disabled, named
> "blk_classic", using this construct:
> 
> /* Select an alternative, minimalistic output than the original one */
> #define TRACE_BLK_OPT_CLASSIC   0x1
> 
> static struct tracer_opt blk_tracer_opts[] = {
>         /* Default disable the minimalistic output */
>         { TRACER_OPT(blk_classic, TRACE_BLK_OPT_CLASSIC ) },
>         { }
> };
> 
> static struct tracer_flags blk_tracer_flags = {
>         .val  = 0,
>         .opts = blk_tracer_opts,
> };
> 
> With this disabled the ->print_line in our struct tracer always returns
> TRACE_TYPE_UNHANDLED, leaving to the struct trace_event ->trace routine to
> pretty print it if the entry in the ring buffer is of type TRACE_BLK, that
> we register now in init_blk_tracer by using register_ftrace_event.
> 
> I did this because I want, at least while this patch is being considered by
> Jens and the other blktrace developers, to both to have the classic format,
> activated by:
> 
> [root@f10-1 ~]# echo blk_classic > /t/trace_options 
> [root@f10-1 ~]# cat /t/trace_options 
> print-parent nosym-offset nosym-addr noverbose noraw nohex nobin noblock nostacktrace nosched-tree ftrace_printk noftrace_preempt nobranch annotate nouserstacktrace nosym-userobj noprintk-msg-only blk_classic
> 
> So that we use struct trace -> print_line and have full control of what is
> formatted:
> 
> [root@f10-1 ~]# tail /t/trace
>   8,1    0   289.469000  1054  D  WB 526039 + 8 [ls]
>   8,1    0   289.470000     0  C  RS 526039 + 8 [1751]
>   8,1    0   289.470000  1054  A WBS 266439 + 8 <- (8,1) 266376
>   8,1    0   289.470000  1054  Q   R 266439 + 8 [ls]
>   8,1    0   289.470000  1054  G  RB 266439 + 8 [ls]
>   8,1    0   289.470000  1054  P  NS [ls]
>   8,1    0   289.470000  1054  I RBS 266439 + 8 [ls]
>   8,1    0   289.470000  1054  U  WS [ls] 1
>   8,1    0   289.470000  1054  D  WB 266439 + 8 [ls]
>   8,1    0   289.470000     0  C  RS 266439 + 8 [4295]
> [root@f10-1 ~]# 
> 
> Now back to non_classic mode...
> 
> # echo noblk_classic > /t/trace_options
> 
> ... where we can mix trace events by doing...
> 
> # echo stacktrace > /t/trace_options
> 
> ... and get something interesting:
> 
> [root@f10-1 ~]# tail /t/trace
>               ls-1060  [000]   431.697000:   8,1    A WBS 12320847 + 8 <- (8,1) 12320784
>               ls-1060  [000]   431.697000: 
>  <= submit_bio
>  <= submit_bh
>  <= __ext3_get_inode_loc
>  <= ext3_iget
>  <= ext3_lookup
>  <= do_lookup
>  <= __link_path_walk
>               ls-1060  [000]   431.697000:   8,1    Q   R 12320847 + 8 [ls]
>               ls-1060  [000]   431.697000: 
>  <= generic_make_request
>  <= submit_bio
>  <= submit_bh
>  <= __ext3_get_inode_loc
>  <= ext3_iget
>  <= ext3_lookup
>  <= do_lookup
>               ls-1060  [000]   431.697000:   8,1    G  RB 12320847 + 8 [ls]
>               ls-1060  [000]   431.697000: 
>  <= get_request
>  <= get_request_wait
>  <= __make_request
>  <= generic_make_request
>  <= submit_bio
>  <= submit_bh
>  <= __ext3_get_inode_loc
>               ls-1060  [000]   431.697000:   8,1    P  NS [ls]
>               ls-1060  [000]   431.697000: 
>  <= __make_request
>  <= generic_make_request
>  <= submit_bio
>  <= submit_bh
>  <= __ext3_get_inode_loc
>  <= ext3_iget
>  <= ext3_lookup
>               ls-1060  [000]   431.697000:   8,1    I RBS 12320847 + 8 [ls]
>               ls-1060  [000]   431.697000: 
>  <= elv_insert
>  <= __elv_add_request
>  <= __make_request
>  <= generic_make_request
>  <= submit_bio
>  <= submit_bh
>  <= __ext3_get_inode_loc
> <SNIP>
> 
> Now lets disable stacktraces:
> 
> # echo nostacktrace > /t/trace_options
> 
> And try doing things a bit more verbosely, using the TRACE_ITER_VERBOSE
> that were already part of the existing ftrace infrastructure.
> 
> # echo verbose > /t/trace_options
> # cat /t/trace_pipe 
>               ls-1065  [000]   694.325000:   8,1   remap WBS 3670207 + 8 <- (8,1) 3670144
>               ls-1065  [000]   694.325000:   8,1   queue   R 3670207 + 8 [ls]
>               ls-1065  [000]   694.325000:   8,1   getrq  RB 3670207 + 8 [ls]
>               ls-1065  [000]   694.325000:   8,1   plug  NS [ls]
>               ls-1065  [000]   694.325000:   8,1   insert RBS 3670207 + 8 [ls]
>               ls-1065  [000]   694.325000:   8,1   unplug_io  WS [ls] 1
>               ls-1065  [000]   694.325000:   8,1   issue  WB 3670207 + 8 [ls]
>           <idle>-0     [000]   694.343000:   8,1   complete  RS 3670207 + 8 [191]
> 
> Hint: look at the action (insert, unplug_io, issue, etc).
> 
> We can make it a bit more verbose using this iter flag, the above is
> just a suggestion on using the standard flags interface.
> 
> Frederic, for now I'm using t->action as the "sub-type" we discussed.
> 
> Now that we have noblk_classic mode we can actually look at the struct
> trace_event->binary() way of doing things + the
> don't-print-anything-just-call-me->binary() flag and then we could
> synthesize the old blktrace format and simply do:
> 
> cat /t/trace_pipe | blkparse -i -
> 
> As a test to check that everything is as the old blktrace(8) codebase
> expects.
> 
> Jens, there are several fields in struct blk_io_trace that could go,
> like pid, cpu, timestamp, because they are already in struct
> trace_entry, and sequence, if my understanding of what was discussed
> here is not flawed:
> 
> http://kerneltrap.org/mailarchive/linux-kernel/2008/9/23/3375544
> 
> Patch against yesterday's linux-tip,
> 
> Comments?
> 
> - Arnaldo
> 
> diff --git a/block/blktrace.c b/block/blktrace.c
> index b0a2cae..21116c4 100644
> --- a/block/blktrace.c
> +++ b/block/blktrace.c
> @@ -25,9 +25,27 @@
>  #include <linux/time.h>
>  #include <trace/block.h>
>  #include <asm/uaccess.h>
> +#include <../kernel/trace/trace_output.h>
>  
>  static unsigned int blktrace_seq __read_mostly = 1;
>  
> +static struct trace_array *blk_tr;
> +static int __read_mostly  blk_tracer_enabled;
> +
> +/* Select an alternative, minimalistic output than the original one */
> +#define TRACE_BLK_OPT_CLASSIC 	0x1
> +
> +static struct tracer_opt blk_tracer_opts[] = {
> +	/* Default disable the minimalistic output */
> +	{ TRACER_OPT(blk_classic, TRACE_BLK_OPT_CLASSIC ) },
> +	{ }
> +};
> +
> +static struct tracer_flags blk_tracer_flags = {
> +	.val  = 0,
> +	.opts = blk_tracer_opts,
> +};
> +
>  /* Global reference count of probes */
>  static DEFINE_MUTEX(blk_probe_mutex);
>  static atomic_t blk_probes_ref = ATOMIC_INIT(0);
> @@ -43,6 +61,9 @@ static void trace_note(struct blk_trace *bt, pid_t pid, int action,
>  {
>  	struct blk_io_trace *t;
>  
> +	if (!bt->rchan)
> +		return;
> +
>  	t = relay_reserve(bt->rchan, sizeof(*t) + len);
>  	if (t) {
>  		const int cpu = smp_processor_id();
> @@ -90,6 +111,9 @@ void __trace_note_message(struct blk_trace *bt, const char *fmt, ...)
>  	unsigned long flags;
>  	char *buf;
>  
> +	if (!bt->msg_data)
> +		return;
> +
>  	local_irq_save(flags);
>  	buf = per_cpu_ptr(bt->msg_data, smp_processor_id());
>  	va_start(args, fmt);
> @@ -131,13 +155,14 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
>  		     int rw, u32 what, int error, int pdu_len, void *pdu_data)
>  {
>  	struct task_struct *tsk = current;
> +	struct ring_buffer_event *event = NULL;
>  	struct blk_io_trace *t;
>  	unsigned long flags;
>  	unsigned long *sequence;
>  	pid_t pid;
> -	int cpu;
> +	int cpu, pc = 0;
>  
> -	if (unlikely(bt->trace_state != Blktrace_running))
> +	if (unlikely(bt->trace_state != Blktrace_running || !blk_tracer_enabled))
>  		return;
>  
>  	what |= ddir_act[rw & WRITE];
> @@ -150,6 +175,24 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
>  	pid = tsk->pid;
>  	if (unlikely(act_log_check(bt, what, sector, pid)))
>  		return;
> +	cpu = raw_smp_processor_id();
> +
> +	if (blk_tr) {
> +		struct trace_entry *ent;
> +		tracing_record_cmdline(current);
> +
> +		event = ring_buffer_lock_reserve(blk_tr->buffer,
> +						 sizeof(*t) + pdu_len, &flags);
> +		if (!event)
> +			return;
> +		
> +		ent = ring_buffer_event_data(event);
> +		t = (struct blk_io_trace *)ent;
> +		pc = preempt_count();
> +		tracing_generic_entry_update(ent, 0, pc);
> +		ent->type = TRACE_BLK;
> +		goto record_it;
> +	}
>  
>  	/*
>  	 * A word about the locking here - we disable interrupts to reserve
> @@ -163,23 +206,33 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
>  
>  	t = relay_reserve(bt->rchan, sizeof(*t) + pdu_len);

Do we still use relay?

-- Steve

>  	if (t) {
> -		cpu = smp_processor_id();
>  		sequence = per_cpu_ptr(bt->sequence, cpu);
>  
>  		t->magic = BLK_IO_TRACE_MAGIC | BLK_IO_TRACE_VERSION;
>  		t->sequence = ++(*sequence);
>  		t->time = ktime_to_ns(ktime_get());
> +		t->cpu = cpu;
> +		t->pid = pid;
> +record_it:
>  		t->sector = sector;
>  		t->bytes = bytes;
>  		t->action = what;
> -		t->pid = pid;
>  		t->device = bt->dev;
> -		t->cpu = cpu;
>  		t->error = error;
>  		t->pdu_len = pdu_len;
>  
>  		if (pdu_len)
>  			memcpy((void *) t + sizeof(*t), pdu_data, pdu_len);
> +
> +		if (blk_tr) {
> +			ring_buffer_unlock_commit(blk_tr->buffer, event, flags);
> +			if (pid != 0 &&
> +			    (blk_tracer_flags.val & TRACE_BLK_OPT_CLASSIC) == 0 &&
> +			    (trace_flags & TRACE_ITER_STACKTRACE) != 0)
> +				__trace_stack(blk_tr, NULL, flags, 5, pc);
> +			trace_wake_up();
> +			return;
> +		}
>  	}
>  
>  	local_irq_restore(flags);
> @@ -888,3 +941,584 @@ static void blk_unregister_tracepoints(void)
>  
>  	tracepoint_synchronize_unregister();
>  }

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCHv2] ftrace interface for blktrace
  2009-01-20 17:03         ` Steven Rostedt
@ 2009-01-20 17:58           ` Arnaldo Carvalho de Melo
  2009-01-20 18:23             ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 14+ messages in thread
From: Arnaldo Carvalho de Melo @ 2009-01-20 17:58 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Jens Axboe, Ingo Molnar, Frederic Weisbecker, linux-kernel

Em Tue, Jan 20, 2009 at 12:03:17PM -0500, Steven Rostedt escreveu:
> 
> On Tue, 20 Jan 2009, Arnaldo Carvalho de Melo wrote:
> > 
> > Setup debugfs/tracing environment:
> > 
> > $ mkdir /d
> > $ mount -t debugfs none /d
> > $ ln -s /d/tracing /t
> 
> OK, now you are just making fun of me ;-)

8-)
 
> > Now enable tracing on the desired device, setting up filters if needed:
> > 
> > $ echo 1 > /sys/block/sda/sda1/trace/enable
> > 
> > root@f10-1 ~]# head /t/trace
> > # tracer: blk
> > #
> >             bash-955   [000]   565.025000:   8,1    A WBS 3932975 + 8 <- (8,1) 3932912
> >             bash-955   [000]   565.025000:   8,1    Q   R 3932975 + 8 [<idle>]
> >             bash-955   [000]   565.025000:   8,1    G  RB 3932975 + 8 [<idle>]
> >             bash-955   [000]   565.025000:   8,1    P  NS [<idle>]
> >             bash-955   [000]   565.025000:   8,1    I RBS 3932975 + 8 [<idle>]
> >             bash-955   [000]   565.025000:   8,1    U  WS [<idle>] 1
> >             bash-955   [000]   565.025000:   8,1    D  WB 3932975 + 8 [<idle>]
> >           <idle>-0     [000]   565.044000:   8,1    C  RS 3932975 + 8 [0]
> > [root@f10-1 ~]#
> > 
> > This is the standardized format used in the sched_switch, sched_wakeup and
> > other tracers.
> > 
> > This is done by adding a tracer option, that starts disabled, named
> > "blk_classic", using this construct:
> > 
> > /* Select an alternative, minimalistic output than the original one */
> > #define TRACE_BLK_OPT_CLASSIC   0x1
> > 
> > static struct tracer_opt blk_tracer_opts[] = {
> >         /* Default disable the minimalistic output */
> >         { TRACER_OPT(blk_classic, TRACE_BLK_OPT_CLASSIC ) },
> >         { }
> > };
> > 
> > static struct tracer_flags blk_tracer_flags = {
> >         .val  = 0,
> >         .opts = blk_tracer_opts,
> > };
> > 
> > With this disabled the ->print_line in our struct tracer always returns
> > TRACE_TYPE_UNHANDLED, leaving to the struct trace_event ->trace routine to
> > pretty print it if the entry in the ring buffer is of type TRACE_BLK, that
> > we register now in init_blk_tracer by using register_ftrace_event.
> > 
> > I did this because I want, at least while this patch is being considered by
> > Jens and the other blktrace developers, to both to have the classic format,
> > activated by:
> > 
> > [root@f10-1 ~]# echo blk_classic > /t/trace_options 
> > [root@f10-1 ~]# cat /t/trace_options 
> > print-parent nosym-offset nosym-addr noverbose noraw nohex nobin noblock nostacktrace nosched-tree ftrace_printk noftrace_preempt nobranch annotate nouserstacktrace nosym-userobj noprintk-msg-only blk_classic
> > 
> > So that we use struct trace -> print_line and have full control of what is
> > formatted:
> > 
> > [root@f10-1 ~]# tail /t/trace
> >   8,1    0   289.469000  1054  D  WB 526039 + 8 [ls]
> >   8,1    0   289.470000     0  C  RS 526039 + 8 [1751]
> >   8,1    0   289.470000  1054  A WBS 266439 + 8 <- (8,1) 266376
> >   8,1    0   289.470000  1054  Q   R 266439 + 8 [ls]
> >   8,1    0   289.470000  1054  G  RB 266439 + 8 [ls]
> >   8,1    0   289.470000  1054  P  NS [ls]
> >   8,1    0   289.470000  1054  I RBS 266439 + 8 [ls]
> >   8,1    0   289.470000  1054  U  WS [ls] 1
> >   8,1    0   289.470000  1054  D  WB 266439 + 8 [ls]
> >   8,1    0   289.470000     0  C  RS 266439 + 8 [4295]
> > [root@f10-1 ~]# 
> > 
> > Now back to non_classic mode...
> > 
> > # echo noblk_classic > /t/trace_options
> > 
> > ... where we can mix trace events by doing...
> > 
> > # echo stacktrace > /t/trace_options
> > 
> > ... and get something interesting:
> > 
> > [root@f10-1 ~]# tail /t/trace
> >               ls-1060  [000]   431.697000:   8,1    A WBS 12320847 + 8 <- (8,1) 12320784
> >               ls-1060  [000]   431.697000: 
> >  <= submit_bio
> >  <= submit_bh
> >  <= __ext3_get_inode_loc
> >  <= ext3_iget
> >  <= ext3_lookup
> >  <= do_lookup
> >  <= __link_path_walk
> >               ls-1060  [000]   431.697000:   8,1    Q   R 12320847 + 8 [ls]
> >               ls-1060  [000]   431.697000: 
> >  <= generic_make_request
> >  <= submit_bio
> >  <= submit_bh
> >  <= __ext3_get_inode_loc
> >  <= ext3_iget
> >  <= ext3_lookup
> >  <= do_lookup
> >               ls-1060  [000]   431.697000:   8,1    G  RB 12320847 + 8 [ls]
> >               ls-1060  [000]   431.697000: 
> >  <= get_request
> >  <= get_request_wait
> >  <= __make_request
> >  <= generic_make_request
> >  <= submit_bio
> >  <= submit_bh
> >  <= __ext3_get_inode_loc
> >               ls-1060  [000]   431.697000:   8,1    P  NS [ls]
> >               ls-1060  [000]   431.697000: 
> >  <= __make_request
> >  <= generic_make_request
> >  <= submit_bio
> >  <= submit_bh
> >  <= __ext3_get_inode_loc
> >  <= ext3_iget
> >  <= ext3_lookup
> >               ls-1060  [000]   431.697000:   8,1    I RBS 12320847 + 8 [ls]
> >               ls-1060  [000]   431.697000: 
> >  <= elv_insert
> >  <= __elv_add_request
> >  <= __make_request
> >  <= generic_make_request
> >  <= submit_bio
> >  <= submit_bh
> >  <= __ext3_get_inode_loc
> > <SNIP>
> > 
> > Now lets disable stacktraces:
> > 
> > # echo nostacktrace > /t/trace_options
> > 
> > And try doing things a bit more verbosely, using the TRACE_ITER_VERBOSE
> > that were already part of the existing ftrace infrastructure.
> > 
> > # echo verbose > /t/trace_options
> > # cat /t/trace_pipe 
> >               ls-1065  [000]   694.325000:   8,1   remap WBS 3670207 + 8 <- (8,1) 3670144
> >               ls-1065  [000]   694.325000:   8,1   queue   R 3670207 + 8 [ls]
> >               ls-1065  [000]   694.325000:   8,1   getrq  RB 3670207 + 8 [ls]
> >               ls-1065  [000]   694.325000:   8,1   plug  NS [ls]
> >               ls-1065  [000]   694.325000:   8,1   insert RBS 3670207 + 8 [ls]
> >               ls-1065  [000]   694.325000:   8,1   unplug_io  WS [ls] 1
> >               ls-1065  [000]   694.325000:   8,1   issue  WB 3670207 + 8 [ls]
> >           <idle>-0     [000]   694.343000:   8,1   complete  RS 3670207 + 8 [191]
> > 
> > Hint: look at the action (insert, unplug_io, issue, etc).
> > 
> > We can make it a bit more verbose using this iter flag, the above is
> > just a suggestion on using the standard flags interface.
> > 
> > Frederic, for now I'm using t->action as the "sub-type" we discussed.
> > 
> > Now that we have noblk_classic mode we can actually look at the struct
> > trace_event->binary() way of doing things + the
> > don't-print-anything-just-call-me->binary() flag and then we could
> > synthesize the old blktrace format and simply do:
> > 
> > cat /t/trace_pipe | blkparse -i -
> > 
> > As a test to check that everything is as the old blktrace(8) codebase
> > expects.
> > 
> > Jens, there are several fields in struct blk_io_trace that could go,
> > like pid, cpu, timestamp, because they are already in struct
> > trace_entry, and sequence, if my understanding of what was discussed
> > here is not flawed:
> > 
> > http://kerneltrap.org/mailarchive/linux-kernel/2008/9/23/3375544
> > 
> > Patch against yesterday's linux-tip,
> > 
> > Comments?
> > 
> > - Arnaldo
> > 
> > diff --git a/block/blktrace.c b/block/blktrace.c
> > index b0a2cae..21116c4 100644
> > --- a/block/blktrace.c
> > +++ b/block/blktrace.c
> > @@ -25,9 +25,27 @@
> >  #include <linux/time.h>
> >  #include <trace/block.h>
> >  #include <asm/uaccess.h>
> > +#include <../kernel/trace/trace_output.h>
> >  
> >  static unsigned int blktrace_seq __read_mostly = 1;
> >  
> > +static struct trace_array *blk_tr;
> > +static int __read_mostly  blk_tracer_enabled;
> > +
> > +/* Select an alternative, minimalistic output than the original one */
> > +#define TRACE_BLK_OPT_CLASSIC 	0x1
> > +
> > +static struct tracer_opt blk_tracer_opts[] = {
> > +	/* Default disable the minimalistic output */
> > +	{ TRACER_OPT(blk_classic, TRACE_BLK_OPT_CLASSIC ) },
> > +	{ }
> > +};
> > +
> > +static struct tracer_flags blk_tracer_flags = {
> > +	.val  = 0,
> > +	.opts = blk_tracer_opts,
> > +};
> > +
> >  /* Global reference count of probes */
> >  static DEFINE_MUTEX(blk_probe_mutex);
> >  static atomic_t blk_probes_ref = ATOMIC_INIT(0);
> > @@ -43,6 +61,9 @@ static void trace_note(struct blk_trace *bt, pid_t pid, int action,
> >  {
> >  	struct blk_io_trace *t;
> >  
> > +	if (!bt->rchan)
> > +		return;
> > +
> >  	t = relay_reserve(bt->rchan, sizeof(*t) + len);
> >  	if (t) {
> >  		const int cpu = smp_processor_id();
> > @@ -90,6 +111,9 @@ void __trace_note_message(struct blk_trace *bt, const char *fmt, ...)
> >  	unsigned long flags;
> >  	char *buf;
> >  
> > +	if (!bt->msg_data)
> > +		return;
> > +
> >  	local_irq_save(flags);
> >  	buf = per_cpu_ptr(bt->msg_data, smp_processor_id());
> >  	va_start(args, fmt);
> > @@ -131,13 +155,14 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
> >  		     int rw, u32 what, int error, int pdu_len, void *pdu_data)
> >  {
> >  	struct task_struct *tsk = current;
> > +	struct ring_buffer_event *event = NULL;
> >  	struct blk_io_trace *t;
> >  	unsigned long flags;
> >  	unsigned long *sequence;
> >  	pid_t pid;
> > -	int cpu;
> > +	int cpu, pc = 0;
> >  
> > -	if (unlikely(bt->trace_state != Blktrace_running))
> > +	if (unlikely(bt->trace_state != Blktrace_running || !blk_tracer_enabled))
> >  		return;
> >  
> >  	what |= ddir_act[rw & WRITE];
> > @@ -150,6 +175,24 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
> >  	pid = tsk->pid;
> >  	if (unlikely(act_log_check(bt, what, sector, pid)))
> >  		return;
> > +	cpu = raw_smp_processor_id();
> > +
> > +	if (blk_tr) {
> > +		struct trace_entry *ent;
> > +		tracing_record_cmdline(current);
> > +
> > +		event = ring_buffer_lock_reserve(blk_tr->buffer,
> > +						 sizeof(*t) + pdu_len, &flags);
> > +		if (!event)
> > +			return;
> > +		
> > +		ent = ring_buffer_event_data(event);
> > +		t = (struct blk_io_trace *)ent;
> > +		pc = preempt_count();
> > +		tracing_generic_entry_update(ent, 0, pc);
> > +		ent->type = TRACE_BLK;
> > +		goto record_it;
> > +	}
> >  
> >  	/*
> >  	 * A word about the locking here - we disable interrupts to reserve
> > @@ -163,23 +206,33 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
> >  
> >  	t = relay_reserve(bt->rchan, sizeof(*t) + pdu_len);
> 
> Do we still use relay?

blkFtrace is an addition to the existing method for the moment, as soon
as we're all satisfied with functionality + performance, we may then
remove that code.
 
> -- Steve

- Arnaldo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCHv2] ftrace interface for blktrace
  2009-01-20 17:58           ` Arnaldo Carvalho de Melo
@ 2009-01-20 18:23             ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 14+ messages in thread
From: Arnaldo Carvalho de Melo @ 2009-01-20 18:23 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Jens Axboe, Ingo Molnar, Frederic Weisbecker, linux-kernel

Em Tue, Jan 20, 2009 at 03:58:35PM -0200, Arnaldo Carvalho de Melo escreveu:
> Em Tue, Jan 20, 2009 at 12:03:17PM -0500, Steven Rostedt escreveu:
> > > @@ -163,23 +206,33 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
> > >  
> > >  	t = relay_reserve(bt->rchan, sizeof(*t) + pdu_len);
> > 
> > Do we still use relay?
> 
> blkFtrace is an addition to the existing method for the moment, as soon
> as we're all satisfied with functionality + performance, we may then
> remove that code.

I guess that is not a clear answer, so lemme try again:

The relay code is only used if the old method of starting a blk trace
and setting up filters is used, i.e. the IOCTL method.

When we use the sysfs interface to do that we do not use the relay code
at all.

But most of the rest is shared.

Except for things like __trace_note_message and the interface provided
by the old blktrace code to insert annotations into the stream.

That will be provided by using standard ftrace methods such as
/t/trace_marker, as demonstrated by:

[root@f10-1 ~]# cat /t/trace_pipe & echo "bye, bye relay" > /t/trace_marker
[1] 1090
[root@f10-1 ~]#             bash-957   [000] 14727.302000: 0: bye, bye relay

- Arnaldo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCHv2] ftrace interface for blktrace
  2009-01-20 14:42       ` [RFC][PATCHv2] " Arnaldo Carvalho de Melo
  2009-01-20 17:03         ` Steven Rostedt
@ 2009-01-20 23:23         ` Frederic Weisbecker
  1 sibling, 0 replies; 14+ messages in thread
From: Frederic Weisbecker @ 2009-01-20 23:23 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jens Axboe, Ingo Molnar, Steven Rostedt,
	linux-kernel

On Tue, Jan 20, 2009 at 12:42:19PM -0200, Arnaldo Carvalho de Melo wrote:
> Em Sat, Jan 17, 2009 at 06:08:04PM -0200, Arnaldo Carvalho de Melo escreveu:
> > Em Sat, Jan 17, 2009 at 08:14:30PM +0100, Jens Axboe escreveu:
> > > On Sat, Jan 17 2009, Ingo Molnar wrote:
> > > > * Arnaldo Carvalho de Melo <acme@redhat.com> wrote:
> > > > > Hi Jens,
> > > > > 	The patch below adds a ftrace interface for blktrace, allowing 
> > > > > people to use it without any required userspace tools, using sysfs to 
> > > > > setup the act_mask, pid, start_lba, end_lba.
> > > > 
> > > > Very nice patch!
> > > > 
> > > > Jens, i'm wondering what's your take on this direction is. I think the 
> > > > consolidation effect is great and the built-in IO tracing capabilities are 
> > > > very nice.
> > > 
> > > I like the current patch, from a quick look. I'll look more soonish
> > > (monday). I've always liked the concept of being able to mix various
> > > traces into the same stream, since it makes it MUCH easier to see what
> > > on earth is going wrong. What I didn't like in the patches Acme did was
> > > the enable part, I'll need to look into that. Again, from a quick look,
> > > what is the BKL doing in there?!
> > 
> > I'm just trying to keep locking expectations in the previous, similar
> > path, in block/ioctl.c, blkdev_ioctl, where we setup q->blk_trace:
> > 
> >         case BLKTRACESTART:
> >         case BLKTRACESTOP:
> >         case BLKTRACESETUP:
> >         case BLKTRACETEARDOWN:
> >                 lock_kernel();
> >                 ret = blk_trace_ioctl(bdev, cmd, (char __user *) arg);
> >                 unlock_kernel();
> >                 break;
> > 
> > So I took the safest path and grabbed the big K lock.
> >  
> > > > We can still get the raw events too and do user-space post-processing, 
> > > > when that is desired - so this does not limit anything that blktrace was 
> > > > able to do before - it only extends on it.
> > > > 
> > > > Is there any particular blktrace feature you can think of that is not 
> > > > present in the blktrace ftrace plugin?
> > > 
> > > Can't answer that yet, I'll give it a more thorough look next week.
> > 
> > Looking forward to that. And will get it split into multiple patches,
> > hopefully by monday.
> 
> Ok, still as one patch, but with lots of improvements, lets see them
> using examples:
> 
> Setup debugfs/tracing environment:
> 
> $ mkdir /d
> $ mount -t debugfs none /d
> $ ln -s /d/tracing /t


:-)


> Now enable tracing on the desired device, setting up filters if needed:
> 
> $ echo 1 > /sys/block/sda/sda1/trace/enable
> 
> root@f10-1 ~]# head /t/trace
> # tracer: blk
> #
>             bash-955   [000]   565.025000:   8,1    A WBS 3932975 + 8 <- (8,1) 3932912
>             bash-955   [000]   565.025000:   8,1    Q   R 3932975 + 8 [<idle>]
>             bash-955   [000]   565.025000:   8,1    G  RB 3932975 + 8 [<idle>]
>             bash-955   [000]   565.025000:   8,1    P  NS [<idle>]
>             bash-955   [000]   565.025000:   8,1    I RBS 3932975 + 8 [<idle>]
>             bash-955   [000]   565.025000:   8,1    U  WS [<idle>] 1
>             bash-955   [000]   565.025000:   8,1    D  WB 3932975 + 8 [<idle>]
>           <idle>-0     [000]   565.044000:   8,1    C  RS 3932975 + 8 [0]
> [root@f10-1 ~]#
> 
> This is the standardized format used in the sched_switch, sched_wakeup and
> other tracers.
> 
> This is done by adding a tracer option, that starts disabled, named
> "blk_classic", using this construct:
> 
> /* Select an alternative, minimalistic output than the original one */
> #define TRACE_BLK_OPT_CLASSIC   0x1
> 
> static struct tracer_opt blk_tracer_opts[] = {
>         /* Default disable the minimalistic output */
>         { TRACER_OPT(blk_classic, TRACE_BLK_OPT_CLASSIC ) },
>         { }
> };
> 
> static struct tracer_flags blk_tracer_flags = {
>         .val  = 0,
>         .opts = blk_tracer_opts,
> };
> 
> With this disabled the ->print_line in our struct tracer always returns
> TRACE_TYPE_UNHANDLED, leaving to the struct trace_event ->trace routine to
> pretty print it if the entry in the ring buffer is of type TRACE_BLK, that
> we register now in init_blk_tracer by using register_ftrace_event.


Ok... I was about to make a patch to append a flag inside trace_event to let
a tracer decide whether it wants to have the standard trace headers.

But actually, your idea is good: why not just using an option flag to decide it.
This way, the presence of these headers can enabled/disabled easily during tracing.

I will send a patch for that soon. This way, it will avoid you this trick with
TRACE_TYPE_UNHANDLED, and your trace_event printer will be sufficient.

 
> I did this because I want, at least while this patch is being considered by
> Jens and the other blktrace developers, to both to have the classic format,
> activated by:
> 
> [root@f10-1 ~]# echo blk_classic > /t/trace_options 
> [root@f10-1 ~]# cat /t/trace_options 
> print-parent nosym-offset nosym-addr noverbose noraw nohex nobin noblock nostacktrace nosched-tree ftrace_printk noftrace_preempt nobranch annotate nouserstacktrace nosym-userobj noprintk-msg-only blk_classic
> 
> So that we use struct trace -> print_line and have full control of what is
> formatted:
> 
> [root@f10-1 ~]# tail /t/trace
>   8,1    0   289.469000  1054  D  WB 526039 + 8 [ls]
>   8,1    0   289.470000     0  C  RS 526039 + 8 [1751]
>   8,1    0   289.470000  1054  A WBS 266439 + 8 <- (8,1) 266376
>   8,1    0   289.470000  1054  Q   R 266439 + 8 [ls]
>   8,1    0   289.470000  1054  G  RB 266439 + 8 [ls]
>   8,1    0   289.470000  1054  P  NS [ls]
>   8,1    0   289.470000  1054  I RBS 266439 + 8 [ls]
>   8,1    0   289.470000  1054  U  WS [ls] 1
>   8,1    0   289.470000  1054  D  WB 266439 + 8 [ls]
>   8,1    0   289.470000     0  C  RS 266439 + 8 [4295]
> [root@f10-1 ~]# 
> 
> Now back to non_classic mode...
> 
> # echo noblk_classic > /t/trace_options
> 
> ... where we can mix trace events by doing...
> 
> # echo stacktrace > /t/trace_options
> 
> ... and get something interesting:
> 
> [root@f10-1 ~]# tail /t/trace
>               ls-1060  [000]   431.697000:   8,1    A WBS 12320847 + 8 <- (8,1) 12320784
>               ls-1060  [000]   431.697000: 
>  <= submit_bio
>  <= submit_bh
>  <= __ext3_get_inode_loc
>  <= ext3_iget
>  <= ext3_lookup
>  <= do_lookup
>  <= __link_path_walk
>               ls-1060  [000]   431.697000:   8,1    Q   R 12320847 + 8 [ls]
>               ls-1060  [000]   431.697000: 
>  <= generic_make_request
>  <= submit_bio
>  <= submit_bh
>  <= __ext3_get_inode_loc
>  <= ext3_iget
>  <= ext3_lookup
>  <= do_lookup
>               ls-1060  [000]   431.697000:   8,1    G  RB 12320847 + 8 [ls]
>               ls-1060  [000]   431.697000: 
>  <= get_request
>  <= get_request_wait
>  <= __make_request
>  <= generic_make_request
>  <= submit_bio
>  <= submit_bh
>  <= __ext3_get_inode_loc
>               ls-1060  [000]   431.697000:   8,1    P  NS [ls]
>               ls-1060  [000]   431.697000: 
>  <= __make_request
>  <= generic_make_request
>  <= submit_bio
>  <= submit_bh
>  <= __ext3_get_inode_loc
>  <= ext3_iget
>  <= ext3_lookup
>               ls-1060  [000]   431.697000:   8,1    I RBS 12320847 + 8 [ls]
>               ls-1060  [000]   431.697000: 
>  <= elv_insert
>  <= __elv_add_request
>  <= __make_request
>  <= generic_make_request
>  <= submit_bio
>  <= submit_bh
>  <= __ext3_get_inode_loc
> <SNIP>
> 
> Now lets disable stacktraces:
> 
> # echo nostacktrace > /t/trace_options
> 
> And try doing things a bit more verbosely, using the TRACE_ITER_VERBOSE
> that were already part of the existing ftrace infrastructure.
> 
> # echo verbose > /t/trace_options
> # cat /t/trace_pipe 
>               ls-1065  [000]   694.325000:   8,1   remap WBS 3670207 + 8 <- (8,1) 3670144
>               ls-1065  [000]   694.325000:   8,1   queue   R 3670207 + 8 [ls]
>               ls-1065  [000]   694.325000:   8,1   getrq  RB 3670207 + 8 [ls]
>               ls-1065  [000]   694.325000:   8,1   plug  NS [ls]
>               ls-1065  [000]   694.325000:   8,1   insert RBS 3670207 + 8 [ls]
>               ls-1065  [000]   694.325000:   8,1   unplug_io  WS [ls] 1
>               ls-1065  [000]   694.325000:   8,1   issue  WB 3670207 + 8 [ls]
>           <idle>-0     [000]   694.343000:   8,1   complete  RS 3670207 + 8 [191]
> 
> Hint: look at the action (insert, unplug_io, issue, etc).
> 
> We can make it a bit more verbose using this iter flag, the above is
> just a suggestion on using the standard flags interface.


That looks good. You made a good use of the trace options....

> 
> Frederic, for now I'm using t->action as the "sub-type" we discussed.
> 
> Now that we have noblk_classic mode we can actually look at the struct
> trace_event->binary() way of doing things + the
> don't-print-anything-just-call-me->binary() flag and then we could
> synthesize the old blktrace format and simply do:
> 
> cat /t/trace_pipe | blkparse -i -
> 
> As a test to check that everything is as the old blktrace(8) codebase
> expects.


Ok. So I will add this option, say TRACE_ITER_CONTEXT_INFO to decide
whether you want to print these pid/cpu/time...
This way you will be able to provide your trace_event without wondering
about it.

Concerning these sub-events, I'm not sure what is the better way to provide
it... It should be generic and not only depending of the blk tracing if ftrace
has to implement facilities for it.


> Jens, there are several fields in struct blk_io_trace that could go,
> like pid, cpu, timestamp, because they are already in struct
> trace_entry, and sequence, if my understanding of what was discussed
> here is not flawed:
> 
> http://kerneltrap.org/mailarchive/linux-kernel/2008/9/23/3375544
> 
> Patch against yesterday's linux-tip,
> 
> Comments?
> 
> - Arnaldo
> 
> diff --git a/block/blktrace.c b/block/blktrace.c
> index b0a2cae..21116c4 100644
> --- a/block/blktrace.c
> +++ b/block/blktrace.c
> @@ -25,9 +25,27 @@
>  #include <linux/time.h>
>  #include <trace/block.h>
>  #include <asm/uaccess.h>
> +#include <../kernel/trace/trace_output.h>
>  
>  static unsigned int blktrace_seq __read_mostly = 1;
>  
> +static struct trace_array *blk_tr;
> +static int __read_mostly  blk_tracer_enabled;
> +
> +/* Select an alternative, minimalistic output than the original one */
> +#define TRACE_BLK_OPT_CLASSIC 	0x1
> +
> +static struct tracer_opt blk_tracer_opts[] = {
> +	/* Default disable the minimalistic output */
> +	{ TRACER_OPT(blk_classic, TRACE_BLK_OPT_CLASSIC ) },
> +	{ }
> +};
> +
> +static struct tracer_flags blk_tracer_flags = {
> +	.val  = 0,
> +	.opts = blk_tracer_opts,
> +};
> +
>  /* Global reference count of probes */
>  static DEFINE_MUTEX(blk_probe_mutex);
>  static atomic_t blk_probes_ref = ATOMIC_INIT(0);
> @@ -43,6 +61,9 @@ static void trace_note(struct blk_trace *bt, pid_t pid, int action,
>  {
>  	struct blk_io_trace *t;
>  
> +	if (!bt->rchan)
> +		return;
> +
>  	t = relay_reserve(bt->rchan, sizeof(*t) + len);
>  	if (t) {
>  		const int cpu = smp_processor_id();
> @@ -90,6 +111,9 @@ void __trace_note_message(struct blk_trace *bt, const char *fmt, ...)
>  	unsigned long flags;
>  	char *buf;
>  
> +	if (!bt->msg_data)
> +		return;
> +
>  	local_irq_save(flags);
>  	buf = per_cpu_ptr(bt->msg_data, smp_processor_id());
>  	va_start(args, fmt);
> @@ -131,13 +155,14 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
>  		     int rw, u32 what, int error, int pdu_len, void *pdu_data)
>  {
>  	struct task_struct *tsk = current;
> +	struct ring_buffer_event *event = NULL;
>  	struct blk_io_trace *t;
>  	unsigned long flags;
>  	unsigned long *sequence;
>  	pid_t pid;
> -	int cpu;
> +	int cpu, pc = 0;
>  
> -	if (unlikely(bt->trace_state != Blktrace_running))
> +	if (unlikely(bt->trace_state != Blktrace_running || !blk_tracer_enabled))
>  		return;
>  
>  	what |= ddir_act[rw & WRITE];
> @@ -150,6 +175,24 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
>  	pid = tsk->pid;
>  	if (unlikely(act_log_check(bt, what, sector, pid)))
>  		return;
> +	cpu = raw_smp_processor_id();
> +
> +	if (blk_tr) {
> +		struct trace_entry *ent;
> +		tracing_record_cmdline(current);
> +
> +		event = ring_buffer_lock_reserve(blk_tr->buffer,
> +						 sizeof(*t) + pdu_len, &flags);
> +		if (!event)
> +			return;
> +		
> +		ent = ring_buffer_event_data(event);
> +		t = (struct blk_io_trace *)ent;
> +		pc = preempt_count();
> +		tracing_generic_entry_update(ent, 0, pc);
> +		ent->type = TRACE_BLK;
> +		goto record_it;
> +	}
>  
>  	/*
>  	 * A word about the locking here - we disable interrupts to reserve
> @@ -163,23 +206,33 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
>  
>  	t = relay_reserve(bt->rchan, sizeof(*t) + pdu_len);
>  	if (t) {
> -		cpu = smp_processor_id();
>  		sequence = per_cpu_ptr(bt->sequence, cpu);
>  
>  		t->magic = BLK_IO_TRACE_MAGIC | BLK_IO_TRACE_VERSION;
>  		t->sequence = ++(*sequence);
>  		t->time = ktime_to_ns(ktime_get());
> +		t->cpu = cpu;
> +		t->pid = pid;
> +record_it:
>  		t->sector = sector;
>  		t->bytes = bytes;
>  		t->action = what;
> -		t->pid = pid;
>  		t->device = bt->dev;
> -		t->cpu = cpu;
>  		t->error = error;
>  		t->pdu_len = pdu_len;
>  
>  		if (pdu_len)
>  			memcpy((void *) t + sizeof(*t), pdu_data, pdu_len);
> +
> +		if (blk_tr) {
> +			ring_buffer_unlock_commit(blk_tr->buffer, event, flags);
> +			if (pid != 0 &&
> +			    (blk_tracer_flags.val & TRACE_BLK_OPT_CLASSIC) == 0 &&
> +			    (trace_flags & TRACE_ITER_STACKTRACE) != 0)
> +				__trace_stack(blk_tr, NULL, flags, 5, pc);
> +			trace_wake_up();
> +			return;
> +		}
>  	}
>  
>  	local_irq_restore(flags);
> @@ -888,3 +941,584 @@ static void blk_unregister_tracepoints(void)
>  
>  	tracepoint_synchronize_unregister();
>  }
> +
> +/*
> + * struct blk_io_tracer formatting routines
> + */
> +
> +static void fill_rwbs(char *rwbs, const struct blk_io_trace *t)
> +{
> +        int i = 0;
> +
> +        if (t->action & BLK_TC_DISCARD)	   rwbs[i++] = 'D';
> +        else if (t->action & BLK_TC_WRITE) rwbs[i++] = 'W';
> +        else if (t->bytes)		   rwbs[i++] = 'R';
> +        else				   rwbs[i++] = 'N';
> +
> +        if (t->action & BLK_TC_AHEAD)	   rwbs[i++] = 'A';
> +        if (t->action & BLK_TC_BARRIER)	   rwbs[i++] = 'B';
> +        if (t->action & BLK_TC_SYNC)	   rwbs[i++] = 'S';
> +        if (t->action & BLK_TC_META)	   rwbs[i++] = 'M';
> +
> +        rwbs[i] = '\0';
> +}
> +
> +static inline
> +const struct blk_io_trace *te_blk_io_trace(const struct trace_entry *ent)
> +{
> +	return (const struct blk_io_trace *)ent;
> +}
> +
> +static inline const void *pdu_start(const struct trace_entry *ent)
> +{
> +	return te_blk_io_trace(ent) + 1;
> +}
> +
> +static inline u32 t_sec(const struct trace_entry *ent)
> +{
> +	return te_blk_io_trace(ent)->bytes >> 9;
> +}
> +
> +static inline unsigned long long t_sector(const struct trace_entry *ent)
> +{
> +	return te_blk_io_trace(ent)->sector;
> +}
> +
> +static inline __u16 t_error(const struct trace_entry *ent)
> +{
> +	return te_blk_io_trace(ent)->sector;
> +}
> +
> +static __u64 get_pdu_int(const struct trace_entry *ent)
> +{
> +	const __u64 *val = pdu_start(ent);
> +	return be64_to_cpu(*val);
> +}
> +
> +static void get_pdu_remap(const struct trace_entry *ent,
> +			  struct blk_io_trace_remap *r)
> +{
> +	const struct blk_io_trace_remap *__r = pdu_start(ent);
> +	__u64 sector = __r->sector;
> +
> +	r->device = be32_to_cpu(__r->device);
> +	r->device_from = be32_to_cpu(__r->device_from);
> +	r->sector = be64_to_cpu(sector);
> +}
> +
> +static int blk_log_action_iter(struct trace_iterator *iter, const char *act)
> +{
> +	char rwbs[6];
> +	unsigned long long ts  = ns2usecs(iter->ts);
> +	unsigned long usec_rem = do_div(ts, USEC_PER_SEC);
> +	unsigned secs	       = (unsigned long)ts;
> +	const struct trace_entry *ent = iter->ent;
> +	const struct blk_io_trace *t = (const struct blk_io_trace *)ent;
> +
> +	fill_rwbs(rwbs, t);
> +
> +	return trace_seq_printf(&iter->seq,
> +				"%3d,%-3d %2d %5d.%06lu %5u %2s %3s ",
> +				MAJOR(t->device), MINOR(t->device), iter->cpu,
> +				secs, usec_rem, ent->pid, act, rwbs);
> +}
> +
> +static int blk_log_action_seq(struct trace_seq *s, const struct blk_io_trace *t,
> +			      const char *act)
> +{
> +	char rwbs[6];
> +	fill_rwbs(rwbs, t);
> +	return trace_seq_printf(s, "%3d,%-3d %2s %3s ",
> +				MAJOR(t->device), MINOR(t->device), act, rwbs);
> +}
> +
> +static int blk_log_generic(struct trace_seq *s, const struct trace_entry *ent)
> +{
> +	const char *cmd = trace_find_cmdline(ent->pid);
> +
> +	if (t_sec(ent))
> +		return trace_seq_printf(s, "%llu + %u [%s]\n",
> +					t_sector(ent), t_sec(ent), cmd);
> +	return trace_seq_printf(s, "[%s]\n", cmd);
> +}
> +
> +static int blk_log_with_error(struct trace_seq *s, const struct trace_entry *ent)
> +{
> +	if (t_sec(ent))
> +		return trace_seq_printf(s, "%llu + %u [%d]\n", t_sector(ent),
> +					t_sec(ent), t_error(ent));
> +	return trace_seq_printf(s, "%llu [%d]\n", t_sector(ent), t_error(ent));
> +}
> +
> +static int blk_log_remap(struct trace_seq *s, const struct trace_entry *ent)
> +{
> +	struct blk_io_trace_remap r = { .device = 0, };
> +
> +	get_pdu_remap(ent, &r);
> +	return trace_seq_printf(s, "%llu + %u <- (%d,%d) %llu\n",
> +			       t_sector(ent),
> +			       t_sec(ent), MAJOR(r.device), MINOR(r.device),
> +			       (unsigned long long)r.sector);
> +}
> +
> +static int blk_log_plug(struct trace_seq *s, const struct trace_entry *ent)
> +{
> +	return trace_seq_printf(s, "[%s]\n", trace_find_cmdline(ent->pid));
> +}
> +
> +static int blk_log_unplug(struct trace_seq *s, const struct trace_entry *ent)
> +{
> +	return trace_seq_printf(s, "[%s] %llu\n", trace_find_cmdline(ent->pid),
> +				get_pdu_int(ent));
> +}
> +
> +static int blk_log_split(struct trace_seq *s, const struct trace_entry *ent)
> +{
> +	return trace_seq_printf(s, "%llu / %llu [%s]\n", t_sector(ent),
> +				get_pdu_int(ent), trace_find_cmdline(ent->pid));
> +}
> +
> +/*
> + * struct tracer operations
> + */
> +
> +static void blk_tracer_print_header(struct seq_file *m)
> +{
> +	if (!(blk_tracer_flags.val & TRACE_BLK_OPT_CLASSIC))
> +		return;
> +	seq_puts(m, "# DEV   CPU TIMESTAMP     PID ACT FLG\n"
> +		    "#  |     |     |           |   |   |\n");
> +}
> +
> +static void blk_tracer_start(struct trace_array *tr)
> +{
> +	int cpu;
> +
> +	tr->time_start = ftrace_now(tr->cpu);
> +
> +	for_each_online_cpu(cpu)
> +		tracing_reset(tr, cpu);
> +
> +	mutex_lock(&blk_probe_mutex);
> +	if (atomic_add_return(1, &blk_probes_ref) == 1)
> +		if (blk_register_tracepoints())
> +			atomic_dec(&blk_probes_ref);
> +	mutex_unlock(&blk_probe_mutex);
> +}


I don't remember if I talked about this mutex or it was about one other...
But your start callback is not supposed to be racy, neither is it against
your reset callback.

So perhaps they are not necessary.

You will, as an example, find mutexes to protect tracepoints in the sched_switch tracer.
This is because the boot tracer is using the sched_switch tracer and enable/disable it
dynamically, sometimes asynchronously (no...actually it was when fastboot still used
async initcalls).

But your tracer shoudn't be actually enabled/disabled in other ways than the current_tracer
file, unless you plan to have other callsites for it?


> +static int blk_tracer_init(struct trace_array *tr)
> +{
> +	blk_tr = tr;
> +	blk_tracer_start(tr);
> +	mutex_lock(&blk_probe_mutex);
> +	blk_tracer_enabled++;
> +	mutex_unlock(&blk_probe_mutex);
> +	return 0;
> +}
> +
> +static void blk_tracer_stop(struct trace_array *tr)
> +{
> +	mutex_lock(&blk_probe_mutex);
> +	if (atomic_dec_and_test(&blk_probes_ref))
> +		blk_unregister_tracepoints();
> +	mutex_unlock(&blk_probe_mutex);
> +}


But...I didn't think about the stop callback. When is it used...I don't remember...
Perhaps there are racy situations I don't see actually.


> +static void blk_tracer_reset(struct trace_array *tr)
> +{
> +	if (!atomic_read(&blk_probes_ref))
> +		return;
> +
> +	mutex_lock(&blk_probe_mutex);
> +	blk_tracer_enabled--;
> +	WARN_ON(blk_tracer_enabled < 0);
> +	mutex_unlock(&blk_probe_mutex);
> +
> +	blk_tracer_stop(tr);
> +}


blk_tracer_enabled can be a simple boolean, for the non-racy reason I gave :-)
Lesser doubts here: the stop callback is not concerned.

> +
> +static struct {
> +	const char *act[2];
> +	int 	   (*print)(struct trace_seq *s, const struct trace_entry *ent);
> +} what2act[] __read_mostly = {
> +	[__BLK_TA_QUEUE]	= {{  "Q", "queue" }, 	   blk_log_generic },
> +	[__BLK_TA_BACKMERGE]	= {{  "M", "backmerge" },  blk_log_generic },
> +	[__BLK_TA_FRONTMERGE]	= {{  "F", "frontmerge" }, blk_log_generic },
> +	[__BLK_TA_GETRQ]	= {{  "G", "getrq" },	   blk_log_generic },
> +	[__BLK_TA_SLEEPRQ]	= {{  "S", "sleeprq" },	   blk_log_generic },
> +	[__BLK_TA_REQUEUE]	= {{  "R", "requeue" },	   blk_log_with_error },
> +	[__BLK_TA_ISSUE]	= {{  "D", "issue" },	   blk_log_generic },
> +	[__BLK_TA_COMPLETE]	= {{  "C", "complete" },   blk_log_with_error },
> +	[__BLK_TA_PLUG]		= {{  "P", "plug" },	   blk_log_plug },
> +	[__BLK_TA_UNPLUG_IO]	= {{  "U", "unplug_io" },  blk_log_unplug },
> +	[__BLK_TA_UNPLUG_TIMER]	= {{ "UT", "unplug_timer" }, blk_log_unplug },
> +	[__BLK_TA_INSERT]	= {{  "I", "insert" },	   blk_log_generic },
> +	[__BLK_TA_SPLIT]	= {{  "X", "split" },	   blk_log_split },
> +	[__BLK_TA_BOUNCE]	= {{  "B", "bounce" },	   blk_log_generic },
> +	[__BLK_TA_REMAP]	= {{  "A", "remap" },	   blk_log_remap },
> +};
> +
> +static int blk_trace_event_print(struct trace_seq *s, struct trace_entry *ent,
> +				 int flags)
> +{
> +	const struct blk_io_trace *t = (struct blk_io_trace *)ent;
> +	const u16 what = t->action & ((1 << BLK_TC_SHIFT) - 1);
> +	int ret;
> +
> +	if (unlikely(what == 0 || what > ARRAY_SIZE(what2act)))
> +		ret = trace_seq_printf(s, "Bad pc action %x\n", what);
> +	else {
> +		const bool long_act = !!(trace_flags & TRACE_ITER_VERBOSE);
> +		ret = blk_log_action_seq(s, t, what2act[what].act[long_act]);
> +		if (ret)
> +			ret = what2act[what].print(s, ent);
> +	}
> +
> +	return ret ? TRACE_TYPE_HANDLED : TRACE_TYPE_PARTIAL_LINE;
> +}
> +
> +static enum print_line_t blk_tracer_print_line(struct trace_iterator *iter)
> +{
> +	const struct blk_io_trace *t;
> +	u16 what;
> +	int ret;
> +
> +	if (!(blk_tracer_flags.val & TRACE_BLK_OPT_CLASSIC))
> +		return TRACE_TYPE_UNHANDLED;
> +
> +	t = (const struct blk_io_trace *)iter->ent;
> +	what = t->action & ((1 << BLK_TC_SHIFT) - 1);
> +
> +	if (unlikely(what == 0 || what > ARRAY_SIZE(what2act)))
> +		ret = trace_seq_printf(&iter->seq, "Bad pc action %x\n", what);
> +	else {
> +		const bool long_act = !!(trace_flags & TRACE_ITER_VERBOSE);
> +		ret = blk_log_action_iter(iter, what2act[what].act[long_act]);
> +		if (ret)
> +			ret = what2act[what].print(&iter->seq, iter->ent);
> +	}
> +
> +	return ret ? TRACE_TYPE_HANDLED : TRACE_TYPE_PARTIAL_LINE;
> +}
> +
> +static struct tracer blk_tracer __read_mostly = {
> +	.name		= "blk",
> +	.init		= blk_tracer_init,
> +	.reset		= blk_tracer_reset,
> +	.start		= blk_tracer_start,
> +	.stop		= blk_tracer_stop,
> +	.print_header	= blk_tracer_print_header,
> +	.print_line	= blk_tracer_print_line,
> +	.flags		= &blk_tracer_flags,
> +};
> +
> +static struct trace_event trace_blk_event = {
> +	.type	 	= TRACE_BLK,
> +	.trace		= blk_trace_event_print,
> +	.latency_trace	= blk_trace_event_print,
> +	.raw		= trace_nop_print,
> +	.hex		= trace_nop_print,
> +	.binary		= trace_nop_print,
> +};
> +
> +static int __init init_blk_tracer(void)
> +{
> +	if (!register_ftrace_event(&trace_blk_event)) {
> +		pr_warning("Warning: could not register block events\n");
> +		return 1;
> +	}
> +
> +	if (register_tracer(&blk_tracer) != 0) {
> +		pr_warning("Warning: could not register the block tracer\n");
> +		unregister_ftrace_event(&trace_blk_event);
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +
> +device_initcall(init_blk_tracer);
> +
> +static int blk_trace_remove_queue(struct request_queue *q)
> +{
> +	struct blk_trace *bt;
> +
> +	bt = xchg(&q->blk_trace, NULL);
> +	if (bt == NULL)
> +		return -EINVAL;
> +
> +	kfree(bt);
> +	return 0;
> +}
> +
> +/*
> + * Setup everything required to start tracing
> + */
> +static int blk_trace_setup_queue(struct request_queue *q, dev_t dev)
> +{
> +	struct blk_trace *old_bt, *bt = NULL;
> +	int ret;
> +
> +	ret = -ENOMEM;
> +	bt = kzalloc(sizeof(*bt), GFP_KERNEL);
> +	if (!bt)
> +		goto err;
> +
> +	bt->dev = dev;
> +	bt->act_mask = (u16)-1;
> +	bt->end_lba = -1ULL;
> +	bt->trace_state = Blktrace_running;
> +
> +	old_bt = xchg(&q->blk_trace, bt);
> +	if (old_bt != NULL) {
> +		(void)xchg(&q->blk_trace, old_bt);
> +		kfree(bt);
> +		ret = -EBUSY;
> +	}
> +	return 0;
> +err:
> +	return ret;
> +}
> +
> +/*
> + * sysfs interface to enable and configure tracing
> + */
> +
> +static ssize_t sysfs_blk_trace_enable_show(struct device *dev,
> +					   struct device_attribute *attr,
> +					   char *buf)
> +{
> +	struct hd_struct *p = dev_to_part(dev);
> +	struct block_device *bdev;
> +	ssize_t ret = -ENXIO;
> +
> +	lock_kernel();
> +	bdev = bdget(part_devt(p));
> +	if (bdev != NULL) {
> +		struct request_queue *q = bdev_get_queue(bdev);
> +
> +		if (q != NULL) {
> +			mutex_lock(&bdev->bd_mutex);
> +			ret = sprintf(buf, "%u\n", !!q->blk_trace);
> +			mutex_unlock(&bdev->bd_mutex);
> +		}
> +
> +		bdput(bdev);
> +	}
> +
> +	unlock_kernel();
> +	return ret;
> +}
> +
> +static ssize_t sysfs_blk_trace_enable_store(struct device *dev,
> +					    struct device_attribute *attr,
> +					    const char *buf, size_t count)
> +{
> +	struct block_device *bdev;
> +	struct request_queue *q;
> +	struct hd_struct *p;
> +	int value;
> +	ssize_t ret = -ENXIO;
> +
> +	if (count == 0 || sscanf(buf, "%d", &value) != 1)
> +		goto out;
> +
> +	lock_kernel();
> +	p = dev_to_part(dev);
> +	bdev = bdget(part_devt(p));
> +	if (bdev == NULL)
> +		goto out_unlock_kernel;
> +
> +	q = bdev_get_queue(bdev);
> +	if (q == NULL)
> +		goto out_bdput;
> +
> +	mutex_lock(&bdev->bd_mutex);
> +	if (value)
> +		ret = blk_trace_setup_queue(q, bdev->bd_dev);
> +	else
> +		ret = blk_trace_remove_queue(q);
> +	mutex_unlock(&bdev->bd_mutex);
> +
> +	if (ret == 0)
> +		ret = count;
> +out_bdput:
> +	bdput(bdev);
> +out_unlock_kernel:
> +	unlock_kernel();
> +out:
> +	return ret;
> +}
> +
> +static ssize_t sysfs_blk_trace_attr_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf);
> +static ssize_t sysfs_blk_trace_attr_store(struct device *dev,
> +					  struct device_attribute *attr,
> +					  const char *buf, size_t count);
> +#define BLK_TRACE_DEVICE_ATTR(_name) \
> +	DEVICE_ATTR(_name, S_IRUGO | S_IWUSR, \
> +		    sysfs_blk_trace_attr_show, \
> +		    sysfs_blk_trace_attr_store)
> +
> +static DEVICE_ATTR(enable, S_IRUGO | S_IWUSR,
> +		   sysfs_blk_trace_enable_show, sysfs_blk_trace_enable_store);
> +static BLK_TRACE_DEVICE_ATTR(act_mask);
> +static BLK_TRACE_DEVICE_ATTR(pid);
> +static BLK_TRACE_DEVICE_ATTR(start_lba);
> +static BLK_TRACE_DEVICE_ATTR(end_lba);
> +
> +static struct attribute *blk_trace_attrs[] = {
> +	&dev_attr_enable.attr,
> +	&dev_attr_act_mask.attr,
> +	&dev_attr_pid.attr,
> +	&dev_attr_start_lba.attr,
> +	&dev_attr_end_lba.attr,
> +	NULL
> +};
> +
> +struct attribute_group blk_trace_attr_group = {
> +	.name  = "trace",
> +	.attrs = blk_trace_attrs,
> +};
> +
> +static int blk_str2act_mask(const char *str)
> +{
> +	int mask = 0;
> +	char *copy = kstrdup(str, GFP_KERNEL), *s;
> +
> +	if (copy == NULL)
> +		return -ENOMEM;
> +
> +	s = strstrip(copy);
> +
> +	while (1) {
> +		char *sep = strchr(s, ',');
> +
> +		if (sep != NULL)
> +			*sep = '\0';
> +
> +		if (strcasecmp(s, "barrier") == 0)
> +			mask |= BLK_TC_BARRIER;
> +		else if (strcasecmp(s, "complete") == 0)
> +			mask |= BLK_TC_COMPLETE;
> +		else if (strcasecmp(s, "fs") == 0)
> +			mask |= BLK_TC_FS;
> +		else if (strcasecmp(s, "issue") == 0)
> +			mask |= BLK_TC_ISSUE;
> +		else if (strcasecmp(s, "pc") == 0)
> +			mask |= BLK_TC_PC;
> +		else if (strcasecmp(s, "queue") == 0)
> +			mask |= BLK_TC_QUEUE;
> +		else if (strcasecmp(s, "read") == 0)
> +			mask |= BLK_TC_READ;
> +		else if (strcasecmp(s, "requeue") == 0)
> +			mask |= BLK_TC_REQUEUE;
> +		else if (strcasecmp(s, "sync") == 0)
> +			mask |= BLK_TC_SYNC;
> +		else if (strcasecmp(s, "write") == 0)
> +			mask |= BLK_TC_WRITE;
> +
> +		if (sep == NULL)
> +			break;
> +
> +		s = sep + 1;
> +	}
> +	kfree(copy);
> +
> +	return mask;
> +}
> +
> +static ssize_t sysfs_blk_trace_attr_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf)
> +{
> +	struct hd_struct *p = dev_to_part(dev);
> +	struct request_queue *q;
> +	struct block_device *bdev;
> +	ssize_t ret = -ENXIO;
> +
> +	lock_kernel();
> +	bdev = bdget(part_devt(p));
> +	if (bdev == NULL)
> +		goto out_unlock_kernel;
> +
> +	q = bdev_get_queue(bdev);
> +	if (q == NULL)
> +		goto out_bdput;
> +	mutex_lock(&bdev->bd_mutex);
> +	if (q->blk_trace == NULL)
> +		ret = sprintf(buf, "disabled\n");
> +	else if (attr == &dev_attr_act_mask)
> +		ret = sprintf(buf, "%#x\n", q->blk_trace->act_mask);
> +	else if (attr == &dev_attr_pid)
> +		ret = sprintf(buf, "%u\n", q->blk_trace->pid);
> +	else if (attr == &dev_attr_start_lba)
> +		ret = sprintf(buf, "%llu\n", q->blk_trace->start_lba);
> +	else if (attr == &dev_attr_end_lba)
> +		ret = sprintf(buf, "%llu\n", q->blk_trace->end_lba);
> +	mutex_unlock(&bdev->bd_mutex);
> +out_bdput:
> +	bdput(bdev);
> +out_unlock_kernel:
> +	unlock_kernel();
> +	return ret;
> +}
> +
> +static ssize_t sysfs_blk_trace_attr_store(struct device *dev,
> +					  struct device_attribute *attr,
> +					  const char *buf, size_t count)
> +{
> +	struct block_device *bdev;
> +	struct request_queue *q;
> +	struct hd_struct *p;
> +	u64 value;
> +	ssize_t ret = -ENXIO;
> +
> +	if (count == 0)
> +		goto out;
> +
> +	if (attr == &dev_attr_act_mask) {
> +		if (sscanf(buf, "%llx", &value) != 1) {
> +			/* Assume it is a list of trace category names */
> +			value = blk_str2act_mask(buf);
> +			if (value < 0)
> +				goto out;
> +		}
> +	} else if (sscanf(buf, "%llu", &value) != 1)
> +		goto out;
> +
> +	lock_kernel();
> +	p = dev_to_part(dev);
> +	bdev = bdget(part_devt(p));
> +	if (bdev == NULL)
> +		goto out_unlock_kernel;
> +
> +	q = bdev_get_queue(bdev);
> +	if (q == NULL)
> +		goto out_bdput;
> +
> +	mutex_lock(&bdev->bd_mutex);
> +	ret = 0;
> +	if (q->blk_trace == NULL)
> +		ret = blk_trace_setup_queue(q, bdev->bd_dev);
> +
> +	if (ret == 0) {
> +		if (attr == &dev_attr_act_mask)
> +			q->blk_trace->act_mask = value;
> +		else if (attr == &dev_attr_pid)
> +			q->blk_trace->pid = value;
> +		else if (attr == &dev_attr_start_lba)
> +			q->blk_trace->start_lba = value;
> +		else if (attr == &dev_attr_end_lba)
> +			q->blk_trace->end_lba = value;
> +		ret = count;
> +	}
> +	mutex_unlock(&bdev->bd_mutex);
> +out_bdput:
> +	bdput(bdev);
> +out_unlock_kernel:
> +	unlock_kernel();
> +out:
> +	return ret;
> +}
> diff --git a/fs/partitions/check.c b/fs/partitions/check.c
> index 6d72024..01714ef 100644
> --- a/fs/partitions/check.c
> +++ b/fs/partitions/check.c
> @@ -268,6 +268,10 @@ ssize_t part_fail_store(struct device *dev,
>  }
>  #endif
>  
> +#ifdef CONFIG_BLK_DEV_IO_TRACE
> +extern struct attribute_group blk_trace_attr_group;
> +#endif
> +
>  static DEVICE_ATTR(partition, S_IRUGO, part_partition_show, NULL);
>  static DEVICE_ATTR(start, S_IRUGO, part_start_show, NULL);
>  static DEVICE_ATTR(size, S_IRUGO, part_size_show, NULL);
> @@ -294,6 +298,9 @@ static struct attribute_group part_attr_group = {
>  
>  static struct attribute_group *part_attr_groups[] = {
>  	&part_attr_group,
> +#ifdef CONFIG_BLK_DEV_IO_TRACE
> +	&blk_trace_attr_group,
> +#endif
>  	NULL
>  };
>  
> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> index 54b7278..0d6e314 100644
> --- a/kernel/trace/trace.h
> +++ b/kernel/trace/trace.h
> @@ -32,6 +32,7 @@ enum trace_type {
>  	TRACE_KMEM_ALLOC,
>  	TRACE_KMEM_FREE,
>  	TRACE_POWER,
> +	TRACE_BLK,
>  
>  	__TRACE_LAST_TYPE,
>  };


Ok, most parts are blk related above, so it's more hard to comment for me :-)
But the tracing parts look good in my opinion.

I'm preparing the TRACE_ITER_CONTEXT_INFO patch.

Thanks,
Frederic.


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2009-01-20 23:24 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-01-16 18:16 [RFC][PATCH] ftrace interface for blktrace Arnaldo Carvalho de Melo
2009-01-16 19:15 ` Steven Rostedt
2009-01-16 19:41   ` Arnaldo Carvalho de Melo
2009-01-17 13:37 ` Frederic Weisbecker
2009-01-17 20:22   ` Arnaldo Carvalho de Melo
2009-01-19 13:33     ` Ingo Molnar
2009-01-17 19:09 ` Ingo Molnar
2009-01-17 19:14   ` Jens Axboe
2009-01-17 20:08     ` Arnaldo Carvalho de Melo
2009-01-20 14:42       ` [RFC][PATCHv2] " Arnaldo Carvalho de Melo
2009-01-20 17:03         ` Steven Rostedt
2009-01-20 17:58           ` Arnaldo Carvalho de Melo
2009-01-20 18:23             ` Arnaldo Carvalho de Melo
2009-01-20 23:23         ` Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).