public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] queue barrier support
@ 2002-02-13 12:51 Jens Axboe
  2002-02-13 13:09 ` Martin Dalecki
  2002-02-13 14:51 ` Daniel Phillips
  0 siblings, 2 replies; 23+ messages in thread
From: Jens Axboe @ 2002-02-13 12:51 UTC (permalink / raw)
  To: Linux Kernel

[-- Attachment #1: Type: text/plain, Size: 1793 bytes --]

Hi,

Chris finally got me back in gear with the write barrier patch (for 2.4,
2.5 had a few missing pieces). The following changesets exists:

ChangeSet@1.295, 2002-02-13 13:22:40+01:00, axboe@burns.home.kernel.dk
  Misc small changes and fixes to properly support barrier writes for
  journalled file systems. The support was basically already there
  with the bio merge, since it had not been used at all there were a
  few loose ends though.

  http://bitmover.com:8888//tmp/v2_logging/athlon.transmeta.com/torvalds-20020205173056-16047-c1d11a41ed024864/cset@1.133.114.2?nav=index.html|ChangeSet@-1h


ChangeSet@1.296, 2002-02-13 13:23:34+01:00, axboe@burns.home.kernel.dk
  IDE barrier write support through write cache flushing.

  http://bitmover.com:8888//tmp/v2_logging/athlon.transmeta.com/torvalds-20020205173056-16047-c1d11a41ed024864/cset@1.133.114.3?nav=index.html|ChangeSet@-1h


ChangeSet@1.297, 2002-02-13 13:42:39+01:00, axboe@burns.home.kernel.dk
  Add support for SCSI drivers to indicate support for ordered tags

  http://bitmover.com:8888//tmp/v2_logging/athlon.transmeta.com/torvalds-20020205173056-16047-c1d11a41ed024864/cset@1.133.114.4?nav=index.html|ChangeSet@-1h


ChangeSet@1.298, 2002-02-13 13:43:04+01:00, axboe@burns.home.kernel.dk
  Add ordered tag support to the aic7xxx scsi driver

  http://bitmover.com:8888//tmp/v2_logging/athlon.transmeta.com/torvalds-20020205173056-16047-c1d11a41ed024864/cset@1.133.114.5?nav=index.html|ChangeSet@-1h


ChangeSet@1.299, 2002-02-13 13:44:18+01:00, axboe@burns.home.kernel.dk
  reiserfs barrier write support

  http://bitmover.com:8888//tmp/v2_logging/athlon.transmeta.com/torvalds-20020205173056-16047-c1d11a41ed024864/cset@1.133.114.6?nav=index.html|ChangeSet@-1h

Patches attached, comments welcome.

-- 
Jens Axboe


[-- Attachment #2: barrier-1-c1.295 --]
[-- Type: text/plain, Size: 5697 bytes --]

# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
#	           ChangeSet	1.294   -> 1.295  
#	drivers/block/ll_rw_blk.c	1.46    -> 1.47   
#	include/linux/blkdev.h	1.31    -> 1.32   
#	 include/linux/bio.h	1.12    -> 1.13   
#	  include/linux/fs.h	1.81    -> 1.82   
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/02/13	axboe@burns.home.kernel.dk	1.295
# Misc small changes and fixes to properly support barrier writes for
# journalled file systems. The support was basically already there
# with the bio merge, since it had not been used at all there were a
# few loose ends though.
# --------------------------------------------
#
diff -Nru a/drivers/block/ll_rw_blk.c b/drivers/block/ll_rw_blk.c
--- a/drivers/block/ll_rw_blk.c	Wed Feb 13 13:48:11 2002
+++ b/drivers/block/ll_rw_blk.c	Wed Feb 13 13:48:11 2002
@@ -123,6 +123,31 @@
 }
 
 /**
+ * blk_queue_ordered - does this queue support ordered writes
+ * @q:     the request queue
+ * @flag:  see below
+ *
+ * Description:
+ *   For journalled file systems, doing ordered writes on a commit
+ *   block instead of explicitly doing wait_on_buffer (which is bad
+ *   for performance) can be a big win. Block drivers supporting this
+ *   feature should call this function and indicate so.
+ *
+ *   SCSI drivers usually need to support ordered tags, while others
+ *   may have to do a complete drive cache flush if they are using write
+ *   back caching (or not and lying about it)
+ *
+ *   With this in mind, the values are
+ *             QUEUE_ORDERED_NONE:	the default, doesn't support barrier
+ *             QUEUE_ORDERED_TAG:	supports ordered tags
+ *             QUEUE_ORDERED_FLUSH:	supports barrier through cache flush
+ **/
+void blk_queue_ordered(request_queue_t *q, int flag)
+{
+	q->queue_barrier = flag;
+}
+
+/**
  * blk_queue_make_request - define an alternate make_request function for a device
  * @q:  the request queue for the device to be affected
  * @mfn: the alternate make_request function
@@ -1099,7 +1124,12 @@
 
 	spin_lock_prefetch(q->queue_lock);
 
-	barrier = test_bit(BIO_RW_BARRIER, &bio->bi_rw);
+	barrier = bio_barrier(bio);
+
+	if (barrier && q->queue_barrier == QUEUE_BARRIER_NONE) {
+		set_bit(BIO_OPNOTSUPP, &bio->bi_flags);
+		goto end_io;
+	}
 
 	spin_lock_irq(q->queue_lock);
 again:
@@ -1380,7 +1410,7 @@
 	BIO_BUG_ON(!bio->bi_size);
 	BIO_BUG_ON(!bio->bi_io_vec);
 
-	bio->bi_rw = rw;
+	bio->bi_rw |= rw;
 
 	if (rw & WRITE)
 		kstat.pgpgout += count;
@@ -1425,6 +1455,9 @@
 
 	bio->bi_end_io = end_bio_bh_io_sync;
 	bio->bi_private = bh;
+
+	if (buffer_ordered(bh))
+		bio->bi_rw |= 1 << BIO_RW_BARRIER;
 
 	return submit_bio(rw, bio);
 }
diff -Nru a/include/linux/bio.h b/include/linux/bio.h
--- a/include/linux/bio.h	Wed Feb 13 13:48:11 2002
+++ b/include/linux/bio.h	Wed Feb 13 13:48:11 2002
@@ -101,6 +101,7 @@
 #define BIO_EOF		2	/* out-out-bounds error */
 #define BIO_SEG_VALID	3	/* nr_hw_seg valid */
 #define BIO_CLONED	4	/* doesn't own data */
+#define BIO_OPNOTSUPP	5	/* not supported */
 
 /*
  * bio bi_rw flags
@@ -123,7 +124,7 @@
 #define bio_offset(bio)		bio_iovec((bio))->bv_offset
 #define bio_sectors(bio)	((bio)->bi_size >> 9)
 #define bio_data(bio)		(page_address(bio_page((bio))) + bio_offset((bio)))
-#define bio_barrier(bio)	((bio)->bi_rw & (1 << BIO_BARRIER))
+#define bio_barrier(bio)	((bio)->bi_rw & (1 << BIO_RW_BARRIER))
 
 /*
  * will die
diff -Nru a/include/linux/blkdev.h b/include/linux/blkdev.h
--- a/include/linux/blkdev.h	Wed Feb 13 13:48:11 2002
+++ b/include/linux/blkdev.h	Wed Feb 13 13:48:11 2002
@@ -174,6 +174,8 @@
 	 */
 	unsigned long		queue_flags;
 
+	char			queue_barrier;
+
 	/*
 	 * protects queue structures from reentrancy
 	 */
@@ -202,6 +204,10 @@
 #define QUEUE_FLAG_PLUGGED	0	/* queue is plugged */
 #define QUEUE_FLAG_CLUSTER	1	/* cluster several segments into 1 */
 
+#define QUEUE_BARRIER_NONE	0	/* barrier not supported */
+#define QUEUE_BARRIER_TAG	1	/* barrier supported through tags */
+#define QUEUE_BARRIER_FLUSH	2	/* barrier supported through flush */
+
 #define blk_queue_plugged(q)	test_bit(QUEUE_FLAG_PLUGGED, &(q)->queue_flags)
 #define blk_mark_plugged(q)	set_bit(QUEUE_FLAG_PLUGGED, &(q)->queue_flags)
 #define blk_queue_empty(q)	elv_queue_empty(q)
@@ -308,6 +314,7 @@
 extern void blk_queue_segment_boundary(request_queue_t *q, unsigned long);
 extern void blk_queue_assign_lock(request_queue_t *q, spinlock_t *);
 extern void blk_queue_prep_rq(request_queue_t *q, prep_rq_fn *pfn);
+extern void blk_queue_ordered(request_queue_t *, int);
 
 extern int blk_rq_map_sg(request_queue_t *, struct request *, struct scatterlist *);
 extern void blk_dump_rq_flags(struct request *, char *);
diff -Nru a/include/linux/fs.h b/include/linux/fs.h
--- a/include/linux/fs.h	Wed Feb 13 13:48:11 2002
+++ b/include/linux/fs.h	Wed Feb 13 13:48:11 2002
@@ -221,6 +221,7 @@
 	BH_Wait_IO,	/* 1 if we should write out this buffer */
 	BH_launder,	/* 1 if we should throttle on this buffer */
 	BH_JBD,		/* 1 if it has an attached journal_head */
+	BH_Ordered,	/* 1 if this buffer is a write barrier */
 
 	BH_PrivateStart,/* not a state bit, but the first bit available
 			 * for private allocation by other entities
@@ -278,6 +279,7 @@
 #define buffer_mapped(bh)	__buffer_state(bh,Mapped)
 #define buffer_new(bh)		__buffer_state(bh,New)
 #define buffer_async(bh)	__buffer_state(bh,Async)
+#define buffer_ordered(bh)	__buffer_state(bh,Ordered)
 
 #define bh_offset(bh)		((unsigned long)(bh)->b_data & ~PAGE_MASK)
 

[-- Attachment #3: ide-barrier-1-c1.296 --]
[-- Type: text/plain, Size: 4254 bytes --]

# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
#	           ChangeSet	1.295   -> 1.296  
#	drivers/ide/ide-probe.c	1.16    -> 1.17   
#	 include/linux/ide.h	1.11    -> 1.12   
#	   drivers/ide/ide.c	1.27    -> 1.28   
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/02/13	axboe@burns.home.kernel.dk	1.296
# IDE barrier write support through write cache flushing.
# --------------------------------------------
#
diff -Nru a/drivers/ide/ide-probe.c b/drivers/ide/ide-probe.c
--- a/drivers/ide/ide-probe.c	Wed Feb 13 13:48:25 2002
+++ b/drivers/ide/ide-probe.c	Wed Feb 13 13:48:25 2002
@@ -633,6 +633,9 @@
 
 	/* This is a driver limit and could be eliminated. */
 	blk_queue_max_phys_segments(q, PRD_ENTRIES);
+
+	/* assume all drivers can do ordered flush, disable this later if not */
+	blk_queue_ordered(q, QUEUE_BARRIER_FLUSH);
 }
 
 /*
diff -Nru a/drivers/ide/ide.c b/drivers/ide/ide.c
--- a/drivers/ide/ide.c	Wed Feb 13 13:48:25 2002
+++ b/drivers/ide/ide.c	Wed Feb 13 13:48:25 2002
@@ -372,6 +372,32 @@
 	return system_bus_speed;
 }
 
+static struct request *ide_queue_flush_cmd(ide_drive_t *drive, struct request *rq, int post)
+{
+	struct request *flush_rq = &HWGROUP(drive)->wrq;
+
+	elv_remove_request(&drive->queue, rq);
+
+	memset(drive->special_buf, 0, sizeof(drive->special_buf));
+
+	ide_init_drive_cmd(flush_rq);
+
+	flush_rq->flags |= REQ_BARRIER;
+	flush_rq->buffer = drive->special_buf;
+	flush_rq->special = rq;
+
+	flush_rq->buffer[0] = (drive->id->cfs_enable_2 & 0x2400) ? WIN_FLUSH_CACHE_EXT : WIN_FLUSH_CACHE;
+
+	/*
+	 * just signal that this is the post-barrier flush
+	 */
+	if (post)
+		flush_rq->flags |= REQ_SPECIAL;
+
+	_elv_add_request(&drive->queue, flush_rq, 0, 0);
+	return flush_rq;
+}
+
 inline int __ide_end_request(ide_hwgroup_t *hwgroup, int uptodate, int nr_secs)
 {
 	ide_drive_t *drive = hwgroup->drive;
@@ -404,7 +430,17 @@
 		add_blkdev_randomness(major(rq->rq_dev));
 		blkdev_dequeue_request(rq);
         	hwgroup->rq = NULL;
-		end_that_request_last(rq);
+
+		/*
+		 * if this is a write barrier, flush the writecache before
+		 * allowing new requests to finsh and before signalling
+		 * completion of this request
+		 */
+		if (rq->flags & REQ_BARRIER)
+			ide_queue_flush_cmd(drive, rq, 1);
+		else
+			end_that_request_last(rq);
+
 		ret = 0;
 	}
 
@@ -757,8 +793,33 @@
 
 	blkdev_dequeue_request(rq);
 	HWGROUP(drive)->rq = NULL;
-	end_that_request_last(rq);
 
+	/*
+	 * if a cache flush fails, disable ordered write support
+	 */
+	if (rq->flags & REQ_BARRIER) {
+		struct request *real_rq = rq->special;
+
+		/*
+		 * best-effort currently, this ignores the fact that there
+		 * may be other barriers currently queued that we can't
+		 * honor any more
+		 */
+		if (err)
+			blk_queue_ordered(&drive->queue, QUEUE_BARRIER_NONE);
+
+		if (rq->flags & REQ_SPECIAL) 
+			end_that_request_last(real_rq);
+		else {
+			/*
+			 * just indicate that we did the pre flush
+			 */
+			real_rq->flags |= REQ_SPECIAL;
+			_elv_add_request(&drive->queue, real_rq, 0, 0);
+		}
+	}
+
+	end_that_request_last(rq);
 	spin_unlock_irqrestore(&ide_lock, flags);
 }
 
@@ -1376,10 +1437,16 @@
 		if (blk_queue_plugged(&drive->queue))
 			BUG();
 
+		rq = elv_next_request(&drive->queue);
+
 		/*
-		 * just continuing an interrupted request maybe
+		 * if rq is a barrier write, issue pre cache flush if not
+		 * already done
 		 */
-		rq = hwgroup->rq = elv_next_request(&drive->queue);
+		if (!(rq->flags ^ (REQ_BARRIER | REQ_SPECIAL)))
+			rq = ide_queue_flush_cmd(drive, rq, 0);
+
+		hwgroup->rq = rq;
 
 		/*
 		 * Some systems have trouble with IDE IRQs arriving while
diff -Nru a/include/linux/ide.h b/include/linux/ide.h
--- a/include/linux/ide.h	Wed Feb 13 13:48:25 2002
+++ b/include/linux/ide.h	Wed Feb 13 13:48:25 2002
@@ -448,6 +448,7 @@
 	byte		acoustic;	/* acoustic management */
 	unsigned int	failures;	/* current failure count */
 	unsigned int	max_failures;	/* maximum allowed failure count */
+	char		special_buf[4];	/* IDE_DRIVE_CMD, free use */
 } ide_drive_t;
 
 /*

[-- Attachment #4: scsi-barrier-1-c1.297 --]
[-- Type: text/plain, Size: 1894 bytes --]

# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
#	           ChangeSet	1.296   -> 1.297  
#	drivers/scsi/hosts.c	1.8     -> 1.9    
#	drivers/scsi/hosts.h	1.8     -> 1.9    
#	 drivers/scsi/scsi.c	1.24    -> 1.25   
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/02/13	axboe@burns.home.kernel.dk	1.297
# Add support for SCSI drivers to indicate support for ordered tags
# --------------------------------------------
#
diff -Nru a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
--- a/drivers/scsi/hosts.c	Wed Feb 13 13:50:02 2002
+++ b/drivers/scsi/hosts.c	Wed Feb 13 13:50:02 2002
@@ -230,6 +230,7 @@
 
     retval->select_queue_depths = tpnt->select_queue_depths;
     retval->max_sectors = tpnt->max_sectors;
+    retval->can_order = tpnt->can_order;
 
     if(!scsi_hostlist)
 	scsi_hostlist = retval;
diff -Nru a/drivers/scsi/hosts.h b/drivers/scsi/hosts.h
--- a/drivers/scsi/hosts.h	Wed Feb 13 13:50:02 2002
+++ b/drivers/scsi/hosts.h	Wed Feb 13 13:50:02 2002
@@ -286,6 +286,8 @@
 
     unsigned highmem_io:1;
 
+    unsigned can_order:1;
+
     /*
      * Name of proc directory
      */
@@ -386,6 +388,7 @@
     unsigned unchecked_isa_dma:1;
     unsigned use_clustering:1;
     unsigned highmem_io:1;
+    unsigned can_order:1;
 
     /*
      * Host has rejected a command because it was busy.
diff -Nru a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
--- a/drivers/scsi/scsi.c	Wed Feb 13 13:50:02 2002
+++ b/drivers/scsi/scsi.c	Wed Feb 13 13:50:02 2002
@@ -215,6 +215,8 @@
 
 	if (!SHpnt->use_clustering)
 		clear_bit(QUEUE_FLAG_CLUSTER, &q->queue_flags);
+	if (SHpnt->can_order)
+		blk_queue_ordered(&SDpnt->request_queue, QUEUE_BARRIER_TAG);
 }
 
 #ifdef MODULE

[-- Attachment #5: aic7xxx-barrier-1-c1.298 --]
[-- Type: text/plain, Size: 1613 bytes --]

# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
#	           ChangeSet	1.297   -> 1.298  
#	drivers/scsi/aic7xxx/aic7xxx_linux_host.h	1.5     -> 1.6    
#	drivers/scsi/aic7xxx/aic7xxx_linux.c	1.15    -> 1.16   
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/02/13	axboe@burns.home.kernel.dk	1.298
# Add ordered tag support to the aic7xxx scsi driver
# --------------------------------------------
#
diff -Nru a/drivers/scsi/aic7xxx/aic7xxx_linux.c b/drivers/scsi/aic7xxx/aic7xxx_linux.c
--- a/drivers/scsi/aic7xxx/aic7xxx_linux.c	Wed Feb 13 13:50:23 2002
+++ b/drivers/scsi/aic7xxx/aic7xxx_linux.c	Wed Feb 13 13:50:23 2002
@@ -1635,6 +1635,9 @@
 			}
 		}
 
+		if (cmd->request.flags & REQ_BARRIER)
+			hscb->control |= MSG_ORDERED_Q_TAG;
+
 		hscb->cdb_len = cmd->cmd_len;
 		if (hscb->cdb_len <= 12) {
 			memcpy(hscb->shared_data.cdb, cmd->cmnd, hscb->cdb_len);
diff -Nru a/drivers/scsi/aic7xxx/aic7xxx_linux_host.h b/drivers/scsi/aic7xxx/aic7xxx_linux_host.h
--- a/drivers/scsi/aic7xxx/aic7xxx_linux_host.h	Wed Feb 13 13:50:23 2002
+++ b/drivers/scsi/aic7xxx/aic7xxx_linux_host.h	Wed Feb 13 13:50:23 2002
@@ -89,7 +89,8 @@
 	present: 0,		/* number of 7xxx's present   */\
 	unchecked_isa_dma: 0,	/* no memory DMA restrictions */\
 	use_clustering: ENABLE_CLUSTERING,			\
-	highmem_io: 1						\
+	highmem_io: 1,						\
+	can_order: 1,						\
 }
 
 #endif /* _AIC7XXX_LINUX_HOST_H_ */

[-- Attachment #6: reiser-barrier-1-c1.299 --]
[-- Type: text/plain, Size: 5750 bytes --]

# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
#	           ChangeSet	1.298   -> 1.299  
#	fs/reiserfs/journal.c	1.27    -> 1.28   
#	 fs/reiserfs/super.c	1.28    -> 1.29   
#	include/linux/reiserfs_fs_sb.h	1.9     -> 1.10   
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/02/13	axboe@burns.home.kernel.dk	1.299
# reiserfs barrier write support
# --------------------------------------------
#
diff -Nru a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
--- a/fs/reiserfs/journal.c	Wed Feb 13 13:50:34 2002
+++ b/fs/reiserfs/journal.c	Wed Feb 13 13:50:34 2002
@@ -115,6 +115,21 @@
   return 0 ;
 }
 
+static int barrier_wait_on_buffer(struct super_block *s, struct buffer_head *bh)
+{
+    int ordered = test_and_clear_bit(BH_Ordered, &bh->b_state) ;
+    if (!reiserfs_barrier(s) || ordered) {
+        wait_on_buffer(bh) ;
+    }
+    return 0 ;
+}
+
+static void mark_barrier(struct super_block *s, struct buffer_head *bh)
+{
+	if (reiserfs_barrier(s))
+		set_bit(BH_Ordered, &bh->b_state);
+}
+
 static struct reiserfs_bitmap_node *
 allocate_bitmap_node(struct super_block *p_s_sb) {
   struct reiserfs_bitmap_node *bn ;
@@ -721,7 +736,7 @@
       bn = SB_ONDISK_JOURNAL_1st_BLOCK(s) + (jl->j_start + i) % SB_ONDISK_JOURNAL_SIZE(s) ;
       tbh = get_hash_table(SB_JOURNAL_DEV(s), bn, s->s_blocksize) ;
 
-      wait_on_buffer(tbh) ;
+      barrier_wait_on_buffer(s, tbh) ;
       if (!buffer_uptodate(tbh)) {
 	reiserfs_panic(s, "journal-601, buffer write failed\n") ;
       }
@@ -741,9 +756,10 @@
 		   atomic_read(&(jl->j_commit_left)));
   }
 
+  mark_barrier(s, jl->j_commit_bh);
   mark_buffer_dirty(jl->j_commit_bh) ;
   ll_rw_block(WRITE, 1, &(jl->j_commit_bh)) ;
-  wait_on_buffer(jl->j_commit_bh) ;
+  barrier_wait_on_buffer(s, jl->j_commit_bh) ;
   if (!buffer_uptodate(jl->j_commit_bh)) {
     reiserfs_panic(s, "journal-615: buffer write failed\n") ;
   }
@@ -835,8 +851,9 @@
     jh->j_first_unflushed_offset = cpu_to_le32(offset) ;
     jh->j_mount_id = cpu_to_le32(SB_JOURNAL(p_s_sb)->j_mount_id) ;
     set_bit(BH_Dirty, &(SB_JOURNAL(p_s_sb)->j_header_bh->b_state)) ;
+    mark_barrier(p_s_sb, SB_JOURNAL(p_s_sb)->j_header_bh);
     ll_rw_block(WRITE, 1, &(SB_JOURNAL(p_s_sb)->j_header_bh)) ;
-    wait_on_buffer((SB_JOURNAL(p_s_sb)->j_header_bh)) ; 
+    barrier_wait_on_buffer(p_s_sb, (SB_JOURNAL(p_s_sb)->j_header_bh)) ; 
     if (!buffer_uptodate(SB_JOURNAL(p_s_sb)->j_header_bh)) {
       printk( "reiserfs: journal-837: IO error during journal replay\n" );
       return -EIO ;
@@ -1057,7 +1074,7 @@
 	if (!cn->bh) {
 	  reiserfs_panic(s, "journal-1011: cn->bh is NULL\n") ;
 	}
-	wait_on_buffer(cn->bh) ;
+	barrier_wait_on_buffer(s, cn->bh) ;
 	if (!cn->bh) {
 	  reiserfs_panic(s, "journal-1012: cn->bh is NULL\n") ;
 	}
@@ -1163,7 +1180,7 @@
         } else if (test_bit(BLOCK_NEEDS_FLUSH, &cn->state)) {
             clear_bit(BLOCK_NEEDS_FLUSH, &cn->state) ;
             if (!pjl && cn->bh) {
-                wait_on_buffer(cn->bh) ;
+                barrier_wait_on_buffer(s, cn->bh) ;
             }
             /* check again, someone could have logged while we scheduled */
             pjl = find_newer_jl_for_cn(cn) ;
diff -Nru a/fs/reiserfs/super.c b/fs/reiserfs/super.c
--- a/fs/reiserfs/super.c	Wed Feb 13 13:50:34 2002
+++ b/fs/reiserfs/super.c	Wed Feb 13 13:50:34 2002
@@ -380,6 +380,9 @@
     reiserfs_prepare_for_journal(s, SB_BUFFER_WITH_SB(s), 1) ;
     set_sb_umount_state( SB_DISK_SUPER_BLOCK(s), s->u.reiserfs_sb.s_mount_state );
     journal_mark_dirty(&th, s, SB_BUFFER_WITH_SB (s));
+
+    /* force full waits when unmounting */
+    s->u.reiserfs_sb.s_mount_opt &= ~(1 << REISERFS_BARRIER) ;
   }
 
   /* note, journal_release checks for readonly mount, and can decide not
@@ -528,6 +531,9 @@
 	    set_bit (REISERFS_NO_UNHASHED_RELOCATION, mount_options);
 	} else if (!strcmp (this_char, "hashed_relocation")) {
 	    set_bit (REISERFS_HASHED_RELOCATION, mount_options);
+	} else if (!strcmp (this_char, "barrier")) {
+	    set_bit (REISERFS_BARRIER, mount_options);
+	    printk("reiserfs: enabling write barrier code\n") ;
 	} else if (!strcmp (this_char, "test4")) {
 	    set_bit (REISERFS_TEST4, mount_options);
 	} else if (!strcmp (this_char, "nolog")) {
diff -Nru a/include/linux/reiserfs_fs_sb.h b/include/linux/reiserfs_fs_sb.h
--- a/include/linux/reiserfs_fs_sb.h	Wed Feb 13 13:50:34 2002
+++ b/include/linux/reiserfs_fs_sb.h	Wed Feb 13 13:50:34 2002
@@ -405,12 +405,8 @@
 #define REISERFS_NO_BORDER 11
 #define REISERFS_NO_UNHASHED_RELOCATION 12
 #define REISERFS_HASHED_RELOCATION 13
-#define REISERFS_TEST4 14 
-
-#define REISERFS_TEST1 11
-#define REISERFS_TEST2 12
-#define REISERFS_TEST3 13
-#define REISERFS_TEST4 14 
+#define REISERFS_BARRIER 14
+#define REISERFS_TEST4 15 
 
 #define reiserfs_r5_hash(s) ((s)->u.reiserfs_sb.s_mount_opt & (1 << FORCE_R5_HASH))
 #define reiserfs_rupasov_hash(s) ((s)->u.reiserfs_sb.s_mount_opt & (1 << FORCE_RUPASOV_HASH))
@@ -419,6 +415,7 @@
 #define reiserfs_no_border(s) ((s)->u.reiserfs_sb.s_mount_opt & (1 << REISERFS_NO_BORDER))
 #define reiserfs_no_unhashed_relocation(s) ((s)->u.reiserfs_sb.s_mount_opt & (1 << REISERFS_NO_UNHASHED_RELOCATION))
 #define reiserfs_hashed_relocation(s) ((s)->u.reiserfs_sb.s_mount_opt & (1 << REISERFS_HASHED_RELOCATION))
+#define reiserfs_barrier(s) ((s)->u.reiserfs_sb.s_mount_opt & (1 << REISERFS_BARRIER))
 #define reiserfs_test4(s) ((s)->u.reiserfs_sb.s_mount_opt & (1 << REISERFS_TEST4))
 
 #define dont_have_tails(s) ((s)->u.reiserfs_sb.s_mount_opt & (1 << NOTAIL))

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] queue barrier support
  2002-02-13 12:51 [PATCH] queue barrier support Jens Axboe
@ 2002-02-13 13:09 ` Martin Dalecki
  2002-02-13 13:13   ` Jens Axboe
  2002-02-13 14:51 ` Daniel Phillips
  1 sibling, 1 reply; 23+ messages in thread
From: Martin Dalecki @ 2002-02-13 13:09 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linux Kernel

Jens Axboe wrote:

>Patches attached, comments welcome.
>
>diff -Nru a/include/linux/ide.h b/include/linux/ide.h
>--- a/include/linux/ide.h	Wed Feb 13 13:48:25 2002
>+++ b/include/linux/ide.h	Wed Feb 13 13:48:25 2002
>@@ -448,6 +448,7 @@
> 	byte		acoustic;	/* acoustic management */
> 	unsigned int	failures;	/* current failure count */
> 	unsigned int	max_failures;	/* maximum allowed failure count */
>+	char		special_buf[4];	/* IDE_DRIVE_CMD, free use */
> } ide_drive_t;
>
ide-barrier-1-c1.296:+  memset(drive->special_buf, 0, 
sizeof(drive->special_buf));
ide-barrier-1-c1.296:+  flush_rq->buffer = drive->special_buf;
ide-barrier-1-c1.296:+  char            special_buf[4]; /* 
IDE_DRIVE_CMD, free use */

I just don't see special_buf used anywhere. What is it supposed to be 
used for?
Is the intention to make it look like the SCSI code?

>



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] queue barrier support
  2002-02-13 13:09 ` Martin Dalecki
@ 2002-02-13 13:13   ` Jens Axboe
  2002-02-13 14:36     ` Martin Dalecki
  0 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2002-02-13 13:13 UTC (permalink / raw)
  To: Martin Dalecki; +Cc: Linux Kernel

On Wed, Feb 13 2002, Martin Dalecki wrote:
> Jens Axboe wrote:
> 
> >Patches attached, comments welcome.
> >
> >diff -Nru a/include/linux/ide.h b/include/linux/ide.h
> >--- a/include/linux/ide.h	Wed Feb 13 13:48:25 2002
> >+++ b/include/linux/ide.h	Wed Feb 13 13:48:25 2002
> >@@ -448,6 +448,7 @@
> >	byte		acoustic;	/* acoustic management */
> >	unsigned int	failures;	/* current failure count */
> >	unsigned int	max_failures;	/* maximum allowed failure count */
> >+	char		special_buf[4];	/* IDE_DRIVE_CMD, free use */
> >} ide_drive_t;
> >
> ide-barrier-1-c1.296:+  memset(drive->special_buf, 0, 
> sizeof(drive->special_buf));
> ide-barrier-1-c1.296:+  flush_rq->buffer = drive->special_buf;
> ide-barrier-1-c1.296:+  char            special_buf[4]; /* 
> IDE_DRIVE_CMD, free use */
> 
> I just don't see special_buf used anywhere. What is it supposed to be 
> used for?
> Is the intention to make it look like the SCSI code?

See ide.c:ide_queue_flush_cmd(), I'm only using 1 of the bytes but it
has room for 4 like that is typically used when issuing an ata command
directly. So yes, it is used, I'm not adding stuff for fun :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] queue barrier support
  2002-02-13 13:13   ` Jens Axboe
@ 2002-02-13 14:36     ` Martin Dalecki
  2002-02-13 14:41       ` Jens Axboe
  0 siblings, 1 reply; 23+ messages in thread
From: Martin Dalecki @ 2002-02-13 14:36 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linux Kernel

Jens Axboe wrote:

>On Wed, Feb 13 2002, Martin Dalecki wrote:
>
>>Jens Axboe wrote:
>>
>>>Patches attached, comments welcome.
>>>
>>>diff -Nru a/include/linux/ide.h b/include/linux/ide.h
>>>--- a/include/linux/ide.h	Wed Feb 13 13:48:25 2002
>>>+++ b/include/linux/ide.h	Wed Feb 13 13:48:25 2002
>>>@@ -448,6 +448,7 @@
>>>	byte		acoustic;	/* acoustic management */
>>>	unsigned int	failures;	/* current failure count */
>>>	unsigned int	max_failures;	/* maximum allowed failure count */
>>>+	char		special_buf[4];	/* IDE_DRIVE_CMD, free use */
>>>} ide_drive_t;
>>>
>>ide-barrier-1-c1.296:+  memset(drive->special_buf, 0, 
>>sizeof(drive->special_buf));
>>ide-barrier-1-c1.296:+  flush_rq->buffer = drive->special_buf;
>>ide-barrier-1-c1.296:+  char            special_buf[4]; /* 
>>IDE_DRIVE_CMD, free use */
>>
>>I just don't see special_buf used anywhere. What is it supposed to be 
>>used for?
>>Is the intention to make it look like the SCSI code?
>>
>
>See ide.c:ide_queue_flush_cmd(), I'm only using 1 of the bytes but it
>has room for 4 like that is typically used when issuing an ata command
>directly. So yes, it is used, I'm not adding stuff for fun :-)
>

OK I see now. Is this to become analogous to the sr_cmnd field for 
struct scsi_request?
It would make sense to first make them use the same software 
architecture at least and then
to merge as much of this driver mid-layer stuff as possible....



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] queue barrier support
  2002-02-13 14:36     ` Martin Dalecki
@ 2002-02-13 14:41       ` Jens Axboe
  0 siblings, 0 replies; 23+ messages in thread
From: Jens Axboe @ 2002-02-13 14:41 UTC (permalink / raw)
  To: Martin Dalecki; +Cc: Linux Kernel

On Wed, Feb 13 2002, Martin Dalecki wrote:
> >On Wed, Feb 13 2002, Martin Dalecki wrote:
> >
> >>Jens Axboe wrote:
> >>
> >>>Patches attached, comments welcome.
> >>>
> >>>diff -Nru a/include/linux/ide.h b/include/linux/ide.h
> >>>--- a/include/linux/ide.h	Wed Feb 13 13:48:25 2002
> >>>+++ b/include/linux/ide.h	Wed Feb 13 13:48:25 2002
> >>>@@ -448,6 +448,7 @@
> >>>	byte		acoustic;	/* acoustic management */
> >>>	unsigned int	failures;	/* current failure count */
> >>>	unsigned int	max_failures;	/* maximum allowed failure count */
> >>>+	char		special_buf[4];	/* IDE_DRIVE_CMD, free use */
> >>>} ide_drive_t;
> >>>
> >>ide-barrier-1-c1.296:+  memset(drive->special_buf, 0, 
> >>sizeof(drive->special_buf));
> >>ide-barrier-1-c1.296:+  flush_rq->buffer = drive->special_buf;
> >>ide-barrier-1-c1.296:+  char            special_buf[4]; /* 
> >>IDE_DRIVE_CMD, free use */
> >>
> >>I just don't see special_buf used anywhere. What is it supposed to be 
> >>used for?
> >>Is the intention to make it look like the SCSI code?
> >>
> >
> >See ide.c:ide_queue_flush_cmd(), I'm only using 1 of the bytes but it
> >has room for 4 like that is typically used when issuing an ata command
> >directly. So yes, it is used, I'm not adding stuff for fun :-)
> >
> 
> OK I see now. Is this to become analogous to the sr_cmnd field for 
> struct scsi_request?
> It would make sense to first make them use the same software 
> architecture at least and then
> to merge as much of this driver mid-layer stuff as possible....

Well dunno, typically users of IDE_DRIVE_CMD's are using ide_wait and
thus waits for completion. So request and arguments are almost always
allocated on the stack, which is fine in that case. For out-of-order
execution like the pre and post flushes in the barrier patch, there just
needed to be some bytes available for the data command.

The SCSI sr_cmnd[] cdb is different from the ATA register set, but in
fact in 2.5 the IDE layer could use the cdb in struct request as a temp
buffer for stuff like this. It's exactly what the ide-barrier patch
above could use. So yeah I could clean special_buf[] and just use

	rq->buffer = rq->cmd;

instead.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] queue barrier support
  2002-02-13 12:51 [PATCH] queue barrier support Jens Axboe
  2002-02-13 13:09 ` Martin Dalecki
@ 2002-02-13 14:51 ` Daniel Phillips
  2002-02-13 15:18   ` Jens Axboe
  1 sibling, 1 reply; 23+ messages in thread
From: Daniel Phillips @ 2002-02-13 14:51 UTC (permalink / raw)
  To: Jens Axboe, Linux Kernel

On February 13, 2002 01:51 pm, Jens Axboe wrote:
> Patches attached, comments welcome.

A meta-comment: the BK url's are wonderfully informative and useful, but they 
are long and _ugly_!  Is there anything that can be done about that?

-- 
Daniel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] queue barrier support
  2002-02-13 14:51 ` Daniel Phillips
@ 2002-02-13 15:18   ` Jens Axboe
  2002-02-13 17:47     ` Andreas Dilger
  0 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2002-02-13 15:18 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Linux Kernel

On Wed, Feb 13 2002, Daniel Phillips wrote:
> On February 13, 2002 01:51 pm, Jens Axboe wrote:
> > Patches attached, comments welcome.
> 
> A meta-comment: the BK url's are wonderfully informative and useful, but they 
> are long and _ugly_!  Is there anything that can be done about that?

Yeah I like them too, maybe if I just figured out how to get BitKeeper
to dump full changeset info I could just inline them in the mail
instead. I'll look at up and try that next time.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] queue barrier support
  2002-02-13 15:18   ` Jens Axboe
@ 2002-02-13 17:47     ` Andreas Dilger
  0 siblings, 0 replies; 23+ messages in thread
From: Andreas Dilger @ 2002-02-13 17:47 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Daniel Phillips, Linux Kernel

On Feb 13, 2002  16:18 +0100, Jens Axboe wrote:
> On Wed, Feb 13 2002, Daniel Phillips wrote:
> > On February 13, 2002 01:51 pm, Jens Axboe wrote:
> > > Patches attached, comments welcome.
> > 
> > A meta-comment: the BK url's are wonderfully informative and useful, but
> > they are long and _ugly_!  Is there anything that can be done about that?
> 
> Yeah I like them too, maybe if I just figured out how to get BitKeeper
> to dump full changeset info I could just inline them in the mail
> instead. I'll look at up and try that next time.

bk send -wgzip_uu -r<rev> - > foo-<rev>.bk

This will dump a gzipped-uuencoded changset to the file.  The receiver
just do "| bk receive [repository] -avv" to import it on the other end.

My preferred format for sending BK CSETs is below.  The gzip_uu CSET
data only adds maybe 10% for large patches, and about doubles the size
of very small patches.  I also created a bz64 (bzip2 + base64) wrapper
which makes the CSET data smaller, but that is only useful if other BK
developers have this wrapper also.

Cheers, Andreas
=============================== bksend ====================================
#!/bin/sh
# A script to format BK changeset output in a manner that is easy to read.
# Andreas Dilger <adilger@turbolabs.com>  13/02/2002

PROG=bksend

usage() {
	echo "usage: $PROG -r<rev>"
	echo -e "\twhere <rev> is of the form '1.23', '1.23..', '1.23..1.27',"
	echo -e "\tor '+' to indicate the most recent revision"

	exit 1
}

case $1 in
-r) REV=$2; shift ;;
-r*) REV=`echo $1 | sed 's/^-r//'` ;;
*) echo "$PROG: no revision given, you probably don't want that";;
esac

[ -z "$REV" ] && usage

bk changes -r$REV
bk export -tpatch -du -h -r$REV
echo -e "\n================================================================\n\n"
bk send -wgzip_uu -r$REV -
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] queue barrier support
@ 2002-02-13 18:26 James Bottomley
  2002-02-15  9:02 ` Jens Axboe
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: James Bottomley @ 2002-02-13 18:26 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-scsi

axboe@suse.de said:
> ChangeSet@1.297, 2002-02-13 13:42:39+01:00, axboe@burns.home.kernel.dk
>   Add support for SCSI drivers to indicate support for ordered tags
>   http://bitmover.com:8888//tmp/v2_logging/athlon.transmeta.com/
> torvalds-2002020517305 \ 6-16047-c1d11a41ed024864/cset@1.133.114.4?nav=
> index.html|ChangeSet@-1h

> ChangeSet@1.298, 2002-02-13 13:43:04+01:00, axboe@burns.home.kernel.dk
>   Add ordered tag support to the aic7xxx scsi driver

The rest of the aic7xxx code uses MSG_ORDERED_TASK rather than 
MSG_ORDERED_Q_TAG.  You have to scan through the headers to see that these are 
#defined the same.

A problem (that is probably only an issue for older drives) is that while 
technically the standard requires all 3 types of TAG to be supported if tag 
queueing is, some drives really only have simple tag support in their 
firmware, so you may need to add a blacklist for ordered tags on certain 
drives.

A further issue is that you haven't added anything to the error recovery code 
for this.  If error recovery is activated for the device at the reset level, 
all tags will be discarded by the device.  The eh will retry the failing 
command and then the other tagged commands will be re-issued from the 
scsi_bottom_half_handler (assuming the low level device driver immediately 
fails them with DID_RESET) in the order in which the low level driver failed 
them.  Thus you have potentially completely messed up the ordering when the 
commands all get retried.

James Bottomley




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] queue barrier support
  2002-02-13 18:26 [PATCH] queue barrier support James Bottomley
@ 2002-02-15  9:02 ` Jens Axboe
  2002-02-15 15:15   ` James Bottomley
  2002-02-15 13:41 ` Chris Mason
  2002-02-16 10:20 ` Daniel Phillips
  2 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2002-02-15  9:02 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-kernel, linux-scsi

On Wed, Feb 13 2002, James Bottomley wrote:
> axboe@suse.de said:
> > ChangeSet@1.297, 2002-02-13 13:42:39+01:00, axboe@burns.home.kernel.dk
> >   Add support for SCSI drivers to indicate support for ordered tags
> >   http://bitmover.com:8888//tmp/v2_logging/athlon.transmeta.com/
> > torvalds-2002020517305 \ 6-16047-c1d11a41ed024864/cset@1.133.114.4?nav=
> > index.html|ChangeSet@-1h
> 
> > ChangeSet@1.298, 2002-02-13 13:43:04+01:00, axboe@burns.home.kernel.dk
> >   Add ordered tag support to the aic7xxx scsi driver
> 
> The rest of the aic7xxx code uses MSG_ORDERED_TASK rather than 
> MSG_ORDERED_Q_TAG.  You have to scan through the headers to see that
> these are #defined the same.

Ah, the MSG_ORDERED_TASK did probably not exist when I originally did
the patch (a year or so ago, iirc). I'll change the define to follow the
rest of the code.

> A problem (that is probably only an issue for older drives) is that
> while technically the standard requires all 3 types of TAG to be
> supported if tag queueing is, some drives really only have simple tag
> support in their firmware, so you may need to add a blacklist for
> ordered tags on certain drives.

Yes you are right.

> A further issue is that you haven't added anything to the error
> recovery code for this.  If error recovery is activated for the device
> at the reset level, all tags will be discarded by the device.  The eh
> will retry the failing command and then the other tagged commands will
> be re-issued from the scsi_bottom_half_handler (assuming the low level
> device driver immediately fails them with DID_RESET) in the order in
> which the low level driver failed them.  Thus you have potentially
> completely messed up the ordering when the commands all get retried.

I already have this fixed locally by just maintaining a fifo list of
queued commands so we can retry them in the correct order. For 2.5
there's a busy and free list (so no more scanning for a free command
either), I would imagine that the easiest for 2.4 is to just maintain a
busy list in addition to the current array of pending/free commands.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] queue barrier support
  2002-02-13 18:26 [PATCH] queue barrier support James Bottomley
  2002-02-15  9:02 ` Jens Axboe
@ 2002-02-15 13:41 ` Chris Mason
  2002-02-16 10:20 ` Daniel Phillips
  2 siblings, 0 replies; 23+ messages in thread
From: Chris Mason @ 2002-02-15 13:41 UTC (permalink / raw)
  To: James Bottomley, Jens Axboe; +Cc: linux-kernel, linux-scsi



On Wednesday, February 13, 2002 01:26:12 PM -0500 James Bottomley <James.Bottomley@steeleye.com> wrote:

> axboe@suse.de said:
>> ChangeSet@1.297, 2002-02-13 13:42:39+01:00, axboe@burns.home.kernel.dk
>>   Add support for SCSI drivers to indicate support for ordered tags
>>   http://bitmover.com:8888//tmp/v2_logging/athlon.transmeta.com/
>> torvalds-2002020517305 \ 6-16047-c1d11a41ed024864/cset@1.133.114.4?nav=
>> index.html|ChangeSet@-1h
> 
>> ChangeSet@1.298, 2002-02-13 13:43:04+01:00, axboe@burns.home.kernel.dk
>>   Add ordered tag support to the aic7xxx scsi driver
> 
> The rest of the aic7xxx code uses MSG_ORDERED_TASK rather than 
> MSG_ORDERED_Q_TAG.  You have to scan through the headers to see that these are 
># defined the same.

Jens, my patch from yesterday has this fixed.

> 
> A problem (that is probably only an issue for older drives) is that while 
> technically the standard requires all 3 types of TAG to be supported if tag 
> queueing is, some drives really only have simple tag support in their 
> firmware, so you may need to add a blacklist for ordered tags on certain 
> drives.

Yes, this could get sticky.  Does anyone know if other OSes have already
done this?

> 
> A further issue is that you haven't added anything to the error recovery code 
> for this.  If error recovery is activated for the device at the reset level, 
> all tags will be discarded by the device.  The eh will retry the failing 
> command and then the other tagged commands will be re-issued from the 
> scsi_bottom_half_handler (assuming the low level device driver immediately 
> fails them with DID_RESET) in the order in which the low level driver failed 
> them.  Thus you have potentially completely messed up the ordering when the 
> commands all get retried.

I was wondering about this, we would need to change the error handler to 
fail all the requests after the barrier.  I was hoping the driver
did this for us ;-)

-chris


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] queue barrier support
  2002-02-15  9:02 ` Jens Axboe
@ 2002-02-15 15:15   ` James Bottomley
  2002-02-15 16:28     ` Chris Mason
  2002-02-15 16:43     ` Mike Anderson
  0 siblings, 2 replies; 23+ messages in thread
From: James Bottomley @ 2002-02-15 15:15 UTC (permalink / raw)
  To: Jens Axboe; +Cc: James Bottomley, Chris Mason, linux-kernel, linux-scsi

James.Bottomley@steeleye.com said:
> A further issue is that you haven't added anything to the error
> recovery code for this.  If error recovery is activated for the device
> at the reset level, all tags will be discarded by the device.  The eh
> will retry the failing command and then the other tagged commands will
> be re-issued from the scsi_bottom_half_handler (assuming the low level
> device driver immediately fails them with DID_RESET) in the order in
> which the low level driver failed them.  Thus you have potentially
> completely messed up the ordering when the commands all get retried.

axboe@suse.de said:
> I already have this fixed locally by just maintaining a fifo list of
> queued commands so we can retry them in the correct order. For 2.5
> there's a busy and free list (so no more scanning for a free command
> either), I would imagine that the easiest for 2.4 is to just maintain
> a busy list in addition to the current array of pending/free commands.

mason@suse.com said:
> I was wondering about this, we would need to change the error handler
> to  fail all the requests after the barrier.  I was hoping the driver
> did this for us ;-) 

Unfortunately, this is going to involve deep hackery inside the error handler. 
 The current initial premise is that it can simply retry the failing command 
by issuing an ABORT to the tag and resending it (which can cause a tag to move 
past your barrier).  In an error situation, it really wouldn't be wise to try 
to abort lots of potentially running tags to preserve the barrier ordering 
(because of the overload placed on a known failing component), so I think the 
error handler has to abandon the concept of aborting commands and move 
straight to device reset.  We then carefully resend the commands in FIFO order.

Additionally, you must handle the case that a device is reset by something 
else (in error handler terms, the cc_ua [check condition/unit attention]).  
Here also, the tags would have to be sent back down in FIFO order as soon as 
the condition is detected.

mason@suse.com said:
> Yes, this could get sticky.  Does anyone know if other OSes have
> already done this? 

Other OSs (well the ones I've heard about: Solaris and HP-UX) try to avoid 
ordered tags, mainly because of the performance impact they have---the drive 
tag service algorithms become inefficient in the presence of ordered tags 
since they're usually optimised for all simple tags.

James



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] queue barrier support
  2002-02-15 15:15   ` James Bottomley
@ 2002-02-15 16:28     ` Chris Mason
  2002-02-15 16:51       ` James Bottomley
  2002-02-15 17:09       ` James Bottomley
  2002-02-15 16:43     ` Mike Anderson
  1 sibling, 2 replies; 23+ messages in thread
From: Chris Mason @ 2002-02-15 16:28 UTC (permalink / raw)
  To: James Bottomley, Jens Axboe; +Cc: linux-kernel, linux-scsi


On Friday, February 15, 2002 10:15:58 AM -0500 James Bottomley <James.Bottomley@steeleye.com> wrote:

> mason@suse.com said:
>> I was wondering about this, we would need to change the error handler
>> to  fail all the requests after the barrier.  I was hoping the driver
>> did this for us ;-) 
> 
> Unfortunately, this is going to involve deep hackery inside the error handler. 
>  The current initial premise is that it can simply retry the failing command 
> by issuing an ABORT to the tag and resending it (which can cause a tag to move 
> past your barrier).  In an error situation, it really wouldn't be wise to try 
> to abort lots of potentially running tags to preserve the barrier ordering 
> (because of the overload placed on a known failing component), so I think the 
> error handler has to abandon the concept of aborting commands and move 
> straight to device reset.  We then carefully resend the commands in FIFO order.
> 

Ok, I'll try to narrow the barrier usage a bit, I'm waiting on the
barrier write once it is sent, so I'm not worried about anything done
after the ordered tag.

write X log blocks   (simple tag)
write 1 commit block (ordered tag)
wait on all of them.

All I care about is knowing that all of the log blocks hit the disk
before the commit.  So, if one of the log blocks aborts, I want it
to abort the commit too.  Is this a little easier to implement?

> Additionally, you must handle the case that a device is reset by something 
> else (in error handler terms, the cc_ua [check condition/unit attention]).  
> Here also, the tags would have to be sent back down in FIFO order as soon as 
> the condition is detected.
> 
> mason@suse.com said:
>> Yes, this could get sticky.  Does anyone know if other OSes have
>> already done this? 
> 
> Other OSs (well the ones I've heard about: Solaris and HP-UX) try to avoid 
> ordered tags, mainly because of the performance impact they have---the drive 
> tag service algorithms become inefficient in the presence of ordered tags 
> since they're usually optimised for all simple tags.

I do see that on my drives ;-)  The main reason I think its worth trying
is because the commit block is directly adjacent to the last log block,
so I'm hoping the drive can optimize the commit (even though it is ordered)
better when the OS sends it directly after the last log block.

While I've got linux-scsi cc'd, I'll reask a question from yesterday.
Do the targets with write back caches usually ignore the order tag, 
doing the write in the most efficient way possible instead?

-chris




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] queue barrier support
  2002-02-15 15:15   ` James Bottomley
  2002-02-15 16:28     ` Chris Mason
@ 2002-02-15 16:43     ` Mike Anderson
  1 sibling, 0 replies; 23+ messages in thread
From: Mike Anderson @ 2002-02-15 16:43 UTC (permalink / raw)
  To: James Bottomley; +Cc: Jens Axboe, Chris Mason, linux-kernel, linux-scsi

James Bottomley [James.Bottomley@SteelEye.com] wrote:
> James.Bottomley@steeleye.com said:
> 
> Unfortunately, this is going to involve deep hackery inside the error handler. 
>  The current initial premise is that it can simply retry the failing command 
> by issuing an ABORT to the tag and resending it (which can cause a tag to move 
> past your barrier).  In an error situation, it really wouldn't be wise to try 
> to abort lots of potentially running tags to preserve the barrier ordering 
> (because of the overload placed on a known failing component), so I think the 
> error handler has to abandon the concept of aborting commands and move 
> straight to device reset.  We then carefully resend the commands in FIFO order.
> 
> Additionally, you must handle the case that a device is reset by something 
> else (in error handler terms, the cc_ua [check condition/unit attention]).  
> Here also, the tags would have to be sent back down in FIFO order as soon as 
> the condition is detected.

I agree on the hacker. You also will need to clean the pipe of the unit
attention post the reset or the first command you send down will
be coming right back into the error handler and could get you in to a
error recovery storm.


> 
> James
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michael Anderson
andmike@us.ibm.com


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] queue barrier support
  2002-02-15 16:28     ` Chris Mason
@ 2002-02-15 16:51       ` James Bottomley
  2002-02-15 17:17         ` Chris Mason
  2002-02-15 22:30         ` Matthias Andree
  2002-02-15 17:09       ` James Bottomley
  1 sibling, 2 replies; 23+ messages in thread
From: James Bottomley @ 2002-02-15 16:51 UTC (permalink / raw)
  To: Chris Mason; +Cc: James Bottomley, Jens Axboe, linux-kernel, linux-scsi

mason@suse.com said:
> While I've got linux-scsi cc'd, I'll reask a question from yesterday.
> Do the targets with write back caches usually ignore the order tag,
> doing the write in the most efficient way possible instead? 

I'll try to answer, although I'm not quite sure of the basis of the question.

No ordinary SCSI drive that's not on a battery backed circuit should *ever* 
use writeback caching.  They should all come by default as write through.  In 
this case, the tag is acknowleged as completed only when the write has made it 
to the medium.

If you alter the drive parameter page to turn on write back caching, it's 
caveat emptor.  Since you're now telling the drive that you consider hitting 
the cache to be equivalent to hitting the medium (because you know better 
about power failures and the like) it is undefined by the standards how the 
drives write from the cache to the medium---and you shouldn't care about this 
if you have arranged your system to do write back caching.  They will 
acknowlege the tag as completed as soon as it hits the cache, and the ordered 
tag will be order it commits to the cache.

Note, for high end disk arrays, the reverse is usually true since they often 
have internal battery backups for their caches and drives.

If you lose power to the drive that does write back caching before the cache 
flushes, all the indications you've given the journalling fs about write 
commits are voided and the fs is probably inconsistent and not recoverable by 
a journal replay.

Note also that on system shutdown, most devices that use write back caching 
are also expecting a cache flush instruction from the node, which Linux 
doesn't send.

James



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] queue barrier support
  2002-02-15 16:28     ` Chris Mason
  2002-02-15 16:51       ` James Bottomley
@ 2002-02-15 17:09       ` James Bottomley
  1 sibling, 0 replies; 23+ messages in thread
From: James Bottomley @ 2002-02-15 17:09 UTC (permalink / raw)
  To: Chris Mason; +Cc: James Bottomley, Jens Axboe, linux-kernel, linux-scsi

mason@suse.com said:
> Ok, I'll try to narrow the barrier usage a bit, I'm waiting on the
> barrier write once it is sent, so I'm not worried about anything done
> after the ordered tag.

> write X log blocks   (simple tag) write 1 commit block (ordered tag)
> wait on all of them.

> All I care about is knowing that all of the log blocks hit the disk
> before the commit.  So, if one of the log blocks aborts, I want it to
> abort the commit too.  Is this a little easier to implement? 

Actually, I'm afraid not.  The FS has the notion of a transaction to help it 
with this.  All the SCSI driver sees is a list of IOs with some requests for 
ordered tags.  If we go into error recovery and have to abort a tag, how do we 
know which of the outstanding ordered tags is actually the commit block for 
this transaction? (Remember also that a FS sits on a partition, not a physical 
SCSI device, so even if you single thread your commits per FS, there could 
still be multiple FSs on the same physical device).

That's why I think it's safer to abandon the abort entirely.  If our first 
line of error recovery is device reset, we know we've just cleared the entire 
outstanding tag queue and we can resend all the tags in FIFO order which means 
that we still don't need a concept of "transaction" at the SCSI level (which 
is a good thing, I think), all we need to know is the order in which the tags 
came to us.

Even if some tags were committed but not acknowledged at reset time, it's 
still safe to resend them in the correct ordered sequence because of 
transaction idempotence of block writes.

James



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] queue barrier support
  2002-02-15 16:51       ` James Bottomley
@ 2002-02-15 17:17         ` Chris Mason
  2002-02-15 17:48           ` James Bottomley
  2002-02-15 22:30         ` Matthias Andree
  1 sibling, 1 reply; 23+ messages in thread
From: Chris Mason @ 2002-02-15 17:17 UTC (permalink / raw)
  To: James Bottomley; +Cc: Jens Axboe, linux-kernel, linux-scsi



On Friday, February 15, 2002 11:51:45 AM -0500 James Bottomley <James.Bottomley@steeleye.com> wrote:

> mason@suse.com said:
>> While I've got linux-scsi cc'd, I'll reask a question from yesterday.
>> Do the targets with write back caches usually ignore the order tag,
>> doing the write in the most efficient way possible instead? 
> 
> I'll try to answer, although I'm not quite sure of the basis of the question.
> 
> No ordinary SCSI drive that's not on a battery backed circuit should *ever* 
> use writeback caching.  They should all come by default as write through.  In 
> this case, the tag is acknowleged as completed only when the write has made it 
> to the medium.

Right, friends don't let friends use non-battery back write back caches ;-)

> 
> If you alter the drive parameter page to turn on write back caching, it's 
> caveat emptor.  Since you're now telling the drive that you consider hitting 
> the cache to be equivalent to hitting the medium (because you know better 
> about power failures and the like) it is undefined by the standards how the 
> drives write from the cache to the medium---and you shouldn't care about this 
> if you have arranged your system to do write back caching.  They will 
> acknowlege the tag as completed as soon as it hits the cache, and the ordered 
> tag will be order it commits to the cache.

Ok, this is what I was expecting.  The performance improvement from
using the ordered tags on single drives doesn't seem to be very big, and
given the problems with failed requests, I'm thinking about dropping the
scsi parts of the 2.4 barrier patch, and just worrying about making ide 
drives flush things correctly.  The hard stuff on error recovery can
be tackled in 2.5 and (maybe) ported back later.

But, if high end controllers with write back caching see improvements
because they can combine the log writes better with the patch, it might
be worth keeping, but modified in reiserfs so it flushes IDE caches
by default, but only sends ordered tags when the admin mounts 
with -o barrier_ordered or something.

thanks for the info
-chris


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] queue barrier support
  2002-02-15 17:17         ` Chris Mason
@ 2002-02-15 17:48           ` James Bottomley
  0 siblings, 0 replies; 23+ messages in thread
From: James Bottomley @ 2002-02-15 17:48 UTC (permalink / raw)
  To: Chris Mason; +Cc: James Bottomley, Jens Axboe, linux-kernel, linux-scsi

mason@suse.com said:
> I'm thinking about dropping the scsi parts of the 2.4 barrier patch,
> and just worrying about making ide  drives flush things correctly.
> The hard stuff on error recovery can be tackled in 2.5 and (maybe)
> ported back later.

That's probably best.  I do agree that 2.5 is the place to play around with 
SCSI error handling first.

I'm willing to help re-do the error handler, since I've always thought that 
abort isn't a good first line of defence because it actually adds to the 
command burden of a failing drive.

Jens, if you want to share the code you already have (or point me to the 
bitkeeper repository where you keep it) I'll look it over.

As far as the back-port to 2.4, as long as you only support SCSI drivers that 
use the new error handler (so we don't have to worry about the obsolete one) 
that should be reasonably easy.

James



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] queue barrier support
  2002-02-15 16:51       ` James Bottomley
  2002-02-15 17:17         ` Chris Mason
@ 2002-02-15 22:30         ` Matthias Andree
  1 sibling, 0 replies; 23+ messages in thread
From: Matthias Andree @ 2002-02-15 22:30 UTC (permalink / raw)
  To: linux-kernel

On Fri, 15 Feb 2002, James Bottomley wrote:

> Note also that on system shutdown, most devices that use write back caching 
> are also expecting a cache flush instruction from the node, which Linux 
> doesn't send.

Hair splitting: it looks as though Andre Hedrick's IDE patch did this at
least. Of course, that does not affect SCSI drives, but since you wrote
"Linux", I thought this could be mentioned in this thread again. At
least, my ATA drives are powered down some seconds before my ATX PC is,
and the kernel says about flushing the cache.

-- 
Matthias Andree

"They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety."         Benjamin Franklin

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] queue barrier support
  2002-02-13 18:26 [PATCH] queue barrier support James Bottomley
  2002-02-15  9:02 ` Jens Axboe
  2002-02-15 13:41 ` Chris Mason
@ 2002-02-16 10:20 ` Daniel Phillips
  2002-02-16 15:02   ` James Bottomley
  2002-02-25 20:55   ` 929-Emulex, ABTS Command Anamoly! Cindy Sweet
  2 siblings, 2 replies; 23+ messages in thread
From: Daniel Phillips @ 2002-02-16 10:20 UTC (permalink / raw)
  To: James Bottomley, Jens Axboe; +Cc: linux-kernel, linux-scsi

On February 13, 2002 07:26 pm, James Bottomley wrote:
> A problem (that is probably only an issue for older drives) is that while 
> technically the standard requires all 3 types of TAG to be supported if tag 
> queueing is, some drives really only have simple tag support in their 
> firmware, so you may need to add a blacklist for ordered tags on certain 
> drives.

>From user space, I hope.

-- 
Daniel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] queue barrier support
  2002-02-16 10:20 ` Daniel Phillips
@ 2002-02-16 15:02   ` James Bottomley
  2002-02-25 20:55   ` 929-Emulex, ABTS Command Anamoly! Cindy Sweet
  1 sibling, 0 replies; 23+ messages in thread
From: James Bottomley @ 2002-02-16 15:02 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: James Bottomley, Jens Axboe, linux-kernel, linux-scsi

On February 13, 2002 07:26 pm, James Bottomley wrote:
> A problem (that is probably only an issue for older drives) is that while 
> technically the standard requires all 3 types of TAG to be supported if tag 
> queueing is, some drives really only have simple tag support in their 
> firmware, so you may need to add a blacklist for ordered tags on certain 
> drives.

phillips@bonn-fries.net said:
> From user space, I hope. 

Well, following the current SCSI black/white list procedure, they would be in 
the static device_list table in scsi_scan.c, so no.

I agree that it's not a nice or friendly thing (and is certainly prone to 
delays when you have to get entries into the actual kernel code), but fixing 
this particular annoyance of the scsi subsystem (although I have actually 
thought about doing it) would be a project separate from the queue barrier 
support.

James






^ permalink raw reply	[flat|nested] 23+ messages in thread

* 929-Emulex, ABTS Command Anamoly!
  2002-02-16 10:20 ` Daniel Phillips
  2002-02-16 15:02   ` James Bottomley
@ 2002-02-25 20:55   ` Cindy Sweet
  2002-02-25 21:02     ` arjan
  1 sibling, 1 reply; 23+ messages in thread
From: Cindy Sweet @ 2002-02-25 20:55 UTC (permalink / raw)
  To: linux-kernel, linux-scsi

Is this the right group to discuss about the ABTS
command ??

__________________________________________________
Do You Yahoo!?
Yahoo! Sports - Coverage of the 2002 Olympic Games
http://sports.yahoo.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 929-Emulex, ABTS Command Anamoly!
  2002-02-25 20:55   ` 929-Emulex, ABTS Command Anamoly! Cindy Sweet
@ 2002-02-25 21:02     ` arjan
  0 siblings, 0 replies; 23+ messages in thread
From: arjan @ 2002-02-25 21:02 UTC (permalink / raw)
  To: Cindy Sweet; +Cc: linux-kernel

In article <20020225205527.76533.qmail@web20908.mail.yahoo.com> you wrote:
> Is this the right group to discuss about the ABTS
> command ??

no.

emulex is a binary only driver so it should be discussed with emulex
if you have problems...

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2002-02-25 21:04 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-02-13 18:26 [PATCH] queue barrier support James Bottomley
2002-02-15  9:02 ` Jens Axboe
2002-02-15 15:15   ` James Bottomley
2002-02-15 16:28     ` Chris Mason
2002-02-15 16:51       ` James Bottomley
2002-02-15 17:17         ` Chris Mason
2002-02-15 17:48           ` James Bottomley
2002-02-15 22:30         ` Matthias Andree
2002-02-15 17:09       ` James Bottomley
2002-02-15 16:43     ` Mike Anderson
2002-02-15 13:41 ` Chris Mason
2002-02-16 10:20 ` Daniel Phillips
2002-02-16 15:02   ` James Bottomley
2002-02-25 20:55   ` 929-Emulex, ABTS Command Anamoly! Cindy Sweet
2002-02-25 21:02     ` arjan
  -- strict thread matches above, loose matches on Subject: below --
2002-02-13 12:51 [PATCH] queue barrier support Jens Axboe
2002-02-13 13:09 ` Martin Dalecki
2002-02-13 13:13   ` Jens Axboe
2002-02-13 14:36     ` Martin Dalecki
2002-02-13 14:41       ` Jens Axboe
2002-02-13 14:51 ` Daniel Phillips
2002-02-13 15:18   ` Jens Axboe
2002-02-13 17:47     ` Andreas Dilger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox