public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH] 2.4.x write barriers (updated for ext3)
@ 2002-02-22 15:57 James Bottomley
  2002-02-22 16:10 ` Chris Mason
                   ` (2 more replies)
  0 siblings, 3 replies; 73+ messages in thread
From: James Bottomley @ 2002-02-22 15:57 UTC (permalink / raw)
  To: Stephen C. Tweedie, Chris Mason; +Cc: linux-kernel, James.Bottomley

> Most ext3 commits, in practice, are lazy, asynchronous commits, and we
> only nedd BH_Ordered_Tag for that, not *_Flush.  It would be easy
> enough to track whether a given transaction has any synchronous
> waiters, and if not, to use the async *_Tag request for the commit
> block instead of forcing a flush.

> We'd also have to track the sync status of the most recent
> transaction, too, so that on fsync of a non-dirty file/inode, we make
> sure that its data had been forced to disk by at least one synchronous
> flush.  

> But that's really only a win for SCSI, where proper async ordered tags
> are supported.  For IDE, the single BH_Ordered_Flush is quite
> sufficient.

Unfortunately, there's actually a hole in the SCSI spec that means ordered 
tags are actually extremely difficult to use in the way you want (although I 
think this is an accident, conceptually, I think they were supposed to be used 
for this).  For the interested, I attach the details at the bottom.

The easy way out of the problem, I think, is to impose the barrier as an 
effective queue plug in the SCSI mid-layer, so that after the mid-layer 
recevies the barrier, it plugs the device queue from below, drains the drive 
tag queue, sends the barrier and unplugs the device queue on barrier I/O 
completion.

Ordinarily, this would produce extremely poor performance since you're 
effectively starving the device to implement the barrier.  However, in Linux 
it might just work because it will give the elevator more time to coalesce 
requests.

James Bottomley

Problems Using Ordered Tags as a Barrier
========================================

Note, the following is independent of re-ordering on error conditions which 
was discussed in a previous thread.  This discussion pertains to normal device 
operations.

The SCSI tag system allows all devices to have a dynamic queue.  This means 
that there is no a priori guarantee about how many tags the device will accept 
before the queue becomes full.

The problem comes because most drivers issue SCSI commands directly from the 
incoming I/O thread but complete them via the SCSI interrupt routine.  What 
this means is that if the SCSI device decides it has no more resources left, 
the driver won't be told until it recevies an interrupt indicating that the 
queue is full and the particular I/O wasn't queued.  At this point, the user 
threads may have send down several more I/Os, and worse still, the SCSI device 
may have accepted some of the later I/Os because the local conditions causing 
it to signal queue full may have abated.

As I read the standard, there's no way to avoid this problem, since the queue 
full signal entitles the device not to queue the command, and not to act on 
any ordering the command may have had.

The other problem is actually driver related, not SCSI.  Most HBA chips are 
intelligent beasts which can process independently of the host CPU.  
Unfortunately, implementing linked lists tends to be rather beyond their 
capabilities.  For this reason, most low level drivers have a certain number 
of queue slots (per device, per chip or whatever).  The usual act of feeding 
an I/O command to a device involves stuffing it in the first available free 
slot.  This can lead to command re-ordering because even though the HBA is 
sequentially processing slots in a round-robin fashion, you don't often know 
which slot it is currently looking at.  Also, the multi threaded nature of tag 
command queuing means that the slot will remain full long after the HBA has 
started processing it and moved on to the next slot.

One possible get out is to process the queue full signal internally (either in 
the interrupt routine or in the chip driver itself) to force it to re-send of 
the non-queued tag until the drive actually accepts it.  As long as this 
looping is done at a level which prevents the device from accepting any more 
commands.  In general, this is nasty because it is effectively a busy wait 
inside the HBA and will block commands to all other devices until the device 
queue drained sufficiently to accept the tag.

The other possibility would be to treat all pending commands for a particular 
device as queue full errors if we get that for one of them.  This would 
require the interrupt or chip script routine to complete all commands for the 
particular device as queue full, which would still be quite a large amount of 
work for device driver writers.

Finally, I think the driver ordering problem can be solved easily as long as 
an observation I have about your barrier is true.  It seems to me that the 
barrier is only semi permeable, namely its purpose is to complete *after* a 
particular set of commands do.  This means that it doesnt matter if later 
commands move through the barrier, it only matters that earlier commands 
cannot move past it?  If this is true, then we can fix the slot problem simply 
by having a slot dedicated to barrier tags, so the processing engine goes over 
it once per cycle.  However, if it finds the barrier slot full, it doesn't 
issue the command until the *next* cycle, thus ensuring that all commands sent 
down before the barrier (plus a few after) are accepted by the device queue 
before we send the barrier with its ordered tag.




^ permalink raw reply	[flat|nested] 73+ messages in thread
* Re: [PATCH] 2.4.x write barriers (updated for ext3)
@ 2002-03-01 15:26 Dieter Nützel
  2002-03-01 16:00 ` James Bottomley
  0 siblings, 1 reply; 73+ messages in thread
From: Dieter Nützel @ 2002-03-01 15:26 UTC (permalink / raw)
  To: James Bottomley, Chris Mason; +Cc: Linux Kernel List

James Bottomley wrote:
> mason@suse.com said:
> > So, a little testing with scsi_info shows my scsi drives do have
> > writeback cache on.  great.  What's interesting is they must be doing
> > additional work for ordered tags.  If they were treating the block as
> > written once in cache, using the tags should not change  performance
> > at all.  But, I can clearly show the tags changing performance, and
> > hear the drive write pattern change when tags are on. 

> I checked all mine and they're write through.  However, I inherited all my 
> drives from an enterprise vendor so this might not be that surprising.

How do you checked it?
Which scsi_info version?
Mine gave only the below info:

SunWave1 /home/nuetzel# scsi_info /dev/sda
SCSI_ID="0,0,0"
MODEL="IBM DDYS-T18350N"
FW_REV="S96H"
SunWave1 /home/nuetzel# scsi_info /dev/sdb
SCSI_ID="0,1,0"
MODEL="IBM DDRS-34560D"
FW_REV="DC1B"
SunWave1 /home/nuetzel# scsi_info /dev/sdc
SCSI_ID="0,2,0"
MODEL="IBM DDRS-34560W"
FW_REV="S71D"

But when I use "scsi-config" I get under "Cache Control Page":
Read cache enabled: Yes
Write cache enabled: No

I've tested it with setting this by hand some months ago, but the speed 
doesn't change in anyway (ReiserFS).

> I can surmise why ordered tags kill performance on your drive, since an 
> ordered tag is required to affect the ordering of the write to the medium,
> not the cache, it is probably implemented with an implicit cache flush.
>
> Anyway, the attached patch against 2.4.18 (and I know it's rather gross
> code)  will probe the cache type and try to set it to write through on boot.
>  See what this does to your performance ordinarily, and also to your
> tagged write barrier performance.

Will test it over the weekend on 2.4.19-pre1aa1 with all Reiserfs 
2.4.18.pending patches applied.

Regards,
	Dieter

^ permalink raw reply	[flat|nested] 73+ messages in thread
* [PATCH] 2.4.x write barriers (updated for ext3)
@ 2002-02-21 23:30 Chris Mason
  2002-02-22 14:19 ` Stephen C. Tweedie
  0 siblings, 1 reply; 73+ messages in thread
From: Chris Mason @ 2002-02-21 23:30 UTC (permalink / raw)
  To: Andrew Morton, Stephen C. Tweedie; +Cc: linux-kernel


Hi everyone,

I've changed the write barrier code around a little so the block layer 
isn't forced to fail barrier requests the queue can't handle.

This makes it much easier to add support for ide writeback
flushing to things like ext3 and lvm, where I want to make
the minimal possible changes to make things safe.

The full patch is at:
ftp.suse.com/pub/people/mason/patches/2.4.18/queue-barrier-8.diff

There might be additional spots in ext3 where ordering needs to be 
enforced, I've included the ext3 code below in hopes of getting 
some comments.

The only other change was to make reiserfs use the IDE flushing mode
by default.  It falls back to non-ordered calls on scsi.

-chris

--- linus.23/fs/jbd/commit.c Mon, 28 Jan 2002 09:51:50 -0500
+++ linus.23(w)/fs/jbd/commit.c Thu, 21 Feb 2002 17:11:00 -0500
@@ -595,7 +595,15 @@
                struct buffer_head *bh = jh2bh(descriptor);
                clear_bit(BH_Dirty, &bh->b_state);
                bh->b_end_io = journal_end_buffer_io_sync;
+
+               /* if we're on an ide device, setting BH_Ordered_Flush
+                  will force a write cache flush before and after the
+                  commit block.  Otherwise, it'll do nothing.  */
+
+               set_bit(BH_Ordered_Flush, &bh->b_state);
                submit_bh(WRITE, bh);
+               clear_bit(BH_Ordered_Flush, &bh->b_state);
+
                wait_on_buffer(bh);
                put_bh(bh);             /* One for getblk() */
                journal_unlock_journal_head(descriptor);









^ permalink raw reply	[flat|nested] 73+ messages in thread
* [ANNOUNCE] FUSE: Filesystem in Userspace 0.95
@ 2002-01-10  9:55 Miklos Szeredi
  2002-01-13  3:10 ` Pavel Machek
  2002-01-22 19:07 ` Daniel Phillips
  0 siblings, 2 replies; 73+ messages in thread
From: Miklos Szeredi @ 2002-01-10  9:55 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, avfs


FUSE 0.95 is available (download or CVS) from:

   http://sourceforge.net/projects/avf

What's new in 0.95 compared to 0.9

   - Major performance improvements in both the kernel module and the
     library parts.

   - Small number of bugs fixed.  FUSE has been through some stress
     testing and no problems have turned up yet.

   - Library interface simplified.  A simple 'hello world' filesystem
     can now be implemented in less than 100 lines.

   - Python (by Jeff Epler) and Perl (by Mark Glines) bindings are in
     the works, and will be released some time in the future (now
     available through CVS).

Problems still remaining:

   - Security problems when fuse is used by non-privileged users:

       o permissions on mountpoint can only be checked by kernel
         (patch exists)

       o user can intentionally block the page writeback operation,
         causing a system lockup.  I'm not sure this can be solved in
         a truly secure way.  Ideas?

Introduction for newbies:

  FUSE provides a simple interface for userspace programs to export a
  virtual filesystem to the Linux kernel.  FUSE also aims to provide a
  secure method for non privileged users to create and mount their own
  filesystem implementations.

  Fuse is available for the 2.4 (and later) kernel series.
  Installation is easy and does not need a kernel recompile.

^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2002-03-13 22:37 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-02-22 15:57 [PATCH] 2.4.x write barriers (updated for ext3) James Bottomley
2002-02-22 16:10 ` Chris Mason
2002-02-22 16:13 ` Stephen C. Tweedie
2002-02-22 17:36   ` James Bottomley
2002-02-22 18:14     ` Chris Mason
2002-02-28 15:36       ` James Bottomley
2002-02-28 15:55         ` Chris Mason
2002-02-28 17:58           ` Mike Anderson
2002-02-28 18:12           ` Chris Mason
2002-03-01  2:08             ` James Bottomley
2002-03-03 22:11         ` Daniel Phillips
2002-03-04  3:34           ` Chris Mason
2002-03-04  5:05             ` Daniel Phillips
2002-03-04 15:03               ` James Bottomley
2002-03-04 17:04                 ` Stephen C. Tweedie
2002-03-04 17:16                   ` Chris Mason
2002-03-04 18:05                     ` Stephen C. Tweedie
2002-03-04 18:28                       ` James Bottomley
2002-03-04 19:55                         ` Stephen C. Tweedie
2002-03-04 19:48                       ` Daniel Phillips
2002-03-04 19:57                         ` Stephen C. Tweedie
2002-03-04 21:06                           ` Daniel Phillips
2002-03-05 14:58                             ` Stephen C. Tweedie
2002-03-05  7:48                         ` Jens Axboe
2002-03-04 19:51                     ` Daniel Phillips
2002-03-05  7:42                       ` Jens Axboe
2002-03-04 17:35                   ` James Bottomley
2002-03-04 17:48                     ` Chris Mason
2002-03-04 18:11                       ` James Bottomley
2002-03-04 18:41                         ` Chris Mason
2002-03-04 21:34                         ` Stephen C. Tweedie
2002-03-04 18:09                     ` Stephen C. Tweedie
2002-03-04  8:19             ` Helge Hafting
2002-03-04 14:57             ` James Bottomley
2002-03-04 17:24               ` Chris Mason
2002-03-04 19:02                 ` Daniel Phillips
2002-03-05  7:22               ` Jeremy Higdon
2002-03-05 23:01                 ` Daniel Phillips
2002-03-04  4:21           ` Jeremy Higdon
2002-03-04  5:31             ` Daniel Phillips
2002-03-04  6:09               ` Jeremy Higdon
2002-03-04  7:57                 ` Daniel Phillips
2002-03-05  7:09                   ` Jeremy Higdon
2002-03-05 22:56                     ` Daniel Phillips
2002-03-04 16:52                 ` Stephen C. Tweedie
2002-03-04 18:15                   ` Daniel Phillips
2002-03-05  7:40                     ` Jens Axboe
2002-03-05 22:29                       ` Daniel Phillips
2002-03-12  7:01                         ` Jens Axboe
2002-03-10  5:24                   ` Douglas Gilbert
2002-03-11 11:13                     ` Kurt Garloff
2002-03-12  1:17                       ` GOTO Masanori
2002-03-12  6:58                       ` Jens Axboe
2002-03-13 22:37                         ` Peter Osterlund
2002-03-11 11:34                     ` Stephen C. Tweedie
2002-03-11 17:15                       ` James Bottomley
2002-03-04 14:48           ` James Bottomley
2002-03-06 13:59             ` Daniel Phillips
2002-03-06 14:34               ` James Bottomley
2002-02-25 10:57 ` Helge Hafting
2002-02-25 15:04   ` James Bottomley
  -- strict thread matches above, loose matches on Subject: below --
2002-03-01 15:26 Dieter Nützel
2002-03-01 16:00 ` James Bottomley
2002-02-21 23:30 Chris Mason
2002-02-22 14:19 ` Stephen C. Tweedie
2002-02-22 15:26   ` Chris Mason
2002-01-10  9:55 [ANNOUNCE] FUSE: Filesystem in Userspace 0.95 Miklos Szeredi
2002-01-13  3:10 ` Pavel Machek
2002-01-21 10:18   ` Miklos Szeredi
2002-01-23 10:47     ` Pavel Machek
2002-01-22 19:07 ` Daniel Phillips
2002-01-23  2:33   ` [Avfs] " Justin Mason
2002-01-23  5:26     ` Daniel Phillips

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox