* [PATCH] barrier patch set
@ 2004-03-19 15:35 Jens Axboe
2004-03-19 16:30 ` Mika Penttilä
` (3 more replies)
0 siblings, 4 replies; 63+ messages in thread
From: Jens Axboe @ 2004-03-19 15:35 UTC (permalink / raw)
To: Linux Kernel; +Cc: Chris Mason
Hi,
A first release of a collected barrier patchset for 2.6.5-rc1-mm2. I
have a few changes planned to support dm/md + sata, I'll do those
changes over the weekend.
Reiser has the best barrier support, ext3 works but only if things don't
go wrong. So only attempt to use the barrier feature on ext3 if on ide
drives, not SCSI nor SATA.
Also note that for reiser you need to add:
-o barrier=flush
while ext3 currently wants:
-o barrier=1
Cosmetic stuff that will get ironed out. You can find the patches here:
ftp://ftp.kernel.org/pub/linux/kernel/people/axboe/patches/v2.6/2.6.5-rc1-mm2/
ide-barrier-2.6.5-rc1-mm2-1
ide/core part
ext3-barrier-2.6.5-rc1-mm2-1
ext3 part
reiserfs-current-2.6.5-rc1-mm2-1
current reiser tree, get it here in parts:
ftp.suse.com/pub/people/mason/patches/data-logging/experimental/2.6.4
(use series.mm for apply order)
reiserfs-barrier-2.6.5-rc1-mm2-1
reiser part.
or just apply
all-barrier-2.6.5-rc1-mm2-1
all rolled up into one patch.
--
Jens Axboe
^ permalink raw reply [flat|nested] 63+ messages in thread* Re: [PATCH] barrier patch set 2004-03-19 15:35 [PATCH] barrier patch set Jens Axboe @ 2004-03-19 16:30 ` Mika Penttilä 2004-03-19 18:16 ` Jens Axboe 2004-03-19 16:34 ` Jeff Garzik ` (2 subsequent siblings) 3 siblings, 1 reply; 63+ messages in thread From: Mika Penttilä @ 2004-03-19 16:30 UTC (permalink / raw) To: Jens Axboe; +Cc: Linux Kernel, Chris Mason Jens Axboe wrote: >Hi, > >A first release of a collected barrier patchset for 2.6.5-rc1-mm2. I >have a few changes planned to support dm/md + sata, I'll do those >changes over the weekend. > >Reiser has the best barrier support, ext3 works but only if things don't >go wrong. So only attempt to use the barrier feature on ext3 if on ide >drives, not SCSI nor SATA. > > > What are these brutal pieces...? +static int ide_transform_pc_req(ide_drive_t *drive, struct request *rq) +{ + if (rq->cmd[0] != 0x35) { + ide_end_request(drive, 0, 0); + return 1; + } + + if (!drive->wcache) { + ide_end_request(drive, 1, 0); + return 1; + } + + ide_fill_flush_cmd(drive, rq); + return 0; +} /* + * basic transformation support for scsi -> ata commands + */ + if (blk_pc_request(rq)) { + if (drive->media != ide_disk) + goto kill_rq; + if (ide_transform_pc_req(drive, rq)) + return ide_stopped; + } + --Mika ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-19 16:30 ` Mika Penttilä @ 2004-03-19 18:16 ` Jens Axboe 2004-03-19 18:44 ` Mika Penttilä 0 siblings, 1 reply; 63+ messages in thread From: Jens Axboe @ 2004-03-19 18:16 UTC (permalink / raw) To: Mika Penttilä; +Cc: Linux Kernel, Chris Mason On Fri, Mar 19 2004, Mika Penttil? wrote: > > > Jens Axboe wrote: > > >Hi, > > > >A first release of a collected barrier patchset for 2.6.5-rc1-mm2. I > >have a few changes planned to support dm/md + sata, I'll do those > >changes over the weekend. > > > >Reiser has the best barrier support, ext3 works but only if things don't > >go wrong. So only attempt to use the barrier feature on ext3 if on ide > >drives, not SCSI nor SATA. > > > > > > > What are these brutal pieces...? > > > +static int ide_transform_pc_req(ide_drive_t *drive, struct request *rq) > +{ > + if (rq->cmd[0] != 0x35) { > + ide_end_request(drive, 0, 0); > + return 1; > + } > + > + if (!drive->wcache) { > + ide_end_request(drive, 1, 0); > + return 1; > + } > + > + ide_fill_flush_cmd(drive, rq); > + return 0; > +} > > > /* > + * basic transformation support for scsi -> ata commands > + */ > + if (blk_pc_request(rq)) { > + if (drive->media != ide_disk) > + goto kill_rq; > + if (ide_transform_pc_req(drive, rq)) > + return ide_stopped; > + } Hmm, I thought it was pretty obvious, even just from the naming and comments. Right now, the block layer issued flush without data attached (ie a drive barrier without pinning it to a buffer) comes as a scsi synchronize cache command. I'm going to change this anyways and allow queue hook of a ->issue_flush_fn() that can just tailored to ide or scsi, _or_ dm/md and that sort of thing. -- Jens Axboe ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-19 18:16 ` Jens Axboe @ 2004-03-19 18:44 ` Mika Penttilä 2004-03-20 9:55 ` Jens Axboe 0 siblings, 1 reply; 63+ messages in thread From: Mika Penttilä @ 2004-03-19 18:44 UTC (permalink / raw) To: Jens Axboe; +Cc: Linux Kernel, Chris Mason Jens Axboe wrote: >On Fri, Mar 19 2004, Mika Penttil? wrote: > > >>Jens Axboe wrote: >> >> >> >>>Hi, >>> >>>A first release of a collected barrier patchset for 2.6.5-rc1-mm2. I >>>have a few changes planned to support dm/md + sata, I'll do those >>>changes over the weekend. >>> >>>Reiser has the best barrier support, ext3 works but only if things don't >>>go wrong. So only attempt to use the barrier feature on ext3 if on ide >>>drives, not SCSI nor SATA. >>> >>> >>> >>> >>> >>What are these brutal pieces...? >> >> >>+static int ide_transform_pc_req(ide_drive_t *drive, struct request *rq) >>+{ >>+ if (rq->cmd[0] != 0x35) { >>+ ide_end_request(drive, 0, 0); >>+ return 1; >>+ } >>+ >>+ if (!drive->wcache) { >>+ ide_end_request(drive, 1, 0); >>+ return 1; >>+ } >>+ >>+ ide_fill_flush_cmd(drive, rq); >>+ return 0; >>+} >> >> >>/* >>+ * basic transformation support for scsi -> ata commands >>+ */ >>+ if (blk_pc_request(rq)) { >>+ if (drive->media != ide_disk) >>+ goto kill_rq; >>+ if (ide_transform_pc_req(drive, rq)) >>+ return ide_stopped; >>+ } >> >> > >Hmm, I thought it was pretty obvious, even just from the naming and >comments. Right now, the block layer issued flush without data attached >(ie a drive barrier without pinning it to a buffer) comes as a scsi >synchronize cache command. I'm going to change this anyways and allow >queue hook of a ->issue_flush_fn() that can just tailored to ide or >scsi, _or_ dm/md and that sort of thing. > > I mean other BLOCK_PC requests than SYNCHRONIZE CACHE -> ide_end_request() and ide_stopped. --Mika ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-19 18:44 ` Mika Penttilä @ 2004-03-20 9:55 ` Jens Axboe 0 siblings, 0 replies; 63+ messages in thread From: Jens Axboe @ 2004-03-20 9:55 UTC (permalink / raw) To: Mika Penttilä; +Cc: Linux Kernel, Chris Mason On Fri, Mar 19 2004, Mika Penttil? wrote: > > > Jens Axboe wrote: > > >On Fri, Mar 19 2004, Mika Penttil? wrote: > > > > > >>Jens Axboe wrote: > >> > >> > >> > >>>Hi, > >>> > >>>A first release of a collected barrier patchset for 2.6.5-rc1-mm2. I > >>>have a few changes planned to support dm/md + sata, I'll do those > >>>changes over the weekend. > >>> > >>>Reiser has the best barrier support, ext3 works but only if things don't > >>>go wrong. So only attempt to use the barrier feature on ext3 if on ide > >>>drives, not SCSI nor SATA. > >>> > >>> > >>> > >>> > >>> > >>What are these brutal pieces...? > >> > >> > >>+static int ide_transform_pc_req(ide_drive_t *drive, struct request *rq) > >>+{ > >>+ if (rq->cmd[0] != 0x35) { > >>+ ide_end_request(drive, 0, 0); > >>+ return 1; > >>+ } > >>+ > >>+ if (!drive->wcache) { > >>+ ide_end_request(drive, 1, 0); > >>+ return 1; > >>+ } > >>+ > >>+ ide_fill_flush_cmd(drive, rq); > >>+ return 0; > >>+} > >> > >> > >>/* > >>+ * basic transformation support for scsi -> ata commands > >>+ */ > >>+ if (blk_pc_request(rq)) { > >>+ if (drive->media != ide_disk) > >>+ goto kill_rq; > >>+ if (ide_transform_pc_req(drive, rq)) > >>+ return ide_stopped; > >>+ } > >> > >> > > > >Hmm, I thought it was pretty obvious, even just from the naming and > >comments. Right now, the block layer issued flush without data attached > >(ie a drive barrier without pinning it to a buffer) comes as a scsi > >synchronize cache command. I'm going to change this anyways and allow > >queue hook of a ->issue_flush_fn() that can just tailored to ide or > >scsi, _or_ dm/md and that sort of thing. > > > > > I mean other BLOCK_PC requests than SYNCHRONIZE CACHE -> > ide_end_request() and ide_stopped. That's a bug, good catch. Added to fixme list for next release :) -- Jens Axboe ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-19 15:35 [PATCH] barrier patch set Jens Axboe 2004-03-19 16:30 ` Mika Penttilä @ 2004-03-19 16:34 ` Jeff Garzik 2004-03-19 18:19 ` Jens Axboe ` (2 more replies) 2004-03-19 16:48 ` Marc-Christian Petersen 2004-03-22 11:09 ` Andrew Morton 3 siblings, 3 replies; 63+ messages in thread From: Jeff Garzik @ 2004-03-19 16:34 UTC (permalink / raw) To: Jens Axboe; +Cc: Linux Kernel, Chris Mason, B.Zolnierkiewicz Jens Axboe wrote: > Cosmetic stuff that will get ironed out. You can find the patches here: > > ftp://ftp.kernel.org/pub/linux/kernel/people/axboe/patches/v2.6/2.6.5-rc1-mm2/ > > ide-barrier-2.6.5-rc1-mm2-1 > ide/core part WRT ATA and flush-cache... before the IDE pieces of this patch are merged, IMO it is a requirement that the entire flush-cache stuff gets a review. ide_get_error_location() is one of the important pieces (great!). Another important piece is to make sure that a drive's flush-cache capability is correctly deduced and set up from the identify-device. The steps look like - check identify-device word 83, bits 12 (flush cache) and 13 (flush cache ext) - issue set-features command to get flush-cache into proper state (enabled or disabled, as the user desires), if identify-device word 86 indicates it is not already in the state you seek. - re-read identify [packet] device page from device, make sure flush-cache[-ext] is enabled. A slacker could just make sure the set-features command completed successfully, but to be 100% correct you need to re-read the identify-device page. :/ NOTE 1: these steps are specific to the flush-cache command, and are only vaguely related to write caching (which must be tested-for and enabled separately). It is important to go through these steps separately for write-caching and flush-cache[-ext]. Write-caching and flush-cache-command state should always be considered separately, even though the two are often used together. I want to avoid the LG cdrom debacle, where not properly checking for, and setting up, flush-cache resulted in turning cdroms into bricks. NOTE 2: Don't forget to check for 0x0000 or 0xffff in word 86, of __only__ identify-packet-device. These special case meanings do not appear for ATA devices, only ATAPI devices. NOTE 3: flush-cache-ext should always be used on lba48 devices, even if the request was inside the lba28 limit. NOTE 4: flush-cache-ext does not exist on ATAPI devices. Jeff ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-19 16:34 ` Jeff Garzik @ 2004-03-19 18:19 ` Jens Axboe 2004-03-19 23:01 ` Matthias Andree 2004-03-19 23:59 ` Bartlomiej Zolnierkiewicz 2 siblings, 0 replies; 63+ messages in thread From: Jens Axboe @ 2004-03-19 18:19 UTC (permalink / raw) To: Jeff Garzik; +Cc: Linux Kernel, Chris Mason, B.Zolnierkiewicz On Fri, Mar 19 2004, Jeff Garzik wrote: > Jens Axboe wrote: > >Cosmetic stuff that will get ironed out. You can find the patches here: > > > >ftp://ftp.kernel.org/pub/linux/kernel/people/axboe/patches/v2.6/2.6.5-rc1-mm2/ > > > >ide-barrier-2.6.5-rc1-mm2-1 > > ide/core part > > > WRT ATA and flush-cache... before the IDE pieces of this patch are > merged, IMO it is a requirement that the entire flush-cache stuff gets a > review. ide_get_error_location() is one of the important pieces Well what are you waiting for :-). Honestly, the bits have been "out there" for 2 years or so, so it's not like it's anything brand new. ide_get_error_location() is new for 2.6, I consider the 2.6 code to have basically 100% coverage where the 2.4 one never really did... > (great!). Another important piece is to make sure that a drive's > flush-cache capability is correctly deduced and set up from the > identify-device. The steps look like > > - check identify-device word 83, bits 12 (flush cache) and 13 (flush > cache ext) > - issue set-features command to get flush-cache into proper state > (enabled or disabled, as the user desires), if identify-device word 86 > indicates it is not already in the state you seek. > - re-read identify [packet] device page from device, make sure > flush-cache[-ext] is enabled. A slacker could just make sure the > set-features command completed successfully, but to be 100% correct you > need to re-read the identify-device page. :/ > > NOTE 1: these steps are specific to the flush-cache command, and are > only vaguely related to write caching (which must be tested-for and > enabled separately). It is important to go through these steps > separately for write-caching and flush-cache[-ext]. > > Write-caching and flush-cache-command state should always be considered > separately, even though the two are often used together. I want to > avoid the LG cdrom debacle, where not properly checking for, and setting > up, flush-cache resulted in turning cdroms into bricks. > > NOTE 2: Don't forget to check for 0x0000 or 0xffff in word 86, of > __only__ identify-packet-device. These special case meanings do not > appear for ATA devices, only ATAPI devices. > > NOTE 3: flush-cache-ext should always be used on lba48 devices, even if > the request was inside the lba28 limit. > > NOTE 4: flush-cache-ext does not exist on ATAPI devices. Only ide-disk is supported. But that's some great info, I'll go over the code this weekend and make sure we're doing ok and correct any deviations. -- Jens Axboe ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-19 16:34 ` Jeff Garzik 2004-03-19 18:19 ` Jens Axboe @ 2004-03-19 23:01 ` Matthias Andree 2004-03-20 0:02 ` Bartlomiej Zolnierkiewicz 2004-03-19 23:59 ` Bartlomiej Zolnierkiewicz 2 siblings, 1 reply; 63+ messages in thread From: Matthias Andree @ 2004-03-19 23:01 UTC (permalink / raw) To: Linux Kernel On Fri, 19 Mar 2004, Jeff Garzik wrote: > - issue set-features command to get flush-cache into proper state > (enabled or disabled, as the user desires), if identify-device word 86 > indicates it is not already in the state you seek. BTW, speaking of identify-device, hdparm -i (which uses HDIO_GET_IDENTITY) always returns "WriteCache=enabled" while hdparm -I that uses HDIO_DRIVE_CMD with WIN_PIDENTIFY reports the "correct" state that I've previously set with -W0. This is an i386 machine w/ 2.6.5-rc1. Is HDIO_GET_IDENTITY working correctly? -- Matthias Andree Encrypt your mail: my GnuPG key ID is 0x052E7D95 ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-19 23:01 ` Matthias Andree @ 2004-03-20 0:02 ` Bartlomiej Zolnierkiewicz 2004-03-20 1:48 ` Johannes Stezenbach 2004-03-20 18:52 ` Helge Hafting 0 siblings, 2 replies; 63+ messages in thread From: Bartlomiej Zolnierkiewicz @ 2004-03-20 0:02 UTC (permalink / raw) To: Matthias Andree; +Cc: Linux Kernel On Saturday 20 of March 2004 00:01, Matthias Andree wrote: > On Fri, 19 Mar 2004, Jeff Garzik wrote: > > - issue set-features command to get flush-cache into proper state > > (enabled or disabled, as the user desires), if identify-device word 86 > > indicates it is not already in the state you seek. > > BTW, speaking of identify-device, hdparm -i (which uses > HDIO_GET_IDENTITY) always returns "WriteCache=enabled" while hdparm -I > that uses HDIO_DRIVE_CMD with WIN_PIDENTIFY reports the "correct" state > that I've previously set with -W0. This is an i386 machine w/ 2.6.5-rc1. > > Is HDIO_GET_IDENTITY working correctly? There were reports that on some drives you can't disable write cache and even (?) that some drives lie (WC still enabled but marked as disabled). Regards, Bartlomiej ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-20 0:02 ` Bartlomiej Zolnierkiewicz @ 2004-03-20 1:48 ` Johannes Stezenbach 2004-03-20 2:13 ` Bartlomiej Zolnierkiewicz 2004-03-20 18:52 ` Helge Hafting 1 sibling, 1 reply; 63+ messages in thread From: Johannes Stezenbach @ 2004-03-20 1:48 UTC (permalink / raw) To: Bartlomiej Zolnierkiewicz; +Cc: Matthias Andree, Linux Kernel Bartlomiej Zolnierkiewicz wrote: > On Saturday 20 of March 2004 00:01, Matthias Andree wrote: > > > > BTW, speaking of identify-device, hdparm -i (which uses > > HDIO_GET_IDENTITY) always returns "WriteCache=enabled" while hdparm -I > > that uses HDIO_DRIVE_CMD with WIN_PIDENTIFY reports the "correct" state > > that I've previously set with -W0. This is an i386 machine w/ 2.6.5-rc1. > > > > Is HDIO_GET_IDENTITY working correctly? > > There were reports that on some drives you can't disable write cache > and even (?) that some drives lie (WC still enabled but marked as disabled). hdparm -i and -I ultimately both interpret WIN_IDENTIFY result, and both test bit 0x0020 of word 85. So it's unclear to me why they report a different write cache setting. I added a hexdump to dump_identity() in hdparm.c, and found that bit 0x0020 of word 85 is always set. BTW, 'cat /proc/ide/hda/identify' or 'hdparm -Istdin </dev/ide/hda/identify' reports the same value as hdparm -I, and that is consistent with the value I set with hdparm -W x. So, is HDIO_GET_IDENTITY broken? Johannes ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-20 1:48 ` Johannes Stezenbach @ 2004-03-20 2:13 ` Bartlomiej Zolnierkiewicz 2004-03-20 2:53 ` Johannes Stezenbach 2004-03-20 11:36 ` Matthias Andree 0 siblings, 2 replies; 63+ messages in thread From: Bartlomiej Zolnierkiewicz @ 2004-03-20 2:13 UTC (permalink / raw) To: Johannes Stezenbach; +Cc: Matthias Andree, Linux Kernel On Saturday 20 of March 2004 02:48, Johannes Stezenbach wrote: > Bartlomiej Zolnierkiewicz wrote: > > On Saturday 20 of March 2004 00:01, Matthias Andree wrote: > > > BTW, speaking of identify-device, hdparm -i (which uses > > > HDIO_GET_IDENTITY) always returns "WriteCache=enabled" while hdparm -I > > > that uses HDIO_DRIVE_CMD with WIN_PIDENTIFY reports the "correct" state > > > that I've previously set with -W0. This is an i386 machine w/ > > > 2.6.5-rc1. > > > > > > Is HDIO_GET_IDENTITY working correctly? > > > > There were reports that on some drives you can't disable write cache > > and even (?) that some drives lie (WC still enabled but marked as > > disabled). Doh, I misunderstood the question. Correct answer is: everything is fine, RTFM (man hdparm). ;-) > hdparm -i and -I ultimately both interpret WIN_IDENTIFY result, and both > test bit 0x0020 of word 85. So it's unclear to me why they report a > different write cache setting. I added a hexdump to dump_identity() > in hdparm.c, and found that bit 0x0020 of word 85 is always set. or WIN_PIDENTIFY to be strict but -i returns _cached_ (read when the device was probed) identify data (uses HDIO_GET_IDENTIFY ioctl) -I reads _current_ data directly from the device (uses HDIO_DRIVE_CMD ioctl) > BTW, 'cat /proc/ide/hda/identify' or 'hdparm -Istdin > </dev/ide/hda/identify' reports the same value as hdparm -I, and that is > consistent with > the value I set with hdparm -W x. > > > So, is HDIO_GET_IDENTITY broken? No. Regards, Bartlomiej ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-20 2:13 ` Bartlomiej Zolnierkiewicz @ 2004-03-20 2:53 ` Johannes Stezenbach 2004-03-20 16:03 ` Bartlomiej Zolnierkiewicz 2004-03-20 11:36 ` Matthias Andree 1 sibling, 1 reply; 63+ messages in thread From: Johannes Stezenbach @ 2004-03-20 2:53 UTC (permalink / raw) To: Bartlomiej Zolnierkiewicz; +Cc: Matthias Andree, Linux Kernel Bartlomiej Zolnierkiewicz wrote: > On Saturday 20 of March 2004 02:48, Johannes Stezenbach wrote: > > > hdparm -i and -I ultimately both interpret WIN_IDENTIFY result, and both > > test bit 0x0020 of word 85. So it's unclear to me why they report a > > different write cache setting. I added a hexdump to dump_identity() > > in hdparm.c, and found that bit 0x0020 of word 85 is always set. > > or WIN_PIDENTIFY to be strict but > > -i returns _cached_ (read when the device was probed) identify data > (uses HDIO_GET_IDENTIFY ioctl) > -I reads _current_ data directly from the device > (uses HDIO_DRIVE_CMD ioctl) Oh, right. But: HDIO_GET_IDENTITY returns drive->id, and surely drive->id is used internally. So isn't it a bug that drive->id is not updated when the write cache setting is changed? I think the barrier code uses drive->id to determine if the write cache is enabled. Johannes ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-20 2:53 ` Johannes Stezenbach @ 2004-03-20 16:03 ` Bartlomiej Zolnierkiewicz 0 siblings, 0 replies; 63+ messages in thread From: Bartlomiej Zolnierkiewicz @ 2004-03-20 16:03 UTC (permalink / raw) To: Johannes Stezenbach; +Cc: Matthias Andree, Linux Kernel On Saturday 20 of March 2004 03:53, Johannes Stezenbach wrote: > Bartlomiej Zolnierkiewicz wrote: > > On Saturday 20 of March 2004 02:48, Johannes Stezenbach wrote: > > > hdparm -i and -I ultimately both interpret WIN_IDENTIFY result, and > > > both test bit 0x0020 of word 85. So it's unclear to me why they report > > > a different write cache setting. I added a hexdump to dump_identity() > > > in hdparm.c, and found that bit 0x0020 of word 85 is always set. > > > > or WIN_PIDENTIFY to be strict but > > > > -i returns _cached_ (read when the device was probed) identify data > > (uses HDIO_GET_IDENTIFY ioctl) > > -I reads _current_ data directly from the device > > (uses HDIO_DRIVE_CMD ioctl) > > Oh, right. > > But: HDIO_GET_IDENTITY returns drive->id, and surely drive->id > is used internally. So isn't it a bug that drive->id is not > updated when the write cache setting is changed? No, drive->id shouldn't be changed. > I think the barrier code uses drive->id to determine if the > write cache is enabled. The barrier code depends on drive->wcache. Regards, Bartlomiej ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-20 2:13 ` Bartlomiej Zolnierkiewicz 2004-03-20 2:53 ` Johannes Stezenbach @ 2004-03-20 11:36 ` Matthias Andree 2004-03-20 16:00 ` Bartlomiej Zolnierkiewicz 1 sibling, 1 reply; 63+ messages in thread From: Matthias Andree @ 2004-03-20 11:36 UTC (permalink / raw) To: Bartlomiej Zolnierkiewicz Cc: Johannes Stezenbach, Matthias Andree, Linux Kernel > Correct answer is: everything is fine, RTFM (man hdparm). ;-) Not everything is fine. hdparm documents -i returns inconsistent data. Most, but _NOT_ _EVERYTHING_ is cached: the multcount is updated, for instance. What is that good for? Mix & Match whatever is convenient? Are there systems where -I will not work? If there are none, hdparm 6.0 should be shipped without the -i option. -- Matthias Andree Encrypt your mail: my GnuPG key ID is 0x052E7D95 ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-20 11:36 ` Matthias Andree @ 2004-03-20 16:00 ` Bartlomiej Zolnierkiewicz 2004-03-20 23:36 ` Johannes Stezenbach 0 siblings, 1 reply; 63+ messages in thread From: Bartlomiej Zolnierkiewicz @ 2004-03-20 16:00 UTC (permalink / raw) To: Matthias Andree; +Cc: Johannes Stezenbach, Matthias Andree, Linux Kernel On Saturday 20 of March 2004 12:36, Matthias Andree wrote: > > Correct answer is: everything is fine, RTFM (man hdparm). ;-) > > Not everything is fine. hdparm documents -i returns inconsistent data. > Most, but _NOT_ _EVERYTHING_ is cached: the multcount is updated, for > instance. What is that good for? Mix & Match whatever is convenient? I'm aware of this bug - driver shouldn't modify drive->id. Patches are welcomed. > Are there systems where -I will not work? If there are none, hdparm 6.0 It should always work. > should be shipped without the -i option. Why? It can be sometimes useful for debugging purposes and HDIO_GET_IDENTIFY ioctl is not going away in 2.6.x. Regards, Bartlomiej ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-20 16:00 ` Bartlomiej Zolnierkiewicz @ 2004-03-20 23:36 ` Johannes Stezenbach 2004-03-21 1:33 ` Bartlomiej Zolnierkiewicz 0 siblings, 1 reply; 63+ messages in thread From: Johannes Stezenbach @ 2004-03-20 23:36 UTC (permalink / raw) To: Bartlomiej Zolnierkiewicz; +Cc: Matthias Andree, Linux Kernel Bartlomiej Zolnierkiewicz wrote: > On Saturday 20 of March 2004 12:36, Matthias Andree wrote: > > > Correct answer is: everything is fine, RTFM (man hdparm). ;-) > > > > Not everything is fine. hdparm documents -i returns inconsistent data. > > Most, but _NOT_ _EVERYTHING_ is cached: the multcount is updated, for > > instance. What is that good for? Mix & Match whatever is convenient? > > I'm aware of this bug - driver shouldn't modify drive->id. Patches are welcomed. Why? What's the reason for keeping out-of-date IDENTIFY data? And what about ide_driveid_update()? Is it a bug that it exists? This is all too confusing for me :-( Johannes ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-20 23:36 ` Johannes Stezenbach @ 2004-03-21 1:33 ` Bartlomiej Zolnierkiewicz 0 siblings, 0 replies; 63+ messages in thread From: Bartlomiej Zolnierkiewicz @ 2004-03-21 1:33 UTC (permalink / raw) To: Johannes Stezenbach; +Cc: Matthias Andree, Linux Kernel On Sunday 21 of March 2004 00:36, Johannes Stezenbach wrote: > Bartlomiej Zolnierkiewicz wrote: > > On Saturday 20 of March 2004 12:36, Matthias Andree wrote: > > > > Correct answer is: everything is fine, RTFM (man hdparm). ;-) > > > > > > Not everything is fine. hdparm documents -i returns inconsistent data. > > > Most, but _NOT_ _EVERYTHING_ is cached: the multcount is updated, for > > > instance. What is that good for? Mix & Match whatever is convenient? > > > > I'm aware of this bug - driver shouldn't modify drive->id. Patches are > > welcomed. > > Why? What's the reason for keeping out-of-date IDENTIFY data? HDIO_GET_IDENTIFY ioctl which should die first. > And what about ide_driveid_update()? Is it a bug that > it exists? It should just check/copy relevant bits from the new identify (but without modifying drive->id). > This is all too confusing for me :-( Welcome in the wonderful world of IDE driver. ;-) Bartlomiej ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-20 0:02 ` Bartlomiej Zolnierkiewicz 2004-03-20 1:48 ` Johannes Stezenbach @ 2004-03-20 18:52 ` Helge Hafting 2004-03-22 11:15 ` Matthias Andree 1 sibling, 1 reply; 63+ messages in thread From: Helge Hafting @ 2004-03-20 18:52 UTC (permalink / raw) To: Bartlomiej Zolnierkiewicz; +Cc: Matthias Andree, Linux Kernel On Sat, Mar 20, 2004 at 01:02:39AM +0100, Bartlomiej Zolnierkiewicz wrote: [...] > There were reports that on some drives you can't disable write cache > and even (?) that some drives lie (WC still enabled but marked as disabled). > I think the simple solution of not supporting data integrity properly on such a broken disk is perfectly ok. Helge Hafting ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-20 18:52 ` Helge Hafting @ 2004-03-22 11:15 ` Matthias Andree 0 siblings, 0 replies; 63+ messages in thread From: Matthias Andree @ 2004-03-22 11:15 UTC (permalink / raw) To: Helge Hafting; +Cc: Bartlomiej Zolnierkiewicz, Matthias Andree, Linux Kernel On Sat, 20 Mar 2004, Helge Hafting wrote: > On Sat, Mar 20, 2004 at 01:02:39AM +0100, Bartlomiej Zolnierkiewicz wrote: > [...] > > There were reports that on some drives you can't disable write cache > > and even (?) that some drives lie (WC still enabled but marked as disabled). > > > I think the simple solution of not supporting data integrity properly > on such a broken disk is perfectly ok. At least Linux should warn the user that his data is heading for doom. -- Matthias Andree Encrypt your mail: my GnuPG key ID is 0x052E7D95 ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-19 16:34 ` Jeff Garzik 2004-03-19 18:19 ` Jens Axboe 2004-03-19 23:01 ` Matthias Andree @ 2004-03-19 23:59 ` Bartlomiej Zolnierkiewicz 2004-03-20 0:14 ` Jeff Garzik ` (2 more replies) 2 siblings, 3 replies; 63+ messages in thread From: Bartlomiej Zolnierkiewicz @ 2004-03-19 23:59 UTC (permalink / raw) To: Jeff Garzik, Jens Axboe; +Cc: Linux Kernel, Chris Mason On Friday 19 of March 2004 17:34, Jeff Garzik wrote: > Jens Axboe wrote: > > Cosmetic stuff that will get ironed out. You can find the patches here: > > > > ftp://ftp.kernel.org/pub/linux/kernel/people/axboe/patches/v2.6/2.6.5-rc1 > >-mm2/ > > > > ide-barrier-2.6.5-rc1-mm2-1 > > ide/core part Jens, am I right that you didn't do any changes/cleanups I asked you to do? Here they are once again (probably some new items added as a bonus). ;-) - do not use hwgroup->wrq (die!) and do not add drive->special_buf, just do what PM code does and other special commands do - use taskfile (yes, dirty stack allocation) - SCSI -> IDE transform should die, please use something like REQ_FLUSH and let subsystems deal with it - ide_get_error_location() is cool but clean other places doing same thing as you are duplicating existing code (please use u64 not sector_t - you are getting raw info from the disk) - why does blkdev_issue_flush() add REQ_BLOCK_PC to rq->flags? - why are we doing pre-flush? > WRT ATA and flush-cache... before the IDE pieces of this patch are > merged, IMO it is a requirement that the entire flush-cache stuff gets a > review. ide_get_error_location() is one of the important pieces > (great!). Another important piece is to make sure that a drive's > flush-cache capability is correctly deduced and set up from the > identify-device. The steps look like Yep. > - check identify-device word 83, bits 12 (flush cache) and 13 (flush > cache ext) > - issue set-features command to get flush-cache into proper state > (enabled or disabled, as the user desires), if identify-device word 86 > indicates it is not already in the state you seek. > - re-read identify [packet] device page from device, make sure > flush-cache[-ext] is enabled. A slacker could just make sure the > set-features command completed successfully, but to be 100% correct you > need to re-read the identify-device page. :/ The fact that spec says "supported" not "enabled" in description of word86 makes me wonder - can they be disabled? (FLUSH CACHE is mandatory for General feature set and FLUSH CACHE EXT is mandatory if 48-bit LBA is supported) Jeff, please note that these bits were introduced by ATA-6 spec and take a look at ATA-5 spec: ... FLUSH CACHE General feature set - Mandatory for all devices ... and ATA-4 spec: ... FLUSH CACHE General feature set - Optional for all devices ... IMO to test if FLUSH CACHE works we should just issue it during disk setup and check result. This way we can use FLUSH CACHE also on < ATA-6 devices (there is a lot of them). Cheers, Bartlomiej ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-19 23:59 ` Bartlomiej Zolnierkiewicz @ 2004-03-20 0:14 ` Jeff Garzik 2004-03-20 0:40 ` Bartlomiej Zolnierkiewicz 2004-03-20 0:17 ` Jeff Garzik 2004-03-20 9:53 ` Jens Axboe 2 siblings, 1 reply; 63+ messages in thread From: Jeff Garzik @ 2004-03-20 0:14 UTC (permalink / raw) To: Bartlomiej Zolnierkiewicz; +Cc: Jens Axboe, Linux Kernel, Chris Mason Bartlomiej Zolnierkiewicz wrote: > The fact that spec says "supported" not "enabled" in description of word86 > makes me wonder - can they be disabled? (FLUSH CACHE is mandatory for General > feature set and FLUSH CACHE EXT is mandatory if 48-bit LBA is supported) Yes, that's why there are separate 'supported' and 'enabled' bits for each feature. Words 82-84 are 'supported' bits. Words 85-87 are 'enabled' bits. These bits mirror each other, i.e. Word 83 and Word 86 have basically the same bits, except that Word 86 definitions change _slightly_ since the only bits that are relevant are the ones for features that can be disabled/enabled. You use set-features command to enable and disable these features, and then the result shows up in subsequent identify-device command output. If the driver is testing for a capability but does not enable it, then always use the 'enabled' set of bits, not the 'supported' set of bits. > Jeff, please note that these bits were introduced by ATA-6 spec > and take a look at ATA-5 spec: > > ... > FLUSH CACHE > General feature set > - Mandatory for all devices > ... > > and ATA-4 spec: > > ... > FLUSH CACHE > General feature set > - Optional for all devices > ... > > IMO to test if FLUSH CACHE works we should just issue it during disk setup > and check result. This way we can use FLUSH CACHE also on < ATA-6 devices > (there is a lot of them). I disagree. "just issue it" is how those LG cdrom drives got cooked. LG cdrom drives indicated in their identify-packet-device page that flush-cache was not supported... and then re-used the flush-cache ATA opcode for their vendor-specific download-firmware command. Combine that with a Linux patch that didn't properly check for flush-cache support. Result: brick. All drives that support flush-cache list the relevant bits in identify-device, even on pre-ATA-6 devices. Whether the feature was optional or mandantory, we can check the feature bits. Jeff ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-20 0:14 ` Jeff Garzik @ 2004-03-20 0:40 ` Bartlomiej Zolnierkiewicz 2004-03-20 0:42 ` Jeff Garzik 0 siblings, 1 reply; 63+ messages in thread From: Bartlomiej Zolnierkiewicz @ 2004-03-20 0:40 UTC (permalink / raw) To: Jeff Garzik; +Cc: Jens Axboe, Linux Kernel, Chris Mason On Saturday 20 of March 2004 01:14, Jeff Garzik wrote: > Bartlomiej Zolnierkiewicz wrote: > > The fact that spec says "supported" not "enabled" in description of > > word86 makes me wonder - can they be disabled? (FLUSH CACHE is mandatory > > for General feature set and FLUSH CACHE EXT is mandatory if 48-bit LBA is > > supported) > > Yes, that's why there are separate 'supported' and 'enabled' bits for > each feature. > > Words 82-84 are 'supported' bits. Words 85-87 are 'enabled' bits. > These bits mirror each other, i.e. Word 83 and Word 86 have basically > the same bits, except that Word 86 definitions change _slightly_ since > the only bits that are relevant are the ones for features that can be > disabled/enabled. > > You use set-features command to enable and disable these features, and > then the result shows up in subsequent identify-device command output. > > If the driver is testing for a capability but does not enable it, then > always use the 'enabled' set of bits, not the 'supported' set of bits. This is quite obvious but I am talking about confusing wording in description of word86 - for some features 'enabled' is used and for others 'supported' > > Jeff, please note that these bits were introduced by ATA-6 spec > > and take a look at ATA-5 spec: > > > > ... > > FLUSH CACHE > > General feature set > > - Mandatory for all devices > > ... > > > > and ATA-4 spec: > > > > ... > > FLUSH CACHE > > General feature set > > - Optional for all devices > > ... > > > > IMO to test if FLUSH CACHE works we should just issue it during disk > > setup and check result. This way we can use FLUSH CACHE also on < ATA-6 > > devices (there is a lot of them). > > I disagree. "just issue it" is how those LG cdrom drives got cooked. I'm aware of LG fun. Jens already stated that current barrier implementation is disk-only and I'm talking about disks only. If anybody reused CACHE FLUSH opcode for disk drive he/she deserves to loose. 8) > LG cdrom drives indicated in their identify-packet-device page that > flush-cache was not supported... and then re-used the flush-cache ATA > opcode for their vendor-specific download-firmware command. Combine > that with a Linux patch that didn't properly check for flush-cache > support. Result: brick. > > All drives that support flush-cache list the relevant bits in > identify-device, even on pre-ATA-6 devices. Whether the feature was > optional or mandantory, we can check the feature bits. Hm. so this is undocumented in the spec? Bartlomiej ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-20 0:40 ` Bartlomiej Zolnierkiewicz @ 2004-03-20 0:42 ` Jeff Garzik 2004-03-20 1:24 ` Bartlomiej Zolnierkiewicz 0 siblings, 1 reply; 63+ messages in thread From: Jeff Garzik @ 2004-03-20 0:42 UTC (permalink / raw) To: Bartlomiej Zolnierkiewicz; +Cc: Jens Axboe, Linux Kernel, Chris Mason Bartlomiej Zolnierkiewicz wrote: > On Saturday 20 of March 2004 01:14, Jeff Garzik wrote: > >>Bartlomiej Zolnierkiewicz wrote: >> >>>The fact that spec says "supported" not "enabled" in description of >>>word86 makes me wonder - can they be disabled? (FLUSH CACHE is mandatory >>>for General feature set and FLUSH CACHE EXT is mandatory if 48-bit LBA is >>>supported) >> >>Yes, that's why there are separate 'supported' and 'enabled' bits for >>each feature. >> >>Words 82-84 are 'supported' bits. Words 85-87 are 'enabled' bits. >>These bits mirror each other, i.e. Word 83 and Word 86 have basically >>the same bits, except that Word 86 definitions change _slightly_ since >>the only bits that are relevant are the ones for features that can be >>disabled/enabled. >> >>You use set-features command to enable and disable these features, and >>then the result shows up in subsequent identify-device command output. >> >>If the driver is testing for a capability but does not enable it, then >>always use the 'enabled' set of bits, not the 'supported' set of bits. > > > This is quite obvious but I am talking about confusing wording in description > of word86 - for some features 'enabled' is used and for others 'supported' Yeah, mainly the difference is communicating in the description of each word. Anyway, what I described is how things work :) For example, features that are always enabled in the drive are listed with both support and enabled bits set. The driver sees that, and does not issue a set-features command, because it does not need to. >>>IMO to test if FLUSH CACHE works we should just issue it during disk >>>setup and check result. This way we can use FLUSH CACHE also on < ATA-6 >>>devices (there is a lot of them). >> >>I disagree. "just issue it" is how those LG cdrom drives got cooked. > > > I'm aware of LG fun. Jens already stated that current barrier implementation > is disk-only and I'm talking about disks only. > > If anybody reused CACHE FLUSH opcode for disk drive he/she deserves to loose. > 8) Well... If you don't check the proper feature bits found in the spec, I blame the driver for ignoring the spec... :) >>All drives that support flush-cache list the relevant bits in >>identify-device, even on pre-ATA-6 devices. Whether the feature was >>optional or mandantory, we can check the feature bits. > > > Hm. so this is undocumented in the spec? ? When it was optional, there was a feature bit to test. When it became mandantory, the feature bit to test stayed in there. The feature bit is zero, otherwise. Makes it possible to use "just test the bit" and have things Just Work(tm). :) Jeff ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-20 0:42 ` Jeff Garzik @ 2004-03-20 1:24 ` Bartlomiej Zolnierkiewicz 2004-03-20 9:58 ` Jens Axboe 2004-03-20 10:21 ` Jeff Garzik 0 siblings, 2 replies; 63+ messages in thread From: Bartlomiej Zolnierkiewicz @ 2004-03-20 1:24 UTC (permalink / raw) To: Jeff Garzik; +Cc: Jens Axboe, Linux Kernel, Chris Mason On Saturday 20 of March 2004 01:42, Jeff Garzik wrote: > Bartlomiej Zolnierkiewicz wrote: > > On Saturday 20 of March 2004 01:14, Jeff Garzik wrote: > >>Bartlomiej Zolnierkiewicz wrote: > >>>The fact that spec says "supported" not "enabled" in description of > >>>word86 makes me wonder - can they be disabled? (FLUSH CACHE is mandatory > >>>for General feature set and FLUSH CACHE EXT is mandatory if 48-bit LBA > >>> is supported) > >> > >>Yes, that's why there are separate 'supported' and 'enabled' bits for > >>each feature. > >> > >>Words 82-84 are 'supported' bits. Words 85-87 are 'enabled' bits. > >>These bits mirror each other, i.e. Word 83 and Word 86 have basically > >>the same bits, except that Word 86 definitions change _slightly_ since > >>the only bits that are relevant are the ones for features that can be > >>disabled/enabled. > >> > >>You use set-features command to enable and disable these features, and > >>then the result shows up in subsequent identify-device command output. > >> > >>If the driver is testing for a capability but does not enable it, then > >>always use the 'enabled' set of bits, not the 'supported' set of bits. > > > > This is quite obvious but I am talking about confusing wording in > > description of word86 - for some features 'enabled' is used and for > > others 'supported' > > Yeah, mainly the difference is communicating in the description of each > word. > > Anyway, what I described is how things work :) For example, features > that are always enabled in the drive are listed with both support and > enabled bits set. The driver sees that, and does not issue a > set-features command, because it does not need to. Yep but in your original mail you suggested that we should explicitly enable FLUSH CACHE and FLUSH CACHE EXT features - there are even no subcommands to do this. ;-) > >>>IMO to test if FLUSH CACHE works we should just issue it during disk > >>>setup and check result. This way we can use FLUSH CACHE also on < ATA-6 > >>>devices (there is a lot of them). > >> > >>I disagree. "just issue it" is how those LG cdrom drives got cooked. > > > > I'm aware of LG fun. Jens already stated that current barrier > > implementation is disk-only and I'm talking about disks only. > > > > If anybody reused CACHE FLUSH opcode for disk drive he/she deserves to > > loose. 8) > > Well... If you don't check the proper feature bits found in the spec, I > blame the driver for ignoring the spec... :) Please point me to these bits in ATA-4 or ATA-5 spec. Have you checked them as I asked? > >>All drives that support flush-cache list the relevant bits in > >>identify-device, even on pre-ATA-6 devices. Whether the feature was > >>optional or mandantory, we can check the feature bits. > > > > Hm. so this is undocumented in the spec? > > ? When it was optional, there was a feature bit to test. When it > became mandantory, the feature bit to test stayed in there. The feature > bit is zero, otherwise. Makes it possible to use "just test the bit" > and have things Just Work(tm). :) I wish it was so simple. Here is an example to make it clear: model: WDC WD800JB-00CRA1 firmware: 17.07W77 word 0x83 is 4b01, word 0x86 is 0x0801 and drive of course supports CACHE FLUSH command. Cheers, Bartlomiej ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-20 1:24 ` Bartlomiej Zolnierkiewicz @ 2004-03-20 9:58 ` Jens Axboe 2004-03-20 10:12 ` Jeff Garzik 2004-03-20 10:21 ` Jeff Garzik 1 sibling, 1 reply; 63+ messages in thread From: Jens Axboe @ 2004-03-20 9:58 UTC (permalink / raw) To: Bartlomiej Zolnierkiewicz; +Cc: Jeff Garzik, Linux Kernel, Chris Mason On Sat, Mar 20 2004, Bartlomiej Zolnierkiewicz wrote: > > >>All drives that support flush-cache list the relevant bits in > > >>identify-device, even on pre-ATA-6 devices. Whether the feature was > > >>optional or mandantory, we can check the feature bits. > > > > > > Hm. so this is undocumented in the spec? > > > > ? When it was optional, there was a feature bit to test. When it > > became mandantory, the feature bit to test stayed in there. The feature > > bit is zero, otherwise. Makes it possible to use "just test the bit" > > and have things Just Work(tm). :) > > I wish it was so simple. Here is an example to make it clear: > > model: WDC WD800JB-00CRA1 firmware: 17.07W77 > word 0x83 is 4b01, word 0x86 is 0x0801 > > and drive of course supports CACHE FLUSH command. I agree with Bart, it's usually never that clear. Quit harping the stupid LG issue, they did something brain dead in the firmware and I almost have to say that they got what they deserved for doing something as _stupid_ as that. Jeff, it's wonderful to think that you can always rely on checking spec bits, but in reality it never really 'just works out' for any given set of hardware. -- Jens Axboe ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-20 9:58 ` Jens Axboe @ 2004-03-20 10:12 ` Jeff Garzik 2004-03-20 10:19 ` Jens Axboe 0 siblings, 1 reply; 63+ messages in thread From: Jeff Garzik @ 2004-03-20 10:12 UTC (permalink / raw) To: Jens Axboe; +Cc: Bartlomiej Zolnierkiewicz, Linux Kernel, Chris Mason Jens Axboe wrote: > I agree with Bart, it's usually never that clear. Quit harping the > stupid LG issue, they did something brain dead in the firmware and I > almost have to say that they got what they deserved for doing something > as _stupid_ as that. I use it because it's an excellent illustration of what happens when you ignore the spec. > Jeff, it's wonderful to think that you can always rely on checking spec > bits, but in reality it never really 'just works out' for any given set > of hardware. "just issue it and hope" is not a reasonable plan, IMO. Jeff ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-20 10:12 ` Jeff Garzik @ 2004-03-20 10:19 ` Jens Axboe 2004-03-20 10:37 ` Jeff Garzik 0 siblings, 1 reply; 63+ messages in thread From: Jens Axboe @ 2004-03-20 10:19 UTC (permalink / raw) To: Jeff Garzik; +Cc: Bartlomiej Zolnierkiewicz, Linux Kernel, Chris Mason On Sat, Mar 20 2004, Jeff Garzik wrote: > Jens Axboe wrote: > >I agree with Bart, it's usually never that clear. Quit harping the > >stupid LG issue, they did something brain dead in the firmware and I > >almost have to say that they got what they deserved for doing something > >as _stupid_ as that. > > I use it because it's an excellent illustration of what happens when you > ignore the spec. Really, I think it's a much better demonstration of exactly how stupid hardware developers are at times... > >Jeff, it's wonderful to think that you can always rely on checking spec > >bits, but in reality it never really 'just works out' for any given set > >of hardware. > > "just issue it and hope" is not a reasonable plan, IMO. I agree with that as well. I just didn't agree with your rosy idea of thinking everything would always work if you just check the bits according to spec. -- Jens Axboe ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-20 10:19 ` Jens Axboe @ 2004-03-20 10:37 ` Jeff Garzik 2004-03-20 16:30 ` Bartlomiej Zolnierkiewicz 0 siblings, 1 reply; 63+ messages in thread From: Jeff Garzik @ 2004-03-20 10:37 UTC (permalink / raw) To: Jens Axboe; +Cc: Bartlomiej Zolnierkiewicz, Linux Kernel, Chris Mason Jens Axboe wrote: > On Sat, Mar 20 2004, Jeff Garzik wrote: > >>Jens Axboe wrote: >> >>>I agree with Bart, it's usually never that clear. Quit harping the >>>stupid LG issue, they did something brain dead in the firmware and I >>>almost have to say that they got what they deserved for doing something >>>as _stupid_ as that. >> >>I use it because it's an excellent illustration of what happens when you >>ignore the spec. > > > Really, I think it's a much better demonstration of exactly how stupid > hardware developers are at times... No argument. But their behavior, however awful, _was_ reported in places where a spec-driven driver would have noticed... :) >>>Jeff, it's wonderful to think that you can always rely on checking spec >>>bits, but in reality it never really 'just works out' for any given set >>>of hardware. >> >>"just issue it and hope" is not a reasonable plan, IMO. > > > I agree with that as well. I just didn't agree with your rosy idea of > thinking everything would always work if you just check the bits > according to spec. Everything _will_ always work, if you check the bits according to spec. Not reporting the flush-cache feature bit, when it really the opcode exists, isn't a spec violation. The opposite is, and I haven't heard of any such drives :) AFAICS: * for ATA versions where flush-cache is optional, you must check the bit. otherwise, issuing flush-cache would be very unwise. * for ATA versions where flush-cache is mandatory... the argument can be made that issuing a flush-cache in the absence of the bit isn't a bad thing. I'm not sure I agree with that, but I'm willing to be convinced. "just check the bits" always works because it is the conservative choice... but it can lead to suboptimal (but valid!) behavior by ignoring some devices' flush-cache capability. If it's only a few drives out there that misreport their flush-cache bit, then who cares --> just more broken hardware, that we won't issue a flush-cache on. If it's a lot of drives, that changes things... Jeff ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-20 10:37 ` Jeff Garzik @ 2004-03-20 16:30 ` Bartlomiej Zolnierkiewicz 2004-03-21 18:12 ` Jeff Garzik 0 siblings, 1 reply; 63+ messages in thread From: Bartlomiej Zolnierkiewicz @ 2004-03-20 16:30 UTC (permalink / raw) To: Jeff Garzik, Jens Axboe; +Cc: Linux Kernel, Chris Mason On Saturday 20 of March 2004 11:37, Jeff Garzik wrote: > Jens Axboe wrote: > > On Sat, Mar 20 2004, Jeff Garzik wrote: > >>Jens Axboe wrote: > >>>I agree with Bart, it's usually never that clear. Quit harping the > >>>stupid LG issue, they did something brain dead in the firmware and I > >>>almost have to say that they got what they deserved for doing something > >>>as _stupid_ as that. > >> > >>I use it because it's an excellent illustration of what happens when you > >>ignore the spec. > > > > Really, I think it's a much better demonstration of exactly how stupid > > hardware developers are at times... > > No argument. But their behavior, however awful, _was_ reported in > places where a spec-driven driver would have noticed... :) > > >>>Jeff, it's wonderful to think that you can always rely on checking spec > >>>bits, but in reality it never really 'just works out' for any given set > >>>of hardware. > >> > >>"just issue it and hope" is not a reasonable plan, IMO. > > > > I agree with that as well. I just didn't agree with your rosy idea of > > thinking everything would always work if you just check the bits > > according to spec. > > Everything _will_ always work, if you check the bits according to spec. > Not reporting the flush-cache feature bit, when it really the > opcode exists, isn't a spec violation. The opposite is, and I haven't > heard of any such drives :) > > AFAICS: > * for ATA versions where flush-cache is optional, you must check the > bit. otherwise, issuing flush-cache would be very unwise. There is not flush-cache bit in both ATA-4 (command optional) and ATA-5 (command mandatory). > * for ATA versions where flush-cache is mandatory... the argument can > be made that issuing a flush-cache in the absence of the bit isn't a bad > thing. I'm not sure I agree with that, but I'm willing to be convinced. > > "just check the bits" always works because it is the conservative > choice... but it can lead to suboptimal (but valid!) behavior by > ignoring some devices' flush-cache capability. It's just damn too conservative. > If it's only a few drives out there that misreport their flush-cache > bit, then who cares --> just more broken hardware, that we won't issue a > flush-cache on. If it's a lot of drives, that changes things... They don't misreport! They comply with the spec (ATA-4 or ATA-5). Jeff, please RTFSPEC. ;-) Cheers, Bartlomiej ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-20 16:30 ` Bartlomiej Zolnierkiewicz @ 2004-03-21 18:12 ` Jeff Garzik 0 siblings, 0 replies; 63+ messages in thread From: Jeff Garzik @ 2004-03-21 18:12 UTC (permalink / raw) To: Bartlomiej Zolnierkiewicz Cc: Jens Axboe, Linux Kernel, Chris Mason, Mudama, Eric Bartlomiej Zolnierkiewicz wrote: > They don't misreport! They comply with the spec (ATA-4 or ATA-5). > > Jeff, please RTFSPEC. ;-) Arrrrgh. Yes, you're right, and I stand corrected. One wonders why the bits appeared at all, then... I wonder if some ATA drives _don't_ support mandatory flush-cache command? grumble. Jeff ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-20 1:24 ` Bartlomiej Zolnierkiewicz 2004-03-20 9:58 ` Jens Axboe @ 2004-03-20 10:21 ` Jeff Garzik 2004-03-20 15:54 ` Bartlomiej Zolnierkiewicz 1 sibling, 1 reply; 63+ messages in thread From: Jeff Garzik @ 2004-03-20 10:21 UTC (permalink / raw) To: Bartlomiej Zolnierkiewicz; +Cc: Jens Axboe, Linux Kernel, Chris Mason Bartlomiej Zolnierkiewicz wrote: > Yep but in your original mail you suggested that we should explicitly enable > FLUSH CACHE and FLUSH CACHE EXT features - there are even no subcommands > to do this. ;-) Whoops, you're right. I was thinking about the general protocol for features. > I wish it was so simple. Here is an example to make it clear: > > model: WDC WD800JB-00CRA1 firmware: 17.07W77 > word 0x83 is 4b01, word 0x86 is 0x0801 > > and drive of course supports CACHE FLUSH command. What ATA revision? Sending down opcodes because they will "probably" work isn't the best idea in the world... Jeff ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-20 10:21 ` Jeff Garzik @ 2004-03-20 15:54 ` Bartlomiej Zolnierkiewicz 0 siblings, 0 replies; 63+ messages in thread From: Bartlomiej Zolnierkiewicz @ 2004-03-20 15:54 UTC (permalink / raw) To: Jeff Garzik; +Cc: Jens Axboe, Linux Kernel, Chris Mason On Saturday 20 of March 2004 11:21, Jeff Garzik wrote: > > I wish it was so simple. Here is an example to make it clear: > > > > model: WDC WD800JB-00CRA1 firmware: 17.07W77 > > word 0x83 is 4b01, word 0x86 is 0x0801 > > > > and drive of course supports CACHE FLUSH command. > > What ATA revision? ATA-5 > Sending down opcodes because they will "probably" work isn't the best > idea in the world... It isn't but spec says that devices should abort unknown commands. 8) Bartlomiej ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-19 23:59 ` Bartlomiej Zolnierkiewicz 2004-03-20 0:14 ` Jeff Garzik @ 2004-03-20 0:17 ` Jeff Garzik 2004-03-20 9:53 ` Jens Axboe 2 siblings, 0 replies; 63+ messages in thread From: Jeff Garzik @ 2004-03-20 0:17 UTC (permalink / raw) To: Bartlomiej Zolnierkiewicz; +Cc: Jens Axboe, Linux Kernel, Chris Mason Bartlomiej Zolnierkiewicz wrote: > The fact that spec says "supported" not "enabled" in description of word86 > makes me wonder - can they be disabled? (FLUSH CACHE is mandatory for General > feature set and FLUSH CACHE EXT is mandatory if 48-bit LBA is supported) IOW, "mandantory" isn't really mandantory, if you scale over time... Always check the identify-device feature bits before using a feature set :) Jeff ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-19 23:59 ` Bartlomiej Zolnierkiewicz 2004-03-20 0:14 ` Jeff Garzik 2004-03-20 0:17 ` Jeff Garzik @ 2004-03-20 9:53 ` Jens Axboe 2004-03-20 16:23 ` Bartlomiej Zolnierkiewicz 2 siblings, 1 reply; 63+ messages in thread From: Jens Axboe @ 2004-03-20 9:53 UTC (permalink / raw) To: Bartlomiej Zolnierkiewicz; +Cc: Jeff Garzik, Linux Kernel, Chris Mason On Sat, Mar 20 2004, Bartlomiej Zolnierkiewicz wrote: > On Friday 19 of March 2004 17:34, Jeff Garzik wrote: > > Jens Axboe wrote: > > > Cosmetic stuff that will get ironed out. You can find the patches here: > > > > > > ftp://ftp.kernel.org/pub/linux/kernel/people/axboe/patches/v2.6/2.6.5-rc1 > > >-mm2/ > > > > > > ide-barrier-2.6.5-rc1-mm2-1 > > > ide/core part > > Jens, am I right that you didn't do any changes/cleanups I asked you to do? > Here they are once again (probably some new items added as a bonus). ;-) Probably, ide code was idle for some time :). As I said to Chris, there are a bunch of things I want to do to the code over the weekend, I wanted to get something out there before that though (raises the incentive to finish it) > - do not use hwgroup->wrq (die!) and do not add drive->special_buf, > just do what PM code does and other special commands do - use taskfile > (yes, dirty stack allocation) Doesn't work for split flush, ie issue a bunch of flushes to devices, then wait for them. I agree using ->wrq and special_buf is ugly as hell, though. > - SCSI -> IDE transform should die, please use something like REQ_FLUSH > and let subsystems deal with it That's what I wanted to avoid, adding more flags. However, if you see the comment in there this is being changed to ->issue_flush() instead. So it's dying, don't worry. > - ide_get_error_location() is cool but clean other places doing same thing > as you are duplicating existing code > (please use u64 not sector_t - you are getting raw info from the disk) Ok > - why does blkdev_issue_flush() add REQ_BLOCK_PC to rq->flags? Ehm, because it _is_ a REQ_BLOCK_PC? ;-) > - why are we doing pre-flush? To ensure previously written data is on platter first. -- Jens Axboe ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-20 9:53 ` Jens Axboe @ 2004-03-20 16:23 ` Bartlomiej Zolnierkiewicz 2004-03-20 16:27 ` Jens Axboe 2004-03-20 16:32 ` Chris Mason 0 siblings, 2 replies; 63+ messages in thread From: Bartlomiej Zolnierkiewicz @ 2004-03-20 16:23 UTC (permalink / raw) To: Jens Axboe; +Cc: Jeff Garzik, Linux Kernel, Chris Mason On Saturday 20 of March 2004 10:53, Jens Axboe wrote: > On Sat, Mar 20 2004, Bartlomiej Zolnierkiewicz wrote: > > - do not use hwgroup->wrq (die!) and do not add drive->special_buf, > > just do what PM code does and other special commands do - use taskfile > > (yes, dirty stack allocation) > > Doesn't work for split flush, ie issue a bunch of flushes to devices, > then wait for them. I agree using ->wrq and special_buf is ugly as hell, > though. I can't see why it doesn't work, please explain. > > - ide_get_error_location() is cool but clean other places doing same > > thing as you are duplicating existing code > > (please use u64 not sector_t - you are getting raw info from the disk) > > Ok Cool. > > - why does blkdev_issue_flush() add REQ_BLOCK_PC to rq->flags? > > Ehm, because it _is_ a REQ_BLOCK_PC? ;-) Ok, it is PC till SCSI->IDE transform, then it is no longer PC. :) > > - why are we doing pre-flush? > > To ensure previously written data is on platter first. I know this, I want to know what for you are doing this? Previously written data is already acknowledgment to the upper layers so you can't do much even if you hit error on flush cache. IMO if error happens we should just check if failed sector is of our ordered write if not well report it and continue. It's cleaner and can give some (small?) performance gain. Regards, Bartlomiej ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-20 16:23 ` Bartlomiej Zolnierkiewicz @ 2004-03-20 16:27 ` Jens Axboe 2004-03-20 16:32 ` Chris Mason 1 sibling, 0 replies; 63+ messages in thread From: Jens Axboe @ 2004-03-20 16:27 UTC (permalink / raw) To: Bartlomiej Zolnierkiewicz; +Cc: Jeff Garzik, Linux Kernel, Chris Mason On Sat, Mar 20 2004, Bartlomiej Zolnierkiewicz wrote: > On Saturday 20 of March 2004 10:53, Jens Axboe wrote: > > On Sat, Mar 20 2004, Bartlomiej Zolnierkiewicz wrote: > > > - do not use hwgroup->wrq (die!) and do not add drive->special_buf, > > > just do what PM code does and other special commands do - use taskfile > > > (yes, dirty stack allocation) > > > > Doesn't work for split flush, ie issue a bunch of flushes to devices, > > then wait for them. I agree using ->wrq and special_buf is ugly as hell, > > though. > > I can't see why it doesn't work, please explain. You'd need to pass on some opaque buffer from the issuer to the flush cache function for that to work, which would be ugly. > > > - why does blkdev_issue_flush() add REQ_BLOCK_PC to rq->flags? > > > > Ehm, because it _is_ a REQ_BLOCK_PC? ;-) > > Ok, it is PC till SCSI->IDE transform, then it is no longer PC. :) Right, and during that transform, REQ_BLOCK_PC is cleared: + rq->flags &= ~REQ_BLOCK_PC; + rq->flags |= REQ_DRIVE_TASK | REQ_STARTED; (REQ_STARTED should be removed, btw). > > > - why are we doing pre-flush? > > > > To ensure previously written data is on platter first. > > I know this, I want to know what for you are doing this? > > Previously written data is already acknowledgment to the upper layers > so you can't do much even if you hit error on flush cache. IMO if > error happens we should just check if failed sector is of our ordered > write if not well report it and continue. It's cleaner and can give > some (small?) performance gain. This depends entirely on what you assume with a barrier and what the upper layers want. My implementation guarentees that existing data sent to drive is on platter before you issue the barrier. What is confusing is that you cannot pass back where the error occured, that probably needs some work. -- Jens Axboe ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-20 16:23 ` Bartlomiej Zolnierkiewicz 2004-03-20 16:27 ` Jens Axboe @ 2004-03-20 16:32 ` Chris Mason 2004-03-20 17:05 ` Bartlomiej Zolnierkiewicz 1 sibling, 1 reply; 63+ messages in thread From: Chris Mason @ 2004-03-20 16:32 UTC (permalink / raw) To: Bartlomiej Zolnierkiewicz; +Cc: Jens Axboe, Jeff Garzik, Linux Kernel On Sat, 2004-03-20 at 11:23, Bartlomiej Zolnierkiewicz wrote: > > > - why are we doing pre-flush? > > > > To ensure previously written data is on platter first. > > I know this, I want to know what for you are doing this? > > Previously written data is already acknowledgment to the upper layers so you > can't do much even if you hit error on flush cache. IMO if error happens we > should just check if failed sector is of our ordered write if not well report > it and continue. It's cleaner and can give some (small?) performance gain. > The journaled filesystems need this. We need to make sure that before we write the commit block for a transaction, all the previous log blocks we're written are safely on media. Then we also need to make sure the commit block is on media. We end up with a log blocks, pre-flush, commit block, post-flush cycle, which is what gives the proper transaction ordering on disk. For data blocks we only need the post flush, which is why Jens made blkdev_issue_flush skip the pre-flush. -chris ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-20 16:32 ` Chris Mason @ 2004-03-20 17:05 ` Bartlomiej Zolnierkiewicz 2004-03-20 17:10 ` Chris Mason 2004-03-30 16:04 ` Stephen C. Tweedie 0 siblings, 2 replies; 63+ messages in thread From: Bartlomiej Zolnierkiewicz @ 2004-03-20 17:05 UTC (permalink / raw) To: Chris Mason; +Cc: Jens Axboe, Jeff Garzik, Linux Kernel On Saturday 20 of March 2004 17:32, Chris Mason wrote: > On Sat, 2004-03-20 at 11:23, Bartlomiej Zolnierkiewicz wrote: > > > > - why are we doing pre-flush? > > > > > > To ensure previously written data is on platter first. > > > > I know this, I want to know what for you are doing this? > > > > Previously written data is already acknowledgment to the upper layers so > > you can't do much even if you hit error on flush cache. IMO if error > > happens we should just check if failed sector is of our ordered write if > > not well report it and continue. It's cleaner and can give some (small?) > > performance gain. > > The journaled filesystems need this. We need to make sure that before > we write the commit block for a transaction, all the previous log blocks > we're written are safely on media. Then we also need to make sure the > commit block is on media. For low-level driver it shouldn't really matter whether sectors to be written are the commit block for a transaction or the previous log blocks and in the current implementation it does matter. > We end up with a log blocks, pre-flush, commit block, post-flush cycle, > which is what gives the proper transaction ordering on disk. Jens, can you explain how this translates to the block layer? If "log blocks" is a separate request from "commit block", we can just do: log blocks, flush, commit block, flush cycle. > For data blocks we only need the post flush, which is why Jens made > blkdev_issue_flush skip the pre-flush. Yes. Thanks, Bartlomiej ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-20 17:05 ` Bartlomiej Zolnierkiewicz @ 2004-03-20 17:10 ` Chris Mason 2004-03-20 20:16 ` Bartlomiej Zolnierkiewicz 2004-03-30 16:04 ` Stephen C. Tweedie 1 sibling, 1 reply; 63+ messages in thread From: Chris Mason @ 2004-03-20 17:10 UTC (permalink / raw) To: Bartlomiej Zolnierkiewicz; +Cc: Jens Axboe, Jeff Garzik, Linux Kernel On Sat, 2004-03-20 at 12:05, Bartlomiej Zolnierkiewicz wrote: > On Saturday 20 of March 2004 17:32, Chris Mason wrote: > > On Sat, 2004-03-20 at 11:23, Bartlomiej Zolnierkiewicz wrote: > > > > > - why are we doing pre-flush? > > The journaled filesystems need this. We need to make sure that before > > we write the commit block for a transaction, all the previous log blocks > > we're written are safely on media. Then we also need to make sure the > > commit block is on media. > > For low-level driver it shouldn't really matter whether sectors to be > written are the commit block for a transaction or the previous log blocks > and in the current implementation it does matter. > As Jens said, it depends on how you define barrier ;-) I define it as this io will be written after all the previous io and before any later io. It was originally written with scsi tags in mind as well, the FS side was the same for both. In the end, I'm not that picky though, any reasonable setup that gets the blocks on media is fine. -chris ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-20 17:10 ` Chris Mason @ 2004-03-20 20:16 ` Bartlomiej Zolnierkiewicz 2004-03-21 9:43 ` Jens Axboe 0 siblings, 1 reply; 63+ messages in thread From: Bartlomiej Zolnierkiewicz @ 2004-03-20 20:16 UTC (permalink / raw) To: Chris Mason; +Cc: Jens Axboe, Jeff Garzik, Linux Kernel On Saturday 20 of March 2004 18:10, Chris Mason wrote: > On Sat, 2004-03-20 at 12:05, Bartlomiej Zolnierkiewicz wrote: > > On Saturday 20 of March 2004 17:32, Chris Mason wrote: > > > On Sat, 2004-03-20 at 11:23, Bartlomiej Zolnierkiewicz wrote: > > > > > > - why are we doing pre-flush? > > > > > > The journaled filesystems need this. We need to make sure that before > > > we write the commit block for a transaction, all the previous log > > > blocks we're written are safely on media. Then we also need to make > > > sure the commit block is on media. > > > > For low-level driver it shouldn't really matter whether sectors to be > > written are the commit block for a transaction or the previous log blocks > > and in the current implementation it does matter. > > As Jens said, it depends on how you define barrier ;-) I define it as > this io will be written after all the previous io and before any later > io. It was originally written with scsi tags in mind as well, the FS > side was the same for both. Yes, thanks for explaining this. I took a quick look at fs/jbd/ and now I think I understand the way barriers currently work. I assume that SCSI handles barriers by ordered tags, right? > In the end, I'm not that picky though, any reasonable setup that gets > the blocks on media is fine. Yep. Regards, Bartlomiej ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-20 20:16 ` Bartlomiej Zolnierkiewicz @ 2004-03-21 9:43 ` Jens Axboe 0 siblings, 0 replies; 63+ messages in thread From: Jens Axboe @ 2004-03-21 9:43 UTC (permalink / raw) To: Bartlomiej Zolnierkiewicz; +Cc: Chris Mason, Jeff Garzik, Linux Kernel On Sat, Mar 20 2004, Bartlomiej Zolnierkiewicz wrote: > On Saturday 20 of March 2004 18:10, Chris Mason wrote: > > On Sat, 2004-03-20 at 12:05, Bartlomiej Zolnierkiewicz wrote: > > > On Saturday 20 of March 2004 17:32, Chris Mason wrote: > > > > On Sat, 2004-03-20 at 11:23, Bartlomiej Zolnierkiewicz wrote: > > > > > > > - why are we doing pre-flush? > > > > > > > > The journaled filesystems need this. We need to make sure that before > > > > we write the commit block for a transaction, all the previous log > > > > blocks we're written are safely on media. Then we also need to make > > > > sure the commit block is on media. > > > > > > For low-level driver it shouldn't really matter whether sectors to be > > > written are the commit block for a transaction or the previous log blocks > > > and in the current implementation it does matter. > > > > As Jens said, it depends on how you define barrier ;-) I define it as > > this io will be written after all the previous io and before any later > > io. It was originally written with scsi tags in mind as well, the FS > > side was the same for both. > > Yes, thanks for explaining this. > > I took a quick look at fs/jbd/ and now I think I understand the way barriers > currently work. I assume that SCSI handles barriers by ordered tags, right? That's the idea, yes. Needs some SCSI error handling work though, to always ensure correct ordering. -- Jens Axboe ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-20 17:05 ` Bartlomiej Zolnierkiewicz 2004-03-20 17:10 ` Chris Mason @ 2004-03-30 16:04 ` Stephen C. Tweedie 2004-03-30 19:19 ` Chris Mason 1 sibling, 1 reply; 63+ messages in thread From: Stephen C. Tweedie @ 2004-03-30 16:04 UTC (permalink / raw) To: Bartlomiej Zolnierkiewicz Cc: Chris Mason, Jens Axboe, Jeff Garzik, Linux Kernel, Stephen Tweedie Hi, On Sat, 2004-03-20 at 17:05, Bartlomiej Zolnierkiewicz wrote: > Jens, can you explain how this translates to the block layer? > If "log blocks" is a separate request from "commit block", > we can just do: log blocks, flush, commit block, flush cycle. It's not so simple. There are two different operations here as far as the fs is concerned --- ordering ("barrier"), and waiting ("flush"). For most journaling purposes, a filesystem really doesn't have to know when data has hit disk --- it only needs to be assured that certain ordering constraints will be obeyed. So in SCSI terms we can submit the log blocks, then set an ORDERED queue tag on the commit block, and leave it at that. We don't have to wait until that ORDERED tag actually becomes persistent on disk --- we don't care, in most cases. This is the "barrier", and in journaling, we basically need a barrier before and after the commit block. The tricky bit is that _sometimes_ we find, later on, that we need to know when data has hit disk. If an application requests synchronised IO (such as fsync, O_SYNC or O_DIRECT) then we cannot legally return until the disk has told us that the data is persistent for certain. _That_ is a full flush, and it's distinct from the simple ordering constraint that we normally have for journal commit blocks. The distinction becomes important if flushing and barriers have significantly different performance characteristics. In the SCSI tagged command case, we can insert an ORDERED queue tag into the pipeline without having to wait for anything. But if we implement the barrier and the flush by the same mechanism, then the fs ends up waiting; we get write log blocks wait for IO queue to empty write commit block wait for IO queue to empty instead of write log blocks write commit block with ORDERED tag set so the commits are miles slower. Jens' patch happens to implement barriers via flushes on IDE, but at the block layer blkdev_issue_flush() and set_buffer_ordered() are quite distinct APIs. --Stephen ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-30 16:04 ` Stephen C. Tweedie @ 2004-03-30 19:19 ` Chris Mason 2004-03-30 21:50 ` Stephen C. Tweedie 0 siblings, 1 reply; 63+ messages in thread From: Chris Mason @ 2004-03-30 19:19 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Bartlomiej Zolnierkiewicz, Jens Axboe, Jeff Garzik, Linux Kernel On Tue, 2004-03-30 at 11:04, Stephen C. Tweedie wrote: [ barriers vs ordered writes ] > The distinction becomes important if flushing and barriers have > significantly different performance characteristics. In the SCSI tagged > command case, we can insert an ORDERED queue tag into the pipeline > without having to wait for anything. But if we implement the barrier > and the flush by the same mechanism, then the fs ends up waiting; we get > > write log blocks > wait for IO queue to empty > write commit block > wait for IO queue to empty > > instead of > > write log blocks > write commit block with ORDERED tag set > > so the commits are miles slower. > I think we're mixing a few concepts together. submit_bh(WRITE_BARRIER, bh) gives us an ordered write in whatever form the lower layers can provide. It also ensures that if you happen to call wait_on_buffer() for the barrier buffer, the wait won't return until the data is on media. The construct allows for the second type of commit you describe above, where wait_on_buffer is not called on the log blocks until after the commit block has been sent down the pipe. The reiserfs barrier patch does this now. The fact that IDE implements the barriers via a flush and that no other layer has barriers right now is secondary ;) I do want to revive some kind of ordering for scsi as well, but since IDE is a safety issue right now I wanted to concentrate there first. > Jens' patch happens to implement barriers via flushes on IDE, but at the > block layer blkdev_issue_flush() and set_buffer_ordered() are quite > distinct APIs. Yes, they are completely different. blkdev_issue_flush is the kernel recognizing that wait_on_buffer (or page or whatever) isn't always enough to make sure writes are really done. It has no ordering semantics at all, it just means "really write what you've already told me is written, unless you promise via battery backed cache that it will get to disk eventually" (roughly) -chris ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-30 19:19 ` Chris Mason @ 2004-03-30 21:50 ` Stephen C. Tweedie 2004-03-30 22:13 ` Chris Mason 2004-03-30 22:21 ` Jeff Garzik 0 siblings, 2 replies; 63+ messages in thread From: Stephen C. Tweedie @ 2004-03-30 21:50 UTC (permalink / raw) To: Chris Mason Cc: Bartlomiej Zolnierkiewicz, Jens Axboe, Jeff Garzik, Linux Kernel, Stephen Tweedie Hi, On Tue, 2004-03-30 at 20:19, Chris Mason wrote: > I think we're mixing a few concepts together. submit_bh(WRITE_BARRIER, > bh) gives us an ordered write in whatever form the lower layers can > provide. It also ensures that if you happen to call wait_on_buffer() > for the barrier buffer, the wait won't return until the data is on > media. Right, but that's just how it works right now --- one doesn't _have_ to imply the other. You could easily imagine an implementation that implements barriers and flushing separately, and which does not do automatic flushing on completion of WRITE_BARRIER IOs. SCSI with writeback caching enabled might be one example of that. NBD/DRBD would be another likely candidate --- if you've got network latencies in the way, then a flushing sync may be far more expensive than a barrier propagation. Unfortunately, a lot of the cases we care about really have to do the barrier via flushing, so the benefit of keeping them separate is limited. For LVM/raid0, for example, we've got no way of preserving ordering between IOs on different drives, so a flush is necessary there unless we start journaling the low-level IOs to preserve order. > The fact that IDE implements the barriers via a flush and that no other > layer has barriers right now is secondary ;) I do want to revive some > kind of ordering for scsi as well, but since IDE is a safety issue right > now I wanted to concentrate there first. Right. > Yes, they are completely different. blkdev_issue_flush is the kernel > recognizing that wait_on_buffer (or page or whatever) isn't always > enough to make sure writes are really done. Yep. It scares me to think what performance characteristics we'll start seeing once that gets used everywhere it's needed, though. If every raw or O_DIRECT write needs a flush after it, databases are going to become very sensitive to flush performance. I guess disabling the flushing and using disks which tell the truth about data hitting the platter is the sane answer there. --Stephen ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-30 21:50 ` Stephen C. Tweedie @ 2004-03-30 22:13 ` Chris Mason 2004-03-31 14:03 ` Stephen C. Tweedie 2004-03-30 22:21 ` Jeff Garzik 1 sibling, 1 reply; 63+ messages in thread From: Chris Mason @ 2004-03-30 22:13 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Bartlomiej Zolnierkiewicz, Jens Axboe, Jeff Garzik, Linux Kernel On Tue, 2004-03-30 at 16:50, Stephen C. Tweedie wrote: > Hi, > > On Tue, 2004-03-30 at 20:19, Chris Mason wrote: > > > > I think we're mixing a few concepts together. submit_bh(WRITE_BARRIER, > > bh) gives us an ordered write in whatever form the lower layers can > > provide. It also ensures that if you happen to call wait_on_buffer() > > for the barrier buffer, the wait won't return until the data is on > > media. > > Right, but that's just how it works right now --- one doesn't _have_ to > imply the other. You could easily imagine an implementation that > implements barriers and flushing separately, and which does not do > automatic flushing on completion of WRITE_BARRIER IOs. SCSI with > writeback caching enabled might be one example of that. NBD/DRBD would > be another likely candidate --- if you've got network latencies in the > way, then a flushing sync may be far more expensive than a barrier > propagation. > Yes, that's true, although the barriers don't really imply a flush, it just implies that if you do use wait_on_* for flushing, it will report things accurately. > Unfortunately, a lot of the cases we care about really have to do the > barrier via flushing, so the benefit of keeping them separate is > limited. For LVM/raid0, for example, we've got no way of preserving > ordering between IOs on different drives, so a flush is necessary there > unless we start journaling the low-level IOs to preserve order. > Right. > Yep. It scares me to think what performance characteristics we'll start > seeing once that gets used everywhere it's needed, though. If every raw > or O_DIRECT write needs a flush after it, databases are going to become > very sensitive to flush performance. I guess disabling the flushing and > using disks which tell the truth about data hitting the platter is the > sane answer there. Most database benchmarks are done on scsi, and the blkdev_flush should be a noop there. For IDE based database and mail server benchmarks, the results won't be pretty. The reiserfs fsync code tries hard to only flush once, so if a commit is done then blkdev_flush isn't called. We might have to do a few other tricks to queue up multiple synchronous ios and only flush once. -chris ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-30 22:13 ` Chris Mason @ 2004-03-31 14:03 ` Stephen C. Tweedie 2004-03-31 14:27 ` Chris Mason 0 siblings, 1 reply; 63+ messages in thread From: Stephen C. Tweedie @ 2004-03-31 14:03 UTC (permalink / raw) To: Chris Mason Cc: Bartlomiej Zolnierkiewicz, Jens Axboe, Jeff Garzik, Linux Kernel, Stephen Tweedie Hi, On Tue, 2004-03-30 at 23:13, Chris Mason wrote: > Most database benchmarks are done on scsi, and the blkdev_flush should > be a noop there. For IDE based database and mail server benchmarks, the > results won't be pretty. Yep. I'm really not too worried about big database benchmarks -- those are very much special cases, using rather specialised storage setup (SCSI or FC, striped over lots of small disks rather than fewer large ones.) I'm much more concerned about your average LAMP user's mysql database, and how to keep performance sane on that. > The reiserfs fsync code tries hard to only flush once, so if a commit is > done then blkdev_flush isn't called. We might have to do a few other > tricks to queue up multiple synchronous ios and only flush once. Batching is really helpful when you've got lots of threads that can be coalesced, yes. ext3 does that for things like mail servers. I'm not sure whether the same tricks will apply to the various databases out there, though. --Stephen ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-31 14:03 ` Stephen C. Tweedie @ 2004-03-31 14:27 ` Chris Mason 2004-03-31 18:28 ` Ric Wheeler 0 siblings, 1 reply; 63+ messages in thread From: Chris Mason @ 2004-03-31 14:27 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Bartlomiej Zolnierkiewicz, Jens Axboe, Jeff Garzik, Linux Kernel On Wed, 2004-03-31 at 09:03, Stephen C. Tweedie wrote: > Hi, > > On Tue, 2004-03-30 at 23:13, Chris Mason wrote: > > > Most database benchmarks are done on scsi, and the blkdev_flush should > > be a noop there. For IDE based database and mail server benchmarks, the > > results won't be pretty. > > Yep. I'm really not too worried about big database benchmarks -- those > are very much special cases, using rather specialised storage setup > (SCSI or FC, striped over lots of small disks rather than fewer large > ones.) I'm much more concerned about your average LAMP user's mysql > database, and how to keep performance sane on that. > In some cases, it's going to be so much slower that it will look like the old code wasn't writing the data at all. I don't think there's much we can do about that. > > The reiserfs fsync code tries hard to only flush once, so if a commit is > > done then blkdev_flush isn't called. We might have to do a few other > > tricks to queue up multiple synchronous ios and only flush once. > > Batching is really helpful when you've got lots of threads that can be > coalesced, yes. ext3 does that for things like mail servers. I'm not > sure whether the same tricks will apply to the various databases out > there, though. We can do better in general when there's more then one process doing an fsync. reiserfs and ext3 both try to be smart about batching log commits, but I think we could do more to streamline the data writes. I'm playing with a few ideas, I'll post more when I've got real code to back things up. If there's only one process doing fsyncs, there's not much the kernel can do except provide an aio fsync call. -chris ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-31 14:27 ` Chris Mason @ 2004-03-31 18:28 ` Ric Wheeler 0 siblings, 0 replies; 63+ messages in thread From: Ric Wheeler @ 2004-03-31 18:28 UTC (permalink / raw) To: Chris Mason Cc: Stephen C. Tweedie, Bartlomiej Zolnierkiewicz, Jens Axboe, Jeff Garzik, Linux Kernel As one of the large users of the Jens and Chris's barrier support in 2.4, I am very motivated to help validate and benchmark the new version. Stephen, if you have a specific mysql workload or usage case, I can try to through that into the mix. We do a lot of Sleepycat DB testing, would results from that help? Ric Chris Mason wrote: >On Wed, 2004-03-31 at 09:03, Stephen C. Tweedie wrote: > > >>Hi, >> >>On Tue, 2004-03-30 at 23:13, Chris Mason wrote: >> >> >> >>>Most database benchmarks are done on scsi, and the blkdev_flush should >>>be a noop there. For IDE based database and mail server benchmarks, the >>>results won't be pretty. >>> >>> >>Yep. I'm really not too worried about big database benchmarks -- those >>are very much special cases, using rather specialised storage setup >>(SCSI or FC, striped over lots of small disks rather than fewer large >>ones.) I'm much more concerned about your average LAMP user's mysql >>database, and how to keep performance sane on that. >> >> >> >In some cases, it's going to be so much slower that it will look like >the old code wasn't writing the data at all. I don't think there's much >we can do about that. > > > >>>The reiserfs fsync code tries hard to only flush once, so if a commit is >>>done then blkdev_flush isn't called. We might have to do a few other >>>tricks to queue up multiple synchronous ios and only flush once. >>> >>> >>Batching is really helpful when you've got lots of threads that can be >>coalesced, yes. ext3 does that for things like mail servers. I'm not >>sure whether the same tricks will apply to the various databases out >>there, though. >> >> > >We can do better in general when there's more then one process doing an >fsync. reiserfs and ext3 both try to be smart about batching log >commits, but I think we could do more to streamline the data writes. > >I'm playing with a few ideas, I'll post more when I've got real code to >back things up. > >If there's only one process doing fsyncs, there's not much the kernel >can do except provide an aio fsync call. > >-chris > > > > >- >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html >Please read the FAQ at http://www.tux.org/lkml/ > > ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-30 21:50 ` Stephen C. Tweedie 2004-03-30 22:13 ` Chris Mason @ 2004-03-30 22:21 ` Jeff Garzik 2004-03-30 22:36 ` Chris Wedgwood ` (2 more replies) 1 sibling, 3 replies; 63+ messages in thread From: Jeff Garzik @ 2004-03-30 22:21 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Chris Mason, Bartlomiej Zolnierkiewicz, Jens Axboe, Linux Kernel Stephen C. Tweedie wrote: > Yep. It scares me to think what performance characteristics we'll start > seeing once that gets used everywhere it's needed, though. If every raw > or O_DIRECT write needs a flush after it, databases are going to become > very sensitive to flush performance. I guess disabling the flushing and > using disks which tell the truth about data hitting the platter is the > sane answer there. For IDE, O_DIRECT and O_SYNC can use special "FUA" commands, which don't return until the data is on the platter. Jeff ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-30 22:21 ` Jeff Garzik @ 2004-03-30 22:36 ` Chris Wedgwood 2004-03-30 22:39 ` Jeff Garzik 2004-03-30 22:40 ` Bartlomiej Zolnierkiewicz 2004-03-31 14:08 ` Stephen C. Tweedie 2 siblings, 1 reply; 63+ messages in thread From: Chris Wedgwood @ 2004-03-30 22:36 UTC (permalink / raw) To: Jeff Garzik Cc: Stephen C. Tweedie, Chris Mason, Bartlomiej Zolnierkiewicz, Jens Axboe, Linux Kernel > For IDE, O_DIRECT and O_SYNC can use special "FUA" commands, which > don't return until the data is on the platter. On modern drives how reliable is this? At one point disk-scrubbing software which used FUA (to ensure data was being written to the platters) showed that some drives completely ignore this. Has the state of things changed significantly that we can assume this is very rare or might we need to have to kind of whitelist/blacklist system? --cw ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-30 22:36 ` Chris Wedgwood @ 2004-03-30 22:39 ` Jeff Garzik 2004-03-30 22:41 ` Chris Wedgwood 0 siblings, 1 reply; 63+ messages in thread From: Jeff Garzik @ 2004-03-30 22:39 UTC (permalink / raw) To: Chris Wedgwood Cc: Stephen C. Tweedie, Chris Mason, Bartlomiej Zolnierkiewicz, Jens Axboe, Linux Kernel Chris Wedgwood wrote: >>For IDE, O_DIRECT and O_SYNC can use special "FUA" commands, which >>don't return until the data is on the platter. > > > On modern drives how reliable is this? At one point disk-scrubbing > software which used FUA (to ensure data was being written to the > platters) showed that some drives completely ignore this. I'm suspicious of this, because of Bart's point... I haven't seen any PATA disks that did FUA, so it sounds like broken software. Jeff ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-30 22:39 ` Jeff Garzik @ 2004-03-30 22:41 ` Chris Wedgwood 0 siblings, 0 replies; 63+ messages in thread From: Chris Wedgwood @ 2004-03-30 22:41 UTC (permalink / raw) To: Jeff Garzik Cc: Stephen C. Tweedie, Chris Mason, Bartlomiej Zolnierkiewicz, Jens Axboe, Linux Kernel On Tue, Mar 30, 2004 at 05:39:26PM -0500, Jeff Garzik wrote: > I'm suspicious of this, because of Bart's point... I haven't seen > any PATA disks that did FUA, so it sounds like broken software. I was kinda hoping that the response would be "all modern SATA and PATA dirves" because ideally this would be a nice thing to have. Maybe we should just test in a few places and see if it works (timing will show obviously). ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-30 22:21 ` Jeff Garzik 2004-03-30 22:36 ` Chris Wedgwood @ 2004-03-30 22:40 ` Bartlomiej Zolnierkiewicz 2004-03-30 22:38 ` Jeff Garzik 2004-03-31 14:08 ` Stephen C. Tweedie 2 siblings, 1 reply; 63+ messages in thread From: Bartlomiej Zolnierkiewicz @ 2004-03-30 22:40 UTC (permalink / raw) To: Jeff Garzik, Stephen C. Tweedie; +Cc: Chris Mason, Jens Axboe, Linux Kernel On Wednesday 31 of March 2004 00:21, Jeff Garzik wrote: > Stephen C. Tweedie wrote: > > Yep. It scares me to think what performance characteristics we'll start > > seeing once that gets used everywhere it's needed, though. If every raw > > or O_DIRECT write needs a flush after it, databases are going to become > > very sensitive to flush performance. I guess disabling the flushing and > > using disks which tell the truth about data hitting the platter is the > > sane answer there. > > For IDE, O_DIRECT and O_SYNC can use special "FUA" commands, which don't > return until the data is on the platter. Do you know of any drive supporting it? I don't. Bartlomiej ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-30 22:40 ` Bartlomiej Zolnierkiewicz @ 2004-03-30 22:38 ` Jeff Garzik 0 siblings, 0 replies; 63+ messages in thread From: Jeff Garzik @ 2004-03-30 22:38 UTC (permalink / raw) To: Bartlomiej Zolnierkiewicz Cc: Stephen C. Tweedie, Chris Mason, Jens Axboe, Linux Kernel Bartlomiej Zolnierkiewicz wrote: > On Wednesday 31 of March 2004 00:21, Jeff Garzik wrote: > >>Stephen C. Tweedie wrote: >> >>>Yep. It scares me to think what performance characteristics we'll start >>>seeing once that gets used everywhere it's needed, though. If every raw >>>or O_DIRECT write needs a flush after it, databases are going to become >>>very sensitive to flush performance. I guess disabling the flushing and >>>using disks which tell the truth about data hitting the platter is the >>>sane answer there. >> >>For IDE, O_DIRECT and O_SYNC can use special "FUA" commands, which don't >>return until the data is on the platter. > > > Do you know of any drive supporting it? I don't. Newer SATAs do... Jeff ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-30 22:21 ` Jeff Garzik 2004-03-30 22:36 ` Chris Wedgwood 2004-03-30 22:40 ` Bartlomiej Zolnierkiewicz @ 2004-03-31 14:08 ` Stephen C. Tweedie 2004-03-31 14:21 ` Chris Mason 2004-03-31 21:27 ` Jeff Garzik 2 siblings, 2 replies; 63+ messages in thread From: Stephen C. Tweedie @ 2004-03-31 14:08 UTC (permalink / raw) To: Jeff Garzik Cc: Chris Mason, Bartlomiej Zolnierkiewicz, Jens Axboe, Linux Kernel, Stephen Tweedie Hi, On Tue, 2004-03-30 at 23:21, Jeff Garzik wrote: > For IDE, O_DIRECT and O_SYNC can use special "FUA" commands, which don't > return until the data is on the platter. fsync() is still really nasty, because that can require that we wait on IO that was submitted by the VM before we knew that there was a synchronous IO wait coming. SCSI also has an FUA bit that can make a difference if you've got writeback caching enabled. (And FUA on read can bypass drive writethrough caches, too, for media verification.) --Stephen ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-31 14:08 ` Stephen C. Tweedie @ 2004-03-31 14:21 ` Chris Mason 2004-03-31 21:26 ` Jeff Garzik 2004-03-31 21:27 ` Jeff Garzik 1 sibling, 1 reply; 63+ messages in thread From: Chris Mason @ 2004-03-31 14:21 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Jeff Garzik, Bartlomiej Zolnierkiewicz, Jens Axboe, Linux Kernel On Wed, 2004-03-31 at 09:08, Stephen C. Tweedie wrote: > Hi, > > On Tue, 2004-03-30 at 23:21, Jeff Garzik wrote: > > > For IDE, O_DIRECT and O_SYNC can use special "FUA" commands, which don't > > return until the data is on the platter. > > fsync() is still really nasty, because that can require that we wait on > IO that was submitted by the VM before we knew that there was a > synchronous IO wait coming. Yes, it gets ugly in a hurry. Jeff, look at the whole thread about the O_DIRECT read vs buffered write races. I don't think we can use FUA for fsync or O_SYNC without using it for every write. We might be able to get away with using it on O_DIRECT. -chris ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-31 14:21 ` Chris Mason @ 2004-03-31 21:26 ` Jeff Garzik 2004-03-31 22:09 ` Chris Mason 0 siblings, 1 reply; 63+ messages in thread From: Jeff Garzik @ 2004-03-31 21:26 UTC (permalink / raw) To: Chris Mason Cc: Stephen C. Tweedie, Bartlomiej Zolnierkiewicz, Jens Axboe, Linux Kernel Chris Mason wrote: > On Wed, 2004-03-31 at 09:08, Stephen C. Tweedie wrote: > >>Hi, >> >>On Tue, 2004-03-30 at 23:21, Jeff Garzik wrote: >> >> >>>For IDE, O_DIRECT and O_SYNC can use special "FUA" commands, which don't >>>return until the data is on the platter. >> >>fsync() is still really nasty, because that can require that we wait on >>IO that was submitted by the VM before we knew that there was a >>synchronous IO wait coming. > > > Yes, it gets ugly in a hurry. Jeff, look at the whole thread about the > O_DIRECT read vs buffered write races. I don't think we can use FUA for Yes, I'm aware of the thread... > fsync or O_SYNC without using it for every write. Why not for O_SYNC? Is some crazy userspace application flipping this bit on and off rapidly? > We might be able to get away with using it on O_DIRECT. Nod. Jeff ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-31 21:26 ` Jeff Garzik @ 2004-03-31 22:09 ` Chris Mason 0 siblings, 0 replies; 63+ messages in thread From: Chris Mason @ 2004-03-31 22:09 UTC (permalink / raw) To: Jeff Garzik Cc: Stephen C. Tweedie, Bartlomiej Zolnierkiewicz, Jens Axboe, Linux Kernel On Wed, 2004-03-31 at 16:26, Jeff Garzik wrote: > > > Yes, it gets ugly in a hurry. Jeff, look at the whole thread about the > > O_DIRECT read vs buffered write races. I don't think we can use FUA for > > Yes, I'm aware of the thread... > > > > fsync or O_SYNC without using it for every write. > > Why not for O_SYNC? Is some crazy userspace application flipping this > bit on and off rapidly? > For both fsync and O_SYNC, the pages we want to write synchronously are also available for some other part of the kernel to write async. Since we do know the write is going to be O_SYNC when we are marking the pages dirty, we could mark them dirty_fua or something as well. Even assuming we can deal with the data=ordered ext3/reiserfs issues, it makes the writeback for O_SYNC yet another corner case to check, and one where we have no useful way to make sure the fua bit really got set on all the writes for a given O_SYNC (unless we pin the page and check each one after the writes are complete). Since O_DIRECT is much less complex I think we should start there. -chris ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-31 14:08 ` Stephen C. Tweedie 2004-03-31 14:21 ` Chris Mason @ 2004-03-31 21:27 ` Jeff Garzik 1 sibling, 0 replies; 63+ messages in thread From: Jeff Garzik @ 2004-03-31 21:27 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Chris Mason, Bartlomiej Zolnierkiewicz, Jens Axboe, Linux Kernel Stephen C. Tweedie wrote: > Hi, > > On Tue, 2004-03-30 at 23:21, Jeff Garzik wrote: > > >>For IDE, O_DIRECT and O_SYNC can use special "FUA" commands, which don't >>return until the data is on the platter. > > > fsync() is still really nasty, because that can require that we wait on > IO that was submitted by the VM before we knew that there was a > synchronous IO wait coming. SCSI also has an FUA bit that can make a > difference if you've got writeback caching enabled. (And FUA on read > can bypass drive writethrough caches, too, for media verification.) Agreed, but I did not mention fsync(), since that would not be appropriate to FUA like O_DIRECT or O_SYNC might be. Jeff ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-19 15:35 [PATCH] barrier patch set Jens Axboe 2004-03-19 16:30 ` Mika Penttilä 2004-03-19 16:34 ` Jeff Garzik @ 2004-03-19 16:48 ` Marc-Christian Petersen 2004-03-19 18:19 ` Jens Axboe 2004-03-22 11:09 ` Andrew Morton 3 siblings, 1 reply; 63+ messages in thread From: Marc-Christian Petersen @ 2004-03-19 16:48 UTC (permalink / raw) To: linux-kernel; +Cc: Jens Axboe, Chris Mason On Friday 19 March 2004 16:35, Jens Axboe wrote: Hi Jens, > A first release of a collected barrier patchset for 2.6.5-rc1-mm2. I > have a few changes planned to support dm/md + sata, I'll do those > changes over the weekend. > Reiser has the best barrier support, ext3 works but only if things don't > go wrong. So only attempt to use the barrier feature on ext3 if on ide > drives, not SCSI nor SATA. > reiserfs-barrier-2.6.5-rc1-mm2-1 > reiser part. is this intended? ;) -rw------- 1 axboe axboe 3377 Mar 19 07:32 reiserfs-barrier-2.6.5-rc1-mm2-1.bz2 -rw------- 1 axboe axboe 248 Mar 19 07:32 reiserfs-barrier-2.6.5-rc1-mm2-1.bz2.sign -rw------- 1 axboe axboe 3473 Mar 19 07:32 reiserfs-barrier-2.6.5-rc1-mm2-1.gz -rw------- 1 axboe axboe 248 Mar 19 07:32 reiserfs-barrier-2.6.5-rc1-mm2-1.gz.sign -rw------- 1 axboe axboe 248 Mar 19 07:32 reiserfs-barrier-2.6.5-rc1-mm2-1.sign means permission denied for us. ciao, Marc ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-19 16:48 ` Marc-Christian Petersen @ 2004-03-19 18:19 ` Jens Axboe 0 siblings, 0 replies; 63+ messages in thread From: Jens Axboe @ 2004-03-19 18:19 UTC (permalink / raw) To: Marc-Christian Petersen; +Cc: linux-kernel, Chris Mason On Fri, Mar 19 2004, Marc-Christian Petersen wrote: > On Friday 19 March 2004 16:35, Jens Axboe wrote: > > Hi Jens, > > > A first release of a collected barrier patchset for 2.6.5-rc1-mm2. I > > have a few changes planned to support dm/md + sata, I'll do those > > changes over the weekend. > > Reiser has the best barrier support, ext3 works but only if things don't > > go wrong. So only attempt to use the barrier feature on ext3 if on ide > > drives, not SCSI nor SATA. > > reiserfs-barrier-2.6.5-rc1-mm2-1 > > reiser part. > > is this intended? ;) > > -rw------- 1 axboe axboe 3377 Mar 19 07:32 > reiserfs-barrier-2.6.5-rc1-mm2-1.bz2 > -rw------- 1 axboe axboe 248 Mar 19 07:32 > reiserfs-barrier-2.6.5-rc1-mm2-1.bz2.sign > -rw------- 1 axboe axboe 3473 Mar 19 07:32 > reiserfs-barrier-2.6.5-rc1-mm2-1.gz > -rw------- 1 axboe axboe 248 Mar 19 07:32 > reiserfs-barrier-2.6.5-rc1-mm2-1.gz.sign > -rw------- 1 axboe axboe 248 Mar 19 07:32 > reiserfs-barrier-2.6.5-rc1-mm2-1.sign > > means permission denied for us. Corrected, thanks. -- Jens Axboe ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-19 15:35 [PATCH] barrier patch set Jens Axboe ` (2 preceding siblings ...) 2004-03-19 16:48 ` Marc-Christian Petersen @ 2004-03-22 11:09 ` Andrew Morton 2004-03-22 11:10 ` Jens Axboe 3 siblings, 1 reply; 63+ messages in thread From: Andrew Morton @ 2004-03-22 11:09 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-kernel, mason Jens Axboe <axboe@suse.de> wrote: > > A first release of a collected barrier patchset for 2.6.5-rc1-mm2. The tagging of BIOs with set_buffer_ordered() or WRITE_BARRIER is a little awkward. Take the case of an ext2 fsync() or even an ext3 fsync() which frequently will not trigger a commit. If we must perform the barrier by tagging the final BIO, that will be tricky to implement. I could set some new field in struct writeback_control and rework the mpage code, but working out "this is the final BIO for this operation" is a fairly hard thing to do. sys_sync() would require even more VFS surgery. Generally, it would be much preferable to use the blkdev_issue_flush() API. What is the status of that? ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH] barrier patch set 2004-03-22 11:09 ` Andrew Morton @ 2004-03-22 11:10 ` Jens Axboe 0 siblings, 0 replies; 63+ messages in thread From: Jens Axboe @ 2004-03-22 11:10 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, mason On Mon, Mar 22 2004, Andrew Morton wrote: > Jens Axboe <axboe@suse.de> wrote: > > > > A first release of a collected barrier patchset for 2.6.5-rc1-mm2. > > The tagging of BIOs with set_buffer_ordered() or WRITE_BARRIER is a little > awkward. > > Take the case of an ext2 fsync() or even an ext3 fsync() which frequently > will not trigger a commit. If we must perform the barrier by tagging the > final BIO, that will be tricky to implement. I could set some new field in > struct writeback_control and rework the mpage code, but working out "this > is the final BIO for this operation" is a fairly hard thing to do. > sys_sync() would require even more VFS surgery. Yeah, it's not very pretty if you have to track the last sumit. Chris complained about that in 2.4 as well :-) > Generally, it would be much preferable to use the blkdev_issue_flush() > API. What is the status of that? It'll be fully supported. -- Jens Axboe ^ permalink raw reply [flat|nested] 63+ messages in thread
end of thread, other threads:[~2004-03-31 22:09 UTC | newest] Thread overview: 63+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2004-03-19 15:35 [PATCH] barrier patch set Jens Axboe 2004-03-19 16:30 ` Mika Penttilä 2004-03-19 18:16 ` Jens Axboe 2004-03-19 18:44 ` Mika Penttilä 2004-03-20 9:55 ` Jens Axboe 2004-03-19 16:34 ` Jeff Garzik 2004-03-19 18:19 ` Jens Axboe 2004-03-19 23:01 ` Matthias Andree 2004-03-20 0:02 ` Bartlomiej Zolnierkiewicz 2004-03-20 1:48 ` Johannes Stezenbach 2004-03-20 2:13 ` Bartlomiej Zolnierkiewicz 2004-03-20 2:53 ` Johannes Stezenbach 2004-03-20 16:03 ` Bartlomiej Zolnierkiewicz 2004-03-20 11:36 ` Matthias Andree 2004-03-20 16:00 ` Bartlomiej Zolnierkiewicz 2004-03-20 23:36 ` Johannes Stezenbach 2004-03-21 1:33 ` Bartlomiej Zolnierkiewicz 2004-03-20 18:52 ` Helge Hafting 2004-03-22 11:15 ` Matthias Andree 2004-03-19 23:59 ` Bartlomiej Zolnierkiewicz 2004-03-20 0:14 ` Jeff Garzik 2004-03-20 0:40 ` Bartlomiej Zolnierkiewicz 2004-03-20 0:42 ` Jeff Garzik 2004-03-20 1:24 ` Bartlomiej Zolnierkiewicz 2004-03-20 9:58 ` Jens Axboe 2004-03-20 10:12 ` Jeff Garzik 2004-03-20 10:19 ` Jens Axboe 2004-03-20 10:37 ` Jeff Garzik 2004-03-20 16:30 ` Bartlomiej Zolnierkiewicz 2004-03-21 18:12 ` Jeff Garzik 2004-03-20 10:21 ` Jeff Garzik 2004-03-20 15:54 ` Bartlomiej Zolnierkiewicz 2004-03-20 0:17 ` Jeff Garzik 2004-03-20 9:53 ` Jens Axboe 2004-03-20 16:23 ` Bartlomiej Zolnierkiewicz 2004-03-20 16:27 ` Jens Axboe 2004-03-20 16:32 ` Chris Mason 2004-03-20 17:05 ` Bartlomiej Zolnierkiewicz 2004-03-20 17:10 ` Chris Mason 2004-03-20 20:16 ` Bartlomiej Zolnierkiewicz 2004-03-21 9:43 ` Jens Axboe 2004-03-30 16:04 ` Stephen C. Tweedie 2004-03-30 19:19 ` Chris Mason 2004-03-30 21:50 ` Stephen C. Tweedie 2004-03-30 22:13 ` Chris Mason 2004-03-31 14:03 ` Stephen C. Tweedie 2004-03-31 14:27 ` Chris Mason 2004-03-31 18:28 ` Ric Wheeler 2004-03-30 22:21 ` Jeff Garzik 2004-03-30 22:36 ` Chris Wedgwood 2004-03-30 22:39 ` Jeff Garzik 2004-03-30 22:41 ` Chris Wedgwood 2004-03-30 22:40 ` Bartlomiej Zolnierkiewicz 2004-03-30 22:38 ` Jeff Garzik 2004-03-31 14:08 ` Stephen C. Tweedie 2004-03-31 14:21 ` Chris Mason 2004-03-31 21:26 ` Jeff Garzik 2004-03-31 22:09 ` Chris Mason 2004-03-31 21:27 ` Jeff Garzik 2004-03-19 16:48 ` Marc-Christian Petersen 2004-03-19 18:19 ` Jens Axboe 2004-03-22 11:09 ` Andrew Morton 2004-03-22 11:10 ` Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox