From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S264389AbTLBWpb (ORCPT ); Tue, 2 Dec 2003 17:45:31 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S264395AbTLBWpb (ORCPT ); Tue, 2 Dec 2003 17:45:31 -0500 Received: from tmr-02.dsl.thebiz.net ([216.238.38.204]:53512 "EHLO gatekeeper.tmr.com") by vger.kernel.org with ESMTP id S264389AbTLBWp3 (ORCPT ); Tue, 2 Dec 2003 17:45:29 -0500 To: linux-kernel@vger.kernel.org Path: gatekeeper.tmr.com!davidsen From: davidsen@tmr.com (bill davidsen) Newsgroups: mail.linux-kernel Subject: Re: libata in 2.4.24? Date: 2 Dec 2003 22:34:20 GMT Organization: TMR Associates, Schenectady NY Message-ID: References: <3FCB8312.3050703@rackable.com> <87iskz9hp6.fsf@stark.dyndns.tv> <20031202190646.GA9043@gtf.org> <877k1f9e1g.fsf@stark.dyndns.tv> X-Trace: gatekeeper.tmr.com 1070404460 14203 192.168.12.62 (2 Dec 2003 22:34:20 GMT) X-Complaints-To: abuse@tmr.com Originator: davidsen@gatekeeper.tmr.com Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org In article <877k1f9e1g.fsf@stark.dyndns.tv>, Greg Stark wrote: | Jeff Garzik writes: | > > This doesn't happen with SCSI disks where multiple requests can be pending so | > > there's no urgency to reporting a false success. The request doesn't complete | > > until the write hits disk. As a result SCSI disks are reliable for database | > > operation and IDE disks aren't unless write caching is disabled. | > | > This is not really true. | > | > Regardless of TCQ, if the OS driver has not issued a FLUSH CACHE (IDE) | > or SYNCHRONIZE CACHE (SCSI), then the data is not guaranteed to be on | > the disk media. Plain and simple. | | That doesn't agree with people's experience. People seem to find that SCSI | drives never cache writes. This sort of makes sense since there's just not | much reason to report a write success before the write can be performed. | There's no performance advantage as long as more requests can be queued up. I hope you mean the drives don't report completion until the data is on the platter, clearly the data is cached in the drive until it can be written. | | | > If fsync(2) returns without a flush-cache, then your data is not | > guaranteed to be on the disk. And as you noted, flush-cache destroys | > performance. | | It's my understanding that it doesn't. There was some discussion in the past | month about making the drivers issue syncs for journalled filesystems, but | even then the idea of adding it to fsync or O_SYNC files wasn't the | motivation. With O_SYNC files there is the possibility of having a don't cache bit in the packet to the drive, even with write caching. With fsync I don't see any way to do it after the fact for only some of the data in the drive cache. That's just an observation. Clearly with a completion status coming back after actual completion O_SYNC or fsync reduce to "wait for the ack from the drive." | | | > There are three levels: | > | > a) Data is successfully transferred to the controller/drive queue (TCQ). | > b) Data is successfully transferred to the drive's internal buffers. | > c) The drive successfully transfers data to the media. | | Only the third is of interest to Postgres or other databases. In fact, I | suspect only the third is of interest to other systems that are supposed to be | reliable like MTAs etc. I think Wietse and others would be shocked if they | were told fsync wasn't guaranteed to have waited until the writes had actually | hit the media. I think for reliability fsync has to flush cache, regardless of the performance hit. I think a drive would be unusably slow if you did it after each O_SYNC write, so that's probably not practical. Clearly the best solution is a full SCSI implementation over PATA/SATA, but that would eliminate some of the justification for SCSI devices at premium prices. -- bill davidsen CTO, TMR Associates, Inc Doing interesting things with little computers since 1979.