From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vladislav Bolkhovitin Subject: Re: [sqlite] light weight write barriers Date: Tue, 23 Oct 2012 15:53:11 -0400 Message-ID: <5086F5A7.9090406@vlnb.net> References: Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: General Discussion of SQLite Database , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, drh@hwaci.com To: =?UTF-8?B?5p2o6IuP56uLIFlhbmcgU3UgTGk=?= Return-path: In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org =E6=9D=A8=E8=8B=8F=E7=AB=8B Yang Su Li, on 10/11/2012 12:32 PM wrote: > I am not quite whether I should ask this question here, but in terms > of light weight barrier/fsync, could anyone tell me why the device > driver / OS provide the barrier interface other than some other > abstractions anyway? I am sorry if this sounds like a stupid question= s > or it has been discussed before.... > > I mean, most of the time, we only need some ordering in writes; not > complete order, but partial,very simple topological order. And a > barrier seems to be a heavy weighted solution to achieve this anyway: > you have to finish all writes before the barrier, then start all > writes issued after the barrier. That is some ordering which is much > stronger than what we need, isn't it? > > As most of the time the order we need do not involve too many blocks > (certainly a lot less than all the cached blocks in the system or in > the disk's cache), that topological order isn't likely to be very > complicated, and I image it could be implemented efficiently in a > modern device, which already has complicated caching/garbage > collection/whatever going on internally. Particularly, it seems not > too hard to be implemented on top of SCSI's ordered/simple task mode? Yes, SCSI has full support for ordered/simple commands designed exactly= for that=20 task: to have steady flow of commands even in case when some of them ar= e ordered.=20 It also has necessary facilities to handle commands errors without unex= pected=20 reorders of their subsequent commands (ACA, etc.). Those allow to get f= ull storage=20 performance by fully "fill the pipe", using networking terms. I can eas= ily imaging=20 real life configs, where it can bring 2+ times more performance, than w= ith queue=20 flushing. In fact, AFAIK, AIX requires from storage to support ordered commands a= nd ACA. Implementation should be relatively easy as well, because all transport= s naturally=20 have link as the point of serialization, so all you need in multithread= ed=20 environment is to pass some SN from the point when each ORDERED command= created to=20 the point when it sent to the link and make sure that no SIMPLE command= s can ever=20 cross ORDERED commands. You can see how it is implemented in SCST in an= elegant=20 and lockless manner (for SIMPLE commands). But historically for some reason Linux storage developers were stuck wi= th=20 "barriers" concept, which is obviously not the same as ORDERED commands= , hence had=20 a lot troubles with their ambiguous semantic. As far as I can tell the = reason of=20 that was some lack of sufficiently deep SCSI understanding (how to hand= le errors,=20 believe that ACA is something legacy from parallel SCSI times, etc.). Hopefully, eventually the storage developers will realize the value beh= ind ordered=20 commands and learn corresponding SCSI facilities to deal with them. It'= s quite=20 easy to demonstrate this value, if you know where to look at and not bl= indly=20 refusing such possibility. I have already tried to explain it a couple = of times,=20 but was not successful. Before that happens, people will keep returning again and again with th= ose simple=20 questions: why the queue must be flushed for any ordered operation? Isn= 't is an=20 obvious overkill? Vlad