From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jamie Lokier <jamie@shareable.org>
Subject: Re: [rfc] fsync_range?
Date: Wed, 21 Jan 2009 03:25:20 +0000
Message-ID: <20090121032520.GA2816@shareable.org>
References: <20090120164726.GA24891@wotan.suse.de> <20090120183120.GD27464@shareable.org> <20090121012900.GD24891@wotan.suse.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: linux-fsdevel@vger.kernel.org
To: Nick Piggin <npiggin@suse.de>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mail2.shareable.org ([80.68.89.115]:60466 "EHLO
	mail2.shareable.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1755533AbZAUDZV (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Tue, 20 Jan 2009 22:25:21 -0500
Content-Disposition: inline
In-Reply-To: <20090121012900.GD24891@wotan.suse.de>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

Nick Piggin wrote:
> > For database writes, you typically write a bunch of stuff in various
> > regions of a big file (or multiple files), then ideally fdatasync
> > some/all of the written ranges - with writes committed to disk in the
> > best order determined by the OS and I/O scheduler.
>  
> Do you know which databases do this? It will be nice to ask their
> input and see whether it helps them (I presume it is an OSS database
> because the "big" ones just use direct IO and manage their own
> buffers, right?)

I just found this:

   http://markmail.org/message/injyo7coein7o3xz
   (Postgresql)

Tom Lane writes (on org.postgreql.pgsql-hackets):
>Greg Stark <gsst...@mit.edu> writes:
>> Come to think of it I wonder whether there's anything to be gained by
>> using smaller files for tables. Instead of 1G files maybe 256M files
>> or something like that to reduce the hit of fsyncing a file.
>>
>> Actually probably not. The weak part of our current approach is that
>> we tell the kernel "sync this file", then "sync that file", etc, in a
>> more or less random order. This leads to a probably non-optimal
>> sequence of disk accesses to complete a checkpoint. What we would
>> really like is a way to tell the kernel "sync all these files, and let
>> me know when you're done" --- then the kernel and hardware have some
>> shot at scheduling all the writes in an intelligent fashion.
>>
>> sync_file_range() is not that exactly, but since it lets you request
>> syncing and then go back and wait for the syncs later, we could get
>> the desired effect with two passes over the file list. (If the file
>> list is longer than our allowed number of open files, though, the
>> extra opens/closes could hurt.)
>>
>> Smaller files would make the I/O scheduling problem worse not better. 

So if you can make
commit-to-multiple-files-in-optimal-I/O-scheduling-order work, that
would be even better ;-)

Seems to me the Postgresql thing could be improved by issuing parallel
fdatasync() calls each in their own thread.  Not optimal, exactly, but
more parallelism to schedule around.  (But limited by the I/O request
queue being full with big flushes, so potentially one fdatasync()
starving the others.

-- Jamie