From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jamie Lokier Subject: Re: [rfc] fsync_range? Date: Wed, 21 Jan 2009 03:25:20 +0000 Message-ID: <20090121032520.GA2816@shareable.org> References: <20090120164726.GA24891@wotan.suse.de> <20090120183120.GD27464@shareable.org> <20090121012900.GD24891@wotan.suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-fsdevel@vger.kernel.org To: Nick Piggin Return-path: Received: from mail2.shareable.org ([80.68.89.115]:60466 "EHLO mail2.shareable.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755533AbZAUDZV (ORCPT ); Tue, 20 Jan 2009 22:25:21 -0500 Content-Disposition: inline In-Reply-To: <20090121012900.GD24891@wotan.suse.de> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: Nick Piggin wrote: > > For database writes, you typically write a bunch of stuff in various > > regions of a big file (or multiple files), then ideally fdatasync > > some/all of the written ranges - with writes committed to disk in the > > best order determined by the OS and I/O scheduler. > > Do you know which databases do this? It will be nice to ask their > input and see whether it helps them (I presume it is an OSS database > because the "big" ones just use direct IO and manage their own > buffers, right?) I just found this: http://markmail.org/message/injyo7coein7o3xz (Postgresql) Tom Lane writes (on org.postgreql.pgsql-hackets): >Greg Stark writes: >> Come to think of it I wonder whether there's anything to be gained by >> using smaller files for tables. Instead of 1G files maybe 256M files >> or something like that to reduce the hit of fsyncing a file. >> >> Actually probably not. The weak part of our current approach is that >> we tell the kernel "sync this file", then "sync that file", etc, in a >> more or less random order. This leads to a probably non-optimal >> sequence of disk accesses to complete a checkpoint. What we would >> really like is a way to tell the kernel "sync all these files, and let >> me know when you're done" --- then the kernel and hardware have some >> shot at scheduling all the writes in an intelligent fashion. >> >> sync_file_range() is not that exactly, but since it lets you request >> syncing and then go back and wait for the syncs later, we could get >> the desired effect with two passes over the file list. (If the file >> list is longer than our allowed number of open files, though, the >> extra opens/closes could hurt.) >> >> Smaller files would make the I/O scheduling problem worse not better. So if you can make commit-to-multiple-files-in-optimal-I/O-scheduling-order work, that would be even better ;-) Seems to me the Postgresql thing could be improved by issuing parallel fdatasync() calls each in their own thread. Not optimal, exactly, but more parallelism to schedule around. (But limited by the I/O request queue being full with big flushes, so potentially one fdatasync() starving the others. -- Jamie