Re: filesystem transactions API

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jamie Lokier <jamie@shareable.org>
To: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: John Stoffel <john@stoffel.org>,
	"Artem B. Bityuckiy" <dedekind@oktetlabs.ru>,
	Ville Herva <v@iki.fi>,
	Linux Filesystem Development <linux-fsdevel@vger.kernel.org>,
	linux-kernel@vger.kernel.org
Subject: Re: filesystem transactions API
Date: Tue, 26 Apr 2005 16:47:08 +0100	[thread overview]
Message-ID: <20050426154708.GC14297@mail.shareable.org> (raw)
In-Reply-To: <1114528782.13568.8.camel@lade.trondhjem.org>

Trond Myklebust wrote:
> > Jamie> No.  A transaction means that _all_ processes will see the
> > Jamie> whole transaction or not.
> > 
> > This is really hard.  How do you handle the case where process X
> > starts a transaction modifies files a, b & c, but process Y has file b
> > open for writing, and never lets it go?  Or the file gets unlinked?  
> 
> That is why implementing it as a form of lock makes sense.

The problem with making them exclusive locks is that you halt the
system for the duration of the transaction.  If it's a big transaction
such as updating 1000 files for a package update, that blocks a lot of
programs for a long time, and it's not necessary.

And, because that's a potential denial of service, you have to limit
the size of transactions and their duration, especially for ordinary
users.  That makes transactions a lot less useful than they can be.

I would implement them as a combination of time-limited lock, and
abortable transaction with file & directory reads establishing
prerequisites.

While the transaction lock is held, everything read (i.e. read byte
ranges, lock byte ranges, directory lookups, and stat results) cause
the corresponding range or inode to be exclusively locked for this
transaction, and also cause them to be recorded in the prerequisite
set for this transaction.  Everything written (i.e. byte ranges or any
other filesystem modifying operation) is queued.

If the transaction lock timeout is reached before the transaction is
closed, all the exlusive locks for this transaction are released, and
the transaction lock itself is released, and the prerequisite set
continues to be recorded.

If at any time, another process tries to modify any of the information
in the transaction's prerequisite set, then firstly: if the
transaction lock is held, the other process is blocked until that lock
is released.  Secondly: if the other process successfully modifies
information in the transaction's prerequisite set, the transaction is
aborted.  All further operations in this transaction will fail,
including reads, writes, and the final close which commits writes.

Finally, when the transaction is closed, either it fails because
prerequisites were modified, or it commits all the pending filesystem
modifications of this transaction.

Why two phases?

The second phase, with no exclusive locking, is to allow ordinary
users to use transactions without blocking other processes or hogging
excessive system resources.  It allows other processes to progress
while a big transaction is in progress.  In other words, it prevents
some kinds of denial-of-service, allows arbitrarily large transactions
as long as there's enough space in the filesystem, and is generally
better.

The first phase, with exlusive locking, uses a randomised timeout for
the lock.  This is to prevent starvation of transacting processes by
other processes.  It's analogous to the problem of readers starving
writers in some kinds of read-write locks.  The randomised timeout is
to prevent mutual starvation between two or more transacting
processes, which might otherwise get into synchronised livelock.

Enjoy :)
-- Jamie

next prev parent reply	other threads:[~2005-04-26 15:47 UTC|newest]

Thread overview: 95+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-04-24 20:08 [PATCH] private mounts Miklos Szeredi
2005-04-24 20:13 ` Al Viro
2005-04-24 20:45   ` Miklos Szeredi
2005-04-24 20:18 ` Christoph Hellwig
2005-04-24 20:50   ` Miklos Szeredi
2005-04-24 20:54     ` Al Viro
2005-04-24 20:59       ` Miklos Szeredi
2005-04-24 21:06         ` Christoph Hellwig
2005-04-24 21:12           ` Jamie Lokier
2005-04-24 21:06         ` Al Viro
2005-04-24 21:15           ` Miklos Szeredi
2005-04-24 21:19             ` Al Viro
2005-04-24 21:29               ` Miklos Szeredi
2005-04-24 21:39                 ` Jamie Lokier
2005-04-25  7:10                 ` Jan Hudec
2005-04-25  9:58                   ` Miklos Szeredi
2005-04-25 11:45                     ` Jan Hudec
2005-04-30  8:35                     ` Christoph Hellwig
2005-04-30  9:25                       ` Miklos Szeredi
2005-04-30  9:42                         ` Jamie Lokier
2005-04-30 10:14                           ` Miklos Szeredi
2005-04-30 14:36                             ` Jamie Lokier
2005-04-30 15:59                               ` Miklos Szeredi
2005-04-30 16:42                                 ` Jamie Lokier
2005-04-30 17:07                                   ` Miklos Szeredi
2005-04-30 18:20                                     ` Olivier Galibert
2005-04-30 23:58                                       ` Jamie Lokier
2005-05-01  2:39                                         ` Ram
2005-04-30 23:54                                     ` Jamie Lokier
2005-05-01  5:56                                       ` Miklos Szeredi
2005-05-01  6:39                                         ` Miklos Szeredi
2005-05-01 15:41                                         ` Eric Van Hensbergen
2005-05-11  9:00                         ` Christoph Hellwig
2005-05-11 10:42                           ` Miklos Szeredi
2005-04-24 21:43               ` Jamie Lokier
2005-04-25  7:14                 ` Jan Hudec
2005-04-27  9:14                 ` Helge Hafting
2005-04-25  9:48               ` Olivier Galibert
2005-04-25 16:37                 ` Tim Hockin
2005-04-30  8:37                 ` Christoph Hellwig
2005-04-25 21:09               ` Bryan Henderson
2005-04-26 13:46                 ` filesystem transactions API Ville Herva
2005-04-26 14:14                   ` Jamie Lokier
2005-04-26 14:22                     ` Artem B. Bityuckiy
2005-04-26 14:32                       ` Jamie Lokier
2005-04-26 14:46                         ` Artem B. Bityuckiy
2005-04-26 15:19                           ` Jamie Lokier
2005-04-26 15:01                         ` John Stoffel
2005-04-26 15:12                           ` Lars Marowsky-Bree
2005-04-26 15:19                           ` Trond Myklebust
2005-04-26 15:29                             ` Ritesh Kumar
2005-04-26 15:50                               ` Jamie Lokier
2005-04-26 16:44                               ` Trond Myklebust
2005-04-26 22:44                               ` Bryan Henderson
2005-04-26 15:47                             ` Jamie Lokier [this message]
2005-04-26 15:51                               ` Artem B. Bityuckiy
2005-04-26 15:56                                 ` Jamie Lokier
2005-04-26 16:01                                   ` Artem B. Bityuckiy
2005-04-27  9:14                                     ` Jan Hudec
2005-04-26 15:24                           ` Jamie Lokier
2005-04-26 17:22                             ` Diego Calleja
2005-04-26 17:38                               ` Jamie Lokier
2005-04-27  9:34                             ` Jan Hudec
2005-04-27 13:43                               ` Ville Herva
2005-04-27 15:17                                 ` Jamie Lokier
2005-04-26 15:40                       ` Charles P. Wright
2005-04-26 16:07                         ` Artem B. Bityuckiy
2005-04-26 17:22                           ` Charles P. Wright
2005-04-27  9:37                         ` Lars Marowsky-Bree
2005-04-27 13:36                       ` Andi Kleen
2005-04-26 14:25                   ` Trond Myklebust
2005-04-24 21:38           ` [PATCH] private mounts Jamie Lokier
2005-04-24 22:20             ` Ram
2005-04-24 22:22               ` Jamie Lokier
2005-04-25  6:00             ` Miklos Szeredi
2005-04-25  6:41               ` Ram
2005-04-25  9:55                 ` Miklos Szeredi
2005-04-25  7:22               ` Jan Hudec
2005-04-25 10:08                 ` Miklos Szeredi
2005-04-25 15:20             ` Pavel Machek
2005-04-25 19:07               ` Jamie Lokier
2005-04-26  9:29                 ` Pavel Machek
2005-04-26 14:07                   ` Jamie Lokier
2005-04-28 13:28                     ` Eric Van Hensbergen
2005-04-28 19:22                       ` Jamie Lokier
2005-04-28 13:47                     ` Eric Van Hensbergen
2005-04-28 19:20                       ` Jamie Lokier
2005-04-28 19:39                         ` Ram
2005-04-28 22:08                           ` Jamie Lokier
2005-04-29  7:57                             ` Ram
2005-04-29 14:13                               ` Miklos Szeredi
2005-04-29 14:42                                 ` Jamie Lokier
2005-04-29 14:50                                   ` Question about current->namespace and check_mnt() Jamie Lokier
2005-04-30  8:33                 ` [PATCH] private mounts Christoph Hellwig
2005-04-30 16:47                   ` Ram

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20050426154708.GC14297@mail.shareable.org \
    --to=jamie@shareable.org \
    --cc=dedekind@oktetlabs.ru \
    --cc=john@stoffel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=trond.myklebust@fys.uio.no \
    --cc=v@iki.fi \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).