From: Andrey Kuzmin <andrey.v.kuzmin@gmail.com>
To: Chris Mason <chris.mason@oracle.com>,
Sage Weil <sage@newdream.net>,
Andrey Kuzmin <andrey.v.kuzmin@gmail.com>,
linux-btrfs@vger.kernel.org
Subject: Re: [RFC] big fat transaction ioctl
Date: Wed, 11 Nov 2009 18:41:06 +0300 [thread overview]
Message-ID: <2a31deca0911110741xb3529cbi982d982ef171de9f@mail.gmail.com> (raw)
In-Reply-To: <20091111150356.GC5566@think>
On Wed, Nov 11, 2009 at 6:03 PM, Chris Mason <chris.mason@oracle.com> w=
rote:
> On Tue, Nov 10, 2009 at 02:13:10PM -0800, Sage Weil wrote:
>> On Tue, 10 Nov 2009, Andrey Kuzmin wrote:
>>
>> > On Tue, Nov 10, 2009 at 11:12 PM, Sage Weil <sage@newdream.net> wr=
ote:
>> > > Hi all,
>> > >
>> > > This is an alternative approach to atomic user transactions for =
btrfs.
>> > > The old start/end ioctls suffer from some basic limitations, nam=
ely
>> > >
>> > > =A0- We can't properly reserve space ahead of time to avoid ENOS=
PC part
>> > > way through the transaction, and
>> > > =A0- The process may die (seg fault, SIGKILL) part way through t=
he
>> > > transaction. =A0Currently when that happens the partial transact=
ion will
>> > > commit.
>> > >
>> > > This patch implements an ioctl that lets the application complet=
ely
>> > > specify the entire transaction in a single syscall. =A0If the pr=
ocess gets
>> > > killed or seg faults part way through, the entire transaction wi=
ll still
>> > > complete.
>> > >
>> > > The goal is to atomically commit updates to multiple files, xatt=
rs,
>> > > directories. =A0But this is still a file system: we don't get ro=
llback if
>> > > things go wrong. =A0Instead, do what we can up front to make sur=
e things
>> > > will work out. =A0And if things do go wrong, optionally prevent =
a partial
>> > > result from reaching the disk.
>> >
>> > Why not snapshot respective root (doesn't work if transaction span=
s
>> > multiple file-systems, but this doesn't look like a real-world
>> > limitation), run txn against that snapshot and rollback on failure
>> > instead? Snapshots are writable, cheap, and this looks like a real
>> > transaction abort mechanism.
>>
>> Good question. =A0:)
>>
>> I hadn't looked into this before, but I think the snapshots could be=
used
>> to achieve both atomicity and rollback. =A0If userspace uses an rw m=
utex to
>> quiesce writes, it can make sure all transactions complete before cr=
eating
>> a snapshot (commit). =A0The problem with this currently is the creat=
e
>> snapshot ioctl is relatively slow... it calls commit_transaction, wh=
ich
>> blocks until everything reaches disk. =A0I think to perform well thi=
s
>> approach would need a hook to start a commit and then return as soon=
as it
>> can guarantee than any subsequent operation's start_transaction can'=
t join
>> in that commit.
>>
>> This may be a better way to go about this, though. =A0Does that soun=
d
>> reasonable, Chris?
>
> Yes, we could do this, but I don't think it will perform very well
> compared to your multi-operation ioctl. =A0It really does depend on h=
ow
> often you need to do atomic ops (my guess is very).
>
> Honestly you'll get better performance with a simple write-ahead log
> from userland:
Write-ahead logging is necessary anyway if the aim is to provide
transactional semantics to an application. But, at the same time, w/o
snapshot there is no synchronization between the log and file-system
state.
Regards,
Andrey
>
> step1: write redo log somewhere in the FS, with enough information to
> bring all the objects you're about to touch to a consistent state.
> step2: fsync the log
> step3: do your operations
> step4: append a record to the undo log that invalidates the last log
> op, or just truncate it to zero.
> step5: fsync the log.
>
> The big advantage of the log is that you won't be tied to btrfs, but
> it's two fsyncs where the big transaction framework does none. =A0Thi=
s
> should allow you to turn on the fast fsync log again, but I think the
> multi-operation ioctl would do that as well.
>
> -chris
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2009-11-11 15:41 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-11-10 20:12 [RFC] big fat transaction ioctl Sage Weil
2009-11-10 20:44 ` Andrey Kuzmin
2009-11-10 22:13 ` Sage Weil
2009-11-11 0:49 ` Jeremy Fitzhardinge
2009-11-11 5:15 ` Sage Weil
2009-11-11 15:03 ` Chris Mason
2009-11-11 15:41 ` Andrey Kuzmin [this message]
2009-11-11 15:55 ` Chris Mason
2009-11-11 17:19 ` Sage Weil
2009-11-12 3:56 ` Andrey Kuzmin
2009-11-11 14:54 ` Chris Mason
2009-11-11 18:22 ` Zach Brown
2009-11-11 22:22 ` Sage Weil
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2a31deca0911110741xb3529cbi982d982ef171de9f@mail.gmail.com \
--to=andrey.v.kuzmin@gmail.com \
--cc=chris.mason@oracle.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=sage@newdream.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox