All of lore.kernel.org
 help / color / mirror / Atom feed
* external journal questions
@ 2006-02-21 15:04 Jure Pečar
  2006-02-21 17:07 ` Jeff Mahoney
  2006-02-23 10:21 ` David Masover
  0 siblings, 2 replies; 5+ messages in thread
From: Jure Pečar @ 2006-02-21 15:04 UTC (permalink / raw)
  To: reiserfs-list


Hi all,

Now that solid state disks are getting affordable (Gigabyte iRam, for
example), it makes sense to use them as external journal with full data
journaling, so they cache all the small writes and dump them to disks
in one single sequential write on every journal flush.

I know how to configure that under ext3. Simply set up external journal
and mount filesystem with data=journal and commit=600 or some such
value. But I'm not so sure about reiserfs.

I know it now knows how to do full data journaling, but I can't find
the docs anywhere that would mention commit mount option. Does it work
at all and does it work in the same way as for ext3?

Also, I've seen patches by Jeff Mahoney from november last year that
optimize some external journal defaults. I'm thinking about a 7Tb or so
Coraid AoE device with 4gb iRam as external journal ... does anyone run
something like this in production? How well does it work?

Jeff, are your pathces already in the Linus tree or do I have to use
some -mm or Suse kernel?


-- 

Jure Pečar
http://jure.pecar.org


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: external journal questions
  2006-02-21 15:04 external journal questions Jure Pečar
@ 2006-02-21 17:07 ` Jeff Mahoney
  2006-02-23 10:21 ` David Masover
  1 sibling, 0 replies; 5+ messages in thread
From: Jeff Mahoney @ 2006-02-21 17:07 UTC (permalink / raw)
  To: Jure Pečar; +Cc: reiserfs-list

[-- Attachment #1: Type: text/plain, Size: 1747 bytes --]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jure Pečar wrote:
> Hi all,
> 
> Now that solid state disks are getting affordable (Gigabyte iRam, for
> example), it makes sense to use them as external journal with full data
> journaling, so they cache all the small writes and dump them to disks
> in one single sequential write on every journal flush.
> 
> I know how to configure that under ext3. Simply set up external journal
> and mount filesystem with data=journal and commit=600 or some such
> value. But I'm not so sure about reiserfs.
> 
> I know it now knows how to do full data journaling, but I can't find
> the docs anywhere that would mention commit mount option. Does it work
> at all and does it work in the same way as for ext3?
> 
> Also, I've seen patches by Jeff Mahoney from november last year that
> optimize some external journal defaults. I'm thinking about a 7Tb or so
> Coraid AoE device with 4gb iRam as external journal ... does anyone run
> something like this in production? How well does it work?
> 
> Jeff, are your pathces already in the Linus tree or do I have to use
> some -mm or Suse kernel?

Hi Jure -

The kernel patches have been part of mainline since Nov 30, 2005. Any
kernel newer than 2.6.15-rc4 will contain them.

You may also want to patch your reiserfsprogs with the patch I posted
along with the kernel patch. This just adjusts reiserfsprogs to use some
sane defaults. I've attached it for convenience.

- -Jeff

- --
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFD+0jOLPWxlyuTD7IRAgEEAJ9I+YbYoZEWTy8sRDOubKsY/mLZPQCfZWfF
4qBU1QZo5WyxzzrSgkGksP8=
=NQSQ
-----END PGP SIGNATURE-----

[-- Attachment #2: [PATCH] reiserfsprogs: changes for better external journal defaults.eml --]
[-- Type: application/x-crossover-eml, Size: 10920 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: external journal questions
  2006-02-21 15:04 external journal questions Jure Pečar
  2006-02-21 17:07 ` Jeff Mahoney
@ 2006-02-23 10:21 ` David Masover
  2006-02-24 10:58   ` Jure Pečar
  1 sibling, 1 reply; 5+ messages in thread
From: David Masover @ 2006-02-23 10:21 UTC (permalink / raw)
  To: Jure Pečar; +Cc: reiserfs-list

[-- Attachment #1: Type: text/plain, Size: 6019 bytes --]

Jure Pečar wrote:
> Hi all,
> 
> Now that solid state disks are getting affordable (Gigabyte iRam, for
> example), it makes sense to use them as external journal with full data
> journaling, so they cache all the small writes and dump them to disks
> in one single sequential write on every journal flush.

Where can I find the paper on why this makes sense?  Because offhand, it
doesn't, unless you're hoping that the majority of transactions can be
flushed on boot, rather than unrolled.

> I know how to configure that under ext3. Simply set up external journal
> and mount filesystem with data=journal and commit=600 or some such
> value. But I'm not so sure about reiserfs.

v3, I don't know, it's probably closer to the way ext3 works.  Closer,
not exactly, because ext3's on-disk format is ext2 + a journal file, so
it's going to be the easiest to move to another device.

I'm going to assume you aren't talking about v4, since this sounds like
a mission-critical production-style environment.  As I understand it, v4
has a completely different way of doing journaling.



I'm replying to you, not because I actually have an answer for you, but
because your case seems interesting, and I'm curious how Reiser4 handles it.

Currently, my v4 is built like this:  I have 2 gigs of RAM and about a
350 gig Reiser4 partition.  I have a custom patch that replaces the
sys_fsync system call with a stub, because fsync performance is worse in
Reiser4 than in anything else, and because fsync gets horrendously
abused by so many programs I use.  Flushing on my system is a privilege,
not a right, and abusing the fsync call means I've revoked that privilege.

So, basically, application says "Oh my GOD and all that is holy, this
piece of data MUST be written to disk now!"  The OS ignores it, and puts
it in the write buffer with everything else.  At this point, nothing
will touch the disk until I need RAM (even for more disk cache), or I
run a "sync" call myself (not an fsync call), or I unmount the FS (shut
the box down).

This makes performance much better, but it sucks for me when my
not-entirely-stable box (overclocked, has proprietary nVidia graphic
drivers) decides to crash.  Now, I have to say, Reiser4 has proven quite
durable, and I haven't actually had significant corruption.  I have,
however, lost the small amounts of data that never made it to disk.

Now, fsync is basically a kind of transaction, and in the future,
Reiser4 will be doing all kinds of transactions.  Could be an email
server, for instance.  The mail server wants to complete the transaction
of adding a new mail to someone's inbox, or at least to some local
spool, before it tells the other server that it successfully received
the message.  Now, obviously, we don't want every message immediately
forced onto the disk, because that would kill performance.  But if we
don't do that, if we let those transactions stay in RAM, then when the
mailserver loses power, it WILL lose messages.

Lose, as in permanently.  Now, imagine it's something even more
important, like a bank computer.  You can't just "roll back" someone's
mortgage payment, can you?

So, in order for performance not to suck, and to keep fragmentation
down, we want a bunch of transactions to be flushed to disk at once.
But in order to not lose data, we want every important transaction to
immediately be guaranteed not to be lost.

So, you buy some battery-backed RAM or some such, and flush your
transactions there first, and be sure they are successfully written to
the "journal" device before you OK the purchase or accept the email or
whatever, and then, when that device starts to fill up, you flush it out
to the real hard disk.

Problem is, I see nowhere for this to fit in the current model of
Reiser4.  As I understand it, there is no concept of a separate
"journal" device, or of writing a file twice, because the vast majority
of writes are simply written out to disk in the new location, and then
the "commit" is updating the metadata to point to the new location and
free the old.

But, at least some code must already be there, right?  Because I know at
least in theory, some writes happen twice -- things like updates to a
database file.  Tiny changes to huge files would be written once to a
new location, and then back to the old, so the file stays in the same
place on disk, and doesn't get fragmented.

Could that logic be adapted to write first to some journal device, and
then to the original location on disk?  And to use the same
memory-pressure strategy that currently drives the decision to flush
from RAM to disk?  (That is, be lazy about moving stuff from the journal
device to the real medium...)

Could it be done flexibly?  For instance, have a number of "journal
devices", from your RAM all the way to your real disk, and be able to
specify which ones are faster (be lazy about moving from a faster device
to a slower one) and which ones are stable (at what point can we be sure
the data won't be lost to a power failure)?  Because obviously, there
will be degrees of persistent storage, just as there are degrees of
volatile storage, from swap to RAM to cache to register.

I would argue that, while an attempt could be made to do this
transparently, it shouldn't.  For instance, Laptop Mode is a set of
patches to try to not use the hard drive at all, but whenever you have
to spin it up, flush everything you can.  It's sort of lazy writes to
save battery.

I would argue that the filesystem can and should know about lazy writes,
even if you still need Laptop Mode to tell it to flush on reads.  And
the filesystem can and should know about fast, nonvolatile storage.

But, why that's a good idea, I'm not sure of right now, because it's
bedtime.  Ask me tomorrow, or write your own rant.



Well, end of rant.  Someone else gets to have fun coding this, because I
have to do some Real Work.  As in School.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 892 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: external journal questions
  2006-02-23 10:21 ` David Masover
@ 2006-02-24 10:58   ` Jure Pečar
  2006-02-24 16:57     ` Jonathan Briggs
  0 siblings, 1 reply; 5+ messages in thread
From: Jure Pečar @ 2006-02-24 10:58 UTC (permalink / raw)
  To: reiserfs-list

On Thu, 23 Feb 2006 04:21:46 -0600
David Masover <ninja@slaphack.com> wrote:

> Where can I find the paper on why this makes sense?  Because offhand,
> it doesn't, unless you're hoping that the majority of transactions
> can be flushed on boot, rather than unrolled.

Can't point you to any specific paper, but you can imagine running a
large mailserver for hundreds of tousands of users. Plenty of
small, random io, almost as much writes as reads. That's where ssd for
journal makes sense.
 
> I'm going to assume you aren't talking about v4, since this sounds
> like a mission-critical production-style environment.  As I
> understand it, v4 has a completely different way of doing journaling.
 
Right and right.
 
> I'm replying to you, not because I actually have an answer for you,
> but because your case seems interesting, and I'm curious how Reiser4
> handles it.

Check namesys.com on "wandering logs" :)

> Problem is, I see nowhere for this to fit in the current model of
> Reiser4.  As I understand it, there is no concept of a separate
> "journal" device, or of writing a file twice, because the vast
> majority of writes are simply written out to disk in the new
> location, and then the "commit" is updating the metadata to point to
> the new location and free the old.

I suppose this "wandering logs" concept is going to be much better that
"journal file/device" concept ext3 uses, but right now it sounds like
it needs some more optimization work.

The cost here we all want to avoid is called seek time. Even today,
it's still measured in miliseconds and that's a couple orders of
magnitude more that gigaherzs cpus tick at. Reiser4 is on a good way to
decrease this cost by spending some more cpu ticks, but because I need
a solution "yesterday" (welcome to the real world ... or how they
say:), I'm lookig for a more traditional approach, ssd journal.


-- 

Jure Pečar
http://jure.pecar.org


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: external journal questions
  2006-02-24 10:58   ` Jure Pečar
@ 2006-02-24 16:57     ` Jonathan Briggs
  0 siblings, 0 replies; 5+ messages in thread
From: Jonathan Briggs @ 2006-02-24 16:57 UTC (permalink / raw)
  To: Reiserfs mail-list

[-- Attachment #1: Type: text/plain, Size: 2702 bytes --]

On Fri, 2006-02-24 at 11:58 +0100, Jure Pečar wrote:
> On Thu, 23 Feb 2006 04:21:46 -0600
> David Masover <ninja@slaphack.com> wrote:
> 
> > Where can I find the paper on why this makes sense?  Because offhand,
> > it doesn't, unless you're hoping that the majority of transactions
> > can be flushed on boot, rather than unrolled.
> 
> Can't point you to any specific paper, but you can imagine running a
> large mailserver for hundreds of tousands of users. Plenty of
> small, random io, almost as much writes as reads. That's where ssd for
> journal makes sense.
>  
> > I'm going to assume you aren't talking about v4, since this sounds
> > like a mission-critical production-style environment.  As I
> > understand it, v4 has a completely different way of doing journaling.
>  
> Right and right.
>  
> > I'm replying to you, not because I actually have an answer for you,
> > but because your case seems interesting, and I'm curious how Reiser4
> > handles it.
> 
> Check namesys.com on "wandering logs" :)
> 
> > Problem is, I see nowhere for this to fit in the current model of
> > Reiser4.  As I understand it, there is no concept of a separate
> > "journal" device, or of writing a file twice, because the vast
> > majority of writes are simply written out to disk in the new
> > location, and then the "commit" is updating the metadata to point to
> > the new location and free the old.
> 
> I suppose this "wandering logs" concept is going to be much better that
> "journal file/device" concept ext3 uses, but right now it sounds like
> it needs some more optimization work.
> 
> The cost here we all want to avoid is called seek time. Even today,
> it's still measured in miliseconds and that's a couple orders of
> magnitude more that gigaherzs cpus tick at. Reiser4 is on a good way to
> decrease this cost by spending some more cpu ticks, but because I need
> a solution "yesterday" (welcome to the real world ... or how they
> say:), I'm lookig for a more traditional approach, ssd journal.

It does sound like Reiser4 will have problems with the small email files
scenario.  Each email file will be written out and immediately sync'd,
so R4 will not have any opportunity to build up its usual delayed
allocation and wandering log.

With an external journal on SSD, data journaling and Ext3 or Reiser3,
the small email files become very efficient because the journal writes
move in a simple linear pattern across the journal.  fsync() can return
immediately once the data is in the journal and the file system can move
data from the journal to the actual disk at its own time.
-- 
Jonathan Briggs <jbriggs@esoft.com>
eSoft, Inc.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 191 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2006-02-24 16:57 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-02-21 15:04 external journal questions Jure Pečar
2006-02-21 17:07 ` Jeff Mahoney
2006-02-23 10:21 ` David Masover
2006-02-24 10:58   ` Jure Pečar
2006-02-24 16:57     ` Jonathan Briggs

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.