public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Idea about a disc backed ram filesystem
@ 2006-06-08 20:33 ` Sascha Nitsch
  2006-06-08 20:43   ` Lennart Sorensen
                     ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Sascha Nitsch @ 2006-06-08 20:33 UTC (permalink / raw)
  To: Linux Kernel Mailing List

Hi,

this is (as of this writing) just an idea.

=== current state ===
Currently we have ram filesystems (like tmpfs) and disc based file systems
(ext2/3, xfs, <insert your fav. fs>).

tmpfs is extremely fast but suffers from data losses from restarts, crashes
and power outages. Disc access is slow against a ram based fs.

=== the idea ===
My idea is to mix them to the following hybrid:
- mount the new fs over an existing dir as an overlay
- all files overlayed are still accessible
- after the first read, the file stays in memory (like a file cache)
- all writes are flushed out to the underlying fs (maybe done async)
- all reads are always done from the memory cache unless they are not cached
  yet
- the cache stays until the partition is unmounted
- the maximum size of the overlayed filesystem could be physical ram/2 (like tmpfs)

=== advantages ===
once the files are read, no more "slow" disc reading is needed=> huge read
speed improvements (like on tmpfs)
if the writing is done asyncronous, write speeds would be as fast as a
tmpfs => huge write speedup
if done syncronous, write speed almost as fast as native disc fs
the ram fs would be imune against data loss from reboots or controled shutdown
if syncronous write is done, the fs would be imune to crashes/power
outages (with the usual exceptions like on disc fs)

=== disadvantage ===
possible higher memory usage (see implementation ideas below)

=== usages ===
possible usage scenarios could be any storage where a
smaller set of files get read/written a lot, like databases
definition of smaller: lets say up to 50% of physical ram size.
Depending on architecture and money spent, this can be a lot :)

=== implementation ideas ===
One note first:
I don't know the fs internals of the kernel (yet), so these ideas might not
work, but you should get the idea.

One idea is to build a complete virtual filesystem that connects to the VFS
layer and hands the writes through to the "original" fs driver.
The caching would be done in that layer. This might cause double caching
(in the io cache) and might waste memory.
But this idea would enable the possibility of async writes (when the disc has
less to do) and gives write speed improves.

The other idea would be to modify the existing filesystem cache algorithm to
have a flag "always keep this file in memory".

The second one may be easier to do and may cause less side effects, but
might not enable async writes.

Since this overlay is done in the kernel, no other process could change the
files under the overlay.
Remote FS must be excluded from the cache layer (for obvious reasons).

Any kind of feedback is welcome.

If this has been discussed earlier, sorry for double posting. I haven't found
anything like this in the archives. Just point me in the right direction.

Regards,

Sascha Nitsch

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Idea about a disc backed ram filesystem
  2006-06-08 20:33 ` Idea about a disc backed ram filesystem Sascha Nitsch
@ 2006-06-08 20:43   ` Lennart Sorensen
  2006-06-08 21:12     ` Sash
  2006-06-08 21:51   ` Horst von Brand
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Lennart Sorensen @ 2006-06-08 20:43 UTC (permalink / raw)
  To: Sascha Nitsch; +Cc: Linux Kernel Mailing List

On Thu, Jun 08, 2006 at 10:33:13PM +0200, Sascha Nitsch wrote:
> this is (as of this writing) just an idea.
> 
> === current state ===
> Currently we have ram filesystems (like tmpfs) and disc based file systems
> (ext2/3, xfs, <insert your fav. fs>).
> 
> tmpfs is extremely fast but suffers from data losses from restarts, crashes
> and power outages. Disc access is slow against a ram based fs.
> 
> === the idea ===
> My idea is to mix them to the following hybrid:
> - mount the new fs over an existing dir as an overlay
> - all files overlayed are still accessible
> - after the first read, the file stays in memory (like a file cache)
> - all writes are flushed out to the underlying fs (maybe done async)
> - all reads are always done from the memory cache unless they are not cached
>   yet
> - the cache stays until the partition is unmounted
> - the maximum size of the overlayed filesystem could be physical ram/2 (like tmpfs)
> 
> === advantages ===
> once the files are read, no more "slow" disc reading is needed=> huge read
> speed improvements (like on tmpfs)
> if the writing is done asyncronous, write speeds would be as fast as a
> tmpfs => huge write speedup
> if done syncronous, write speed almost as fast as native disc fs
> the ram fs would be imune against data loss from reboots or controled shutdown
> if syncronous write is done, the fs would be imune to crashes/power
> outages (with the usual exceptions like on disc fs)
> 
> === disadvantage ===
> possible higher memory usage (see implementation ideas below)
> 
> === usages ===
> possible usage scenarios could be any storage where a
> smaller set of files get read/written a lot, like databases
> definition of smaller: lets say up to 50% of physical ram size.
> Depending on architecture and money spent, this can be a lot :)
> 
> === implementation ideas ===
> One note first:
> I don't know the fs internals of the kernel (yet), so these ideas might not
> work, but you should get the idea.
> 
> One idea is to build a complete virtual filesystem that connects to the VFS
> layer and hands the writes through to the "original" fs driver.
> The caching would be done in that layer. This might cause double caching
> (in the io cache) and might waste memory.
> But this idea would enable the possibility of async writes (when the disc has
> less to do) and gives write speed improves.
> 
> The other idea would be to modify the existing filesystem cache algorithm to
> have a flag "always keep this file in memory".
> 
> The second one may be easier to do and may cause less side effects, but
> might not enable async writes.
> 
> Since this overlay is done in the kernel, no other process could change the
> files under the overlay.
> Remote FS must be excluded from the cache layer (for obvious reasons).
> 
> Any kind of feedback is welcome.
> 
> If this has been discussed earlier, sorry for double posting. I haven't found
> anything like this in the archives. Just point me in the right direction.

I am a bit puzzled.  How is your idea different in use than the current
caching system that the kernel already applies to reads of all block
devices, other than essentially locking the cached data into ram, rather
than letting it get kicked out if it isn't used.  Writing is similarly
cached unless the application asks for it to not be cached.  It is
flushed out within a certain amount of time, or when there is an idle
period.  I fail to see where having to explicitly specify something as
having to be cached in ram and locked in is an improvement over simply
caching anything that is used a lot from any disk.  Your idea also
appears to break any application that asks for sync since you take over
control of when things are flushed to disk. 

I just don't get it. :)

Len Sorensen

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Idea about a disc backed ram filesystem
  2006-06-08 20:43   ` Lennart Sorensen
@ 2006-06-08 21:12     ` Sash
  2006-06-09 10:45       ` Jan Engelhardt
  0 siblings, 1 reply; 22+ messages in thread
From: Sash @ 2006-06-08 21:12 UTC (permalink / raw)
  To: Linux Kernel Mailing List

Am Thursday, 8. June 2006 22:43 schrieben Sie:
> On Thu, Jun 08, 2006 at 10:33:13PM +0200, Sascha Nitsch wrote:
> > ....
> > ....
> 
> I am a bit puzzled.  How is your idea different in use than the current
> caching system that the kernel already applies to reads of all block
> devices, other than essentially locking the cached data into ram, rather
> than letting it get kicked out if it isn't used.  Writing is similarly
> cached unless the application asks for it to not be cached.  It is
> flushed out within a certain amount of time, or when there is an idle
> period.  I fail to see where having to explicitly specify something as
> having to be cached in ram and locked in is an improvement over simply
> caching anything that is used a lot from any disk.  Your idea also
> appears to break any application that asks for sync since you take over
> control of when things are flushed to disk. 
> 
> I just don't get it. :)
> 
> Len Sorensen
> 

True, my idea is indeed similar to the existing cache, thats why I had one of the
ideas for the implementation.
If you ever had the possibility to run a database application on a tmpfs you got
to "experience" the difference :)

The idea was simply born to have a fast tmpfs but with the safety of permanent
data storage in case of reboots/crashes without user level app modification.

The problem with the current cache implementation is that I have not much
control about what keeps cached and what not. (which is fine for normal usage).

On a normal server with mixed load my database caches are flushed and
used for other stuff (like mail or webserver cache). If I access the database
files again, they have to be reloaded from disc which slows it down.
Same applies to other applications as well, this is just an example from my
daily work. (~1GB database on a 2GB ram box) and a lot of disc io because
of cache misses with a read/write ratio of ~20:1). Putting that DB into RAM is
dangerous because of the data loss risk.

The idea enables me to have a defined set of files/dirs permanently cached,
and take the choice away from the kernel (for a fixed amount of memory and files).

You are right, the idea in the current form may break application that ask for
sync. Maybe this can be honored by the implementation to access that files
directly.

If someone has a better idea to get the desired effect, feel free to post
them here.

One of the reasons I posted the idea here is to have some useful comments
from people with far more kernel/fs knowledge than I have.

I hope I could clear the clouds a bit.

Sascha Nitsch

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Idea about a disc backed ram filesystem
  2006-06-08 20:33 ` Idea about a disc backed ram filesystem Sascha Nitsch
  2006-06-08 20:43   ` Lennart Sorensen
@ 2006-06-08 21:51   ` Horst von Brand
  2006-06-08 22:39     ` Joshua Hudson
  2006-06-08 22:48   ` Matheus Izvekov
  2006-06-09  6:33   ` Andi Kleen
  3 siblings, 1 reply; 22+ messages in thread
From: Horst von Brand @ 2006-06-08 21:51 UTC (permalink / raw)
  To: Sascha Nitsch; +Cc: Linux Kernel Mailing List

Sascha Nitsch <Sash_lkl@linuxhowtos.org> wrote:
> this is (as of this writing) just an idea.
> 
> === current state ===
> Currently we have ram filesystems (like tmpfs) and disc based file systems
> (ext2/3, xfs, <insert your fav. fs>).

Right.

> tmpfs is extremely fast but suffers from data losses from restarts, crashes
> and power outages.

Part of the design tradeoffs.

>                    Disc access is slow against a ram based fs.

On-disk filesystems (and block device handling) are designed around that
fact.

> === the idea ===
> My idea is to mix them to the following hybrid:
> - mount the new fs over an existing dir as an overlay
> - all files overlayed are still accessible
> - after the first read, the file stays in memory (like a file cache)
> - all writes are flushed out to the underlying fs (maybe done async)
> - all reads are always done from the memory cache unless they are not cached
>   yet
> - the cache stays until the partition is unmounted
> - the maximum size of the overlayed filesystem could be physical ram/2 (like tmpfs)

But the current on.disk filesystems use caching of data in RAM extensively,
/without/ having to keep the whole file in memory, just the pieces
currently in active use. Your proposal negates the RAM for caches, so it
would be much /slower/ than the current on-disk filesystems.

BTW, many of the live-CD distributions do exactly this (RAM overlay over a
CD-based filesystem).
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Idea about a disc backed ram filesystem
  2006-06-08 21:51   ` Horst von Brand
@ 2006-06-08 22:39     ` Joshua Hudson
  0 siblings, 0 replies; 22+ messages in thread
From: Joshua Hudson @ 2006-06-08 22:39 UTC (permalink / raw)
  To: linux-kernel

This just *screams* block-layer. If anybody feels up to it,
try making a modified loopback device that implements
an independent, fixed-size write-through cache using vmalloc and such.

I have a hunch it won't really improve performance much.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Idea about a disc backed ram filesystem
  2006-06-08 20:33 ` Idea about a disc backed ram filesystem Sascha Nitsch
  2006-06-08 20:43   ` Lennart Sorensen
  2006-06-08 21:51   ` Horst von Brand
@ 2006-06-08 22:48   ` Matheus Izvekov
  2006-06-08 23:40     ` Måns Rullgård
  2006-06-09  2:17     ` Horst von Brand
  2006-06-09  6:33   ` Andi Kleen
  3 siblings, 2 replies; 22+ messages in thread
From: Matheus Izvekov @ 2006-06-08 22:48 UTC (permalink / raw)
  To: Sascha Nitsch; +Cc: Linux Kernel Mailing List

On 6/8/06, Sascha Nitsch <Sash_lkl@linuxhowtos.org> wrote:
> Hi,
>
> this is (as of this writing) just an idea.
>
> === current state ===
> Currently we have ram filesystems (like tmpfs) and disc based file systems
> (ext2/3, xfs, <insert your fav. fs>).
>
> tmpfs is extremely fast but suffers from data losses from restarts, crashes
> and power outages. Disc access is slow against a ram based fs.
>
> === the idea ===
> My idea is to mix them to the following hybrid:
> - mount the new fs over an existing dir as an overlay
> - all files overlayed are still accessible
> - after the first read, the file stays in memory (like a file cache)
> - all writes are flushed out to the underlying fs (maybe done async)
> - all reads are always done from the memory cache unless they are not cached
>   yet
> - the cache stays until the partition is unmounted
> - the maximum size of the overlayed filesystem could be physical ram/2 (like tmpfs)
>
> === advantages ===
> once the files are read, no more "slow" disc reading is needed=> huge read
> speed improvements (like on tmpfs)
> if the writing is done asyncronous, write speeds would be as fast as a
> tmpfs => huge write speedup
> if done syncronous, write speed almost as fast as native disc fs
> the ram fs would be imune against data loss from reboots or controled shutdown
> if syncronous write is done, the fs would be imune to crashes/power
> outages (with the usual exceptions like on disc fs)
>
> === disadvantage ===
> possible higher memory usage (see implementation ideas below)
>
> === usages ===
> possible usage scenarios could be any storage where a
> smaller set of files get read/written a lot, like databases
> definition of smaller: lets say up to 50% of physical ram size.
> Depending on architecture and money spent, this can be a lot :)
>
> === implementation ideas ===
> One note first:
> I don't know the fs internals of the kernel (yet), so these ideas might not
> work, but you should get the idea.
>
> One idea is to build a complete virtual filesystem that connects to the VFS
> layer and hands the writes through to the "original" fs driver.
> The caching would be done in that layer. This might cause double caching
> (in the io cache) and might waste memory.
> But this idea would enable the possibility of async writes (when the disc has
> less to do) and gives write speed improves.
>
> The other idea would be to modify the existing filesystem cache algorithm to
> have a flag "always keep this file in memory".
>
> The second one may be easier to do and may cause less side effects, but
> might not enable async writes.
>
> Since this overlay is done in the kernel, no other process could change the
> files under the overlay.
> Remote FS must be excluded from the cache layer (for obvious reasons).
>
> Any kind of feedback is welcome.
>
> If this has been discussed earlier, sorry for double posting. I haven't found
> anything like this in the archives. Just point me in the right direction.
>
> Regards,
>
> Sascha Nitsch
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

I had a somewhat similar idea, once i have time to implement it ill
submit a patch.
My idea consisted of adding the capability to specify a device for
tmpfs mounting. if you dont specify any device, tmpfs continues to
behave the way it currently is. But if you do, once data doesnt fit on
ram (or some other limit) anymore, it will flush things to this
device. my intention was to reuse swap code for this, so you mount a
tmpfs passing the dev node of some unused swap device, and it works
just like tmpfs with a dedicated swap partition.
So i hope it would be damn fast because of the simple disk format, and
ofcourse all the data becomes lost if you umount it.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Idea about a disc backed ram filesystem
  2006-06-08 22:48   ` Matheus Izvekov
@ 2006-06-08 23:40     ` Måns Rullgård
  2006-06-09  1:01       ` Matheus Izvekov
  2006-06-09  2:17     ` Horst von Brand
  1 sibling, 1 reply; 22+ messages in thread
From: Måns Rullgård @ 2006-06-08 23:40 UTC (permalink / raw)
  To: linux-kernel

"Matheus Izvekov" <mizvekov@gmail.com> writes:

> My idea consisted of adding the capability to specify a device for
> tmpfs mounting. if you dont specify any device, tmpfs continues to
> behave the way it currently is. But if you do, once data doesnt fit on
> ram (or some other limit) anymore, it will flush things to this
> device. my intention was to reuse swap code for this, so you mount a
> tmpfs passing the dev node of some unused swap device, and it works
> just like tmpfs with a dedicated swap partition.

I don't see what advantage this would have over normal tmpfs.

-- 
Måns Rullgård
mru@inprovide.com


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Idea about a disc backed ram filesystem
  2006-06-08 23:40     ` Måns Rullgård
@ 2006-06-09  1:01       ` Matheus Izvekov
  2006-06-09  8:52         ` Måns Rullgård
  0 siblings, 1 reply; 22+ messages in thread
From: Matheus Izvekov @ 2006-06-09  1:01 UTC (permalink / raw)
  To: Måns Rullgård; +Cc: linux-kernel

On 6/8/06, Måns Rullgård <mru@inprovide.com> wrote:
> "Matheus Izvekov" <mizvekov@gmail.com> writes:
>
> > My idea consisted of adding the capability to specify a device for
> > tmpfs mounting. if you dont specify any device, tmpfs continues to
> > behave the way it currently is. But if you do, once data doesnt fit on
> > ram (or some other limit) anymore, it will flush things to this
> > device. my intention was to reuse swap code for this, so you mount a
> > tmpfs passing the dev node of some unused swap device, and it works
> > just like tmpfs with a dedicated swap partition.
>
> I don't see what advantage this would have over normal tmpfs.
>
> --

The difference is that the swap device is exclusive for the tmpfs mount.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Idea about a disc backed ram filesystem
  2006-06-08 22:48   ` Matheus Izvekov
  2006-06-08 23:40     ` Måns Rullgård
@ 2006-06-09  2:17     ` Horst von Brand
  2006-06-09  4:59       ` Matheus Izvekov
  1 sibling, 1 reply; 22+ messages in thread
From: Horst von Brand @ 2006-06-09  2:17 UTC (permalink / raw)
  To: Matheus Izvekov; +Cc: Sascha Nitsch, Linux Kernel Mailing List

Matheus Izvekov <mizvekov@gmail.com> wrote:

[...]

> I had a somewhat similar idea, once i have time to implement it ill
> submit a patch.
> My idea consisted of adding the capability to specify a device for
> tmpfs mounting. if you dont specify any device, tmpfs continues to
> behave the way it currently is. But if you do, once data doesnt fit on
> ram (or some other limit) anymore, it will flush things to this
> device. my intention was to reuse swap code for this, so you mount a
> tmpfs passing the dev node of some unused swap device, and it works
> just like tmpfs with a dedicated swap partition.

tmpfs does use swap currently. Giving tmpfs a dedicated swap space is dumb,
as it takes away the possibility of using that space for swapping when not
in use by tmpfs (and viceversa).
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Idea about a disc backed ram filesystem
  2006-06-09  2:17     ` Horst von Brand
@ 2006-06-09  4:59       ` Matheus Izvekov
  2006-06-09 13:43         ` Horst von Brand
  0 siblings, 1 reply; 22+ messages in thread
From: Matheus Izvekov @ 2006-06-09  4:59 UTC (permalink / raw)
  To: Horst von Brand; +Cc: Sascha Nitsch, Linux Kernel Mailing List

On 6/8/06, Horst von Brand <vonbrand@inf.utfsm.cl> wrote:
> tmpfs does use swap currently. Giving tmpfs a dedicated swap space is dumb,
> as it takes away the possibility of using that space for swapping when not
> in use by tmpfs (and viceversa).

The idea is not dumb per se. Maybe you want your applications to swap
to one device (or not swap at all) and a tmpfs mount to swap to
another. For me at least it would make a difference.
I dont use swap at all, have enough ram for all my processes. And ive
seen that for some workloads, setting a temporary directory as tmpfs
gives huge speed improvements. But just occasionally, the space used
in this temp dir will not fit in my ram, so in this case swapping
would be fine. The problem is, currently there is no way to enforce
this.
Ditto for the fact that, when you have many swap devices set, each
with different performances, there is no way to give priorities/rules
to enforce who uses each device.
When someone gets to implement those features, this wouldnt be needed
anymore. But that seems far away enough to justify a more immediate
workaround.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Idea about a disc backed ram filesystem
  2006-06-08 20:33 ` Idea about a disc backed ram filesystem Sascha Nitsch
                     ` (2 preceding siblings ...)
  2006-06-08 22:48   ` Matheus Izvekov
@ 2006-06-09  6:33   ` Andi Kleen
  3 siblings, 0 replies; 22+ messages in thread
From: Andi Kleen @ 2006-06-09  6:33 UTC (permalink / raw)
  To: Sascha Nitsch; +Cc: linux-kernel

Sascha Nitsch <Sash_lkl@linuxhowtos.org> writes:
> - all files overlayed are still accessible
> - after the first read, the file stays in memory (like a file cache)

Linux has very aggressive file caching and does this effectively
by default for every file system.

Sounds like you're trying to reinvent the wheel. 

-Andi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Idea about a disc backed ram filesystem
  2006-06-09  1:01       ` Matheus Izvekov
@ 2006-06-09  8:52         ` Måns Rullgård
  0 siblings, 0 replies; 22+ messages in thread
From: Måns Rullgård @ 2006-06-09  8:52 UTC (permalink / raw)
  To: Matheus Izvekov; +Cc: Måns Rullgård, linux-kernel


Matheus Izvekov said:
> On 6/8/06, Måns Rullgård <mru@inprovide.com> wrote:
>> "Matheus Izvekov" <mizvekov@gmail.com> writes:
>>
>> > My idea consisted of adding the capability to specify a device for
>> > tmpfs mounting. if you dont specify any device, tmpfs continues to
>> > behave the way it currently is. But if you do, once data doesnt fit on
>> > ram (or some other limit) anymore, it will flush things to this
>> > device. my intention was to reuse swap code for this, so you mount a
>> > tmpfs passing the dev node of some unused swap device, and it works
>> > just like tmpfs with a dedicated swap partition.
>>
>> I don't see what advantage this would have over normal tmpfs.
>
> The difference is that the swap device is exclusive for the tmpfs mount.

Yes, and what would the advantage of that be?  Sounds to me you'd only end
up wasting swap space.

-- 
Måns Rullgård
mru@inprovide.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Idea about a disc backed ram filesystem
  2006-06-08 21:12     ` Sash
@ 2006-06-09 10:45       ` Jan Engelhardt
  0 siblings, 0 replies; 22+ messages in thread
From: Jan Engelhardt @ 2006-06-09 10:45 UTC (permalink / raw)
  To: Sash; +Cc: Linux Kernel Mailing List

>> I am a bit puzzled.  How is your idea different in use than the current
>> caching system that the kernel already applies to reads of all block
>>...
>
>The idea was simply born to have a fast tmpfs but with the safety of permanent
>data storage in case of reboots/crashes without user level app modification.
>

When do you want to write to disk?
Anytime? That would impact on the "fast" attribute, in which case
you don't need ramfs.
Not anytime? Potential loss of data.
Hm.


Jan Engelhardt
-- 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Idea about a disc backed ram filesystem
  2006-06-09  4:59       ` Matheus Izvekov
@ 2006-06-09 13:43         ` Horst von Brand
  2006-06-09 15:07           ` Matheus Izvekov
  0 siblings, 1 reply; 22+ messages in thread
From: Horst von Brand @ 2006-06-09 13:43 UTC (permalink / raw)
  To: Matheus Izvekov; +Cc: Horst von Brand, Sascha Nitsch, Linux Kernel Mailing List

Matheus Izvekov <mizvekov@gmail.com> wrote:
> On 6/8/06, Horst von Brand <vonbrand@inf.utfsm.cl> wrote:
> > tmpfs does use swap currently. Giving tmpfs a dedicated swap space is dumb,
> > as it takes away the possibility of using that space for swapping when not
> > in use by tmpfs (and viceversa).

> The idea is not dumb per se. Maybe you want your applications to swap
> to one device (or not swap at all) and a tmpfs mount to swap to
> another.

Why? If one device is faster, you'd want to prefer that one for swapping
/and/ tmpfs. If not, I don't see the point. Except for limiting maximal
sizes of tmpfs or swap, but limiting the later doesn't make much sense (why
go OOM even though swap /is/ available?), and the former can be set on mount.

>          For me at least it would make a difference.

How?

> I dont use swap at all, have enough ram for all my processes.

What is your beef then?

>                                                               And ive
> seen that for some workloads, setting a temporary directory as tmpfs
> gives huge speed improvements. But just occasionally, the space used
> in this temp dir will not fit in my ram, so in this case swapping
> would be fine. The problem is, currently there is no way to enforce
> this.

That is exactly how tmpfs works, and has worked that way from the
beginning. If it doesn't for you, it is a bug to report.

> Ditto for the fact that, when you have many swap devices set, each
> with different performances, there is no way to give priorities/rules
> to enforce who uses each device.

There are priorities: See swapon(8). It has worked this way from day one
(or for as long as I can remember, in any case). The "who gets to use swap
and who doesn't" you can control partially via pinning processes to RAM or
limiting their memory use.

> When someone gets to implement those features,

Done already, as far as it makes sense.

>                                                this wouldnt be needed
> anymore.

Case closed.

>          But that seems far away enough to justify a more immediate
> workaround.

On some level you /have/ to trust the system to do things right. It has
much more detailed information (and better response time) than you could
ever hope to get. Besides, adding even more knobs to fiddle just makes the
system more complex (and thus bloated/slow) and harder to manage, for a
limited gain in niche situations.
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Idea about a disc backed ram filesystem
  2006-06-09 13:43         ` Horst von Brand
@ 2006-06-09 15:07           ` Matheus Izvekov
  2006-06-09 18:43             ` Lee Revell
  2006-06-09 23:37             ` Horst von Brand
  0 siblings, 2 replies; 22+ messages in thread
From: Matheus Izvekov @ 2006-06-09 15:07 UTC (permalink / raw)
  To: Horst von Brand; +Cc: Sascha Nitsch, Linux Kernel Mailing List

On 6/9/06, Horst von Brand <vonbrand@inf.utfsm.cl> wrote:
> Matheus Izvekov <mizvekov@gmail.com> wrote:
> > On 6/8/06, Horst von Brand <vonbrand@inf.utfsm.cl> wrote:
> > > tmpfs does use swap currently. Giving tmpfs a dedicated swap space is dumb,
> > > as it takes away the possibility of using that space for swapping when not
> > > in use by tmpfs (and viceversa).
>
> > The idea is not dumb per se. Maybe you want your applications to swap
> > to one device (or not swap at all) and a tmpfs mount to swap to
> > another.
>
> Why? If one device is faster, you'd want to prefer that one for swapping
> /and/ tmpfs. If not, I don't see the point. Except for limiting maximal
> sizes of tmpfs or swap, but limiting the later doesn't make much sense (why
> go OOM even though swap /is/ available?), and the former can be set on mount.
>
> >          For me at least it would make a difference.
>
> How?
>

Ok, but reality is that, even if i setup a swap partition with the
most lazy swapiness, it will swap my processes out. Is there a
pratical way to pin all processes to ram or otherwise tell the vm to
never swap any process? If there is, then you are right, there is no
point in doing this.

> > I dont use swap at all, have enough ram for all my processes.
>
> What is your beef then?
>

I just wanted to have no swap for my processes, but i wanted swap for
my tmpfs mount, as i explained. For my usage, there is no point in
having swap for processes. If something gets to use that much ram,
somethings gone wrong, and it should die anyway instead of getting my
system unusable for several minutes until swap is full too, and then
it dies anyway.

> >                                                               And ive
> > seen that for some workloads, setting a temporary directory as tmpfs
> > gives huge speed improvements. But just occasionally, the space used
> > in this temp dir will not fit in my ram, so in this case swapping
> > would be fine. The problem is, currently there is no way to enforce
> > this.
>
> That is exactly how tmpfs works, and has worked that way from the
> beginning. If it doesn't for you, it is a bug to report.
>

I know it works like this, my point was the separation.

> > Ditto for the fact that, when you have many swap devices set, each
> > with different performances, there is no way to give priorities/rules
> > to enforce who uses each device.
>
> There are priorities: See swapon(8). It has worked this way from day one
> (or for as long as I can remember, in any case). The "who gets to use swap
> and who doesn't" you can control partially via pinning processes to RAM or
> limiting their memory use.
>
> > When someone gets to implement those features,
>
> Done already, as far as it makes sense.

Good to know, except that there is no way in the universe the
algorithm can be smart enough to be optimal to all usage cases, so
some hand fiddling can be desired.

>
> >                                                this wouldnt be needed
> > anymore.
>
> Case closed.
>
> >          But that seems far away enough to justify a more immediate
> > workaround.
>
> On some level you /have/ to trust the system to do things right. It has
> much more detailed information (and better response time) than you could
> ever hope to get. Besides, adding even more knobs to fiddle just makes the
> system more complex (and thus bloated/slow) and harder to manage, for a
> limited gain in niche situations.

If it adds so much overhead, it can always be a compile option. The
system has many knobs already, its always a compromise. Im not giving
any proof that what i described is a good compromise, but at least it
can be.

> --
> Dr. Horst H. von Brand                   User #22616 counter.li.org
> Departamento de Informatica                     Fono: +56 32 654431
> Universidad Tecnica Federico Santa Maria              +56 32 654239
> Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Idea about a disc backed ram filesystem
  2006-06-09 15:07           ` Matheus Izvekov
@ 2006-06-09 18:43             ` Lee Revell
  2006-06-09 19:27               ` Matheus Izvekov
  2006-06-09 23:37             ` Horst von Brand
  1 sibling, 1 reply; 22+ messages in thread
From: Lee Revell @ 2006-06-09 18:43 UTC (permalink / raw)
  To: Matheus Izvekov; +Cc: Horst von Brand, Sascha Nitsch, Linux Kernel Mailing List

On Fri, 2006-06-09 at 12:07 -0300, Matheus Izvekov wrote:
> Ok, but reality is that, even if i setup a swap partition with the
> most lazy swapiness, it will swap my processes out. Is there a
> pratical way to pin all processes to ram or otherwise tell the vm to
> never swap any process? If there is, then you are right, there is no
> point in doing this.
> 

echo 0 > /proc/sys/vm/swappiness

Lee


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Idea about a disc backed ram filesystem
  2006-06-09 18:43             ` Lee Revell
@ 2006-06-09 19:27               ` Matheus Izvekov
  2006-06-09 19:31                 ` Lee Revell
  0 siblings, 1 reply; 22+ messages in thread
From: Matheus Izvekov @ 2006-06-09 19:27 UTC (permalink / raw)
  To: Lee Revell; +Cc: Horst von Brand, Sascha Nitsch, Linux Kernel Mailing List

On 6/9/06, Lee Revell <rlrevell@joe-job.com> wrote:
> On Fri, 2006-06-09 at 12:07 -0300, Matheus Izvekov wrote:
> > Ok, but reality is that, even if i setup a swap partition with the
> > most lazy swapiness, it will swap my processes out. Is there a
> > pratical way to pin all processes to ram or otherwise tell the vm to
> > never swap any process? If there is, then you are right, there is no
> > point in doing this.
> >
>
> echo 0 > /proc/sys/vm/swappiness
>
> Lee

Sorry, i took a look at the code which handles this and swappiness = 0
doesnt seem to imply that process memory will never be swapped out.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Idea about a disc backed ram filesystem
  2006-06-09 19:27               ` Matheus Izvekov
@ 2006-06-09 19:31                 ` Lee Revell
  2006-06-09 19:43                   ` Matheus Izvekov
  0 siblings, 1 reply; 22+ messages in thread
From: Lee Revell @ 2006-06-09 19:31 UTC (permalink / raw)
  To: Matheus Izvekov; +Cc: Horst von Brand, Sascha Nitsch, Linux Kernel Mailing List

On Fri, 2006-06-09 at 16:27 -0300, Matheus Izvekov wrote:
> On 6/9/06, Lee Revell <rlrevell@joe-job.com> wrote:
> > On Fri, 2006-06-09 at 12:07 -0300, Matheus Izvekov wrote:
> > > Ok, but reality is that, even if i setup a swap partition with the
> > > most lazy swapiness, it will swap my processes out. Is there a
> > > pratical way to pin all processes to ram or otherwise tell the vm to
> > > never swap any process? If there is, then you are right, there is no
> > > point in doing this.
> > >
> >
> > echo 0 > /proc/sys/vm/swappiness
> >
> > Lee
> 
> Sorry, i took a look at the code which handles this and swappiness = 0
> doesnt seem to imply that process memory will never be swapped out.
> 

OK, then use mlockall().

Lee


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Idea about a disc backed ram filesystem
  2006-06-09 19:31                 ` Lee Revell
@ 2006-06-09 19:43                   ` Matheus Izvekov
  2006-06-09 20:03                     ` Lee Revell
  0 siblings, 1 reply; 22+ messages in thread
From: Matheus Izvekov @ 2006-06-09 19:43 UTC (permalink / raw)
  To: Lee Revell; +Cc: Horst von Brand, Sascha Nitsch, Linux Kernel Mailing List

On 6/9/06, Lee Revell <rlrevell@joe-job.com> wrote:
> On Fri, 2006-06-09 at 16:27 -0300, Matheus Izvekov wrote:
> > Sorry, i took a look at the code which handles this and swappiness = 0
> > doesnt seem to imply that process memory will never be swapped out.
> >
>
> OK, then use mlockall().
>
> Lee
>
>

If i make init mlockall, would all child processes be mlocked too?
If not, using this to enforce a system wide policy seems a bit hacky
and non trivial.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Idea about a disc backed ram filesystem
  2006-06-09 19:43                   ` Matheus Izvekov
@ 2006-06-09 20:03                     ` Lee Revell
  2006-06-09 21:23                       ` Matheus Izvekov
  0 siblings, 1 reply; 22+ messages in thread
From: Lee Revell @ 2006-06-09 20:03 UTC (permalink / raw)
  To: Matheus Izvekov; +Cc: Horst von Brand, Sascha Nitsch, Linux Kernel Mailing List

On Fri, 2006-06-09 at 16:43 -0300, Matheus Izvekov wrote:
> On 6/9/06, Lee Revell <rlrevell@joe-job.com> wrote:
> > On Fri, 2006-06-09 at 16:27 -0300, Matheus Izvekov wrote:
> > > Sorry, i took a look at the code which handles this and swappiness = 0
> > > doesnt seem to imply that process memory will never be swapped out.
> > >
> >
> > OK, then use mlockall().
> >
> > Lee
> >
> >
> 
> If i make init mlockall, would all child processes be mlocked too?

No.

> If not, using this to enforce a system wide policy seems a bit hacky
> and non trivial.
> 

Well, what you are trying to do seems hacky.  What real world problem
are you trying to solve that setting swappiness to 0 is not sufficient
for?

Lee




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Idea about a disc backed ram filesystem
  2006-06-09 20:03                     ` Lee Revell
@ 2006-06-09 21:23                       ` Matheus Izvekov
  0 siblings, 0 replies; 22+ messages in thread
From: Matheus Izvekov @ 2006-06-09 21:23 UTC (permalink / raw)
  To: Lee Revell; +Cc: Horst von Brand, Sascha Nitsch, Linux Kernel Mailing List

On 6/9/06, Lee Revell <rlrevell@joe-job.com> wrote:
> Well, what you are trying to do seems hacky.  What real world problem
> are you trying to solve that setting swappiness to 0 is not sufficient
> for?
>
> Lee

For my usage, having processes swap is a complete loss. I have enough
ram and if some process doesnt fits into ram i would rather have it
killed than have it swap. Swap activity hogs my system and probably
either the process would fill up the swap and die anyway or it would
be too slow to be usable and i would kill it.

But i have some processes which gain considerable performance benefit
it they do their temporary work over tmpfs. The problem is that just
sometimes, their temporary work just doesnt fit into ram. In that case
swapping would be just fine. The simple dataformat on disk is a gain
when the stuff you are working on doesnt need to survive
unmounting/powerloss.
Now ive considered two alternatives:

1) Creating a new filesystem, a very simple one which only stores the
data in disk, and all the other stuff (superblock etc) is kept in
kernel memory, and let pagecache do its work on keeping fresh stuff on
ram.
2) Modifying tmpfs to accept a device, and when things dont fit in
ram, they would be flushed to this device instead first, while there
is space availiable, and then ultimately revert to swap when its
present.

It seems to me that both approaches would converge to the same thing,
but 2 is better because there would be no functionality duplication,
and it would get to keep the cool name

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Idea about a disc backed ram filesystem
  2006-06-09 15:07           ` Matheus Izvekov
  2006-06-09 18:43             ` Lee Revell
@ 2006-06-09 23:37             ` Horst von Brand
  1 sibling, 0 replies; 22+ messages in thread
From: Horst von Brand @ 2006-06-09 23:37 UTC (permalink / raw)
  To: Matheus Izvekov; +Cc: Horst von Brand, Sascha Nitsch, Linux Kernel Mailing List

Matheus Izvekov <mizvekov@gmail.com> wrote:
> On 6/9/06, Horst von Brand <vonbrand@inf.utfsm.cl> wrote:
> > Matheus Izvekov <mizvekov@gmail.com> wrote:
> > > On 6/8/06, Horst von Brand <vonbrand@inf.utfsm.cl> wrote:

> > > > tmpfs does use swap currently. Giving tmpfs a dedicated swap space
> > > > is dumb, as it takes away the possibility of using that space for
> > > > swapping when not in use by tmpfs (and viceversa).

> > > The idea is not dumb per se. Maybe you want your applications to swap
> > > to one device (or not swap at all) and a tmpfs mount to swap to
> > > another.

> > Why? If one device is faster, you'd want to prefer that one for
> > swapping /and/ tmpfs. If not, I don't see the point. Except for
> > limiting maximal sizes of tmpfs or swap, but limiting the later doesn't
> > make much sense (why go OOM even though swap /is/ available?), and the
> > former can be set on mount.

> > >          For me at least it would make a difference.

> > How?

> Ok, but reality is that, even if i setup a swap partition with the
> most lazy swapiness, it will swap my processes out.

When you run out of RAM, or the RAM can be put to better use than keeping
stale process data around (you do realize that program code is paged
directly from the executable, don't you?).

>                                                     Is there a
> pratical way to pin all processes to ram or otherwise tell the vm to
> never swap any process? If there is, then you are right, there is no
> point in doing this.

There is: Max out the RAM on your machine. Don't ever run large processes.
Don't ever read large files.

> > > I dont use swap at all, have enough ram for all my processes.

> > What is your beef then?

> I just wanted to have no swap for my processes, but i wanted swap for
> my tmpfs mount, as i explained.

Use a regular filesystem for /tmp, Linux is pretty good at caching file
data. If it isn't enough, complain /with data/ and /details/ of how it
isn't enough...

>                                 For my usage, there is no point in
> having swap for processes. If something gets to use that much ram,
> somethings gone wrong, and it should die anyway instead of getting my
> system unusable for several minutes until swap is full too, and then
> it dies anyway.

OK. But in any case, this is rare? So it would make not that much of a
difference...

> > >                                                               And ive
> > > seen that for some workloads, setting a temporary directory as tmpfs
> > > gives huge speed improvements. But just occasionally, the space used
> > > in this temp dir will not fit in my ram, so in this case swapping
> > > would be fine. The problem is, currently there is no way to enforce
> > > this.

> > That is exactly how tmpfs works, and has worked that way from the
> > beginning. If it doesn't for you, it is a bug to report.

> I know it works like this, my point was the separation.

Again, I fail to see the point.

> > > Ditto for the fact that, when you have many swap devices set, each
> > > with different performances, there is no way to give priorities/rules
> > > to enforce who uses each device.
> >
> > There are priorities: See swapon(8). It has worked this way from day one
> > (or for as long as I can remember, in any case). The "who gets to use swap
> > and who doesn't" you can control partially via pinning processes to RAM or
> > limiting their memory use.
> >
> > > When someone gets to implement those features,
> >
> > Done already, as far as it makes sense.

> Good to know, except that there is no way in the universe the
> algorithm can be smart enough to be optimal to all usage cases,

Right.

>                                                                 so
> some hand fiddling can be desired.

Desired, yes; but not all desires can (or deserve to) be granted...

> > >                                                this wouldnt be needed
> > > anymore.

> > Case closed.

> > >          But that seems far away enough to justify a more immediate
> > > workaround.

> > On some level you /have/ to trust the system to do things right. It has
> > much more detailed information (and better response time) than you could
> > ever hope to get. Besides, adding even more knobs to fiddle just makes the
> > system more complex (and thus bloated/slow) and harder to manage, for a
> > limited gain in niche situations.

> If it adds so much overhead, it can always be a compile option.

The overhead is not "just" runtime overhead, it is also developer time
consumption, it is more complex testing (need to check it works
with/without), ...

>                                                                 The
> system has many knobs already, its always a compromise. Im not giving
> any proof that what i described is a good compromise, but at least it
> can be.

I've given no proof that "just another knob" is a bad idea either, but the
road to massive suckage is paved with "just another little feature"...
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2006-06-09 23:38 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <Sash_lkl@linuxhowtos.org>
2006-06-08 20:33 ` Idea about a disc backed ram filesystem Sascha Nitsch
2006-06-08 20:43   ` Lennart Sorensen
2006-06-08 21:12     ` Sash
2006-06-09 10:45       ` Jan Engelhardt
2006-06-08 21:51   ` Horst von Brand
2006-06-08 22:39     ` Joshua Hudson
2006-06-08 22:48   ` Matheus Izvekov
2006-06-08 23:40     ` Måns Rullgård
2006-06-09  1:01       ` Matheus Izvekov
2006-06-09  8:52         ` Måns Rullgård
2006-06-09  2:17     ` Horst von Brand
2006-06-09  4:59       ` Matheus Izvekov
2006-06-09 13:43         ` Horst von Brand
2006-06-09 15:07           ` Matheus Izvekov
2006-06-09 18:43             ` Lee Revell
2006-06-09 19:27               ` Matheus Izvekov
2006-06-09 19:31                 ` Lee Revell
2006-06-09 19:43                   ` Matheus Izvekov
2006-06-09 20:03                     ` Lee Revell
2006-06-09 21:23                       ` Matheus Izvekov
2006-06-09 23:37             ` Horst von Brand
2006-06-09  6:33   ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox