[PATCH RFC] nilfs2: continuous snapshotting file system

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC] nilfs2: continuous snapshotting file system
@ 2008-08-20  2:45 Ryusuke Konishi
  2008-08-20  7:43 ` Andrew Morton
  2008-08-20  9:47 ` Andi Kleen
  0 siblings, 2 replies; 55+ messages in thread
From: Ryusuke Konishi @ 2008-08-20  2:45 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel

This is a kernel patch of NILFS2 file system which was previously
announced in [1].  Since the original code did not comply with the
Linux Coding Style, I've rewritten it a lot.

NILFS2 is a log-structured file system (LFS) supporting ``continuous
snapshotting''.  In addition to versioning capability of the entire
file system, users can even restore files and namespaces mistakenly
overwritten or destroyed just a few seconds ago.

NILFS2 creates a number of checkpoints every few seconds or per
synchronous write basis (unless there is no change).  Users can select
significant versions among continuously created checkpoints, and can
change them into snapshots which will be preserved until they are
changed back to checkpoints.

There is no limit on the number of snapshots until the volume gets
full.  Each snapshot is mountable as a read-only file system
concurrently with its writable mount, and this feature is convenient
for online backup.  It will be also favorable for time-machine like
user environment or appliances.

Please see [2] for details on the project.

Other features are:

- Quick crash recovery on-mount (like conventional LFS)
- B-tree based file, inode, and other meta data management including
  snapshots.
- 64-bit data structures; support many files, large files and disks.
- Online disk space reclamation by userland daemon, which can maintain
  multiple snapshots.
- Less use of barrier with keeping reliability. The barrier is enabled
  by default.
- Easy and quickly performable snapshot administration

Some impressive benchmark results on SSD are shown in [3], however the
current NILFS2 performance is sensitive to machine environment due to
its immature implementation.

It has many TODO items:

- performance improvement (better block I/O submission)
- better integration of b-tree node cache with filemap and buffer code.
- cleanups, further simplification.
- atime support
- extendend attributes support
- POSIX ACL support
- Quota support

The patch against 2.6.27-rc3 (hopefully applicable to the next -mm
tree) is available at:

http://www.nilfs.org/pub/patch/nilfs2-continuous-snapshotting-file-system.patch

It is not yet divided into pieces (sorry).  Unlike original code
available at [4], many code lines to support past kernel versions and
peculiar debug code are removed in this patch.

The userland tools are included in nilfs-utils package, which is
available from [4].  Details on the tools are described in the man
pages included in the package.

Here is an example:

- To use nilfs2 as a local file system, simply:

 # mkfs -t nilfs2 /dev/block_device
 # mount -t nilfs2 /dev/block_device /dir

  This will also invoke the cleaner through the mount helper program
  (mount.nilfs2).

- Checkpoints and snapshots are managed by the following commands.
  Their manpages are included in the nilfs-utils package above.

   lscp     list checkpoints or snapshots.
   mkcp     make a checkpoint or a snapshot.
   chcp     change an existing checkpoint to a snapshot or vice versa.
   rmcp     invalidate specified checkpoint(s).

  For example,

 # chcp ss 2

  changes the checkpoint No. 2 into snapshot.

- To mount a snapshot,

 # mount -t nilfs2 -r -o cp=<cno> /dev/block_device /snap_dir

  where <cno> is the checkpoint number of the snapshot.

- More illustrative example is found in [5].

Thank you,
Ryusuke Konishi, NILFS Team, NTT.

1. NILFS version2 now available
http://marc.info/?l=linux-fsdevel&m=118187597808509&w=2

2. NILFS homepage
http://www.nilfs.org/en/index.html

3. Dongjun Shin, About SSD
http://www.usenix.org/event/lsf08/tech/shin_SSD.pdf

4. Source archive
http://www.nilfs.org/en/download.html

5. Using NILFS
http://www.nilfs.org/en/about_nilfs.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC] nilfs2: continuous snapshotting file system
  2008-08-20  2:45 [PATCH RFC] nilfs2: continuous snapshotting file system Ryusuke Konishi
@ 2008-08-20  7:43 ` Andrew Morton
  2008-08-20  8:22   ` Pekka Enberg
  2008-08-20 16:13   ` Ryusuke Konishi
  2008-08-20  9:47 ` Andi Kleen
  1 sibling, 2 replies; 55+ messages in thread
From: Andrew Morton @ 2008-08-20  7:43 UTC (permalink / raw)
  To: Ryusuke Konishi; +Cc: linux-fsdevel, linux-kernel

On Wed, 20 Aug 2008 11:45:05 +0900 Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> wrote:

> This is a kernel patch of NILFS2 file system which was previously
> announced in [1].  Since the original code did not comply with the
> Linux Coding Style, I've rewritten it a lot.

I expected this email two years ago :)

> NILFS2 is a log-structured file system (LFS) supporting ``continuous
> snapshotting''.  In addition to versioning capability of the entire
> file system, users can even restore files and namespaces mistakenly
> overwritten or destroyed just a few seconds ago.
> 
> NILFS2 creates a number of checkpoints every few seconds or per
> synchronous write basis (unless there is no change).  Users can select
> significant versions among continuously created checkpoints, and can
> change them into snapshots which will be preserved until they are
> changed back to checkpoints.

What approach does it take to garbage collection?

> There is no limit on the number of snapshots until the volume gets
> full.  Each snapshot is mountable as a read-only file system
> concurrently with its writable mount, and this feature is convenient
> for online backup.  It will be also favorable for time-machine like
> user environment or appliances.
> 
> Please see [2] for details on the project.
> 
> Other features are:
> 
> - Quick crash recovery on-mount (like conventional LFS)
> - B-tree based file, inode, and other meta data management including
>   snapshots.
> - 64-bit data structures; support many files, large files and disks.
> - Online disk space reclamation by userland daemon, which can maintain
>   multiple snapshots.
> - Less use of barrier with keeping reliability. The barrier is enabled
>   by default.
> - Easy and quickly performable snapshot administration
> 
> Some impressive benchmark results on SSD are shown in [3],

heh.  It wipes the floor with everything, including btrfs.

But a log-based fs will do that, initially.  What will the performace
look like after a month or two's usage?

> however the
> current NILFS2 performance is sensitive to machine environment due to
> its immature implementation.
> 
> It has many TODO items:
> 
> - performance improvement (better block I/O submission)
> - better integration of b-tree node cache with filemap and buffer code.
> - cleanups, further simplification.
> - atime support
> - extendend attributes support
> - POSIX ACL support
> - Quota support
> 
> The patch against 2.6.27-rc3 (hopefully applicable to the next -mm
> tree) is available at:
> 
> http://www.nilfs.org/pub/patch/nilfs2-continuous-snapshotting-file-system.patch

Needs a few fixes for recent linux-next changes.

I queued it up without looking at it, just for a bit of review and
compile-coverage testing.

> It is not yet divided into pieces (sorry).  Unlike original code
> available at [4], many code lines to support past kernel versions and
> peculiar debug code are removed in this patch.

Yes, please do that splitup and let's get down to reviewing it.

> The userland tools are included in nilfs-utils package, which is
> available from [4].  Details on the tools are described in the man
> pages included in the package.
> 
> Here is an example:
> 
> - To use nilfs2 as a local file system, simply:
> 
>  # mkfs -t nilfs2 /dev/block_device
>  # mount -t nilfs2 /dev/block_device /dir
> 
>   This will also invoke the cleaner through the mount helper program
>   (mount.nilfs2).
> 
> - Checkpoints and snapshots are managed by the following commands.
>   Their manpages are included in the nilfs-utils package above.
> 
>    lscp     list checkpoints or snapshots.
>    mkcp     make a checkpoint or a snapshot.
>    chcp     change an existing checkpoint to a snapshot or vice versa.
>    rmcp     invalidate specified checkpoint(s).
> 
>   For example,
> 
>  # chcp ss 2
> 
>   changes the checkpoint No. 2 into snapshot.
> 
> - To mount a snapshot,
> 
>  # mount -t nilfs2 -r -o cp=<cno> /dev/block_device /snap_dir
> 
>   where <cno> is the checkpoint number of the snapshot.
> 
> - More illustrative example is found in [5].
> 
> Thank you,
> Ryusuke Konishi, NILFS Team, NTT.
> 
> 1. NILFS version2 now available
> http://marc.info/?l=linux-fsdevel&m=118187597808509&w=2
> 
> 2. NILFS homepage
> http://www.nilfs.org/en/index.html
> 
> 3. Dongjun Shin, About SSD
> http://www.usenix.org/event/lsf08/tech/shin_SSD.pdf

Interesting document, that.

> 4. Source archive
> http://www.nilfs.org/en/download.html
> 
> 5. Using NILFS
> http://www.nilfs.org/en/about_nilfs.html


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC] nilfs2: continuous snapshotting file system
  2008-08-20  7:43 ` Andrew Morton
@ 2008-08-20  8:22   ` Pekka Enberg
  2008-08-20 18:47     ` Ryusuke Konishi
  2008-08-20 16:13   ` Ryusuke Konishi
  1 sibling, 1 reply; 55+ messages in thread
From: Pekka Enberg @ 2008-08-20  8:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ryusuke Konishi, linux-fsdevel, linux-kernel

On Wed, Aug 20, 2008 at 10:43 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
>> It is not yet divided into pieces (sorry).  Unlike original code
>> available at [4], many code lines to support past kernel versions and
>> peculiar debug code are removed in this patch.
>
> Yes, please do that splitup and let's get down to reviewing it.

Hmm, this looks bit scary:

> +/*
> + * Low-level nilfs pages, page functions
> + * Reviews should be made to adapt these to the common pagemap and buffer code.
> + */
> +static struct nilfs_pages {
> +       spinlock_t              lru_lock;
> +       struct list_head        active;
> +       struct list_head        inactive;
> +       unsigned long           nr_active;
> +       unsigned long           nr_inactive;
> +       struct rw_semaphore     shrink_sem;
> +} nilfs_pages;
> +
> +/*
> + * XXX per-cpu pagevecs may be able to reduce the overhead of list handlings
> + *
> + * static DEFINE_PER_CPU(struct pagevec, nilfs_lru_active) = { 0, };
> + * static DEFINE_PER_CPU(struct pagevec, nilfs_lru_inactive) = { 0, };
> + */
> +
> +void nilfs_pages_init(void)
> +{
> +       INIT_LIST_HEAD(&nilfs_pages.active);
> +       INIT_LIST_HEAD(&nilfs_pages.inactive);
> +       spin_lock_init(&nilfs_pages.lru_lock);
> +       init_rwsem(&nilfs_pages.shrink_sem);
> +       nilfs_pages.nr_active = 0;
> +       nilfs_pages.nr_inactive = 0;
> +}

(a) why does NILFS need this and (b) why aren't these patches against
generic mm/*.c?

                                    Pekka

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC] nilfs2: continuous snapshotting file system
  2008-08-20  2:45 [PATCH RFC] nilfs2: continuous snapshotting file system Ryusuke Konishi
  2008-08-20  7:43 ` Andrew Morton
@ 2008-08-20  9:47 ` Andi Kleen
  2008-08-21  4:57   ` Ryusuke Konishi
  1 sibling, 1 reply; 55+ messages in thread
From: Andi Kleen @ 2008-08-20  9:47 UTC (permalink / raw)
  To: Ryusuke Konishi; +Cc: linux-fsdevel, linux-kernel

Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> writes:

> It has many TODO items:

How stable is the on-disk format? If the file system makes mainline
your user base would likely increase significantly. Users then tend
to have a reasonable exception that they can still mount old file systems
later on newer kernels (although not necessarily the other way round)

-Andi

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC] nilfs2: continuous snapshotting file system
  2008-08-20  7:43 ` Andrew Morton
  2008-08-20  8:22   ` Pekka Enberg
@ 2008-08-20 16:13   ` Ryusuke Konishi
  2008-08-20 21:25     ` Szabolcs Szakacsits
  2008-08-26 10:16     ` Jörn Engel
  1 sibling, 2 replies; 55+ messages in thread
From: Ryusuke Konishi @ 2008-08-20 16:13 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, linux-kernel

>On Wed, 20 Aug 2008 11:45:05 +0900 Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> wrote:
>
>> This is a kernel patch of NILFS2 file system which was previously
>> announced in [1].  Since the original code did not comply with the
>> Linux Coding Style, I've rewritten it a lot.
>
>I expected this email two years ago :)

I'm sorry for that. Thank you to remember this file system :)

>What approach does it take to garbage collection?

Lifetime information is maintained for each (virtualized) address of disk 
block to judge whether a given disk block is eliminable or not.

The garbage collector (GC) of NILFS2 works as follows:

1. GC does not remove snapshots, which are the checkpoints marked as snapshot.
   Plain checkpoints are not protected from GC except for the recent ones.

2. Disk blocks that do not belong to any snapshots nor the recent checkpoints,
   are eliminable.    For a given disk block, GC confirms state of every
   checkpoints whose serial number is included in the lifetime.  It judges 
   the block is not eliminable if at least one snapshot or a recent checkpoint
   is included.

3. GC reclaims disk space in units of segment. (where a segment is equally 
   divided disk region.)

   For a selected segment, removable blocks are just ignored, and 
   unremovable blocks (live blocks) are copied to a new log appended in the
   current segment for writing.

   When all the live blocks are copied into the new log, the segment becomes
   free and reusable.

4. To make disk blocks relocatable, NILFS2 maintains a table file (called DAT)
   which maps virtual disk blocks addresses to usual block addresses.
   The lifetime information is recorded in the DAT per virtual block address.

The current NILFS2 GC simply reclaims from the oldest segment, so the disk
partition acts like a ring buffer. (this behaviour can be changed by 
replacing userland daemon).

>> Some impressive benchmark results on SSD are shown in [3],
>
>heh.  It wipes the floor with everything, including btrfs.
>
>But a log-based fs will do that, initially.  What will the performace
>look like after a month or two's usage?

I'm using NILFS2 for my home directory for serveral months, but so far
I don't feel notable performance degradation.  Later, I'd like to try
a benchmark for a server.
Anyhow, I have many things to do for performance.

>> however the
>> current NILFS2 performance is sensitive to machine environment due to
>> its immature implementation.
>> .. 
>> It is not yet divided into pieces (sorry).  Unlike original code
>> available at [4], many code lines to support past kernel versions and
>> peculiar debug code are removed in this patch.
>
>Yes, please do that splitup and let's get down to reviewing it.

Sure, I will.

With regards,
Ryusuke Konishi

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC] nilfs2: continuous snapshotting file system
  2008-08-20  8:22   ` Pekka Enberg
@ 2008-08-20 18:47     ` Ryusuke Konishi
  0 siblings, 0 replies; 55+ messages in thread
From: Ryusuke Konishi @ 2008-08-20 18:47 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Andrew Morton, linux-fsdevel, linux-kernel

On Wed, Aug 20, 2008 at 11:22 AM, Pekka Enberg wrote:
>Hmm, this looks bit scary:
>
>> +/*
>> + * Low-level nilfs pages, page functions
>> + * Reviews should be made to adapt these to the common pagemap and buffer code.
>> + */
>> +static struct nilfs_pages {
>> +       spinlock_t              lru_lock;
>> +       struct list_head        active;
>> +       struct list_head        inactive;
>> +       unsigned long           nr_active;
>> +       unsigned long           nr_inactive;
>> +       struct rw_semaphore     shrink_sem;
>> +} nilfs_pages;
>> +
>> +/*
>> + * XXX per-cpu pagevecs may be able to reduce the overhead of list handlings
>> + *
>> + * static DEFINE_PER_CPU(struct pagevec, nilfs_lru_active) = { 0, };
>> + * static DEFINE_PER_CPU(struct pagevec, nilfs_lru_inactive) = { 0, };
>> + */
>> +
>> +void nilfs_pages_init(void)
>> +{
>> +       INIT_LIST_HEAD(&nilfs_pages.active);
>> +       INIT_LIST_HEAD(&nilfs_pages.inactive);
>> +       spin_lock_init(&nilfs_pages.lru_lock);
>> +       init_rwsem(&nilfs_pages.shrink_sem);
>> +       nilfs_pages.nr_active = 0;
>> +       nilfs_pages.nr_inactive = 0;
>> +}

Yeah, it's bothersome part.
I'd like to eliminate this peculiar code by using the standard mm/
functions or bd_inode, but still pending.

It's mainly used to maintain pages held by struct nilfs_btnode_cache,
which is a per-inode additional page cache used to store buffers of
B-tree.

Incidentally, for data blocks, mm/ page cache is used like other
file systems.

>(a) why does NILFS need this and (b) why aren't these patches against
>generic mm/*.c?

(a) I believe this is historical, but I will confirm the reason
    why filemap was not adopted.

(b) Because I think it should be eliminated rather than integrated 
    into mm/ at this point.

Thank you for comment.

Regards,
Ryusuke Konishi

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC] nilfs2: continuous snapshotting file system
  2008-08-20 16:13   ` Ryusuke Konishi
@ 2008-08-20 21:25     ` Szabolcs Szakacsits
  2008-08-20 21:39       ` Andrew Morton
  2008-08-21 12:51       ` [PATCH RFC] nilfs2: continuous snapshotting file system Chris Mason
  2008-08-26 10:16     ` Jörn Engel
  1 sibling, 2 replies; 55+ messages in thread
From: Szabolcs Szakacsits @ 2008-08-20 21:25 UTC (permalink / raw)
  To: Ryusuke Konishi; +Cc: Andrew Morton, linux-fsdevel, linux-kernel


On Thu, 21 Aug 2008, Ryusuke Konishi wrote:
> >> Some impressive benchmark results on SSD are shown in [3],
> >
> >heh.  It wipes the floor with everything, including btrfs.

It seems the benchmark was done over half year ago. It's questionable how 
relevant today the performance comparison is with actively developed file 
systems ...

> >But a log-based fs will do that, initially.  What will the performace
> >look like after a month or two's usage?
> 
> I'm using NILFS2 for my home directory for serveral months, but so far
> I don't feel notable performance degradation. 

I ran compilebench on kernel 2.6.26 with freshly formatted volumes. 
The behavior of NILFS2 was interesting.

Its peformance rapidly degrades to the lowest ever measured level 
(< 1 MB/s) but after a while it recovers and gives consistent numbers.
However it's still very far from the current unstable btrfs performance. 
The results are reproducible.

                    MB/s    Runtime (s)
                   -----    -----------
  btrfs unstable   17.09        572
  ext3             13.24        877
  btrfs 0.16       12.33        793
  nilfs2 2nd+ runs 11.29        674
  ntfs-3g           8.55        865
  reiserfs          8.38        966
  nilfs2 1st run    4.95       3800
  xfs               1.88       3901

	Szaka

-- 
NTFS-3G:  http://ntfs-3g.org


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC] nilfs2: continuous snapshotting file system
  2008-08-20 21:25     ` Szabolcs Szakacsits
@ 2008-08-20 21:39       ` Andrew Morton
  2008-08-20 21:48         ` Szabolcs Szakacsits
  2008-08-21  2:12         ` Dave Chinner
  2008-08-21 12:51       ` [PATCH RFC] nilfs2: continuous snapshotting file system Chris Mason
  1 sibling, 2 replies; 55+ messages in thread
From: Andrew Morton @ 2008-08-20 21:39 UTC (permalink / raw)
  To: Szabolcs Szakacsits; +Cc: konishi.ryusuke, linux-fsdevel, linux-kernel

On Thu, 21 Aug 2008 00:25:55 +0300 (MET DST)
Szabolcs Szakacsits <szaka@ntfs-3g.org> wrote:

> 
> On Thu, 21 Aug 2008, Ryusuke Konishi wrote:
> > >> Some impressive benchmark results on SSD are shown in [3],
> > >
> > >heh.  It wipes the floor with everything, including btrfs.
> 
> It seems the benchmark was done over half year ago. It's questionable how 
> relevant today the performance comparison is with actively developed file 
> systems ...
> 
> > >But a log-based fs will do that, initially.  What will the performace
> > >look like after a month or two's usage?
> > 
> > I'm using NILFS2 for my home directory for serveral months, but so far
> > I don't feel notable performance degradation. 
> 
> I ran compilebench on kernel 2.6.26 with freshly formatted volumes. 
> The behavior of NILFS2 was interesting.
> 
> Its peformance rapidly degrades to the lowest ever measured level 
> (< 1 MB/s) but after a while it recovers and gives consistent numbers.
> However it's still very far from the current unstable btrfs performance. 
> The results are reproducible.
> 
>                     MB/s    Runtime (s)
>                    -----    -----------
>   btrfs unstable   17.09        572
>   ext3             13.24        877
>   btrfs 0.16       12.33        793
>   nilfs2 2nd+ runs 11.29        674
>   ntfs-3g           8.55        865
>   reiserfs          8.38        966
>   nilfs2 1st run    4.95       3800
>   xfs               1.88       3901

err, what the heck happened to xfs?  Is this usual?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC] nilfs2: continuous snapshotting file system
  2008-08-20 21:39       ` Andrew Morton
@ 2008-08-20 21:48         ` Szabolcs Szakacsits
  2008-08-21  2:12         ` Dave Chinner
  1 sibling, 0 replies; 55+ messages in thread
From: Szabolcs Szakacsits @ 2008-08-20 21:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: konishi.ryusuke, linux-fsdevel, linux-kernel


On Wed, 20 Aug 2008, Andrew Morton wrote:
> On Thu, 21 Aug 2008 00:25:55 +0300 (MET DST)
> Szabolcs Szakacsits <szaka@ntfs-3g.org> wrote:
> 
> > 
> > On Thu, 21 Aug 2008, Ryusuke Konishi wrote:
> > > >> Some impressive benchmark results on SSD are shown in [3],
> > > >
> > > >heh.  It wipes the floor with everything, including btrfs.
> > 
> > It seems the benchmark was done over half year ago. It's questionable how 
> > relevant today the performance comparison is with actively developed file 
> > systems ...
> > 
> > > >But a log-based fs will do that, initially.  What will the performace
> > > >look like after a month or two's usage?
> > > 
> > > I'm using NILFS2 for my home directory for serveral months, but so far
> > > I don't feel notable performance degradation. 
> > 
> > I ran compilebench on kernel 2.6.26 with freshly formatted volumes. 
> > The behavior of NILFS2 was interesting.
> > 
> > Its peformance rapidly degrades to the lowest ever measured level 
> > (< 1 MB/s) but after a while it recovers and gives consistent numbers.
> > However it's still very far from the current unstable btrfs performance. 
> > The results are reproducible.
> > 
> >                     MB/s    Runtime (s)
> >                    -----    -----------
> >   btrfs unstable   17.09        572
> >   ext3             13.24        877
> >   btrfs 0.16       12.33        793
> >   nilfs2 2nd+ runs 11.29        674
> >   ntfs-3g           8.55        865
> >   reiserfs          8.38        966
> >   nilfs2 1st run    4.95       3800
> >   xfs               1.88       3901
> 
> err, what the heck happened to xfs?  Is this usual?

vmstat typically shows that xfs does ... "nothing". It uses no CPU time and 
doesn't wait for I/O either. 

	Szaka

--
NTFS-3G:  http://ntfs-3g.org


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC] nilfs2: continuous snapshotting file system
  2008-08-20 21:39       ` Andrew Morton
  2008-08-20 21:48         ` Szabolcs Szakacsits
@ 2008-08-21  2:12         ` Dave Chinner
  2008-08-21  2:46           ` Szabolcs Szakacsits
  1 sibling, 1 reply; 55+ messages in thread
From: Dave Chinner @ 2008-08-21  2:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Szabolcs Szakacsits, konishi.ryusuke, linux-fsdevel, linux-kernel

On Wed, Aug 20, 2008 at 02:39:16PM -0700, Andrew Morton wrote:
> On Thu, 21 Aug 2008 00:25:55 +0300 (MET DST)
> Szabolcs Szakacsits <szaka@ntfs-3g.org> wrote:
> > I ran compilebench on kernel 2.6.26 with freshly formatted volumes. 
> > The behavior of NILFS2 was interesting.
> > 
> > Its peformance rapidly degrades to the lowest ever measured level 
> > (< 1 MB/s) but after a while it recovers and gives consistent numbers.
> > However it's still very far from the current unstable btrfs performance. 
> > The results are reproducible.
> > 
> >                     MB/s    Runtime (s)
> >                    -----    -----------
> >   btrfs unstable   17.09        572
> >   ext3             13.24        877
> >   btrfs 0.16       12.33        793
> >   nilfs2 2nd+ runs 11.29        674
> >   ntfs-3g           8.55        865
> >   reiserfs          8.38        966
> >   nilfs2 1st run    4.95       3800
> >   xfs               1.88       3901
> 
> err, what the heck happened to xfs?  Is this usual?

No, definitely not usual. I suspect it's from an old mkfs and
barriers being used.  What is the output of the xfs.mkfs when
you make the filesystem and what mount options being used?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC] nilfs2: continuous snapshotting file system
  2008-08-21  2:12         ` Dave Chinner
@ 2008-08-21  2:46           ` Szabolcs Szakacsits
  2008-08-21  5:15             ` XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system) Dave Chinner
  0 siblings, 1 reply; 55+ messages in thread
From: Szabolcs Szakacsits @ 2008-08-21  2:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Andrew Morton, linux-fsdevel, linux-kernel


On Thu, 21 Aug 2008, Dave Chinner wrote:
> On Wed, Aug 20, 2008 at 02:39:16PM -0700, Andrew Morton wrote:
> > On Thu, 21 Aug 2008 00:25:55 +0300 (MET DST)
> > Szabolcs Szakacsits <szaka@ntfs-3g.org> wrote:
> > > I ran compilebench on kernel 2.6.26 with freshly formatted volumes. 
> > > The behavior of NILFS2 was interesting.
> > > 
> > > Its peformance rapidly degrades to the lowest ever measured level 
> > > (< 1 MB/s) but after a while it recovers and gives consistent numbers.
> > > However it's still very far from the current unstable btrfs performance. 
> > > The results are reproducible.
> > > 
> > >                     MB/s    Runtime (s)
> > >                    -----    -----------
> > >   btrfs unstable   17.09        572
> > >   ext3             13.24        877
> > >   btrfs 0.16       12.33        793
> > >   nilfs2 2nd+ runs 11.29        674
> > >   ntfs-3g           8.55        865
> > >   reiserfs          8.38        966
> > >   nilfs2 1st run    4.95       3800
> > >   xfs               1.88       3901
> > 
> > err, what the heck happened to xfs?  Is this usual?
> 
> No, definitely not usual. I suspect it's from an old mkfs and
> barriers being used.  What is the output of the xfs.mkfs when
> you make the filesystem and what mount options being used?

Everything is default.

  % rpm -qf =mkfs.xfs
  xfsprogs-2.9.8-7.1 

which, according to ftp://oss.sgi.com/projects/xfs/cmd_tars, is the 
latest stable mkfs.xfs. Its output is

meta-data=/dev/sda8              isize=256    agcount=4, agsize=1221440 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=4885760, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096  
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=0
realtime =none                   extsz=4096   blocks=0, rtextents=0

Kernel xfs log:

SGI XFS with ACLs, security attributes, realtime, large block/inode numbers, no debug enabled
SGI XFS Quota Management subsystem
XFS mounting filesystem sda8
Ending clean XFS mount for filesystem: sda8

	Szaka

--
NTFS-3G:  http://ntfs-3g.org


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC] nilfs2: continuous snapshotting file system
  2008-08-20  9:47 ` Andi Kleen
@ 2008-08-21  4:57   ` Ryusuke Konishi
  0 siblings, 0 replies; 55+ messages in thread
From: Ryusuke Konishi @ 2008-08-21  4:57 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-fsdevel, linux-kernel

On Wed, 20 Aug 2008 11:47:18 +0200 Andi Kleen wrote:
>> It has many TODO items:
>
>How stable is the on-disk format?

It's almost stable.
Hopefully I don't want to make any major change on the on-disk
format which affects the compatibility.

Some unsupported features like atime, EA, and ACLs should be
carefully confirmed before merging to the mainline though
I believe they don't lead to the major change.

>If the file system makes mainline
>your user base would likely increase significantly. Users then tend
>to have a reasonable exception that they can still mount old file systems
>later on newer kernels (although not necessarily the other way round)

Yes I know.  Thank you for the important advice.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system)
  2008-08-21  2:46           ` Szabolcs Szakacsits
@ 2008-08-21  5:15             ` Dave Chinner
  2008-08-21  6:00               ` gus3
  2008-08-21  6:04               ` Dave Chinner
  0 siblings, 2 replies; 55+ messages in thread
From: Dave Chinner @ 2008-08-21  5:15 UTC (permalink / raw)
  To: Szabolcs Szakacsits; +Cc: Andrew Morton, linux-fsdevel, linux-kernel, xfs

On Thu, Aug 21, 2008 at 05:46:00AM +0300, Szabolcs Szakacsits wrote:
> On Thu, 21 Aug 2008, Dave Chinner wrote:
> > On Wed, Aug 20, 2008 at 02:39:16PM -0700, Andrew Morton wrote:
> > > On Thu, 21 Aug 2008 00:25:55 +0300 (MET DST)
> > > Szabolcs Szakacsits <szaka@ntfs-3g.org> wrote:
> > > > I ran compilebench on kernel 2.6.26 with freshly formatted volumes. 
> > > > The behavior of NILFS2 was interesting.
> > > > 
> > > > Its peformance rapidly degrades to the lowest ever measured level 
> > > > (< 1 MB/s) but after a while it recovers and gives consistent numbers.
> > > > However it's still very far from the current unstable btrfs performance. 
> > > > The results are reproducible.
> > > > 
> > > >                     MB/s    Runtime (s)
> > > >                    -----    -----------
> > > >   btrfs unstable   17.09        572
> > > >   ext3             13.24        877
> > > >   btrfs 0.16       12.33        793
> > > >   nilfs2 2nd+ runs 11.29        674
> > > >   ntfs-3g           8.55        865
> > > >   reiserfs          8.38        966
> > > >   nilfs2 1st run    4.95       3800
> > > >   xfs               1.88       3901
> > > 
> > > err, what the heck happened to xfs?  Is this usual?
> > 
> > No, definitely not usual. I suspect it's from an old mkfs and
> > barriers being used.  What is the output of the xfs.mkfs when
> > you make the filesystem and what mount options being used?
> 
> Everything is default.
> 
>   % rpm -qf =mkfs.xfs
>   xfsprogs-2.9.8-7.1 
> 
> which, according to ftp://oss.sgi.com/projects/xfs/cmd_tars, is the 
> latest stable mkfs.xfs. Its output is
> 
> meta-data=/dev/sda8              isize=256    agcount=4, agsize=1221440 blks
>          =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=4885760, imaxpct=25
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096  
> log      =internal log           bsize=4096   blocks=2560, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=0
> realtime =none                   extsz=4096   blocks=0, rtextents=0

Ok, I thought it might be the tiny log, but it didn't improve anything
here when increased the log size, or the log buffer size.

Looking at the block trace, I think elevator merging is somewhat busted. I'm
seeing adjacent I/Os being dispatched without having been merged.  e.g:

104,48   1     2139     4.803090086  4175  Q   W 18540712 + 8 [pdflush]
104,48   1     2140     4.803092492  4175  G   W 18540712 + 8 [pdflush]
104,48   1     2141     4.803094875  4175  P   N [pdflush]
104,48   1     2142     4.803096205  4175  I   W 18540712 + 8 [pdflush]
104,48   1     2143     4.803160324  4175  Q   W 18540720 + 40 [pdflush]
104,48   1     2144     4.803162724  4175  M   W 18540720 + 40 [pdflush]
104,48   1     2145     4.803231701  4175  Q   W 18540760 + 48 [pdflush]
104,48   1     2146     4.803234223  4175  M   W 18540760 + 48 [pdflush]
.....
104,48   1     2163     4.803844214  4175  Q   W 18541032 + 56 [pdflush]
104,48   1     2164     4.803846694  4175  M   W 18541032 + 56 [pdflush]
104,48   1     2165     4.803932321  4175  Q   W 18541088 + 48 [pdflush]
104,48   1     2166     4.803937177  4175  G   W 18541088 + 48 [pdflush]
104,48   1     2167     4.803940416  4175  I   W 18541088 + 48 [pdflush]
104,48   1     2168     4.804005265  4175  Q   W 18541136 + 24 [pdflush]
104,48   1     2169     4.804007664  4175  M   W 18541136 + 24 [pdflush]
.....
104,48   1     2183     4.804518129  4175  D   W 18540712 + 376 [pdflush]
104,48   1     2184     4.804537981  4175  D   W 18541088 + 248 [pdflush]

In entry 2165, a new request is made rather than merging the
existing, adjacent request that is already open. The result is we
then dispatch two I/Os instead of one.

Also, CFQ appears to not be merging WRITE_SYNC bios or issuing them
with any urgency.  The result of this is that it stalls the XFS
transaction subsystem by capturing all the log buffers in the
elevator and not issuing them. e.g.:

104,48   0      149     0.107856547  4160  Q  WS 35624860 + 128 [pdflush]
104,48   0      150     0.107861855  4160  G  WS 35624860 + 128 [pdflush]
104,48   0      151     0.107865332  4160  I   W 35624860 + 128 [pdflush]
...
104,48   0      162     0.120791581  4159  Q  WS 35624988 + 128 [python]
104,48   0      163     0.120805714  4159  G  WS 35624988 + 128 [python]
104,48   0      164     0.120813427  4159  I   W 35624988 + 128 [python]
104,48   0      165     0.132109889  4159  Q  WS 35625116 + 128 [python]
104,48   0      166     0.132128642  4159  G  WS 35625116 + 128 [python]
104,48   0      167     0.132132988  4159  I   W 35625116 + 128 [python]
104,48   0      168     0.143612843  4159  Q  WS 35625244 + 128 [python]
104,48   0      169     0.143640248  4159  G  WS 35625244 + 128 [python]
104,48   0      170     0.143644697  4159  I   W 35625244 + 128 [python]
104,48   0      171     0.158243553  4159  Q  WS 35625372 + 128 [python]
104,48   0      172     0.158261652  4159  G  WS 35625372 + 128 [python]
104,48   0      173     0.158266233  4159  I   W 35625372 + 128 [python]
104,48   0      174     0.171342555  4159  Q  WS 35625500 + 128 [python]
104,48   0      175     0.171360707  4159  G  WS 35625500 + 128 [python]
104,48   0      176     0.171365036  4159  I   W 35625500 + 128 [python]
104,48   0      177     0.183936429  4159  Q  WS 35625628 + 128 [python]
104,48   0      178     0.183955172  4159  G  WS 35625628 + 128 [python]
104,48   0      179     0.183959726  4159  I   W 35625628 + 128 [python]
...
104,48   0      180     0.194008953  4159  Q  WS 35625756 + 128 [python]
104,48   0      181     0.194027120  4159  G  WS 35625756 + 128 [python]
104,48   0      182     0.194031311  4159  I   W 35625756 + 128 [python]
...
104,48   0      191     0.699915104     0  D   W 35624860 + 128 [swapper]
...
104,48   0      196     0.700513279     0  C   W 35624860 + 128 [0]
...
104,48   0      198     0.711808579  4159  Q  WS 35625884 + 128 [python]
104,48   0      199     0.711826259  4159  G  WS 35625884 + 128 [python]
104,48   0      200     0.711830589  4159  I   W 35625884 + 128 [python]
104,48   0      201     0.711848493  4159  D   W 35624988 + 128 [python]
104,48   0      202     0.711861868  4159  D   W 35625116 + 128 [python]\x02
104,48   0      203     0.711868531  4159  D   W 35625244 + 128 [python]
104,48   0      204     0.711874967  4159  D   W 35625372 + 128 [python]
....
104,48   1       72     0.900288147     0  D   W 35625500 + 128 [swapper]
104,48   1       73     0.900296058     0  D   W 35625628 + 128 [swapper]
104,48   1       74     0.900302401     0  D   W 35625756 + 128 [swapper]
104,48   1       75     0.900308516     0  D   W 35625884 + 128 [swapper]
.....

here we see all 8 log buffers written and queued in ~95ms. At this point
(0.194s into the trace) the log stalls because we've used all the log
buffers and have to wait for I/O to complete. The filesystem effectively
sits idle now for half a second waiting for I/O to be dispatched.

At 0.699s, we have a single buffer issued and it completes in 500
*microseconds* (NVRAM on raid controller). We do completion
processing, fill and dispatch that buffer in under 10ms (on a 1GHz
P3) at which point we dispatch the 4 oldest remaining buffers. 200ms
later, we dispatch the remainder.

Effectively, the elevator has stalled all transactions in the
filesystem for close to 700ms by not dispatching the SYNC_WRITE
buffers, and all the bios could have been merged into a single 512k
I/O when they were to be dispatched. I guess the only way to prevent
this really is to issue explicit unplugs....

On 2.6.24:

104,48   0      975     1.707253442  2761  Q  WS 35753545 + 128 [python]
104,48   0      976     1.707268811  2761  G  WS 35753545 + 128 [python]
104,48   0      977     1.707275455  2761  I   W 35753545 + 128 [python]
104,48   0      978     1.728703316  2761  Q  WS 35753673 + 128 [python]
104,48   0      979     1.728714289  2761  M  WS 35753673 + 128 [python]
104,48   0      980     1.761603632  2761  Q  WS 35753801 + 128 [python]
104,48   0      981     1.761614498  2761  M  WS 35753801 + 128 [python]
104,48   0      982     1.784522988  2761  Q  WS 35753929 + 128 [python]
104,48   0      983     1.784533351  2761  M  WS 35753929 + 128 [python]
....
104,48   0     1125     2.475132431     0  D   W 35753545 + 512 [swapper]

The I/Os are merged, but there's still that 700ms delay before dispatch.
i was looking at this a while back but didn't get to finishing it off.
i.e.:

http://oss.sgi.com/archives/xfs/2008-01/msg00151.html
http://oss.sgi.com/archives/xfs/2008-01/msg00152.html

I'll have a bit more of a look at this w.r.t to compilebench performance,
because it seems like a similar set of problems that I was seeing back
then...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system)
  2008-08-21  5:15             ` XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system) Dave Chinner
@ 2008-08-21  6:00               ` gus3
  2008-08-21  6:14                 ` Dave Chinner
  2008-08-21  6:04               ` Dave Chinner
  1 sibling, 1 reply; 55+ messages in thread
From: gus3 @ 2008-08-21  6:00 UTC (permalink / raw)
  To: Szabolcs Szakacsits, Dave Chinner
  Cc: Andrew Morton, linux-fsdevel, linux-kernel, xfs

--- On Wed, 8/20/08, Dave Chinner <david@fromorbit.com> wrote:

> Ok, I thought it might be the tiny log, but it didn't
> improve anything
> here when increased the log size, or the log buffer size.
> 
> Looking at the block trace, I think elevator merging is
> somewhat busted. I'm
> seeing adjacent I/Os being dispatched without having been
> merged.  e.g:

[snip]

> Also, CFQ appears to not be merging WRITE_SYNC bios or
> issuing them
> with any urgency.  The result of this is that it stalls the
> XFS
> transaction subsystem by capturing all the log buffers in
> the
> elevator and not issuing them. e.g.:

[snip]

> The I/Os are merged, but there's still that 700ms delay
> before dispatch.
> i was looking at this a while back but didn't get to
> finishing it off.
> i.e.:
> 
> http://oss.sgi.com/archives/xfs/2008-01/msg00151.html
> http://oss.sgi.com/archives/xfs/2008-01/msg00152.html
> 
> I'll have a bit more of a look at this w.r.t to
> compilebench performance,
> because it seems like a similar set of problems that I was
> seeing back
> then...

I concur your observation, esp. w.r.t. XFS and CFQ clashing:

http://gus3.typepad.com/i_am_therefore_i_think/2008/07/finding-the-fas.html

CFQ is the default on most Linux systems AFAIK; for decent XFS performance one needs to switch to "noop" or "deadline". I wasn't sure if it was broken code, or simply base assumptions in conflict (XFS vs. CFQ). Your log output sheds light on the matter for me, thanks.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system)
  2008-08-21  5:15             ` XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system) Dave Chinner
  2008-08-21  6:00               ` gus3
@ 2008-08-21  6:04               ` Dave Chinner
  2008-08-21  8:07                 ` Aaron Carroll
                                   ` (2 more replies)
  1 sibling, 3 replies; 55+ messages in thread
From: Dave Chinner @ 2008-08-21  6:04 UTC (permalink / raw)
  To: Szabolcs Szakacsits, Andrew Morton, linux-fsdevel, linux-kernel,
	xfs

On Thu, Aug 21, 2008 at 03:15:08PM +1000, Dave Chinner wrote:
> On Thu, Aug 21, 2008 at 05:46:00AM +0300, Szabolcs Szakacsits wrote:
> > On Thu, 21 Aug 2008, Dave Chinner wrote:
> > Everything is default.
> > 
> >   % rpm -qf =mkfs.xfs
> >   xfsprogs-2.9.8-7.1 
> > 
> > which, according to ftp://oss.sgi.com/projects/xfs/cmd_tars, is the 
> > latest stable mkfs.xfs. Its output is
> > 
> > meta-data=/dev/sda8              isize=256    agcount=4, agsize=1221440 blks
> >          =                       sectsz=512   attr=2
> > data     =                       bsize=4096   blocks=4885760, imaxpct=25
> >          =                       sunit=0      swidth=0 blks
> > naming   =version 2              bsize=4096  
> > log      =internal log           bsize=4096   blocks=2560, version=2
> >          =                       sectsz=512   sunit=0 blks, lazy-count=0
> > realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> Ok, I thought it might be the tiny log, but it didn't improve anything
> here when increased the log size, or the log buffer size.

One thing I just found out - my old *laptop* is 4-5x faster than the
10krpm scsi disk behind an old cciss raid controller.  I'm wondering
if the long delays in dispatch is caused by an interaction with CTQ
but I can't change it on the cciss raid controllers. Are you using
ctq/ncq on your machine?  If so, can you reduce the depth to
something less than 4 and see what difference that makes?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system)
  2008-08-21  6:00               ` gus3
@ 2008-08-21  6:14                 ` Dave Chinner
  2008-08-21  7:00                   ` Nick Piggin
  0 siblings, 1 reply; 55+ messages in thread
From: Dave Chinner @ 2008-08-21  6:14 UTC (permalink / raw)
  To: gus3; +Cc: Szabolcs Szakacsits, Andrew Morton, linux-fsdevel, linux-kernel,
	xfs

On Wed, Aug 20, 2008 at 11:00:07PM -0700, gus3 wrote:
> --- On Wed, 8/20/08, Dave Chinner <david@fromorbit.com> wrote:
> 
> > Ok, I thought it might be the tiny log, but it didn't improve
> > anything here when increased the log size, or the log buffer
> > size.
> > 
> > Looking at the block trace, I think elevator merging is somewhat
> > busted. I'm seeing adjacent I/Os being dispatched without having
> > been merged.  e.g:
> 
> [snip]
> 
> > Also, CFQ appears to not be merging WRITE_SYNC bios or issuing
> > them with any urgency.  The result of this is that it stalls the
> > XFS transaction subsystem by capturing all the log buffers in
> > the elevator and not issuing them. e.g.:
> 
> [snip]
> 
> > The I/Os are merged, but there's still that 700ms delay before
> > dispatch.  i was looking at this a while back but didn't get to
> > finishing it off.  i.e.:
> > 
> > http://oss.sgi.com/archives/xfs/2008-01/msg00151.html
> > http://oss.sgi.com/archives/xfs/2008-01/msg00152.html
> > 
> > I'll have a bit more of a look at this w.r.t to compilebench
> > performance, because it seems like a similar set of problems
> > that I was seeing back then...
> 
> I concur your observation, esp. w.r.t. XFS and CFQ clashing:
> 
> http://gus3.typepad.com/i_am_therefore_i_think/2008/07/finding-the-fas.html
> 
> CFQ is the default on most Linux systems AFAIK; for decent XFS
> performance one needs to switch to "noop" or "deadline". I wasn't
> sure if it was broken code, or simply base assumptions in conflict
> (XFS vs. CFQ). Your log output sheds light on the matter for me,
> thanks.

I'm wondering if these elevators are just getting too smart for
their own good. w.r.t to the above test, deadline was about twice
as slow as CFQ - it does immediate dispatch on SYNC_WRITE bios and
so caused more seeks that CFQ and hence went slower. noop had
similar dispatch latency problems to CFQ, so it wasn't any
faster either.

I think that we need to issue explicit unplugs to get the log I/O
dispatched the way we want on all elevators and stop trying to
give elevators implicit hints by abusing the bio types and hoping
they do the right thing....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system)
  2008-08-21  6:14                 ` Dave Chinner
@ 2008-08-21  7:00                   ` Nick Piggin
  2008-08-21  8:53                     ` Dave Chinner
  0 siblings, 1 reply; 55+ messages in thread
From: Nick Piggin @ 2008-08-21  7:00 UTC (permalink / raw)
  To: Dave Chinner
  Cc: gus3, Szabolcs Szakacsits, Andrew Morton, linux-fsdevel,
	linux-kernel, xfs

On Thursday 21 August 2008 16:14, Dave Chinner wrote:

> I think that we need to issue explicit unplugs to get the log I/O
> dispatched the way we want on all elevators and stop trying to
> give elevators implicit hints by abusing the bio types and hoping
> they do the right thing....

FWIW, my explicit plugging idea is still hanging around in one of
Jens' block trees (actually he refreshed it a couple of months ago).

It provides an API for VM or filesystems to plug and unplug
requests coming out of the current process, and it can reduce the
need to idle the queue. Needs more performance analysis and tuning
though.

But existing plugging is below the level of the elevators, and should
only kick in for at most tens of ms at queue idle events, so it sounds
like it may not be your problem. Elevators will need some hint to give
priority to specific requests -- either via the current threads's io
priority, or information attached to bios.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system)
  2008-08-21  6:04               ` Dave Chinner
@ 2008-08-21  8:07                 ` Aaron Carroll
  2008-08-21  8:25                 ` Dave Chinner
  2008-08-21 11:53                 ` Matthew Wilcox
  2 siblings, 0 replies; 55+ messages in thread
From: Aaron Carroll @ 2008-08-21  8:07 UTC (permalink / raw)
  To: david; +Cc: Szabolcs Szakacsits, Andrew Morton, linux-fsdevel, linux-kernel,
	xfs

[-- Attachment #1: Type: text/plain, Size: 713 bytes --]

Dave Chinner wrote:
> One thing I just found out - my old *laptop* is 4-5x faster than the
> 10krpm scsi disk behind an old cciss raid controller.  I'm wondering
> if the long delays in dispatch is caused by an interaction with CTQ
> but I can't change it on the cciss raid controllers. Are you using
> ctq/ncq on your machine?  If so, can you reduce the depth to
> something less than 4 and see what difference that makes?

I've been benchmarking on a cciss card, and patched the driver to
control the queue depth via sysfs.  Maybe you'll find it useful...

The original patch was for 2.6.24, but that won't apply on git head.
I fixed it for 2.6.27, and it seems to work fine.  Both are attached.


   -- Aaron


[-- Attachment #2: cciss_qdepth-2.6.24.patch --]
[-- Type: text/plain, Size: 4025 bytes --]

diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
index 55bd35c..709c419 100644
--- a/drivers/block/cciss.c
+++ b/drivers/block/cciss.c
@@ -474,7 +474,7 @@ static CommandList_struct *cmd_alloc(ctlr_info_t *h, int get_from_pool)
 
 		do {
 			i = find_first_zero_bit(h->cmd_pool_bits, h->nr_cmds);
-			if (i == h->nr_cmds)
+			if (i >= h->qdepth_max)
 				return NULL;
 		} while (test_and_set_bit
 			 (i & (BITS_PER_LONG - 1),
@@ -1257,7 +1257,7 @@ static void cciss_check_queues(ctlr_info_t *h)
 	 * in case the interrupt we serviced was from an ioctl and did not
 	 * free any new commands.
 	 */
-	if ((find_first_zero_bit(h->cmd_pool_bits, h->nr_cmds)) == h->nr_cmds)
+	if ((find_first_zero_bit(h->cmd_pool_bits, h->nr_cmds)) >= h->qdepth_max)
 		return;
 
 	/* We have room on the queue for more commands.  Now we need to queue
@@ -1276,7 +1276,7 @@ static void cciss_check_queues(ctlr_info_t *h)
 		/* check to see if we have maxed out the number of commands
 		 * that can be placed on the queue.
 		 */
-		if ((find_first_zero_bit(h->cmd_pool_bits, h->nr_cmds)) == h->nr_cmds) {
+		if ((find_first_zero_bit(h->cmd_pool_bits, h->nr_cmds)) >= h->qdepth_max) {
 			if (curr_queue == start_queue) {
 				h->next_to_run =
 				    (start_queue + 1) % (h->highest_lun + 1);
@@ -3075,6 +3075,7 @@ static int __devinit cciss_pci_init(ctlr_info_t *c, struct pci_dev *pdev)
 			c->product_name = products[i].product_name;
 			c->access = *(products[i].access);
 			c->nr_cmds = products[i].nr_cmds;
+			c->qdepth_max = products[i].nr_cmds;
 			break;
 		}
 	}
@@ -3095,6 +3096,7 @@ static int __devinit cciss_pci_init(ctlr_info_t *c, struct pci_dev *pdev)
 			c->product_name = products[i-1].product_name;
 			c->access = *(products[i-1].access);
 			c->nr_cmds = products[i-1].nr_cmds;
+			c->qdepth_max = products[i-1].nr_cmds;
 			printk(KERN_WARNING "cciss: This is an unknown "
 				"Smart Array controller.\n"
 				"cciss: Please update to the latest driver "
@@ -3346,6 +3348,44 @@ static void free_hba(int i)
 	kfree(p);
 }
 
+static inline ctlr_info_t *cciss_get_ctlr_info(struct device *dev)
+{
+	struct pci_dev *pdev = container_of(dev, struct pci_dev, dev);
+	return pci_get_drvdata(pdev);
+}
+
+static ssize_t cciss_show_queue_depth(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	ctlr_info_t *ctlr = cciss_get_ctlr_info(dev);
+	BUG_ON(!ctlr);
+
+	return sprintf(buf, "%u\n", ctlr->qdepth_max);
+}
+
+static ssize_t cciss_store_queue_depth(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	ctlr_info_t *ctlr = cciss_get_ctlr_info(dev);
+	unsigned long qdepth_max;
+
+	BUG_ON(!ctlr);
+	qdepth_max = simple_strtoul(buf, NULL, 10);
+
+	if (qdepth_max < 1)
+		qdepth_max = 1;
+	else if (qdepth_max > ctlr->nr_cmds)
+		qdepth_max = ctlr->nr_cmds;
+
+	ctlr->qdepth_max = (unsigned)qdepth_max;
+	return count;
+}
+
+static struct device_attribute cciss_queue_depth =
+		__ATTR(queue_depth, S_IRUGO | S_IWUSR,
+			&cciss_show_queue_depth,
+			&cciss_store_queue_depth);
+
 /*
  *  This is it.  Find all the controllers and register them.  I really hate
  *  stealing all these major device numbers.
@@ -3450,6 +3490,11 @@ static int __devinit cciss_init_one(struct pci_dev *pdev,
 	       ((hba[i]->nr_cmds + BITS_PER_LONG -
 		 1) / BITS_PER_LONG) * sizeof(unsigned long));
 
+	/* Setup queue_depth sysfs entry */
+	rc = device_create_file(&pdev->dev, &cciss_queue_depth);
+	if (rc)
+		goto clean4;
+
 #ifdef CCISS_DEBUG
 	printk(KERN_DEBUG "Scanning for drives on controller cciss%d\n", i);
 #endif				/* CCISS_DEBUG */
diff --git a/drivers/block/cciss.h b/drivers/block/cciss.h
index b70988d..6a4a38a 100644
--- a/drivers/block/cciss.h
+++ b/drivers/block/cciss.h
@@ -60,6 +60,7 @@ struct ctlr_info
 	void __iomem *vaddr;
 	unsigned long paddr;
 	int 	nr_cmds; /* Number of commands allowed on this controller */
+	unsigned qdepth_max;  /* userspace queue depth limit */
 	CfgTable_struct __iomem *cfgtable;
 	int	interrupts_enabled;
 	int	major;

[-- Attachment #3: cciss_qdepth-2.6.27.patch --]
[-- Type: text/plain, Size: 3974 bytes --]

diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
index b73116e..066577f 100644
--- a/drivers/block/cciss.c
+++ b/drivers/block/cciss.c
@@ -480,7 +480,7 @@ static CommandList_struct *cmd_alloc(ctlr_info_t *h, int get_from_pool)
 
 		do {
 			i = find_first_zero_bit(h->cmd_pool_bits, h->nr_cmds);
-			if (i == h->nr_cmds)
+			if (i >= h->qdepth_max)
 				return NULL;
 		} while (test_and_set_bit
 			 (i & (BITS_PER_LONG - 1),
@@ -1259,7 +1259,7 @@ static void cciss_check_queues(ctlr_info_t *h)
 	 * in case the interrupt we serviced was from an ioctl and did not
 	 * free any new commands.
 	 */
-	if ((find_first_zero_bit(h->cmd_pool_bits, h->nr_cmds)) == h->nr_cmds)
+	if ((find_first_zero_bit(h->cmd_pool_bits, h->nr_cmds)) >= h->qdepth_max)
 		return;
 
 	/* We have room on the queue for more commands.  Now we need to queue
@@ -1278,7 +1278,7 @@ static void cciss_check_queues(ctlr_info_t *h)
 		/* check to see if we have maxed out the number of commands
 		 * that can be placed on the queue.
 		 */
-		if ((find_first_zero_bit(h->cmd_pool_bits, h->nr_cmds)) == h->nr_cmds) {
+		if ((find_first_zero_bit(h->cmd_pool_bits, h->nr_cmds)) >= h->qdepth_max) {
 			if (curr_queue == start_queue) {
 				h->next_to_run =
 				    (start_queue + 1) % (h->highest_lun + 1);
@@ -3253,6 +3253,7 @@ static int __devinit cciss_pci_init(ctlr_info_t *c, struct pci_dev *pdev)
 			c->product_name = products[i].product_name;
 			c->access = *(products[i].access);
 			c->nr_cmds = c->max_commands - 4;
+			c->qdepth_max = c->nr_cmds;
 			break;
 		}
 	}
@@ -3273,6 +3274,7 @@ static int __devinit cciss_pci_init(ctlr_info_t *c, struct pci_dev *pdev)
 			c->product_name = products[i-1].product_name;
 			c->access = *(products[i-1].access);
 			c->nr_cmds = c->max_commands - 4;
+			c->qdepth_max = c->nr_cmds;
 			printk(KERN_WARNING "cciss: This is an unknown "
 				"Smart Array controller.\n"
 				"cciss: Please update to the latest driver "
@@ -3392,6 +3394,44 @@ static void free_hba(int i)
 	kfree(p);
 }
 
+static inline ctlr_info_t *cciss_get_ctlr_info(struct device *dev)
+{
+	struct pci_dev *pdev = container_of(dev, struct pci_dev, dev);
+	return pci_get_drvdata(pdev);
+}
+
+static ssize_t cciss_show_queue_depth(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	ctlr_info_t *ctlr = cciss_get_ctlr_info(dev);
+	BUG_ON(!ctlr);
+
+	return sprintf(buf, "%u\n", ctlr->qdepth_max);
+}
+
+static ssize_t cciss_store_queue_depth(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	ctlr_info_t *ctlr = cciss_get_ctlr_info(dev);
+	unsigned long qdepth_max;
+
+	BUG_ON(!ctlr);
+	qdepth_max = simple_strtoul(buf, NULL, 10);
+
+	if (qdepth_max < 1)
+		qdepth_max = 1;
+	else if (qdepth_max > ctlr->nr_cmds)
+		qdepth_max = ctlr->nr_cmds;
+
+	ctlr->qdepth_max = (unsigned)qdepth_max;
+	return count;
+}
+
+static struct device_attribute cciss_queue_depth =
+		__ATTR(queue_depth, S_IRUGO | S_IWUSR,
+			&cciss_show_queue_depth,
+			&cciss_store_queue_depth);
+
 /*
  *  This is it.  Find all the controllers and register them.  I really hate
  *  stealing all these major device numbers.
@@ -3496,6 +3536,11 @@ static int __devinit cciss_init_one(struct pci_dev *pdev,
 	       ((hba[i]->nr_cmds + BITS_PER_LONG -
 		 1) / BITS_PER_LONG) * sizeof(unsigned long));
 
+	/* Setup queue_depth sysfs entry */
+	rc = device_create_file(&pdev->dev, &cciss_queue_depth);
+	if (rc)
+		goto clean4;
+
 	hba[i]->num_luns = 0;
 	hba[i]->highest_lun = -1;
 	for (j = 0; j < CISS_MAX_LUN; j++) {
diff --git a/drivers/block/cciss.h b/drivers/block/cciss.h
index 24a7efa..91dcac6 100644
--- a/drivers/block/cciss.h
+++ b/drivers/block/cciss.h
@@ -62,6 +62,7 @@ struct ctlr_info
 	void __iomem *vaddr;
 	unsigned long paddr;
 	int 	nr_cmds; /* Number of commands allowed on this controller */
+	unsigned qdepth_max;  /* userspace queue depth limit */
 	CfgTable_struct __iomem *cfgtable;
 	int	interrupts_enabled;
 	int	major;

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system)
  2008-08-21  6:04               ` Dave Chinner
  2008-08-21  8:07                 ` Aaron Carroll
@ 2008-08-21  8:25                 ` Dave Chinner
  2008-08-21 11:02                   ` Martin Steigerwald
  2008-08-21 17:10                   ` Szabolcs Szakacsits
  2008-08-21 11:53                 ` Matthew Wilcox
  2 siblings, 2 replies; 55+ messages in thread
From: Dave Chinner @ 2008-08-21  8:25 UTC (permalink / raw)
  To: Szabolcs Szakacsits, Andrew Morton, linux-fsdevel, linux-kernel,
	xfs

On Thu, Aug 21, 2008 at 04:04:18PM +1000, Dave Chinner wrote:
> On Thu, Aug 21, 2008 at 03:15:08PM +1000, Dave Chinner wrote:
> > On Thu, Aug 21, 2008 at 05:46:00AM +0300, Szabolcs Szakacsits wrote:
> > > On Thu, 21 Aug 2008, Dave Chinner wrote:
> > > Everything is default.
> > > 
> > >   % rpm -qf =mkfs.xfs
> > >   xfsprogs-2.9.8-7.1 
> > > 
> > > which, according to ftp://oss.sgi.com/projects/xfs/cmd_tars, is the 
> > > latest stable mkfs.xfs. Its output is
> > > 
> > > meta-data=/dev/sda8              isize=256    agcount=4, agsize=1221440 blks
> > >          =                       sectsz=512   attr=2
> > > data     =                       bsize=4096   blocks=4885760, imaxpct=25
> > >          =                       sunit=0      swidth=0 blks
> > > naming   =version 2              bsize=4096  
> > > log      =internal log           bsize=4096   blocks=2560, version=2
> > >          =                       sectsz=512   sunit=0 blks, lazy-count=0
> > > realtime =none                   extsz=4096   blocks=0, rtextents=0
> > 
> > Ok, I thought it might be the tiny log, but it didn't improve anything
> > here when increased the log size, or the log buffer size.
> 
> One thing I just found out - my old *laptop* is 4-5x faster than the
> 10krpm scsi disk behind an old cciss raid controller.  I'm wondering
> if the long delays in dispatch is caused by an interaction with CTQ
> but I can't change it on the cciss raid controllers. Are you using
> ctq/ncq on your machine?  If so, can you reduce the depth to
> something less than 4 and see what difference that makes?

Just to point out - this is not a new problem - I can reproduce
it on 2.6.24 as well as 2.6.26. Likewise, my laptop shows XFS
being faster than ext3 on both 2.6.24 and 2.6.26. So the difference
is something related to the disk subsystem on the server....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system)
  2008-08-21  7:00                   ` Nick Piggin
@ 2008-08-21  8:53                     ` Dave Chinner
  2008-08-21  9:33                       ` Nick Piggin
  2008-08-21 14:52                       ` Chris Mason
  0 siblings, 2 replies; 55+ messages in thread
From: Dave Chinner @ 2008-08-21  8:53 UTC (permalink / raw)
  To: Nick Piggin
  Cc: gus3, Szabolcs Szakacsits, Andrew Morton, linux-fsdevel,
	linux-kernel, xfs

On Thu, Aug 21, 2008 at 05:00:39PM +1000, Nick Piggin wrote:
> On Thursday 21 August 2008 16:14, Dave Chinner wrote:
> 
> > I think that we need to issue explicit unplugs to get the log I/O
> > dispatched the way we want on all elevators and stop trying to
> > give elevators implicit hints by abusing the bio types and hoping
> > they do the right thing....
> 
> FWIW, my explicit plugging idea is still hanging around in one of
> Jens' block trees (actually he refreshed it a couple of months ago).
> 
> It provides an API for VM or filesystems to plug and unplug
> requests coming out of the current process, and it can reduce the
> need to idle the queue. Needs more performance analysis and tuning
> though.

We've already got plenty of explicit unplugs in XFS to get stuff
moving quickly - I'll just have to add another....

> But existing plugging is below the level of the elevators, and should
> only kick in for at most tens of ms at queue idle events, so it sounds
> like it may not be your problem. Elevators will need some hint to give
> priority to specific requests -- either via the current threads's io
> priority, or information attached to bios.

It's getting too bloody complex, IMO. What is right for one elevator
is wrong for another, so as a filesystem developer I have to pick
one to target. With the way the elevators have been regressing,
improving and changing behaviour, I am starting to think that I
should be picking the noop scheduler. Any 'advanced' scheduler that
is slower than the same test on the noop scheduler needs fixing...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system)
  2008-08-21  8:53                     ` Dave Chinner
@ 2008-08-21  9:33                       ` Nick Piggin
  2008-08-21 17:08                         ` Dave Chinner
  2008-08-21 14:52                       ` Chris Mason
  1 sibling, 1 reply; 55+ messages in thread
From: Nick Piggin @ 2008-08-21  9:33 UTC (permalink / raw)
  To: Dave Chinner
  Cc: gus3, Szabolcs Szakacsits, Andrew Morton, linux-fsdevel,
	linux-kernel, xfs

On Thursday 21 August 2008 18:53, Dave Chinner wrote:
> On Thu, Aug 21, 2008 at 05:00:39PM +1000, Nick Piggin wrote:
> > On Thursday 21 August 2008 16:14, Dave Chinner wrote:
> > > I think that we need to issue explicit unplugs to get the log I/O
> > > dispatched the way we want on all elevators and stop trying to
> > > give elevators implicit hints by abusing the bio types and hoping
> > > they do the right thing....
> >
> > FWIW, my explicit plugging idea is still hanging around in one of
> > Jens' block trees (actually he refreshed it a couple of months ago).
> >
> > It provides an API for VM or filesystems to plug and unplug
> > requests coming out of the current process, and it can reduce the
> > need to idle the queue. Needs more performance analysis and tuning
> > though.
>
> We've already got plenty of explicit unplugs in XFS to get stuff
> moving quickly - I'll just have to add another....

That doesn't really help at the elevator, though.


> > But existing plugging is below the level of the elevators, and should
> > only kick in for at most tens of ms at queue idle events, so it sounds
> > like it may not be your problem. Elevators will need some hint to give
> > priority to specific requests -- either via the current threads's io
> > priority, or information attached to bios.
>
> It's getting too bloody complex, IMO. What is right for one elevator
> is wrong for another, so as a filesystem developer I have to pick
> one to target.

I don't really see it as too complex. If you know how you want the
request to be handled, then it should be possible to implement.


> With the way the elevators have been regressing, 
> improving and changing behaviour,

AFAIK deadline, AS, and noop haven't significantly changed for years.


> I am starting to think that I 
> should be picking the noop scheduler.
> Any 'advanced' scheduler that 
> is slower than the same test on the noop scheduler needs fixing...

I disagree. On devices with no seek penalty or their own queueing,
noop is often the best choice. Same for specialized apps that do
their own disk scheduling.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system)
  2008-08-21  8:25                 ` Dave Chinner
@ 2008-08-21 11:02                   ` Martin Steigerwald
  2008-08-21 15:00                     ` Martin Steigerwald
  2008-08-21 17:10                   ` Szabolcs Szakacsits
  1 sibling, 1 reply; 55+ messages in thread
From: Martin Steigerwald @ 2008-08-21 11:02 UTC (permalink / raw)
  To: Szabolcs Szakacsits, Andrew Morton, linux-fsdevel, linux-kernel,
	xfs

Am Donnerstag 21 August 2008 schrieb Dave Chinner:
> On Thu, Aug 21, 2008 at 04:04:18PM +1000, Dave Chinner wrote:
> > On Thu, Aug 21, 2008 at 03:15:08PM +1000, Dave Chinner wrote:
> > > On Thu, Aug 21, 2008 at 05:46:00AM +0300, Szabolcs Szakacsits wrote:
> > > > On Thu, 21 Aug 2008, Dave Chinner wrote:
> > > > Everything is default.
> > > >
> > > >   % rpm -qf =mkfs.xfs
> > > >   xfsprogs-2.9.8-7.1
> > > >
> > > > which, according to ftp://oss.sgi.com/projects/xfs/cmd_tars, is
> > > > the latest stable mkfs.xfs. Its output is
> > > >
> > > > meta-data=/dev/sda8              isize=256    agcount=4,
> > > > agsize=1221440 blks =                       sectsz=512   attr=2
> > > > data     =                       bsize=4096   blocks=4885760,
> > > > imaxpct=25 =                       sunit=0      swidth=0 blks
> > > > naming   =version 2              bsize=4096
> > > > log      =internal log           bsize=4096   blocks=2560,
> > > > version=2 =                       sectsz=512   sunit=0 blks,
> > > > lazy-count=0 realtime =none                   extsz=4096  
> > > > blocks=0, rtextents=0
> > >
> > > Ok, I thought it might be the tiny log, but it didn't improve
> > > anything here when increased the log size, or the log buffer size.
> >
> > One thing I just found out - my old *laptop* is 4-5x faster than the
> > 10krpm scsi disk behind an old cciss raid controller.  I'm wondering
> > if the long delays in dispatch is caused by an interaction with CTQ
> > but I can't change it on the cciss raid controllers. Are you using
> > ctq/ncq on your machine?  If so, can you reduce the depth to
> > something less than 4 and see what difference that makes?
>
> Just to point out - this is not a new problem - I can reproduce
> it on 2.6.24 as well as 2.6.26. Likewise, my laptop shows XFS
> being faster than ext3 on both 2.6.24 and 2.6.26. So the difference
> is something related to the disk subsystem on the server....

Interesting. I switched from cfq to deadline some time ago, due to abysmal 
XFS performance on parallel IO - aptitude upgrade and doing desktop 
stuff. Just my subjective perception, but I have seen it crawl, even 
stall for 5-10 seconds easily at times. I found deadline to be way faster 
initially, but then it rarely happened that IO for desktop tasks is 
basically stalled for even longer, say 15 seconds or more, on parallel 
IO. However I can't remember having this problem with the last kernel 
2.6.26.2.

I am now testing with cfq again. On a ThinkPad T42 internal 160 GB 
harddisk with barriers enabled. But you tell, it only happens on certain 
servers, so I might have seen something different.

Thus I had the rough feeling that something is wrong with at least CFQ and 
XFS together, but I couldn't prove it back then. I have no idea how to 
easily do a reproducable test case. Maybe having a script that unpacks 
kernel source archives while I try to use the desktop...

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system)
  2008-08-21  6:04               ` Dave Chinner
  2008-08-21  8:07                 ` Aaron Carroll
  2008-08-21  8:25                 ` Dave Chinner
@ 2008-08-21 11:53                 ` Matthew Wilcox
  2008-08-21 15:56                   ` Dave Chinner
  2 siblings, 1 reply; 55+ messages in thread
From: Matthew Wilcox @ 2008-08-21 11:53 UTC (permalink / raw)
  To: Szabolcs Szakacsits, Andrew Morton, linux-fsdevel, linux-kernel,
	xfs

On Thu, Aug 21, 2008 at 04:04:18PM +1000, Dave Chinner wrote:
> One thing I just found out - my old *laptop* is 4-5x faster than the
> 10krpm scsi disk behind an old cciss raid controller.  I'm wondering
> if the long delays in dispatch is caused by an interaction with CTQ
> but I can't change it on the cciss raid controllers. Are you using
> ctq/ncq on your machine?  If so, can you reduce the depth to
> something less than 4 and see what difference that makes?

I don't think that's going to make a difference when using CFQ.  I did
some tests that showed that CFQ would never issue more than one IO at a
time to a drive.  This was using sixteen userspace threads, each doing a
4k direct I/O to the same location.  When using noop, I would get 70k
IOPS and when using CFQ I'd get around 40k IOPS.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC] nilfs2: continuous snapshotting file system
  2008-08-20 21:25     ` Szabolcs Szakacsits
  2008-08-20 21:39       ` Andrew Morton
@ 2008-08-21 12:51       ` Chris Mason
  1 sibling, 0 replies; 55+ messages in thread
From: Chris Mason @ 2008-08-21 12:51 UTC (permalink / raw)
  To: Szabolcs Szakacsits
  Cc: Ryusuke Konishi, Andrew Morton, linux-fsdevel, linux-kernel

On Thu, 2008-08-21 at 00:25 +0300, Szabolcs Szakacsits wrote:
> On Thu, 21 Aug 2008, Ryusuke Konishi wrote:
> > >> Some impressive benchmark results on SSD are shown in [3],
> > >
> > >heh.  It wipes the floor with everything, including btrfs.
> 
> It seems the benchmark was done over half year ago. It's questionable how 
> relevant today the performance comparison is with actively developed file 
> systems ...
> 

I'd expect that nilfs continues to win postmark.  Btrfs splits data and
metadata into different parts of the disk, so at best btrfs is going to
produce two streams of writes into the SSD while nilfs is doing one.
Most consumer ssds still benefit from huge writes, and so nilfs is
pretty optimal in that case.

The main benefit of the split for btrfs is being able to have different
duplication policies for metadata and data, and faster fsck times
because the metadata is more compact.  Over time that may prove less
relevant on SSD, and changing it in btrfs is just flipping a few bits
during allocation.

-chris

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system)
  2008-08-21  8:53                     ` Dave Chinner
  2008-08-21  9:33                       ` Nick Piggin
@ 2008-08-21 14:52                       ` Chris Mason
  1 sibling, 0 replies; 55+ messages in thread
From: Chris Mason @ 2008-08-21 14:52 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Nick Piggin, gus3, Szabolcs Szakacsits, Andrew Morton,
	linux-fsdevel, linux-kernel, xfs

On Thu, 2008-08-21 at 18:53 +1000, Dave Chinner wrote:
> On Thu, Aug 21, 2008 at 05:00:39PM +1000, Nick Piggin wrote:
> > On Thursday 21 August 2008 16:14, Dave Chinner wrote:
> > 
> > > I think that we need to issue explicit unplugs to get the log I/O
> > > dispatched the way we want on all elevators and stop trying to
> > > give elevators implicit hints by abusing the bio types and hoping
> > > they do the right thing....
> > 
> > FWIW, my explicit plugging idea is still hanging around in one of
> > Jens' block trees (actually he refreshed it a couple of months ago).
> > 
> > It provides an API for VM or filesystems to plug and unplug
> > requests coming out of the current process, and it can reduce the
> > need to idle the queue. Needs more performance analysis and tuning
> > though.
> 
> We've already got plenty of explicit unplugs in XFS to get stuff
> moving quickly - I'll just have to add another....
> 

I did some compilebench runs with xfs this morning, creating 30 kernel
trees on the same machine I posted btrfs and xfs numbers with last week.
Btrfs gets between 60 and 75MB/s average depending on the mount options
used, ext4 gets around 60MB/s

This is a single sata drive that can run at 100MB/s streaming writes.
The numbers show XFS is largely log bound, and that turning off barriers
makes a huge difference.  I'd be happy to try another run with explicit
unplugging somewhere in the transaction commit path.

I think the most relevant number is the count of MB written at the end
of blkparse. I'm not sure why the 4ag XFS writes less, but the numbers
do include calling sync at the end.  None of the filesystems were doing
barriers in these numbers:

Ext4                                9036MiB
Btrfs metadata dup                  9190MiB
Btrfs metadata dup no inline files 10280MiB
XFS 4ag, nobarrier                 14299MiB
XFS 1ag, nobarrier                 17836MiB

This is a long way of saying the xfs log isn't optimal for these kinds
of operations, which isn't really news.  I'm not ripping on xfs here,
this is just one tiny benchmark.

I uploaded some graphs of the IO here:

http://oss.oracle.com/~mason/seekwatcher/compilebench-30/xfs

XFS:

*** 4ag, 128m log, logbsize=256k
intial create total runs 30 avg 7.48 MB/s (user 0.52s sys 1.04s)

*** 4ag, 128m log, logbsize=256k, nobarrier
intial create total runs 30 avg 21.58 MB/s (user 0.51s sys 1.04s)
http://oss.oracle.com/~mason/seekwatcher/compilebench-30/xfs/xfs-4ag-nobarrier.png

*** 1ag, 128m log, logbsize=256k, nobarrier
intial create total runs 30 avg 26.28 MB/s (user 0.50s sys 1.15s)
http://oss.oracle.com/~mason/seekwatcher/compilebench-30/xfs/xfs-nobarrier-1ag.png

It is hard to see in the graph, but it looks like the log is in the
first 128MB of the drive.  If we give XFS an external log device:

*** 1ag 128m external log, logbsize=256k, nobarrier
intial create total runs 30 avg 38.44 MB/s (user 0.51s sys 1.09s)

This graph shows that log is running more or less seek free between
30-60MB/s for the whole run.  I'd expect the explicit unplugging to help
the most in this config?

http://oss.oracle.com/~mason/seekwatcher/compilebench-30/xfs/xfs-external-log-disk.png

Here is the main disk during the run:
http://oss.oracle.com/~mason/seekwatcher/compilebench-30/xfs/xfs-external-log-main-disk.png

*** 1ag 128m external log, logbsize=256k, nobarrier, deadline
intial create total runs 30 avg 34.00 MB/s (user 0.51s sys 1.07s)

Deadline didn't help on this box.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system)
  2008-08-21 11:02                   ` Martin Steigerwald
@ 2008-08-21 15:00                     ` Martin Steigerwald
  0 siblings, 0 replies; 55+ messages in thread
From: Martin Steigerwald @ 2008-08-21 15:00 UTC (permalink / raw)
  To: linux-xfs
  Cc: Szabolcs Szakacsits, Andrew Morton, linux-fsdevel, linux-kernel,
	xfs

[-- Attachment #1: Type: text/plain, Size: 6210 bytes --]

Am Donnerstag 21 August 2008 schrieb Martin Steigerwald:
> Am Donnerstag 21 August 2008 schrieb Dave Chinner:
> > On Thu, Aug 21, 2008 at 04:04:18PM +1000, Dave Chinner wrote:
> > > On Thu, Aug 21, 2008 at 03:15:08PM +1000, Dave Chinner wrote:
> > > > On Thu, Aug 21, 2008 at 05:46:00AM +0300, Szabolcs Szakacsits 
wrote:
> > > > > On Thu, 21 Aug 2008, Dave Chinner wrote:
> > > > > Everything is default.
> > > > >
> > > > >   % rpm -qf =mkfs.xfs
> > > > >   xfsprogs-2.9.8-7.1
> > > > >
> > > > > which, according to ftp://oss.sgi.com/projects/xfs/cmd_tars, is
> > > > > the latest stable mkfs.xfs. Its output is
> > > > >
> > > > > meta-data=/dev/sda8              isize=256    agcount=4,
> > > > > agsize=1221440 blks =                       sectsz=512   attr=2
> > > > > data     =                       bsize=4096   blocks=4885760,
> > > > > imaxpct=25 =                       sunit=0      swidth=0 blks
> > > > > naming   =version 2              bsize=4096
> > > > > log      =internal log           bsize=4096   blocks=2560,
> > > > > version=2 =                       sectsz=512   sunit=0 blks,
> > > > > lazy-count=0 realtime =none                   extsz=4096
> > > > > blocks=0, rtextents=0
> > > >
> > > > Ok, I thought it might be the tiny log, but it didn't improve
> > > > anything here when increased the log size, or the log buffer
> > > > size.
> > >
> > > One thing I just found out - my old *laptop* is 4-5x faster than
> > > the 10krpm scsi disk behind an old cciss raid controller.  I'm
> > > wondering if the long delays in dispatch is caused by an
> > > interaction with CTQ but I can't change it on the cciss raid
> > > controllers. Are you using ctq/ncq on your machine?  If so, can you
> > > reduce the depth to something less than 4 and see what difference
> > > that makes?
> >
> > Just to point out - this is not a new problem - I can reproduce
> > it on 2.6.24 as well as 2.6.26. Likewise, my laptop shows XFS
> > being faster than ext3 on both 2.6.24 and 2.6.26. So the difference
> > is something related to the disk subsystem on the server....
>
> Interesting. I switched from cfq to deadline some time ago, due to
> abysmal XFS performance on parallel IO - aptitude upgrade and doing
> desktop stuff. Just my subjective perception, but I have seen it crawl,
> even stall for 5-10 seconds easily at times. I found deadline to be way
> faster initially, but then it rarely happened that IO for desktop tasks
> is basically stalled for even longer, say 15 seconds or more, on
> parallel IO. However I can't remember having this problem with the last
> kernel 2.6.26.2.
>
> I am now testing with cfq again. On a ThinkPad T42 internal 160 GB
> harddisk with barriers enabled. But you tell, it only happens on
> certain servers, so I might have seen something different.
>
> Thus I had the rough feeling that something is wrong with at least CFQ
> and XFS together, but I couldn't prove it back then. I have no idea how
> to easily do a reproducable test case. Maybe having a script that
> unpacks kernel source archives while I try to use the desktop...

Okay, some numbers attached:

- On XFS: Barrier versus Nobarrier makes quite a difference with 
compilebench. Also on rm -rf'ing the large directory tree it leaves 
behind. While I did not measure the first barrier related compilebench 
directory deletion I am pretty sure it took way longer. Also vmstat 
throughput it higher without nobarriers.
 
- On XFS: CFQ versus NOOP does not seem to make that much of a difference, 
at least not with barriers enabled (didn't test without). With NOOP 
responsiveness was even weaker than with CFQ. Opening a context menu on a 
webpage link displayed in Konqueror could take easily a minute or more. I 
think it shall never ever take that long for the OS to respond to user 
input.

- Ext3, NILFS, BTRFS with CFQ: Perform quite well. Especially btrfs. nilfs 
text isn't complete, cause likely due to checkpoints those 4G I dedicated 
to it were not enough for the compilebench test to complete.

So at least here performance degration with XFS seems more related to 
barriers than scheduler decision - least when it comes to the two choices 
CFQ and NOOP. But no, I won't switch barriers off permanently on my 
laptop. ;) Would be fine if performance impact of barriers could be 
reduced a bit tough.

At last I appear to see something different than the I/O scheduler issue 
discussed here.

Anyway subjectively I am quite happy with XFS performance nonetheless. But 
then since I can't switch from XFS to ext3 or btrfs in a second I can't 
really compare subjective impressions. Maybe desktop would respond faster 
with ext3 or btrfs? Who knows?

I think a script which does extensive automated testing would be fine:

- have some basic settings like

SCRATCH_DEV=/dev/sda8 (this should be a real partition in order to be able 
to test barriers which do not work over LVM / device mapper)

SCRATCH_MNT=/mnt/test

- have an array of pre-pre-test setups like

[ echo "cfq" >/sys/block/sda/queue/scheduler ]
[ echo "deadline" >/sys/block/sda/queue/scheduler ]
[ echo "anticipatory" >/sys/block/sda/queue/scheduler ]
[ echo "noop" >/sys/block/sda/queue/scheduler ]

- have an array of pre-test setups like

[ mkfs.xfs -f $SCRATCH_DEV
mount $SCRATCH_DEV $SCRATCH_MNT ]
[ mkfs.xfs -f $SCRATCH_DEV
mount -o nobarrier $SCRATCH_DEV $SCRATCH_MNT ]
[ mkfs.xfs -f $SCRATCH_DEV
mount -o logbsize=256k $SCRATCH_DEV $SCRATCH_MNT ]
[ mkfs.btrfs $SCRATCH_DEV
mount $SCRATCH_DEV $SCRATCH_MNT ]

- have an array of tests like

[ ./compilebench -D /mnt/zeit-btrfs -i 5 -r 10 ]
[ postmark whatever ]
[ iozone whatever ]

- and let it run every combination of those array elements unattended 
(over night;-)

- have any results collected with settings for each patch and basic 
machine info in one easy to share text file

- then as additional feature let it test responsiveness during each 
running test. Let it makes sure there are some files that are not in the 
cache and let it access one of those files once in a while and measure 
how long it takes the filesystem to respond

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

[-- Attachment #2: filesystem-benchmarks-compilebench-2008-08-21.txt --]
[-- Type: text/plain, Size: 20700 bytes --]



martin@shambhala:~> date
Do 21. Aug 13:27:49 CEST 2008

shambhala:~> cat /proc/version
Linux version 2.6.26.2-tp42-toi-3.0-rc7a-xfs-ticket-patch
(martin@shambala) (gcc version 4.3.1 (Debian 4.3.1-8) ) #1 PREEMPT Wed
Aug 13 10:10:11 CEST 2008

shambhala:~> apt-show-versions | egrep "(btrfs|nilfs)"
btrfs-modules-2.6.26.2-tp42-toi-3.0-rc7a-xfs-ticket-patch 0.15-1+1
installed: No available version in archive
btrfs-source/lenny uptodate 0.15-1
btrfs-tools/lenny uptodate 0.15-2
nilfs2-modules-2.6.26.2-tp42-toi-3.0-rc7a-xfs-ticket-patch 2.0.4-1+1
installed: No available version in archive
nilfs2-source/sid uptodate 2.0.4-1
nilfs2-tools/lenny uptodate 2.0.5-1

shambhala:~> cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 13
model name      : Intel(R) Pentium(R) M processor 1.80GHz
stepping        : 6
cpu MHz         : 600.000
cache size      : 2048 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr mce cx8 sep mtrr pge mca cmov
pat clflush dts acpi mmx fxsr sse sse2 ss tm pbe bts est tm2
bogomips        : 1197.54
clflush size    : 64
power management:


martin@shambhala:~> cat /proc/mounts | tail -4
/dev/mapper/shambala-ext3 /mnt/zeit-ext3 ext3
rw,errors=continue,data=ordered 0 0
/dev/mapper/shambala-nilfs /mnt/zeit-nilfs2 nilfs2 rw 0 0
/dev/mapper/shambala-btrfs /mnt/zeit-btrfs btrfs rw 0 0
/dev/mapper/shambala-xfs /mnt/zeit-xfs xfs
rw,attr2,nobarrier,logbufs=8,logbsize=256k,noquota 0 0
martin@shambhala:~> df -hT | tail -8
/dev/mapper/shambala-ext3
              ext3    4,0G  137M  3,7G   4% /mnt/zeit-ext3
/dev/mapper/shambala-nilfs
            nilfs2    4,0G   16M  3,8G   1% /mnt/zeit-nilfs2
/dev/mapper/shambala-btrfs
             btrfs    4,0G   40K  4,0G   1% /mnt/zeit-btrfs
/dev/mapper/shambala-xfs
               xfs    4,0G  4,2M  4,0G   1% /mnt/zeit-xfs

shambhala:~> xfs_info /mnt/zeit-xfs
meta-data=/dev/mapper/shambala-xfs isize=256    agcount=4, agsize=262144
blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=1048576, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096
log      =internal               bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=0
realtime =none                   extsz=4096   blocks=0, rtextents=0

martin@shambhala:~> cat /sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]


XFS without barriers, since device mapper doesn't support barrier
requests (http://bugzilla.kernel.org/show_bug.cgi?id=9554):


shambhala:/home/martin/Linux/Dateisysteme/Performance-Messung/
compilebench/compilebench-0.6> ./compilebench -D /mnt/zeit-xfs -i 5 -r
10
using working directory /mnt/zeit-xfs, 5 intial dirs 10 runs
native unpatched native-0 222MB in 25.88 seconds (8.59 MB/s)
native patched native-0 109MB in 5.82 seconds (18.84 MB/s)
native patched compiled native-0 691MB in 33.69 seconds (20.53 MB/s)
create dir kernel-0 222MB in 20.38 seconds (10.91 MB/s)
create dir kernel-1 222MB in 27.27 seconds (8.15 MB/s)
create dir kernel-2 222MB in 26.69 seconds (8.33 MB/s)
create dir kernel-3 222MB in 25.17 seconds (8.83 MB/s)
create dir kernel-4 222MB in 29.52 seconds (7.53 MB/s)
patch dir kernel-2 109MB in 38.54 seconds (2.85 MB/s)
compile dir kernel-2 691MB in 41.60 seconds (16.62 MB/s)
compile dir kernel-4 680MB in 49.46 seconds (13.76 MB/s)
patch dir kernel-4 691MB in 118.19 seconds (5.85 MB/s)
read dir kernel-4 in 77.09 11.89 MB/s
read dir kernel-3 in 30.91 7.19 MB/s
create dir kernel-3116 222MB in 42.73 seconds (5.20 MB/s)
clean kernel-4 691MB in 6.48 seconds (106.73 MB/s)
read dir kernel-1 in 32.08 6.93 MB/s
stat dir kernel-0 in 6.94 seconds

run complete:
========================================================================
==
intial create total runs 5 avg 8.75 MB/s (user 2.05s sys 3.72s)
create total runs 1 avg 5.20 MB/s (user 2.40s sys 5.34s)
patch total runs 2 avg 4.35 MB/s (user 0.83s sys 3.93s)
compile total runs 2 avg 15.19 MB/s (user 0.56s sys 2.90s)
clean total runs 1 avg 106.73 MB/s (user 0.07s sys 0.40s)
read tree total runs 2 avg 7.06 MB/s (user 1.93s sys 3.94s)
read compiled tree total runs 1 avg 11.89 MB/s (user 2.29s sys 6.22s)
no runs for delete tree
no runs for delete compiled tree
stat tree total runs 1 avg 6.94 seconds (user 1.13s sys 0.94s)
no runs for stat compiled tree


With barriers on an already heavily populated filesystem - I don't have
an empty one on a raw partition at hand at the moment and I for sure
won't empty this one:

martin@shambhala:~> df -hT | grep /home
/dev/sda5      xfs    112G  104G  8,2G  93% /home

shambhala:~> df -hiT | grep /home
/dev/sda5      xfs       34M    751K     33M    3% /home

shambhala:~> xfs_db -rx /dev/sda5
xfs_db> frag
actual 726986, ideal 703687, fragmentation factor 3.20%
xfs_db> quit
shambhala:~>

martin@shambhala:~> cat /proc/mounts | grep "/home "
/dev/sda5 /home xfs rw,relatime,attr2,logbufs=8,logbsize=256k,noquota 0
0

shambhala:~> xfs_info /home
meta-data=/dev/sda5              isize=256    agcount=6, agsize=4883256
blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=29299536,
imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096
log      =internal               bsize=4096   blocks=32768, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

shambhala:/home/martin/Linux/Dateisysteme/Performance-Messung/
compilebench/compilebench-0.6> ./compilebench -D
/home/martin/Zeit/compilebench -i 5 -r 10
using working directory /home/martin/Zeit/compilebench, 5 intial dirs 10
runs
native unpatched native-0 222MB in 117.37 seconds (1.89 MB/s)
native patched native-0 109MB in 27.46 seconds (3.99 MB/s)
native patched compiled native-0 691MB in 48.03 seconds (14.40 MB/s)
create dir kernel-0 222MB in 83.55 seconds (2.66 MB/s)
create dir kernel-1 222MB in 86.01 seconds (2.59 MB/s)
create dir kernel-2 222MB in 71.61 seconds (3.11 MB/s)
create dir kernel-3 222MB in 71.73 seconds (3.10 MB/s)
create dir kernel-4 222MB in 61.61 seconds (3.61 MB/s)
patch dir kernel-2 109MB in 63.14 seconds (1.74 MB/s)
compile dir kernel-2 691MB in 45.61 seconds (15.16 MB/s)
compile dir kernel-4 680MB in 50.13 seconds (13.58 MB/s)
patch dir kernel-4 691MB in 154.38 seconds (4.48 MB/s)
read dir kernel-4 in 95.04 9.65 MB/s
read dir kernel-3 in 49.49 4.49 MB/s
create dir kernel-3116 222MB in 79.44 seconds (2.80 MB/s)
clean kernel-4 691MB in 8.64 seconds (80.05 MB/s)
read dir kernel-1 in 71.40 3.11 MB/s
stat dir kernel-0 in 14.44 seconds

run complete:
========================================================================
==
intial create total runs 5 avg 3.01 MB/s (user 2.34s sys 4.30s)
create total runs 1 avg 2.80 MB/s (user 2.36s sys 4.12s)
patch total runs 2 avg 3.11 MB/s (user 0.91s sys 4.07s)
compile total runs 2 avg 14.37 MB/s (user 0.60s sys 2.76s)
clean total runs 1 avg 80.05 MB/s (user 0.09s sys 0.45s)
read tree total runs 2 avg 3.80 MB/s (user 2.00s sys 4.05s)
read compiled tree total runs 1 avg 9.65 MB/s (user 2.36s sys 6.42s)
no runs for delete tree
no runs for delete compiled tree
stat tree total runs 1 avg 14.44 seconds (user 1.17s sys 1.07s)
no runs for stat compiled tree

shambhala:/home/martin/Linux/Dateisysteme/Performance-Messung/
compilebench/compilebench-0.6> rm -rf /home/martin/Zeit/compilebench

I didn't measure it, but it took *ages* while rm -rf was mostly in D
state. According to harddisk noise a lot of seeks where involved.

vmstat 1 during the rm -rf:

 0  0   2784 748048     20 247160    0    0   160  4628  352 1224 15 14
71  0
 0  0   2784 748056     20 247308    0    0   148  3848  298  442 11 10
79  0
 0  0   2784 747996     20 247428    0    0   120  3377  260  449  9  9
82  0
 0  0   2784 747764     20 247580    0    0   152  4364  324 1094 20 10
70  0
 1  0   2784 747452     20 247736    0    0   156  4356  279  814 15 11
74  0
 0  0   2784 747408     20 247900    0    0   164  4112  360 1131 13 13
74  0
 0  0   2784 747136     20 248064    0    0   164  5128  318  855 16 10
74  0
 0  0   2784 746780     20 248208    0    0   144  4353  305 1066 20 12
68  0
 0  0   2784 746204     20 248336    0    0   128  5388  275  966 14 11
75  0
 1  0   2784 748352     20 248468    0    0   132  5384  314 1234 22 11
67  0
 0  0   2784 748104     20 248604    0    0   136  4873  284  807 16 11
73  0


Same game on same productively used partition, but now without barriers:

shambhala:~> mount -o remount,nobarrier /home
shambhala:~> cat /proc/mounts | grep "/home "
/dev/sda5 /home xfs
rw,relatime,attr2,nobarrier,logbufs=8,logbsize=256k,noquota 0 0

shambhala:/home/martin/Linux/Dateisysteme/Performance-Messung/
compilebench/compilebench-0.6> mkdir /home/martin/Zeit/compilebench

shambhala:/home/martin/Linux/Dateisysteme/Performance-Messung/
compilebench/compilebench-0.6> ./compilebench -D
/home/martin/Zeit/compilebench -i 5 -r 10
using working directory /home/martin/Zeit/compilebench, 5 intial dirs 10
runs
native unpatched native-0 222MB in 51.44 seconds (4.32 MB/s)
native patched native-0 109MB in 12.69 seconds (8.64 MB/s)
native patched compiled native-0 691MB in 51.75 seconds (13.36 MB/s)
create dir kernel-0 222MB in 47.64 seconds (4.67 MB/s)
create dir kernel-1 222MB in 53.40 seconds (4.16 MB/s)
create dir kernel-2 222MB in 48.04 seconds (4.63 MB/s)
create dir kernel-3 222MB in 38.26 seconds (5.81 MB/s)
create dir kernel-4 222MB in 34.15 seconds (6.51 MB/s)
patch dir kernel-2 109MB in 50.61 seconds (2.17 MB/s)
compile dir kernel-2 691MB in 37.94 seconds (18.23 MB/s)
compile dir kernel-4 680MB in 45.32 seconds (15.02 MB/s)
patch dir kernel-4 691MB in 107.27 seconds (6.45 MB/s)
read dir kernel-4 in 82.18 11.16 MB/s
read dir kernel-3 in 42.35 5.25 MB/s
create dir kernel-3116 222MB in 38.27 seconds (5.81 MB/s)
clean kernel-4 691MB in 5.92 seconds (116.82 MB/s)
read dir kernel-1 in 73.63 3.02 MB/s
stat dir kernel-0 in 13.77 seconds

run complete:
========================================================================
==
intial create total runs 5 avg 5.16 MB/s (user 2.21s sys 4.23s)
create total runs 1 avg 5.81 MB/s (user 2.18s sys 4.89s)
patch total runs 2 avg 4.31 MB/s (user 0.90s sys 4.05s)
compile total runs 2 avg 16.62 MB/s (user 0.59s sys 3.05s)
clean total runs 1 avg 116.82 MB/s (user 0.09s sys 0.41s)
read tree total runs 2 avg 4.14 MB/s (user 1.90s sys 4.02s)
read compiled tree total runs 1 avg 11.16 MB/s (user 2.28s sys 6.36s)
no runs for delete tree
no runs for delete compiled tree
stat tree total runs 1 avg 13.77 seconds (user 1.19s sys 1.01s)
no runs for stat compiled tree


Not as fast as on the clean XFS LV, but still almost everytime almost
twice as fast as with barriers.


shambhala:/home/martin/Linux/Dateisysteme/Performance-Messung/
compilebench/compilebench-0.6> time rm -rf
/home/martin/Zeit/compilebench
rm -rf /home/martin/Zeit/compilebench  0,32s user 19,19s system 15% cpu
2:09,79 total

This is definately faster than before. I didn't measure exact time on
first occasion, but it took ages.

vmstat 1 during the rm -rf indicated much higher metadata throughput:

 3  0   2780 827696     20 162492    0    0   280 11109  449  865 31 15
52  2
 0  0   2780 827304     20 162816    0    0   324  6656  468 1009 57  8 
7 28
 2  0   2636 828992     20 163364    0    0   540  5317  350  545 30 10
30 31
 2  1   2636 837488     20 164020    0    0   656  7691  394  650 39 12 
0 49
 0  0   2224 960360     20 164516    0    0   496 12060  420  549 13 26
56  5
 0  0   2224 959988     20 164904    0    0   388 13704  425  792 16 23
61  0
 0  0   2224 959864     20 165128    0    0   224  6209  363  503 12 10
78  0
 1  0   2224 959376     20 165540    0    0   412 14886  392  513 12 22
66  0


Now with barriers again, but with "noop" as scheduler:

shambhala:~> mount -o remount,barrier /home
shambhala:~> cat /proc/mounts | grep /home
/dev/sda5 /home xfs rw,relatime,attr2,logbufs=8,logbsize=256k,noquota 0
0
shambhala:~> echo "noop" >/sys/block/sda/queue/scheduler
shambhala:~> cat /sys/block/sda/queue/scheduler
[noop] anticipatory deadline cfq

shambhala:/home/martin/Linux/Dateisysteme/Performance-Messung/
compilebench/compilebench-0.6> mkdir /home/martin/Zeit/compilebench

shambhala:/home/martin/Linux/Dateisysteme/Performance-Messung/
compilebench/compilebench-0.6> ./compilebench -D
/home/martin/Zeit/compilebench -i 5 -r 10
using working directory /home/martin/Zeit/compilebench, 5 intial dirs 10
runs
native unpatched native-0 222MB in 97.42 seconds (2.28 MB/s)
native patched native-0 109MB in 20.72 seconds (5.29 MB/s)
native patched compiled native-0 691MB in 46.37 seconds (14.91 MB/s)
create dir kernel-0 222MB in 84.12 seconds (2.64 MB/s)
create dir kernel-1 222MB in 95.18 seconds (2.34 MB/s)
create dir kernel-2 222MB in 74.57 seconds (2.98 MB/s)
create dir kernel-3 222MB in 71.81 seconds (3.10 MB/s)
create dir kernel-4 222MB in 64.77 seconds (3.43 MB/s)
patch dir kernel-2 109MB in 81.22 seconds (1.35 MB/s)
compile dir kernel-2 691MB in 41.87 seconds (16.52 MB/s)
compile dir kernel-4 680MB in 50.35 seconds (13.52 MB/s)
patch dir kernel-4 691MB in 151.03 seconds (4.58 MB/s)
read dir kernel-4 in 82.83 11.07 MB/s
read dir kernel-3 in 48.49 4.59 MB/s
create dir kernel-3116 222MB in 79.43 seconds (2.80 MB/s)
clean kernel-4 691MB in 15.51 seconds (44.59 MB/s)
read dir kernel-1 in 75.36 2.95 MB/s
stat dir kernel-0 in 14.65 seconds

run complete:
========================================================================
==
intial create total runs 5 avg 2.90 MB/s (user 2.35s sys 4.56s)
create total runs 1 avg 2.80 MB/s (user 2.18s sys 3.92s)
patch total runs 2 avg 2.96 MB/s (user 0.87s sys 4.07s)
compile total runs 2 avg 15.02 MB/s (user 0.60s sys 2.73s)
clean total runs 1 avg 44.59 MB/s (user 0.07s sys 0.44s)
read tree total runs 2 avg 3.77 MB/s (user 2.03s sys 3.82s)
read compiled tree total runs 1 avg 11.07 MB/s (user 2.29s sys 6.24s)
no runs for delete tree
no runs for delete compiled tree
stat tree total runs 1 avg 14.65 seconds (user 1.12s sys 1.00s)
no runs for stat compiled tree

Some tests run a bit faster, but on cost of responsiveness to out of
line I/Os (opening a new webpage in Konqueror). Some do not run faster
at all.

Seems that write barriers on/off make the bigger difference here.

As last XFS thing:

vmstat 1 during a rm -rf while switching of XFS from nobarrier to
barrier:

 0  0   1976 422236   1784 516840    0    0   508 17160  410  540  7 23
70  0
 1  0   1976 420624   1784 517576    0    0   736 26904  539 1032 14 35
51  0
 0  0   1976 419176   1784 518152    0    0   576 23842  486 1060 17 33
50  0
 0  0   1976 418316   1784 518460    0    0   308 12812  317  552  6 18
76  0
 2  0   1976 417392   1784 518776    0    0   316 16689  360  882  2 23
75  0
 8  0   1976 432948   1784 519252    0    0   476 16710  452  630  8 39
53  0
 0  0   1976 432892   1784 519392    0    0   140  4146  371 1564 14 26
60  0
 0  0   1976 432628   1784 519572    0    0   180  3844  340  660 11 10
79  0
 0  0   1976 432496   1784 519736    0    0   164  3852  328  534  9  8
83  0
 0  0   1976 432372   1784 519920    0    0   176  4100  359  788 19 11
70  0

Its obvious, where it was switched to barrier ;)



Now the other filesystems with CFQ enabled.

Ext3:

shambhala:~> echo "cfq" >/sys/block/sda/queue/scheduler
shambhala:~> cat /sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]

shambhala:/home/martin/Linux/Dateisysteme/Performance-Messung/
compilebench/compilebench-0.6> ./compilebench -D /mnt/zeit-ext3 -i 5
-r 10
using working directory /mnt/zeit-ext3, 5 intial dirs 10 runs
native unpatched native-0 222MB in 16.90 seconds (13.16 MB/s)
native patched native-0 109MB in 4.63 seconds (23.69 MB/s)
native patched compiled native-0 691MB in 39.78 seconds (17.39 MB/s)
create dir kernel-0 222MB in 12.24 seconds (18.17 MB/s)
create dir kernel-1 222MB in 16.71 seconds (13.31 MB/s)
create dir kernel-2 222MB in 18.50 seconds (12.02 MB/s)
create dir kernel-3 222MB in 18.25 seconds (12.18 MB/s)
create dir kernel-4 222MB in 27.24 seconds (8.16 MB/s)
patch dir kernel-2 109MB in 29.26 seconds (3.75 MB/s)
compile dir kernel-2 691MB in 53.41 seconds (12.95 MB/s)
compile dir kernel-4 680MB in 55.24 seconds (12.32 MB/s)
patch dir kernel-4 691MB in 108.66 seconds (6.36 MB/s)
read dir kernel-4 in 79.38 11.55 MB/s
read dir kernel-3 in 21.65 10.27 MB/s
create dir kernel-3116 222MB in 28.22 seconds (7.88 MB/s)
clean kernel-4 691MB in 17.05 seconds (40.56 MB/s)
read dir kernel-1 in 23.67 9.39 MB/s
stat dir kernel-0 in 9.63 seconds

run complete:
========================================================================
==
intial create total runs 5 avg 12.77 MB/s (user 1.96s sys 3.24s)
create total runs 1 avg 7.88 MB/s (user 1.57s sys 2.39s)
patch total runs 2 avg 5.06 MB/s (user 0.78s sys 3.92s)
compile total runs 2 avg 12.64 MB/s (user 0.54s sys 3.75s)
clean total runs 1 avg 40.56 MB/s (user 0.08s sys 0.36s)
read tree total runs 2 avg 9.83 MB/s (user 1.82s sys 4.32s)
read compiled tree total runs 1 avg 11.55 MB/s (user 2.32s sys 7.02s)
no runs for delete tree
no runs for delete compiled tree
stat tree total runs 1 avg 9.63 seconds (user 1.11s sys 0.89s)
no runs for stat compiled tree


nilfs2:

shambhala:/home/martin/Linux/Dateisysteme/Performance-Messung/
compilebench/compilebench-0.6> ./compilebench -D /mnt/zeit-nilfs2 -i 5
-r 10
using working directory /mnt/zeit-nilfs2, 5 intial dirs 10 runs
native unpatched native-0 222MB in 20.28 seconds (10.97 MB/s)
native patched native-0 109MB in 8.83 seconds (12.42 MB/s)
native patched compiled native-0 691MB in 42.44 seconds (16.30 MB/s)
create dir kernel-0 222MB in 20.89 seconds (10.65 MB/s)
create dir kernel-1 222MB in 21.13 seconds (10.52 MB/s)
create dir kernel-2 222MB in 20.22 seconds (11.00 MB/s)
create dir kernel-3 222MB in 21.60 seconds (10.30 MB/s)
create dir kernel-4 222MB in 20.63 seconds (10.78 MB/s)
patch dir kernel-2 109MB in 20.97 seconds (5.23 MB/s)
compile dir kernel-2 691MB in 44.40 seconds (15.58 MB/s)
Traceback (most recent call last):
  File "./compilebench", line 631, in <module>
    total_runs += func(dset, rnd)
  File "./compilebench", line 368, in compile_one_dir
    mbs = run_directory(ch[0], dir, "compile dir")
  File "./compilebench", line 241, in run_directory
    fp.write(buf[:cur])
IOError: [Errno 28] No space left on device


Okay, possibly due to those 11 checkpoints it stored. Seems I would need
more than 4 GB for the test to complete. But enough testing for today
;).


btrfs 0.15:

shambhala:/home/martin/Linux/Dateisysteme/Performance-Messung/
compilebench/compilebench-0.6> ./compilebench -D /mnt/zeit-btrfs -i 5
-r 10
using working directory /mnt/zeit-btrfs, 5 intial dirs 10 runs
native unpatched native-0 222MB in 13.61 seconds (16.34 MB/s)
native patched native-0 109MB in 3.12 seconds (35.15 MB/s)
native patched compiled native-0 691MB in 28.84 seconds (23.98 MB/s)
create dir kernel-0 222MB in 10.99 seconds (20.23 MB/s)
create dir kernel-1 222MB in 13.95 seconds (15.94 MB/s)
create dir kernel-2 222MB in 14.99 seconds (14.83 MB/s)
create dir kernel-3 222MB in 15.00 seconds (14.82 MB/s)
create dir kernel-4 222MB in 16.16 seconds (13.76 MB/s)
patch dir kernel-2 109MB in 30.09 seconds (3.64 MB/s)
compile dir kernel-2 691MB in 58.05 seconds (11.91 MB/s)
compile dir kernel-4 680MB in 55.23 seconds (12.32 MB/s)
patch dir kernel-4 691MB in 134.20 seconds (5.15 MB/s)
read dir kernel-4 in 108.58 8.44 MB/s
read dir kernel-3 in 43.47 5.12 MB/s
create dir kernel-3116 222MB in 27.81 seconds (8.00 MB/s)
clean kernel-4 691MB in 17.63 seconds (39.23 MB/s)
read dir kernel-1 in 70.31 3.16 MB/s
stat dir kernel-0 in 32.85 seconds

run complete:
========================================================================
==
intial create total runs 5 avg 15.92 MB/s (user 1.06s sys 5.43s)
create total runs 1 avg 8.00 MB/s (user 1.17s sys 7.41s)
patch total runs 2 avg 4.40 MB/s (user 0.88s sys 10.55s)
compile total runs 2 avg 12.12 MB/s (user 0.56s sys 5.34s)
clean total runs 1 avg 39.23 MB/s (user 0.05s sys 2.30s)
read tree total runs 2 avg 4.14 MB/s (user 1.85s sys 10.00s)
read compiled tree total runs 1 avg 8.44 MB/s (user 2.19s sys 16.50s)
no runs for delete tree
no runs for delete compiled tree
stat tree total runs 1 avg 32.85 seconds (user 1.01s sys 3.35s)
no runs for stat compiled tree




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system)
  2008-08-21 11:53                 ` Matthew Wilcox
@ 2008-08-21 15:56                   ` Dave Chinner
  0 siblings, 0 replies; 55+ messages in thread
From: Dave Chinner @ 2008-08-21 15:56 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Szabolcs Szakacsits, Andrew Morton, linux-fsdevel, linux-kernel,
	xfs

On Thu, Aug 21, 2008 at 05:53:10AM -0600, Matthew Wilcox wrote:
> On Thu, Aug 21, 2008 at 04:04:18PM +1000, Dave Chinner wrote:
> > One thing I just found out - my old *laptop* is 4-5x faster than the
> > 10krpm scsi disk behind an old cciss raid controller.  I'm wondering
> > if the long delays in dispatch is caused by an interaction with CTQ
> > but I can't change it on the cciss raid controllers. Are you using
> > ctq/ncq on your machine?  If so, can you reduce the depth to
> > something less than 4 and see what difference that makes?
> 
> I don't think that's going to make a difference when using CFQ.  I did
> some tests that showed that CFQ would never issue more than one IO at a
> time to a drive.  This was using sixteen userspace threads, each doing a
> 4k direct I/O to the same location.  When using noop, I would get 70k
> IOPS and when using CFQ I'd get around 40k IOPS.

Not obviously the same sort of issue. The traces clearly show
multiple nested dispatches and completions so CTQ is definitely
active...

Anyway, after a teeth-pulling equivalent exercise of finding the
latest firmware for the machine in a format I could apply, I
upgraded the firmware throughout the machine (disks, raid
controller, system, etc) and XFS is a *lot* faster. In fact -
mostly back to +/- a small amount compared to ext3.

run complete:
==========================================================================
				  avg MB/s       user       sys
			runs	 xfs   ext3    xfs ext3    xfs ext3
intial create total      30	6.36   6.29   4.48 3.79   7.03 5.22
create total              7	5.20   5.68   4.47 3.69   7.34 5.23
patch total               6	4.53   5.87   2.26 1.96   6.27 4.86
compile total             9    16.46   9.61   1.74 1.72   9.02 9.74
clean total               4   478.50 553.22   0.09 0.06   0.92 0.70
read tree total           2    13.07  15.62   2.39 2.19   3.68 3.44
read compiled tree        1    53.94  60.91   2.57 2.71   7.35 7.27
delete tree total         3    15.94s  6.82s  1.38 1.06   4.10 1.49
delete compiled tree      1    24.07s  8.70s  1.58 1.18   5.56 2.30
stat tree total           5	3.30s  3.22s  1.09 1.07   0.61 0.53
stat compiled tree total  3	2.93s  3.85s  1.17 1.22   0.59 0.55


The blocktrace looks very regular, too. All the big bursts of
dispatch and completion are gone as are the latencies on
log I/Os. It would appear that ext3 is not sensitive to
concurrent I/O latency like XFS is...

At this point, I'm still interested to know if the original
results were had ctq/ncq enabled and if it is whether it is
introducing latencies are not.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system)
  2008-08-21  9:33                       ` Nick Piggin
@ 2008-08-21 17:08                         ` Dave Chinner
  2008-08-22  2:29                           ` Nick Piggin
  0 siblings, 1 reply; 55+ messages in thread
From: Dave Chinner @ 2008-08-21 17:08 UTC (permalink / raw)
  To: Nick Piggin
  Cc: gus3, Szabolcs Szakacsits, Andrew Morton, linux-fsdevel,
	linux-kernel, xfs

On Thu, Aug 21, 2008 at 07:33:34PM +1000, Nick Piggin wrote:
> > > But existing plugging is below the level of the elevators, and should
> > > only kick in for at most tens of ms at queue idle events, so it sounds
> > > like it may not be your problem. Elevators will need some hint to give
> > > priority to specific requests -- either via the current threads's io
> > > priority, or information attached to bios.
> >
> > It's getting too bloody complex, IMO. What is right for one elevator
> > is wrong for another, so as a filesystem developer I have to pick
> > one to target.
> 
> I don't really see it as too complex. If you know how you want the
> request to be handled, then it should be possible to implement.

That is the problem in a nutshell. Nobody can keep up with all
the shiny new stuff that is being implemented,let alone the
subtle behavioural differences that accumulate through such
change...

> > With the way the elevators have been regressing, 
> > improving and changing behaviour,
> 
> AFAIK deadline, AS, and noop haven't significantly changed for years.

Yet they've regularly shown performance regressions because other
stuff has been changing around them, right?

> > I am starting to think that I 
> > should be picking the noop scheduler.
> > Any 'advanced' scheduler that 
> > is slower than the same test on the noop scheduler needs fixing...
> 
> I disagree. On devices with no seek penalty or their own queueing,
> noop is often the best choice. Same for specialized apps that do
> their own disk scheduling.

A filesystem is nothing but a complex disk scheduler that
has to handle vastly larger queues than an elevator. Іf the
filesystem doesn't get it's disk scheduling right, then the
elevator is irrelevant because nothing will fix the I/O
problems in the filesystem algorithms.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system)
  2008-08-21  8:25                 ` Dave Chinner
  2008-08-21 11:02                   ` Martin Steigerwald
@ 2008-08-21 17:10                   ` Szabolcs Szakacsits
  2008-08-21 17:33                     ` Szabolcs Szakacsits
  1 sibling, 1 reply; 55+ messages in thread
From: Szabolcs Szakacsits @ 2008-08-21 17:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Andrew Morton, linux-fsdevel, linux-kernel, xfs


On Thu, 21 Aug 2008, Dave Chinner wrote:
> On Thu, Aug 21, 2008 at 04:04:18PM +1000, Dave Chinner wrote:
> > 
> > One thing I just found out - my old *laptop* is 4-5x faster than the
> > 10krpm scsi disk behind an old cciss raid controller.  I'm wondering
> > if the long delays in dispatch is caused by an interaction with CTQ
> > but I can't change it on the cciss raid controllers. Are you using
> > ctq/ncq on your machine?  

It's a laptop and has NCQ. It makes no difference if NCQ is enabled or 
disabled. The problem seems to be XFS only.

> > If so, can you reduce the depth to something less than 4 and see what 
> > difference that makes?
> 
> Just to point out - this is not a new problem - I can reproduce
> it on 2.6.24 as well as 2.6.26. Likewise, my laptop shows XFS
> being faster than ext3 on both 2.6.24 and 2.6.26. So the difference
> is something related to the disk subsystem on the server....

XFS definitely stalls somewhere: stats show virtually no CPU usage and no 
time spent waiting for IO. No file system produces similar output.

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0      0 3146180   7848 600868    0    0     0  4128  790  549  0  2 98  0
 0  0      0 3145200   7848 601524    0    0     0  2372  766  516  0  2 98  0
 1  0      0 3144328   7848 602260    0    0     0  2924  792  542  1  2 98  0
 0  1      0 3143824   7856 602664    0    0     0  4116  732  426  0  2 53 45
 1  0      0 3143068   7856 603136    0    0     0  4676  756  534  0  3 95  1
 0  0      0 3142652   7856 603540    0    0     0  6577  756  436  0  0 100  0
 0  0      0 3141952   7856 604100    0    0     0  5840  764  498  1  3 96  0
 0  0      0 3141424   7856 604544    0    0     0  4752  761  386  0  0 99  0
 0  0      0 3140860   7856 604916    0    0     0  6477  785  495  0  1 98  0
 0  0      0 3139980   7856 605468    0    0     0  2840  743  370  1  2 97  0
 0  0      0 3138464   7856 606884    0    0     0  4902  795  421  0  4 96  0
 0  0      0 3137636   7856 607696    0    0     0  4364  739  395  0  1 99  0
 0  0      0 3136520   7856 608220    0    0     0  6160  774  566  0  2 97  0

	Szaka

-- 
NTFS-3G:  http://ntfs-3g.org

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system)
  2008-08-21 17:10                   ` Szabolcs Szakacsits
@ 2008-08-21 17:33                     ` Szabolcs Szakacsits
  2008-08-22  2:24                       ` Dave Chinner
  0 siblings, 1 reply; 55+ messages in thread
From: Szabolcs Szakacsits @ 2008-08-21 17:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Andrew Morton, linux-fsdevel, linux-kernel, xfs


On Thu, 21 Aug 2008, Szabolcs Szakacsits wrote:
> On Thu, 21 Aug 2008, Dave Chinner wrote:
> > On Thu, Aug 21, 2008 at 04:04:18PM +1000, Dave Chinner wrote:
> > > 
> > > One thing I just found out - my old *laptop* is 4-5x faster than the
> > > 10krpm scsi disk behind an old cciss raid controller.  I'm wondering
> > > if the long delays in dispatch is caused by an interaction with CTQ
> > > but I can't change it on the cciss raid controllers. Are you using
> > > ctq/ncq on your machine?  
> 
> It's a laptop and has NCQ. It makes no difference if NCQ is enabled or 
> disabled. The problem seems to be XFS only.

The 'nobarrier' mount option made a big improvement:

                    MB/s    Runtime (s)
                   -----    -----------
  btrfs unstable   17.09        572
  ext3             13.24        877
  btrfs 0.16       12.33        793
  nilfs2 2nd+ runs 11.29        674
  ntfs-3g           8.55        865
  reiserfs          8.38        966
  xfs nobarrier     7.89        949
  nilfs2 1st run    4.95       3800
  xfs               1.88       3901

	Szaka
 
-- 
NTFS-3G:  http://ntfs-3g.org

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system)
  2008-08-21 17:33                     ` Szabolcs Szakacsits
@ 2008-08-22  2:24                       ` Dave Chinner
  2008-08-22  6:49                         ` Martin Steigerwald
  2008-08-22 12:44                         ` Szabolcs Szakacsits
  0 siblings, 2 replies; 55+ messages in thread
From: Dave Chinner @ 2008-08-22  2:24 UTC (permalink / raw)
  To: Szabolcs Szakacsits; +Cc: Andrew Morton, linux-fsdevel, linux-kernel, xfs

On Thu, Aug 21, 2008 at 08:33:50PM +0300, Szabolcs Szakacsits wrote:
> 
> On Thu, 21 Aug 2008, Szabolcs Szakacsits wrote:
> > On Thu, 21 Aug 2008, Dave Chinner wrote:
> > > On Thu, Aug 21, 2008 at 04:04:18PM +1000, Dave Chinner wrote:
> > > > 
> > > > One thing I just found out - my old *laptop* is 4-5x faster than the
> > > > 10krpm scsi disk behind an old cciss raid controller.  I'm wondering
> > > > if the long delays in dispatch is caused by an interaction with CTQ
> > > > but I can't change it on the cciss raid controllers. Are you using
> > > > ctq/ncq on your machine?  
> > 
> > It's a laptop and has NCQ. It makes no difference if NCQ is enabled or 
> > disabled. The problem seems to be XFS only.
> 
> The 'nobarrier' mount option made a big improvement:
> 
>                     MB/s    Runtime (s)
>                    -----    -----------
>   btrfs unstable   17.09        572
>   ext3             13.24        877
>   btrfs 0.16       12.33        793
>   nilfs2 2nd+ runs 11.29        674
>   ntfs-3g           8.55        865
>   reiserfs          8.38        966
>   xfs nobarrier     7.89        949
>   nilfs2 1st run    4.95       3800
>   xfs               1.88       3901

INteresting. Barriers make only a little difference on my laptop;
10-20% slower. But yes, barriers will have this effect on XFS.

If you've got NCQ, then you'd do better to turn off write caching
on the drive, turn off barriers and use NCQ to give you back the
performance that the write cache used to. That is, of course,
assuming the NCQ implementation doesn't suck....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system)
  2008-08-21 17:08                         ` Dave Chinner
@ 2008-08-22  2:29                           ` Nick Piggin
  2008-08-25  1:59                             ` Dave Chinner
  0 siblings, 1 reply; 55+ messages in thread
From: Nick Piggin @ 2008-08-22  2:29 UTC (permalink / raw)
  To: Dave Chinner
  Cc: gus3, Szabolcs Szakacsits, Andrew Morton, linux-fsdevel,
	linux-kernel, xfs

On Friday 22 August 2008 03:08, Dave Chinner wrote:
> On Thu, Aug 21, 2008 at 07:33:34PM +1000, Nick Piggin wrote:

> > I don't really see it as too complex. If you know how you want the
> > request to be handled, then it should be possible to implement.
>
> That is the problem in a nutshell. Nobody can keep up with all
> the shiny new stuff that is being implemented,let alone the
> subtle behavioural differences that accumulate through such
> change...

I'm not sure exactly what you mean.. I certainly have not been keeping
up with all the changes here as I'm spending most of my time on other
things lately...

But from what I see, you've got a fairly good handle on analysing the
elevator behaviour (if only the end result). So if you were to tell
Jens that "these blocks" need more priority, or not to contribute to
a process's usage quota, etc. then I'm sure improvements could be
made.

Or am I completely misunderstanding you? :)


> > > With the way the elevators have been regressing,
> > > improving and changing behaviour,
> >
> > AFAIK deadline, AS, and noop haven't significantly changed for years.
>
> Yet they've regularly shown performance regressions because other
> stuff has been changing around them, right?

Is this rhetorical? Because I don't see how *they* could be showing
regular performance regressions. Deadline literally had its last
behaviour change nearly a year ago, and before that was before
recorded (git) history.

AS hasn't changed much more frequently, although I will grant that it
and CFS add a lot more complexity. So I would always compare results
with deadline or noop.


> > > I am starting to think that I
> > > should be picking the noop scheduler.
> > > Any 'advanced' scheduler that
> > > is slower than the same test on the noop scheduler needs fixing...
> >
> > I disagree. On devices with no seek penalty or their own queueing,
> > noop is often the best choice. Same for specialized apps that do
> > their own disk scheduling.
>
> A filesystem is nothing but a complex disk scheduler that
> has to handle vastly larger queues than an elevator. Іf the
> filesystem doesn't get it's disk scheduling right, then the
> elevator is irrelevant because nothing will fix the I/O
> problems in the filesystem algorithms.....

I wouldn't say it is so black and white if you have multiple processes
submitting IO. You get more opportunities to sort and merge things in
the disk scheduler, and you can do things like fairness and anticipatory
scheduling. But if XFS does enough of what you need, then by all means
use noop. There is an in-kernel API to change it (although it's
designed more for block devices than filesystems so it might not work
exactly for you).
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system)
  2008-08-22  2:24                       ` Dave Chinner
@ 2008-08-22  6:49                         ` Martin Steigerwald
  2008-08-22 12:44                         ` Szabolcs Szakacsits
  1 sibling, 0 replies; 55+ messages in thread
From: Martin Steigerwald @ 2008-08-22  6:49 UTC (permalink / raw)
  To: linux-xfs
  Cc: Dave Chinner, Szabolcs Szakacsits, Andrew Morton, linux-fsdevel,
	linux-kernel, xfs

Am Freitag 22 August 2008 schrieb Dave Chinner:
> On Thu, Aug 21, 2008 at 08:33:50PM +0300, Szabolcs Szakacsits wrote:
> > On Thu, 21 Aug 2008, Szabolcs Szakacsits wrote:
> > > On Thu, 21 Aug 2008, Dave Chinner wrote:
> > > > On Thu, Aug 21, 2008 at 04:04:18PM +1000, Dave Chinner wrote:
> > > > > One thing I just found out - my old *laptop* is 4-5x faster
> > > > > than the 10krpm scsi disk behind an old cciss raid controller. 
> > > > > I'm wondering if the long delays in dispatch is caused by an
> > > > > interaction with CTQ but I can't change it on the cciss raid
> > > > > controllers. Are you using ctq/ncq on your machine?
> > >
> > > It's a laptop and has NCQ. It makes no difference if NCQ is enabled
> > > or disabled. The problem seems to be XFS only.
> >
> > The 'nobarrier' mount option made a big improvement:
> >
> >                     MB/s    Runtime (s)
> >                    -----    -----------
> >   btrfs unstable   17.09        572
> >   ext3             13.24        877
> >   btrfs 0.16       12.33        793
> >   nilfs2 2nd+ runs 11.29        674
> >   ntfs-3g           8.55        865
> >   reiserfs          8.38        966
> >   xfs nobarrier     7.89        949
> >   nilfs2 1st run    4.95       3800
> >   xfs               1.88       3901
>
> INteresting. Barriers make only a little difference on my laptop;
> 10-20% slower. But yes, barriers will have this effect on XFS.
>
> If you've got NCQ, then you'd do better to turn off write caching
> on the drive, turn off barriers and use NCQ to give you back the
> performance that the write cache used to. That is, of course,
> assuming the NCQ implementation doesn't suck....

See my other post with performance numbers:

Barriers appear to make more than 50% difference on my laptop for some 
operations on some other operations it hardly makes a difference at all - 
I bet it goes slow mainly when creating or deleting lots of small files. 
Looking at vmstat 1 during a rm -rf of a compilebench leftover directory 
while switching off barriers shows a difference of even more than 50% in 
metadata throughput.

It has this controller

00:1f.1 IDE interface: Intel Corporation 82801DBM (ICH4-M) IDE Controller 
(rev 01)

and this drive

---------------------------------------------------------------------
shambhala:~> hdparm -I /dev/sda

/dev/sda:

ATA device, with non-removable media
        Model Number:       Hitachi HTS541616J9AT00
        Serial Number:      SB0442SJDVDDHH
        Firmware Revision:  SB4OA70H
Standards:
        Used: ATA/ATAPI-7 T13 1532D revision 1
        Supported: 7 6 5 4
Configuration:
        Logical         max     current
        cylinders       16383   16383
        heads           16      16
        sectors/track   63      63
        --
        CHS current addressable sectors:   16514064
        LBA    user addressable sectors:  268435455
        LBA48  user addressable sectors:  312581808
        device size with M = 1024*1024:      152627 MBytes
        device size with M = 1000*1000:      160041 MBytes (160 GB)
Capabilities:
        LBA, IORDY(can be disabled)
        Standby timer values: spec'd by Vendor, no device specific minimum
        R/W multiple sector transfer: Max = 16  Current = 16
        Advanced power management level: 254
        Recommended acoustic management value: 128, current value: 128
        DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5
             Cycle time: min=120ns recommended=120ns
        PIO: pio0 pio1 pio2 pio3 pio4
             Cycle time: no flow control=240ns  IORDY flow control=120ns
Commands/features:
        Enabled Supported:
           *    SMART feature set
                Security Mode feature set
           *    Power Management feature set
           *    Write cache
           *    Look-ahead
           *    Host Protected Area feature set
           *    WRITE_BUFFER command
           *    READ_BUFFER command
           *    NOP cmd
           *    DOWNLOAD_MICROCODE
           *    Advanced Power Management feature set
                Power-Up In Standby feature set
           *    SET_FEATURES required to spinup after power up
                Address Offset Reserved Area Boot
           *    SET_MAX security extension
           *    Automatic Acoustic Management feature set
           *    48-bit Address feature set
           *    Device Configuration Overlay feature set
           *    Mandatory FLUSH_CACHE
           *    FLUSH_CACHE_EXT
           *    SMART error logging
           *    SMART self-test
           *    General Purpose Logging feature set
           *    WRITE_{DMA|MULTIPLE}_FUA_EXT
           *    64-bit World wide name
           *    IDLE_IMMEDIATE with UNLOAD
Security:
        Master password revision code = 65534
                supported
        not     enabled
        not     locked
                frozen
        not     expired: security count
        not     supported: enhanced erase
        82min for SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 5000cca525da17b6
        NAA             : 5
        IEEE OUI        : cca
        Unique ID       : 525da17b6
HW reset results:
        CBLID- above Vih
        Device num = 0 determined by the jumper
---------------------------------------------------------------------

with libata driver which doesn't use FUA while its advertised above:

---------------------------------------------------------------------
sd 0:0:0:0: [sda] Synchronizing SCSI cache
sd 0:0:0:0: [sda] 312581808 512-byte hardware sectors (160042 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't 
support DPO or FUA
sd 0:0:0:0: [sda] 312581808 512-byte hardware sectors (160042 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't 
support DPO or FUA
sd 0:0:0:0: [sda] Starting disk
---------------------------------------------------------------------

So AFAIK that should be without NCQ since its not a SATA drive and 
apparently its also without FUA (maybe due to controller?). Maybe the bad 
results are due to lack of NCQ and FUA?

Here the relevant parts from my other mail:

---------------------------------------------------------------------
With barriers on an already heavily populated filesystem - I don't have
an empty one on a raw partition at hand at the moment and I for sure
won't empty this one:

martin@shambhala:~> df -hT | grep /home
/dev/sda5      xfs    112G  104G  8,2G  93% /home

shambhala:~> df -hiT | grep /home
/dev/sda5      xfs       34M    751K     33M    3% /home

shambhala:~> xfs_db -rx /dev/sda5
xfs_db> frag
actual 726986, ideal 703687, fragmentation factor 3.20%
xfs_db> quit
shambhala:~>

martin@shambhala:~> cat /proc/mounts | grep "/home "
/dev/sda5 /home xfs rw,relatime,attr2,logbufs=8,logbsize=256k,noquota 0
0

shambhala:~> xfs_info /home
meta-data=/dev/sda5              isize=256    agcount=6, agsize=4883256
blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=29299536,
imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096
log      =internal               bsize=4096   blocks=32768, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

shambhala:/home/martin/Linux/Dateisysteme/Performance-Messung/
compilebench/compilebench-0.6> ./compilebench -D
/home/martin/Zeit/compilebench -i 5 -r 10
using working directory /home/martin/Zeit/compilebench, 5 intial dirs 10
runs
native unpatched native-0 222MB in 117.37 seconds (1.89 MB/s)
native patched native-0 109MB in 27.46 seconds (3.99 MB/s)
native patched compiled native-0 691MB in 48.03 seconds (14.40 MB/s)
create dir kernel-0 222MB in 83.55 seconds (2.66 MB/s)
create dir kernel-1 222MB in 86.01 seconds (2.59 MB/s)
create dir kernel-2 222MB in 71.61 seconds (3.11 MB/s)
create dir kernel-3 222MB in 71.73 seconds (3.10 MB/s)
create dir kernel-4 222MB in 61.61 seconds (3.61 MB/s)
patch dir kernel-2 109MB in 63.14 seconds (1.74 MB/s)
compile dir kernel-2 691MB in 45.61 seconds (15.16 MB/s)
compile dir kernel-4 680MB in 50.13 seconds (13.58 MB/s)
patch dir kernel-4 691MB in 154.38 seconds (4.48 MB/s)
read dir kernel-4 in 95.04 9.65 MB/s
read dir kernel-3 in 49.49 4.49 MB/s
create dir kernel-3116 222MB in 79.44 seconds (2.80 MB/s)
clean kernel-4 691MB in 8.64 seconds (80.05 MB/s)
read dir kernel-1 in 71.40 3.11 MB/s
stat dir kernel-0 in 14.44 seconds

run complete:
========================================================================
==
intial create total runs 5 avg 3.01 MB/s (user 2.34s sys 4.30s)
create total runs 1 avg 2.80 MB/s (user 2.36s sys 4.12s)
patch total runs 2 avg 3.11 MB/s (user 0.91s sys 4.07s)
compile total runs 2 avg 14.37 MB/s (user 0.60s sys 2.76s)
clean total runs 1 avg 80.05 MB/s (user 0.09s sys 0.45s)
read tree total runs 2 avg 3.80 MB/s (user 2.00s sys 4.05s)
read compiled tree total runs 1 avg 9.65 MB/s (user 2.36s sys 6.42s)
no runs for delete tree
no runs for delete compiled tree
stat tree total runs 1 avg 14.44 seconds (user 1.17s sys 1.07s)
no runs for stat compiled tree

shambhala:/home/martin/Linux/Dateisysteme/Performance-Messung/
compilebench/compilebench-0.6> rm -rf /home/martin/Zeit/compilebench

I didn't measure it, but it took *ages* while rm -rf was mostly in D
state. According to harddisk noise a lot of seeks where involved.

vmstat 1 during the rm -rf:

 0  0   2784 748048     20 247160    0    0   160  4628  352 1224 15 14
71  0
 0  0   2784 748056     20 247308    0    0   148  3848  298  442 11 10
79  0
 0  0   2784 747996     20 247428    0    0   120  3377  260  449  9  9
82  0
 0  0   2784 747764     20 247580    0    0   152  4364  324 1094 20 10
70  0
 1  0   2784 747452     20 247736    0    0   156  4356  279  814 15 11
74  0
 0  0   2784 747408     20 247900    0    0   164  4112  360 1131 13 13
74  0
 0  0   2784 747136     20 248064    0    0   164  5128  318  855 16 10
74  0
 0  0   2784 746780     20 248208    0    0   144  4353  305 1066 20 12
68  0
 0  0   2784 746204     20 248336    0    0   128  5388  275  966 14 11
75  0
 1  0   2784 748352     20 248468    0    0   132  5384  314 1234 22 11
67  0
 0  0   2784 748104     20 248604    0    0   136  4873  284  807 16 11
73  0

Same game on same productively used partition, but now without barriers:

shambhala:~> mount -o remount,nobarrier /home
shambhala:~> cat /proc/mounts | grep "/home "
/dev/sda5 /home xfs
rw,relatime,attr2,nobarrier,logbufs=8,logbsize=256k,noquota 0 0

shambhala:/home/martin/Linux/Dateisysteme/Performance-Messung/
compilebench/compilebench-0.6> mkdir /home/martin/Zeit/compilebench

shambhala:/home/martin/Linux/Dateisysteme/Performance-Messung/
compilebench/compilebench-0.6> ./compilebench -D
/home/martin/Zeit/compilebench -i 5 -r 10
using working directory /home/martin/Zeit/compilebench, 5 intial dirs 10
runs
native unpatched native-0 222MB in 51.44 seconds (4.32 MB/s)
native patched native-0 109MB in 12.69 seconds (8.64 MB/s)
native patched compiled native-0 691MB in 51.75 seconds (13.36 MB/s)
create dir kernel-0 222MB in 47.64 seconds (4.67 MB/s)
create dir kernel-1 222MB in 53.40 seconds (4.16 MB/s)
create dir kernel-2 222MB in 48.04 seconds (4.63 MB/s)
create dir kernel-3 222MB in 38.26 seconds (5.81 MB/s)
create dir kernel-4 222MB in 34.15 seconds (6.51 MB/s)
patch dir kernel-2 109MB in 50.61 seconds (2.17 MB/s)
compile dir kernel-2 691MB in 37.94 seconds (18.23 MB/s)
compile dir kernel-4 680MB in 45.32 seconds (15.02 MB/s)
patch dir kernel-4 691MB in 107.27 seconds (6.45 MB/s)
read dir kernel-4 in 82.18 11.16 MB/s
read dir kernel-3 in 42.35 5.25 MB/s
create dir kernel-3116 222MB in 38.27 seconds (5.81 MB/s)
clean kernel-4 691MB in 5.92 seconds (116.82 MB/s)
read dir kernel-1 in 73.63 3.02 MB/s
stat dir kernel-0 in 13.77 seconds

run complete:
========================================================================
==
intial create total runs 5 avg 5.16 MB/s (user 2.21s sys 4.23s)
create total runs 1 avg 5.81 MB/s (user 2.18s sys 4.89s)
patch total runs 2 avg 4.31 MB/s (user 0.90s sys 4.05s)
compile total runs 2 avg 16.62 MB/s (user 0.59s sys 3.05s)
clean total runs 1 avg 116.82 MB/s (user 0.09s sys 0.41s)
read tree total runs 2 avg 4.14 MB/s (user 1.90s sys 4.02s)
read compiled tree total runs 1 avg 11.16 MB/s (user 2.28s sys 6.36s)
no runs for delete tree
no runs for delete compiled tree
stat tree total runs 1 avg 13.77 seconds (user 1.19s sys 1.01s)
no runs for stat compiled tree


Not as fast as on the clean XFS LV, but still almost everytime almost
twice as fast as with barriers.


shambhala:/home/martin/Linux/Dateisysteme/Performance-Messung/
compilebench/compilebench-0.6> time rm -rf
/home/martin/Zeit/compilebench
rm -rf /home/martin/Zeit/compilebench  0,32s user 19,19s system 15% cpu
2:09,79 total

This is definately faster than before. I didn't measure exact time on
first occasion, but it took ages.

vmstat 1 during the rm -rf indicated much higher metadata throughput:

 3  0   2780 827696     20 162492    0    0   280 11109  449  865 31 15
52  2
 0  0   2780 827304     20 162816    0    0   324  6656  468 1009 57  8 
7 28
 2  0   2636 828992     20 163364    0    0   540  5317  350  545 30 10
30 31
 2  1   2636 837488     20 164020    0    0   656  7691  394  650 39 12 
0 49
 0  0   2224 960360     20 164516    0    0   496 12060  420  549 13 26
56  5
 0  0   2224 959988     20 164904    0    0   388 13704  425  792 16 23
61  0
 0  0   2224 959864     20 165128    0    0   224  6209  363  503 12 10
78  0
 1  0   2224 959376     20 165540    0    0   412 14886  392  513 12 22
66  0

[...]

As last XFS thing:

vmstat 1 during a rm -rf while switching of XFS from nobarrier to
barrier:

 0  0   1976 422236   1784 516840    0    0   508 17160  410  540  7 23
70  0
 1  0   1976 420624   1784 517576    0    0   736 26904  539 1032 14 35
51  0
 0  0   1976 419176   1784 518152    0    0   576 23842  486 1060 17 33
50  0
 0  0   1976 418316   1784 518460    0    0   308 12812  317  552  6 18
76  0
 2  0   1976 417392   1784 518776    0    0   316 16689  360  882  2 23
75  0
 8  0   1976 432948   1784 519252    0    0   476 16710  452  630  8 39
53  0
 0  0   1976 432892   1784 519392    0    0   140  4146  371 1564 14 26
60  0
 0  0   1976 432628   1784 519572    0    0   180  3844  340  660 11 10
79  0
 0  0   1976 432496   1784 519736    0    0   164  3852  328  534  9  8
83  0
 0  0   1976 432372   1784 519920    0    0   176  4100  359  788 19 11
70  0

Its obvious, where it was switched to barrier ;)
---------------------------------------------------------------------

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system)
  2008-08-22  2:24                       ` Dave Chinner
  2008-08-22  6:49                         ` Martin Steigerwald
@ 2008-08-22 12:44                         ` Szabolcs Szakacsits
  2008-08-23 12:52                           ` Szabolcs Szakacsits
  1 sibling, 1 reply; 55+ messages in thread
From: Szabolcs Szakacsits @ 2008-08-22 12:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Andrew Morton, linux-fsdevel, linux-kernel, xfs


On Fri, 22 Aug 2008, Dave Chinner wrote:
> On Thu, Aug 21, 2008 at 08:33:50PM +0300, Szabolcs Szakacsits wrote:
>
> > The 'nobarrier' mount option made a big improvement:
> 
> INteresting. Barriers make only a little difference on my laptop;
> 10-20% slower. But yes, barriers will have this effect on XFS.
> 
> If you've got NCQ, then you'd do better to turn off write caching
> on the drive, turn off barriers and use NCQ to give you back the
> performance that the write cache used to. That is, of course,
> assuming the NCQ implementation doesn't suck....

Write cache off, nobarrier and AHCI NCQ lowered the XFS result:

                               MB/s    Runtime (s)
                              -----    -----------
  btrfs unstable              17.09        572
  ext3                        13.24        877
  btrfs 0.16                  12.33        793
  ntfs-3g unstable            11.52        673
  nilfs2 2nd+ runs            11.29        674
  reiserfs                     8.38        966
  xfs nobarrier                7.89        949
  nilfs2 1st run               4.95       3800
  xfs nobarrier, ncq, wc off   3.81       1973
  xfs                          1.88       3901

	Szaka

-- 
NTFS-3G:  http://ntfs-3g.org


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system)
  2008-08-22 12:44                         ` Szabolcs Szakacsits
@ 2008-08-23 12:52                           ` Szabolcs Szakacsits
  0 siblings, 0 replies; 55+ messages in thread
From: Szabolcs Szakacsits @ 2008-08-23 12:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Andrew Morton, linux-fsdevel, linux-kernel, xfs


On Fri, 22 Aug 2008, Szabolcs Szakacsits wrote:
> On Fri, 22 Aug 2008, Dave Chinner wrote:
> > On Thu, Aug 21, 2008 at 08:33:50PM +0300, Szabolcs Szakacsits wrote:
> >
> > > The 'nobarrier' mount option made a big improvement:
> > 
> > INteresting. Barriers make only a little difference on my laptop;
> > 10-20% slower. But yes, barriers will have this effect on XFS.
> > 
> > If you've got NCQ, then you'd do better to turn off write caching
> > on the drive, turn off barriers and use NCQ to give you back the
> > performance that the write cache used to. That is, of course,
> > assuming the NCQ implementation doesn't suck....
> 
> Write cache off, nobarrier and AHCI NCQ lowered the XFS result:
> 
>                                MB/s    Runtime (s)
>                               -----    -----------
>   btrfs unstable              17.09        572
>   ext3                        13.24        877
>   btrfs 0.16                  12.33        793
>   ntfs-3g unstable            11.52        673
>   nilfs2 2nd+ runs            11.29        674
>   reiserfs                     8.38        966
>   xfs nobarrier                7.89        949
>   nilfs2 1st run               4.95       3800
>   xfs nobarrier, ncq, wc off   3.81       1973
>   xfs                          1.88       3901

Retested with a different disk, SATA-II, NCQ, capable of 70-110 MB/s 
read/write:

                               MB/s    Runtime (s)
                              -----    -----------
  btrfs unstable, no dup      51.42        168
  btrfs unstable              42.67        197
  ext4 2.6.26                 35.63        245
  nilfs2 2nd+ runs            26.43        287
  ntfs-3g unstable            21.41        370
  ext3                        19.92        559
  xfs nobarrier               14.17        562
  reiserfs                    13.11        595
  nilfs2 1st run              12.06       3719
  xfs nobarrier, ncq, wc off   6.89       1070
  xfs                          1.95       3786

	Szaka

-- 
NTFS-3G:  http://ntfs-3g.org


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system)
  2008-08-22  2:29                           ` Nick Piggin
@ 2008-08-25  1:59                             ` Dave Chinner
  2008-08-25  4:32                               ` Nick Piggin
  2008-08-25 12:01                               ` Jamie Lokier
  0 siblings, 2 replies; 55+ messages in thread
From: Dave Chinner @ 2008-08-25  1:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: gus3, Szabolcs Szakacsits, Andrew Morton, linux-fsdevel,
	linux-kernel, xfs

On Fri, Aug 22, 2008 at 12:29:10PM +1000, Nick Piggin wrote:
> On Friday 22 August 2008 03:08, Dave Chinner wrote:
> > On Thu, Aug 21, 2008 at 07:33:34PM +1000, Nick Piggin wrote:
> 
> > > I don't really see it as too complex. If you know how you want the
> > > request to be handled, then it should be possible to implement.
> >
> > That is the problem in a nutshell. Nobody can keep up with all
> > the shiny new stuff that is being implemented,let alone the
> > subtle behavioural differences that accumulate through such
> > change...
> 
> I'm not sure exactly what you mean.. I certainly have not been keeping
> up with all the changes here as I'm spending most of my time on other
> things lately...
> 
> But from what I see, you've got a fairly good handle on analysing the
> elevator behaviour (if only the end result).

Only from having to do this analysis over and over again trying to
understand what has changed in the elevator that has negated the
effect of some previous optimisation....

> So if you were to tell
> Jens that "these blocks" need more priority, or not to contribute to
> a process's usage quota, etc. then I'm sure improvements could be
> made.

It's exactly this sort of complexity that is the problem. When the
behaviour of such things change, filesystems that are optimised for
the previous behaviour are not updated - we're not even aware that
the elevator has been changed in some subtle manner that breaks
the optimisations that have been done.

To keep on top of this, we keep adding new variations and types and
expect the filesystems to make best use of them (without
documentation) to optimise for certain situations. Example - the
new(ish) BIO_META tag that only CFQ understands. I can change the
way XFS issues bios to use this tag to make CFQ behave the same way
it used to w.r.t. metadata I/O from XFS, but then the deadline and
AS will probably regress because they don't understand that tag and
still need the old optimisations that just got removed. Ditto for
prioritised bio dispatch - CFQ supports it but none of the others
do.

IOWs, I am left with a choice - optimise for a specific elevator
(CFQ) to the detriment of all others (noop, as, deadline), or make
the filesystem work best with the simple elevator (noop) and
consider the smarter schedulers deficient if they are slower than
the noop elevator....

> Or am I completely misunderstanding you? :)

You're suggesting that I add complexity to solve the too much complexity
problem.... ;)

> > > > With the way the elevators have been regressing,
> > > > improving and changing behaviour,
> > >
> > > AFAIK deadline, AS, and noop haven't significantly changed for years.
> >
> > Yet they've regularly shown performance regressions because other
> > stuff has been changing around them, right?
> 
> Is this rhetorical? Because I don't see how *they* could be showing
> regular performance regressions.

I get private email fairly often asking questions as to why XFS is
slower going from, say, 2.6.23 to 2.6.24 and then speeds back up in
2.6.25. I seen a number of cases where the answer to this was that
elevator 'x' with XFS in 2.6.x because for some reason it is much,
much slower than the others on that workload on that hardware.

As seen earlier in this thread, this can be caused by a problem with
the hardware, firmware, configuration, driver bugs, etc - there are
so many combinations of variables that can cause performance issues
that often the only 'macro' level change that you can make to avoid
them is to switch schedulers. IOWs, while a specific scheduler has
not changed, the code around it has changed sufficiently for a
specific elevator to show a regression compared to the otherr
elevators.....

Basically, the complexity of the interactions between the
filesystems, elevators and the storage devices is such that there
are transient second order effects occurring that are not reported
widely because they are easily worked around by switching elevators.

> Deadline literally had its last
> behaviour change nearly a year ago, and before that was before
> recorded (git) history.
> 
> AS hasn't changed much more frequently, although I will grant that it
> and CFS add a lot more complexity. So I would always compare results
> with deadline or noop.

Which can still change by things like changing merging behaviour.
Granted, it is less complex, but still we can have subtle changes
having major impact in less commonly run workloads...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system)
  2008-08-25  1:59                             ` Dave Chinner
@ 2008-08-25  4:32                               ` Nick Piggin
  2008-08-25 12:01                               ` Jamie Lokier
  1 sibling, 0 replies; 55+ messages in thread
From: Nick Piggin @ 2008-08-25  4:32 UTC (permalink / raw)
  To: Dave Chinner
  Cc: gus3, Szabolcs Szakacsits, Andrew Morton, linux-fsdevel,
	linux-kernel, xfs

On Monday 25 August 2008 11:59, Dave Chinner wrote:
> On Fri, Aug 22, 2008 at 12:29:10PM +1000, Nick Piggin wrote:

> > So if you were to tell
> > Jens that "these blocks" need more priority, or not to contribute to
> > a process's usage quota, etc. then I'm sure improvements could be
> > made.
>
> It's exactly this sort of complexity that is the problem. When the
> behaviour of such things change, filesystems that are optimised for
> the previous behaviour are not updated - we're not even aware that
> the elevator has been changed in some subtle manner that breaks
> the optimisations that have been done.
>
> To keep on top of this, we keep adding new variations and types and
> expect the filesystems to make best use of them (without
> documentation) to optimise for certain situations. Example - the
> new(ish) BIO_META tag that only CFQ understands. I can change the
> way XFS issues bios to use this tag to make CFQ behave the same way
> it used to w.r.t. metadata I/O from XFS, but then the deadline and
> AS will probably regress because they don't understand that tag and
> still need the old optimisations that just got removed. Ditto for
> prioritised bio dispatch - CFQ supports it but none of the others
> do.

I don't know why AS or DL would regress though. What old optimizations
would you be referring to?


> IOWs, I am left with a choice - optimise for a specific elevator
> (CFQ) to the detriment of all others (noop, as, deadline), or make
> the filesystem work best with the simple elevator (noop) and
> consider the smarter schedulers deficient if they are slower than
> the noop elevator....

I don't think this is necessarily such a bad thing to do. It would
be very helpful of course if you could report the workloads where one
is slower than noop so that we can work out what is going wrong and
how we can improve performance with the others.


> > Or am I completely misunderstanding you? :)
>
> You're suggesting that I add complexity to solve the too much complexity
> problem.... ;)

Actually, if it's too much complexity that's the problem for you, then I
do think testing with noop or deadline is a valid thing to do.


> > Is this rhetorical? Because I don't see how *they* could be showing
> > regular performance regressions.
>
> I get private email fairly often asking questions as to why XFS is
> slower going from, say, 2.6.23 to 2.6.24 and then speeds back up in
> 2.6.25. I seen a number of cases where the answer to this was that
> elevator 'x' with XFS in 2.6.x because for some reason it is much,
> much slower than the others on that workload on that hardware.
>
> As seen earlier in this thread, this can be caused by a problem with
> the hardware, firmware, configuration, driver bugs, etc - there are
> so many combinations of variables that can cause performance issues
> that often the only 'macro' level change that you can make to avoid
> them is to switch schedulers. IOWs, while a specific scheduler has
> not changed, the code around it has changed sufficiently for a
> specific elevator to show a regression compared to the otherr
> elevators.....

Fair enough, and you're saying noop isn't so fragile to these other
things changing. I would expect deadline to be pretty good too, in
that regard.


> Basically, the complexity of the interactions between the
> filesystems, elevators and the storage devices is such that there
> are transient second order effects occurring that are not reported
> widely because they are easily worked around by switching elevators.

Well then I don't have a good answer, sorry :P

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system)
  2008-08-25  1:59                             ` Dave Chinner
  2008-08-25  4:32                               ` Nick Piggin
@ 2008-08-25 12:01                               ` Jamie Lokier
  2008-08-26  3:07                                 ` Dave Chinner
  1 sibling, 1 reply; 55+ messages in thread
From: Jamie Lokier @ 2008-08-25 12:01 UTC (permalink / raw)
  To: Nick Piggin, gus3, Szabolcs Szakacsits, Andrew Morton,
	linux-fsdevel

Dave Chinner wrote:
> To keep on top of this, we keep adding new variations and types and
> expect the filesystems to make best use of them (without
> documentation) to optimise for certain situations. Example - the
> new(ish) BIO_META tag that only CFQ understands. I can change the
> way XFS issues bios to use this tag to make CFQ behave the same way
> it used to w.r.t. metadata I/O from XFS, but then the deadline and
> AS will probably regress because they don't understand that tag and
> still need the old optimisations that just got removed. Ditto for
> prioritised bio dispatch - CFQ supports it but none of the others
> do.

There's nothing wrong with adding BIO_META (for example) and other
hints in _principle_.  You should be able to ignore it with no adverse
effects.  If its not used by a filesystem (and there's nothing else
competing to use the same disk), I would hope to see the same
performance as other kernels which don't have it.

If the elevators are being changed in such a way that old filesystem
code which doesn't use new hint bits is running significantly slower,
surely that's blatant elevator regression, and that's where the bugs
should be reported and fixed?

-- Jamie

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system)
  2008-08-25 12:01                               ` Jamie Lokier
@ 2008-08-26  3:07                                 ` Dave Chinner
  2008-08-26  3:50                                   ` david
  0 siblings, 1 reply; 55+ messages in thread
From: Dave Chinner @ 2008-08-26  3:07 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Nick Piggin, gus3, Szabolcs Szakacsits, Andrew Morton,
	linux-fsdevel, linux-kernel, xfs

On Mon, Aug 25, 2008 at 01:01:47PM +0100, Jamie Lokier wrote:
> Dave Chinner wrote:
> > To keep on top of this, we keep adding new variations and types and
> > expect the filesystems to make best use of them (without
> > documentation) to optimise for certain situations. Example - the
> > new(ish) BIO_META tag that only CFQ understands. I can change the
> > way XFS issues bios to use this tag to make CFQ behave the same way
> > it used to w.r.t. metadata I/O from XFS, but then the deadline and
> > AS will probably regress because they don't understand that tag and
> > still need the old optimisations that just got removed. Ditto for
> > prioritised bio dispatch - CFQ supports it but none of the others
> > do.
> 
> There's nothing wrong with adding BIO_META (for example) and other
> hints in _principle_.  You should be able to ignore it with no adverse
> effects.  If its not used by a filesystem (and there's nothing else
> competing to use the same disk), I would hope to see the same
> performance as other kernels which don't have it.

Right, but it's what we need to do to make use of that optimisation
that is the problem. For XFS, it needs to replace the current
BIO_SYNC hints we use (even for async I/O) to get metadata
dispatched quickly. i.e. CFQ looks at the sync flag first then the
meta flag.  Hence to take advantage of it, we need to remove the
BIO_SYNC hints we currently use which will change the behaviour on
all other elevators as a side effect.

This is the optimisation problem I'm refering to - the BIO_SYNC
usage was done years ago to get metadata dispatched quickly because
that is what all the elevators did with sync I/O. Now to optimise
for CFQ we need to remove that BIO_SYNC optimisation which is still
valid for the other elevators....

> If the elevators are being changed in such a way that old filesystem
> code which doesn't use new hint bits is running significantly slower,
> surely that's blatant elevator regression, and that's where the bugs
> should be reported and fixed?

Sure, but in reality getting ppl to go through the pain of triage is
extremely rare because it only takes 10s to change elevators and 
make the problem go away...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system)
  2008-08-26  3:07                                 ` Dave Chinner
@ 2008-08-26  3:50                                   ` david
  2008-08-27  1:20                                     ` Dave Chinner
  0 siblings, 1 reply; 55+ messages in thread
From: david @ 2008-08-26  3:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jamie Lokier, Nick Piggin, gus3, Szabolcs Szakacsits,
	Andrew Morton, linux-fsdevel, linux-kernel, xfs

On Tue, 26 Aug 2008, Dave Chinner wrote:

> 
> On Mon, Aug 25, 2008 at 01:01:47PM +0100, Jamie Lokier wrote:
>> Dave Chinner wrote:
>>> To keep on top of this, we keep adding new variations and types and
>>> expect the filesystems to make best use of them (without
>>> documentation) to optimise for certain situations. Example - the
>>> new(ish) BIO_META tag that only CFQ understands. I can change the
>>> way XFS issues bios to use this tag to make CFQ behave the same way
>>> it used to w.r.t. metadata I/O from XFS, but then the deadline and
>>> AS will probably regress because they don't understand that tag and
>>> still need the old optimisations that just got removed. Ditto for
>>> prioritised bio dispatch - CFQ supports it but none of the others
>>> do.
>>
>> There's nothing wrong with adding BIO_META (for example) and other
>> hints in _principle_.  You should be able to ignore it with no adverse
>> effects.  If its not used by a filesystem (and there's nothing else
>> competing to use the same disk), I would hope to see the same
>> performance as other kernels which don't have it.
>
> Right, but it's what we need to do to make use of that optimisation
> that is the problem. For XFS, it needs to replace the current
> BIO_SYNC hints we use (even for async I/O) to get metadata
> dispatched quickly. i.e. CFQ looks at the sync flag first then the
> meta flag.  Hence to take advantage of it, we need to remove the
> BIO_SYNC hints we currently use which will change the behaviour on
> all other elevators as a side effect.
>
> This is the optimisation problem I'm refering to - the BIO_SYNC
> usage was done years ago to get metadata dispatched quickly because
> that is what all the elevators did with sync I/O. Now to optimise
> for CFQ we need to remove that BIO_SYNC optimisation which is still
> valid for the other elevators....
>
>> If the elevators are being changed in such a way that old filesystem
>> code which doesn't use new hint bits is running significantly slower,
>> surely that's blatant elevator regression, and that's where the bugs
>> should be reported and fixed?
>
> Sure, but in reality getting ppl to go through the pain of triage is
> extremely rare because it only takes 10s to change elevators and
> make the problem go away...

it sounds as if the various flag definitions have been evolving, would it 
be worthwhile to sep back and try to get the various filesystem folks to 
brainstorm together on what types of hints they would _like_ to see 
supported?

it sounds like you are using 'sync' for things where you really should be 
saying 'metadata' (or 'journal contents'), it's happened to work well 
enough in the past, but it's forcing you to keep tweaking the filesystems. 
it may be better to try and define things from the filesystem point of 
view and let the elevators do the tweaking.

basicly I'm proposing a complete rethink of the filesyste <-> elevator 
interface.

David Lang

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC] nilfs2: continuous snapshotting file system
  2008-08-20 16:13   ` Ryusuke Konishi
  2008-08-20 21:25     ` Szabolcs Szakacsits
@ 2008-08-26 10:16     ` Jörn Engel
  2008-08-26 16:54       ` Ryusuke Konishi
  1 sibling, 1 reply; 55+ messages in thread
From: Jörn Engel @ 2008-08-26 10:16 UTC (permalink / raw)
  To: Ryusuke Konishi; +Cc: Andrew Morton, linux-fsdevel, linux-kernel

On Thu, 21 August 2008 01:13:45 +0900, Ryusuke Konishi wrote:
> >On Wed, 20 Aug 2008 11:45:05 +0900 Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> wrote:
> 
> 4. To make disk blocks relocatable, NILFS2 maintains a table file (called DAT)
>    which maps virtual disk blocks addresses to usual block addresses.
>    The lifetime information is recorded in the DAT per virtual block address.

Interesting approach.  Does that mean that every block lookup involves
two disk accesses, one for the DAT and one for the actual block?

> The current NILFS2 GC simply reclaims from the oldest segment, so the disk
> partition acts like a ring buffer. (this behaviour can be changed by 
> replacing userland daemon).

Is this userland daemon really necessary?  I do all that stuff in
kernelspace and the amount of code I have is likely less than would be
necessary for the userspace interface alone.  Apart from creating a
plethora of research papers, I never saw much use for pluggable
cleaners.

Did you encounter any nasty deadlocks and how did you solve them?
Finding deadlocks in the vfs-interaction became a hobby of mine when
testing logfs and at least one other lfs seems to have had similar
problems - they exported the inode_lock in their patch. ;)

Jörn

-- 
Consensus is no proof!
-- John Naisbitt

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC] nilfs2: continuous snapshotting file system
  2008-08-26 10:16     ` Jörn Engel
@ 2008-08-26 16:54       ` Ryusuke Konishi
  2008-08-27 18:13         ` Jörn Engel
  2008-08-27 18:19         ` Jörn Engel
  0 siblings, 2 replies; 55+ messages in thread
From: Ryusuke Konishi @ 2008-08-26 16:54 UTC (permalink / raw)
  To: Jorn Engel; +Cc: Andrew Morton, linux-fsdevel, linux-kernel

On Tue, 26 Aug 2008 12:16:19 +0200, Jorn Engel wrote:
>On Thu, 21 August 2008 01:13:45 +0900, Ryusuke Konishi wrote:
>> 
>> 4. To make disk blocks relocatable, NILFS2 maintains a table file (called DAT)
>>    which maps virtual disk blocks addresses to usual block addresses.
>>    The lifetime information is recorded in the DAT per virtual block address.
>
>Interesting approach.  Does that mean that every block lookup involves
>two disk accesses, one for the DAT and one for the actual block?

Simply stated, it's Yes.

But the actual number of disk accesses will become fewer because the DAT is 
cached like regular files and read-ahead is also applied.
The cache for the DAT works well enough.

>> The current NILFS2 GC simply reclaims from the oldest segment, so the disk
>> partition acts like a ring buffer. (this behaviour can be changed by 
>> replacing userland daemon).
>
>Is this userland daemon really necessary?  I do all that stuff in
>kernelspace and the amount of code I have is likely less than would be
>necessary for the userspace interface alone.  Apart from creating a
>plethora of research papers, I never saw much use for pluggable
>cleaners.

Well, that sounds reasonable.
Still I cannot say which is better for now.
My colleague has intention to develop other type of cleaners, and another
colleague experimentally made a cleaner with GUI.
In addition, there are possibilities to integrate attractive features
like defragmentation, background data verification, or remote backups.

>Did you encounter any nasty deadlocks and how did you solve them?
>Finding deadlocks in the vfs-interaction became a hobby of mine when
>testing logfs and at least one other lfs seems to have had similar
>problems - they exported the inode_lock in their patch. ;)
>
>Jorn

Yeah, it was very tough battle :)
Read is OK.  But write was hard.  I looked at the vfs code over again and
again.
We've implemented NILFS without bringing specific changes into vfs.
However, if we can find common basis for LFSes, I'm grad to cooperate 
with you.
Though I don't know whether exporting inode_lock is the case or not ;)

Regards,
Ryusuke Konishi

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system)
  2008-08-26  3:50                                   ` david
@ 2008-08-27  1:20                                     ` Dave Chinner
  2008-08-27 21:54                                       ` david
  0 siblings, 1 reply; 55+ messages in thread
From: Dave Chinner @ 2008-08-27  1:20 UTC (permalink / raw)
  To: david
  Cc: Jamie Lokier, Nick Piggin, gus3, Szabolcs Szakacsits,
	Andrew Morton, linux-fsdevel, linux-kernel, xfs

On Mon, Aug 25, 2008 at 08:50:14PM -0700, david@lang.hm wrote:
> it sounds as if the various flag definitions have been evolving, would it 
> be worthwhile to sep back and try to get the various filesystem folks to  
> brainstorm together on what types of hints they would _like_ to see  
> supported?

Three types:

	1. immediate dispatch - merge first with adjacent requests
	   then dispatch
	2. delayed dispatch - queue for a short while to allow
	   merging of requests from above
	3. bulk data - queue and merge. dispatch is completely
	   controlled by the elevator

Basically most metadata and log writes would fall into category 2,
which every logbufs/2 log writes or every log force using a category
1 to prevent log I/O from being stalled too long by other I/O.

Data writes from the filesystem would appear as category 3 (read and write)
and are subject to the specific elevator scheduling. That is, things
like the CFQ ionice throttling would work on the bulk data queue,
but not the other queues that the filesystem is using for metadata.

Tagging the I/O as a sync I/O can still be done, but that only
affects category 3 scheduling - category 1 or 2 would do the same
thing whether sync or async....

> it sounds like you are using 'sync' for things where you really should be 
> saying 'metadata' (or 'journal contents'), it's happened to work well  
> enough in the past, but it's forcing you to keep tweaking the 
> filesystems.

Right, because there was no 'metadata' tagging, and 'sync' happened
to do exactly what we needed on all elevators at the time.

> it may be better to try and define things from the 
> filesystem point of view and let the elevators do the tweaking.
>
> basicly I'm proposing a complete rethink of the filesyste <-> elevator  
> interface.

Yeah, I've been saying that for a while w.r.t. the filesystem/block
layer interfaces, esp. now with discard requests, data integrity,
device alignment information, barriers, etc being exposed by the
layers below the filesystem, but with no interface for filesystems
to be able to access that information...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC] nilfs2: continuous snapshotting file system
  2008-08-26 16:54       ` Ryusuke Konishi
@ 2008-08-27 18:13         ` Jörn Engel
  2008-08-27 18:19         ` Jörn Engel
  1 sibling, 0 replies; 55+ messages in thread
From: Jörn Engel @ 2008-08-27 18:13 UTC (permalink / raw)
  To: Ryusuke Konishi; +Cc: Andrew Morton, linux-fsdevel, linux-kernel

On Wed, 27 August 2008 01:54:30 +0900, Ryusuke Konishi wrote:
> 
> Yeah, it was very tough battle :)
> Read is OK.  But write was hard.  I looked at the vfs code over again and
> again.
> We've implemented NILFS without bringing specific changes into vfs.
> However, if we can find common basis for LFSes, I'm grad to cooperate 
> with you.
> Though I don't know whether exporting inode_lock is the case or not ;)

Well, I was looking more for something like a list of problems and
solutions.  Partially because I am plain curious and partially because I
know those are the problem areas of any log-structured filesystem and
they deserve special attention in a review.

In logfs, garbage collection may read (and write) any inode and any
block from any file.  And since garbage collection may be called from
writepage() and write_inode(), the fun included:

P: iget() on the inode being currently written back and locked.
S: Split I_LOCK into I_LOCK and I_SYNC.  Has been merged upstream.

P: iget() on an inode in I_FREEING or I_WILL_FREE state.
S: Add inodes to a list in drop_inode() and remove them again in
   destroy_inode().  iget() in GC context is wrapped in a method that
   checks said list first and return an inode from the list when
   applicable.  Used to hold inode_lock to prevent races, but a
   logfs-local lock is actually sufficient.

If either of the two problems above is solved by calling
ilookup5_nowait() I bet you a fiver that a race with data corruption is
lurking somewhere in the area.

P: find_get_page() or some variant on a page handed to
   logfs_writepage().
S: Use the one available page flag, PG_owner_priv_1 to mark pages that
   are waiting for the single-threaded logfs write path.  If any page GC
   needs is locked, check for PG_owner_priv_1 and if it is set, just use
   the page anyway.  Whoever has set the flag cannot clear it until GC
   has finished.
   If the flag is not set, the page might still be somewhere in the
   logfs write path - before setting the page.  So simply do the check
   in a loop, call schedule() each time, knock on wood and keep your
   fingers crossed that the page will either become unlocked and set
   PG_owner_priv_1 sometime soon.  I'm not proud of this solution but
   know no better one.

So something like the above for nilfs would be useful.  And maybe, just
to be on the safe side, try the following testcase overnight:
- Create tiny filesystem (32M or so).
- Fill filesystem 100% with a single file.
- Rewrite random parts of the file in an endless loop.

Or even better, combine this testcase with some automated system crashes
and do an fsck every time the system comes back up. ;)

Jörn

-- 
Geld macht nicht glücklich.
Glück macht nicht satt.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC] nilfs2: continuous snapshotting file system
  2008-08-26 16:54       ` Ryusuke Konishi
  2008-08-27 18:13         ` Jörn Engel
@ 2008-08-27 18:19         ` Jörn Engel
  2008-08-29  6:29           ` Ryusuke Konishi
  1 sibling, 1 reply; 55+ messages in thread
From: Jörn Engel @ 2008-08-27 18:19 UTC (permalink / raw)
  To: Ryusuke Konishi; +Cc: Andrew Morton, linux-fsdevel, linux-kernel

On Wed, 27 August 2008 01:54:30 +0900, Ryusuke Konishi wrote:
> On Tue, 26 Aug 2008 12:16:19 +0200, Jorn Engel wrote:
> >
> >Interesting approach.  Does that mean that every block lookup involves
> >two disk accesses, one for the DAT and one for the actual block?
> 
> Simply stated, it's Yes.
> 
> But the actual number of disk accesses will become fewer because the DAT is 
> cached like regular files and read-ahead is also applied.
> The cache for the DAT works well enough.

Yep.  It is not a bad tradeoff.  You pay with some extra seeks when the
filesystem is freshly mounted but gain a lot of simplicity in garbage
collection.

More questions.  I believe the first two answer are no, but would like
to be sure.
Do you support compression?
Do you do wear leveling or scrubbing?
How does garbage collection work?  In particular, when the filesystem
runs out of free space, do you depend on the userspace daemon to make
some policy decisions or can the kernel make progress on its own?

Jörn

-- 
There are two ways of constructing a software design: one way is to make
it so simple that there are obviously no deficiencies, and the other is
to make it so complicated that there are no obvious deficiencies.
-- C. A. R. Hoare
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system)
  2008-08-27  1:20                                     ` Dave Chinner
@ 2008-08-27 21:54                                       ` david
  2008-08-28  1:08                                         ` Dave Chinner
  0 siblings, 1 reply; 55+ messages in thread
From: david @ 2008-08-27 21:54 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jamie Lokier, Nick Piggin, gus3, Szabolcs Szakacsits,
	Andrew Morton, linux-fsdevel, linux-kernel, xfs

On Wed, 27 Aug 2008, Dave Chinner wrote:

> On Mon, Aug 25, 2008 at 08:50:14PM -0700, david@lang.hm wrote:
>> it sounds as if the various flag definitions have been evolving, would it
>> be worthwhile to sep back and try to get the various filesystem folks to
>> brainstorm together on what types of hints they would _like_ to see
>> supported?
>
> Three types:
>
> 	1. immediate dispatch - merge first with adjacent requests
> 	   then dispatch
> 	2. delayed dispatch - queue for a short while to allow
> 	   merging of requests from above
> 	3. bulk data - queue and merge. dispatch is completely
> 	   controlled by the elevator

does this list change if you consider the fact that there may be a raid 
array or some more complex structure for the block device instead of a 
simple single disk partition?

since I am suggesting re-thinking the filesystem <-> elevator interface, 
is there anything you need to have the elevator tell the filesystem? (I'm 
thinking that this may be the path for the filesystem to learn things 
about the block device that's under it, is it a raid array, a solid-state 
drive, etc)

David Lang

> Basically most metadata and log writes would fall into category 2,
> which every logbufs/2 log writes or every log force using a category
> 1 to prevent log I/O from being stalled too long by other I/O.
>
> Data writes from the filesystem would appear as category 3 (read and write)
> and are subject to the specific elevator scheduling. That is, things
> like the CFQ ionice throttling would work on the bulk data queue,
> but not the other queues that the filesystem is using for metadata.
>
> Tagging the I/O as a sync I/O can still be done, but that only
> affects category 3 scheduling - category 1 or 2 would do the same
> thing whether sync or async....
>
>> it sounds like you are using 'sync' for things where you really should be
>> saying 'metadata' (or 'journal contents'), it's happened to work well
>> enough in the past, but it's forcing you to keep tweaking the
>> filesystems.
>
> Right, because there was no 'metadata' tagging, and 'sync' happened
> to do exactly what we needed on all elevators at the time.
>
>> it may be better to try and define things from the
>> filesystem point of view and let the elevators do the tweaking.
>>
>> basicly I'm proposing a complete rethink of the filesyste <-> elevator
>> interface.
>
> Yeah, I've been saying that for a while w.r.t. the filesystem/block
> layer interfaces, esp. now with discard requests, data integrity,
> device alignment information, barriers, etc being exposed by the
> layers below the filesystem, but with no interface for filesystems
> to be able to access that information...
>
> Cheers,
>
> Dave.
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system)
  2008-08-27 21:54                                       ` david
@ 2008-08-28  1:08                                         ` Dave Chinner
  0 siblings, 0 replies; 55+ messages in thread
From: Dave Chinner @ 2008-08-28  1:08 UTC (permalink / raw)
  To: david
  Cc: Jamie Lokier, Nick Piggin, gus3, Szabolcs Szakacsits,
	Andrew Morton, linux-fsdevel, linux-kernel, xfs

On Wed, Aug 27, 2008 at 02:54:28PM -0700, david@lang.hm wrote:
> On Wed, 27 Aug 2008, Dave Chinner wrote:
>
>> On Mon, Aug 25, 2008 at 08:50:14PM -0700, david@lang.hm wrote:
>>> it sounds as if the various flag definitions have been evolving, would it
>>> be worthwhile to sep back and try to get the various filesystem folks to
>>> brainstorm together on what types of hints they would _like_ to see
>>> supported?
>>
>> Three types:
>>
>> 	1. immediate dispatch - merge first with adjacent requests
>> 	   then dispatch
>> 	2. delayed dispatch - queue for a short while to allow
>> 	   merging of requests from above
>> 	3. bulk data - queue and merge. dispatch is completely
>> 	   controlled by the elevator
>
> does this list change if you consider the fact that there may be a raid  
> array or some more complex structure for the block device instead of a  
> simple single disk partition?

No. The whole point of immediate dispatch is that those I/Os are
extremely latency sensitive (i.e. whole fs can stall waiting or
them), so it doesn't matter what the end target is. The faster the
storage subsystem, the more important it is to dispatch those
I/Os immediately to keep the pipes filled...

> since I am suggesting re-thinking the filesystem <-> elevator interface,  
> is there anything you need to have the elevator tell the filesystem? (I'm 
> thinking that this may be the path for the filesystem to learn things  
> about the block device that's under it, is it a raid array, a solid-state 
> drive, etc)

Not so much the elevator, but the block layer in general. That is:

	- capability reporting
		- barriers and type
		- discard support
		- integrity support
		- maximum number of I/Os that can be in flight
		  before congestion occurs
	- geometry of the underlying storage
		- independent domains within the device (e.g. boundaries
		  of linear concatentations)
		- stripe unit/width per domain
		- optimal I/O size per domain
		- latency characteristics per domain
	- notifiers to indicate change of status due to device
	  hotplug back up to the filesystem
		- barrier status change
		- geometry changes due to on-line volume modification
		  (e.g. raid5/6 rebuild after adding a new disk,
		   added another disk to a linear concat, etc)

I'm sure there's more, but that's the list quickly off the top of
my head.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC] nilfs2: continuous snapshotting file system
  2008-08-27 18:19         ` Jörn Engel
@ 2008-08-29  6:29           ` Ryusuke Konishi
  2008-08-29  8:40             ` Arnd Bergmann
  2008-08-29 10:45             ` Jörn Engel
  0 siblings, 2 replies; 55+ messages in thread
From: Ryusuke Konishi @ 2008-08-29  6:29 UTC (permalink / raw)
  To: Jorn Engel; +Cc: Andrew Morton, linux-fsdevel, linux-kernel

Hi, Jorn

I'll reply from the latter mail.

On Wed, 27 Aug 2008 20:19:04 +0200,  Jorn Engel wrote:
>More questions.  I believe the first two answer are no, but would like
>to be sure.
>Do you support compression?

No. (as you guessed)

>Do you do wear leveling or scrubbing?

NILFS does not support scrubbing. (as you guessed)
Under the current GC daemon, it writes logs sequentially and circularly
in the partition, and as you know, this leads to the wear levelling
except for superblock.

>How does garbage collection work?  In particular, when the filesystem
>runs out of free space, do you depend on the userspace daemon to make
>some policy decisions or can the kernel make progress on its own?

The GC of NILFS depends on the userspace daemon to make policy decisions.
NILFS cannot reclaim disk space on its own though it can work 
(i.e. read, write, or do other operations) without the daemon.
After it runs out of free space, disk full errors will be returned
until GC makes new space.

But, usually the GC will make enough disk space in the background
before that occurs.
The userland GC daemon, which runs in the background, starts to reclaim
logs (to be presice segments) if there are logs (segments) whose age
is older than a certain period, which we call ``protection period''. 
If no recent logs are found, it goes sleeping.

Regards,
Ryusuke

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC] nilfs2: continuous snapshotting file system
  2008-08-29  6:29           ` Ryusuke Konishi
@ 2008-08-29  8:40             ` Arnd Bergmann
  2008-08-29 10:51               ` konishi.ryusuke
  2008-08-29 10:45             ` Jörn Engel
  1 sibling, 1 reply; 55+ messages in thread
From: Arnd Bergmann @ 2008-08-29  8:40 UTC (permalink / raw)
  To: Ryusuke Konishi; +Cc: Jorn Engel, Andrew Morton, linux-fsdevel, linux-kernel

On Friday 29 August 2008, Ryusuke Konishi wrote:
> >Do you do wear leveling or scrubbing?
> 
> NILFS does not support scrubbing. (as you guessed)
> Under the current GC daemon, it writes logs sequentially and circularly
> in the partition, and as you know, this leads to the wear levelling
> except for superblock.

I don't see how that would cope with file systems that have a lot
of static data. The classic problem of most cheap devices that implement
wear leveling in hardware is that they never move data in an erase block
that is used for read-only data. If 90% of the file system is read-only,
your wear leveling will only work on 10% of the medium, wearing it down
10 times faster than it should.

Can the GC daemon handle this case, e.g. by moving around aging read-only
erase blocks?

	Arnd <><

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC] nilfs2: continuous snapshotting file system
  2008-08-29  6:29           ` Ryusuke Konishi
  2008-08-29  8:40             ` Arnd Bergmann
@ 2008-08-29 10:45             ` Jörn Engel
  2008-08-29 16:37               ` Ryusuke Konishi
  1 sibling, 1 reply; 55+ messages in thread
From: Jörn Engel @ 2008-08-29 10:45 UTC (permalink / raw)
  To: Ryusuke Konishi; +Cc: Andrew Morton, linux-fsdevel, linux-kernel

On Fri, 29 August 2008 15:29:35 +0900, Ryusuke Konishi wrote:
> On Wed, 27 Aug 2008 20:19:04 +0200,  Jorn Engel wrote:
> 
> >Do you do wear leveling or scrubbing?
> 
> NILFS does not support scrubbing. (as you guessed)
> Under the current GC daemon, it writes logs sequentially and circularly
> in the partition, and as you know, this leads to the wear levelling
> except for superblock.

I am a bit confused here.  My picture of log-structured filesystems was
always that writes go round-robin _within_ a segment, but new segments
can be picked in any order.  So there is a good chance of some segments
simply never being picked and others constantly being reused.

If nilfs works in the same way, it will by design spread the writes
somewhat better than ext3, to pick an example, but can still lead to
local wear-out if f.e. 98% of the filesystem is full and the remaining
2% receive a high write load.

True wear leveling requires a bit more work.  Either some probabilistic
garbage collection of any random segment, as jffs2 does, or storing some
write counters and keeping them roughly level as logfs does.

> >How does garbage collection work?  In particular, when the filesystem
> >runs out of free space, do you depend on the userspace daemon to make
> >some policy decisions or can the kernel make progress on its own?
> 
> The GC of NILFS depends on the userspace daemon to make policy decisions.
> NILFS cannot reclaim disk space on its own though it can work 
> (i.e. read, write, or do other operations) without the daemon.
> After it runs out of free space, disk full errors will be returned
> until GC makes new space.

This looks problematic.  In logfs I was very careful to define a
"filesystem full" condition that is independent of GC.  So with a single
writer, -ENOSPC always means the filesystem is full and the only way to
gain some free space is by deleting data again.

In nilfs it appears possible that a single writer received -ENOSPC and
can simply continue writing until - magically - there is space again
because the GC daemon woke up and freed some more.  That is unexpected,
to say the least.

Which is also one of the reasons why I don't like the userspace daemon
approach very much.  Decent behaviour now requires that you block the
writes, wake up the userspace daemon and wait for it to do its job.  Or
you would have to implement a backup-daemon in kernelspace which gets
called into whenever -ENOSPC would be returned otherwise.

> But, usually the GC will make enough disk space in the background
> before that occurs.

Usually, yes.  You just have to make sure that in the unusual cases the
filesystem continues to behave correctly. ;)

Jörn

-- 
Homo Sapiens is a goal, not a description.
-- unknown
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC] nilfs2: continuous snapshotting file system
  2008-08-29  8:40             ` Arnd Bergmann
@ 2008-08-29 10:51               ` konishi.ryusuke
  2008-08-29 11:04                 ` Jörn Engel
  0 siblings, 1 reply; 55+ messages in thread
From: konishi.ryusuke @ 2008-08-29 10:51 UTC (permalink / raw)
  To: arnd; +Cc: joern, akpm, linux-fsdevel, linux-kernel

On Fri, 29 Aug 2008 10:40:06 +0200, Arnd Bergmann wrote:
> On Friday 29 August 2008, Ryusuke Konishi wrote:
> > >Do you do wear leveling or scrubbing?
> > 
> > NILFS does not support scrubbing. (as you guessed)
> > Under the current GC daemon, it writes logs sequentially and circularly
> > in the partition, and as you know, this leads to the wear levelling
> > except for superblock.
> 
> I don't see how that would cope with file systems that have a lot
> of static data. The classic problem of most cheap devices that implement
> wear leveling in hardware is that they never move data in an erase block
> that is used for read-only data. If 90% of the file system is read-only,
> your wear leveling will only work on 10% of the medium, wearing it down
> 10 times faster than it should.
>
> Can the GC daemon handle this case, e.g. by moving around aging read-only
> erase blocks?

Yeah, exactly.  Thank you for this comment.

To minimize aging of the device itself, the userland GC daemon would
need another cleaning policy.  So, in that sense, the answer of the
above question is NO.

Since the primary purpose of NILFS is providing continuous
snapshotting, the GC is not necessarily designed with such requirement
in mind.

Regards,
Ryusuke Konishi

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC] nilfs2: continuous snapshotting file system
  2008-08-29 10:51               ` konishi.ryusuke
@ 2008-08-29 11:04                 ` Jörn Engel
  0 siblings, 0 replies; 55+ messages in thread
From: Jörn Engel @ 2008-08-29 11:04 UTC (permalink / raw)
  To: konishi.ryusuke; +Cc: arnd, akpm, linux-fsdevel, linux-kernel

On Fri, 29 August 2008 19:51:56 +0900, konishi.ryusuke@lab.ntt.co.jp wrote:
> 
> Since the primary purpose of NILFS is providing continuous
> snapshotting, the GC is not necessarily designed with such requirement
> in mind.

No shame in that.  One fine day I would like to have a filesystem that
combines all the neat tricks from the half-dozen new filesystems that
are currently under development.  Until then, people will simply have to
pick which one matches their personal requirements best.

Jörn

-- 
There are three principal ways to lose money: wine, women, and engineers.
While the first two are more pleasant, the third is by far the more certain.
-- Baron Rothschild

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC] nilfs2: continuous snapshotting file system
  2008-08-29 10:45             ` Jörn Engel
@ 2008-08-29 16:37               ` Ryusuke Konishi
  2008-08-29 19:16                 ` Jörn Engel
  0 siblings, 1 reply; 55+ messages in thread
From: Ryusuke Konishi @ 2008-08-29 16:37 UTC (permalink / raw)
  To: joern; +Cc: akpm, linux-fsdevel, linux-kernel

On Fri, 29 Aug 2008 12:45:00 +0200, Jörn Engel wrote:
> > The GC of NILFS depends on the userspace daemon to make policy decisions.
> > NILFS cannot reclaim disk space on its own though it can work 
> > (i.e. read, write, or do other operations) without the daemon.
> > After it runs out of free space, disk full errors will be returned
> > until GC makes new space.
> 
> This looks problematic.  In logfs I was very careful to define a
> "filesystem full" condition that is independent of GC.  So with a single
> writer, -ENOSPC always means the filesystem is full and the only way to
> gain some free space is by deleting data again.
> ...
> > But, usually the GC will make enough disk space in the background
> > before that occurs.
> 
> Usually, yes.  You just have to make sure that in the unusual cases the
> filesystem continues to behave correctly. ;)

As the side remark, the GC of nilfs runs in the background, not
started after it runs out of free space.  Basically the intended
meaning of -ENOSPC is same; it does not mean the GC is ongoing, but
means the deletion is required.  Of course this depends on the
condition that the GC has been working with enough speed, so the
meaning is not assured strictly.  But, at least I won't return -ENOSPC
so easily, and will deal it more politely if needed.

On the other hand, there are some differences in premise because nilfs
is aiming at racking up past user data and makes it a top priority to
keep data which is overwritten by recent updates.  If users want to
preserve much data in nilfs, it will increase the chance of disk fulls
than regular file systems.

Cheers,
Ryusuke
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC] nilfs2: continuous snapshotting file system
  2008-08-29 16:37               ` Ryusuke Konishi
@ 2008-08-29 19:16                 ` Jörn Engel
  2008-09-01 12:25                   ` Ryusuke Konishi
  0 siblings, 1 reply; 55+ messages in thread
From: Jörn Engel @ 2008-08-29 19:16 UTC (permalink / raw)
  To: Ryusuke Konishi; +Cc: akpm, linux-fsdevel, linux-kernel

On Sat, 30 August 2008 01:37:29 +0900, Ryusuke Konishi wrote:
> 
> As the side remark, the GC of nilfs runs in the background, not
> started after it runs out of free space.  Basically the intended
> meaning of -ENOSPC is same; it does not mean the GC is ongoing, but
> means the deletion is required.  Of course this depends on the
> condition that the GC has been working with enough speed, so the
> meaning is not assured strictly.  But, at least I won't return -ENOSPC
> so easily, and will deal it more politely if needed.
> 
> On the other hand, there are some differences in premise because nilfs
> is aiming at racking up past user data and makes it a top priority to
> keep data which is overwritten by recent updates.  If users want to
> preserve much data in nilfs, it will increase the chance of disk fulls
> than regular file systems.

Hm, good point.  With continuous snapshots the rules of the game change
considerably.  So maybe it is ok to depend on the userspace daemon here,
because the space is unreclaimable anyway.

What is the policy on deleting continuous snapshots?  Or can it even be
configured by the administrator (which would be cool)?

Jörn

-- 
The cheapest, fastest and most reliable components of a computer
system are those that aren't there.
-- Gordon Bell, DEC labratories
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC] nilfs2: continuous snapshotting file system
  2008-08-29 19:16                 ` Jörn Engel
@ 2008-09-01 12:25                   ` Ryusuke Konishi
  0 siblings, 0 replies; 55+ messages in thread
From: Ryusuke Konishi @ 2008-09-01 12:25 UTC (permalink / raw)
  To: joern; +Cc: akpm, linux-fsdevel, linux-kernel

On Fri, 29 Aug 2008 21:16:22 +0200, Jörn Engel wrote:
> On Sat, 30 August 2008 01:37:29 +0900, Ryusuke Konishi wrote:
> > On the other hand, there are some differences in premise because nilfs
> > is aiming at racking up past user data and makes it a top priority to
> > keep data which is overwritten by recent updates.  If users want to
> > preserve much data in nilfs, it will increase the chance of disk fulls
> > than regular file systems.
> 
> Hm, good point.  With continuous snapshots the rules of the game change
> considerably.  So maybe it is ok to depend on the userspace daemon here,
> because the space is unreclaimable anyway.
> 
> What is the policy on deleting continuous snapshots?  Or can it even be
> configured by the administrator (which would be cool)?

First, nilfs never deletes the checkpoints marked as snapshot nor the
recent checkpoints whose elapsed time from its creation is smaller than
``protection period''.  These are ground rules.

Based on the rules, the userland GC daemon can delete arbitrary
checkpoints among removable checkpoints.  But the current GC just
deletes the removable checkpoints in chronological order.  More
sophisticated policies, for example, the one detects landmark
checkpoints and tries to keep them (a known policy in versioning
filesystems), may be conceivable.

But I feel the current policy is simple and satisfactory, so I'd like
to leave others to someone who wants to implement them (e.g. one of my
colleagues).

Regards,
Ryusuke Konishi
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2008-09-01 12:26 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-20  2:45 [PATCH RFC] nilfs2: continuous snapshotting file system Ryusuke Konishi
2008-08-20  7:43 ` Andrew Morton
2008-08-20  8:22   ` Pekka Enberg
2008-08-20 18:47     ` Ryusuke Konishi
2008-08-20 16:13   ` Ryusuke Konishi
2008-08-20 21:25     ` Szabolcs Szakacsits
2008-08-20 21:39       ` Andrew Morton
2008-08-20 21:48         ` Szabolcs Szakacsits
2008-08-21  2:12         ` Dave Chinner
2008-08-21  2:46           ` Szabolcs Szakacsits
2008-08-21  5:15             ` XFS vs Elevators (was Re: [PATCH RFC] nilfs2: continuous snapshotting file system) Dave Chinner
2008-08-21  6:00               ` gus3
2008-08-21  6:14                 ` Dave Chinner
2008-08-21  7:00                   ` Nick Piggin
2008-08-21  8:53                     ` Dave Chinner
2008-08-21  9:33                       ` Nick Piggin
2008-08-21 17:08                         ` Dave Chinner
2008-08-22  2:29                           ` Nick Piggin
2008-08-25  1:59                             ` Dave Chinner
2008-08-25  4:32                               ` Nick Piggin
2008-08-25 12:01                               ` Jamie Lokier
2008-08-26  3:07                                 ` Dave Chinner
2008-08-26  3:50                                   ` david
2008-08-27  1:20                                     ` Dave Chinner
2008-08-27 21:54                                       ` david
2008-08-28  1:08                                         ` Dave Chinner
2008-08-21 14:52                       ` Chris Mason
2008-08-21  6:04               ` Dave Chinner
2008-08-21  8:07                 ` Aaron Carroll
2008-08-21  8:25                 ` Dave Chinner
2008-08-21 11:02                   ` Martin Steigerwald
2008-08-21 15:00                     ` Martin Steigerwald
2008-08-21 17:10                   ` Szabolcs Szakacsits
2008-08-21 17:33                     ` Szabolcs Szakacsits
2008-08-22  2:24                       ` Dave Chinner
2008-08-22  6:49                         ` Martin Steigerwald
2008-08-22 12:44                         ` Szabolcs Szakacsits
2008-08-23 12:52                           ` Szabolcs Szakacsits
2008-08-21 11:53                 ` Matthew Wilcox
2008-08-21 15:56                   ` Dave Chinner
2008-08-21 12:51       ` [PATCH RFC] nilfs2: continuous snapshotting file system Chris Mason
2008-08-26 10:16     ` Jörn Engel
2008-08-26 16:54       ` Ryusuke Konishi
2008-08-27 18:13         ` Jörn Engel
2008-08-27 18:19         ` Jörn Engel
2008-08-29  6:29           ` Ryusuke Konishi
2008-08-29  8:40             ` Arnd Bergmann
2008-08-29 10:51               ` konishi.ryusuke
2008-08-29 11:04                 ` Jörn Engel
2008-08-29 10:45             ` Jörn Engel
2008-08-29 16:37               ` Ryusuke Konishi
2008-08-29 19:16                 ` Jörn Engel
2008-09-01 12:25                   ` Ryusuke Konishi
2008-08-20  9:47 ` Andi Kleen
2008-08-21  4:57   ` Ryusuke Konishi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).