Re: Of block allocation algorithms, fsck times, and file fragmentation

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Andreas Dilger <adilger@sun.com>
To: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-ext4@vger.kernel.org, Curt Wohlgemuth <curtw@google.com>
Subject: Re: Of block allocation algorithms, fsck times, and file	fragmentation
Date: Wed, 06 May 2009 05:50:29 -0600	[thread overview]
Message-ID: <20090506115029.GF3209@webber.adilger.int> (raw)
In-Reply-To: <E1M1fIm-0001Hw-0f@closure.thunk.org>

On May 06, 2009  07:28 -0400, Theodore Ts'o wrote:
> So that's the good news.  However, the block allocation shows that we
> are doing something... strange.  Running an e2fsck -E fragcheck report,
> the large files seem to be written out in 8 megabyte chunks:
> 
>   1313(f): expecting  51200 actual extent phys  53248 log 2048 len 2048
>   1351(f): expecting  53248 actual extent phys  57344 log 2048 len 2048
>   1351(f): expecting  59392 actual extent phys  67584 log 4096 len 4096
>   1351(f): expecting  71680 actual extent phys  73728 log 8192 len 2048
>   1351(f): expecting  75776 actual extent phys  77824 log 10240 len 2048
>   1574(f): expecting  77824 actual extent phys  81920 log 6144 len 2048
>   1574(f): expecting  83968 actual extent phys  86016 log 8192 len 12288
>   1574(f): expecting  98304 actual extent phys 100352 log 20480 len 32768

Two things might be involved here:
- IIRC mballoc limits its extent searches to 8MB, so that it doesn't
  waste a lot of cycles looking for huge free chunks when there aren't
  any.  For Lustre that didn't make much difference since the largest
  possible IO size at the server is 1MB.  That said, if we have huge
  delalloc files it might make sense to do some checking for more space,
  possibly whole free groups for files > 128MB in size.  Scanning the
  buddy bitmaps isn't very expensive, but loading some 10000's of them
  in a large filesystem IS.
- it might also relate to pdflush limiting the background writeout from 
  a single file, and flushing the delalloc pages in round-robin manner.
  Without delalloc the blocks would already have been allocated, so the
  writeout speed didn't matter.  With delalloc now we might have an
  unpleasant interaction between how pdflush writes out the dirty pages
  and how the files are allocated on disk.

> Thinking this was perhaps rsync's fault, I tried the experiment where I
> copied the files using tar:
> 
>        tar -cf - -C /mnt2 . | tar -xpf - -C /mnt .
> 
> However, the same pattern was visible.  Tar definitely copies files
> using one at a time, so this must be an artifact of the page writeback
> algorithms.

If you can run a similar test with fsync after each file I suspect the
layout will be correct.  Alternately, if the kernel did the equivalent
of "fallocate(KEEP_SIZE)" for the file as soon as writeout started, it
would avoid any interaction between pdflush and the file allocation.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

     prev parent reply	other threads:[~2009-05-06 11:50 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-05-06 11:28 Of block allocation algorithms, fsck times, and file fragmentation Theodore Ts'o
2009-05-06 11:50 ` Andreas Dilger [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090506115029.GF3209@webber.adilger.int \
    --to=adilger@sun.com \
    --cc=curtw@google.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).