From: Andreas Dilger <adilger@sun.com>
To: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-ext4@vger.kernel.org, Curt Wohlgemuth <curtw@google.com>
Subject: Re: Of block allocation algorithms, fsck times, and file fragmentation
Date: Wed, 06 May 2009 05:50:29 -0600 [thread overview]
Message-ID: <20090506115029.GF3209@webber.adilger.int> (raw)
In-Reply-To: <E1M1fIm-0001Hw-0f@closure.thunk.org>
On May 06, 2009 07:28 -0400, Theodore Ts'o wrote:
> So that's the good news. However, the block allocation shows that we
> are doing something... strange. Running an e2fsck -E fragcheck report,
> the large files seem to be written out in 8 megabyte chunks:
>
> 1313(f): expecting 51200 actual extent phys 53248 log 2048 len 2048
> 1351(f): expecting 53248 actual extent phys 57344 log 2048 len 2048
> 1351(f): expecting 59392 actual extent phys 67584 log 4096 len 4096
> 1351(f): expecting 71680 actual extent phys 73728 log 8192 len 2048
> 1351(f): expecting 75776 actual extent phys 77824 log 10240 len 2048
> 1574(f): expecting 77824 actual extent phys 81920 log 6144 len 2048
> 1574(f): expecting 83968 actual extent phys 86016 log 8192 len 12288
> 1574(f): expecting 98304 actual extent phys 100352 log 20480 len 32768
Two things might be involved here:
- IIRC mballoc limits its extent searches to 8MB, so that it doesn't
waste a lot of cycles looking for huge free chunks when there aren't
any. For Lustre that didn't make much difference since the largest
possible IO size at the server is 1MB. That said, if we have huge
delalloc files it might make sense to do some checking for more space,
possibly whole free groups for files > 128MB in size. Scanning the
buddy bitmaps isn't very expensive, but loading some 10000's of them
in a large filesystem IS.
- it might also relate to pdflush limiting the background writeout from
a single file, and flushing the delalloc pages in round-robin manner.
Without delalloc the blocks would already have been allocated, so the
writeout speed didn't matter. With delalloc now we might have an
unpleasant interaction between how pdflush writes out the dirty pages
and how the files are allocated on disk.
> Thinking this was perhaps rsync's fault, I tried the experiment where I
> copied the files using tar:
>
> tar -cf - -C /mnt2 . | tar -xpf - -C /mnt .
>
> However, the same pattern was visible. Tar definitely copies files
> using one at a time, so this must be an artifact of the page writeback
> algorithms.
If you can run a similar test with fsync after each file I suspect the
layout will be correct. Alternately, if the kernel did the equivalent
of "fallocate(KEEP_SIZE)" for the file as soon as writeout started, it
would avoid any interaction between pdflush and the file allocation.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
prev parent reply other threads:[~2009-05-06 11:50 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-05-06 11:28 Of block allocation algorithms, fsck times, and file fragmentation Theodore Ts'o
2009-05-06 11:50 ` Andreas Dilger [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090506115029.GF3209@webber.adilger.int \
--to=adilger@sun.com \
--cc=curtw@google.com \
--cc=linux-ext4@vger.kernel.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).