linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Nick Piggin <npiggin@suse.de>
To: linux-fsdevel@vger.kernel.org
Subject: [patch] fsblock preview
Date: Mon, 15 Sep 2008 10:30:14 +0200	[thread overview]
Message-ID: <20080915083014.GA3407@wotan.suse.de> (raw)
In-Reply-To: <20080914221500.GH27080@wotan.suse.de>

OK, vger doesn't seem to like my patch, so I'll have to give a url to it,
sorry.

http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/fsblock/2.6.27-rc5/fsb-preview.patch

I've been doing some work on fsblock again lately, so in case anybody might
find it interesting, here is a "preview" patch. Basically it compiles and
runs OK for me here, under a few stress tests. I wouldn't say it is close to
bug free, and it needs a lot of bits and pieces to polish up like error
handling.

I've also just stripped out the large block size support in the patch I'm
mailing out... I have been developing with ext2 without large lock support
sizes so those paths have rotted a bit and besides they still really need
a bit more changes to some VM paths.

Since I last posted fsblock, there have been some big changes:

- Using a per block spinlock to protect most access now. This eliminates
  some races I had against dirtying vs cleaning, and with fsblock
  refcounting and reclaim.

- fsblock_no_cache aka "nobh" mode now works well due to the above. When
  /proc/sys/vm/fsblock_no_cache is 1, you never get fsblocks hanging around
  longer than they have to. You also would never be subject to the circular
  referencing "orphan" pages that buffer heads are subject to.

- RCU is gone. This is actually a good thing because in "nobh" mode, some
  workloads will rapidly allocate and free the structures, and that can
  be costly with RCU.

- struct fsblock has shrunk to 32 bytes on 64-bit. Less than 1/3 the size
  of struct buffer_head. Although absolute size doesn't matter so much now
  (because of no_cache mode). I even have an optional feature "bdflush"
  that increases the size... although I do want to keep it within 64 bytes
  (one cacheline).

- added an "intermediate" mode which provides a ->data pointer in struct
  fsblock_meta, and means it is trivial to transition filesystems to
  fsblock (although they would not be able to support superpage blocks).

- Added ext2 intermediate support.

- Had to modify the VM a little bit in order to close races with freeing a
  page's fsblock before it can be cleaned (or still has a chance to be
  dirtied via mmap). fsblock of course ensures that zero memory allocations
  are required in the writeout path.

- Lockless pagecache has been merged in mainline, which means the largest
  granularity of synchronisation anywhere in the fsblock core code is on a
  per-page basis (buffer uses per-inode private_lock). This is one of the
  reasons I am skeptical that keeping pagecache state in extents is better: it
  would be rather impressive if it could match the straight line speed or
  scalability of fsblock.

- However, I *have* always agreed that it makes sense to keep (some) block
  state in extents, because that is going to change much less frequently, and
  should be represented with fewer extents provided the filesystem layout is
  reasonable. So I've written a (very) basic extent cache for block mappings,
  which can be used by filesystems that don't have good in-memory block
  mapping structures themselves (like ext2, for example). No reclaim for this
  at present, I should just add a simple shrinker.

- bdflush... it's commented out so it won't build by default, but basically
  because fslbock properly keeps block dirty state in synch with page dirty
  state, I can keep sorted structure of dirty fsblocks per device, and do
  writeout based on that rather than this fragile walking over inodes that
  pdflush does. Of course it won't work with delayed allocation, so something
  would have to be figured out with that (perhaps allocate all outstanding
  blocks before each writeout pass).

  The thing I like about bdflush is that it can easily do nice submit
  ordering of inter-file as well as file/metadata blocks for writeout. I
  don't know if it will come to anything, but at least it is not tightly
  coupled with the core fsblock stuff. It's a bit hacky at the moment ;)

- Still not using a private bdev for fsblock filesystems... I never got around
  to figuring out how to do this. This means that sometimes funny things will
  happen with block_dev device if pages and buffers try to use it. It mostly
  works OK but is a hack that I need to fix.

- Finally, for those not listening last time. I'm doing block sizes larger
  than page size (up to 16MB IIRC, but easily expandable to much higher) with
  fsblock using exactly the same data structures. Although I haven't included
  that in the patch here.


  parent reply	other threads:[~2008-09-15  8:30 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20080911165335.GA31244@wotan.suse.de>
2008-09-12  9:48 ` [patch] fsblock preview Nick Piggin
     [not found] ` <20080914221500.GH27080@wotan.suse.de>
2008-09-15  8:30   ` Nick Piggin [this message]
2008-09-16 11:35     ` Neil Brown
2008-09-23  4:39       ` Nick Piggin
2008-09-24  1:31         ` Neil Brown
2008-09-25  4:38           ` Nick Piggin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080915083014.GA3407@wotan.suse.de \
    --to=npiggin@suse.de \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).