From: Neil Brown <neilb@suse.de>
To: Nick Piggin <npiggin@suse.de>
Cc: linux-fsdevel@vger.kernel.org
Subject: Re: [patch] fsblock preview
Date: Tue, 16 Sep 2008 21:35:16 +1000 [thread overview]
Message-ID: <18639.39412.462554.307046@notabene.brown> (raw)
In-Reply-To: message from Nick Piggin on Monday September 15
On Monday September 15, npiggin@suse.de wrote:
> OK, vger doesn't seem to like my patch, so I'll have to give a url to it,
> sorry.
Thank (A URL is lots better than nothing)!
Some comments.
1/ I think you need to include a design document as a patch to
Documentation/something. This would help me not to ask the same
question twice, such as:
2/ Why do you allow pages in the same file to have either blocks or
buffers. Surely it should be set-in-concrete on a per-address-space
(and probably per-filesystem) basis whether buffer_heads or blocks
or something else is used.
I know I've asked this before so I found the answer in my mail
archives.
Some filesystems store their metadata in the same address-space
that block_device.c uses for reading/writing the device. For this
to work they must both use the same buffering strategy.
My feeling is that the complexity that you add to switch between
buffers and block on a per-page basis is too great a cost, and you
should avoid this.
My preferred approach would be to require the filesystem to use a
separate address-space to store the metadata (even if it is still
physically indexed). However I suspect that would cause breakage
for things like tune2fs as the caches would no longer be coherent.
(Would that actually be a problem for anything other than the
filesystem superblock? In which case, just continue to use
buffer_heads for the superblock and fsblock for the rest).
The alternative would be for block_device.c to either use buffers
or blocks depending on how the filesystem is using it.
So on first open, default to using buffers. When the filesystem
bd_claims the device it can switch it to using blocks. It already
must potentially flush the whole cache if it is changing block
size. This is just a small extra step. You would probably just
replace the address_space_operations structure in something similar
to set_blocksize, between sync_blockdev and kill_bdev.
This might end up being more code, but it would remove some real
ugliness from your current patch.
At the very least, I think that the difference between using
buffer_heads and fsblocks should be entirely encoded in the
PageBlocks flags, and that PagePrivate should be left to mean
exactly what it currently does: That the ->private field is in
use by something.
3/ Why do blocks hold a refcount on the page. This seems wrong.
The setting of PagePrivate (or PageBlock) is the equivalent of a
refcount. If it is set, then releasepage will be called to when
the VM wants to drop the page, and invalidatepage will be called
when it insists on dropping the page. No need for a refcount.
I found a hint in the patch which suggests that this is needed for
superpage support. So when you truncate a file at the middle of a
superpage, fsblock can still hold on to the pages that
truncate_inode_pages wants to get rid of. However as you didn't
include that I cannot yet comment on whether that seems
justifiable.
As I think more on this, I'm wondering if I've missed something
(entirely possible). If each block does indeed hold a refcount on
the page, then freeing idle pages which happen to have blocks
attached is not ever going to happen. This would be ok if you
dropped the fsblocks as soon as they weren't in used, but your
commentary seems to suggest that is an option, not a certainty.
Confused.
4/ In truncate_inode_pages_range, you have:
- if (page_index > end) {
- next = page_index;
+ next = page_index+1;
+ if (next-1 > end)
break;
- }
- if (page_index > next)
- next = page_index;
- next++;
This reminds me of very similar code in invalidate_mapping_pages
which has the comment
/*
* We really shouldn't be looking at the ->index of an
* unlocked page. But we're not allowed to lock these
* pages. So we rely upon nobody altering the ->index
* of this (pinned-by-us) page.
*/
(inserted by a patch attributed to me, but I think the text is from
Andrew Morton).
The theory is that until you get the page lock, the ->mapping and
->index of a page can change (I'm not sure how. Maybe splice??).
Note that truncate_complete_page (Which is called fairly shortly
after the patch included above) starts with
if (page->mapping != mapping)
return;
which seems to justify the need for caution.
Now, back to the change that you made:
The "before" code will only ever increase next based on the
(possibly random) value retrieved from ->index. Your "after" code
always sets it explicitly. It could be argued that this is not
as safe.
I'm not really sure how important this all is, and whether ->index
really can give a surprise value in this context. But I thought I
would mention it, and suggest that if you really want to make this
change which (along with the reversal of trylock_page and testing
PageWriteback) doesn't seem to be directly related to fsblock, then
maybe it should be a separate patch so it could be reviewed more
effectively. Either way, it would be good to have the code in
'truncate' and 'invalidate' as similar as possible, including
comments.
5/ Still in truncate_inode_pages_range: what is the point of making
any changes in there at all? They seem to be just syntactic
rearrangement with no obvious gain.
I can see that you add a test for PageWriteback before
trylock_page, which could avoid taking a page lock in some cases so
that might be a micro-optimisation.
I can also see that you have dropped a called to cond_resched(),
which probably needs some justification.
In any case, this should all be in a separate patch.
6/ I would love to have more words explaining the new
address_space_operations. We even have a file to contain them:
Documentation/filesystems/vfs.txt.
I'm guessing that the "any private data" is typically
buffers/blocks holding metadata (e.g. indirect blocks)
Hmmm.. looking further is seems that courtesy of struct mb_assoc,
one fsblock of metadata can be shared by multiple address_spaces.
I guess that is for block-allocation bitmaps etc.
I'm beginning to wonder if this belongs in the address_space rather
than directly in the inode. Part of my reasoning for this is that
I didn't think that i_data should normally be used in generic code,
i_mapping should be used instead. But all accesses to private_list
and assoc_mapping use i_data, suggesting that it is really part of
the inode.
I realise that you didn't make it this way, but if it is a bad
design (which I'm beginning to suspect) then we should perpetuate
it by putting the 'release' and 'sync' operations in the
address_space. They should rather be inode_operations I think.
7/ Those kmallocs in fsblock_init are a worry. Surely is isn't hard
to come up with a scheme for static allocation.
And I think you should remove "if (i < 30)" and just leave it as
a simple 'else'.
8/ Given that sector_t is unsigned, code like
+ block->block_nr = -1;
(in init_blocks) bothers me.
+ sector_t last_block_nr = (sector_t)-1ULL;
Bothers me even more. "-1" as an unsigned long long ??
I would much prefer you used (~0ULL). But maybe that's just me.
9/ I'm a bit confused by fsblock_writeout_data and its interaction
with writeout_block and the PageWriteback flag. This is in the
subpage case.
fsblock_writeout_data seems to want to write out lots of blocks
in sequential order.
It will find a block, lock the page, and (if the block is still
dirty) start writeout using writeout_block. This will set
PageWriteback on the page.
fsblock_writeout_data will often find the next block which is in
the same page, but as PageWriteback is now set, it will not try to
lock and writeout that block. So I can imagine it writing out
only the first block of each page. Surely this cannot be
intended. Am I missing something?
10/ list_add_tail(new, head) is better than list_add(new, head->prev)
11/ nit. checkpatch.pl reports.
total: 67 errors, 141 warnings, 7239 lines checke
I thought to run it because I noticed white-space-damage in
blkdev_get_block.
12/ In 'invalidate_block' you have commented out lock_block, and
put in a comment "why lock?".
The reason to lock is to wait for pending writeout to complete.
If the writeout was triggered by the VFS/VM, then the page will
have PageWriteback set and we will already have waited for that to
clear. However if the writeout was triggered internally by the
filesystem (such as for data ordering guarantees) then
PageWriteback will not have been set, but the block (not the page)
will be locked.
At least that is how I understand buffer_heads to work. However
for fsblock you set PageWriteback whenever you write out any block
on the page. As noted in point 9 above, I'm not convinced that is
the correct thing to do.
And finally:
13/
+ /*
+ * XXX: maybe use private alloc funcions so fses can embed block into
+ * their fs-private block rather than using ->private? Maybe ->private
+ * is easier though...
+ */
Yes, that's just what I was going to say.
I find 'struct fsblock_meta' slightly confusing. It extends
'struct fsblock' in two ways.
1/ It embeds fsblock in itself, just adding a 'data' pointer.
2/ It sometimes sets ->private to a 'struct mb_assoc'.
So it looks like you "have a bob each way". Both embedding and using
->private.
It seems to me that you have a core 'fsblock' subsystem, and a
secondary fsblock_meta subsystem which would be useful for some
filesystems and useless to other filesystems which handle metadata
in a different way.
It would be nice if these two were clearly delineated.
It would also be nice if filesystems could extend fsblock in what
ever way suited them.
Using embedding seems to be the most common approach in Linux these
days, and saves a pointer. Using ->private has the advantage of
allowing you to allocate little arrays of fsblocks in common code
because you "know" how big they should be. More importantly it
allows 'for_each_block' to work fairly easily.
Given that the fs would need to allocate it's own 'private' info
anyway, for_each_block is really the only benefit of using
->private.
It shouldn't be to hard to use (say) 8 bits for fsblock to record
the size of the full structure (much as you currently record the
number of blocks-per-page) and then for_each_block can use that.
You could even save a couple of bits but allocating one int plus
N filesystem_block structures, and recording 'N' and 'sizeof
filesystem_block' in the 'int'. You could even put the spinlock
bit in that int rather than (ab)using the lsb of page->private.
I think that will do for now.
NeilBrown
next prev parent reply other threads:[~2008-09-16 11:35 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20080911165335.GA31244@wotan.suse.de>
2008-09-12 9:48 ` [patch] fsblock preview Nick Piggin
[not found] ` <20080914221500.GH27080@wotan.suse.de>
2008-09-15 8:30 ` Nick Piggin
2008-09-16 11:35 ` Neil Brown [this message]
2008-09-23 4:39 ` Nick Piggin
2008-09-24 1:31 ` Neil Brown
2008-09-25 4:38 ` Nick Piggin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=18639.39412.462554.307046@notabene.brown \
--to=neilb@suse.de \
--cc=linux-fsdevel@vger.kernel.org \
--cc=npiggin@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).