From: Andrew Morton <akpm@osdl.org>
To: Neil Brown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org,
paul.clements@steeleye.com
Subject: Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.
Date: Fri, 12 May 2006 23:59:34 -0700 [thread overview]
Message-ID: <20060512235934.4f609019.akpm@osdl.org> (raw)
In-Reply-To: <17509.22160.118149.49714@cse.unsw.edu.au>
Neil Brown <neilb@suse.de> wrote:
>
> On Friday May 12, akpm@osdl.org wrote:
> > NeilBrown <neilb@suse.de> wrote:
> > >
> > > If md is asked to store a bitmap in a file, it tries to hold onto the
> > > page cache pages for that file, manipulate them directly, and call a
> > > cocktail of operations to write the file out. I don't believe this is
> > > a supportable approach.
> >
> > erk. I think it's better than...
> >
> > > This patch changes the approach to use the same approach as swap files.
> > > i.e. bmap is used to enumerate all the block address of parts of the file
> > > and we write directly to those blocks of the device.
> >
> > That's going in at a much lower level. Even swapfiles don't assume
> > buffer_heads.
>
> I'm not "assuming" buffer_heads. I'm creating buffer heads and using
> them for my own purposes. These are my pages and my buffer heads.
> None of them belong to the filesystem.
Right, so it's incoherent with pagecache and userspace can no longer
usefully read this file.
> The buffer_heads are simply a convenient data-structure to record the
> several block addresses for each page. I could have equally created
> an array storing all the addresses, and built the required bios by
> hand at write time. But buffer_heads did most of the work for me, so
> I used them.
OK.
> Yes, it is a lower level, but
> 1/ I am certain that there will be no kmalloc problems and
> 2/ Because it is exactly the level used by swapfile, I know that it
> is sufficiently 'well defined' that no-one is going to break it.
It would be nicer of course to actually use the mm/page_io.c code. That
would involve implementing swap_bmap() and reimplementing the
get_swap_bio() stuff in terms of a_ops->bmap().
But the swap code can afford to skip blockruns which aren't page-sized and
it uses that capability nicely. You cannot do that.
> >
> > All this (and a set_fs(KERNEL_DS), ug) looks like a step backwards to me.
> > Operating at the pagecache a_ops level looked better, and more
> > filesystem-independent.
>
> If you really want filesystem independence, you need to use vfs_read
> and vfs_write to read/write the file.
yup.
> I have a patch which did that,
> but decided that the possibility of kmalloc failure at awkward times
> would make that not suitable.
submit_bh() can and will allocate memory, although most decent device
drivers should be OK.
There are tricks we can do with writepage. If the backing filesystem uses
buffer_heads and if you hold a ref on the page then we know that there
won't be any buffer_head allocations nor any disk reads in the writepage
path. It'll go direct into bio_alloc+submit_bio, just as you're doing now.
IOW: no gain.
> So I now use vfs_read to read in the file (just like knfsd) and
> bmap/submit_bh to write out the file (just like swapfile).
>
> I don't think a_ops really provides an interface that I can use, partly
> because, as I said in a previous email, it isn't really a public
> interface to a filesystem.
It's publicer than bmap+submit_bh!
> >
> > I haven't looked at this patch at all closely yet. Do I really need to?
>
> I assume you are asking that because you hope I will retract the
> patch.
Was kinda hoping that would be the outcome. It's rather gruesome, using
set_fs()+vfs_read() on one side and submit_bh() on the other.
Are you sure the handling at EOF for a non-multiple-of-PAGE_SIZE file
is OK?
The loss of pagecache coherency seems sad. I assume there's never a
requirement for userspace to read this file.
invalidate_inode_pages() is best-effort. If someone else has the page
locked or if the page is mmapped then the attempt to take down the
pagecache will fail. That's relatively-OK, because it'll just lead to
userspace seeing wrong stuff, and we've already conceded that.
But if the pagecache is dirty then invalidate_inode_pages() will leave it
dirty and the VM will later write it back, corrupting your bitmap file.
You should get i_writecount, fsync the file and then run
invalidate_inode_pages().
Or not run invalidate_inode_pages() - it doesn't gain anything and will
just reduce the observeability of bugs. Better off leaving the pagecache
there all the time so that any rarely-occurring bugs become all-the-time
bugs.
You might as well use kernel_read() too, if you insist on begin oddball ;)
next prev parent reply other threads:[~2006-05-13 6:59 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-05-12 6:07 [PATCH 000 of 8] md/bitmap: Introduction - rework management of bitmap files NeilBrown
2006-05-12 6:07 ` [PATCH 001 of 8] md/bitmap: Fix online removal of file-backed bitmaps NeilBrown
2006-05-12 6:07 ` [PATCH 002 of 8] md/bitmap: Remove bitmap writeback daemon NeilBrown
2006-05-12 17:40 ` Andrew Morton
2006-05-13 3:14 ` Neil Brown
2006-05-13 6:59 ` Andrew Morton
2006-05-12 6:07 ` [PATCH 003 of 8] md/bitmap: Cleaner separation of page attribute handlers in md/bitmap NeilBrown
2006-05-12 6:07 ` [PATCH 004 of 8] md/bitmap: Use set_bit etc for bitmap page attributes NeilBrown
2006-05-12 6:07 ` [PATCH 005 of 8] md/bitmap: Remove unnecessary page reference manipulations from md/bitmap code NeilBrown
2006-05-12 6:07 ` [PATCH 006 of 8] md/bitmap: Remove dead code from md/bitmap NeilBrown
2006-05-12 6:08 ` [PATCH 007 of 8] md/bitmap: Tidy up i_writecount handling in md/bitmap NeilBrown
2006-05-12 6:08 ` [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks NeilBrown
2006-05-12 17:47 ` Andrew Morton
2006-05-13 3:46 ` Neil Brown
2006-05-13 6:59 ` Andrew Morton [this message]
2006-05-13 15:29 ` Paul Clements
2006-05-13 15:42 ` Andrew Morton
2006-05-14 11:15 ` Neil Brown
2006-05-14 11:22 ` Andrew Morton
2006-05-15 0:26 ` Neil Brown
2006-05-15 21:04 ` Andrew Morton
2006-05-15 23:03 ` Neil Brown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20060512235934.4f609019.akpm@osdl.org \
--to=akpm@osdl.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=neilb@suse.de \
--cc=paul.clements@steeleye.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).