* [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure
@ 2007-10-15 8:28 Christian Borntraeger
2007-10-15 14:06 ` Nick Piggin
0 siblings, 1 reply; 77+ messages in thread
From: Christian Borntraeger @ 2007-10-15 8:28 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o
Andrew, this is a resend of a bugfix patch. Ramdisk seems a bit unmaintained,
so decided to sent the patch to you :-).
I have CCed Ted, who did work on the code in the 90s. I found no current
email address of Chad Page.
We have seen ramdisk based install systems, where some pages of mapped
libraries and programs were suddendly zeroed under memory pressure. This
should not happen, as the ramdisk avoids freeing its pages by keeping them
dirty all the time.
It turns out that there is a case, where the VM makes a ramdisk page clean,
without telling the ramdisk driver.
On memory pressure shrink_zone runs and it starts to run shrink_active_list.
There is a check for buffer_heads_over_limit, and if true, pagevec_strip is
called. pagevec_strip calls try_to_release_page. If the mapping has no
releasepage callback, try_to_free_buffers is called. try_to_free_buffers has
now a special logic for some file systems to make a dirty page clean, if all
buffers are clean. Thats what happened in our test case.
The solution is to provide a noop-releasepage callback for the ramdisk driver.
This avoids try_to_free_buffers for ramdisk pages.
To trigger the problem, you have to make buffer_heads_over_limit true, which
means:
- lower max_buffer_heads
or
- have a system with lots of high memory
Andrew, if there are no objections - please apply. The patch applies against
todays git.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
---
drivers/block/rd.c | 13 +++++++++++++
1 files changed, 13 insertions(+)
Index: linux-2.6/drivers/block/rd.c
===================================================================
--- linux-2.6.orig/drivers/block/rd.c
+++ linux-2.6/drivers/block/rd.c
@@ -189,6 +189,18 @@ static int ramdisk_set_page_dirty(struct
return 0;
}
+/*
+ * releasepage is called by pagevec_strip/try_to_release_page if
+ * buffers_heads_over_limit is true. Without a releasepage function
+ * try_to_free_buffers is called instead. That can unset the dirty
+ * bit of our ram disk pages, which will be eventually freed, even
+ * if the page is still in use.
+ */
+static int ramdisk_releasepage(struct page *page, gfp_t dummy)
+{
+ return 0;
+}
+
static const struct address_space_operations ramdisk_aops = {
.readpage = ramdisk_readpage,
.prepare_write = ramdisk_prepare_write,
@@ -196,6 +208,7 @@ static const struct address_space_operat
.writepage = ramdisk_writepage,
.set_page_dirty = ramdisk_set_page_dirty,
.writepages = ramdisk_writepages,
+ .releasepage = ramdisk_releasepage,
};
static int rd_blkdev_pagecache_IO(int rw, struct bio_vec *vec, sector_t sector,
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 77+ messages in thread* Re: [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure 2007-10-15 8:28 [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure Christian Borntraeger @ 2007-10-15 14:06 ` Nick Piggin 2007-10-15 9:05 ` Christian Borntraeger 2007-10-15 9:16 ` [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure Andrew Morton 0 siblings, 2 replies; 77+ messages in thread From: Nick Piggin @ 2007-10-15 14:06 UTC (permalink / raw) To: Christian Borntraeger Cc: Andrew Morton, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o On Monday 15 October 2007 18:28, Christian Borntraeger wrote: > Andrew, this is a resend of a bugfix patch. Ramdisk seems a bit > unmaintained, so decided to sent the patch to you :-). > I have CCed Ted, who did work on the code in the 90s. I found no current > email address of Chad Page. This really needs to be fixed... I can't make up my mind between the approaches to fixing it. On one hand, I would actually prefer to really mark the buffers dirty (as in: Eric's fix for this problem[*]) than this patch, and this seems a bit like a bandaid... On the other hand, the wound being covered by the bandaid is actually the code in the buffer layer that does this latent "cleaning" of the page because it sadly doesn't really keep track of the pagecache state. But it *still* feels like we should be marking the rd page's buffers dirty which should avoid this problem anyway. [*] However, hmm, with Eric's patch I guess we'd still have a hole where filesystems that write their buffers by hand think they are "cleaning" these things and we're back to square one. That could be fixed by marking the buffers dirty again? Why were Eric's patches dropped, BTW? I don't remember. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure 2007-10-15 14:06 ` Nick Piggin @ 2007-10-15 9:05 ` Christian Borntraeger 2007-10-15 14:38 ` Nick Piggin 2007-10-15 9:16 ` [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure Andrew Morton 1 sibling, 1 reply; 77+ messages in thread From: Christian Borntraeger @ 2007-10-15 9:05 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o Am Montag, 15. Oktober 2007 schrieb Nick Piggin: > On Monday 15 October 2007 18:28, Christian Borntraeger wrote: > > Andrew, this is a resend of a bugfix patch. Ramdisk seems a bit > > unmaintained, so decided to sent the patch to you :-). > > I have CCed Ted, who did work on the code in the 90s. I found no current > > email address of Chad Page. > > This really needs to be fixed... I obviously agree ;-) We have seen this problem happen several times. > I can't make up my mind between the approaches to fixing it. > > On one hand, I would actually prefer to really mark the buffers > dirty (as in: Eric's fix for this problem[*]) than this patch, > and this seems a bit like a bandaid... I have never seen these patches, so I cannot comment on them. > > On the other hand, the wound being covered by the bandaid is > actually the code in the buffer layer that does this latent > "cleaning" of the page because it sadly doesn't really keep > track of the pagecache state. But it *still* feels like we > should be marking the rd page's buffers dirty which should > avoid this problem anyway. Yes, that would solve the problem as well. As long as we fix the problem, I am happy. On the other hand, do you see any obvious problem with this "bandaid"? Christian -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure 2007-10-15 9:05 ` Christian Borntraeger @ 2007-10-15 14:38 ` Nick Piggin 2007-10-15 18:38 ` Eric W. Biederman 0 siblings, 1 reply; 77+ messages in thread From: Nick Piggin @ 2007-10-15 14:38 UTC (permalink / raw) To: Christian Borntraeger, Eric W. Biederman Cc: Andrew Morton, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o On Monday 15 October 2007 19:05, Christian Borntraeger wrote: > Am Montag, 15. Oktober 2007 schrieb Nick Piggin: > > On Monday 15 October 2007 18:28, Christian Borntraeger wrote: > > > Andrew, this is a resend of a bugfix patch. Ramdisk seems a bit > > > unmaintained, so decided to sent the patch to you :-). > > > I have CCed Ted, who did work on the code in the 90s. I found no > > > current email address of Chad Page. > > > > This really needs to be fixed... > > I obviously agree ;-) > We have seen this problem happen several times. > > > I can't make up my mind between the approaches to fixing it. > > > > On one hand, I would actually prefer to really mark the buffers > > dirty (as in: Eric's fix for this problem[*]) than this patch, > > and this seems a bit like a bandaid... > > I have never seen these patches, so I cannot comment on them. > > On the other hand, the wound being covered by the bandaid is > > actually the code in the buffer layer that does this latent > > "cleaning" of the page because it sadly doesn't really keep > > track of the pagecache state. But it *still* feels like we > > should be marking the rd page's buffers dirty which should > > avoid this problem anyway. > > Yes, that would solve the problem as well. As long as we fix > the problem, I am happy. On the other hand, do you see any > obvious problem with this "bandaid"? I don't think so -- in fact, it could be the best candidate for a minimal fix for stable kernels (anyone disagree? if not, maybe you could also send this to the stable maintainers?). But I do want to have this fixed in a "nice" way. eg. I'd like it to mark the buffers dirty because that actually results in more reuse of generic kernel code, and also should make rd behave more naturally (I like using it to test filesystems because it can expose a lot more concurrency than something like loop on tmpfs). It should also be possible to actually have rd's buffer heads get reclaimed as well, preferably while exercising the common buffer paths and without writing much new code. All of that is secondary to fixing the data corruption problem of course! But the fact that those alternate patches do exist now means I want to just bring them into the discussion again before merging one or the other. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure 2007-10-15 14:38 ` Nick Piggin @ 2007-10-15 18:38 ` Eric W. Biederman 2007-10-15 22:37 ` Eric W. Biederman 0 siblings, 1 reply; 77+ messages in thread From: Eric W. Biederman @ 2007-10-15 18:38 UTC (permalink / raw) To: Nick Piggin Cc: Christian Borntraeger, Andrew Morton, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o Nick Piggin <nickpiggin@yahoo.com.au> writes: > On Monday 15 October 2007 19:05, Christian Borntraeger wrote: >> Am Montag, 15. Oktober 2007 schrieb Nick Piggin: >> > On Monday 15 October 2007 18:28, Christian Borntraeger wrote: >> > > Andrew, this is a resend of a bugfix patch. Ramdisk seems a bit >> > > unmaintained, so decided to sent the patch to you :-). >> > > I have CCed Ted, who did work on the code in the 90s. I found no >> > > current email address of Chad Page. >> > >> > This really needs to be fixed... >> >> I obviously agree ;-) >> We have seen this problem happen several times. >> >> > I can't make up my mind between the approaches to fixing it. >> > >> > On one hand, I would actually prefer to really mark the buffers >> > dirty (as in: Eric's fix for this problem[*]) than this patch, >> > and this seems a bit like a bandaid... >> >> I have never seen these patches, so I cannot comment on them. > >> > On the other hand, the wound being covered by the bandaid is >> > actually the code in the buffer layer that does this latent >> > "cleaning" of the page because it sadly doesn't really keep >> > track of the pagecache state. But it *still* feels like we >> > should be marking the rd page's buffers dirty which should >> > avoid this problem anyway. >> >> Yes, that would solve the problem as well. As long as we fix >> the problem, I am happy. On the other hand, do you see any >> obvious problem with this "bandaid"? > > I don't think so -- in fact, it could be the best candidate for > a minimal fix for stable kernels (anyone disagree? if not, maybe > you could also send this to the stable maintainers?). A minor one. It still leaves us with buffer heads out of sync with struct page. > But I do want to have this fixed in a "nice" way. eg. I'd like > it to mark the buffers dirty because that actually results in > more reuse of generic kernel code, and also should make rd > behave more naturally (I like using it to test filesystems > because it can expose a lot more concurrency than something like > loop on tmpfs). It should also be possible to actually have > rd's buffer heads get reclaimed as well, preferably while > exercising the common buffer paths and without writing much new > code. We actually allow that currently for clean pages which is part of what makes this tricky. > All of that is secondary to fixing the data corruption problem > of course! But the fact that those alternate patches do exist now > means I want to just bring them into the discussion again before > merging one or the other. The core of my original fix was to modify init_page_buffers so that when we added buffers to a dirty page the buffers became dirty. Modifying the generic code is a bit spooky because it requires us to audit the kernel to make certain nothing else depends on the current behavior in odd ways. Although since init_page_buffers is only called when we are adding buffer heads to an existing page I still think that was the proper change. The historical reason for my patches not getting merged the first time is there was some weird issue with reiserfs ramdisks and so Andrew disabled the code, and then dropped it when he had discovered he had the patch disabled for several releases. I don't think any causal relationship was ever established. But I didn't hear enough about the reiserfs ramdisk issue, to make a guess what was going on. So it looks to me like the important invariant we need to maintain is that when a ramdisk page is dirty it always has buffers and those buffers are dirty as well. With a little care we can ensure this happens with just modifications to rd.c Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure 2007-10-15 18:38 ` Eric W. Biederman @ 2007-10-15 22:37 ` Eric W. Biederman 2007-10-15 22:40 ` [PATCH] rd: Preserve the dirty bit in init_page_buffers() Eric W. Biederman 0 siblings, 1 reply; 77+ messages in thread From: Eric W. Biederman @ 2007-10-15 22:37 UTC (permalink / raw) To: Nick Piggin Cc: Christian Borntraeger, Andrew Morton, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o ebiederm@xmission.com (Eric W. Biederman) writes: > Nick Piggin <nickpiggin@yahoo.com.au> writes: > >> On Monday 15 October 2007 19:05, Christian Borntraeger wrote: >>> Am Montag, 15. Oktober 2007 schrieb Nick Piggin: >>> > On Monday 15 October 2007 18:28, Christian Borntraeger wrote: >>> > > Andrew, this is a resend of a bugfix patch. Ramdisk seems a bit >>> > > unmaintained, so decided to sent the patch to you :-). >>> > > I have CCed Ted, who did work on the code in the 90s. I found no >>> > > current email address of Chad Page. >>> > >>> > This really needs to be fixed... >>> >>> I obviously agree ;-) >>> We have seen this problem happen several times. >>> >>> > I can't make up my mind between the approaches to fixing it. >>> > >>> > On one hand, I would actually prefer to really mark the buffers >>> > dirty (as in: Eric's fix for this problem[*]) than this patch, >>> > and this seems a bit like a bandaid... >>> >>> I have never seen these patches, so I cannot comment on them. >> >>> > On the other hand, the wound being covered by the bandaid is >>> > actually the code in the buffer layer that does this latent >>> > "cleaning" of the page because it sadly doesn't really keep >>> > track of the pagecache state. But it *still* feels like we >>> > should be marking the rd page's buffers dirty which should >>> > avoid this problem anyway. >>> >>> Yes, that would solve the problem as well. As long as we fix >>> the problem, I am happy. On the other hand, do you see any >>> obvious problem with this "bandaid"? >> >> I don't think so -- in fact, it could be the best candidate for >> a minimal fix for stable kernels (anyone disagree? if not, maybe >> you could also send this to the stable maintainers?). > > A minor one. It still leaves us with buffer heads out of sync with > struct page. > >> But I do want to have this fixed in a "nice" way. eg. I'd like >> it to mark the buffers dirty because that actually results in >> more reuse of generic kernel code, and also should make rd >> behave more naturally (I like using it to test filesystems >> because it can expose a lot more concurrency than something like >> loop on tmpfs). It should also be possible to actually have >> rd's buffer heads get reclaimed as well, preferably while >> exercising the common buffer paths and without writing much new >> code. > > We actually allow that currently for clean pages which is part > of what makes this tricky. > >> All of that is secondary to fixing the data corruption problem >> of course! But the fact that those alternate patches do exist now >> means I want to just bring them into the discussion again before >> merging one or the other. > > The core of my original fix was to modify init_page_buffers so that > when we added buffers to a dirty page the buffers became dirty. > > Modifying the generic code is a bit spooky because it requires us > to audit the kernel to make certain nothing else depends on the > current behavior in odd ways. Although since init_page_buffers > is only called when we are adding buffer heads to an existing > page I still think that was the proper change. > > The historical reason for my patches not getting merged the first > time is there was some weird issue with reiserfs ramdisks and so > Andrew disabled the code, and then dropped it when he had discovered > he had the patch disabled for several releases. I don't think > any causal relationship was ever established. But I didn't > hear enough about the reiserfs ramdisk issue, to make a guess > what was going on. > > So it looks to me like the important invariant we need to maintain > is that when a ramdisk page is dirty it always has buffers and those > buffers are dirty as well. With a little care we can ensure this > happens with just modifications to rd.c Hah. I looked over my last round of patches again and I have been able to verify by review the parts I was a little iffy about and I have found where in my cleanups I had missed a layering violation in the ramdisk code, and removed some needed code. Which probably accounts for the reiserfs ramdisk problems. Updated patches in a minute. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* [PATCH] rd: Preserve the dirty bit in init_page_buffers() 2007-10-15 22:37 ` Eric W. Biederman @ 2007-10-15 22:40 ` Eric W. Biederman 2007-10-15 22:42 ` [PATCH] rd: Mark ramdisk buffers heads dirty Eric W. Biederman 2007-10-16 8:12 ` [PATCH] rd: Preserve the dirty bit in init_page_buffers() Nick Piggin 0 siblings, 2 replies; 77+ messages in thread From: Eric W. Biederman @ 2007-10-15 22:40 UTC (permalink / raw) To: Andrew Morton Cc: Nick Piggin, Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o The problem: When we are trying to free buffers try_to_free_buffers() will look at ramdisk pages with clean buffer heads and remove the dirty bit from the page. Resulting in ramdisk pages with data that get removed from the page cache. Ouch! Buffer heads appear on ramdisk pages when a filesystem calls __getblk() which through a series of function calls eventually calls init_page_buffers(). So to fix the mismatch between buffer head and page state this patch modifies init_page_buffers() to transfer the dirty bit from the page to the buffer heads like we currently do for the uptodate bit. This patch is safe as only __getblk calls init_page_buffers, and there are only two implementations of block devices cached in the page cache. The generic implementation in block_dev.c and the implementation in rd.c. The generic implementation of block devices always does everything in terms of buffer heads so it always has buffer heads allocated before a page is marked dirty so this change does not affect it. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> --- fs/buffer.c | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 75b51df..8b87beb 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -972,6 +972,7 @@ init_page_buffers(struct page *page, struct block_device *bdev, struct buffer_head *head = page_buffers(page); struct buffer_head *bh = head; int uptodate = PageUptodate(page); + int dirty = PageDirty(page); do { if (!buffer_mapped(bh)) { @@ -980,6 +981,8 @@ init_page_buffers(struct page *page, struct block_device *bdev, bh->b_blocknr = block; if (uptodate) set_buffer_uptodate(bh); + if (dirty) + set_buffer_dirty(bh); set_buffer_mapped(bh); } block++; -- 1.5.3.rc6.17.g1911 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 77+ messages in thread
* [PATCH] rd: Mark ramdisk buffers heads dirty 2007-10-15 22:40 ` [PATCH] rd: Preserve the dirty bit in init_page_buffers() Eric W. Biederman @ 2007-10-15 22:42 ` Eric W. Biederman 2007-10-16 7:56 ` Christian Borntraeger 2007-10-16 8:19 ` [PATCH] rd: Mark ramdisk buffers heads dirty Nick Piggin 2007-10-16 8:12 ` [PATCH] rd: Preserve the dirty bit in init_page_buffers() Nick Piggin 1 sibling, 2 replies; 77+ messages in thread From: Eric W. Biederman @ 2007-10-15 22:42 UTC (permalink / raw) To: Andrew Morton Cc: Nick Piggin, Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o I have not observed this case but it is possible to get a dirty page cache with clean buffer heads if we get a clean ramdisk page with buffer heads generated by a filesystem calling __getblk and then write to that page from user space through the block device. Then we just need to hit the proper window and try_to_free_buffers() will mark that page clean and eventually drop it. Ouch! To fix this use the generic __set_page_dirty_buffers in the ramdisk code so that when we mark a page dirty we also mark it's buffer heads dirty. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> --- drivers/block/rd.c | 13 +------------ 1 files changed, 1 insertions(+), 12 deletions(-) diff --git a/drivers/block/rd.c b/drivers/block/rd.c index 701ea77..84163da 100644 --- a/drivers/block/rd.c +++ b/drivers/block/rd.c @@ -178,23 +178,12 @@ static int ramdisk_writepages(struct address_space *mapping, return 0; } -/* - * ramdisk blockdev pages have their own ->set_page_dirty() because we don't - * want them to contribute to dirty memory accounting. - */ -static int ramdisk_set_page_dirty(struct page *page) -{ - if (!TestSetPageDirty(page)) - return 1; - return 0; -} - static const struct address_space_operations ramdisk_aops = { .readpage = ramdisk_readpage, .prepare_write = ramdisk_prepare_write, .commit_write = ramdisk_commit_write, .writepage = ramdisk_writepage, - .set_page_dirty = ramdisk_set_page_dirty, + .set_page_dirty = __set_page_dirty_buffers, .writepages = ramdisk_writepages, }; -- 1.5.3.rc6.17.g1911 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Mark ramdisk buffers heads dirty 2007-10-15 22:42 ` [PATCH] rd: Mark ramdisk buffers heads dirty Eric W. Biederman @ 2007-10-16 7:56 ` Christian Borntraeger 2007-10-16 9:22 ` Eric W. Biederman 2007-10-17 16:14 ` Christian Borntraeger 2007-10-16 8:19 ` [PATCH] rd: Mark ramdisk buffers heads dirty Nick Piggin 1 sibling, 2 replies; 77+ messages in thread From: Christian Borntraeger @ 2007-10-16 7:56 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Nick Piggin, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable Am Dienstag, 16. Oktober 2007 schrieb Eric W. Biederman: > fs/buffer.c | 3 +++ > 1 files changed, 3 insertions(+), 0 deletions(-) > drivers/block/rd.c | 13 +------------ > 1 files changed, 1 insertions(+), 12 deletions(-) Your patches look sane so far. I have applied both patches, and the problem seems gone. I will try to get these patches to our testers. As long as they dont find new problems: Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Nick, Eric. What is the best patch for the stable series? Both patches from Eric or mine? I CCed stable, so they know that something is coming. Christian -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Mark ramdisk buffers heads dirty 2007-10-16 7:56 ` Christian Borntraeger @ 2007-10-16 9:22 ` Eric W. Biederman 2007-10-17 16:14 ` Christian Borntraeger 1 sibling, 0 replies; 77+ messages in thread From: Eric W. Biederman @ 2007-10-16 9:22 UTC (permalink / raw) To: Christian Borntraeger Cc: Andrew Morton, Nick Piggin, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable Christian Borntraeger <borntraeger@de.ibm.com> writes: > Am Dienstag, 16. Oktober 2007 schrieb Eric W. Biederman: > >> fs/buffer.c | 3 +++ >> 1 files changed, 3 insertions(+), 0 deletions(-) >> drivers/block/rd.c | 13 +------------ >> 1 files changed, 1 insertions(+), 12 deletions(-) > > Your patches look sane so far. I have applied both patches, and the problem > seems gone. I will try to get these patches to our testers. > > As long as they dont find new problems: Sounds good. Please make certain to test reiserfs as well as ext2+ as it seems to strain the ramdisk code more aggressively. > Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> > > Nick, Eric. What is the best patch for the stable series? Both patches from > Eric or mine? I CCed stable, so they know that something is coming. My gut feel says my patches assuming everything tests out ok, as they actually fix the problem instead of papering over it, and there isn't really a size difference. In addition the change to init_page_buffers is interesting all by itself. With that patch we now have the option of running block devices in mode where we don't bother to cache the buffer heads which should reduce memory pressure a bit. Not that an enhancement like that will show up in stable, but... Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Mark ramdisk buffers heads dirty 2007-10-16 7:56 ` Christian Borntraeger 2007-10-16 9:22 ` Eric W. Biederman @ 2007-10-17 16:14 ` Christian Borntraeger 2007-10-17 17:57 ` Eric W. Biederman 1 sibling, 1 reply; 77+ messages in thread From: Christian Borntraeger @ 2007-10-17 16:14 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Nick Piggin, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable Eric, Am Dienstag, 16. Oktober 2007 schrieb Christian Borntraeger: > Am Dienstag, 16. Oktober 2007 schrieb Eric W. Biederman: > > > fs/buffer.c | 3 +++ > > 1 files changed, 3 insertions(+), 0 deletions(-) > > drivers/block/rd.c | 13 +------------ > > 1 files changed, 1 insertions(+), 12 deletions(-) > > Your patches look sane so far. I have applied both patches, and the problem > seems gone. I will try to get these patches to our testers. > > As long as they dont find new problems: Our testers did only a short test, and then they were stopped by problems with reiserfs. At the moment I cannot say for sure if your patch caused this, but we got the following BUG ReiserFS: ram0: warning: Created .reiserfs_priv on ram0 - reserved for xattr storage. ------------[ cut here ]------------ kernel BUG at /home/autobuild/BUILD/linux-2.6.23-20071017/fs/reiserfs/journal.c:1117! illegal operation: 0001 [#1] Modules linked in: reiserfs dm_multipath sunrpc dm_mod qeth ccwgroup vmur CPU: 3 Not tainted Process reiserfs/3 (pid: 2592, task: 77dac418, ksp: 7513ee88) Krnl PSW : 070c3000 fb344380 (flush_commit_list+0x808/0x95c [reiserfs]) R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:3 PM:0 Krnl GPRS: 00000002 7411b5c8 0000002b 00000000 7b04d000 00000001 00000000 76d1de00 7513eec0 00000003 00000012 77f77680 7411b608 fb343b7e fb34404a 7513ee50 Krnl Code: fb344374: a7210002 tmll %r2,2 fb344378: a7840004 brc 8,fb344380 fb34437c: a7f40001 brc 15,fb34437e >fb344380: 5810d8c2 l %r1,2242(%r13) fb344384: 5820b03c l %r2,60(%r11) fb344388: 0de1 basr %r14,%r1 fb34438a: 5810d90e l %r1,2318(%r13) fb34438e: 5820b03c l %r2,60(%r11) Looking at the code, this really seems related to dirty buffers, so your patch is the main suspect at the moment. if (!barrier) { /* If there was a write error in the journal - we can't commit * this transaction - it will be invalid and, if successful, * will just end up propagating the write error out to * the file system. */ if (likely(!retval && !reiserfs_is_journal_aborted (journal))) { if (buffer_dirty(jl->j_commit_bh)) 1117----> BUG(); mark_buffer_dirty(jl->j_commit_bh) ; sync_dirty_buffer(jl->j_commit_bh) ; } } Christian -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Mark ramdisk buffers heads dirty 2007-10-17 16:14 ` Christian Borntraeger @ 2007-10-17 17:57 ` Eric W. Biederman 2007-10-17 19:14 ` Chris Mason 2007-10-17 21:48 ` [PATCH] rd: Mark ramdisk buffers heads dirty Christian Borntraeger 0 siblings, 2 replies; 77+ messages in thread From: Eric W. Biederman @ 2007-10-17 17:57 UTC (permalink / raw) To: Christian Borntraeger Cc: Andrew Morton, Nick Piggin, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable Christian Borntraeger <borntraeger@de.ibm.com> writes: > Eric, > > Am Dienstag, 16. Oktober 2007 schrieb Christian Borntraeger: >> Am Dienstag, 16. Oktober 2007 schrieb Eric W. Biederman: >> >> > fs/buffer.c | 3 +++ >> > 1 files changed, 3 insertions(+), 0 deletions(-) >> > drivers/block/rd.c | 13 +------------ >> > 1 files changed, 1 insertions(+), 12 deletions(-) >> >> Your patches look sane so far. I have applied both patches, and the problem >> seems gone. I will try to get these patches to our testers. >> >> As long as they dont find new problems: > > Our testers did only a short test, and then they were stopped by problems with > reiserfs. At the moment I cannot say for sure if your patch caused this, but > we got the following BUG Thanks. > ReiserFS: ram0: warning: Created .reiserfs_priv on ram0 - reserved for xattr > storage. > ------------[ cut here ]------------ > kernel BUG at > /home/autobuild/BUILD/linux-2.6.23-20071017/fs/reiserfs/journal.c:1117! > illegal operation: 0001 [#1] > Modules linked in: reiserfs dm_multipath sunrpc dm_mod qeth ccwgroup vmur > CPU: 3 Not tainted > Process reiserfs/3 (pid: 2592, task: 77dac418, ksp: 7513ee88) > Krnl PSW : 070c3000 fb344380 (flush_commit_list+0x808/0x95c [reiserfs]) > R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:3 PM:0 > Krnl GPRS: 00000002 7411b5c8 0000002b 00000000 > 7b04d000 00000001 00000000 76d1de00 > 7513eec0 00000003 00000012 77f77680 > 7411b608 fb343b7e fb34404a 7513ee50 > Krnl Code: fb344374: a7210002 tmll %r2,2 > fb344378: a7840004 brc 8,fb344380 > fb34437c: a7f40001 brc 15,fb34437e > >fb344380: 5810d8c2 l %r1,2242(%r13) > fb344384: 5820b03c l %r2,60(%r11) > fb344388: 0de1 basr %r14,%r1 > fb34438a: 5810d90e l %r1,2318(%r13) > fb34438e: 5820b03c l %r2,60(%r11) > > > Looking at the code, this really seems related to dirty buffers, so your patch > is the main suspect at the moment. Sounds reasonable. > if (!barrier) { > /* If there was a write error in the journal - we can't commit > * this transaction - it will be invalid and, if successful, > * will just end up propagating the write error out to > * the file system. */ > if (likely(!retval && !reiserfs_is_journal_aborted (journal))) { > if (buffer_dirty(jl->j_commit_bh)) > 1117----> BUG(); > mark_buffer_dirty(jl->j_commit_bh) ; > sync_dirty_buffer(jl->j_commit_bh) ; > } > } Grr. I'm not certain how to call that. Given that I should also be able to trigger this case by writing to the block device through the buffer cache (to the write spot at the write moment) this feels like a reiserfs bug. Although I feel screaming about filesystems that go BUG instead of remount read-only.... At the same time I increasingly don't think we should allow user space to dirty or update our filesystem metadata buffer heads. That seems like asking for trouble. Did you have both of my changes applied? To init_page_buffer() and to the ramdisk_set_dirty_page? I'm guessing it was the change to ramdisk_set_dirty_page.... Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Mark ramdisk buffers heads dirty 2007-10-17 17:57 ` Eric W. Biederman @ 2007-10-17 19:14 ` Chris Mason 2007-10-17 20:29 ` Eric W. Biederman 2007-10-17 21:48 ` [PATCH] rd: Mark ramdisk buffers heads dirty Christian Borntraeger 1 sibling, 1 reply; 77+ messages in thread From: Chris Mason @ 2007-10-17 19:14 UTC (permalink / raw) To: Eric W. Biederman Cc: Christian Borntraeger, Andrew Morton, Nick Piggin, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable On Wed, 2007-10-17 at 11:57 -0600, Eric W. Biederman wrote: > Christian Borntraeger <borntraeger@de.ibm.com> writes: > > > Eric, > > > > Am Dienstag, 16. Oktober 2007 schrieb Christian Borntraeger: > >> Am Dienstag, 16. Oktober 2007 schrieb Eric W. Biederman: > >> > >> > fs/buffer.c | 3 +++ > >> > 1 files changed, 3 insertions(+), 0 deletions(-) > >> > drivers/block/rd.c | 13 +------------ > >> > 1 files changed, 1 insertions(+), 12 deletions(-) > >> > >> Your patches look sane so far. I have applied both patches, and the problem > >> seems gone. I will try to get these patches to our testers. > >> > >> As long as they dont find new problems: > > > > Our testers did only a short test, and then they were stopped by problems with > > reiserfs. At the moment I cannot say for sure if your patch caused this, but > > we got the following BUG > > Thanks. > > > ReiserFS: ram0: warning: Created .reiserfs_priv on ram0 - reserved for xattr > > storage. > > ------------[ cut here ]------------ > > kernel BUG at > > /home/autobuild/BUILD/linux-2.6.23-20071017/fs/reiserfs/journal.c:1117! > > illegal operation: 0001 [#1] > > Modules linked in: reiserfs dm_multipath sunrpc dm_mod qeth ccwgroup vmur > > CPU: 3 Not tainted > > Process reiserfs/3 (pid: 2592, task: 77dac418, ksp: 7513ee88) > > Krnl PSW : 070c3000 fb344380 (flush_commit_list+0x808/0x95c [reiserfs]) > > R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:3 PM:0 > > Krnl GPRS: 00000002 7411b5c8 0000002b 00000000 > > 7b04d000 00000001 00000000 76d1de00 > > 7513eec0 00000003 00000012 77f77680 > > 7411b608 fb343b7e fb34404a 7513ee50 > > Krnl Code: fb344374: a7210002 tmll %r2,2 > > fb344378: a7840004 brc 8,fb344380 > > fb34437c: a7f40001 brc 15,fb34437e > > >fb344380: 5810d8c2 l %r1,2242(%r13) > > fb344384: 5820b03c l %r2,60(%r11) > > fb344388: 0de1 basr %r14,%r1 > > fb34438a: 5810d90e l %r1,2318(%r13) > > fb34438e: 5820b03c l %r2,60(%r11) > > > > > > Looking at the code, this really seems related to dirty buffers, so your patch > > is the main suspect at the moment. > > Sounds reasonable. > > > if (!barrier) { > > /* If there was a write error in the journal - we can't commit > > * this transaction - it will be invalid and, if successful, > > * will just end up propagating the write error out to > > * the file system. */ > > if (likely(!retval && !reiserfs_is_journal_aborted (journal))) { > > if (buffer_dirty(jl->j_commit_bh)) > > 1117----> BUG(); > > mark_buffer_dirty(jl->j_commit_bh) ; > > sync_dirty_buffer(jl->j_commit_bh) ; > > } > > } > > Grr. I'm not certain how to call that. > > Given that I should also be able to trigger this case by writing to > the block device through the buffer cache (to the write spot at the > write moment) this feels like a reiserfs bug. > Although I feel > screaming about filesystems that go BUG instead of remount read-only.... > In this case, the commit block isn't allowed to be dirty before reiserfs decides it is safe to write it. The journal code expects it is the only spot in the kernel setting buffer heads dirty, and it only does so after the rest of the log blocks are safely on disk. Given this is a ramdisk, the check can be ignored, but I'd rather not sprinkle if (ram_disk) into the FS code.... > At the same time I increasingly don't think we should allow user space > to dirty or update our filesystem metadata buffer heads. That seems > like asking for trouble. > Demanding trouble ;) -chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Mark ramdisk buffers heads dirty 2007-10-17 19:14 ` Chris Mason @ 2007-10-17 20:29 ` Eric W. Biederman 2007-10-17 20:54 ` Chris Mason 0 siblings, 1 reply; 77+ messages in thread From: Eric W. Biederman @ 2007-10-17 20:29 UTC (permalink / raw) To: Chris Mason Cc: Christian Borntraeger, Andrew Morton, Nick Piggin, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable Chris Mason <chris.mason@oracle.com> writes: > In this case, the commit block isn't allowed to be dirty before reiserfs > decides it is safe to write it. The journal code expects it is the only > spot in the kernel setting buffer heads dirty, and it only does so after > the rest of the log blocks are safely on disk. Ok. So the journal code here fundamentally depends on being able to control the order of the writes, and something else being able to set the buffer head dirty messes up that control. > Given this is a ramdisk, the check can be ignored, but I'd rather not > sprinkle if (ram_disk) into the FS code.... It makes no sense to implement a ramdisk in a way that requires this. >> At the same time I increasingly don't think we should allow user space >> to dirty or update our filesystem metadata buffer heads. That seems >> like asking for trouble. >> > > Demanding trouble ;) Looks like it. There are even comments in jbd about the same class of problems. Apparently dump and tune2fs on mounted filesystems have triggered some of these issues. The practical question is any of this trouble worth handling. Thinking about it. I don't believe anyone has ever intentionally built a filesystem tool that depends on being able to modify a file systems metadata buffer heads while the filesystem is running, and doing that would seem to be fragile as it would require a lot of cooperation between the tool and the filesystem about how the filesystem uses and implement things. Now I guess I need to see how difficult a patch would be to give filesystems magic inodes to keep their metadata buffer heads in. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Mark ramdisk buffers heads dirty 2007-10-17 20:29 ` Eric W. Biederman @ 2007-10-17 20:54 ` Chris Mason 2007-10-17 21:30 ` Eric W. Biederman 0 siblings, 1 reply; 77+ messages in thread From: Chris Mason @ 2007-10-17 20:54 UTC (permalink / raw) To: Eric W. Biederman Cc: Christian Borntraeger, Andrew Morton, Nick Piggin, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable On Wed, 2007-10-17 at 14:29 -0600, Eric W. Biederman wrote: > Chris Mason <chris.mason@oracle.com> writes: > > > In this case, the commit block isn't allowed to be dirty before reiserfs > > decides it is safe to write it. The journal code expects it is the only > > spot in the kernel setting buffer heads dirty, and it only does so after > > the rest of the log blocks are safely on disk. > > Ok. So the journal code here fundamentally depends on being able to > control the order of the writes, and something else being able to set > the buffer head dirty messes up that control. > Right. > >> At the same time I increasingly don't think we should allow user space > >> to dirty or update our filesystem metadata buffer heads. That seems > >> like asking for trouble. > >> > > > > Demanding trouble ;) > > Looks like it. There are even comments in jbd about the same class > of problems. Apparently dump and tune2fs on mounted filesystems have > triggered some of these issues. The practical question is any of this > trouble worth handling. > > Thinking about it. I don't believe anyone has ever intentionally built > a filesystem tool that depends on being able to modify a file systems > metadata buffer heads while the filesystem is running, and doing that > would seem to be fragile as it would require a lot of cooperation > between the tool and the filesystem about how the filesystem uses and > implement things. > That's right. For example, ext2 is doing directories in the page cache of the directory inode, so there's a cache alias between the block device page cache and the directory inode page cache. > Now I guess I need to see how difficult a patch would be to give > filesystems magic inodes to keep their metadata buffer heads in. Not hard, the block device inode is already a magic inode for metadata buffer heads. You could just make another one attached to the bdev. But, I don't think I fully understand the problem you're trying to solve? -chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Mark ramdisk buffers heads dirty 2007-10-17 20:54 ` Chris Mason @ 2007-10-17 21:30 ` Eric W. Biederman 2007-10-17 22:58 ` Chris Mason 0 siblings, 1 reply; 77+ messages in thread From: Eric W. Biederman @ 2007-10-17 21:30 UTC (permalink / raw) To: Chris Mason Cc: Christian Borntraeger, Andrew Morton, Nick Piggin, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable Chris Mason <chris.mason@oracle.com> writes: >> Thinking about it. I don't believe anyone has ever intentionally built >> a filesystem tool that depends on being able to modify a file systems >> metadata buffer heads while the filesystem is running, and doing that >> would seem to be fragile as it would require a lot of cooperation >> between the tool and the filesystem about how the filesystem uses and >> implement things. >> > > That's right. For example, ext2 is doing directories in the page cache > of the directory inode, so there's a cache alias between the block > device page cache and the directory inode page cache. > >> Now I guess I need to see how difficult a patch would be to give >> filesystems magic inodes to keep their metadata buffer heads in. > > Not hard, the block device inode is already a magic inode for metadata > buffer heads. You could just make another one attached to the bdev. > > But, I don't think I fully understand the problem you're trying to > solve? So the start: When we write buffers from the buffer cache we clear buffer_dirty but not PageDirty So try_to_free_buffers() will mark any page with clean buffer_heads that is not clean itself clean. The ramdisk set pages dirty to keep them from being removed from the page cache, just like ramfs. Unfortunately when those dirty ramdisk pages get buffers on them and those buffers all go clean and we are trying to reclaim buffer_heads we drop those pages from the page cache. Ouch! We can fix the ramdisk by setting making certain that buffer_heads on ramdisk pages stay dirty as well. The problem is this breaks filesystems like reiserfs and ext3 that expect to be able to make buffer_heads clean sometimes. There are other ways to solve this for ramdisks, such as changing where ramdisks are stored. However fixing the ramdisks this way still leaves the general problem that there are other paths to the filesystem metadata buffers, and those other paths cause the code to be complicated and buggy. So I'm trying to see if we can untangle this Gordian knot, so the code because more easily maintainable. To make the buffer cache a helper library instead of require infrastructure, it looks like two things need to happen. - Move metadata buffer heads off block devices page cache entries, - Communicate the mappings of data pages to block device sectors in a generic way without buffer heads. How we ultimately fix the ramdisk tends to depend on how we untangle the buffer head problem. Right now the only simple solution is to suppress try_to_free_buffers, which is a bit ugly. We can also come up with a completely separate store for the pages in the buffer cache but if we wind up moving the metadata buffer heads anyway then that should not be necessary. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Mark ramdisk buffers heads dirty 2007-10-17 21:30 ` Eric W. Biederman @ 2007-10-17 22:58 ` Chris Mason 2007-10-17 23:28 ` Eric W. Biederman 0 siblings, 1 reply; 77+ messages in thread From: Chris Mason @ 2007-10-17 22:58 UTC (permalink / raw) To: Eric W. Biederman Cc: Christian Borntraeger, Andrew Morton, Nick Piggin, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable On Wed, 2007-10-17 at 15:30 -0600, Eric W. Biederman wrote: > Chris Mason <chris.mason@oracle.com> writes: > > >> Thinking about it. I don't believe anyone has ever intentionally built > >> a filesystem tool that depends on being able to modify a file systems > >> metadata buffer heads while the filesystem is running, and doing that > >> would seem to be fragile as it would require a lot of cooperation > >> between the tool and the filesystem about how the filesystem uses and > >> implement things. > >> > > > > That's right. For example, ext2 is doing directories in the page cache > > of the directory inode, so there's a cache alias between the block > > device page cache and the directory inode page cache. > > > >> Now I guess I need to see how difficult a patch would be to give > >> filesystems magic inodes to keep their metadata buffer heads in. > > > > Not hard, the block device inode is already a magic inode for metadata > > buffer heads. You could just make another one attached to the bdev. > > > > But, I don't think I fully understand the problem you're trying to > > solve? > > > So the start: > When we write buffers from the buffer cache we clear buffer_dirty > but not PageDirty > > So try_to_free_buffers() will mark any page with clean buffer_heads > that is not clean itself clean. > > The ramdisk set pages dirty to keep them from being removed from the > page cache, just like ramfs. > So, the problem is using the Dirty bit to indicate pinned. You're completely right that our current setup of buffer heads and pages and filesystem metadata is complex and difficult. But, moving the buffer heads off of the page cache pages isn't going to make it any easier to use dirty as pinned, especially in the face of buffer_head users for file data pages. You've already seen Nick fsblock code, but you can see my general approach to replacing buffer heads here: http://oss.oracle.com/mercurial/mason/btrfs-unstable/file/f89e7971692f/extent_map.h (alpha quality implementation in extent_map.c and users in inode.c) The basic idea is to do extent based record keeping for mapping and state of things in the filesystem, and to avoid attaching these things to the page. > Unfortunately when those dirty ramdisk pages get buffers on them and > those buffers all go clean and we are trying to reclaim buffer_heads > we drop those pages from the page cache. Ouch! > > We can fix the ramdisk by setting making certain that buffer_heads > on ramdisk pages stay dirty as well. The problem is this breaks > filesystems like reiserfs and ext3 that expect to be able to make > buffer_heads clean sometimes. > > There are other ways to solve this for ramdisks, such as changing > where ramdisks are stored. However fixing the ramdisks this way > still leaves the general problem that there are other paths to the > filesystem metadata buffers, and those other paths cause the code > to be complicated and buggy. > > So I'm trying to see if we can untangle this Gordian knot, so the > code because more easily maintainable. > Don't get me wrong, I'd love to see a simple and coherent fix for what reiserfs and ext3 do with buffer head state, but I think for the short term you're best off pinning the ramdisk pages via some other means. -chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Mark ramdisk buffers heads dirty 2007-10-17 22:58 ` Chris Mason @ 2007-10-17 23:28 ` Eric W. Biederman 2007-10-18 0:03 ` Chris Mason 0 siblings, 1 reply; 77+ messages in thread From: Eric W. Biederman @ 2007-10-17 23:28 UTC (permalink / raw) To: Chris Mason Cc: Christian Borntraeger, Andrew Morton, Nick Piggin, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable Chris Mason <chris.mason@oracle.com> writes: > So, the problem is using the Dirty bit to indicate pinned. You're > completely right that our current setup of buffer heads and pages and > filesystpem metadata is complex and difficult. > > But, moving the buffer heads off of the page cache pages isn't going to > make it any easier to use dirty as pinned, especially in the face of > buffer_head users for file data pages. Let me specific. Not moving buffer_heads off of page cache pages, moving buffer_heads off of the block devices page cache pages. My problem is the coupling of how block devices are cached and the implementation of buffer heads, and by removing that coupling we can generally make things better. Currently that coupling means silly things like all block devices are cached in low memory. Which probably isn't what you want if you actually have a use for block devices. For the ramdisk case in particular what this means is that there are no more users that create buffer_head mappings on the block device cache so using the dirty bit will be safe. Further it removes the nasty possibility of user space messing with metadata buffer head state. So the only way those cases can happen is a code bug, or a hardware bug. So I think by removing these unnecessary code paths things will become easier to work with. > You've already seen Nick fsblock code, but you can see my general > approach to replacing buffer heads here: > > http://oss.oracle.com/mercurial/mason/btrfs-unstable/file/f89e7971692f/extent_map.h > > (alpha quality implementation in extent_map.c and users in inode.c) The > basic idea is to do extent based record keeping for mapping and state of > things in the filesystem, and to avoid attaching these things to the > page. Interesting. Something to dig into. > Don't get me wrong, I'd love to see a simple and coherent fix for what > reiserfs and ext3 do with buffer head state, but I think for the short > term you're best off pinning the ramdisk pages via some other means. Yes. And the problem is hard enough to trigger that a short term fix is actually of debatable value. The reason this hasn't shown up more frequently is that it only ever triggers if you are in the buffer head reclaim state, which on a 64bit box means you have to use < 4K buffers and have your ram cache another block device. That plus most people use initramfs these days. For the short term we have Christian's other patch which simply disables calling try_to_free_buffers. Although that really feels like a hack to me. For 2.6.25 I think I have a shot at fixing these things cleanly. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Mark ramdisk buffers heads dirty 2007-10-17 23:28 ` Eric W. Biederman @ 2007-10-18 0:03 ` Chris Mason 2007-10-18 3:27 ` Eric W. Biederman 2007-10-18 3:59 ` [RFC][PATCH] block: Isolate the buffer cache in it's own mappings Eric W. Biederman 0 siblings, 2 replies; 77+ messages in thread From: Chris Mason @ 2007-10-18 0:03 UTC (permalink / raw) To: Eric W. Biederman Cc: Christian Borntraeger, Andrew Morton, Nick Piggin, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable On Wed, 2007-10-17 at 17:28 -0600, Eric W. Biederman wrote: > Chris Mason <chris.mason@oracle.com> writes: > > > So, the problem is using the Dirty bit to indicate pinned. You're > > completely right that our current setup of buffer heads and pages and > > filesystpem metadata is complex and difficult. > > > > But, moving the buffer heads off of the page cache pages isn't going to > > make it any easier to use dirty as pinned, especially in the face of > > buffer_head users for file data pages. > > Let me specific. Not moving buffer_heads off of page cache pages, > moving buffer_heads off of the block devices page cache pages. > > My problem is the coupling of how block devices are cached and the > implementation of buffer heads, and by removing that coupling > we can generally make things better. Currently that coupling > means silly things like all block devices are cached in low memory. > Which probably isn't what you want if you actually have a use > for block devices. > > For the ramdisk case in particular what this means is that there > are no more users that create buffer_head mappings on the block > device cache so using the dirty bit will be safe. Ok, we move the buffer heads off to a different inode, and that indoe has pages. The pages on the inode still need to get pinned, how does that pinning happen? The problem you described where someone cleans a page because the buffer heads are clean happens already without help from userland. So, keeping the pages away from userland won't save them from cleaning. Sorry if I'm reading your suggestion wrong... -chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Mark ramdisk buffers heads dirty 2007-10-18 0:03 ` Chris Mason @ 2007-10-18 3:27 ` Eric W. Biederman 2007-10-18 3:59 ` [RFC][PATCH] block: Isolate the buffer cache in it's own mappings Eric W. Biederman 1 sibling, 0 replies; 77+ messages in thread From: Eric W. Biederman @ 2007-10-18 3:27 UTC (permalink / raw) To: Chris Mason Cc: Christian Borntraeger, Andrew Morton, Nick Piggin, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable Chris Mason <chris.mason@oracle.com> writes: > On Wed, 2007-10-17 at 17:28 -0600, Eric W. Biederman wrote: >> Chris Mason <chris.mason@oracle.com> writes: >> >> > So, the problem is using the Dirty bit to indicate pinned. You're >> > completely right that our current setup of buffer heads and pages and >> > filesystpem metadata is complex and difficult. >> > >> > But, moving the buffer heads off of the page cache pages isn't going to >> > make it any easier to use dirty as pinned, especially in the face of >> > buffer_head users for file data pages. >> >> Let me specific. Not moving buffer_heads off of page cache pages, >> moving buffer_heads off of the block devices page cache pages. >> >> My problem is the coupling of how block devices are cached and the >> implementation of buffer heads, and by removing that coupling >> we can generally make things better. Currently that coupling >> means silly things like all block devices are cached in low memory. >> Which probably isn't what you want if you actually have a use >> for block devices. >> >> For the ramdisk case in particular what this means is that there >> are no more users that create buffer_head mappings on the block >> device cache so using the dirty bit will be safe. > > Ok, we move the buffer heads off to a different inode, and that indoe > has pages. The pages on the inode still need to get pinned, how does > that pinning happen? > The problem you described where someone cleans a page because the buffer > heads are clean happens already without help from userland. So, keeping > the pages away from userland won't save them from cleaning. > > Sorry if I'm reading your suggestion wrong... Yes. I was suggesting to continue to pin the pages for the page cache pages block device inode, and having the buffer cache live on some other inode. Thus not causing me problems by adding clean buffer_heads to my dirty pages. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* [RFC][PATCH] block: Isolate the buffer cache in it's own mappings. 2007-10-18 0:03 ` Chris Mason 2007-10-18 3:27 ` Eric W. Biederman @ 2007-10-18 3:59 ` Eric W. Biederman 2007-10-18 4:32 ` Andrew Morton 2007-10-18 5:10 ` Nick Piggin 1 sibling, 2 replies; 77+ messages in thread From: Eric W. Biederman @ 2007-10-18 3:59 UTC (permalink / raw) To: Chris Mason Cc: Christian Borntraeger, Andrew Morton, Nick Piggin, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable If filesystems care at all they want absolute control over the buffer cache. Controlling which buffers are dirty and when. Because we keep the buffer cache in the page cache for the block device we have not quite been giving filesystems that control leading to really weird bugs. In addition this tieing of the implemetation of block device caching and the buffer cache has resulted in a much more complicated and limited implementation then necessary. Block devices for example don't need buffer_heads, and it is perfectly reasonable to cache block devices in high memory. To start untangling the worst of this mess this patch introduces a second block device inode for the buffer cache. All buffer cache operations are diverted to that use the new bd_metadata_inode, which keeps the weirdness of the metadata requirements isolated in their own little world. This should enable future cleanups to diverge and simplify the address_space_operations of the buffer cache and block device page cache. As a side effect of this cleanup the current ramdisk code should be safe from dropping pages because we never place any buffer heads on ramdisk pages. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> --- fs/block_dev.c | 45 ++++++++++++++++++++++++++++++++------------- fs/buffer.c | 17 ++++++++++++----- fs/ext3/dir.c | 2 +- fs/ext4/dir.c | 2 +- fs/fat/inode.c | 2 +- include/linux/fs.h | 1 + 6 files changed, 48 insertions(+), 21 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 379a446..87a5760 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -59,10 +59,12 @@ static sector_t max_block(struct block_device *bdev) /* Kill _all_ buffers and pagecache , dirty or not.. */ static void kill_bdev(struct block_device *bdev) { - if (bdev->bd_inode->i_mapping->nrpages == 0) - return; - invalidate_bh_lrus(); - truncate_inode_pages(bdev->bd_inode->i_mapping, 0); + if (bdev->bd_inode->i_mapping->nrpages) + truncate_inode_pages(bdev->bd_inode->i_mapping, 0); + if (bdev->bd_metadata_inode->i_mapping->nrpages) { + truncate_inode_pages(bdev->bd_metadata_inode->i_mapping, 0); + invalidate_bh_lrus(); + } } int set_blocksize(struct block_device *bdev, int size) @@ -80,6 +82,7 @@ int set_blocksize(struct block_device *bdev, int size) sync_blockdev(bdev); bdev->bd_block_size = size; bdev->bd_inode->i_blkbits = blksize_bits(size); + bdev->bd_metadata_inode->i_blkbits = blksize_bits(size); kill_bdev(bdev); } return 0; @@ -114,7 +117,7 @@ static int blkdev_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh, int create) { - if (iblock >= max_block(I_BDEV(inode))) { + if (iblock >= max_block(inode->i_bdev)) { if (create) return -EIO; @@ -126,7 +129,7 @@ blkdev_get_block(struct inode *inode, sector_t iblock, */ return 0; } - bh->b_bdev = I_BDEV(inode); + bh->b_bdev = inode->i_bdev; bh->b_blocknr = iblock; set_buffer_mapped(bh); return 0; @@ -136,7 +139,7 @@ static int blkdev_get_blocks(struct inode *inode, sector_t iblock, struct buffer_head *bh, int create) { - sector_t end_block = max_block(I_BDEV(inode)); + sector_t end_block = max_block(inode->i_bdev); unsigned long max_blocks = bh->b_size >> inode->i_blkbits; if ((iblock + max_blocks) > end_block) { @@ -152,7 +155,7 @@ blkdev_get_blocks(struct inode *inode, sector_t iblock, } } - bh->b_bdev = I_BDEV(inode); + bh->b_bdev = inode->i_bdev; bh->b_blocknr = iblock; bh->b_size = max_blocks << inode->i_blkbits; if (max_blocks) @@ -167,7 +170,7 @@ blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, struct file *file = iocb->ki_filp; struct inode *inode = file->f_mapping->host; - return blockdev_direct_IO_no_locking(rw, iocb, inode, I_BDEV(inode), + return blockdev_direct_IO_no_locking(rw, iocb, inode, inode->i_bdev, iov, offset, nr_segs, blkdev_get_blocks, NULL); } @@ -244,7 +247,7 @@ blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, loff_t pos, unsigned long nr_segs) { struct inode *inode = iocb->ki_filp->f_mapping->host; - unsigned blkbits = blksize_bits(bdev_hardsect_size(I_BDEV(inode))); + unsigned blkbits = blksize_bits(bdev_hardsect_size(inode->i_bdev); unsigned blocksize_mask = (1 << blkbits) - 1; unsigned long seg = 0; /* iov segment iterator */ unsigned long nvec; /* number of bio vec needed */ @@ -292,7 +295,7 @@ blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, /* bio_alloc should not fail with GFP_KERNEL flag */ bio = bio_alloc(GFP_KERNEL, nvec); - bio->bi_bdev = I_BDEV(inode); + bio->bi_bdev = inode->i_bdev; bio->bi_end_io = blk_end_aio; bio->bi_private = iocb; bio->bi_sector = pos >> blkbits; @@ -498,6 +501,8 @@ static void bdev_clear_inode(struct inode *inode) } list_del_init(&bdev->bd_list); spin_unlock(&bdev_lock); + iput(bdev->bd_metadata_inode); + bdev->bd_metadata_inode = NULL; } static const struct super_operations bdev_sops = { @@ -566,7 +571,7 @@ static LIST_HEAD(all_bdevs); struct block_device *bdget(dev_t dev) { struct block_device *bdev; - struct inode *inode; + struct inode *inode, *md_inode; inode = iget5_locked(bd_mnt->mnt_sb, hash(dev), bdev_test, bdev_set, &dev); @@ -574,6 +579,11 @@ struct block_device *bdget(dev_t dev) if (!inode) return NULL; + /* Get an anonymous inode for the filesystem metadata cache */ + md_inode = new_inode(bd_mnt->mnt_sb); + if (!md_inode) + return NULL; + bdev = &BDEV_I(inode)->bdev; if (inode->i_state & I_NEW) { @@ -582,12 +592,19 @@ struct block_device *bdget(dev_t dev) bdev->bd_block_size = (1 << inode->i_blkbits); bdev->bd_part_count = 0; bdev->bd_invalidated = 0; + bdev->bd_metadata_inode = md_inode; inode->i_mode = S_IFBLK; inode->i_rdev = dev; inode->i_bdev = bdev; inode->i_data.a_ops = &def_blk_aops; mapping_set_gfp_mask(&inode->i_data, GFP_USER); inode->i_data.backing_dev_info = &default_backing_dev_info; + md_inode->i_mode = S_IFBLK; + md_inode->i_rdev = dev; + md_inode->i_bdev = bdev; + md_inode->i_data.a_ops = &def_blk_aops; + mapping_set_gfp_mask(&md_inode->i_data, GFP_USER); + md_inode->i_data.backing_dev_info = &default_backing_dev_info; spin_lock(&bdev_lock); list_add(&bdev->bd_list, &all_bdevs); spin_unlock(&bdev_lock); @@ -604,7 +621,7 @@ long nr_blockdev_pages(void) long ret = 0; spin_lock(&bdev_lock); list_for_each_entry(bdev, &all_bdevs, bd_list) { - ret += bdev->bd_inode->i_mapping->nrpages; + ret += bdev->bd_metadata_inode->i_mapping->nrpages; } spin_unlock(&bdev_lock); return ret; @@ -1099,6 +1116,7 @@ void bd_set_size(struct block_device *bdev, loff_t size) unsigned bsize = bdev_hardsect_size(bdev); bdev->bd_inode->i_size = size; + bdev->bd_metadata_inode->i_size = size; while (bsize < PAGE_CACHE_SIZE) { if (size & bsize) break; @@ -1106,6 +1124,7 @@ void bd_set_size(struct block_device *bdev, loff_t size) } bdev->bd_block_size = bsize; bdev->bd_inode->i_blkbits = blksize_bits(bsize); + bdev->bd_metadata_inode->i_blkbits = blksize_bits(bsize); } EXPORT_SYMBOL(bd_set_size); diff --git a/fs/buffer.c b/fs/buffer.c index faceb5e..2c044b6 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -166,8 +166,11 @@ int sync_blockdev(struct block_device *bdev) { int ret = 0; - if (bdev) + if (bdev) { ret = filemap_write_and_wait(bdev->bd_inode->i_mapping); + if (!ret) + ret = filemap_write_and_wait(bdev->bd_metadata_inode->i_mapping); + } return ret; } EXPORT_SYMBOL(sync_blockdev); @@ -261,7 +264,7 @@ EXPORT_SYMBOL(thaw_bdev); static struct buffer_head * __find_get_block_slow(struct block_device *bdev, sector_t block) { - struct inode *bd_inode = bdev->bd_inode; + struct inode *bd_inode = bdev->bd_metadata_inode; struct address_space *bd_mapping = bd_inode->i_mapping; struct buffer_head *ret = NULL; pgoff_t index; @@ -347,12 +350,16 @@ out: void invalidate_bdev(struct block_device *bdev) { struct address_space *mapping = bdev->bd_inode->i_mapping; + struct address_space *meta_mapping = bdev->bd_metadata_inode->i_mapping; - if (mapping->nrpages == 0) + if (mapping->nrpages) + invalidate_mapping_pages(mapping, 0, -1); + + if (meta_mapping->nrpages == 0) return; invalidate_bh_lrus(); - invalidate_mapping_pages(mapping, 0, -1); + invalidate_mapping_pages(meta_mapping, 0, -1); } /* @@ -1009,7 +1016,7 @@ static struct page * grow_dev_page(struct block_device *bdev, sector_t block, pgoff_t index, int size) { - struct inode *inode = bdev->bd_inode; + struct inode *inode = bdev->bd_metadata_inode; struct page *page; struct buffer_head *bh; diff --git a/fs/ext3/dir.c b/fs/ext3/dir.c index c2c3491..a46305e 100644 --- a/fs/ext3/dir.c +++ b/fs/ext3/dir.c @@ -140,7 +140,7 @@ static int ext3_readdir(struct file * filp, (PAGE_CACHE_SHIFT - inode->i_blkbits); if (!ra_has_index(&filp->f_ra, index)) page_cache_sync_readahead( - sb->s_bdev->bd_inode->i_mapping, + sb->s_bdev->bd_metadata_inode->i_mapping, &filp->f_ra, filp, index, 1); filp->f_ra.prev_pos = (loff_t)index << PAGE_CACHE_SHIFT; diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c index e11890a..eaab1db 100644 --- a/fs/ext4/dir.c +++ b/fs/ext4/dir.c @@ -139,7 +139,7 @@ static int ext4_readdir(struct file * filp, (PAGE_CACHE_SHIFT - inode->i_blkbits); if (!ra_has_index(&filp->f_ra, index)) page_cache_sync_readahead( - sb->s_bdev->bd_inode->i_mapping, + sb->s_bdev->bd_metadata_inode->i_mapping, &filp->f_ra, filp, index, 1); filp->f_ra.prev_pos = (loff_t)index << PAGE_CACHE_SHIFT; diff --git a/fs/fat/inode.c b/fs/fat/inode.c index 46b8a67..a8485b6 100644 --- a/fs/fat/inode.c +++ b/fs/fat/inode.c @@ -1484,7 +1484,7 @@ int fat_flush_inodes(struct super_block *sb, struct inode *i1, struct inode *i2) if (!ret && i2) ret = writeback_inode(i2); if (!ret) { - struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping; + struct address_space *mapping = sb->s_bdev->bd_metadata_inode->i_mapping; ret = filemap_flush(mapping); } return ret; diff --git a/include/linux/fs.h b/include/linux/fs.h index f70d52c..aeac9d3 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -519,6 +519,7 @@ struct address_space { struct block_device { dev_t bd_dev; /* not a kdev_t - it's a search key */ struct inode * bd_inode; /* will die */ + struct inode * bd_metadata_inode; int bd_openers; struct mutex bd_mutex; /* open/close mutex */ struct semaphore bd_mount_sem; -- 1.5.3.rc6.17.g1911 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 77+ messages in thread
* Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings. 2007-10-18 3:59 ` [RFC][PATCH] block: Isolate the buffer cache in it's own mappings Eric W. Biederman @ 2007-10-18 4:32 ` Andrew Morton 2007-10-19 21:27 ` Eric W. Biederman 2007-10-18 5:10 ` Nick Piggin 1 sibling, 1 reply; 77+ messages in thread From: Andrew Morton @ 2007-10-18 4:32 UTC (permalink / raw) To: Eric W. Biederman Cc: Chris Mason, Christian Borntraeger, Nick Piggin, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable On Wed, 17 Oct 2007 21:59:02 -0600 ebiederm@xmission.com (Eric W. Biederman) wrote: > If filesystems care at all they want absolute control over the buffer > cache. Controlling which buffers are dirty and when. Because we > keep the buffer cache in the page cache for the block device we have > not quite been giving filesystems that control leading to really weird > bugs. > > In addition this tieing of the implemetation of block device caching > and the buffer cache has resulted in a much more complicated and > limited implementation then necessary. Block devices for example > don't need buffer_heads, and it is perfectly reasonable to cache > block devices in high memory. > > To start untangling the worst of this mess this patch introduces a > second block device inode for the buffer cache. All buffer cache > operations are diverted to that use the new bd_metadata_inode, which > keeps the weirdness of the metadata requirements isolated in their > own little world. I don't think we little angels want to tread here. There are so many weirdo things out there which will break if we bust the coherence between the fs and /dev/hda1. Online resize, online fs checkers, various local tools which people have hacked up to look at metadata in a live fs, direct-io access to the underlying fs, heaven knows how many boot loader installers, etc. Cerainly I couldn't enumerate tham all. The mere thought of all this scares the crap out of me. I don't actually see what the conceptual problem is with the existing implementation. The buffer_head is a finer-grained view onto the blockdev's pagecache: it provides additional states and additional locking against a finer-grained section of the page. It works well. Yeah, the highmem thing is a bit of a problem (but waning in importance). But we can fix that by teaching individual filesystems about kmap and then tweak the blockdev's caching policy with mapping_set_gfp_mask() at mount time. If anyone cares, which they don't. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings. 2007-10-18 4:32 ` Andrew Morton @ 2007-10-19 21:27 ` Eric W. Biederman 2007-10-21 4:24 ` Nick Piggin 0 siblings, 1 reply; 77+ messages in thread From: Eric W. Biederman @ 2007-10-19 21:27 UTC (permalink / raw) To: Andrew Morton Cc: Chris Mason, Christian Borntraeger, Nick Piggin, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable Andrew Morton <akpm@linux-foundation.org> writes: > > I don't think we little angels want to tread here. There are so many > weirdo things out there which will break if we bust the coherence between > the fs and /dev/hda1. We broke coherence between the fs and /dev/hda1 when we introduced the page cache years ago, and weird hacky cases like unmap_underlying_metadata don't change that. Currently only metadata is more or less in sync with the contents of /dev/hda1. > Online resize, online fs checkers, various local > tools which people have hacked up to look at metadata in a live fs, > direct-io access to the underlying fs, heaven knows how many boot loader > installers, etc. Cerainly I couldn't enumerate tham all. Well I took a look at ext3. For online resize all of the writes are done by the fs not by the user space tool. For e2fsck of a read-only filesystem currently we do cache the buffers for the super block and reexamine those blocks when we mount read-only. Which makes my patch by itself unsafe. If however ext3 and anyone else who does things like that were to reread the data and not to merely reexamine the data we should be fine. Fundamentally doing anything like this requires some form of synchronization, and if that synchronization does not exist today there will be bugs. Further decoupling things only makes that requirement clearer. Unfortunately because of things like the ext3 handling of remounting from ro to rw this doesn't fall into the quick trivial fix category :( > I don't actually see what the conceptual problem is with the existing > implementation. The buffer_head is a finer-grained view onto the > blockdev's pagecache: it provides additional states and additional locking > against a finer-grained section of the page. It works well. The buffer_head itself seems to be a reasonable entity. The buffer cache is a monster. It does not follow the ordinary rules of the page cache, making it extremely hard to reason about. Currently in the buffer cache there are buffer_heads we are not allowed to make dirty which hold dirty data. Some filesystems panic the kernel when they notice this. Others like ext3 use a different bit to remember that the buffer is dirty. Because of ordering considerations the buffer cache does not hold a consistent view of what has been scheduled for being written to disk. It instead holds partially complete pages. The only place we should ever clear the dirty bit is just before calling write_page but try_to_free_buffers clears the dirty bit! We have buffers on pages without a mapping! In general the buffer cache violates a primary rule for comprehensible programming having. The buffer cache does not have a clear enough definition that it is clear what things are bugs and what things are features. 99% of the weird strange behavior in rd.c is because of the buffer cache not following the normal rules. > Yeah, the highmem thing is a bit of a problem (but waning in importance). > But we can fix that by teaching individual filesystems about kmap and then > tweak the blockdev's caching policy with mapping_set_gfp_mask() at mount > time. If anyone cares, which they don't. This presumes I want to use a filesystem on my block device. Where I would care most is when I am doing things like fsck or mkfs on an unmounted filesystem. Where having buffer_heads is just extra memory pressure slowing things down, and similarly for highmem. We have to sync the filesystem before mounting but we have to do that anyway for all of the non metadata so that isn't new. Anyway my main objective was to get a good grasp on the buffer cache and the mm layer again. Which I now more or less have. While I think the buffer cache needs a bunch of tender loving care before it becomes sane I have other projects that I intend to complete before I try anything in this area. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings. 2007-10-19 21:27 ` Eric W. Biederman @ 2007-10-21 4:24 ` Nick Piggin 2007-10-21 4:53 ` Eric W. Biederman 2007-10-22 0:15 ` David Chinner 0 siblings, 2 replies; 77+ messages in thread From: Nick Piggin @ 2007-10-21 4:24 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Chris Mason, Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable On Saturday 20 October 2007 07:27, Eric W. Biederman wrote: > Andrew Morton <akpm@linux-foundation.org> writes: > > I don't think we little angels want to tread here. There are so many > > weirdo things out there which will break if we bust the coherence between > > the fs and /dev/hda1. > > We broke coherence between the fs and /dev/hda1 when we introduced > the page cache years ago, Not for metadata. And I wouldn't expect many filesystem analysis tools to care about data. > and weird hacky cases like > unmap_underlying_metadata don't change that. unmap_underlying_metadata isn't about raw block device access at all, though (if you write to the filesystem via the blockdevice when it isn't expecting it, it's going to blow up regardless). > Currently only > metadata is more or less in sync with the contents of /dev/hda1. It either is or it isn't, right? And it is, isn't it? (at least for the common filesystems). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings. 2007-10-21 4:24 ` Nick Piggin @ 2007-10-21 4:53 ` Eric W. Biederman 2007-10-21 5:36 ` Nick Piggin 2007-10-22 0:15 ` David Chinner 1 sibling, 1 reply; 77+ messages in thread From: Eric W. Biederman @ 2007-10-21 4:53 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Chris Mason, Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable Nick Piggin <nickpiggin@yahoo.com.au> writes: > On Saturday 20 October 2007 07:27, Eric W. Biederman wrote: >> Andrew Morton <akpm@linux-foundation.org> writes: >> > I don't think we little angels want to tread here. There are so many >> > weirdo things out there which will break if we bust the coherence between >> > the fs and /dev/hda1. >> >> We broke coherence between the fs and /dev/hda1 when we introduced >> the page cache years ago, > > Not for metadata. And I wouldn't expect many filesystem analysis > tools to care about data. Well tools like dump certainly weren't happy when we made the change. >> and weird hacky cases like >> unmap_underlying_metadata don't change that. > > unmap_underlying_metadata isn't about raw block device access at > all, though (if you write to the filesystem via the blockdevice > when it isn't expecting it, it's going to blow up regardless). Well my goal with separating things is so that we could decouple two pieces of code that have different usage scenarios, and where supporting both scenarios simultaneously appears to me to needlessly complicate the code. Added to that we could then tune the two pieces of code for their different users. >> Currently only >> metadata is more or less in sync with the contents of /dev/hda1. > > It either is or it isn't, right? And it is, isn't it? (at least > for the common filesystems). ext2 doesn't store directories in the buffer cache. Journaling filesystems and filesystems that do ordered writes game the buffer cache. Putting in data that should not yet be written to disk. That gaming is where reiserfs goes BUG and where JBD moves the dirty bit to a different dirty bit. So as far as I can tell what is in the buffer cache is not really in sync with what should be on disk at any given movement except when everything is clean. My suspicion is that actually reading from disk is likely to give a more coherent view of things. Because there at least we have the writes as they are expected to be seen by fsck to recover the data, and a snapshot there should at least be recoverable. Whereas a snapshot provides not such guarantees. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings. 2007-10-21 4:53 ` Eric W. Biederman @ 2007-10-21 5:36 ` Nick Piggin 2007-10-21 7:09 ` Eric W. Biederman 0 siblings, 1 reply; 77+ messages in thread From: Nick Piggin @ 2007-10-21 5:36 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Chris Mason, Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable On Sunday 21 October 2007 14:53, Eric W. Biederman wrote: > Nick Piggin <nickpiggin@yahoo.com.au> writes: > > On Saturday 20 October 2007 07:27, Eric W. Biederman wrote: > >> Andrew Morton <akpm@linux-foundation.org> writes: > >> > I don't think we little angels want to tread here. There are so many > >> > weirdo things out there which will break if we bust the coherence > >> > between the fs and /dev/hda1. > >> > >> We broke coherence between the fs and /dev/hda1 when we introduced > >> the page cache years ago, > > > > Not for metadata. And I wouldn't expect many filesystem analysis > > tools to care about data. > > Well tools like dump certainly weren't happy when we made the change. Doesn't that give you any suspicion that other tools mightn't be happy if we make this change, then? > >> and weird hacky cases like > >> unmap_underlying_metadata don't change that. > > > > unmap_underlying_metadata isn't about raw block device access at > > all, though (if you write to the filesystem via the blockdevice > > when it isn't expecting it, it's going to blow up regardless). > > Well my goal with separating things is so that we could decouple two > pieces of code that have different usage scenarios, and where > supporting both scenarios simultaneously appears to me to needlessly > complicate the code. > > Added to that we could then tune the two pieces of code for their > different users. I don't see too much complication from it. If we can actually simplify things or make useful tuning, maybe it will be worth doing. > >> Currently only > >> metadata is more or less in sync with the contents of /dev/hda1. > > > > It either is or it isn't, right? And it is, isn't it? (at least > > for the common filesystems). > > ext2 doesn't store directories in the buffer cache. Oh that's what you mean. OK, agreed there. But for the filesystems and types of metadata that can now expect to have coherency, doing this will break that expectation. Again, I have no opinions either way on whether we should do that in the long run. But doing it as a kneejerk response to braindead rd.c code is wrong because of what *might* go wrong and we don't know about. > Journaling filesystems and filesystems that do ordered writes > game the buffer cache. Putting in data that should not yet > be written to disk. That gaming is where reiserfs goes BUG > and where JBD moves the dirty bit to a different dirty bit. Filesystems really want better control of writeback, I think. This isn't really a consequence of the unified blockdev pagecache / metadata buffer cache, it is just that most of the important things they do are with metadata. If they have their own metadata inode, then they'll need to game the cache for it, or the writeback code for that inode somehow too. > So as far as I can tell what is in the buffer cache is not really > in sync with what should be on disk at any given movement except > when everything is clean. Naturally. It is a writeback cache. > My suspicion is that actually reading from disk is likely to > give a more coherent view of things. Because there at least > we have the writes as they are expected to be seen by fsck > to recover the data, and a snapshot there should at least > be recoverable. Whereas a snapshot provides not such guarantees. ext3 fsck I don't think is supposed to be run under a read/write filesystem, so it's going to explode if you do that regardless. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings. 2007-10-21 5:36 ` Nick Piggin @ 2007-10-21 7:09 ` Eric W. Biederman 0 siblings, 0 replies; 77+ messages in thread From: Eric W. Biederman @ 2007-10-21 7:09 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Chris Mason, Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable Nick Piggin <nickpiggin@yahoo.com.au> writes: > On Sunday 21 October 2007 14:53, Eric W. Biederman wrote: >> Nick Piggin <nickpiggin@yahoo.com.au> writes: >> > On Saturday 20 October 2007 07:27, Eric W. Biederman wrote: >> >> Andrew Morton <akpm@linux-foundation.org> writes: >> >> > I don't think we little angels want to tread here. There are so many >> >> > weirdo things out there which will break if we bust the coherence >> >> > between the fs and /dev/hda1. >> >> >> >> We broke coherence between the fs and /dev/hda1 when we introduced >> >> the page cache years ago, >> > >> > Not for metadata. And I wouldn't expect many filesystem analysis >> > tools to care about data. >> >> Well tools like dump certainly weren't happy when we made the change. > > Doesn't that give you any suspicion that other tools mightn't > be happy if we make this change, then? I read a representative sample of the relevant tools before replying to Andrew. >> >> and weird hacky cases like >> >> unmap_underlying_metadata don't change that. >> > >> > unmap_underlying_metadata isn't about raw block device access at >> > all, though (if you write to the filesystem via the blockdevice >> > when it isn't expecting it, it's going to blow up regardless). >> >> Well my goal with separating things is so that we could decouple two >> pieces of code that have different usage scenarios, and where >> supporting both scenarios simultaneously appears to me to needlessly >> complicate the code. >> >> Added to that we could then tune the two pieces of code for their >> different users. > > I don't see too much complication from it. If we can actually > simplify things or make useful tuning, maybe it will be worth > doing. That was my feeling that we could simplify things. The block layer page cache operations certainly. I know in the filesystems that use the buffer cache like reiser and JBD they could stop worrying about the buffers becoming mysteriously dirty. Beyond that I think there is a lot of opportunity I just haven't looked much yet. >> >> Currently only >> >> metadata is more or less in sync with the contents of /dev/hda1. >> > >> > It either is or it isn't, right? And it is, isn't it? (at least >> > for the common filesystems). >> >> ext2 doesn't store directories in the buffer cache. > > Oh that's what you mean. OK, agreed there. But for the filesystems > and types of metadata that can now expect to have coherency, doing > this will break that expectation. > > Again, I have no opinions either way on whether we should do that > in the long run. But doing it as a kneejerk response to braindead > rd.c code is wrong because of what *might* go wrong and we don't > know about. The rd.c code is perfectly valid if someone wasn't forcing buffer heads on it's pages. It is a conflict of expectations. Regardless I didn't do it as a kneejerk and I don't think that patch should be merged at this time. I proposed it because as I see it that starts untangling the mess that is the buffer cache. rd.c was just my entry point into understanding how all of those pieces work. I was doing my best to completely explore my options and what the code was doing before settling on the fix for rd.c >> Journaling filesystems and filesystems that do ordered writes >> game the buffer cache. Putting in data that should not yet >> be written to disk. That gaming is where reiserfs goes BUG >> and where JBD moves the dirty bit to a different dirty bit. > > Filesystems really want better control of writeback, I think. > This isn't really a consequence of the unified blockdev pagecache > / metadata buffer cache, it is just that most of the important > things they do are with metadata. Yes. > If they have their own metadata inode, then they'll need to game > the cache for it, or the writeback code for that inode somehow > too. Yes. Although they will at least get the guarantee that no one else is dirtying their pages at strange times. >> So as far as I can tell what is in the buffer cache is not really >> in sync with what should be on disk at any given movement except >> when everything is clean. > > Naturally. It is a writeback cache. Not that so much as the order in which things go into the cache does not match the order the blocks go to disk. >> My suspicion is that actually reading from disk is likely to >> give a more coherent view of things. Because there at least >> we have the writes as they are expected to be seen by fsck >> to recover the data, and a snapshot there should at least >> be recoverable. Whereas a snapshot provides not such guarantees. > > ext3 fsck I don't think is supposed to be run under a read/write > filesystem, so it's going to explode if you do that regardless. Yes. I was thinking of dump or something like that here. Where we simply read out the data and try to make some coherent sense of it. If we see a version of the metadata that points to things that have not been finished yet or is in the process of being written to that could be a problem. When going through the buffer cache as far as I can tell people don't use little things like page lock when writing data so the page cache reads can potentially race with what should be atomic writes. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings. 2007-10-21 4:24 ` Nick Piggin 2007-10-21 4:53 ` Eric W. Biederman @ 2007-10-22 0:15 ` David Chinner 1 sibling, 0 replies; 77+ messages in thread From: David Chinner @ 2007-10-22 0:15 UTC (permalink / raw) To: Nick Piggin Cc: Eric W. Biederman, Andrew Morton, Chris Mason, Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable On Sun, Oct 21, 2007 at 02:24:46PM +1000, Nick Piggin wrote: > On Saturday 20 October 2007 07:27, Eric W. Biederman wrote: > > Currently only > > metadata is more or less in sync with the contents of /dev/hda1. > > It either is or it isn't, right? And it is, isn't it? (at least > for the common filesystems). It is not true for XFS - it's metadata is not in sync with /dev/<block> at all as all the cached metadata is kept in a different address space to the raw block device. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings. 2007-10-18 3:59 ` [RFC][PATCH] block: Isolate the buffer cache in it's own mappings Eric W. Biederman 2007-10-18 4:32 ` Andrew Morton @ 2007-10-18 5:10 ` Nick Piggin 2007-10-19 21:35 ` Eric W. Biederman 1 sibling, 1 reply; 77+ messages in thread From: Nick Piggin @ 2007-10-18 5:10 UTC (permalink / raw) To: Eric W. Biederman Cc: Chris Mason, Christian Borntraeger, Andrew Morton, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable On Thursday 18 October 2007 13:59, Eric W. Biederman wrote: > If filesystems care at all they want absolute control over the buffer > cache. Controlling which buffers are dirty and when. Because we > keep the buffer cache in the page cache for the block device we have > not quite been giving filesystems that control leading to really weird > bugs. Mmm. Like I said, when a live filesystem is mounted on a bdev, it isn't like you want userspace to go dancing around on it without knowing exactly what it is doing. The kernel more or less does the right thing here with respect to the *state* of the data[*] (that is, buffer heads and pagecache). It's when you actually start changing the data itself around is when you'll blow up the filesystem. [*] The ramdisk code is simply buggy, right? (and not the buffer cache) The idea of your patch in theory is OK, but Andrew raises valid points about potential coherency problems, I think. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings. 2007-10-18 5:10 ` Nick Piggin @ 2007-10-19 21:35 ` Eric W. Biederman 0 siblings, 0 replies; 77+ messages in thread From: Eric W. Biederman @ 2007-10-19 21:35 UTC (permalink / raw) To: Nick Piggin Cc: Chris Mason, Christian Borntraeger, Andrew Morton, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable Nick Piggin <nickpiggin@yahoo.com.au> writes: > > [*] The ramdisk code is simply buggy, right? (and not the buffer > cache) >From the perspective of the ramdisk it expects the buffer cache to simply be a user of the page cache, and thus the buffer cache is horribly buggy. >From the perspective of the buffer cache it still the block device cache in the kernel and it the way it works are the rules for how caching should be done in the kernel, and it doesn't give any mind to this new fangled page cache thingy. > The idea of your patch in theory is OK, but Andrew raises valid > points about potential coherency problems, I think. There are certainly implementation issues in various filesystems to overcome before remounting read-write after doing a fsck on a read-only filesystem will work as it does today. So my patch is incomplete. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Mark ramdisk buffers heads dirty 2007-10-17 17:57 ` Eric W. Biederman 2007-10-17 19:14 ` Chris Mason @ 2007-10-17 21:48 ` Christian Borntraeger 2007-10-17 22:22 ` Eric W. Biederman 1 sibling, 1 reply; 77+ messages in thread From: Christian Borntraeger @ 2007-10-17 21:48 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Nick Piggin, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable Am Mittwoch, 17. Oktober 2007 schrieb Eric W. Biederman: > Did you have both of my changes applied? > To init_page_buffer() and to the ramdisk_set_dirty_page? Yes, I removed my patch and applied both patches from you. Christian -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Mark ramdisk buffers heads dirty 2007-10-17 21:48 ` [PATCH] rd: Mark ramdisk buffers heads dirty Christian Borntraeger @ 2007-10-17 22:22 ` Eric W. Biederman 2007-10-18 9:26 ` Christian Borntraeger 0 siblings, 1 reply; 77+ messages in thread From: Eric W. Biederman @ 2007-10-17 22:22 UTC (permalink / raw) To: Christian Borntraeger Cc: Andrew Morton, Nick Piggin, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable Christian Borntraeger <borntraeger@de.ibm.com> writes: > Am Mittwoch, 17. Oktober 2007 schrieb Eric W. Biederman: >> Did you have both of my changes applied? >> To init_page_buffer() and to the ramdisk_set_dirty_page? > > Yes, I removed my patch and applied both patches from you. Thanks. Grr. Inconsistent rules on a core piece of infrastructure. It looks like that if there is any trivial/minimal fix it is based on your patch suppressing try_to_free_buffers. Ugh. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Mark ramdisk buffers heads dirty 2007-10-17 22:22 ` Eric W. Biederman @ 2007-10-18 9:26 ` Christian Borntraeger 2007-10-19 22:46 ` Eric W. Biederman 2007-10-19 22:51 ` [PATCH] rd: Use a private inode for backing storage Eric W. Biederman 0 siblings, 2 replies; 77+ messages in thread From: Christian Borntraeger @ 2007-10-18 9:26 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Nick Piggin, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable Am Donnerstag, 18. Oktober 2007 schrieb Eric W. Biederman: > Grr. Inconsistent rules on a core piece of infrastructure. > It looks like that if there is any trivial/minimal fix it > is based on your patch suppressing try_to_free_buffers. Ugh. > > Eric Ok. What do you think about having my patch for 2.6.23 stable, for 2.6.24 and doing a nicer fix (rd rewrite for example for post 2.6.24)? Christian -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Mark ramdisk buffers heads dirty 2007-10-18 9:26 ` Christian Borntraeger @ 2007-10-19 22:46 ` Eric W. Biederman 2007-10-19 22:51 ` [PATCH] rd: Use a private inode for backing storage Eric W. Biederman 1 sibling, 0 replies; 77+ messages in thread From: Eric W. Biederman @ 2007-10-19 22:46 UTC (permalink / raw) To: Christian Borntraeger Cc: Andrew Morton, Nick Piggin, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable Christian Borntraeger <borntraeger@de.ibm.com> writes: > Am Donnerstag, 18. Oktober 2007 schrieb Eric W. Biederman: >> Grr. Inconsistent rules on a core piece of infrastructure. >> It looks like that if there is any trivial/minimal fix it >> is based on your patch suppressing try_to_free_buffers. Ugh. >> >> Eric > > Ok. What do you think about having my patch for 2.6.23 stable, for 2.6.24 > and doing a nicer fix (rd rewrite for example for post 2.6.24)? Looking at it. If we don't get carried away using our own private inode is barely more difficult then stomping on release_page and in a number of ways a whole lot more subtle. At least for 2.6.24 I think it makes a sane fix, and quite possibly as a back port as well. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* [PATCH] rd: Use a private inode for backing storage 2007-10-18 9:26 ` Christian Borntraeger 2007-10-19 22:46 ` Eric W. Biederman @ 2007-10-19 22:51 ` Eric W. Biederman 2007-10-21 4:28 ` Nick Piggin 1 sibling, 1 reply; 77+ messages in thread From: Eric W. Biederman @ 2007-10-19 22:51 UTC (permalink / raw) To: Christian Borntraeger Cc: Andrew Morton, Nick Piggin, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable Currently the ramdisk tries to keep the block device page cache pages from being marked clean and dropped from memory. That fails for filesystems that use the buffer cache because the buffer cache is not an ordinary buffer cache user and depends on the generic block device address space operations being used. To fix all of those associated problems this patch allocates a private inode to store the ramdisk pages in. The result is slightly more memory used for metadata, an extra copying when reading or writing directly to the block device, and changing the software block size does not loose the contents of the ramdisk. Most of all this ensures we don't loose data during normal use of the ramdisk. I deliberately avoid the cleanup that is now possible because this patch is intended to be a bug fix. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> --- drivers/block/rd.c | 19 ++++++++++++++++--- 1 files changed, 16 insertions(+), 3 deletions(-) diff --git a/drivers/block/rd.c b/drivers/block/rd.c index 08176d2..a52f153 100644 --- a/drivers/block/rd.c +++ b/drivers/block/rd.c @@ -62,6 +62,7 @@ /* Various static variables go here. Most are used only in the RAM disk code. */ +static struct inode *rd_inode[CONFIG_BLK_DEV_RAM_COUNT]; static struct gendisk *rd_disks[CONFIG_BLK_DEV_RAM_COUNT]; static struct block_device *rd_bdev[CONFIG_BLK_DEV_RAM_COUNT];/* Protected device data */ static struct request_queue *rd_queue[CONFIG_BLK_DEV_RAM_COUNT]; @@ -267,7 +268,7 @@ static int rd_blkdev_pagecache_IO(int rw, struct bio_vec *vec, sector_t sector, static int rd_make_request(struct request_queue *q, struct bio *bio) { struct block_device *bdev = bio->bi_bdev; - struct address_space * mapping = bdev->bd_inode->i_mapping; + struct address_space * mapping = rd_inode[MINOR(bdev->bd_dev)]->i_mapping; sector_t sector = bio->bi_sector; unsigned long len = bio->bi_size >> 9; int rw = bio_data_dir(bio); @@ -312,6 +313,7 @@ static int rd_ioctl(struct inode *inode, struct file *file, mutex_lock(&bdev->bd_mutex); if (bdev->bd_openers <= 2) { truncate_inode_pages(bdev->bd_inode->i_mapping, 0); + truncate_inode_pages(rd_inode[iminor(inode)]->i_mapping, 0); error = 0; } mutex_unlock(&bdev->bd_mutex); @@ -344,20 +346,30 @@ static int rd_open(struct inode *inode, struct file *filp) unsigned unit = iminor(inode); if (rd_bdev[unit] == NULL) { + struct inode *ramdisk_inode; struct block_device *bdev = inode->i_bdev; struct address_space *mapping; unsigned bsize; gfp_t gfp_mask; + ramdisk_inode = new_inode(bdev->bd_inode->i_sb); + if (!ramdisk_inode) + return -ENOMEM; + inode = igrab(bdev->bd_inode); rd_bdev[unit] = bdev; + rd_inode[unit] = ramdisk_inode; bdev->bd_openers++; bsize = bdev_hardsect_size(bdev); bdev->bd_block_size = bsize; inode->i_blkbits = blksize_bits(bsize); inode->i_size = get_capacity(bdev->bd_disk)<<9; - mapping = inode->i_mapping; + ramdisk_inode->i_mode = S_IFBLK; + ramdisk_inode->i_bdev = bdev; + ramdisk_inode->i_rdev = bdev->bd_dev; + + mapping = ramdisk_inode->i_mapping; mapping->a_ops = &ramdisk_aops; mapping->backing_dev_info = &rd_backing_dev_info; bdev->bd_inode_backing_dev_info = &rd_file_backing_dev_info; @@ -377,7 +389,7 @@ static int rd_open(struct inode *inode, struct file *filp) * for the page allocator emergency pools to keep the ramdisk * driver happy. */ - gfp_mask = mapping_gfp_mask(mapping); + gfp_mask = GFP_USER; gfp_mask &= ~(__GFP_FS|__GFP_IO); gfp_mask |= __GFP_HIGH; mapping_set_gfp_mask(mapping, gfp_mask); @@ -409,6 +421,7 @@ static void __exit rd_cleanup(void) del_gendisk(rd_disks[i]); put_disk(rd_disks[i]); blk_cleanup_queue(rd_queue[i]); + iput(rd_inode[i]); } unregister_blkdev(RAMDISK_MAJOR, "ramdisk"); -- 1.5.3.rc6.17.g1911 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Use a private inode for backing storage 2007-10-19 22:51 ` [PATCH] rd: Use a private inode for backing storage Eric W. Biederman @ 2007-10-21 4:28 ` Nick Piggin 2007-10-21 5:10 ` Eric W. Biederman 0 siblings, 1 reply; 77+ messages in thread From: Nick Piggin @ 2007-10-21 4:28 UTC (permalink / raw) To: Eric W. Biederman Cc: Christian Borntraeger, Andrew Morton, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable On Saturday 20 October 2007 08:51, Eric W. Biederman wrote: > Currently the ramdisk tries to keep the block device page cache pages > from being marked clean and dropped from memory. That fails for > filesystems that use the buffer cache because the buffer cache is not > an ordinary buffer cache user and depends on the generic block device > address space operations being used. > > To fix all of those associated problems this patch allocates a private > inode to store the ramdisk pages in. > > The result is slightly more memory used for metadata, an extra copying > when reading or writing directly to the block device, and changing the > software block size does not loose the contents of the ramdisk. Most > of all this ensures we don't loose data during normal use of the > ramdisk. > > I deliberately avoid the cleanup that is now possible because this > patch is intended to be a bug fix. This just breaks coherency again like the last patch. That's a really bad idea especially for stable (even if nothing actually was to break, we'd likely never know about it anyway). Christian's patch should go upstream and into stable. For 2.6.25-6, my rewrite should just replace what's there. Using address spaces to hold the ramdisk pages just confuses the issue even if they *aren't* actually wired up to the vfs at all. Saving 20 lines is not a good reason to use them. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Use a private inode for backing storage 2007-10-21 4:28 ` Nick Piggin @ 2007-10-21 5:10 ` Eric W. Biederman 2007-10-21 5:24 ` Nick Piggin 0 siblings, 1 reply; 77+ messages in thread From: Eric W. Biederman @ 2007-10-21 5:10 UTC (permalink / raw) To: Nick Piggin Cc: Christian Borntraeger, Andrew Morton, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable Nick Piggin <nickpiggin@yahoo.com.au> writes: > On Saturday 20 October 2007 08:51, Eric W. Biederman wrote: >> Currently the ramdisk tries to keep the block device page cache pages >> from being marked clean and dropped from memory. That fails for >> filesystems that use the buffer cache because the buffer cache is not >> an ordinary buffer cache user and depends on the generic block device >> address space operations being used. >> >> To fix all of those associated problems this patch allocates a private >> inode to store the ramdisk pages in. >> >> The result is slightly more memory used for metadata, an extra copying >> when reading or writing directly to the block device, and changing the >> software block size does not loose the contents of the ramdisk. Most >> of all this ensures we don't loose data during normal use of the >> ramdisk. >> >> I deliberately avoid the cleanup that is now possible because this >> patch is intended to be a bug fix. > > This just breaks coherency again like the last patch. That's a > really bad idea especially for stable (even if nothing actually > was to break, we'd likely never know about it anyway). Not a chance. The only way we make it to that inode is through block device I/O so it lives at exactly the same level in the hierarchy as a real block device. My patch is the considered rewrite boiled down to it's essentials and made a trivial patch. It fundamentally fixes the problem, and doesn't attempt to reconcile the incompatible expectations of the ramdisk code and the buffer cache. > Christian's patch should go upstream and into stable. For 2.6.25-6, > my rewrite should just replace what's there. Using address spaces > to hold the ramdisk pages just confuses the issue even if they > *aren't* actually wired up to the vfs at all. Saving 20 lines is > not a good reason to use them. Well is more like saving 100 lines. Not having to reexamine complicated infrastructure code and doing things the same way ramfs is. I think that combination is a good reason. Especially since I can do with a 16 line patch as I just demonstrated. It is a solid and simple incremental change. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Use a private inode for backing storage 2007-10-21 5:10 ` Eric W. Biederman @ 2007-10-21 5:24 ` Nick Piggin 2007-10-21 6:48 ` Eric W. Biederman 0 siblings, 1 reply; 77+ messages in thread From: Nick Piggin @ 2007-10-21 5:24 UTC (permalink / raw) To: Eric W. Biederman Cc: Christian Borntraeger, Andrew Morton, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable On Sunday 21 October 2007 15:10, Eric W. Biederman wrote: > Nick Piggin <nickpiggin@yahoo.com.au> writes: > > On Saturday 20 October 2007 08:51, Eric W. Biederman wrote: > >> Currently the ramdisk tries to keep the block device page cache pages > >> from being marked clean and dropped from memory. That fails for > >> filesystems that use the buffer cache because the buffer cache is not > >> an ordinary buffer cache user and depends on the generic block device > >> address space operations being used. > >> > >> To fix all of those associated problems this patch allocates a private > >> inode to store the ramdisk pages in. > >> > >> The result is slightly more memory used for metadata, an extra copying > >> when reading or writing directly to the block device, and changing the > >> software block size does not loose the contents of the ramdisk. Most > >> of all this ensures we don't loose data during normal use of the > >> ramdisk. > >> > >> I deliberately avoid the cleanup that is now possible because this > >> patch is intended to be a bug fix. > > > > This just breaks coherency again like the last patch. That's a > > really bad idea especially for stable (even if nothing actually > > was to break, we'd likely never know about it anyway). > > Not a chance. Yes it does. It is exactly breaking the coherency between block device and filesystem metadata coherency that Andrew cared about. Whether or not that matters, that is a much bigger conceptual change than simply using slightly more (reclaimable) memory in some situations that my patch does. If you want to try convincing people to break that coherency, fine, but it has to be done consistently and everywhere rather than for a special case in rd.c. > The only way we make it to that inode is through block > device I/O so it lives at exactly the same level in the hierarchy as > a real block device. No, it doesn't. A real block device driver does have its own buffer cache as it's backing store. It doesn't know about readpage or writepage or set_page_dirty or buffers or pagecache. > My patch is the considered rewrite boiled down > to it's essentials and made a trivial patch. What's the considered rewrite here? The rewrite I posted is the only one so far that's come up that I would consider [worthy], while these patches are just more of the same wrongness. > It fundamentally fixes the problem, and doesn't attempt to reconcile > the incompatible expectations of the ramdisk code and the buffer cache. It fixes the bug in question, but not because it makes any fundamental improvement to the conceptual ickyness of rd.c. I don't know what you mean about attempting to reconcile the incompatible [stuff]? > > Christian's patch should go upstream and into stable. For 2.6.25-6, > > my rewrite should just replace what's there. Using address spaces > > to hold the ramdisk pages just confuses the issue even if they > > *aren't* actually wired up to the vfs at all. Saving 20 lines is > > not a good reason to use them. > > Well is more like saving 100 lines. Not having to reexamine complicated > infrastructure code and doing things the same way ramfs is. Using radix_tree_insert instead of add_to_page_cache is hardly complicated. If anything it is simpler because it isn't actually confusing the issues and it is much better fit with real block device drivers, which is actually the thing that is familiar to block device driver maintainers. > I think > that combination is a good reason. Especially since I can do with a > 16 line patch as I just demonstrated. It is a solid and simple > incremental change. My opinion is that it is a much bigger behavioural change because it results in incoherent buffer cache / blockdev cache, and it results in code which is still ugly and conceptually wrong. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Use a private inode for backing storage 2007-10-21 5:24 ` Nick Piggin @ 2007-10-21 6:48 ` Eric W. Biederman 2007-10-21 7:28 ` Christian Borntraeger 2007-10-21 9:39 ` Nick Piggin 0 siblings, 2 replies; 77+ messages in thread From: Eric W. Biederman @ 2007-10-21 6:48 UTC (permalink / raw) To: Nick Piggin Cc: Christian Borntraeger, Andrew Morton, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable Nick Piggin <nickpiggin@yahoo.com.au> writes: > Yes it does. It is exactly breaking the coherency between block > device and filesystem metadata coherency that Andrew cared about. > Whether or not that matters, that is a much bigger conceptual > change than simply using slightly more (reclaimable) memory in > some situations that my patch does. > > If you want to try convincing people to break that coherency, > fine, but it has to be done consistently and everywhere rather than > for a special case in rd.c. Nick. Reread the patch. The only thing your arguments have established for me is that this patch is not obviously correct. Which makes it ineligible for a back port. Frankly I suspect the whole issue is to subtle and rare to make any backport make any sense. My apologies Christian. >> The only way we make it to that inode is through block >> device I/O so it lives at exactly the same level in the hierarchy as >> a real block device. > > No, it doesn't. A real block device driver does have its own > buffer cache as it's backing store. It doesn't know about > readpage or writepage or set_page_dirty or buffers or pagecache. Well those pages are only accessed through rd_blkdev_pagecache_IO and rd_ioctl. The address space operations can (after my patch) be deleted or be replaced by their generic versions. I just didn't take that step because it was an unnecessary change and I wanted the minimal change for a backport. >> My patch is the considered rewrite boiled down >> to it's essentials and made a trivial patch. > > What's the considered rewrite here? The rewrite I posted is the > only one so far that's come up that I would consider [worthy], > while these patches are just more of the same wrongness. Well it looks like you were blind when you read the patch. Because the semantics between the two are almost identical, except I managed to implement BLKFLSBUF in a backwards compatible way by flushing both the buffer cache and my private cache. You failed to flush the buffer cache in your implementation. Yes. I use an inode 99% for it's mapping and the mapping 99% for it's radix_tree. But having truncate_inode_pages and grab_cache_page continue to work sure is convenient. I certainly think it makes it a lot simpler to audit the code to change just one thing at a time (the backing store) then to rip out and replace everything and then try and prove that the two patches are equivalent. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Use a private inode for backing storage 2007-10-21 6:48 ` Eric W. Biederman @ 2007-10-21 7:28 ` Christian Borntraeger 2007-10-21 8:23 ` Eric W. Biederman 2007-10-21 9:39 ` Nick Piggin 1 sibling, 1 reply; 77+ messages in thread From: Christian Borntraeger @ 2007-10-21 7:28 UTC (permalink / raw) To: Eric W. Biederman Cc: Nick Piggin, Andrew Morton, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable Am Sonntag, 21. Oktober 2007 schrieb Eric W. Biederman: > Nick. Reread the patch. The only thing your arguments have > established for me is that this patch is not obviously correct. Which > makes it ineligible for a back port. Frankly I suspect the whole > issue is to subtle and rare to make any backport make any sense. My > apologies Christian. About being rare, when I force the VM to be more aggressive reclaiming buffer by using the following patch: --- linux-2.6.orig/fs/buffer.c +++ linux-2.6/fs/buffer.c @@ -3225,7 +3225,7 @@ void __init buffer_init(void) * Limit the bh occupancy to 10% of ZONE_NORMAL */ nrpages = (nr_free_buffer_pages() * 10) / 100; - max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head)); + max_buffer_heads = 0; hotcpu_notifier(buffer_cpu_notify, 0); } I can actually cause data corruption within some seconds. So I think the problem is real enough to be worth fixing. I still dont fully understand what issues you have with my patch. - it obviously fixes the problem - I am not aware of any regression it introduces - its small One concern you had, was the fact that buffer heads are out of sync with struct pages. Testing your first patch revealed that this is actually needed by reiserfs - and maybe others. I can also see, that my patch looks a bit like a bandaid that cobbles the rd pieces together. Is there anything else, that makes my patch unmergeable in your opinion? Christian -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Use a private inode for backing storage 2007-10-21 7:28 ` Christian Borntraeger @ 2007-10-21 8:23 ` Eric W. Biederman 2007-10-21 9:56 ` Nick Piggin 0 siblings, 1 reply; 77+ messages in thread From: Eric W. Biederman @ 2007-10-21 8:23 UTC (permalink / raw) To: Christian Borntraeger Cc: Nick Piggin, Andrew Morton, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable Christian Borntraeger <borntraeger@de.ibm.com> writes: > Am Sonntag, 21. Oktober 2007 schrieb Eric W. Biederman: >> Nick. Reread the patch. The only thing your arguments have >> established for me is that this patch is not obviously correct. Which >> makes it ineligible for a back port. Frankly I suspect the whole >> issue is to subtle and rare to make any backport make any sense. My >> apologies Christian. > > About being rare, when I force the VM to be more aggressive reclaiming buffer > by using the following patch: > --- linux-2.6.orig/fs/buffer.c > +++ linux-2.6/fs/buffer.c > @@ -3225,7 +3225,7 @@ void __init buffer_init(void) > * Limit the bh occupancy to 10% of ZONE_NORMAL > */ > nrpages = (nr_free_buffer_pages() * 10) / 100; > - max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head)); > + max_buffer_heads = 0; > hotcpu_notifier(buffer_cpu_notify, 0); > } > > I can actually cause data corruption within some seconds. So I think the > problem is real enough to be worth fixing. Let me put it another way. Looking at /proc/slabinfo I can get 37 buffer_heads per page. I can allocate 10% of memory in buffer_heads before we start to reclaim them. So it requires just over 3.7 buffer_heads on very page of low memory to even trigger this case. That is a large 1k filesystem or a weird sized partition, that we have written to directly. That makes this condition very rare in practice without your patch. Especially since even after we reach the above condition we have to have enough vm pressure to find a page with clean buffer heads that is dirty in the ramdisk. While it can be done deterministically usually it is pretty hard to trigger and pretty easy to work around by simply using partition sizes that are a multiple of 4k and 4k block sized filesystems. > I still dont fully understand what issues you have with my patch. > - it obviously fixes the problem > - I am not aware of any regression it introduces > - its small My primary issue with your patch is that it continues the saga the trying to use buffer cache to store the data which is a serious review problem, and clearly not what we want to do long term. > One concern you had, was the fact that buffer heads are out of sync with > struct pages. Testing your first patch revealed that this is actually needed > by reiserfs - and maybe others. > I can also see, that my patch looks a bit like a bandaid that cobbles the rd > pieces together. > Is there anything else, that makes my patch unmergeable in your > opinion? For linus's tree the consensus is that to fix rd.c that we need to have a backing store that is stored somewhere besides in the page cache/buffer cache for /dev/ram0. Doing that prevents all of the weird issues. Now we have the question of which patch gets us there. I contend I have implemented it with my last little patch that this thread is a reply to. Nick hasn't seen that just yet. So if we have a small patch that can implement the proper long term fix I contend we are in better shape. As for backports we can worry about that after we get something sane merged upstream. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Use a private inode for backing storage 2007-10-21 8:23 ` Eric W. Biederman @ 2007-10-21 9:56 ` Nick Piggin 2007-10-21 18:39 ` Eric W. Biederman 0 siblings, 1 reply; 77+ messages in thread From: Nick Piggin @ 2007-10-21 9:56 UTC (permalink / raw) To: Eric W. Biederman Cc: Christian Borntraeger, Andrew Morton, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable On Sunday 21 October 2007 18:23, Eric W. Biederman wrote: > Christian Borntraeger <borntraeger@de.ibm.com> writes: > Let me put it another way. Looking at /proc/slabinfo I can get > 37 buffer_heads per page. I can allocate 10% of memory in > buffer_heads before we start to reclaim them. So it requires just > over 3.7 buffer_heads on very page of low memory to even trigger > this case. That is a large 1k filesystem or a weird sized partition, > that we have written to directly. On a highmem machine it it could be relatively common. > > I still dont fully understand what issues you have with my patch. > > - it obviously fixes the problem > > - I am not aware of any regression it introduces > > - its small > > My primary issue with your patch is that it continues the saga the > trying to use buffer cache to store the data which is a serious > review problem, and clearly not what we want to do long term. You don't want to change that for a stable patch, however. It fixes the bug. > > One concern you had, was the fact that buffer heads are out of sync with > > struct pages. Testing your first patch revealed that this is actually > > needed by reiserfs - and maybe others. > > I can also see, that my patch looks a bit like a bandaid that cobbles the > > rd pieces together. > > > > Is there anything else, that makes my patch unmergeable in your > > opinion? > > For linus's tree the consensus is that to fix rd.c that we > need to have a backing store that is stored somewhere besides > in the page cache/buffer cache for /dev/ram0. Doing that prevents > all of the weird issues. > > Now we have the question of which patch gets us there. I contend > I have implemented it with my last little patch that this thread > is a reply to. Nick hasn't seen that just yet. Or ever will. It wasn't that my whole argument against it is based on that I mistakenly thought your patch served the bdev inode directly from its backing store. > So if we have a small patch that can implement the proper long > term fix I contend we are in better shape. I just don't think what you have is the proper fix. Calling into the core vfs and vm because right now it does something that works for you but is completely unrelated to what you are conceptually doing is not the right fix. Also, the patch I posted is big because it did other stuff with dynamically allocated ramdisks from loop (ie. a modern rewrite). As it is applied to rd.c and split into chunks, the actual patch to switch to the new backing store isn't actually that big. I'll submit it to -mm after things stabilise after the merge window too. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Use a private inode for backing storage 2007-10-21 9:56 ` Nick Piggin @ 2007-10-21 18:39 ` Eric W. Biederman 2007-10-22 1:56 ` Nick Piggin 2007-10-22 13:11 ` Chris Mason 0 siblings, 2 replies; 77+ messages in thread From: Eric W. Biederman @ 2007-10-21 18:39 UTC (permalink / raw) To: Nick Piggin Cc: Christian Borntraeger, Andrew Morton, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable Nick Piggin <nickpiggin@yahoo.com.au> writes: > On Sunday 21 October 2007 18:23, Eric W. Biederman wrote: >> Christian Borntraeger <borntraeger@de.ibm.com> writes: > >> Let me put it another way. Looking at /proc/slabinfo I can get >> 37 buffer_heads per page. I can allocate 10% of memory in >> buffer_heads before we start to reclaim them. So it requires just >> over 3.7 buffer_heads on very page of low memory to even trigger >> this case. That is a large 1k filesystem or a weird sized partition, >> that we have written to directly. > > On a highmem machine it it could be relatively common. Possibly. But the same proportions still hold. 1k filesystems are not the default these days and ramdisks are relatively uncommon. The memory quantities involved are all low mem. This is an old problem. If it was common I suspect we would have noticed and fixed it long ago. As far as I can tell this problem dates back to 2.5.13 in the commit below (which starts clearing the dirty bit in try_to_free_buffers). The rd.c had earlier made the transition from using BH_protected to using the dirty bit to lock it into the page cache sometime earlier. >commit acc98edfe3abf531df6cc0ba87f857abdd3552ad >Author: akpm <akpm> >Date: Tue Apr 30 20:52:10 2002 +0000 > > [PATCH] writeback from address spaces > > [ I reversed the order in which writeback walks the superblock's > dirty inodes. It sped up dbench's unlink phase greatly. I'm > such a sleaze ] > > The core writeback patch. Switches file writeback from the dirty > buffer LRU over to address_space.dirty_pages. > > - The buffer LRU is removed > > - The buffer hash is removed (uses blockdev pagecache lookups) > > - The bdflush and kupdate functions are implemented against > address_spaces, via pdflush. > > - The relationship between pages and buffers is changed. > > - If a page has dirty buffers, it is marked dirty > - If a page is marked dirty, it *may* have dirty buffers. > - A dirty page may be "partially dirty". block_write_full_page > discovers this. > > - A bunch of consistency checks of the form > > if (!something_which_should_be_true()) > buffer_error(); > > have been introduced. These fog the code up but are important for > ensuring that the new buffer/page code is working correctly. > > - New locking (inode.i_bufferlist_lock) is introduced for exclusion > from try_to_free_buffers(). This is needed because set_page_dirty > is called under spinlock, so it cannot lock the page. But it > needs access to page->buffers to set them all dirty. > > i_bufferlist_lock is also used to protect inode.i_dirty_buffers. > > - fs/inode.c has been split: all the code related to file data writeback > has been moved into fs/fs-writeback.c > > - Code related to file data writeback at the address_space level is in > the new mm/page-writeback.c > > - try_to_free_buffers() is now non-blocking > > - Switches vmscan.c over to understand that all pages with dirty data > are now marked dirty. > > - Introduces a new a_op for VM writeback: > > ->vm_writeback(struct page *page, int *nr_to_write) > > this is a bit half-baked at present. The intent is that the address_space > is given the opportunity to perform clustered writeback. To allow it to > opportunistically write out disk-contiguous dirty data which may be in other zones. > To allow delayed-allocate filesystems to get good disk layout. > > - Added address_space.io_pages. Pages which are being prepared for > writeback. This is here for two reasons: > > 1: It will be needed later, when BIOs are assembled direct > against pagecache, bypassing the buffer layer. It avoids a > deadlock which would occur if someone moved the page back onto the > dirty_pages list after it was added to the BIO, but before it was > submitted. (hmm. This may not be a problem with PG_writeback logic). > > 2: Avoids a livelock which would occur if some other thread is continually > redirtying pages. > > - There are two known performance problems in this code: > > 1: Pages which are locked for writeback cause undesirable > blocking when they are being overwritten. A patch which leaves > pages unlocked during writeback comes later in the series. > > > 2: While inodes are under writeback, they are locked. This > causes namespace lookups against the file to get unnecessarily > blocked in wait_on_inode(). This is a fairly minor problem. > > I don't have a fix for this at present - I'll fix this when I > attach dirty address_spaces direct to super_blocks. > > - The patch vastly increases the amount of dirty data which the > kernel permits highmem machines to maintain. This is because the > balancing decisions are made against the amount of memory in the > 2: While inodes are under writeback, they are locked. This > causes namespace lookups against the file to get unnecessarily > blocked in wait_on_inode(). This is a fairly minor problem. > > I don't have a fix for this at present - I'll fix this when I > attach dirty address_spaces direct to super_blocks. > > - The patch vastly increases the amount of dirty data which the > kernel permits highmem machines to maintain. This is because the > balancing decisions are made against the amount of memory in the > machine, not against the amount of buffercache-allocatable memory. > > This may be very wrong, although it works fine for me (2.5 gigs). > > We can trivially go back to the old-style throttling with > s/nr_free_pagecache_pages/nr_free_buffer_pages/ in > balance_dirty_pages(). But better would be to allow blockdev > mappings to use highmem (I'm thinking about this one, slowly). And > to move writer-throttling and writeback decisions into the VM (modulo > the file-overwriting problem). > > - Drops 24 bytes from struct buffer_head. More to come. > > - There's some gunk like super_block.flags:MS_FLUSHING which needs to > be killed. Need a better way of providing collision avoidance > between pdflush threads, to prevent more than one pdflush thread > working a disk at the same time. > > The correct way to do that is to put a flag in the request queue to > say "there's a pdlfush thread working this disk". This is easy to > do: just generalise the "ra_pages" pointer to point at a struct which > includes ra_pages and the new collision-avoidance flag. > > BKrev: 3ccf03faM0no6SEm3ltNUkHf4BH1ag > You don't want to change that for a stable patch, however. > It fixes the bug. No it avoids the bug which is something slightly different. Further I contend that it is not obviously correct that there are no other side effects (because it doesn't actually fix the bug), and that makes it of dubious value for a backport. If I had to slap a patch on there at this point just implementing an empty try_to_release_page (which disables try_to_free_buffers) would be my choice. I just think something that has existed at least potentially for the entire 2.6 series, and is easy to avoid is a bit dubious to fix now. > I just don't think what you have is the proper fix. Calling > into the core vfs and vm because right now it does something > that works for you but is completely unrelated to what you > are conceptually doing is not the right fix. I think there is a strong conceptual relation and other code doing largely the same thing is already in the kernel (ramfs). Plus my gut feel says shared code will make maintenance easier. You do have a point that the reuse may not be perfect and if that is the case we need to watch out for the potential to mess things up. So far I don't see any problems with the reuse. > Also, the patch I posted is big because it did other stuff > with dynamically allocated ramdisks from loop (ie. a modern > rewrite). As it is applied to rd.c and split into chunks, the > actual patch to switch to the new backing store isn't actually > that big. I'll submit it to -mm after things stabilise after > the merge window too. Sounds like a plan. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Use a private inode for backing storage 2007-10-21 18:39 ` Eric W. Biederman @ 2007-10-22 1:56 ` Nick Piggin 2007-10-22 13:11 ` Chris Mason 1 sibling, 0 replies; 77+ messages in thread From: Nick Piggin @ 2007-10-22 1:56 UTC (permalink / raw) To: Eric W. Biederman Cc: Christian Borntraeger, Andrew Morton, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable On Monday 22 October 2007 04:39, Eric W. Biederman wrote: > Nick Piggin <nickpiggin@yahoo.com.au> writes: > > On Sunday 21 October 2007 18:23, Eric W. Biederman wrote: > >> Christian Borntraeger <borntraeger@de.ibm.com> writes: > >> > >> Let me put it another way. Looking at /proc/slabinfo I can get > >> 37 buffer_heads per page. I can allocate 10% of memory in > >> buffer_heads before we start to reclaim them. So it requires just > >> over 3.7 buffer_heads on very page of low memory to even trigger > >> this case. That is a large 1k filesystem or a weird sized partition, > >> that we have written to directly. > > > > On a highmem machine it it could be relatively common. > > Possibly. But the same proportions still hold. 1k filesystems > are not the default these days and ramdisks are relatively uncommon. > The memory quantities involved are all low mem. You don't need 1K filesystems to have buffers attached though, of course. You can hit the limit with a 4K filesystem with less than 8GB in pagecache I'd say. > > You don't want to change that for a stable patch, however. > > It fixes the bug. > > No it avoids the bug which is something slightly different. > Further I contend that it is not obviously correct that there > are no other side effects (because it doesn't actually fix the > bug), and that makes it of dubious value for a backport. The bug in question isn't exactly that it uses buffercache for its backing store (although that's obviously rather hairy as well). It's this specific problem sequence. And it does fix the problem. > If I had to slap a patch on there at this point just implementing > an empty try_to_release_page (which disables try_to_free_buffers) > would be my choice. How is that better? Now you're making the exact same change for all filesystems that you didn't think was obviously correct for rd.c. > > I just don't think what you have is the proper fix. Calling > > into the core vfs and vm because right now it does something > > that works for you but is completely unrelated to what you > > are conceptually doing is not the right fix. > > I think there is a strong conceptual relation and other code > doing largely the same thing is already in the kernel (ramfs). Plus > my gut feel says shared code will make maintenance easier. ramfs is rather a different case. Filesystems intimately know about the pagecache. > You do have a point that the reuse may not be perfect and if that > is the case we need to watch out for the potential to mess things > up. > > So far I don't see any problems with the reuse. It's just wrong. I guess if you don't see that by now, then we have to just disagree. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Use a private inode for backing storage 2007-10-21 18:39 ` Eric W. Biederman 2007-10-22 1:56 ` Nick Piggin @ 2007-10-22 13:11 ` Chris Mason 1 sibling, 0 replies; 77+ messages in thread From: Chris Mason @ 2007-10-22 13:11 UTC (permalink / raw) To: Eric W. Biederman Cc: Nick Piggin, Christian Borntraeger, Andrew Morton, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable On Sun, 21 Oct 2007 12:39:30 -0600 ebiederm@xmission.com (Eric W. Biederman) wrote: > Nick Piggin <nickpiggin@yahoo.com.au> writes: > > > On Sunday 21 October 2007 18:23, Eric W. Biederman wrote: > >> Christian Borntraeger <borntraeger@de.ibm.com> writes: > > > >> Let me put it another way. Looking at /proc/slabinfo I can get > >> 37 buffer_heads per page. I can allocate 10% of memory in > >> buffer_heads before we start to reclaim them. So it requires just > >> over 3.7 buffer_heads on very page of low memory to even trigger > >> this case. That is a large 1k filesystem or a weird sized > >> partition, that we have written to directly. > > > > On a highmem machine it it could be relatively common. > > Possibly. But the same proportions still hold. 1k filesystems > are not the default these days and ramdisks are relatively uncommon. > The memory quantities involved are all low mem. It is definitely common during run time. It was seen in practice enough to be reproducible and get fixed for the non-ramdisk case. The big underlying question is how which ramdisk usage case are we shooting for. Keeping the ram disk pages off the LRU can certainly help the VM if larger ramdisks used at runtime are very common. Otherwise, I'd say to keep it as simple as possible and use Eric's patch. By simple I'm not counting lines of code, I'm counting overall readability between something everyone knows (page cache usage) and something specific to ramdisks (Nick's patch). -chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Use a private inode for backing storage 2007-10-21 6:48 ` Eric W. Biederman 2007-10-21 7:28 ` Christian Borntraeger @ 2007-10-21 9:39 ` Nick Piggin 2007-10-21 17:56 ` Eric W. Biederman 1 sibling, 1 reply; 77+ messages in thread From: Nick Piggin @ 2007-10-21 9:39 UTC (permalink / raw) To: Eric W. Biederman Cc: Christian Borntraeger, Andrew Morton, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable On Sunday 21 October 2007 16:48, Eric W. Biederman wrote: > Nick Piggin <nickpiggin@yahoo.com.au> writes: > > Yes it does. It is exactly breaking the coherency between block > > device and filesystem metadata coherency that Andrew cared about. > > Whether or not that matters, that is a much bigger conceptual > > change than simply using slightly more (reclaimable) memory in > > some situations that my patch does. > > > > If you want to try convincing people to break that coherency, > > fine, but it has to be done consistently and everywhere rather than > > for a special case in rd.c. > > Nick. Reread the patch. The only thing your arguments have > established for me is that this patch is not obviously correct. Which > makes it ineligible for a back port. OK, I missed that you set the new inode's aops to the ramdisk_aops rather than the bd_inode. Which doesn't make a lot of sense because you just have a lot of useless aops there now. > Frankly I suspect the whole > issue is to subtle and rare to make any backport make any sense. My > apologies Christian. It's a data corruption issue. I think it should be fixed. > >> The only way we make it to that inode is through block > >> device I/O so it lives at exactly the same level in the hierarchy as > >> a real block device. > > > > No, it doesn't. A real block device driver does have its own > > buffer cache as it's backing store. It doesn't know about > > readpage or writepage or set_page_dirty or buffers or pagecache. > > Well those pages are only accessed through rd_blkdev_pagecache_IO > and rd_ioctl. Wrong. It will be via the LRU, will get ->writepage() called, block_invalidate_page, etc. And I guess also via sb->s_inodes, where drop_pagecache_sb might do stuff to it (although it probably escapes harm). But you're right that it isn't the obviously correct fix for the problem. > >> My patch is the considered rewrite boiled down > >> to it's essentials and made a trivial patch. > > > > What's the considered rewrite here? The rewrite I posted is the > > only one so far that's come up that I would consider [worthy], > > while these patches are just more of the same wrongness. > > Well it looks like you were blind when you read the patch. If you think it is a nice way to go, then I think you are blind ;) > Because the semantics between the two are almost identical, > except I managed to implement BLKFLSBUF in a backwards compatible > way by flushing both the buffer cache and my private cache. You > failed to flush the buffer cache in your implementation. Obviously a simple typo that can be fixed by adding one line of code. > Yes. I use an inode 99% for it's mapping and the mapping 99% for it's > radix_tree. But having truncate_inode_pages and grab_cache_page > continue to work sure is convenient. It's horrible. And using truncate_inode_pages / grab_cache_page and new_inode is an incredible argument to save a few lines of code. You obviously didn't realise your so called private pages would get accessed via the LRU, for example. This is making a relatively larger logical change than my patch, because now as well as having a separate buffer cache and backing store, you are also making the backing store pages visible to the VM. > I certainly think it makes it a > lot simpler to audit the code to change just one thing at a time (the > backing store) then to rip out and replace everything and then try and > prove that the two patches are equivalent. I think it's a bad idea just to stir the shit. We should take the simple fix for the problem, and then fix it properly. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Use a private inode for backing storage 2007-10-21 9:39 ` Nick Piggin @ 2007-10-21 17:56 ` Eric W. Biederman 2007-10-22 0:29 ` Nick Piggin 0 siblings, 1 reply; 77+ messages in thread From: Eric W. Biederman @ 2007-10-21 17:56 UTC (permalink / raw) To: Nick Piggin Cc: Christian Borntraeger, Andrew Morton, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable Nick Piggin <nickpiggin@yahoo.com.au> writes: > OK, I missed that you set the new inode's aops to the ramdisk_aops > rather than the bd_inode. Which doesn't make a lot of sense because > you just have a lot of useless aops there now. Not totally useless as you have mentioned they are accessed by the lru so we still need something there just not much. >> Frankly I suspect the whole >> issue is to subtle and rare to make any backport make any sense. My >> apologies Christian. > > It's a data corruption issue. I think it should be fixed. Of course it should be fixed. I just don't know if a backport makes sense. The problem once understood is hard to trigger and easy to avoid. >> >> The only way we make it to that inode is through block >> >> device I/O so it lives at exactly the same level in the hierarchy as >> >> a real block device. >> > >> > No, it doesn't. A real block device driver does have its own >> > buffer cache as it's backing store. It doesn't know about >> > readpage or writepage or set_page_dirty or buffers or pagecache. >> >> Well those pages are only accessed through rd_blkdev_pagecache_IO >> and rd_ioctl. > > Wrong. It will be via the LRU, will get ->writepage() called, > block_invalidate_page, etc. And I guess also via sb->s_inodes, where > drop_pagecache_sb might do stuff to it (although it probably escapes > harm). But you're right that it isn't the obviously correct fix for > the problem. Yes. We will be accessed via the LRU. Yes I knew that. The truth is whatever we do we will be visible to the LRU. My preferences run towards having something that is less of a special case then a new potentially huge cache that is completely unaccounted for, that we have to build and maintain all of the infrastructure for independently. The ramdisk code doesn't seem interesting enough long term to get people to do independent maintenance. With my patch we are on the road to being just like the ramdisk for maintenance purposes code except having a different GFP mask. I think I might have to send in a patch that renames block_invalidatepage to block_invalidate_page which is how everyone seems to be spelling it. Anyway nothing in the ramdisk code sets PagePrivate o block_invalidagepage should never be called, nor try_to_free_buffers for that matter. >> >> My patch is the considered rewrite boiled down >> >> to it's essentials and made a trivial patch. >> > >> > What's the considered rewrite here? The rewrite I posted is the >> > only one so far that's come up that I would consider [worthy], >> > while these patches are just more of the same wrongness. >> >> Well it looks like you were blind when you read the patch. > > If you think it is a nice way to go, then I think you are > blind ;) Well we each have different tastes. I think my patch is a sane sensible small incremental step that does just what is needed to fix the problem. It doesn't drag a lot of other stuff into the problem like a rewrite would so we can easily verify it. >> Because the semantics between the two are almost identical, >> except I managed to implement BLKFLSBUF in a backwards compatible >> way by flushing both the buffer cache and my private cache. You >> failed to flush the buffer cache in your implementation. > > Obviously a simple typo that can be fixed by adding one line > of code. Quite true. I noticed the breakage in mine because the patch was so simple, and I only had to worry about one aspect. With a rewrite I missed it because there was so much other noise I couldn't see any issues. >> Yes. I use an inode 99% for it's mapping and the mapping 99% for it's >> radix_tree. But having truncate_inode_pages and grab_cache_page >> continue to work sure is convenient. > > It's horrible. And using truncate_inode_pages / grab_cache_page and > new_inode is an incredible argument to save a few lines of code. You > obviously didn't realise your so called private pages would get > accessed via the LRU, for example. I did but but that is relatively minor. Using the pagecache this way for this purpose is a well established idiom in the kernel so I didn't figure I was introducing anything to hard to maintain. > This is making a relatively > larger logical change than my patch, because now as well as having > a separate buffer cache and backing store, you are also making the > backing store pages visible to the VM. I am continuing to have the backing store pages visible to the VM, and from that perspective it is a smaller change then where we are today. Not that we can truly hide pages from the VM. >> I certainly think it makes it a >> lot simpler to audit the code to change just one thing at a time (the >> backing store) then to rip out and replace everything and then try and >> prove that the two patches are equivalent. > > I think it's a bad idea just to stir the shit. We should take the > simple fix for the problem, and then fix it properly. The code in rd.c isn't terrible, and it isn't shit. There is only one fundamental problem with it. rd.c is fundamentally incompatible with the buffer cache. Currently rd.c is a legitimate if a bit ugly user of the page cache. The ugliness comes from working around the buffer cache placing buffer heads on it's pages. With my patch I stop using the same pages as the buffer cache which removes that one fundamental problem. Once we get the problem actually fixed I have no problem with cleaning up the code. I even have patches queued I just believe in only changing one thing at a time if I can. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Use a private inode for backing storage 2007-10-21 17:56 ` Eric W. Biederman @ 2007-10-22 0:29 ` Nick Piggin 0 siblings, 0 replies; 77+ messages in thread From: Nick Piggin @ 2007-10-22 0:29 UTC (permalink / raw) To: Eric W. Biederman Cc: Christian Borntraeger, Andrew Morton, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o, stable On Monday 22 October 2007 03:56, Eric W. Biederman wrote: > Nick Piggin <nickpiggin@yahoo.com.au> writes: > > OK, I missed that you set the new inode's aops to the ramdisk_aops > > rather than the bd_inode. Which doesn't make a lot of sense because > > you just have a lot of useless aops there now. > > Not totally useless as you have mentioned they are accessed by > the lru so we still need something there just not much. Well, the ones that are unused are totally useless ;) I would have expected them to be removed. > >> Frankly I suspect the whole > >> issue is to subtle and rare to make any backport make any sense. My > >> apologies Christian. > > > > It's a data corruption issue. I think it should be fixed. > > Of course it should be fixed. I just don't know if a backport > makes sense. The problem once understood is hard to trigger > and easy to avoid. I mean backported. That's just me though, I don't know the nuances of -stable releases. It could be that they rather not risk introducing something worse which would be fair enough. > >> Well those pages are only accessed through rd_blkdev_pagecache_IO > >> and rd_ioctl. > > > > Wrong. It will be via the LRU, will get ->writepage() called, > > block_invalidate_page, etc. And I guess also via sb->s_inodes, where > > drop_pagecache_sb might do stuff to it (although it probably escapes > > harm). But you're right that it isn't the obviously correct fix for > > the problem. > > Yes. We will be accessed via the LRU. Yes I knew that. OK it just didn't sound like it, seeing as you said that's the only way they are accessed. > The truth is > whatever we do we will be visible to the LRU. No. With my patch, nothing in the ramdisk code is visible to the LRU. Which is how it should be. > My preferences run > towards having something that is less of a special case then a new > potentially huge cache that is completely unaccounted for, that we > have to build and maintain all of the infrastructure for > independently. It's not a cache, and it's not unaccounted for. It's specified exactly with the rd sizing parameters. I don't know why you would say your patch is better in this regard. Your ramdisk backing store will be accounted as pagecache, which is completely wrong. > The ramdisk code doesn't seem interesting enough > long term to get people to do independent maintenance. > > With my patch we are on the road to being just like the ramdisk > for maintenance purposes code except having a different GFP mask. You can be using highmem, BTW. And technically it probably isn't correct to use GFP_USER. > > If you think it is a nice way to go, then I think you are > > blind ;) > > Well we each have different tastes. I think my patch > is a sane sensible small incremental step that does just what > is needed to fix the problem. It doesn't drag a lot of other > stuff into the problem like a rewrite would so we can easily verify > it. The small incremental step that fixes the problem is Christian's patch. > > It's horrible. And using truncate_inode_pages / grab_cache_page and > > new_inode is an incredible argument to save a few lines of code. You > > obviously didn't realise your so called private pages would get > > accessed via the LRU, for example. > > I did but but that is relatively minor. Using the pagecache this > way for this purpose is a well established idiom in the kernel > so I didn't figure I was introducing anything to hard to maintain. Where else is this an established idiom? > > This is making a relatively > > larger logical change than my patch, because now as well as having > > a separate buffer cache and backing store, you are also making the > > backing store pages visible to the VM. > > I am continuing to have the backing store pages visible to the VM, > and from that perspective it is a smaller change then where we are > today. It is smaller lines of code. It is a larger change. Because what you are doing is 2 things. You are firstly discontinuing the use of the buffer cache for the backing store, and secondly you are introducing a new backing store which registers an inode with the vfs and pages with the pagecache. My patch does the same thing without those two last questionable steps. You now have to treat those backing store pages as pagecache pages, and hope you have set the right flags and registered the right aops. > Not that we can truly hide pages from the VM. Any page you allocate is your private page. The problem is you're just sending them back to the VM again. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Mark ramdisk buffers heads dirty 2007-10-15 22:42 ` [PATCH] rd: Mark ramdisk buffers heads dirty Eric W. Biederman 2007-10-16 7:56 ` Christian Borntraeger @ 2007-10-16 8:19 ` Nick Piggin 2007-10-16 8:48 ` Christian Borntraeger 2007-10-16 19:06 ` Eric W. Biederman 1 sibling, 2 replies; 77+ messages in thread From: Nick Piggin @ 2007-10-16 8:19 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o On Tuesday 16 October 2007 08:42, Eric W. Biederman wrote: > I have not observed this case but it is possible to get a dirty page > cache with clean buffer heads if we get a clean ramdisk page with > buffer heads generated by a filesystem calling __getblk and then write > to that page from user space through the block device. Then we just > need to hit the proper window and try_to_free_buffers() will mark that > page clean and eventually drop it. Ouch! > > To fix this use the generic __set_page_dirty_buffers in the ramdisk > code so that when we mark a page dirty we also mark it's buffer heads > dirty. Hmm, so we can also have some filesystems writing their own buffers out by hand (clear_buffer_dirty, submit buffer for IO). Other places will do similarly dodgy things with filesystem metadata (fsync_buffers_list, for example). So your buffers get cleaned again, then your pages get cleaned. While I said it was a good fix when I saw the patch earlier, I think it's not closing the entire hole, and as such, Christian's patch is probably the way to go for stable. For mainline, *if* we want to keep the old rd.c around at all, I don't see any harm in this patch so long as Christian's is merged as well. Sharing common code is always good. > > Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> > --- > drivers/block/rd.c | 13 +------------ > 1 files changed, 1 insertions(+), 12 deletions(-) > > diff --git a/drivers/block/rd.c b/drivers/block/rd.c > index 701ea77..84163da 100644 > --- a/drivers/block/rd.c > +++ b/drivers/block/rd.c > @@ -178,23 +178,12 @@ static int ramdisk_writepages(struct address_space > *mapping, return 0; > } > > -/* > - * ramdisk blockdev pages have their own ->set_page_dirty() because we > don't - * want them to contribute to dirty memory accounting. > - */ > -static int ramdisk_set_page_dirty(struct page *page) > -{ > - if (!TestSetPageDirty(page)) > - return 1; > - return 0; > -} > - > static const struct address_space_operations ramdisk_aops = { > .readpage = ramdisk_readpage, > .prepare_write = ramdisk_prepare_write, > .commit_write = ramdisk_commit_write, > .writepage = ramdisk_writepage, > - .set_page_dirty = ramdisk_set_page_dirty, > + .set_page_dirty = __set_page_dirty_buffers, > .writepages = ramdisk_writepages, > }; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Mark ramdisk buffers heads dirty 2007-10-16 8:19 ` [PATCH] rd: Mark ramdisk buffers heads dirty Nick Piggin @ 2007-10-16 8:48 ` Christian Borntraeger 2007-10-16 19:06 ` Eric W. Biederman 1 sibling, 0 replies; 77+ messages in thread From: Christian Borntraeger @ 2007-10-16 8:48 UTC (permalink / raw) To: Nick Piggin Cc: Eric W. Biederman, Andrew Morton, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o Am Dienstag, 16. Oktober 2007 schrieb Nick Piggin: > While I said it was a good fix when I saw the patch earlier, I think > it's not closing the entire hole, and as such, Christian's patch is > probably the way to go for stable. That would be my preferred method. Merge Erics and my fixup for 2.6.24-rc. The only open questions is, what was the reiserfs problem? Is it still causes by Erics patches? > For mainline, *if* we want to keep the old rd.c around at all, I > don't see any harm in this patch so long as Christian's is merged > as well. Sharing common code is always good. While the merge window is still open, to me the new ramdisk code seems more like a 2.6.25-rc thing. We actually should test the behaviour in low memory scenarios. What do you think? Christian -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Mark ramdisk buffers heads dirty 2007-10-16 8:19 ` [PATCH] rd: Mark ramdisk buffers heads dirty Nick Piggin 2007-10-16 8:48 ` Christian Borntraeger @ 2007-10-16 19:06 ` Eric W. Biederman 2007-10-16 22:06 ` Nick Piggin 1 sibling, 1 reply; 77+ messages in thread From: Eric W. Biederman @ 2007-10-16 19:06 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o Nick Piggin <nickpiggin@yahoo.com.au> writes: > On Tuesday 16 October 2007 08:42, Eric W. Biederman wrote: >> I have not observed this case but it is possible to get a dirty page >> cache with clean buffer heads if we get a clean ramdisk page with >> buffer heads generated by a filesystem calling __getblk and then write >> to that page from user space through the block device. Then we just >> need to hit the proper window and try_to_free_buffers() will mark that >> page clean and eventually drop it. Ouch! >> >> To fix this use the generic __set_page_dirty_buffers in the ramdisk >> code so that when we mark a page dirty we also mark it's buffer heads >> dirty. > > Hmm, so we can also have some filesystems writing their own buffers > out by hand (clear_buffer_dirty, submit buffer for IO). Other places > will do similarly dodgy things with filesystem metadata > (fsync_buffers_list, for example). > > So your buffers get cleaned again, then your pages get cleaned. So I just took a little bit of time to look at and think about the path you are referring to, and I don't see a problem. The rule with the buffer dirty bit is that you first clear it and then you submit the write. When the write request finally makes it's way to rd.c we copy the data if necessary and call set_page_dirty. Which will then mark the page and the buffer dirty again. In essence the ramdisk code just attempts to lock buffers in ram by setting their dirty bit, just like we do for pages in ramfs. The only case where I see the dirty bit getting cleared without submitting I/O is when dirty state doesn't matter, in which case if we get a page full buffers all of whose data we don't care about it is legitimate to drop the page. As for ramdisk_writepage it probably makes sense to remove it, as the generic code paths seem to work as well or better the writepage method is NULL. However if we do keep it we should call set_page_dirty there on general principles. > While I said it was a good fix when I saw the patch earlier, I think > it's not closing the entire hole, and as such, Christian's patch is > probably the way to go for stable. I thought through the logic in try_to_free_buffers and it actually makes sense to me now. mark_buffer_dirty sets the page dirty bit so dirty buffers reside on dirty pages. When we submit I/O we aren't guaranteed to submit all of the dirty buffers for a page at once, so we don't clear the page dirty bit. With the result that we can end up with pages with the dirty bit set but all of their buffers are clean. Since we rather deliberately allow truly clean pages to be dropped from the ramdisk overriding try_to_free_buffer_pages looks wrong because then except for invalidate we can not remove buffers from truly clean pages. > For mainline, *if* we want to keep the old rd.c around at all, I > don't see any harm in this patch so long as Christian's is merged > as well. Sharing common code is always good. I do agree that with the amount of code duplication in the buffer cache that locking pages into the buffer cache seems very error prone, and difficult to maintain. So rewriting rd.c to store it's pages elsewhere sounds very promising. While I can see Christian's patch as fixing the symptom. I have a very hard time see it as correct. If we did something more elaborate to replace try_to_free_buffer_pages such that we could drop buffers from clean buffer cache pages when they became such and so were only suppressing the drop the dirty bit logic for pages that contain data we want to keep I would be ok with it. Just totally skipping buffer head freeing just feels wrong to me. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Mark ramdisk buffers heads dirty 2007-10-16 19:06 ` Eric W. Biederman @ 2007-10-16 22:06 ` Nick Piggin 0 siblings, 0 replies; 77+ messages in thread From: Nick Piggin @ 2007-10-16 22:06 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o On Wednesday 17 October 2007 05:06, Eric W. Biederman wrote: > Nick Piggin <nickpiggin@yahoo.com.au> writes: > > On Tuesday 16 October 2007 08:42, Eric W. Biederman wrote: > >> I have not observed this case but it is possible to get a dirty page > >> cache with clean buffer heads if we get a clean ramdisk page with > >> buffer heads generated by a filesystem calling __getblk and then write > >> to that page from user space through the block device. Then we just > >> need to hit the proper window and try_to_free_buffers() will mark that > >> page clean and eventually drop it. Ouch! > >> > >> To fix this use the generic __set_page_dirty_buffers in the ramdisk > >> code so that when we mark a page dirty we also mark it's buffer heads > >> dirty. > > > > Hmm, so we can also have some filesystems writing their own buffers > > out by hand (clear_buffer_dirty, submit buffer for IO). Other places > > will do similarly dodgy things with filesystem metadata > > (fsync_buffers_list, for example). > > > > So your buffers get cleaned again, then your pages get cleaned. > > So I just took a little bit of time to look at and think about > the path you are referring to, and I don't see a problem. > > The rule with the buffer dirty bit is that you first clear it > and then you submit the write. When the write request finally > makes it's way to rd.c we copy the data if necessary and call > set_page_dirty. Which will then mark the page and the buffer > dirty again. Oh, maybe you're right. I didn't see it redirty the page there. > In essence the ramdisk code just attempts to lock buffers in > ram by setting their dirty bit, just like we do for pages > in ramfs. Yeah, which is half the reason why its so complicated. Logically it should just hold another reference on the pages rather than interfere with pagecache state, but it can't do that because it doesn't always know when a new page is inserted. > > While I said it was a good fix when I saw the patch earlier, I think > > it's not closing the entire hole, and as such, Christian's patch is > > probably the way to go for stable. > > I thought through the logic in try_to_free_buffers and it actually > makes sense to me now. mark_buffer_dirty sets the page dirty bit > so dirty buffers reside on dirty pages. When we submit I/O we > aren't guaranteed to submit all of the dirty buffers for a page > at once, so we don't clear the page dirty bit. With the result > that we can end up with pages with the dirty bit set but all of > their buffers are clean. Yep. > Since we rather deliberately allow truly clean pages to be dropped > from the ramdisk overriding try_to_free_buffer_pages looks wrong > because then except for invalidate we can not remove buffers > from truly clean pages. Yeah, if your fix works I guess it is better to use it and converge code rather than diverge it even more. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Preserve the dirty bit in init_page_buffers() 2007-10-15 22:40 ` [PATCH] rd: Preserve the dirty bit in init_page_buffers() Eric W. Biederman 2007-10-15 22:42 ` [PATCH] rd: Mark ramdisk buffers heads dirty Eric W. Biederman @ 2007-10-16 8:12 ` Nick Piggin 2007-10-16 9:35 ` Eric W. Biederman 1 sibling, 1 reply; 77+ messages in thread From: Nick Piggin @ 2007-10-16 8:12 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o On Tuesday 16 October 2007 08:40, Eric W. Biederman wrote: > The problem: When we are trying to free buffers try_to_free_buffers() > will look at ramdisk pages with clean buffer heads and remove the > dirty bit from the page. Resulting in ramdisk pages with data that get > removed from the page cache. Ouch! > > Buffer heads appear on ramdisk pages when a filesystem calls > __getblk() which through a series of function calls eventually calls > init_page_buffers(). > > So to fix the mismatch between buffer head and page state this patch > modifies init_page_buffers() to transfer the dirty bit from the page to > the buffer heads like we currently do for the uptodate bit. > > This patch is safe as only __getblk calls init_page_buffers, and > there are only two implementations of block devices cached in the > page cache. The generic implementation in block_dev.c and the > implementation in rd.c. > > The generic implementation of block devices always does everything > in terms of buffer heads so it always has buffer heads allocated > before a page is marked dirty so this change does not affect it. This is probably a good idea. Was this causing the reiserfs problems? If so, I think we should be concentrating on what the real problem is with reiserfs... (or at least why this so obviously correct looking patch is wrong). Acked-by: Nick Piggin <npiggin@suse.de> > > Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> > --- > fs/buffer.c | 3 +++ > 1 files changed, 3 insertions(+), 0 deletions(-) > > diff --git a/fs/buffer.c b/fs/buffer.c > index 75b51df..8b87beb 100644 > --- a/fs/buffer.c > +++ b/fs/buffer.c > @@ -972,6 +972,7 @@ init_page_buffers(struct page *page, struct > block_device *bdev, struct buffer_head *head = page_buffers(page); > struct buffer_head *bh = head; > int uptodate = PageUptodate(page); > + int dirty = PageDirty(page); > > do { > if (!buffer_mapped(bh)) { > @@ -980,6 +981,8 @@ init_page_buffers(struct page *page, struct > block_device *bdev, bh->b_blocknr = block; > if (uptodate) > set_buffer_uptodate(bh); > + if (dirty) > + set_buffer_dirty(bh); > set_buffer_mapped(bh); > } > block++; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH] rd: Preserve the dirty bit in init_page_buffers() 2007-10-16 8:12 ` [PATCH] rd: Preserve the dirty bit in init_page_buffers() Nick Piggin @ 2007-10-16 9:35 ` Eric W. Biederman 0 siblings, 0 replies; 77+ messages in thread From: Eric W. Biederman @ 2007-10-16 9:35 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o Nick Piggin <nickpiggin@yahoo.com.au> writes: > On Tuesday 16 October 2007 08:40, Eric W. Biederman wrote: >> The problem: When we are trying to free buffers try_to_free_buffers() >> will look at ramdisk pages with clean buffer heads and remove the >> dirty bit from the page. Resulting in ramdisk pages with data that get >> removed from the page cache. Ouch! >> >> Buffer heads appear on ramdisk pages when a filesystem calls >> __getblk() which through a series of function calls eventually calls >> init_page_buffers(). >> >> So to fix the mismatch between buffer head and page state this patch >> modifies init_page_buffers() to transfer the dirty bit from the page to >> the buffer heads like we currently do for the uptodate bit. >> >> This patch is safe as only __getblk calls init_page_buffers, and >> there are only two implementations of block devices cached in the >> page cache. The generic implementation in block_dev.c and the >> implementation in rd.c. >> >> The generic implementation of block devices always does everything >> in terms of buffer heads so it always has buffer heads allocated >> before a page is marked dirty so this change does not affect it. > > This is probably a good idea. Was this causing the reiserfs problems? > If so, I think we should be concentrating on what the real problem > is with reiserfs... (or at least why this so obviously correct > looking patch is wrong). I think it was my cleanup patch that was sitting on top of this, That caused the problems. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure 2007-10-15 14:06 ` Nick Piggin 2007-10-15 9:05 ` Christian Borntraeger @ 2007-10-15 9:16 ` Andrew Morton 2007-10-15 15:23 ` Nick Piggin 1 sibling, 1 reply; 77+ messages in thread From: Andrew Morton @ 2007-10-15 9:16 UTC (permalink / raw) To: Nick Piggin Cc: Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o On Tue, 16 Oct 2007 00:06:19 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote: > On Monday 15 October 2007 18:28, Christian Borntraeger wrote: > > Andrew, this is a resend of a bugfix patch. Ramdisk seems a bit > > unmaintained, so decided to sent the patch to you :-). > > I have CCed Ted, who did work on the code in the 90s. I found no current > > email address of Chad Page. > > This really needs to be fixed... rd.c is fairly mind-boggling vfs abuse. > I can't make up my mind between the approaches to fixing it. > > On one hand, I would actually prefer to really mark the buffers > dirty (as in: Eric's fix for this problem[*]) than this patch, > and this seems a bit like a bandaid... > > On the other hand, the wound being covered by the bandaid is > actually the code in the buffer layer that does this latent > "cleaning" of the page because it sadly doesn't really keep > track of the pagecache state. But it *still* feels like we > should be marking the rd page's buffers dirty which should > avoid this problem anyway. > > [*] However, hmm, with Eric's patch I guess we'd still have a hole > where filesystems that write their buffers by hand think they are > "cleaning" these things and we're back to square one. That could > be fixed by marking the buffers dirty again? > > Why were Eric's patches dropped, BTW? I don't remember. runtime problems, iirc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure 2007-10-15 9:16 ` [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure Andrew Morton @ 2007-10-15 15:23 ` Nick Piggin 2007-10-16 3:14 ` Eric W. Biederman 0 siblings, 1 reply; 77+ messages in thread From: Nick Piggin @ 2007-10-15 15:23 UTC (permalink / raw) To: Andrew Morton Cc: Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o On Monday 15 October 2007 19:16, Andrew Morton wrote: > On Tue, 16 Oct 2007 00:06:19 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > On Monday 15 October 2007 18:28, Christian Borntraeger wrote: > > > Andrew, this is a resend of a bugfix patch. Ramdisk seems a bit > > > unmaintained, so decided to sent the patch to you :-). > > > I have CCed Ted, who did work on the code in the 90s. I found no > > > current email address of Chad Page. > > > > This really needs to be fixed... > > rd.c is fairly mind-boggling vfs abuse. Why do you say that? I guess it is _different_, by necessity(?) Is there anything that is really bad? I guess it's not nice for operating on the pagecache from its request_fn, but the alternative is to duplicate pages for backing store and buffer cache (actually that might not be a bad alternative really). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure 2007-10-15 15:23 ` Nick Piggin @ 2007-10-16 3:14 ` Eric W. Biederman 2007-10-16 6:45 ` Nick Piggin 0 siblings, 1 reply; 77+ messages in thread From: Eric W. Biederman @ 2007-10-16 3:14 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o Nick Piggin <nickpiggin@yahoo.com.au> writes: > On Monday 15 October 2007 19:16, Andrew Morton wrote: >> On Tue, 16 Oct 2007 00:06:19 +1000 Nick Piggin <nickpiggin@yahoo.com.au> > wrote: >> > On Monday 15 October 2007 18:28, Christian Borntraeger wrote: >> > > Andrew, this is a resend of a bugfix patch. Ramdisk seems a bit >> > > unmaintained, so decided to sent the patch to you :-). >> > > I have CCed Ted, who did work on the code in the 90s. I found no >> > > current email address of Chad Page. >> > >> > This really needs to be fixed... >> >> rd.c is fairly mind-boggling vfs abuse. > > Why do you say that? I guess it is _different_, by necessity(?) > Is there anything that is really bad? make_page_uptodate() is most hideous part I have run into. It has to know details about other layers to now what not to stomp. I think my incorrect simplification of this is what messed things up, last round. > I guess it's not nice > for operating on the pagecache from its request_fn, but the > alternative is to duplicate pages for backing store and buffer > cache (actually that might not be a bad alternative really). Cool. Triple buffering :) Although I guess that would only apply to metadata these days. Having a separate store would solve some of the problems, and probably remove the need for carefully specifying the ramdisk block size. We would still need the magic restictions on page allocations though and it we would use them more often as the initial write to the ramdisk would not populate the pages we need. A very ugly bit seems to be the fact that we assume we can dereference bh->b_data without any special magic which means the ramdisk must live in low memory on 32bit machines. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure 2007-10-16 3:14 ` Eric W. Biederman @ 2007-10-16 6:45 ` Nick Piggin 2007-10-16 4:57 ` Eric W. Biederman 0 siblings, 1 reply; 77+ messages in thread From: Nick Piggin @ 2007-10-16 6:45 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o On Tuesday 16 October 2007 13:14, Eric W. Biederman wrote: > Nick Piggin <nickpiggin@yahoo.com.au> writes: > > On Monday 15 October 2007 19:16, Andrew Morton wrote: > >> On Tue, 16 Oct 2007 00:06:19 +1000 Nick Piggin <nickpiggin@yahoo.com.au> > > > > wrote: > >> > On Monday 15 October 2007 18:28, Christian Borntraeger wrote: > >> > > Andrew, this is a resend of a bugfix patch. Ramdisk seems a bit > >> > > unmaintained, so decided to sent the patch to you :-). > >> > > I have CCed Ted, who did work on the code in the 90s. I found no > >> > > current email address of Chad Page. > >> > > >> > This really needs to be fixed... > >> > >> rd.c is fairly mind-boggling vfs abuse. > > > > Why do you say that? I guess it is _different_, by necessity(?) > > Is there anything that is really bad? > > make_page_uptodate() is most hideous part I have run into. > It has to know details about other layers to now what not > to stomp. I think my incorrect simplification of this is what messed > things up, last round. Not really, it's just named funny. That's just a minor utility function that more or less does what it says it should do. The main problem is really that it's implementing a block device who's data comes from its own buffercache :P. I think. > > I guess it's not nice > > for operating on the pagecache from its request_fn, but the > > alternative is to duplicate pages for backing store and buffer > > cache (actually that might not be a bad alternative really). > > Cool. Triple buffering :) Although I guess that would only > apply to metadata these days. Double buffering. You no longer serve data out of your buffer cache. All filesystem data was already double buffered anyway, so we'd be just losing out on one layer of savings for metadata. I think it's worthwhile, given that we'd have a "real" looking block device and minus these bugs. > Having a separate store would > solve some of the problems, and probably remove the need > for carefully specifying the ramdisk block size. We would > still need the magic restictions on page allocations though > and it we would use them more often as the initial write to the > ramdisk would not populate the pages we need. What magic restrictions on page allocations? Actually we have fewer restrictions on page allocations because we can use highmem! And the lowmem buffercache pages that we currently pin (unsuccessfully, in the case of this bug) are now completely reclaimable. And all your buffer heads are now reclaimable. If you mean GFP_NOIO... I don't see any problem. Block device drivers have to allocate memory with GFP_NOIO; this may have been considered magic or deep badness back when the code was written, but it's pretty simple and accepted now. > A very ugly bit seems to be the fact that we assume we can > dereference bh->b_data without any special magic which > means the ramdisk must live in low memory on 32bit machines. Yeah but that's not rd.c. You need to rewrite the buffer layer to fix that (see fsblock ;)). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure 2007-10-16 6:45 ` Nick Piggin @ 2007-10-16 4:57 ` Eric W. Biederman 2007-10-16 8:08 ` Nick Piggin 0 siblings, 1 reply; 77+ messages in thread From: Eric W. Biederman @ 2007-10-16 4:57 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o Nick Piggin <nickpiggin@yahoo.com.au> writes: >> >> make_page_uptodate() is most hideous part I have run into. >> It has to know details about other layers to now what not >> to stomp. I think my incorrect simplification of this is what messed >> things up, last round. > > Not really, it's just named funny. That's just a minor utility > function that more or less does what it says it should do. > > The main problem is really that it's implementing a block device > who's data comes from its own buffercache :P. I think. Well to put it another way, mark_page_uptodate() is the only place where we really need to know about the upper layers. Given that you can kill ramdisks by coding it as: static void make_page_uptodate(struct page *page) { clear_highpage(page); flush_dcache_page(page); SetPageUptodate(page); } Something is seriously non-intuitive about that function if you understand the usual rules for how to use the page cache. The problem is that we support a case in the buffer cache where pages are partially uptodate and only the buffer_heads remember which parts are valid. Assuming we are using them correctly. Having to walk through all of the buffer heads in make_page_uptodate seems to me to be a nasty layering violation in rd.c >> > I guess it's not nice >> > for operating on the pagecache from its request_fn, but the >> > alternative is to duplicate pages for backing store and buffer >> > cache (actually that might not be a bad alternative really). >> >> Cool. Triple buffering :) Although I guess that would only >> apply to metadata these days. > > Double buffering. You no longer serve data out of your buffer > cache. All filesystem data was already double buffered anyway, > so we'd be just losing out on one layer of savings for metadata. Yep we are in agreement there. > I think it's worthwhile, given that we'd have a "real" looking > block device and minus these bugs. For testing purposes I think I can agree with that. >> Having a separate store would >> solve some of the problems, and probably remove the need >> for carefully specifying the ramdisk block size. We would >> still need the magic restictions on page allocations though >> and it we would use them more often as the initial write to the >> ramdisk would not populate the pages we need. > > What magic restrictions on page allocations? Actually we have > fewer restrictions on page allocations because we can use > highmem! With the proposed rewrite yes. > And the lowmem buffercache pages that we currently pin > (unsuccessfully, in the case of this bug) are now completely > reclaimable. And all your buffer heads are now reclaimable. Hmm. Good point. So in net it should save memory even if it consumes a little more in the worst case. > If you mean GFP_NOIO... I don't see any problem. Block device > drivers have to allocate memory with GFP_NOIO; this may have > been considered magic or deep badness back when the code was > written, but it's pretty simple and accepted now. Well I always figured it was a bit rude allocating large amounts of memory GFP_NOIO but whatever. >> A very ugly bit seems to be the fact that we assume we can >> dereference bh->b_data without any special magic which >> means the ramdisk must live in low memory on 32bit machines. > > Yeah but that's not rd.c. You need to rewrite the buffer layer > to fix that (see fsblock ;)). I'm not certain which way we should go. Take fsblock and run it in parallel until everything is converted or use fsblock as a prototype and once we have figured out which way we should go convert struct buffer_head into struct fsblock one patch at a time. I'm inclined to think we should evolve the buffer_head. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure 2007-10-16 4:57 ` Eric W. Biederman @ 2007-10-16 8:08 ` Nick Piggin 2007-10-16 7:47 ` [patch][rfc] rewrite ramdisk Nick Piggin 0 siblings, 1 reply; 77+ messages in thread From: Nick Piggin @ 2007-10-16 8:08 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o On Tuesday 16 October 2007 14:57, Eric W. Biederman wrote: > Nick Piggin <nickpiggin@yahoo.com.au> writes: > >> make_page_uptodate() is most hideous part I have run into. > >> It has to know details about other layers to now what not > >> to stomp. I think my incorrect simplification of this is what messed > >> things up, last round. > > > > Not really, it's just named funny. That's just a minor utility > > function that more or less does what it says it should do. > > > > The main problem is really that it's implementing a block device > > who's data comes from its own buffercache :P. I think. > > Well to put it another way, mark_page_uptodate() is the only > place where we really need to know about the upper layers. > Given that you can kill ramdisks by coding it as: > > static void make_page_uptodate(struct page *page) > { > clear_highpage(page); > flush_dcache_page(page); > SetPageUptodate(page); > } > > Something is seriously non-intuitive about that function if > you understand the usual rules for how to use the page cache. You're overwriting some buffers that were uptodate and dirty. That would be expected to cause problems. > The problem is that we support a case in the buffer cache > where pages are partially uptodate and only the buffer_heads > remember which parts are valid. Assuming we are using them > correctly. > > Having to walk through all of the buffer heads in make_page_uptodate > seems to me to be a nasty layering violation in rd.c Sure, but it's not just about the buffers. It's the pagecache in general. It is supposed to be invisible to the device driver and sitting above it, and yet it is taking the buffercache and using it to pull its data out of. > > I think it's worthwhile, given that we'd have a "real" looking > > block device and minus these bugs. > > For testing purposes I think I can agree with that. What non-testing uses does it have? > >> Having a separate store would > >> solve some of the problems, and probably remove the need > >> for carefully specifying the ramdisk block size. We would > >> still need the magic restictions on page allocations though > >> and it we would use them more often as the initial write to the > >> ramdisk would not populate the pages we need. > > > > What magic restrictions on page allocations? Actually we have > > fewer restrictions on page allocations because we can use > > highmem! > > With the proposed rewrite yes. > > > And the lowmem buffercache pages that we currently pin > > (unsuccessfully, in the case of this bug) are now completely > > reclaimable. And all your buffer heads are now reclaimable. > > Hmm. Good point. So in net it should save memory even if > it consumes a little more in the worst case. Highmem systems would definitely like it. For others, yes, all the duplicated pages should be able to get reclaimed if memory gets tight, along with the buffer heads, so yeah footprint may be a tad smaller. > > If you mean GFP_NOIO... I don't see any problem. Block device > > drivers have to allocate memory with GFP_NOIO; this may have > > been considered magic or deep badness back when the code was > > written, but it's pretty simple and accepted now. > > Well I always figured it was a bit rude allocating large amounts > of memory GFP_NOIO but whatever. You'd rather not, of course, but with dirty data limits now, it doesn't matter much. (and I doubt anybody outside testing is going to be hammering like crazy on rd). Note that the buffercache based ramdisk driver is going to also be allocating with GFP_NOFS if you're talking about a filesystem writing to its metadata. In most systems, GFP_NOFS isn't much different to GFP_NOIO. We could introduce a mode which allocates pages up front quite easily if it were a problem (which I doubt it ever would be). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* [patch][rfc] rewrite ramdisk 2007-10-16 8:08 ` Nick Piggin @ 2007-10-16 7:47 ` Nick Piggin 2007-10-16 7:52 ` Jan Engelhardt ` (3 more replies) 0 siblings, 4 replies; 77+ messages in thread From: Nick Piggin @ 2007-10-16 7:47 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o [-- Attachment #1: Type: text/plain, Size: 342 bytes --] On Tuesday 16 October 2007 18:08, Nick Piggin wrote: > On Tuesday 16 October 2007 14:57, Eric W. Biederman wrote: > > > What magic restrictions on page allocations? Actually we have > > > fewer restrictions on page allocations because we can use > > > highmem! > > > > With the proposed rewrite yes. Here's a quick first hack... Comments? [-- Attachment #2: rd2.patch --] [-- Type: text/x-diff, Size: 13940 bytes --] This is a rewrite of the ramdisk block device driver. The old one is really difficult because it effectively implements a block device which serves data out of its own buffer cache. This makes it crap. The new one is more like a regular block device driver. It has no idea about vm/vfs stuff. It's backing store is similar to the buffer cache (a simple radix-tree of pages), but it doesn't know anything about page cache (the pages in the radix tree are not pagecache pages). There is one slight downside -- direct block device access and filesystem metadata access goes through an extra copy and gets stored in RAM twice. However, this downside is only slight, because the real buffercache of the device is now reclaimable (because we're not playing crazy games with it), so under memory intensive situations, footprint should effectively be the same -- maybe even a slight advantage to the new driver because it can also reclaim buffer heads. The fact that it now goes through all the regular vm/fs paths makes it much more useful for testing, too. --- Index: linux-2.6/drivers/block/Kconfig =================================================================== --- linux-2.6.orig/drivers/block/Kconfig +++ linux-2.6/drivers/block/Kconfig @@ -329,6 +329,7 @@ config BLK_DEV_UB config BLK_DEV_RAM tristate "RAM disk support" + depends on !BLK_DEV_BRD ---help--- Saying Y here will allow you to use a portion of your RAM memory as a block device, so that you can make file systems on it, read and @@ -346,10 +347,24 @@ config BLK_DEV_RAM Most normal users won't need the RAM disk functionality, and can thus say N here. +config BLK_DEV_BRD + tristate "RAM block device support" + ---help--- + This is a new based block driver that replaces BLK_DEV_RAM. + + Note that the kernel command line option "ramdisk=XX" is now + obsolete. For details, read <file:Documentation/ramdisk.txt>. + + To compile this driver as a module, choose M here: the + module will be called rd. + + Most normal users won't need the RAM disk functionality, and can + thus say N here. + config BLK_DEV_RAM_COUNT int "Default number of RAM disks" default "16" - depends on BLK_DEV_RAM + depends on BLK_DEV_RAM || BLK_DEV_BRD help The default value is 16 RAM disks. Change this if you know what are doing. If you boot from a filesystem that needs to be extracted @@ -357,7 +372,7 @@ config BLK_DEV_RAM_COUNT config BLK_DEV_RAM_SIZE int "Default RAM disk size (kbytes)" - depends on BLK_DEV_RAM + depends on BLK_DEV_RAM || BLK_DEV_BRD default "4096" help The default value is 4096 kilobytes. Only change this if you know Index: linux-2.6/drivers/block/Makefile =================================================================== --- linux-2.6.orig/drivers/block/Makefile +++ linux-2.6/drivers/block/Makefile @@ -12,6 +12,7 @@ obj-$(CONFIG_PS3_DISK) += ps3disk.o obj-$(CONFIG_ATARI_FLOPPY) += ataflop.o obj-$(CONFIG_AMIGA_Z2RAM) += z2ram.o obj-$(CONFIG_BLK_DEV_RAM) += rd.o +obj-$(CONFIG_BLK_DEV_BRD) += brd.o obj-$(CONFIG_BLK_DEV_LOOP) += loop.o obj-$(CONFIG_BLK_DEV_PS2) += ps2esdi.o obj-$(CONFIG_BLK_DEV_XD) += xd.o Index: linux-2.6/drivers/block/brd.c =================================================================== --- /dev/null +++ linux-2.6/drivers/block/brd.c @@ -0,0 +1,467 @@ +/* + * Ram backed block device driver. + * + * Copyright (C) 2007 Nick Piggin + * Copyright (C) 2007 Novell Inc. + * + * Parts derived from drivers/block/rd.c, and drivers/block/loop.c. + */ + +#include <linux/init.h> +#include <linux/module.h> +#include <linux/moduleparam.h> +#include <linux/major.h> +#include <linux/blkdev.h> +#include <linux/bio.h> +#include <linux/highmem.h> +#include <linux/gfp.h> +#include <linux/radix-tree.h> + +#include <asm/uaccess.h> + +#define PAGE_SECTORS (1 << (PAGE_SHIFT - 9)) + +struct brd_device { + int brd_number; + int brd_refcnt; + loff_t brd_offset; + loff_t brd_sizelimit; + unsigned brd_blocksize; + + struct request_queue *brd_queue; + struct gendisk *brd_disk; + + spinlock_t brd_lock; + struct radix_tree_root brd_pages; + + struct list_head brd_list; +}; + +static struct page *brd_lookup_page(struct brd_device *brd, sector_t sector) +{ + unsigned long idx; + struct page *page; + + rcu_read_lock(); + idx = sector >> (PAGE_SHIFT - 9); + page = radix_tree_lookup(&brd->brd_pages, idx); + rcu_read_unlock(); + + BUG_ON(page && page->index != idx); + + return page; +} + +static struct page *brd_insert_page(struct brd_device *brd, sector_t sector) +{ + unsigned long idx; + struct page *page; + + page = brd_lookup_page(brd, sector); + if (page) + return page; + + page = alloc_page(GFP_NOIO | __GFP_HIGHMEM | __GFP_ZERO); + if (!page) + return NULL; + + if (radix_tree_preload(GFP_NOIO)) { + __free_page(page); + return NULL; + } + + spin_lock(&brd->brd_lock); + idx = sector >> (PAGE_SHIFT - 9); + if (radix_tree_insert(&brd->brd_pages, idx, page)) { + __free_page(page); + page = radix_tree_lookup(&brd->brd_pages, idx); + BUG_ON(!page); + BUG_ON(page->index != idx); + } else + page->index = idx; + spin_unlock(&brd->brd_lock); + + radix_tree_preload_end(); + + return page; +} + +#define FREE_BATCH 16 +static void brd_free_pages(struct brd_device *brd) +{ + unsigned long pos = 0; + struct page *pages[FREE_BATCH]; + int nr_pages; + + do { + int i; + + nr_pages = radix_tree_gang_lookup(&brd->brd_pages, + (void **)pages, pos, FREE_BATCH); + + for (i = 0; i < nr_pages; i++) { + BUG_ON(pages[i]->index < pos); + pos = pages[i]->index+1; + __free_page(pages[i]); + } + + } while (nr_pages == FREE_BATCH); +} + +static int copy_to_brd_setup(struct brd_device *brd, sector_t sector, size_t n) +{ + unsigned int offset = (sector & (PAGE_SECTORS-1)) << 9; + size_t copy; + + copy = min((unsigned long)n, PAGE_SIZE - offset); + if (!brd_insert_page(brd, sector)) + return -ENOMEM; + if (copy < n) { + if (!brd_insert_page(brd, sector)) + return -ENOMEM; + } + return 0; +} + +static void copy_to_brd(struct brd_device *brd, const void *src, sector_t sector, size_t n) +{ + struct page *page; + void *dst; + unsigned int offset = (sector & (PAGE_SECTORS-1)) << 9; + size_t copy; + + copy = min((unsigned long)n, PAGE_SIZE - offset); + page = brd_lookup_page(brd, sector); + BUG_ON(!page); + + dst = kmap_atomic(page, KM_USER1); + memcpy(dst + offset, src, copy); + kunmap_atomic(dst, KM_USER1); + + if (copy < n) { + copy = n - copy; + page = brd_lookup_page(brd, sector); + BUG_ON(!page); + + dst = kmap_atomic(page, KM_USER1); + memcpy(dst, src, copy); + kunmap_atomic(dst, KM_USER1); + } +} + +static void copy_from_brd(void *dst, struct brd_device *brd, sector_t sector, size_t n) +{ + struct page *page; + const void *src; + unsigned int offset = (sector & (PAGE_SECTORS-1)) << 9; + size_t copy; + + copy = min((unsigned long)n, PAGE_SIZE - offset); + page = brd_lookup_page(brd, sector); + if (page) { + src = kmap_atomic(page, KM_USER1); + memcpy(dst, src + offset, copy); + kunmap_atomic(src, KM_USER1); + } else + memset(dst, 0, copy); + + if (copy < n) { + copy = n - copy; + page = brd_lookup_page(brd, sector); + if (page) { + src = kmap_atomic(page, KM_USER1); + memcpy(dst, src, copy); + kunmap_atomic(src, KM_USER1); + } else + memset(dst, 0, copy); + } +} + +static int brd_do_bvec(struct brd_device *brd, struct page *page, unsigned int len, unsigned int off, int rw, sector_t sector) +{ + void *mem; + int err = 0; + + if (rw != READ) { + err = copy_to_brd_setup(brd, sector, len); + if (err) + goto out; + } + + mem = kmap_atomic(page, KM_USER0); + if (rw == READ) { + copy_from_brd(mem + off, brd, sector, len); + flush_dcache_page(page); + } else + copy_to_brd(brd, mem + off, sector, len); + kunmap_atomic(mem, KM_USER0); + +out: + return err; +} + +static int brd_make_request(struct request_queue *q, struct bio *bio) +{ + struct block_device *bdev = bio->bi_bdev; + struct brd_device *brd = bdev->bd_disk->private_data; + int rw; + struct bio_vec *bvec; + sector_t sector; + int i; + int err = -EIO; + + sector = bio->bi_sector; + if (sector + (bio->bi_size >> 9) > get_capacity(bdev->bd_disk)) + goto out; + + rw = bio_rw(bio); + if (rw == READA) + rw = READ; + + bio_for_each_segment(bvec, bio, i) { + unsigned int len = bvec->bv_len; + err = brd_do_bvec(brd, bvec->bv_page, len, bvec->bv_offset, rw, sector); + if (err) + break; + sector += len >> 9; + } + +out: + bio_endio(bio, err); + + return 0; +} + +static int brd_ioctl(struct inode *inode, struct file *file, + unsigned int cmd, unsigned long arg) +{ + int error; + struct block_device *bdev = inode->i_bdev; + struct brd_device *brd = bdev->bd_disk->private_data; + + if (cmd != BLKFLSBUF) + return -ENOTTY; + + /* + * ram device BLKFLSBUF has special semantics, we want to actually + * release and destroy the ramdisk data. + */ + mutex_lock(&bdev->bd_mutex); + error = -EBUSY; + if (bdev->bd_openers <= 1) { + brd_free_pages(brd); + error = 0; + } + mutex_unlock(&bdev->bd_mutex); + + return error; +} + +static struct block_device_operations brd_fops = { + .owner = THIS_MODULE, + .ioctl = brd_ioctl, +}; + +/* + * And now the modules code and kernel interface. + */ +static int rd_nr; +static int rd_size = CONFIG_BLK_DEV_RAM_SIZE; +module_param(rd_nr, int, 0); +MODULE_PARM_DESC(rd_nr, "Maximum number of brd devices"); +module_param(rd_size, int, 0); +MODULE_PARM_DESC(rd_size, "Size of each RAM disk in kbytes."); +MODULE_LICENSE("GPL"); +MODULE_ALIAS_BLOCKDEV_MAJOR(RAMDISK_MAJOR); + +/* options - nonmodular */ +#ifndef MODULE +static int __init ramdisk_size(char *str) +{ + rd_size = simple_strtol(str,NULL,0); + return 1; +} +static int __init ramdisk_size2(char *str) +{ + return ramdisk_size(str); +} +static int __init rd_nr(char *str) +{ + rd_nr = simple_strtol(str, NULL, 0); + return 1; +} +__setup("ramdisk=", ramdisk_size); +__setup("ramdisk_size=", ramdisk_size2); +__setup("rd_nr=", rd_nr); +#endif + + +static LIST_HEAD(brd_devices); +static DEFINE_MUTEX(brd_devices_mutex); + +static struct brd_device *brd_alloc(int i) +{ + struct brd_device *brd; + struct gendisk *disk; + + brd = kzalloc(sizeof(*brd), GFP_KERNEL); + if (!brd) + goto out; + brd->brd_number = i; + spin_lock_init(&brd->brd_lock); + INIT_RADIX_TREE(&brd->brd_pages, GFP_ATOMIC); + + brd->brd_queue = blk_alloc_queue(GFP_KERNEL); + if (!brd->brd_queue) + goto out_free_dev; + blk_queue_make_request(brd->brd_queue, brd_make_request); + blk_queue_max_sectors(brd->brd_queue, 1024); + blk_queue_bounce_limit(brd->brd_queue, BLK_BOUNCE_ANY); + + disk = brd->brd_disk = alloc_disk(1); + if (!disk) + goto out_free_queue; + disk->major = RAMDISK_MAJOR; + disk->first_minor = i; + disk->fops = &brd_fops; + disk->private_data = brd; + disk->queue = brd->brd_queue; + sprintf(disk->disk_name, "ram%d", i); + set_capacity(disk, rd_size * 2); + + return brd; + +out_free_queue: + blk_cleanup_queue(brd->brd_queue); +out_free_dev: + kfree(brd); +out: + return NULL; +} + +static void brd_free(struct brd_device *brd) +{ + put_disk(brd->brd_disk); + blk_cleanup_queue(brd->brd_queue); + brd_free_pages(brd); + kfree(brd); +} + +static struct brd_device *brd_init_one(int i) +{ + struct brd_device *brd; + + list_for_each_entry(brd, &brd_devices, brd_list) { + if (brd->brd_number == i) + goto out; + } + + brd = brd_alloc(i); + if (brd) { + add_disk(brd->brd_disk); + list_add_tail(&brd->brd_list, &brd_devices); + } +out: + return brd; +} + +static void brd_del_one(struct brd_device *brd) +{ + list_del(&brd->brd_list); + del_gendisk(brd->brd_disk); + brd_free(brd); +} + +static struct kobject *brd_probe(dev_t dev, int *part, void *data) +{ + struct brd_device *brd; + struct kobject *kobj; + + mutex_lock(&brd_devices_mutex); + brd = brd_init_one(dev & MINORMASK); + kobj = brd ? get_disk(brd->brd_disk) : ERR_PTR(-ENOMEM); + mutex_unlock(&brd_devices_mutex); + + *part = 0; + return kobj; +} + +static int __init brd_init(void) +{ + int i, nr; + unsigned long range; + struct brd_device *brd, *next; + + /* + * brd module now has a feature to instantiate underlying device + * structure on-demand, provided that there is an access dev node. + * However, this will not work well with user space tool that doesn't + * know about such "feature". In order to not break any existing + * tool, we do the following: + * + * (1) if rd_nr is specified, create that many upfront, and this + * also becomes a hard limit. + * (2) if rd_nr is not specified, create 1 rd device on module + * load, user can further extend brd device by create dev node + * themselves and have kernel automatically instantiate actual + * device on-demand. + */ + if (rd_nr > 1UL << MINORBITS) + return -EINVAL; + + if (rd_nr) { + nr = rd_nr; + range = rd_nr; + } else { + nr = CONFIG_BLK_DEV_RAM_COUNT; + range = 1UL << MINORBITS; + } + + if (register_blkdev(RAMDISK_MAJOR, "ramdisk")) + return -EIO; + + for (i = 0; i < nr; i++) { + brd = brd_alloc(i); + if (!brd) + goto out_free; + list_add_tail(&brd->brd_list, &brd_devices); + } + + /* point of no return */ + + list_for_each_entry(brd, &brd_devices, brd_list) + add_disk(brd->brd_disk); + + blk_register_region(MKDEV(RAMDISK_MAJOR, 0), range, + THIS_MODULE, brd_probe, NULL, NULL); + + printk(KERN_INFO "brd: module loaded\n"); + return 0; + +out_free: + list_for_each_entry_safe(brd, next, &brd_devices, brd_list) { + list_del(&brd->brd_list); + brd_free(brd); + } + + unregister_blkdev(RAMDISK_MAJOR, "brd"); + return -ENOMEM; +} + +static void __exit brd_exit(void) +{ + unsigned long range; + struct brd_device *brd, *next; + + range = rd_nr ? rd_nr : 1UL << MINORBITS; + + list_for_each_entry_safe(brd, next, &brd_devices, brd_list) + brd_del_one(brd); + + blk_unregister_region(MKDEV(RAMDISK_MAJOR, 0), range); + unregister_blkdev(RAMDISK_MAJOR, "ramdisk"); +} + +module_init(brd_init); +module_exit(brd_exit); + ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [patch][rfc] rewrite ramdisk 2007-10-16 7:47 ` [patch][rfc] rewrite ramdisk Nick Piggin @ 2007-10-16 7:52 ` Jan Engelhardt 2007-10-16 8:07 ` Nick Piggin 2007-10-16 9:08 ` Eric W. Biederman ` (2 subsequent siblings) 3 siblings, 1 reply; 77+ messages in thread From: Jan Engelhardt @ 2007-10-16 7:52 UTC (permalink / raw) To: Nick Piggin Cc: Eric W. Biederman, Andrew Morton, Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o On Oct 16 2007 17:47, Nick Piggin wrote: > >Here's a quick first hack... Inline patches preferred ;-) >+config BLK_DEV_BRD >+ tristate "RAM block device support" >+ ---help--- >+ This is a new based block driver that replaces BLK_DEV_RAM. based on what? -^ >+ To compile this driver as a module, choose M here: the >+ module will be called rd. called brd.ko. >+/* >+ * And now the modules code and kernel interface. >+ */ >+static int rd_nr; >+static int rd_size = CONFIG_BLK_DEV_RAM_SIZE; Perhaps unsigned? Perhaps even long for rd_size? >+module_param(rd_nr, int, 0); >+MODULE_PARM_DESC(rd_nr, "Maximum number of brd devices"); >+module_param(rd_size, int, 0); >+MODULE_PARM_DESC(rd_size, "Size of each RAM disk in kbytes."); >+MODULE_LICENSE("GPL"); >+MODULE_ALIAS_BLOCKDEV_MAJOR(RAMDISK_MAJOR); >+ >+/* options - nonmodular */ >+#ifndef MODULE >+static int __init ramdisk_size(char *str) >+{ >+ rd_size = simple_strtol(str,NULL,0); >+ return 1; >+} Is this, besides for compatibility, really needed? >+static int __init ramdisk_size2(char *str) >+{ >+ return ramdisk_size(str); >+} >+static int __init rd_nr(char *str) Err! Overlapping symbols! The rd_nr() function collides with the rd_nr variable. It also does not seem needed, since it did not exist before. It should go, you can set the variable with brd.rd_nr=XXX (same goes for ramdisk_size). What's the point of ramdisk_size2()? >+{ >+ rd_nr = simple_strtol(str, NULL, 0); >+ return 1; >+} >+__setup("ramdisk=", ramdisk_size); >+__setup("ramdisk_size=", ramdisk_size2); __setup("ramdisk_size=", ramdisk_size); maybe, or does not that work? >+__setup("rd_nr=", rd_nr); >+#endif >+ >+ >+static struct brd_device *brd_alloc(int i) >+{ >+ struct brd_device *brd; >+ struct gendisk *disk; >+ >+ brd = kzalloc(sizeof(*brd), GFP_KERNEL); >+ if (!brd) >+ goto out; >+ brd->brd_number = i; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [patch][rfc] rewrite ramdisk 2007-10-16 7:52 ` Jan Engelhardt @ 2007-10-16 8:07 ` Nick Piggin 2007-10-16 8:17 ` Jan Engelhardt 0 siblings, 1 reply; 77+ messages in thread From: Nick Piggin @ 2007-10-16 8:07 UTC (permalink / raw) To: Jan Engelhardt Cc: Eric W. Biederman, Andrew Morton, Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o On Tuesday 16 October 2007 17:52, Jan Engelhardt wrote: > On Oct 16 2007 17:47, Nick Piggin wrote: > >Here's a quick first hack... > > Inline patches preferred ;-) Thanks for reviewing it anyway ;) > >+config BLK_DEV_BRD > >+ tristate "RAM block device support" > >+ ---help--- > >+ This is a new based block driver that replaces BLK_DEV_RAM. > > based on what? -^ RAM based. > >+ To compile this driver as a module, choose M here: the > >+ module will be called rd. > > called brd.ko. Changed. But it will hopefully just completely replace rd.c, so I will probably just rename it to rd.c at some point (and change .config options to stay compatible). Unless someone sees a problem with that? > >+/* > >+ * And now the modules code and kernel interface. > >+ */ > >+static int rd_nr; > >+static int rd_size = CONFIG_BLK_DEV_RAM_SIZE; > > Perhaps unsigned? > Perhaps even long for rd_size? I've taken most of that stuff out of rd.c in an effort to stay back compatible. I don't know if it really matters to use long? > >+module_param(rd_nr, int, 0); > >+MODULE_PARM_DESC(rd_nr, "Maximum number of brd devices"); > >+module_param(rd_size, int, 0); > >+MODULE_PARM_DESC(rd_size, "Size of each RAM disk in kbytes."); > >+MODULE_LICENSE("GPL"); > >+MODULE_ALIAS_BLOCKDEV_MAJOR(RAMDISK_MAJOR); > >+ > >+/* options - nonmodular */ > >+#ifndef MODULE > >+static int __init ramdisk_size(char *str) > >+{ > >+ rd_size = simple_strtol(str,NULL,0); > >+ return 1; > >+} > > Is this, besides for compatibility, really needed? > > >+static int __init ramdisk_size2(char *str) > >+{ > >+ return ramdisk_size(str); > >+} > >+static int __init rd_nr(char *str) > > Err! Overlapping symbols! The rd_nr() function collides with the rd_nr > variable. Thanks... %s gone awry. Fixed to the expected names. > It also does not seem needed, since it did not exist before. > It should go, you can set the variable with brd.rd_nr=XXX (same > goes for ramdisk_size). But only if it's a module? > What's the point of ramdisk_size2()? Back compat. When rewriting the internals, I want to try avoid changing any externals if possible. Whether this is the Right Way to do it or not, I don't know :P > >+{ > >+ rd_nr = simple_strtol(str, NULL, 0); > >+ return 1; > >+} > >+__setup("ramdisk=", ramdisk_size); > >+__setup("ramdisk_size=", ramdisk_size2); > > __setup("ramdisk_size=", ramdisk_size); maybe, or does not that work? Didn't try it, but the rd.c code does the same thing so I guess it doesn't (or didn't, when it was written). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [patch][rfc] rewrite ramdisk 2007-10-16 8:07 ` Nick Piggin @ 2007-10-16 8:17 ` Jan Engelhardt 2007-10-16 8:26 ` Nick Piggin 0 siblings, 1 reply; 77+ messages in thread From: Jan Engelhardt @ 2007-10-16 8:17 UTC (permalink / raw) To: Nick Piggin Cc: Eric W. Biederman, Andrew Morton, Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o On Oct 16 2007 18:07, Nick Piggin wrote: >Changed. But it will hopefully just completely replace rd.c, >so I will probably just rename it to rd.c at some point (and >change .config options to stay compatible). Unless someone >sees a problem with that? I do not see a problem with keeping brd either. >> It also does not seem needed, since it did not exist before. >> It should go, you can set the variable with brd.rd_nr=XXX (same >> goes for ramdisk_size). > >But only if it's a module? Attributes always work. Try vt.default_red=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 and you will see. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [patch][rfc] rewrite ramdisk 2007-10-16 8:17 ` Jan Engelhardt @ 2007-10-16 8:26 ` Nick Piggin 2007-10-16 8:53 ` Jan Engelhardt 0 siblings, 1 reply; 77+ messages in thread From: Nick Piggin @ 2007-10-16 8:26 UTC (permalink / raw) To: Jan Engelhardt Cc: Eric W. Biederman, Andrew Morton, Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o On Tuesday 16 October 2007 18:17, Jan Engelhardt wrote: > On Oct 16 2007 18:07, Nick Piggin wrote: > >Changed. But it will hopefully just completely replace rd.c, > >so I will probably just rename it to rd.c at some point (and > >change .config options to stay compatible). Unless someone > >sees a problem with that? > > I do not see a problem with keeping brd either. Just doesn't seem to be any point in making it a new and different module, assuming it can support exactly the same semantics. I'm only doing so in these first diffs so that they are easier to read and also easier to do a side by side comparison / builds with the old rd.c > >> It also does not seem needed, since it did not exist before. > >> It should go, you can set the variable with brd.rd_nr=XXX (same > >> goes for ramdisk_size). > > > >But only if it's a module? > > Attributes always work. Try vt.default_red=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 > and you will see. Ah, nice. (I don't use them much!). Still, backward compat I think is needed if we are to replace rd.c. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [patch][rfc] rewrite ramdisk 2007-10-16 8:26 ` Nick Piggin @ 2007-10-16 8:53 ` Jan Engelhardt 0 siblings, 0 replies; 77+ messages in thread From: Jan Engelhardt @ 2007-10-16 8:53 UTC (permalink / raw) To: Nick Piggin Cc: Eric W. Biederman, Andrew Morton, Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o On Oct 16 2007 18:26, Nick Piggin wrote: >> >> It also does not seem needed, since it did not exist before. >> >> It should go, you can set the variable with brd.rd_nr=XXX (same >> >> goes for ramdisk_size). >> > >> >But only if it's a module? >> >> Attributes always work. Try vt.default_red=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 >> and you will see. > >Ah, nice. (I don't use them much!). Still, backward compat I >think is needed if we are to replace rd.c. > Like I said, I did not see rd_nr in Documentation/kernel-parameters.txt so I thought there was no compat. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [patch][rfc] rewrite ramdisk 2007-10-16 7:47 ` [patch][rfc] rewrite ramdisk Nick Piggin 2007-10-16 7:52 ` Jan Engelhardt @ 2007-10-16 9:08 ` Eric W. Biederman 2007-10-16 21:28 ` Theodore Tso 2007-10-17 10:30 ` Eric W. Biederman 3 siblings, 0 replies; 77+ messages in thread From: Eric W. Biederman @ 2007-10-16 9:08 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o Well on that same tune. But with a slightly different implementation. It compiles but I need to head to bed so I haven't had a chance to test it yet. Nick it is very similar to yours with the big difference being that I embedded a struct address_space instead of rolled rerolled it by hand, which saves a lot of lines of code. Eric --- drivers/block/rd.c /* * ramdisk.c - Multiple RAM disk driver - gzip-loading version - v. 0.8 beta. * * (C) Chad Page, Theodore Ts'o, et. al, 1995. * * This RAM disk is designed to have filesystems created on it and mounted * just like a regular floppy disk. * * It also does something suggested by Linus: use the buffer cache as the * RAM disk data. This makes it possible to dynamically allocate the RAM disk * buffer - with some consequences I have to deal with as I write this. * * This code is based on the original ramdisk.c, written mostly by * Theodore Ts'o (TYT) in 1991. The code was largely rewritten by * Chad Page to use the buffer cache to store the RAM disk data in * 1995; Theodore then took over the driver again, and cleaned it up * for inclusion in the mainline kernel. * * The original CRAMDISK code was written by Richard Lyons, and * adapted by Chad Page to use the new RAM disk interface. Theodore * Ts'o rewrote it so that both the compressed RAM disk loader and the * kernel decompressor uses the same inflate.c codebase. The RAM disk * loader now also loads into a dynamic (buffer cache based) RAM disk, * not the old static RAM disk. Support for the old static RAM disk has * been completely removed. * * Loadable module support added by Tom Dyas. * * Further cleanups by Chad Page (page0588@sundance.sjsu.edu): * Cosmetic changes in #ifdef MODULE, code movement, etc. * When the RAM disk module is removed, free the protected buffers * Default RAM disk size changed to 2.88 MB * * Added initrd: Werner Almesberger & Hans Lermen, Feb '96 * * 4/25/96 : Made RAM disk size a parameter (default is now 4 MB) * - Chad Page * * Add support for fs images split across >1 disk, Paul Gortmaker, Mar '98 * * Make block size and block size shift for RAM disks a global macro * and set blk_size for -ENOSPC, Werner Fink <werner@suse.de>, Apr '99 */ #include <linux/string.h> #include <linux/slab.h> #include <asm/atomic.h> #include <linux/bio.h> #include <linux/module.h> #include <linux/moduleparam.h> #include <linux/init.h> #include <linux/pagemap.h> #include <linux/blkdev.h> #include <linux/genhd.h> #include <linux/buffer_head.h> /* for invalidate_bdev() */ #include <linux/backing-dev.h> #include <linux/blkpg.h> #include <linux/writeback.h> #include <asm/uaccess.h> #define RAMDISK_MINORS 250 struct ramdisk { struct address_space rd_mapping; struct gendisk *rd_disk; struct request_queue *rd_queue; struct list_head rd_list; }; /* Various static variables go here. Most are used only in the RAM disk code. */ static LIST_HEAD(ramdisks); static DEFINE_MUTEX(ramdisks_mutex); /* * Parameters for the boot-loading of the RAM disk. These are set by * init/main.c (from arguments to the kernel command line) or from the * architecture-specific setup routine (from the stored boot sector * information). */ int rd_size = CONFIG_BLK_DEV_RAM_SIZE; /* Size of the RAM disks */ /* * It would be very desirable to have a soft-blocksize (that in the case * of the ramdisk driver is also the hardblocksize ;) of PAGE_SIZE because * doing that we'll achieve a far better MM footprint. Using a rd_blocksize of * BLOCK_SIZE in the worst case we'll make PAGE_SIZE/BLOCK_SIZE buffer-pages * unfreeable. With a rd_blocksize of PAGE_SIZE instead we are sure that only * 1 page will be protected. Depending on the size of the ramdisk you * may want to change the ramdisk blocksize to achieve a better or worse MM * behaviour. The default is still BLOCK_SIZE (needed by rd_load_image that * supposes the filesystem in the image uses a BLOCK_SIZE blocksize). */ static int rd_blocksize = CONFIG_BLK_DEV_RAM_BLOCKSIZE; /* * Copyright (C) 2000 Linus Torvalds. * 2000 Transmeta Corp. * aops copied from ramfs. */ static const struct address_space_operations ramdisk_aops = { .readpage = simple_readpage, .set_page_dirty = __set_page_dirty_no_writeback, }; static struct ramdisk *bdev_ramdisk(struct block_device *bdev) { return bdev->bd_disk->private_data; } static int rd_blkdev_pagecache_IO(int rw, struct bio_vec *vec, sector_t sector, struct address_space *mapping) { pgoff_t index = sector >> (PAGE_CACHE_SHIFT - 9); unsigned int vec_offset = vec->bv_offset; int offset = (sector << 9) & ~PAGE_CACHE_MASK; int size = vec->bv_len; int err = 0; do { int count; struct page *page; char *src; char *dst; count = PAGE_CACHE_SIZE - offset; if (count > size) count = size; size -= count; page = grab_cache_page(mapping, index); if (!page) { err = -ENOMEM; goto out; } if (!PageUptodate(page)) { clear_highpage(page); SetPageUptodate(page); } index++; if (rw == READ) { src = kmap_atomic(page, KM_USER0) + offset; dst = kmap_atomic(vec->bv_page, KM_USER1) + vec_offset; } else { src = kmap_atomic(vec->bv_page, KM_USER0) + vec_offset; dst = kmap_atomic(page, KM_USER1) + offset; } offset = 0; vec_offset += count; memcpy(dst, src, count); kunmap_atomic(src, KM_USER0); kunmap_atomic(dst, KM_USER1); if (rw == WRITE) set_page_dirty(page); unlock_page(page); put_page(page); } while (size); out: return err; } /* * Basically, my strategy here is to set up a buffer-head which can't be * deleted, and make that my Ramdisk. If the request is outside of the * allocated size, we must get rid of it... * * 19-JAN-1998 Richard Gooch <rgooch@atnf.csiro.au> Added devfs support * */ static int rd_make_request(struct request_queue *q, struct bio *bio) { struct block_device *bdev = bio->bi_bdev; struct ramdisk *rd = bdev_ramdisk(bdev); struct address_space * mapping = &rd->rd_mapping; sector_t sector = bio->bi_sector; unsigned long len = bio->bi_size >> 9; int rw = bio_data_dir(bio); struct bio_vec *bvec; int ret = 0, i; if (sector + len > get_capacity(bdev->bd_disk)) goto fail; if (rw==READA) rw=READ; bio_for_each_segment(bvec, bio, i) { ret |= rd_blkdev_pagecache_IO(rw, bvec, sector, mapping); sector += bvec->bv_len >> 9; } if (ret) goto fail; bio_endio(bio, 0); return 0; fail: bio_io_error(bio); return 0; } static int rd_ioctl(struct inode *inode, struct file *file, unsigned int cmd, unsigned long arg) { int error; struct block_device *bdev = inode->i_bdev; struct ramdisk *rd = bdev_ramdisk(bdev); if (cmd != BLKFLSBUF) return -ENOTTY; /* * special: we want to release the ramdisk memory, it's not like with * the other blockdevices where this ioctl only flushes away the buffer * cache */ error = -EBUSY; mutex_lock(&bdev->bd_mutex); if (bdev->bd_openers <= 1) { truncate_inode_pages(&rd->rd_mapping, 0); error = 0; } mutex_unlock(&bdev->bd_mutex); return error; } /* * This is the backing_dev_info for the blockdev inode itself. It doesn't need * writeback and it does not contribute to dirty memory accounting. */ static struct backing_dev_info rd_backing_dev_info = { .ra_pages = 0, /* No readahead */ .capabilities = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK, .unplug_io_fn = default_unplug_io_fn, }; static struct block_device_operations rd_bd_op = { .owner = THIS_MODULE, .ioctl = rd_ioctl, }; static struct ramdisk *rd_alloc(int i) { struct ramdisk *rd; struct request_queue *queue; struct gendisk *disk; struct address_space *mapping; rd = kzalloc(sizeof(*rd), GFP_KERNEL); if (!rd) goto out; rd->rd_queue = queue = blk_alloc_queue(GFP_KERNEL); if (!rd->rd_queue) goto out_free_dev; disk = rd->rd_disk = alloc_disk(1); if (!disk) goto out_free_queue; mapping = &rd->rd_mapping; mapping->a_ops = &ramdisk_aops; mapping->backing_dev_info = &rd_backing_dev_info; mapping_set_gfp_mask(mapping, __GFP_WAIT | __GFP_HIGH | __GFP_HIGHMEM); blk_queue_make_request(queue, &rd_make_request); blk_queue_hardsect_size(queue, rd_blocksize); /* rd_size is given in kB */ disk->major = RAMDISK_MAJOR; disk->first_minor = i; disk->fops = &rd_bd_op; disk->queue = queue; disk->flags |= GENHD_FL_SUPPRESS_PARTITION_INFO; sprintf(disk->disk_name, "ram%d", i); set_capacity(disk, rd_size * 2); disk->private_data = rd; add_disk(rd->rd_disk); list_add_tail(&rd->rd_list, &ramdisks); return rd; out_free_queue: blk_cleanup_queue(queue); out_free_dev: kfree(rd); out: return NULL; } static void rd_free(struct ramdisk *rd) { del_gendisk(rd->rd_disk); put_disk(rd->rd_disk); blk_cleanup_queue(rd->rd_queue); truncate_inode_pages(&rd->rd_mapping, 0); list_del(&rd->rd_list); kfree(rd); } static struct kobject *rd_probe(dev_t dev, int *part, void *data) { struct ramdisk *rd; struct kobject *kobj; unsigned int unit; kobj = ERR_PTR(-ENOMEM); unit = MINOR(dev); mutex_lock(&ramdisks_mutex); list_for_each_entry(rd, &ramdisks, rd_list) { if (rd->rd_disk->first_minor == unit) goto found; } rd = rd_alloc(MINOR(dev)); found: if (rd) kobj = get_disk(rd->rd_disk); mutex_unlock(&ramdisks_mutex); *part = 0; return kobj; } /* * Before freeing the module, invalidate all of the protected buffers! */ static void __exit rd_cleanup(void) { struct ramdisk *rd, *next; list_for_each_entry_safe(rd, next, &ramdisks, rd_list) rd_free(rd); blk_unregister_region(MKDEV(RAMDISK_MAJOR, 0), RAMDISK_MINORS); unregister_blkdev(RAMDISK_MAJOR, "ramdisk"); } /* * This is the registration and initialization section of the RAM disk driver */ static int __init rd_init(void) { if (rd_blocksize > PAGE_SIZE || rd_blocksize < 512 || (rd_blocksize & (rd_blocksize-1))) { printk("RAMDISK: wrong blocksize %d, reverting to defaults\n", rd_blocksize); rd_blocksize = BLOCK_SIZE; } if (register_blkdev(RAMDISK_MAJOR, "ramdisk")) return -EIO; blk_register_region(MKDEV(RAMDISK_MAJOR, 0), RAMDISK_MINORS, THIS_MODULE, rd_probe, NULL, NULL); /* rd_size is given in kB */ printk("RAMDISK driver initialized: " "%d RAM disks of %dK size %d blocksize\n", CONFIG_BLK_DEV_RAM_COUNT, rd_size, rd_blocksize); return 0; } module_init(rd_init); module_exit(rd_cleanup); /* options - nonmodular */ #ifndef MODULE static int __init ramdisk_size(char *str) { rd_size = simple_strtol(str,NULL,0); return 1; } static int __init ramdisk_size2(char *str) /* kludge */ { return ramdisk_size(str); } static int __init ramdisk_blocksize(char *str) { rd_blocksize = simple_strtol(str,NULL,0); return 1; } __setup("ramdisk=", ramdisk_size); __setup("ramdisk_size=", ramdisk_size2); __setup("ramdisk_blocksize=", ramdisk_blocksize); #endif /* options - modular */ module_param(rd_size, int, 0); MODULE_PARM_DESC(rd_size, "Size of each RAM disk in kbytes."); module_param(rd_blocksize, int, 0); MODULE_PARM_DESC(rd_blocksize, "Blocksize of each RAM disk in bytes."); MODULE_ALIAS_BLOCKDEV_MAJOR(RAMDISK_MAJOR); MODULE_LICENSE("GPL"); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [patch][rfc] rewrite ramdisk 2007-10-16 7:47 ` [patch][rfc] rewrite ramdisk Nick Piggin 2007-10-16 7:52 ` Jan Engelhardt 2007-10-16 9:08 ` Eric W. Biederman @ 2007-10-16 21:28 ` Theodore Tso 2007-10-16 22:08 ` Nick Piggin 2007-10-17 10:30 ` Eric W. Biederman 3 siblings, 1 reply; 77+ messages in thread From: Theodore Tso @ 2007-10-16 21:28 UTC (permalink / raw) To: Nick Piggin Cc: Eric W. Biederman, Andrew Morton, Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky On Tue, Oct 16, 2007 at 05:47:12PM +1000, Nick Piggin wrote: > + /* > + * ram device BLKFLSBUF has special semantics, we want to actually > + * release and destroy the ramdisk data. > + */ We won't be able to fix completely this for a while time, but the fact that BLKFLSBUF has special semantics has always been a major wart. Could we perhaps create a new ioctl, say RAMDISKDESTORY, and add a deperecation printk for BLKFLSBUF when passed to the ramdisk? I doubt there are many tools that actually take advantage of this wierd aspect of ramdisks, so hopefully it's something we could remove in a 18 months or so... - Ted -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [patch][rfc] rewrite ramdisk 2007-10-16 21:28 ` Theodore Tso @ 2007-10-16 22:08 ` Nick Piggin 2007-10-16 23:48 ` Eric W. Biederman 0 siblings, 1 reply; 77+ messages in thread From: Nick Piggin @ 2007-10-16 22:08 UTC (permalink / raw) To: Theodore Tso Cc: Eric W. Biederman, Andrew Morton, Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky On Wednesday 17 October 2007 07:28, Theodore Tso wrote: > On Tue, Oct 16, 2007 at 05:47:12PM +1000, Nick Piggin wrote: > > + /* > > + * ram device BLKFLSBUF has special semantics, we want to actually > > + * release and destroy the ramdisk data. > > + */ > > We won't be able to fix completely this for a while time, but the fact > that BLKFLSBUF has special semantics has always been a major wart. > Could we perhaps create a new ioctl, say RAMDISKDESTORY, and add a > deperecation printk for BLKFLSBUF when passed to the ramdisk? I doubt > there are many tools that actually take advantage of this wierd aspect > of ramdisks, so hopefully it's something we could remove in a 18 > months or so... It would be nice to be able to do that, I agree. The new ramdisk code will be able to flush the buffer cache and destroy its data separately, so it can actually be implemented. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [patch][rfc] rewrite ramdisk 2007-10-16 22:08 ` Nick Piggin @ 2007-10-16 23:48 ` Eric W. Biederman 2007-10-17 0:28 ` Nick Piggin 0 siblings, 1 reply; 77+ messages in thread From: Eric W. Biederman @ 2007-10-16 23:48 UTC (permalink / raw) To: Nick Piggin Cc: Theodore Tso, Andrew Morton, Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky Nick Piggin <nickpiggin@yahoo.com.au> writes: > On Wednesday 17 October 2007 07:28, Theodore Tso wrote: >> On Tue, Oct 16, 2007 at 05:47:12PM +1000, Nick Piggin wrote: >> > + /* >> > + * ram device BLKFLSBUF has special semantics, we want to actually >> > + * release and destroy the ramdisk data. >> > + */ >> >> We won't be able to fix completely this for a while time, but the fact >> that BLKFLSBUF has special semantics has always been a major wart. >> Could we perhaps create a new ioctl, say RAMDISKDESTORY, and add a >> deperecation printk for BLKFLSBUF when passed to the ramdisk? I doubt >> there are many tools that actually take advantage of this wierd aspect >> of ramdisks, so hopefully it's something we could remove in a 18 >> months or so... > > It would be nice to be able to do that, I agree. The new ramdisk > code will be able to flush the buffer cache and destroy its data > separately, so it can actually be implemented. So the practical problem are peoples legacy boot setups but those are quickly going away. The sane thing is probably something that can be taken as a low level format command for the block device. Say: dd if=/dev/zero of=/dev/ramX I know rewriting the drive with all zeroes can cause a modern disk to redo it's low level format. And that is something we can definitely implement without any backwards compatibility problems. Hmm. Do we have anything special for punching holes in files? That would be another sane route to take to remove the special case for clearing the memory. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [patch][rfc] rewrite ramdisk 2007-10-16 23:48 ` Eric W. Biederman @ 2007-10-17 0:28 ` Nick Piggin 2007-10-17 1:13 ` Eric W. Biederman 0 siblings, 1 reply; 77+ messages in thread From: Nick Piggin @ 2007-10-17 0:28 UTC (permalink / raw) To: Eric W. Biederman Cc: Theodore Tso, Andrew Morton, Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky On Wednesday 17 October 2007 09:48, Eric W. Biederman wrote: > Nick Piggin <nickpiggin@yahoo.com.au> writes: > > On Wednesday 17 October 2007 07:28, Theodore Tso wrote: > >> On Tue, Oct 16, 2007 at 05:47:12PM +1000, Nick Piggin wrote: > >> > + /* > >> > + * ram device BLKFLSBUF has special semantics, we want to actually > >> > + * release and destroy the ramdisk data. > >> > + */ > >> > >> We won't be able to fix completely this for a while time, but the fact > >> that BLKFLSBUF has special semantics has always been a major wart. > >> Could we perhaps create a new ioctl, say RAMDISKDESTORY, and add a > >> deperecation printk for BLKFLSBUF when passed to the ramdisk? I doubt > >> there are many tools that actually take advantage of this wierd aspect > >> of ramdisks, so hopefully it's something we could remove in a 18 > >> months or so... > > > > It would be nice to be able to do that, I agree. The new ramdisk > > code will be able to flush the buffer cache and destroy its data > > separately, so it can actually be implemented. > > So the practical problem are peoples legacy boot setups but those > are quickly going away. After that, is the ramdisk useful for anything aside from testing? > The sane thing is probably something that can be taken as a low > level format command for the block device. > > Say: dd if=/dev/zero of=/dev/ramX We have 2 problems. First is that, for testing/consistency, we don't want BLKFLSBUF to throw out the data. Maybe hardly anything uses BLKFLSBUF now, so it could be just a minor problem, but still one to fix. Second is actually throwing out the ramdisk data. dd from /dev/null isn't trivial because it isn't a "command" from the kernel's POV. rd could examine the writes to see if they are zero and page aligned, I suppose... but if you're transitioning everyone over to a new method anyway, might as well make it a nice one ;) > I know rewriting the drive with all zeroes can cause a modern > disk to redo it's low level format. And that is something > we can definitely implement without any backwards compatibility > problems. > > Hmm. Do we have anything special for punching holes in files? > That would be another sane route to take to remove the special > case for clearing the memory. truncate_range, I suppose. A file descriptor syscall based alternative for madvise would be nice though (like fallocate). We could always put something in /sys/block/ram*/xxx -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [patch][rfc] rewrite ramdisk 2007-10-17 0:28 ` Nick Piggin @ 2007-10-17 1:13 ` Eric W. Biederman 2007-10-17 1:47 ` Nick Piggin 0 siblings, 1 reply; 77+ messages in thread From: Eric W. Biederman @ 2007-10-17 1:13 UTC (permalink / raw) To: Nick Piggin Cc: Theodore Tso, Andrew Morton, Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky Nick Piggin <nickpiggin@yahoo.com.au> writes: > On Wednesday 17 October 2007 09:48, Eric W. Biederman wrote: >> Nick Piggin <nickpiggin@yahoo.com.au> writes: >> > On Wednesday 17 October 2007 07:28, Theodore Tso wrote: >> >> On Tue, Oct 16, 2007 at 05:47:12PM +1000, Nick Piggin wrote: >> >> > + /* >> >> > + * ram device BLKFLSBUF has special semantics, we want to actually >> >> > + * release and destroy the ramdisk data. >> >> > + */ >> >> >> >> We won't be able to fix completely this for a while time, but the fact >> >> that BLKFLSBUF has special semantics has always been a major wart. >> >> Could we perhaps create a new ioctl, say RAMDISKDESTORY, and add a >> >> deperecation printk for BLKFLSBUF when passed to the ramdisk? I doubt >> >> there are many tools that actually take advantage of this wierd aspect >> >> of ramdisks, so hopefully it's something we could remove in a 18 >> >> months or so... >> > >> > It would be nice to be able to do that, I agree. The new ramdisk >> > code will be able to flush the buffer cache and destroy its data >> > separately, so it can actually be implemented. >> >> So the practical problem are peoples legacy boot setups but those >> are quickly going away. > > After that, is the ramdisk useful for anything aside from testing? > > >> The sane thing is probably something that can be taken as a low >> level format command for the block device. >> >> Say: dd if=/dev/zero of=/dev/ramX > > We have 2 problems. First is that, for testing/consistency, we > don't want BLKFLSBUF to throw out the data. Maybe hardly anything > uses BLKFLSBUF now, so it could be just a minor problem, but still > one to fix. Hmm. This is interesting because we won't be doing anything that effects correctness if we don't special case BLKFLSBUF just something that effects efficiency. So I think we can get away with just changing blkflsbuf as long as there is a way to get rid of the data. > Second is actually throwing out the ramdisk data. dd from /dev/null > isn't trivial because it isn't a "command" from the kernel's POV. > rd could examine the writes to see if they are zero and page aligned, > I suppose... but if you're transitioning everyone over to a new > method anyway, might as well make it a nice one ;) Well I was thinking you can examine the page you just wrote to and if it is all zero's you don't need to cache that page anymore. Call it intelligent compression. Further it does make forwards and backwards compatibility simple because all you would have to do to reliably free a ramdisk is: dd if=/dev/zero of=/dev/ramX blockdev --flushbufs /dev/ramX >> I know rewriting the drive with all zeroes can cause a modern >> disk to redo it's low level format. And that is something >> we can definitely implement without any backwards compatibility >> problems. >> >> Hmm. Do we have anything special for punching holes in files? >> That would be another sane route to take to remove the special >> case for clearing the memory. > > truncate_range, I suppose. A file descriptor syscall based > alternative for madvise would be nice though (like fallocate). > > We could always put something in /sys/block/ram*/xxx I guess when I look at this it looks like an operation that is unique to a ramdisk. Real hard drives have a low level format operations and the like. If we can find something standard there that is guaranteed to trash your data we can use that, and have gone from less consistency to more. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [patch][rfc] rewrite ramdisk 2007-10-17 1:13 ` Eric W. Biederman @ 2007-10-17 1:47 ` Nick Piggin 0 siblings, 0 replies; 77+ messages in thread From: Nick Piggin @ 2007-10-17 1:47 UTC (permalink / raw) To: Eric W. Biederman Cc: Theodore Tso, Andrew Morton, Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky On Wednesday 17 October 2007 11:13, Eric W. Biederman wrote: > Nick Piggin <nickpiggin@yahoo.com.au> writes: > > We have 2 problems. First is that, for testing/consistency, we > > don't want BLKFLSBUF to throw out the data. Maybe hardly anything > > uses BLKFLSBUF now, so it could be just a minor problem, but still > > one to fix. > > Hmm. This is interesting because we won't be doing anything that > effects correctness if we don't special case BLKFLSBUF just something > that effects efficiency. So I think we can get away with just > changing blkflsbuf as long as there is a way to get rid of > the data. Technically, it does change correctness: after BLKFLSBUF, the ramdisk should contain zeroes. I'm assuming it would also cause problems in tight embedded environments if ramdisk ram is supposed to be thrown away but isn't. So maybe not technically a correctness problem, but could be the difference between working and not working. > > Second is actually throwing out the ramdisk data. dd from /dev/null > > isn't trivial because it isn't a "command" from the kernel's POV. > > rd could examine the writes to see if they are zero and page aligned, > > I suppose... but if you're transitioning everyone over to a new > > method anyway, might as well make it a nice one ;) > > Well I was thinking you can examine the page you just wrote to > and if it is all zero's you don't need to cache that page anymore. > Call it intelligent compression. > > Further it does make forwards and backwards compatibility simple > because all you would have to do to reliably free a ramdisk is: > > dd if=/dev/zero of=/dev/ramX > blockdev --flushbufs /dev/ramX Sure, you could do that, but you still presumably need to support the old behaviour. As a test vehicle for filesystems, I'd much rather it didn't do this of course, because subsequent writes would need to reallocate the page again. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [patch][rfc] rewrite ramdisk 2007-10-16 7:47 ` [patch][rfc] rewrite ramdisk Nick Piggin ` (2 preceding siblings ...) 2007-10-16 21:28 ` Theodore Tso @ 2007-10-17 10:30 ` Eric W. Biederman 2007-10-17 12:49 ` Nick Piggin 3 siblings, 1 reply; 77+ messages in thread From: Eric W. Biederman @ 2007-10-17 10:30 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o Nick Piggin <nickpiggin@yahoo.com.au> writes: > On Tuesday 16 October 2007 18:08, Nick Piggin wrote: >> On Tuesday 16 October 2007 14:57, Eric W. Biederman wrote: > >> > > What magic restrictions on page allocations? Actually we have >> > > fewer restrictions on page allocations because we can use >> > > highmem! >> > >> > With the proposed rewrite yes. > > Here's a quick first hack... > > Comments? I have beaten my version of this into working shape, and things seem ok. However I'm beginning to think that the real solution is to remove the dependence on buffer heads for caching the disk mapping for data pages, and move the metadata buffer heads off of the block device page cache pages. Although I am just a touch concerned there may be an issue with using filesystem tools while the filesystem is mounted if I move the metadata buffer heads. If we were to move the metadata buffer heads (assuming I haven't missed some weird dependency) then I think there a bunch of weird corner cases that would be simplified. I guess that is where I look next. Oh for what it is worth I took a quick look at fsblock and I don't think struct fsblock makes much sense as a block mapping translation layer for the data path where the current page caches works well. For less then the cost of 1 fsblock I can cache all of the translations for a 1K filesystem on a 4K page. I haven't looked to see if fsblock makes sense to use as a buffer head replacement yet. Anyway off to bed with me. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [patch][rfc] rewrite ramdisk 2007-10-17 10:30 ` Eric W. Biederman @ 2007-10-17 12:49 ` Nick Piggin 2007-10-17 18:45 ` Eric W. Biederman 0 siblings, 1 reply; 77+ messages in thread From: Nick Piggin @ 2007-10-17 12:49 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o On Wednesday 17 October 2007 20:30, Eric W. Biederman wrote: > Nick Piggin <nickpiggin@yahoo.com.au> writes: > > On Tuesday 16 October 2007 18:08, Nick Piggin wrote: > >> On Tuesday 16 October 2007 14:57, Eric W. Biederman wrote: > >> > > What magic restrictions on page allocations? Actually we have > >> > > fewer restrictions on page allocations because we can use > >> > > highmem! > >> > > >> > With the proposed rewrite yes. > > > > Here's a quick first hack... > > > > Comments? > > I have beaten my version of this into working shape, and things > seem ok. > > However I'm beginning to think that the real solution is to remove > the dependence on buffer heads for caching the disk mapping for > data pages, and move the metadata buffer heads off of the block > device page cache pages. Although I am just a touch concerned > there may be an issue with using filesystem tools while the > filesystem is mounted if I move the metadata buffer heads. > > If we were to move the metadata buffer heads (assuming I haven't > missed some weird dependency) then I think there a bunch of > weird corner cases that would be simplified. I'd prefer just to go along the lines of what I posted. It's a pure block device driver which knows absolutely nothing about vm or vfs. What are you guys using rd for, and is it really important to have this supposed buffercache optimisation? > I guess that is where I look next. > > Oh for what it is worth I took a quick look at fsblock and I don't think > struct fsblock makes much sense as a block mapping translation layer for > the data path where the current page caches works well. Except the page cache doesn't store any such translation. fsblock works very nicely as a buffer_head and nobh-mode replacement there, but we could probably go one better (for many filesystems) by using a tree. > For less > then the cost of 1 fsblock I can cache all of the translations for a > 1K filesystem on a 4K page. I'm not exactly sure what you mean, unless you're talking about an extent based data structure. fsblock is fairly slim as far as a generic solution goes. You could probably save 4 or 8 bytes in it, but I don't know if it's worth the trouble. > I haven't looked to see if fsblock makes sense to use as a buffer head > replacement yet. Well I don't want to get too far off topic on the subject of fsblock, and block mappings (because I think the ramdisk driver simply should have nothing to do with any of that)... but fsblock is exactly a buffer head replacement (so if it doesn't make sense, then I've screwed something up badly! ;)) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [patch][rfc] rewrite ramdisk 2007-10-17 12:49 ` Nick Piggin @ 2007-10-17 18:45 ` Eric W. Biederman 2007-10-18 1:06 ` Nick Piggin 0 siblings, 1 reply; 77+ messages in thread From: Eric W. Biederman @ 2007-10-17 18:45 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o Nick Piggin <nickpiggin@yahoo.com.au> writes: > On Wednesday 17 October 2007 20:30, Eric W. Biederman wrote: >> Nick Piggin <nickpiggin@yahoo.com.au> writes: >> > On Tuesday 16 October 2007 18:08, Nick Piggin wrote: >> >> On Tuesday 16 October 2007 14:57, Eric W. Biederman wrote: >> >> > > What magic restrictions on page allocations? Actually we have >> >> > > fewer restrictions on page allocations because we can use >> >> > > highmem! >> >> > >> >> > With the proposed rewrite yes. >> > >> > Here's a quick first hack... >> > >> > Comments? >> >> I have beaten my version of this into working shape, and things >> seem ok. >> >> However I'm beginning to think that the real solution is to remove >> the dependence on buffer heads for caching the disk mapping for >> data pages, and move the metadata buffer heads off of the block >> device page cache pages. Although I am just a touch concerned >> there may be an issue with using filesystem tools while the >> filesystem is mounted if I move the metadata buffer heads. >> >> If we were to move the metadata buffer heads (assuming I haven't >> missed some weird dependency) then I think there a bunch of >> weird corner cases that would be simplified. > > I'd prefer just to go along the lines of what I posted. It's > a pure block device driver which knows absolutely nothing about > vm or vfs. > > What are you guys using rd for, and is it really important to > have this supposed buffercache optimisation? Well what brought this up for me was old user space code using an initial ramdisk. The actual failure that I saw occurred on the read path. And fixing init_page_buffers was the real world fix. At the moment I'm messing with it because it has become the itch I've decided to scratch. So at the moment I'm having fun, learning the block layer, refreshing my VM knowledge and getting my head around this wreck that we call buffer_heads. The high level concept of buffer_heads may be sane but the implementation seems to export a lot of nasty state. At this point my concern is what makes a clean code change in the kernel. Because user space can currently play with buffer_heads by way of the block device and cause lots of havoc (see the recent resierfs bug in this thread) that is why I increasingly think metadata buffer_heads should not share storage with the block device page cache. If that change is made then it happens that the current ramdisk would not need to worry about buffer heads and all of that nastiness and could just lock pages in the page cache. It would not be quite as good for testing filesystems but retaining the existing characteristics would be simple. After having looked a bit deeper the buffer_heads and the block devices don't look as intricately tied up as I had first thought. We still have the nasty case of: if (buffer_new(bh)) unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr); That I don't know how it got merged. But otherwise the caches are fully separate. So currently it looks to me like there are two big things that will clean up that part of the code a lot: - moving the metadata buffer_heads to a magic filesystem inode. - Using a simpler non-buffer_head returning version of get_block so we can make simple generic code for generating BIOs. >> I guess that is where I look next. >> >> Oh for what it is worth I took a quick look at fsblock and I don't think >> struct fsblock makes much sense as a block mapping translation layer for >> the data path where the current page caches works well. > > Except the page cache doesn't store any such translation. fsblock > works very nicely as a buffer_head and nobh-mode replacement there, > but we could probably go one better (for many filesystems) by using > a tree. > >> For less >> then the cost of 1 fsblock I can cache all of the translations for a >> 1K filesystem on a 4K page. > > I'm not exactly sure what you mean, unless you're talking about an > extent based data structure. fsblock is fairly slim as far as a > generic solution goes. You could probably save 4 or 8 bytes in it, > but I don't know if it's worth the trouble. As a meta_data cache manager perhaps, for a translation cache we need 8 bytes per page max. However all we need for a generic translation cache (assuming we still want one) is an array of sector_t per page. So what we would want is: int blkbits_per_page = PAGE_CACHE_SHIFT - inode->i_blkbits; if (blkbits_per_page <= 0) blkbits_per_page = 0; sector_t *blocks = kmalloc(sizeof(sector_t) << blkbits_per_page); And to remember if we have stored the translation: #define UNMAPPED_SECTOR (-1(sector_t)) ... The core of all of this being something like: #define MAX_BLOCKS_PER_PAGE (1 << (PAGE_CACHE_SHIFT - 9)) typedef int (page_blocks_t)(struct page *page, sector_t blocks[MAX_BLOCKS_PER_PAGE], int create); >> I haven't looked to see if fsblock makes sense to use as a buffer head >> replacement yet. > > Well I don't want to get too far off topic on the subject of fsblock, > and block mappings (because I think the ramdisk driver simply should > have nothing to do with any of that)... Which I can agree with. > but fsblock is exactly a buffer head replacement (so if it doesn't > make sense, then I've screwed something up badly! ;)) By definition! Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [patch][rfc] rewrite ramdisk 2007-10-17 18:45 ` Eric W. Biederman @ 2007-10-18 1:06 ` Nick Piggin 0 siblings, 0 replies; 77+ messages in thread From: Nick Piggin @ 2007-10-18 1:06 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Christian Borntraeger, linux-mm, linux-kernel, Martin Schwidefsky, Theodore Ts'o On Thursday 18 October 2007 04:45, Eric W. Biederman wrote: > At this point my concern is what makes a clean code change in the > kernel. Because user space can currently play with buffer_heads > by way of the block device and cause lots of havoc (see the recent Well if userspace is writing to the filesystem metadata via the blockdevice while it is running... that's the definition of havoc, isn't it? ;) Whether or not the writes are going via a unified metadata/blockdev cache or separate ones. You really just have to not do that. The actual reiserfs problem being seen is not because of userspace going silly, but because ramdisk is hijacking the dirty bits. > If that change is made then it happens that the current ramdisk > would not need to worry about buffer heads and all of that > nastiness and could just lock pages in the page cache. It would not > be quite as good for testing filesystems but retaining the existing > characteristics would be simple. No, it wouldn't. Because if you're proposing to split up the buffer cache and the metadata cache, then you're back to a 2 cache solution which is basically has the memory characteristics of my proposal while still being horribly incestuous with the pagecache. > After having looked a bit deeper the buffer_heads and the block > devices don't look as intricately tied up as I had first thought. > We still have the nasty case of: > if (buffer_new(bh)) > unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr); > That I don't know how it got merged. But otherwise the caches > are fully separate. Well its needed because some filesystems forget about their old metadata. It's not really there to solve aliasing with the blockdev pagecache. > So currently it looks to me like there are two big things that will > clean up that part of the code a lot: > - moving the metadata buffer_heads to a magic filesystem inode. > - Using a simpler non-buffer_head returning version of get_block > so we can make simple generic code for generating BIOs. Although this is going off the track of the ramdisk problem. For that we should just do the rewrite. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 77+ messages in thread
end of thread, other threads:[~2007-10-22 13:11 UTC | newest] Thread overview: 77+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-10-15 8:28 [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure Christian Borntraeger 2007-10-15 14:06 ` Nick Piggin 2007-10-15 9:05 ` Christian Borntraeger 2007-10-15 14:38 ` Nick Piggin 2007-10-15 18:38 ` Eric W. Biederman 2007-10-15 22:37 ` Eric W. Biederman 2007-10-15 22:40 ` [PATCH] rd: Preserve the dirty bit in init_page_buffers() Eric W. Biederman 2007-10-15 22:42 ` [PATCH] rd: Mark ramdisk buffers heads dirty Eric W. Biederman 2007-10-16 7:56 ` Christian Borntraeger 2007-10-16 9:22 ` Eric W. Biederman 2007-10-17 16:14 ` Christian Borntraeger 2007-10-17 17:57 ` Eric W. Biederman 2007-10-17 19:14 ` Chris Mason 2007-10-17 20:29 ` Eric W. Biederman 2007-10-17 20:54 ` Chris Mason 2007-10-17 21:30 ` Eric W. Biederman 2007-10-17 22:58 ` Chris Mason 2007-10-17 23:28 ` Eric W. Biederman 2007-10-18 0:03 ` Chris Mason 2007-10-18 3:27 ` Eric W. Biederman 2007-10-18 3:59 ` [RFC][PATCH] block: Isolate the buffer cache in it's own mappings Eric W. Biederman 2007-10-18 4:32 ` Andrew Morton 2007-10-19 21:27 ` Eric W. Biederman 2007-10-21 4:24 ` Nick Piggin 2007-10-21 4:53 ` Eric W. Biederman 2007-10-21 5:36 ` Nick Piggin 2007-10-21 7:09 ` Eric W. Biederman 2007-10-22 0:15 ` David Chinner 2007-10-18 5:10 ` Nick Piggin 2007-10-19 21:35 ` Eric W. Biederman 2007-10-17 21:48 ` [PATCH] rd: Mark ramdisk buffers heads dirty Christian Borntraeger 2007-10-17 22:22 ` Eric W. Biederman 2007-10-18 9:26 ` Christian Borntraeger 2007-10-19 22:46 ` Eric W. Biederman 2007-10-19 22:51 ` [PATCH] rd: Use a private inode for backing storage Eric W. Biederman 2007-10-21 4:28 ` Nick Piggin 2007-10-21 5:10 ` Eric W. Biederman 2007-10-21 5:24 ` Nick Piggin 2007-10-21 6:48 ` Eric W. Biederman 2007-10-21 7:28 ` Christian Borntraeger 2007-10-21 8:23 ` Eric W. Biederman 2007-10-21 9:56 ` Nick Piggin 2007-10-21 18:39 ` Eric W. Biederman 2007-10-22 1:56 ` Nick Piggin 2007-10-22 13:11 ` Chris Mason 2007-10-21 9:39 ` Nick Piggin 2007-10-21 17:56 ` Eric W. Biederman 2007-10-22 0:29 ` Nick Piggin 2007-10-16 8:19 ` [PATCH] rd: Mark ramdisk buffers heads dirty Nick Piggin 2007-10-16 8:48 ` Christian Borntraeger 2007-10-16 19:06 ` Eric W. Biederman 2007-10-16 22:06 ` Nick Piggin 2007-10-16 8:12 ` [PATCH] rd: Preserve the dirty bit in init_page_buffers() Nick Piggin 2007-10-16 9:35 ` Eric W. Biederman 2007-10-15 9:16 ` [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure Andrew Morton 2007-10-15 15:23 ` Nick Piggin 2007-10-16 3:14 ` Eric W. Biederman 2007-10-16 6:45 ` Nick Piggin 2007-10-16 4:57 ` Eric W. Biederman 2007-10-16 8:08 ` Nick Piggin 2007-10-16 7:47 ` [patch][rfc] rewrite ramdisk Nick Piggin 2007-10-16 7:52 ` Jan Engelhardt 2007-10-16 8:07 ` Nick Piggin 2007-10-16 8:17 ` Jan Engelhardt 2007-10-16 8:26 ` Nick Piggin 2007-10-16 8:53 ` Jan Engelhardt 2007-10-16 9:08 ` Eric W. Biederman 2007-10-16 21:28 ` Theodore Tso 2007-10-16 22:08 ` Nick Piggin 2007-10-16 23:48 ` Eric W. Biederman 2007-10-17 0:28 ` Nick Piggin 2007-10-17 1:13 ` Eric W. Biederman 2007-10-17 1:47 ` Nick Piggin 2007-10-17 10:30 ` Eric W. Biederman 2007-10-17 12:49 ` Nick Piggin 2007-10-17 18:45 ` Eric W. Biederman 2007-10-18 1:06 ` Nick Piggin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).