* [PATCH 1/5]: ufs: missed brelse and wrong baseblk
@ 2006-06-17 10:14 Evgeniy Dushistov
2006-06-18 16:20 ` Al Viro
0 siblings, 1 reply; 11+ messages in thread
From: Evgeniy Dushistov @ 2006-06-17 10:14 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, linux-fsdevel
This patch fixes two bugs, which introduced by previous
patches:
1)Missed "brelse"
2)Sometimes "baseblk" may be wrongly calculated, if
i_size is equal to zero, which lead infinite cycle
in "mpage_writepages".
Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru>
---
Index: linux-2.6.17-rc6-mm2/fs/ufs/inode.c
===================================================================
--- linux-2.6.17-rc6-mm2.orig/fs/ufs/inode.c
+++ linux-2.6.17-rc6-mm2/fs/ufs/inode.c
@@ -175,6 +175,7 @@ ufs_clear_frags(struct inode *inode, sec
for (++beg; beg < end; ++beg) {
bh = sb_getblk(inode->i_sb, beg);
ufs_clear_frag(inode, bh);
+ brelse(bh);
}
return res;
}
Index: linux-2.6.17-rc6-mm2/fs/ufs/balloc.c
===================================================================
--- linux-2.6.17-rc6-mm2.orig/fs/ufs/balloc.c
+++ linux-2.6.17-rc6-mm2/fs/ufs/balloc.c
@@ -269,20 +269,17 @@ out:
* We can come here from ufs_writepage or ufs_prepare_write,
* locked_page is argument of these functions, so we already lock it.
*/
-static void ufs_change_blocknr(struct inode *inode, unsigned int count,
- unsigned int oldb, unsigned int newb,
- struct page *locked_page)
+static void ufs_change_blocknr(struct inode *inode, unsigned int baseblk,
+ unsigned int count, unsigned int oldb,
+ unsigned int newb, struct page *locked_page)
{
unsigned int blk_per_page = 1 << (PAGE_CACHE_SHIFT - inode->i_blkbits);
- sector_t baseblk;
struct address_space *mapping = inode->i_mapping;
pgoff_t index, cur_index = locked_page->index;
unsigned int i, j;
struct page *page;
struct buffer_head *head, *bh;
- baseblk = ((i_size_read(inode) - 1) >> inode->i_blkbits) + 1 - count;
-
UFSD("ENTER, ino %lu, count %u, oldb %u, newb %u\n",
inode->i_ino, count, oldb, newb);
@@ -439,7 +436,8 @@ unsigned ufs_new_fragments(struct inode
}
result = ufs_alloc_fragments (inode, cgno, goal, request, err);
if (result) {
- ufs_change_blocknr(inode, oldcount, tmp, result, locked_page);
+ ufs_change_blocknr(inode, fragment - oldcount, oldcount, tmp,
+ result, locked_page);
*p = cpu_to_fs32(sb, result);
*err = 0;
--
/Evgeniy
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH 1/5]: ufs: missed brelse and wrong baseblk
2006-06-17 10:14 [PATCH 1/5]: ufs: missed brelse and wrong baseblk Evgeniy Dushistov
@ 2006-06-18 16:20 ` Al Viro
2006-06-18 17:50 ` Al Viro
0 siblings, 1 reply; 11+ messages in thread
From: Al Viro @ 2006-06-18 16:20 UTC (permalink / raw)
To: Andrew Morton, linux-kernel, linux-fsdevel
Comment more on the entire series than on this patch: scenario that causes
trouble
* foo is a sparse file on ufs with 8Kb block/1Kb fragment
* process opens it writable and mmaps it shared
* it proceeds to dirty the pages
* fork
* parent and child do msync() on pages next to each other
I.e. we try to write adjacent pages that sit in the same block. At the
same time. Each will trigger an allocation and we'd better be very
careful, or we'll end up allocating the same block twice.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/5]: ufs: missed brelse and wrong baseblk
2006-06-18 16:20 ` Al Viro
@ 2006-06-18 17:50 ` Al Viro
2006-06-19 6:47 ` Evgeniy Dushistov
0 siblings, 1 reply; 11+ messages in thread
From: Al Viro @ 2006-06-18 17:50 UTC (permalink / raw)
To: Andrew Morton, linux-kernel, linux-fsdevel
On Sun, Jun 18, 2006 at 05:20:54PM +0100, Al Viro wrote:
> Comment more on the entire series than on this patch: scenario that causes
> trouble
> * foo is a sparse file on ufs with 8Kb block/1Kb fragment
> * process opens it writable and mmaps it shared
> * it proceeds to dirty the pages
> * fork
> * parent and child do msync() on pages next to each other
>
> I.e. we try to write adjacent pages that sit in the same block. At the
> same time. Each will trigger an allocation and we'd better be very
> careful, or we'll end up allocating the same block twice.
FWIW, for folks who are not familiar with UFS: the important differences
between it and ext2 are
* directory chunk size is fixed to 512, instead of being fs parameter
as in ext2
* names in directory entries are NUL-terminated
* there are two allocation units: fragment and block. Each block
consists of 2^{parameter} fragments. ext2 is what you get when parameter
is 0 (block == fragment). UFS tends to use block:fragment == 8, but
1, 2 and 4 are also allowed.
* equivalent of ext2 free block bitmap has bit per fragment.
* "block" in "direct block", "indirect block", etc. is actually
a group of fragments. The number of first fragment in group is stored
where ext2 would store the block number.
* if there are indirect blocks, all those groups are simply full
blocks; they are aligned to block boundary and consist of block:fragment
ratio fragments.
* if file is shorter than 12 * block size, we have no indirects and
all but the last direct one are full blocks. I.e. the numbers we have
there are multiples of block:fragment ration and a full block is allocated
for each. The last one consists of just enough fragments to reach the
end of file and may be not aligned.
IOW, it's _almost_ as if we had ext2 with all block numbers (in inode and
in indirect blocks) multiplied by block:fragment ratio. The only exception
is for the last direct block of small files - these span fewer fragments
and may be unaligned.
The only subtle part is when we extend a small file; the last direct block
needs to be expanded and that may require relocation.
* block may be bigger than page. That can cause all sorts of fun
problems in interaction with our VM, since allocation can affect more than
one page and that has to be taken into account.
* UFS2 supports ext.attributes; it has two fragment numbers in
inode; they refer to up to 2 blocks worth of data. As with the data
of small files, the partial block doesn't have to be aligned.
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH 1/5]: ufs: missed brelse and wrong baseblk
2006-06-18 17:50 ` Al Viro
@ 2006-06-19 6:47 ` Evgeniy Dushistov
2006-06-19 7:32 ` Al Viro
0 siblings, 1 reply; 11+ messages in thread
From: Evgeniy Dushistov @ 2006-06-19 6:47 UTC (permalink / raw)
To: Al Viro; +Cc: Andrew Morton, linux-kernel, linux-fsdevel
On Sun, Jun 18, 2006 at 06:50:45PM +0100, Al Viro wrote:
> * block may be bigger than page. That can cause all sorts of fun
> problems in interaction with our VM, since allocation can affect more than
> one page and that has to be taken into account.
In fact this is not a problem. Blocks in terms of linux VFS
is fragments in terms of UFS.
And if fragment >4096 we just don't mount such file system.
So we can easily support 32K blocks.
--
/Evgeniy
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/5]: ufs: missed brelse and wrong baseblk
2006-06-19 6:47 ` Evgeniy Dushistov
@ 2006-06-19 7:32 ` Al Viro
2006-06-19 13:17 ` Evgeniy Dushistov
0 siblings, 1 reply; 11+ messages in thread
From: Al Viro @ 2006-06-19 7:32 UTC (permalink / raw)
To: Andrew Morton, linux-kernel, linux-fsdevel
On Mon, Jun 19, 2006 at 10:47:21AM +0400, Evgeniy Dushistov wrote:
> On Sun, Jun 18, 2006 at 06:50:45PM +0100, Al Viro wrote:
> > * block may be bigger than page. That can cause all sorts of fun
> > problems in interaction with our VM, since allocation can affect more than
> > one page and that has to be taken into account.
>
> In fact this is not a problem. Blocks in terms of linux VFS
> is fragments in terms of UFS.
And?
> And if fragment >4096 we just don't mount such file system.
>
> So we can easily support 32K blocks.
... Which means that we have allocation unit (aka UFS block) covering
several in-core pages. Which means that if you have a file with 8Kb
hole in the beginning, mmap it shared and dirty the first couple of
pages, you will get the situation when parallel writeout on those two
pages will cause trouble. Both times you'll allocate full block (file
is couple of megabytes long, so forget about partial blocks, relocations,
etc.) And both will try to put the reference to what they'd allocated
into the same inode as the first direct block. Do you see the problem?
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/5]: ufs: missed brelse and wrong baseblk
2006-06-19 7:32 ` Al Viro
@ 2006-06-19 13:17 ` Evgeniy Dushistov
2006-06-19 18:28 ` Al Viro
0 siblings, 1 reply; 11+ messages in thread
From: Evgeniy Dushistov @ 2006-06-19 13:17 UTC (permalink / raw)
To: Al Viro; +Cc: Andrew Morton, linux-kernel, linux-fsdevel
On Mon, Jun 19, 2006 at 08:32:29AM +0100, Al Viro wrote:
> On Mon, Jun 19, 2006 at 10:47:21AM +0400, Evgeniy Dushistov wrote:
> > On Sun, Jun 18, 2006 at 06:50:45PM +0100, Al Viro wrote:
> > > * block may be bigger than page. That can cause all sorts of fun
> > > problems in interaction with our VM, since allocation can affect more than
> > > one page and that has to be taken into account.
> >
> > In fact this is not a problem. Blocks in terms of linux VFS
> > is fragments in terms of UFS.
>
> And?
>
> > And if fragment >4096 we just don't mount such file system.
> >
> > So we can easily support 32K blocks.
>
> .... Which means that we have allocation unit (aka UFS block) covering
> several in-core pages. Which means that if you have a file with 8Kb
> hole in the beginning, mmap it shared and dirty the first couple of
> pages, you will get the situation when parallel writeout on those two
> pages will cause trouble. Both times you'll allocate full block (file
> is couple of megabytes long, so forget about partial blocks, relocations,
> etc.) And both will try to put the reference to what they'd allocated
> into the same inode as the first direct block. Do you see the problem?
No, I don't see, can you explain in detail how this affect current
implementation?
In case of 1k fragments, msync of two pages
cause 8 calls of ufs's get_block_t with create == 1,
they will be consequent because of synchronization.
Only the first one allocate block,
all other check if reference to block not 0, and just return
appropriate value. See for example ufs_inode_getblock:
p = (__fs32 *) bh->b_data + block;
tmp = fs32_to_cpu(sb, *p);
if (tmp) {
*phys = tmp + blockoff;
goto out;
and that's all, nothing will be allocated.
So this is abstract theory about some abstract implementation of UFS,
or you see some problems in code?
--
/Evgeniy
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH 1/5]: ufs: missed brelse and wrong baseblk
2006-06-19 13:17 ` Evgeniy Dushistov
@ 2006-06-19 18:28 ` Al Viro
2006-06-19 18:58 ` Evgeniy Dushistov
2006-06-20 16:30 ` Evgeniy Dushistov
0 siblings, 2 replies; 11+ messages in thread
From: Al Viro @ 2006-06-19 18:28 UTC (permalink / raw)
To: Andrew Morton, linux-kernel, linux-fsdevel
On Mon, Jun 19, 2006 at 05:17:50PM +0400, Evgeniy Dushistov wrote:
> In case of 1k fragments, msync of two pages
> cause 8 calls of ufs's get_block_t with create == 1,
> they will be consequent because of synchronization.
_What_ synchronization?
To simplify the analysis, have one of those do msync() and another - write().
One triggers writeback, leading to ufs_writepage(). Another leads to call
of ufs_prepare_write(). Note that the latter call is process-synchronous,
so no implicit serialization could apply.
Now, which lock would, in your opinion, provide serialization between these
two calls? They apply to different pages, so page locks do not help.
And yes, we do need some serialization between get_block_t here, indeed.
As it is, we don't have any.
Note that there is another problem with use of fs/buffer.c helpers. They
assume that there are 3 states of buffer_head corresponding to fragment:
* mapped to known disk block
* known to be in hole
* not known
And that's where we have a problem. block_read_full_page() on page 0
will get all buffer_head on that page (one for each fragment) to
the second state. block_prepare_write() will get all buffer_head on
page 1 to the first state after the callback allocates the first direct block
of file (they will be mapped to the fragments in that block, at offsets 4Kb
and further).
But now we have the buffer_heads on page 0 in the state inconsistent with
the reality - basically, fs/buffer.c helpers will assume that they are
_still_ in the second state (known to be in hole), while in the reality
they should be either in the first or in the third one (mapped to known
disk block or not known).
It's not a fundamental problem; however, it does mean that using these
helpers means using library functions in situation they'd never been designed
for. IOW, you need very careful analysis of the assumptions made by
the entire bunch and, quite possibly, need versions modified for UFS.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/5]: ufs: missed brelse and wrong baseblk
2006-06-19 18:28 ` Al Viro
@ 2006-06-19 18:58 ` Evgeniy Dushistov
2006-06-19 19:13 ` Al Viro
2006-06-20 16:30 ` Evgeniy Dushistov
1 sibling, 1 reply; 11+ messages in thread
From: Evgeniy Dushistov @ 2006-06-19 18:58 UTC (permalink / raw)
To: Al Viro; +Cc: Andrew Morton, linux-kernel, linux-fsdevel
On Mon, Jun 19, 2006 at 07:28:33PM +0100, Al Viro wrote:
> On Mon, Jun 19, 2006 at 05:17:50PM +0400, Evgeniy Dushistov wrote:
> > In case of 1k fragments, msync of two pages
> > cause 8 calls of ufs's get_block_t with create == 1,
> > they will be consequent because of synchronization.
>
> _What_ synchronization?
> Now, which lock would, in your opinion, provide serialization between these
> two calls? They apply to different pages, so page locks do not help.
>
you can look at fs/ufs/inode.c: ufs_getfrag_block.
It is ufs's get_block_t,
if create == 1 it uses "[un]lock_kernel".
> To simplify the analysis, have one of those do msync() and another - write().
> One triggers writeback, leading to ufs_writepage(). Another leads to call
> of ufs_prepare_write(). Note that the latter call is process-synchronous,
> so no implicit serialization could apply.
skiped
I'll think about this.
--
/Evgeniy
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/5]: ufs: missed brelse and wrong baseblk
2006-06-19 18:58 ` Evgeniy Dushistov
@ 2006-06-19 19:13 ` Al Viro
2006-06-20 16:30 ` Evgeniy Dushistov
0 siblings, 1 reply; 11+ messages in thread
From: Al Viro @ 2006-06-19 19:13 UTC (permalink / raw)
To: Andrew Morton, linux-kernel, linux-fsdevel
On Mon, Jun 19, 2006 at 10:58:16PM +0400, Evgeniy Dushistov wrote:
> On Mon, Jun 19, 2006 at 07:28:33PM +0100, Al Viro wrote:
> > On Mon, Jun 19, 2006 at 05:17:50PM +0400, Evgeniy Dushistov wrote:
> > > In case of 1k fragments, msync of two pages
> > > cause 8 calls of ufs's get_block_t with create == 1,
> > > they will be consequent because of synchronization.
> >
> > _What_ synchronization?
> > Now, which lock would, in your opinion, provide serialization between these
> > two calls? They apply to different pages, so page locks do not help.
> >
> you can look at fs/ufs/inode.c: ufs_getfrag_block.
> It is ufs's get_block_t,
> if create == 1 it uses "[un]lock_kernel".
Which is fsck-all protection, since then you proceed to do a lot of
blocking operations. Now, lock_super() down in balloc.c _might_ be
enough, but I wouldn't bet on that.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/5]: ufs: missed brelse and wrong baseblk
2006-06-19 19:13 ` Al Viro
@ 2006-06-20 16:30 ` Evgeniy Dushistov
0 siblings, 0 replies; 11+ messages in thread
From: Evgeniy Dushistov @ 2006-06-20 16:30 UTC (permalink / raw)
To: Al Viro; +Cc: Andrew Morton, linux-kernel, linux-fsdevel
On Mon, Jun 19, 2006 at 08:13:06PM +0100, Al Viro wrote:
> Which is fsck-all protection, since then you proceed to do a lot of
> blocking operations. Now, lock_super() down in balloc.c _might_ be
> enough, but I wouldn't bet on that.
There is still leak of proper locking model for inode's metadata,
for example we don't lock/unlock buffer_head when check if
we've already allocated block or not,
so lock_kernel still necessary.
--
/Evgeniy
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/5]: ufs: missed brelse and wrong baseblk
2006-06-19 18:28 ` Al Viro
2006-06-19 18:58 ` Evgeniy Dushistov
@ 2006-06-20 16:30 ` Evgeniy Dushistov
1 sibling, 0 replies; 11+ messages in thread
From: Evgeniy Dushistov @ 2006-06-20 16:30 UTC (permalink / raw)
To: Al Viro; +Cc: Andrew Morton, linux-kernel, linux-fsdevel
On Mon, Jun 19, 2006 at 07:28:33PM +0100, Al Viro wrote:
> But now we have the buffer_heads on page 0 in the state inconsistent with
> the reality - basically, fs/buffer.c helpers will assume that they are
> _still_ in the second state (known to be in hole), while in the reality
> they should be either in the first or in the third one (mapped to known
> disk block or not known).
>
> It's not a fundamental problem;
And if we'll write after that to 0th page,
data with size <=page size, we can get garbage(not zeroes)
on the rest of page.
Definitely, after block allocation,
we should touch pages from inode cache,
which belongs to block except current page.
>however, it does mean that using these
> helpers means using library functions in situation they'd never been designed
> for. IOW, you need very careful analysis of the assumptions made by
> the entire bunch and, quite possibly, need versions modified for UFS.
May be there is some incomprehension here,
this series and all other my patches in -mm related to UFS
is not introduced write support for UFS, they fixes
bugs similar to which you point out in black corners
of the existing implementation.
Note: almost all such bugs related to
touch blockdev's cache instead of inode's cache, and working
with blockdev's buffer cache without take into consideration
that it's also page cache).
--
/Evgeniy
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2006-06-20 16:25 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-17 10:14 [PATCH 1/5]: ufs: missed brelse and wrong baseblk Evgeniy Dushistov
2006-06-18 16:20 ` Al Viro
2006-06-18 17:50 ` Al Viro
2006-06-19 6:47 ` Evgeniy Dushistov
2006-06-19 7:32 ` Al Viro
2006-06-19 13:17 ` Evgeniy Dushistov
2006-06-19 18:28 ` Al Viro
2006-06-19 18:58 ` Evgeniy Dushistov
2006-06-19 19:13 ` Al Viro
2006-06-20 16:30 ` Evgeniy Dushistov
2006-06-20 16:30 ` Evgeniy Dushistov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).