* Re: [PATCH 1/2 linux-next] Revert "ufs: fix deadlocks introduced by sb mutex merge" [not found] ` <20150604050123.GL7232@ZenIV.linux.org.uk> @ 2015-06-04 22:22 ` Al Viro [not found] ` <1122467636.634568.1433521621076.open-xchange@webmail.nmp.proximus.be> 1 sibling, 0 replies; 7+ messages in thread From: Al Viro @ 2015-06-04 22:22 UTC (permalink / raw) To: Andrew Morton Cc: Fabian Frederick, Ian Campbell, Evgeniy Dushistov, Alexey Khoroshilov, Roger Pau Monne, Ian Jackson, xen-devel, Jan Kara, linux-fsdevel On Thu, Jun 04, 2015 at 06:01:23AM +0100, Al Viro wrote: > So we need > * per-page exclusion for reallocation time (normal page locks are > doing that) > * per-fs exclusion for block and fragment allocations (->s_lock?) > * per-fs exclusion for inode allocations (->s_lock?) > * per-inode exclusion for mapping changes (a-la ext2 truncate_mutex) > * per-directory exclusion for contents access (->i_mutex gives that) > > Looks like we ought to add ->truncate_mutex and shove lock_ufs() calls > all way down into balloc.c (and ialloc.c for inode allocations)... After looking through that code, that appears to be feasible (and would simplify the hell out of truncate side of things). However, there's a problem with the way we do ->write_begin() and ->writepage() there - it's quite suboptimal for short files. Suppose we start with an empty file on e.g. a filesystem with 1Kb fragments and 4Kb blocks. We grab page 0 and call ->write_begin() on it, offset range 0 to 4095. OK, that'll be block_write_begin(), with ufs_getfrag_block as callback. And it'll proceed to call it 4 times, one for each kilobyte. In ascending order. If we are lucky, it'll allocate the first fragment of an otherwise empty block and then extend it 3 times until we get the full block. If we are *not* that lucky, and the first fragment is grabbed in already partially used block, we'll end up doing reallocations (and copying, while we are at it, hopefully without bogus uptodate flags set anywhere in process). The same if somebody else grabs a fragment in the middle of block we are growing into. A part of that is the lack of prealloc in fs/ufs; that would certainly improve things, but I really wonder if we are doing the handling of partial blocks in the wrong place. As it is, that happens fairly deep in call chain - ufs_write_begin() -> block_write_begin() -> __block_write_begin() -> ufs_getfrag_block() -> ufs_inode_getfrag() -> ufs_new_fragments() -> ufs_change_blocknr()) and we don't notice the partial blocks until we get to ufs_new_fragments(); in reality, we can tell if there's a partial block from the very beginning, just by looking at inode size and checking if the direct pointer in question is not zero. Do we really need it done that deep in chain? Note that there's another long-standing problem - we map the things fragment-by-fragment, which is seriously redundant; within the same logical block we have either * hole - no fragments present, or * partial block at the end of short file - known number of adjacent fragments present, or * full block - all fragments are present and adjacent Trying to map fragments one by one is completely pointless. Which makes the use fs/buffer.c helpers dubious. Besides, we pay for doing it that deep by having to pass the page all way down into block allocator. AFAICS, writepage cares about the partial blocks only when EOF is right inside the page - pages past EOF shouldn't be faulted in and truncate_setsize() should've killed everything off before lowering ->i_size. And since truncate and ->write_begin are serialized on ->i_mutex, we are down to * write_iter starting past the EOF should start with dummy extending truncate (and truncate back to original size if nothing actually gets written) * extending truncate() of a file with partial block should grab the page containing EOF and expand it (with possible reallocation). * truncate() shrinking a file to something with partial block should do nothing special, other than releasing only part of fragments for that block. * write_begin and writepage should check if they are dealing with a page covering a partial block and start with extending it - they know how far to go and they are serialized by page lock. Does anybody see holes in that? This way we can handle the partial blocks higher in the call chain, with a lot less PITA... And with that done, we can have ufs_block_getfrag() just check if the previous bh in the page falls into the same block and is already mapped and map ours to the next fragment if that's the case - allocation would either happen for entire block or page, whichever is smaller, or be already done by caller (UFS blocks can be bigger than a page). ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <1122467636.634568.1433521621076.open-xchange@webmail.nmp.proximus.be>]
[parent not found: <20150605185018.GX7232@ZenIV.linux.org.uk>]
* Re: [PATCH 1/2 linux-next] Revert "ufs: fix deadlocks introduced by sb mutex merge" [not found] ` <20150605185018.GX7232@ZenIV.linux.org.uk> @ 2015-06-05 22:03 ` Al Viro 2015-06-17 8:57 ` Jan Kara 0 siblings, 1 reply; 7+ messages in thread From: Al Viro @ 2015-06-05 22:03 UTC (permalink / raw) To: Fabian Frederick Cc: Andrew Morton, Alexey Khoroshilov, Ian Campbell, Roger Pau Monne, Ian Jackson, Jan Kara, xen-devel, Evgeniy Dushistov, linux-fsdevel On Fri, Jun 05, 2015 at 07:50:18PM +0100, Al Viro wrote: > Basically, we have > i_mutex: file size changes, contents-affecting syscalls. Per-inode. > truncate_mutex: block pointers changes. Per-inode. > s_lock: block and inode bitmaps changes. Per-filesystem. > > For UFS it's slightly more complicated due to tail packing they are doing for > short files, but most of that complexity is due to having that stuff handled > way too deep in call chain. Oh, lovely... commit 10e5dc Author: Evgeniy Dushistov <dushistov@mail.ru> Date: Sat Jul 1 04:36:24 2006 -0700 [PATCH] ufs: truncate should allocate block for last byte had removed ->truncate() method and missed the fact that vmtrucate() had callers outside of ->setattr(), such as handling of ->prepare_write() partial failures and short copies on write(2) in general. Then we had a long and convoluted series of conversions that ended up with vmtruncate() lifted into ufs_write_failed() and replaced with truncate_pagecache() in there. Through all that, everybody (me included) had not noticed that we *still* do not free blocks allocated by ufs_write_begin() failing halfway through. While we are at it, ufs_write_end() ought to call ufs_write_failed() in case when we'd been called after a short copy (and do the same block freeing). Joy... Folks, is anybody actively maintaining fs/ufs these days? ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/2 linux-next] Revert "ufs: fix deadlocks introduced by sb mutex merge" 2015-06-05 22:03 ` Al Viro @ 2015-06-17 8:57 ` Jan Kara 2015-06-17 20:31 ` Al Viro 0 siblings, 1 reply; 7+ messages in thread From: Jan Kara @ 2015-06-17 8:57 UTC (permalink / raw) To: Al Viro Cc: Fabian Frederick, Andrew Morton, Alexey Khoroshilov, Ian Campbell, Roger Pau Monne, Ian Jackson, Jan Kara, xen-devel, Evgeniy Dushistov, linux-fsdevel On Fri 05-06-15 23:03:48, Al Viro wrote: > On Fri, Jun 05, 2015 at 07:50:18PM +0100, Al Viro wrote: > > Basically, we have > > i_mutex: file size changes, contents-affecting syscalls. Per-inode. > > truncate_mutex: block pointers changes. Per-inode. > > s_lock: block and inode bitmaps changes. Per-filesystem. > > > > For UFS it's slightly more complicated due to tail packing they are doing for > > short files, but most of that complexity is due to having that stuff handled > > way too deep in call chain. > > Oh, lovely... commit 10e5dc > Author: Evgeniy Dushistov <dushistov@mail.ru> > Date: Sat Jul 1 04:36:24 2006 -0700 > > [PATCH] ufs: truncate should allocate block for last byte > > had removed ->truncate() method and missed the fact that vmtrucate() had > callers outside of ->setattr(), such as handling of ->prepare_write() partial > failures and short copies on write(2) in general. > > Then we had a long and convoluted series of conversions that ended up with > vmtruncate() lifted into ufs_write_failed() and replaced with > truncate_pagecache() in there. > > Through all that, everybody (me included) had not noticed that we *still* > do not free blocks allocated by ufs_write_begin() failing halfway through. > While we are at it, ufs_write_end() ought to call ufs_write_failed() in > case when we'd been called after a short copy (and do the same block freeing). > > Joy... Folks, is anybody actively maintaining fs/ufs these days? Looking into the changelog there wasn't anyone seriously looking into UFS for at least 5-6 years... Fabian did some cleanups recently but they were mostly cosmetic. So I don't think it's really maintained. OTOH clearly there are some users since occasionally someone comes with a bug report. Welcome to the world where we have ~50 on disk / network filesystems :-| Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/2 linux-next] Revert "ufs: fix deadlocks introduced by sb mutex merge" 2015-06-17 8:57 ` Jan Kara @ 2015-06-17 20:31 ` Al Viro 2015-06-19 23:07 ` Al Viro 0 siblings, 1 reply; 7+ messages in thread From: Al Viro @ 2015-06-17 20:31 UTC (permalink / raw) To: Jan Kara Cc: Fabian Frederick, Andrew Morton, Alexey Khoroshilov, Ian Campbell, Roger Pau Monne, Ian Jackson, xen-devel, Evgeniy Dushistov, linux-fsdevel On Wed, Jun 17, 2015 at 10:57:15AM +0200, Jan Kara wrote: > > Joy... Folks, is anybody actively maintaining fs/ufs these days? > > Looking into the changelog there wasn't anyone seriously looking into UFS > for at least 5-6 years... Fabian did some cleanups recently but they were > mostly cosmetic. So I don't think it's really maintained. OTOH clearly > there are some users since occasionally someone comes with a bug report. > Welcome to the world where we have ~50 on disk / network filesystems :-| FWIW, I've a partial rewrite of that area (completely untested, etc.) in vfs.git#ufs; it's still in progress, but hopefully it'll be done by tomorrow. Basically, it switches the sucker to locking scheme similar to ext2 (modulo seqlock instead of rwlock for protection of block pointers, having to be more careful due to 64bit assignments not being atomic and being unable to fall back to ->truncate_mutex in case of race on the read side) and starts cleaning the hell out of the truncate (and eventually create side of get_block as well). As far as I can see, the root of the problems with that sucker is the misplaced handling of tail-packing logics. That had ballooned into a lot of convoluted codepaths ;-/ I'm not blaming Evgeniy - it *is* painful to massage into saner shape and the damn thing kept missing cleanups that were done to similar filesystems due to that... One thing I'm somewhat unhappy about in what Evgeniy had done is dealing with UFS vs. UFS2 differences at such a low level. Sure, the way we did endianness handling in there provoked exactly that, but it might have been better to go the other way round; see e.g. fs/minix/itree*.c for opposite approach. All this stuff is currently completely untested; I'll be using generic parts of xfstests for beating it up, but any assistance would be welcome. Note: all testing I'll be realistically able to do will be for FreeBSD UFS variants - I don't have Solaris left on any boxen, to start with... ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/2 linux-next] Revert "ufs: fix deadlocks introduced by sb mutex merge" 2015-06-17 20:31 ` Al Viro @ 2015-06-19 23:07 ` Al Viro 2015-06-23 16:46 ` Jan Kara 0 siblings, 1 reply; 7+ messages in thread From: Al Viro @ 2015-06-19 23:07 UTC (permalink / raw) To: Jan Kara Cc: Fabian Frederick, Andrew Morton, Alexey Khoroshilov, Ian Campbell, Roger Pau Monne, Ian Jackson, xen-devel, Evgeniy Dushistov, linux-fsdevel On Wed, Jun 17, 2015 at 09:31:16PM +0100, Al Viro wrote: > On Wed, Jun 17, 2015 at 10:57:15AM +0200, Jan Kara wrote: > > > Joy... Folks, is anybody actively maintaining fs/ufs these days? > > > > Looking into the changelog there wasn't anyone seriously looking into UFS > > for at least 5-6 years... Fabian did some cleanups recently but they were > > mostly cosmetic. So I don't think it's really maintained. OTOH clearly > > there are some users since occasionally someone comes with a bug report. > > Welcome to the world where we have ~50 on disk / network filesystems :-| > > FWIW, I've a partial rewrite of that area (completely untested, etc.) > in vfs.git#ufs; it's still in progress, but hopefully it'll be done > by tomorrow. Basically, it switches the sucker to locking scheme > similar to ext2 (modulo seqlock instead of rwlock for protection of > block pointers, having to be more careful due to 64bit assignments not > being atomic and being unable to fall back to ->truncate_mutex in > case of race on the read side) and starts cleaning the hell out of > the truncate (and eventually create side of get_block as well). > > As far as I can see, the root of the problems with that sucker is the > misplaced handling of tail-packing logics. That had ballooned into > a lot of convoluted codepaths ;-/ I'm not blaming Evgeniy - it *is* > painful to massage into saner shape and the damn thing kept missing > cleanups that were done to similar filesystems due to that... > > One thing I'm somewhat unhappy about in what Evgeniy had done is dealing with > UFS vs. UFS2 differences at such a low level. Sure, the way we did endianness > handling in there provoked exactly that, but it might have been better to > go the other way round; see e.g. fs/minix/itree*.c for opposite approach. > > All this stuff is currently completely untested; I'll be using generic > parts of xfstests for beating it up, but any assistance would be welcome. > Note: all testing I'll be realistically able to do will be for FreeBSD > UFS variants - I don't have Solaris left on any boxen, to start with... FWIW, I've got to one bloody unpleasant part of UFS we'd never managed to get right and it seems that I have a potential solution. Problem: UFS has larger-than-page allocation units. And when we allocate a block in a hole, we *must* zero the whole thing out. The trouble being, we notice that inside writepage(), where we are already holding a lock on our page and where we can't lock other pages without risking a deadlock. What's more, as soon as we'd inserted a reference to that block into inode, we are open to readpage on another page backed by the same block coming and seeing a normal mapped area. So UFS block allocator ends up zeroing them out via _buffer_ cache, waiting for writes to finish before returning the block to caller. It doesn't wait for metadata blocks (they are accessed via buffer cache), but for data we end up with zeroes being written to disk, no matter what. That works, but it obviously hurts the performance. Badly. Writes into a hole are relatively rare, but the same thing hurts normal sequential writes to the end of file, which is not rare at all. How about the following approach: * Let allocation of data block at the EOF return the sucker without writing zeroes out. * Let ->write_iter() check if it's about to create a hole at the (current) EOF. In such case start with writing zeroes (in normal fashion, via ->begin_write()/memset()/->end_write() through the page cache) from the EOF to block boundary (or beginning of the area we were going to write to, whichever's closer). Then proceed in normal fashion. Incidentally, it will deal with tail unpacking for free. * Do the same upon extending truncate. Does anybody see holes in that? readpage() crossing the EOF isn't a problem - it will not even attempt to map past the current ->i_size and will zero the page tail out just fine. We get junk past the EOF on disk, but we zero it out (or overwrite with our data) before increasing ->i_size. Dealing with sparse files is also possible (taking allocations from ->writepage() to ->page_mkwrite(), where we are not holding a page lock and thus can grab all affected pages), but it would be bigger, trickier and benefit only a relatively rare case. Comments? PS: UFS tail unpacking is broken - in some cases it can be tricked into trying to extend an indirect block, of all things, and that's not the only fishy case in there. I think I've fixed that crap in the local tree... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/2 linux-next] Revert "ufs: fix deadlocks introduced by sb mutex merge" 2015-06-19 23:07 ` Al Viro @ 2015-06-23 16:46 ` Jan Kara 2015-06-23 21:56 ` Al Viro 0 siblings, 1 reply; 7+ messages in thread From: Jan Kara @ 2015-06-23 16:46 UTC (permalink / raw) To: Al Viro Cc: Jan Kara, Fabian Frederick, Andrew Morton, Alexey Khoroshilov, Ian Campbell, Roger Pau Monne, Ian Jackson, xen-devel, Evgeniy Dushistov, linux-fsdevel On Sat 20-06-15 00:07:39, Al Viro wrote: > On Wed, Jun 17, 2015 at 09:31:16PM +0100, Al Viro wrote: > > On Wed, Jun 17, 2015 at 10:57:15AM +0200, Jan Kara wrote: > > > > Joy... Folks, is anybody actively maintaining fs/ufs these days? > > > > > > Looking into the changelog there wasn't anyone seriously looking into UFS > > > for at least 5-6 years... Fabian did some cleanups recently but they were > > > mostly cosmetic. So I don't think it's really maintained. OTOH clearly > > > there are some users since occasionally someone comes with a bug report. > > > Welcome to the world where we have ~50 on disk / network filesystems :-| > > > > FWIW, I've a partial rewrite of that area (completely untested, etc.) > > in vfs.git#ufs; it's still in progress, but hopefully it'll be done > > by tomorrow. Basically, it switches the sucker to locking scheme > > similar to ext2 (modulo seqlock instead of rwlock for protection of > > block pointers, having to be more careful due to 64bit assignments not > > being atomic and being unable to fall back to ->truncate_mutex in > > case of race on the read side) and starts cleaning the hell out of > > the truncate (and eventually create side of get_block as well). > > > > As far as I can see, the root of the problems with that sucker is the > > misplaced handling of tail-packing logics. That had ballooned into > > a lot of convoluted codepaths ;-/ I'm not blaming Evgeniy - it *is* > > painful to massage into saner shape and the damn thing kept missing > > cleanups that were done to similar filesystems due to that... > > > > One thing I'm somewhat unhappy about in what Evgeniy had done is dealing with > > UFS vs. UFS2 differences at such a low level. Sure, the way we did endianness > > handling in there provoked exactly that, but it might have been better to > > go the other way round; see e.g. fs/minix/itree*.c for opposite approach. > > > > All this stuff is currently completely untested; I'll be using generic > > parts of xfstests for beating it up, but any assistance would be welcome. > > Note: all testing I'll be realistically able to do will be for FreeBSD > > UFS variants - I don't have Solaris left on any boxen, to start with... > > FWIW, I've got to one bloody unpleasant part of UFS we'd never managed to > get right and it seems that I have a potential solution. > > Problem: UFS has larger-than-page allocation units. And when we allocate > a block in a hole, we *must* zero the whole thing out. The trouble being, > we notice that inside writepage(), where we are already holding a lock on > our page and where we can't lock other pages without risking a deadlock. > > What's more, as soon as we'd inserted a reference to that block into inode, > we are open to readpage on another page backed by the same block coming > and seeing a normal mapped area. So UFS block allocator ends up zeroing > them out via _buffer_ cache, waiting for writes to finish before returning > the block to caller. It doesn't wait for metadata blocks (they are accessed > via buffer cache), but for data we end up with zeroes being written to disk, > no matter what. > > That works, but it obviously hurts the performance. Badly. Writes into > a hole are relatively rare, but the same thing hurts normal sequential > writes to the end of file, which is not rare at all. > > How about the following approach: > * Let allocation of data block at the EOF return the sucker without > writing zeroes out. > * Let ->write_iter() check if it's about to create a hole at the > (current) EOF. In such case start with writing zeroes (in normal fashion, > via ->begin_write()/memset()/->end_write() through the page cache) from the > EOF to block boundary (or beginning of the area we were going to write to, > whichever's closer). Then proceed in normal fashion. Incidentally, it will > deal with tail unpacking for free. > * Do the same upon extending truncate. > > Does anybody see holes in that? readpage() crossing the EOF isn't a problem - > it will not even attempt to map past the current ->i_size and will zero the > page tail out just fine. We get junk past the EOF on disk, but we zero it > out (or overwrite with our data) before increasing ->i_size. > > Dealing with sparse files is also possible (taking allocations from > ->writepage() to ->page_mkwrite(), where we are not holding a page lock > and thus can grab all affected pages), but it would be bigger, trickier > and benefit only a relatively rare case. > > Comments? Looks good to me. BTW also ext4 (with BIGALLOC feature) and OCFS2 can have block allocation unit (called cluster) larger than page size. However the block size of both filesystems is still <= page size. So at least ext4 handles fun with partially initialized clusters by just marking parts of the cluster as uninitialized in the extent tree. But the code is still pretty messy to be honest. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/2 linux-next] Revert "ufs: fix deadlocks introduced by sb mutex merge" 2015-06-23 16:46 ` Jan Kara @ 2015-06-23 21:56 ` Al Viro 0 siblings, 0 replies; 7+ messages in thread From: Al Viro @ 2015-06-23 21:56 UTC (permalink / raw) To: Jan Kara Cc: Fabian Frederick, Andrew Morton, Alexey Khoroshilov, Ian Campbell, Roger Pau Monne, Ian Jackson, xen-devel, Evgeniy Dushistov, linux-fsdevel On Tue, Jun 23, 2015 at 06:46:08PM +0200, Jan Kara wrote: > Looks good to me. BTW also ext4 (with BIGALLOC feature) and OCFS2 can have > block allocation unit (called cluster) larger than page size. However the > block size of both filesystems is still <= page size. So at least ext4 > handles fun with partially initialized clusters by just marking parts > of the cluster as uninitialized in the extent tree. But the code is still > pretty messy to be honest. Well, with UFS there's no place on disk to store such "this block is uninitialized" marks - it uses a bog-standard Unix inode structure. There are two units - fragments and blocks. Block is an aligned group of adjacent fragments; normal ratio is 8:1. Block is at least 4Kb (and always a power of two), fragment is at least a one sector and block:fragment ratio is at most 8:1. Inode structure is normal for a Unix filesystem (12 direct + indirect + double indirect + triple indirect). Each reference covers a block worth of file offsets and almost all of them point to full blocks. Indirects are full blocks as well. Reference to a block is represented as the number of the first fragment in it (i.e. with normal parameters bits 0..2 are clear). Block bitmap is actually a fragment bitmap (i.e. bit per fragment). The only situation when a reference is *not* to a full block is the last reference in a file shorter than 12*block size (i.e. not requiring indirects at all). In that case the last direct reference points to less than a full block (unless the size in fragments is a multiple of block:fragment ratio, that is). One unusual thing is that holes can't extend to EOF - the last byte *must* be allocated. (BTW, the only difference between UFS2 and UFS1 in that area is that fragment numbers are 64bit now. There had been talk about turning block:fragment ratio into a per-inode value, but so far nobody has implemented that - ->di_blksize is there, but it's never used). ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2015-06-23 21:56 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <1432754131-27425-1-git-send-email-fabf@skynet.be> [not found] ` <20150527145735.e3d1913bc66426038d53be32@linux-foundation.org> [not found] ` <20150604050123.GL7232@ZenIV.linux.org.uk> 2015-06-04 22:22 ` [PATCH 1/2 linux-next] Revert "ufs: fix deadlocks introduced by sb mutex merge" Al Viro [not found] ` <1122467636.634568.1433521621076.open-xchange@webmail.nmp.proximus.be> [not found] ` <20150605185018.GX7232@ZenIV.linux.org.uk> 2015-06-05 22:03 ` Al Viro 2015-06-17 8:57 ` Jan Kara 2015-06-17 20:31 ` Al Viro 2015-06-19 23:07 ` Al Viro 2015-06-23 16:46 ` Jan Kara 2015-06-23 21:56 ` Al Viro
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).