* XFS shrink functionality @ 2007-06-01 16:39 Ruben Porras 2007-06-04 0:16 ` David Chinner 0 siblings, 1 reply; 21+ messages in thread From: Ruben Porras @ 2007-06-01 16:39 UTC (permalink / raw) To: xfs; +Cc: iusty, cw [-- Attachment #1: Type: text/plain, Size: 1268 bytes --] Hello, I'm investigating the possibility to write myself the necessary code to shrink an xfs filesystem (I'd be able to dedicate a day/week). Trying to know if something is already done I came across the mails of a previous intent [0], [1] (I'm cc'ing the people involved). At a first glance the patch is a little outdated and will no more apply (as of linux 2.16.18, which is the last customised kernel that I was able to run under a XEN environment), because at least the function xfs_fs_geometry is changed. I'm really curious about what happened to this patches and why they were discontinued. The second part never was made public, and there was also no answer. Was there any flaw in any of the posted code or anything in XFS that makes it especially hard to shrink [3] that discouraged the development? After that, the first questions that arouse are, would there be some assistance/groove in from the developers? How doable is it? What are the programmers requirements from your point of view? Thank you. [1] http://oss.sgi.com/archives/xfs/2005-08/msg00142.html [2] http://oss.sgi.com/archives/xfs/2005-09/msg00038.html [3] the only limitation that I might think of is not being able to shrink past the internal journal. [-- Attachment #2: Dies ist ein digital signierter Nachrichtenteil --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: XFS shrink functionality 2007-06-01 16:39 XFS shrink functionality Ruben Porras @ 2007-06-04 0:16 ` David Chinner 2007-06-04 8:41 ` Iustin Pop 2007-06-19 22:22 ` XFS shrink (step 0) Ruben Porras 0 siblings, 2 replies; 21+ messages in thread From: David Chinner @ 2007-06-04 0:16 UTC (permalink / raw) To: Ruben Porras; +Cc: xfs, iusty, cw On Fri, Jun 01, 2007 at 06:39:34PM +0200, Ruben Porras wrote: > Hello, > > I'm investigating the possibility to write myself the necessary code to > shrink an xfs filesystem (I'd be able to dedicate a day/week). Trying to > know if something is already done I came across the mails of a previous > intent [0], [1] (I'm cc'ing the people involved). Oh, thanks for pointing those out - they're before my time ;) > At a first glance the patch is a little outdated and will no more apply > (as of linux 2.16.18, which is the last customised kernel that I was > able to run under a XEN environment), because at least the function > xfs_fs_geometry is changed. Any work for this would need to be done against current mainline of the xfs-dev tree. Yes, that patch is out of date, and it also did things that were not necessary i.e. walk btrees to work out if AGs are empty or not. > I'm really curious about what happened to this patches and why they were > discontinued. The second part never was made public, and there was also > no answer. Was there any flaw in any of the posted code or anything in > XFS that makes it especially hard to shrink [3] that discouraged the > development? The posted code is only a *tiny* part of the shrink problem. > After that, the first questions that arouse are, > would there be some assistance/groove in from the developers? Certainly there's help available. ;) > How doable is it? It is doable. > What are the programmers requirements from your point of view? Here's the "simple" bits that will allow you to shrink the filesystem down to the end of the internal log: 0. Check space is available for shrink 1. Mark allocation groups as "don't use - going away soon" - so we don't put new stuff in them while we are moving all the bits out of them - requires hooks in the allocators to prevent the AG from being selected for allllocations - must still allow allocations for the free lists so that extent freeing can succeed - *new transaction required*. - also needs an "undo" (e.g. on partial failure) so we need to be able to mark allocation groups online again. 2. Move inodes out of offline AGs - On Irix, we have a program called 'xfs_reno' which converts 64 bit inode filesystems to 32 bit inode filesystems. This needs to be: - released under the GPL (should not be a problem). - ported to linux - modified to understand inodes sit in certain AGs and to move them out of those AGs as needed. - requires filesystem traversal to find all the inodes to be moved. % wc -l xfs_reno.c 1991 xfs_reno.c - even with "-o ikeep", this needs to trigger inode cluster deletion in offline AGs (needs hooks in xfs_ifree()). 3. Move data out of offline AGs. - this is difficult to do efficiently as we do not have a block-to-owner reverse mapping in the filesystem. Hence requires a walk of the *entire* filesystem to find the owners of data blocks in the AGs being offlined. - xfs_db wrapper might be the best way to do this... <AGs are now empty> 4. Execute shrink - new transaction - XFS_TRANS_SHRINKFS - check AGs are empty - icount == 0 - freeblks == mp->m_sb.sb_agblocks (will be a little more than this) - check shrink won't go past end of internal log - free AGs, updating superblock fields - update perag structure - not a simple realloc() as there may be other threads using the structure at the same time.... Initially, I'd say just support shrinking to whole AGs - you've got to empty the whole "partial-last-ag" to ensure we can shrink it anyway, so doing a subsequent grow operation to increase the size afterwards should be trivial. Once this all works, we can then tackle the "move the log" problem which will allow you to shrink to much smaller sizes. As you can see, doing a shrink properly is not trivial, which is probably why it has't gone anywhere fast.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: XFS shrink functionality 2007-06-04 0:16 ` David Chinner @ 2007-06-04 8:41 ` Iustin Pop 2007-06-04 9:21 ` David Chinner 2007-06-08 8:23 ` Ruben Porras 2007-06-19 22:22 ` XFS shrink (step 0) Ruben Porras 1 sibling, 2 replies; 21+ messages in thread From: Iustin Pop @ 2007-06-04 8:41 UTC (permalink / raw) To: David Chinner; +Cc: Ruben Porras, xfs, cw Disclaimer: all the below is based on my weak understanding of the code, I don't claim I'm right below. On Mon, Jun 04, 2007 at 10:16:32AM +1000, David Chinner wrote: > Any work for this would need to be done against current mainline > of the xfs-dev tree. > > Yes, that patch is out of date, and it also did things that were not > necessary i.e. walk btrees to work out if AGs are empty or not. Well, I did what I could based on my own understanding of the code. Sorry if it's ugly :) > > I'm really curious about what happened to this patches and why they were > > discontinued. The second part never was made public, and there was also > > no answer. Was there any flaw in any of the posted code or anything in > > XFS that makes it especially hard to shrink [3] that discouraged the > > development? > > The posted code is only a *tiny* part of the shrink problem. My ideea at that time is to start small and be able to shrink an empty filesystem (or empty at least regarding the AGs that you want to clear). The point is that if AGs are lockable outside of a transaction (something like the freeze/unfreeze functionality at the fs level), then by simply copying the conflicting files you ensure that they are allocated on an available AG and when you remove the originals, the to-be-shrinked AGs become free. Yes, utterly non-optimal, but it was the simplest way to do it based on what I knew at the time. > > After that, the first questions that arouse are, > > would there be some assistance/groove in from the developers? > > Certainly there's help available. ;) Good to know. If there is at least more documentation about the internals, I could try to find some time to work on this again. > > > What are the programmers requirements from your point of view? > > Here's the "simple" bits that will allow you to shrink > the filesystem down to the end of the internal log: > > 0. Check space is available for shrink Can be done by actually allocating the space to be freed at the beggining of the transaction. Right? This is actually a bit more than needed, since when freeing an AG you also free some non-available space, but it's ok. > 1. Mark allocation groups as "don't use - going away soon" > - so we don't put new stuff in them while we > are moving all the bits out of them > - requires hooks in the allocators to prevent > the AG from being selected for allllocations > - must still allow allocations for the free lists > so that extent freeing can succeed > - *new transaction required*. > - also needs an "undo" (e.g. on partial failure) > so we need to be able to mark allocation groups > online again. So a question: can transaction be nested? Because the offline AG transation needs to live until the shrink transaction is done. I was more thinking that the offline-AG should be a bit on the AG that could be changed by the admin (like xfs_freeze); this could also help for other reasons than shrink (when on a big FS some AGs lie on a physical device and others on a different device, and you would like to restrict writes to a given AG, as much as possible). > 2. Move inodes out of offline AGs > - On Irix, we have a program called 'xfs_reno' which > converts 64 bit inode filesystems to 32 bit inode > filesystems. This needs to be: > - released under the GPL (should not be a problem). > - ported to linux > - modified to understand inodes sit in certain > AGs and to move them out of those AGs as needed. > - requires filesystem traversal to find all the > inodes to be moved. Interesing. I've read on the mail list of this before, but no other details. > > % wc -l xfs_reno.c > 1991 xfs_reno.c > > - even with "-o ikeep", this needs to trigger inode cluster > deletion in offline AGs (needs hooks in xfs_ifree()). This part (removal of inodes) is not actually needed if the icount == ifree (I presume this means that all the existing inodes are free). > 3. Move data out of offline AGs. > - this is difficult to do efficiently as we do not have > a block-to-owner reverse mapping in the filesystem. > Hence requires a walk of the *entire* filesystem to find > the owners of data blocks in the AGs being offlined. > - xfs_db wrapper might be the best way to do this... > > <AGs are now empty> > > 4. Execute shrink > - new transaction - XFS_TRANS_SHRINKFS > - check AGs are empty > - icount == 0 > - freeblks == mp->m_sb.sb_agblocks > (will be a little more than this) > - check shrink won't go past end of internal log > - free AGs, updating superblock fields > - update perag structure > - not a simple realloc() as there may > be other threads using the structure at the > same time.... > My suggestion would be to start implementing these steps in reverse. 4) is the most important as it touches the entire FS. If 4) is working correctly, then 1) would be simpler (I think) and 3) can be implemented by just running a forced xfs_fsr against the conflicting files. I don't know about 2). Sorry if I'm blatantly wrong in my statements. Good to have more information! regards, iustin ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: XFS shrink functionality 2007-06-04 8:41 ` Iustin Pop @ 2007-06-04 9:21 ` David Chinner 2007-06-05 8:00 ` Iustin Pop 2007-06-08 8:23 ` Ruben Porras 1 sibling, 1 reply; 21+ messages in thread From: David Chinner @ 2007-06-04 9:21 UTC (permalink / raw) To: David Chinner, Ruben Porras, xfs, cw On Mon, Jun 04, 2007 at 10:41:54AM +0200, Iustin Pop wrote: > Disclaimer: all the below is based on my weak understanding of the code, > I don't claim I'm right below. > > On Mon, Jun 04, 2007 at 10:16:32AM +1000, David Chinner wrote: > > Any work for this would need to be done against current mainline > > of the xfs-dev tree. > > > > Yes, that patch is out of date, and it also did things that were not > > necessary i.e. walk btrees to work out if AGs are empty or not. > > Well, I did what I could based on my own understanding of the code. > Sorry if it's ugly :) > > > > I'm really curious about what happened to this patches and why they were > > > discontinued. The second part never was made public, and there was also > > > no answer. Was there any flaw in any of the posted code or anything in > > > XFS that makes it especially hard to shrink [3] that discouraged the > > > development? > > > > The posted code is only a *tiny* part of the shrink problem. > > My ideea at that time is to start small and be able to shrink an empty > filesystem (or empty at least regarding the AGs that you want to clear). Yes, that is one way of looking at it.... > The point is that if AGs are lockable outside of a transaction > (something like the freeze/unfreeze functionality at the fs level), then > by simply copying the conflicting files you ensure that they are Copying is not good enough - attributes must remain unchanged. The only thing we can't preserve is the inode number.... > allocated on an available AG and when you remove the originals, the > to-be-shrinked AGs become free. Yes, utterly non-optimal, but it was the > simplest way to do it based on what I knew at the time. Not quite that simple, unfortunately. You can't leave the AGs locked in the same way we do for a grow because we need to be able to use the AGs to move stuff about and that requires locking them. Hence we need a separate mechanism to prevent allocation in a given AG outside of locking them. Hence we need: - a transaction to mark AGs "no-allocate" - a transaction to mark AGs "allocatable" - a flag in each AGF/AGI to say the AG is available for allocations (persistent over crashes) - a flag in the per-ag structure to indicate allocation status of the AG. - everywhere we select an AG for allocation, we need to check this flag and skip the AG if it's not available. FWIW, the transactions can probably just be an extension of xfs_alloc_log_agf() and xfs_alloc_log_agi().... > > > What are the programmers requirements from your point of view? > > > > Here's the "simple" bits that will allow you to shrink > > the filesystem down to the end of the internal log: > > > > 0. Check space is available for shrink > Can be done by actually allocating the space to be freed at the > beggining of the transaction. Right? No, I mean that you need to check that there is sufficient space in the untouched AGs to mve all the data from the AG's to be removed into the remaining part of the filesystem. This is not part of a transaction, but still a check that needs to be done before starting.... > > 1. Mark allocation groups as "don't use - going away soon" > > - so we don't put new stuff in them while we > > are moving all the bits out of them > > - requires hooks in the allocators to prevent > > the AG from being selected for allllocations > > - must still allow allocations for the free lists > > so that extent freeing can succeed > > - *new transaction required*. > > - also needs an "undo" (e.g. on partial failure) > > so we need to be able to mark allocation groups > > online again. > > So a question: can transaction be nested? No. > Because the offline AG > transation needs to live until the shrink transaction is done. No it doesn't - the *state* needs to remain until we do the shrink, the transaction only needs to live until it has hit the disk. > I was > more thinking that the offline-AG should be a bit on the AG that could > be changed by the admin (like xfs_freeze); this could also help for > other reasons than shrink (when on a big FS some AGs lie on a physical > device and others on a different device, and you would like to restrict > writes to a given AG, as much as possible). Yes, that's exactly what I'm talking about ;) > > 2. Move inodes out of offline AGs > > - On Irix, we have a program called 'xfs_reno' which > > converts 64 bit inode filesystems to 32 bit inode > > filesystems. This needs to be: > > - released under the GPL (should not be a problem). > > - ported to linux > > - modified to understand inodes sit in certain > > AGs and to move them out of those AGs as needed. > > - requires filesystem traversal to find all the > > inodes to be moved. > Interesing. I've read on the mail list of this before, but no other > details. > > > > > % wc -l xfs_reno.c > > 1991 xfs_reno.c > > > > - even with "-o ikeep", this needs to trigger inode cluster > > deletion in offline AGs (needs hooks in xfs_ifree()). > This part (removal of inodes) is not actually needed if the icount == > ifree (I presume this means that all the existing inodes are free). Yes, I guess that could be done - it means extra stuffing about when doing the final shrink transaction, though. e.g. making sure that free block counts update correctly given that the AGI btrees will be consuming blocks - easier just to free the clusters as they get emptied, I think.... > > 3. Move data out of offline AGs. > > - this is difficult to do efficiently as we do not have > > a block-to-owner reverse mapping in the filesystem. > > Hence requires a walk of the *entire* filesystem to find > > the owners of data blocks in the AGs being offlined. > > - xfs_db wrapper might be the best way to do this... > > > > <AGs are now empty> > > > > 4. Execute shrink > > - new transaction - XFS_TRANS_SHRINKFS > > - check AGs are empty > > - icount == 0 > > - freeblks == mp->m_sb.sb_agblocks > > (will be a little more than this) > > - check shrink won't go past end of internal log > > - free AGs, updating superblock fields > > - update perag structure > > - not a simple realloc() as there may > > be other threads using the structure at the > > same time.... > > > > My suggestion would be to start implementing these steps in reverse. 4) > is the most important as it touches the entire FS. If 4) is working > correctly, then 1) would be simpler (I think) and 3) can be implemented > by just running a forced xfs_fsr against the conflicting files. I don't > know about 2). Yeah, 1) and 4) are separable parts of the problem and can be done in any order. 2) can be implemented relatively easily as stated above. 3) is the hard one - we need to find the owner of each block (metadata and data) remaining in the AGs to be removed. This may be a directory btree block, a inode extent btree block, a data block, and extended attr block, etc. Moving the data blocks is easy to do (swap extents), but moving the metadata blocks is a major PITA as it will need to be done transactionally and that will require a bunch of new (complex) code to be written, I think. It will be of equivalent complexity to defragmenting metadata.... If we ignore the metadata block problem then finding and moving the data blocks should not be a problem - swap extents can be used for that as well - but it will be extremely time consuming and won't scale to large filesystem sizes.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: XFS shrink functionality 2007-06-04 9:21 ` David Chinner @ 2007-06-05 8:00 ` Iustin Pop 2007-06-06 1:50 ` Nathan Scott 2007-06-07 8:18 ` David Chinner 0 siblings, 2 replies; 21+ messages in thread From: Iustin Pop @ 2007-06-05 8:00 UTC (permalink / raw) To: David Chinner; +Cc: Ruben Porras, xfs, cw On Mon, Jun 04, 2007 at 07:21:15PM +1000, David Chinner wrote: > > allocated on an available AG and when you remove the originals, the > > to-be-shrinked AGs become free. Yes, utterly non-optimal, but it was the > > simplest way to do it based on what I knew at the time. > > Not quite that simple, unfortunately. You can't leave the > AGs locked in the same way we do for a grow because we need > to be able to use the AGs to move stuff about and that > requires locking them. Hence we need a separate mechanism > to prevent allocation in a given AG outside of locking them. > > Hence we need: > > - a transaction to mark AGs "no-allocate" > - a transaction to mark AGs "allocatable" > - a flag in each AGF/AGI to say the AG is available for > allocations (persistent over crashes) > - a flag in the per-ag structure to indicate allocation > status of the AG. > - everywhere we select an AG for allocation, we need to > check this flag and skip the AG if it's not available. > > FWIW, the transactions can probably just be an extension of > xfs_alloc_log_agf() and xfs_alloc_log_agi().... A question: do you think that the cost of having this in the code (especially the last part, check that flag in every allocation function) is acceptable? I mean, let's say one would write the patch to implement all this. Does it have a chance to be accepted? Or will people say it's only bloat? ... > > I was > > more thinking that the offline-AG should be a bit on the AG that could > > be changed by the admin (like xfs_freeze); this could also help for > > other reasons than shrink (when on a big FS some AGs lie on a physical > > device and others on a different device, and you would like to restrict > > writes to a given AG, as much as possible). > > Yes, that's exactly what I'm talking about ;) Ah, I see now what did you mean by having a transaction for locking/unlocking AGs for allocation. > Yeah, 1) and 4) are separable parts of the problem and can be done > in any order. 2) can be implemented relatively easily as stated > above. > > 3) is the hard one - we need to find the owner of each block > (metadata and data) remaining in the AGs to be removed. This may be > a directory btree block, a inode extent btree block, a data block, > and extended attr block, etc. Moving the data blocks is easy to > do (swap extents), but moving the metadata blocks is a major PITA > as it will need to be done transactionally and that will require > a bunch of new (complex) code to be written, I think. It will be > of equivalent complexity to defragmenting metadata.... > > If we ignore the metadata block problem then finding and moving the > data blocks should not be a problem - swap extents can be used for > that as well - but it will be extremely time consuming and won't > scale to large filesystem sizes.... So given these caveats, is there a chance that a) this will be actually useful and b) will this be accepted? The last time I tried to work on this there has been no real feedback and I'm thinking that maybe the code will be too intrusive and will give to little gain to be accepted. Thanks for your comments, iustin ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: XFS shrink functionality 2007-06-05 8:00 ` Iustin Pop @ 2007-06-06 1:50 ` Nathan Scott 2007-06-07 8:18 ` David Chinner 1 sibling, 0 replies; 21+ messages in thread From: Nathan Scott @ 2007-06-06 1:50 UTC (permalink / raw) To: Iustin Pop; +Cc: David Chinner, Ruben Porras, xfs, cw On Tue, 2007-06-05 at 10:00 +0200, Iustin Pop wrote: > > > So given these caveats, is there a chance that a) this will be > actually > useful and b) will this be accepted? Theres no doubt that its useful, its probably the most frequently requested feature for XFS from the community. I'd imagine its acceptance will depend on code quality, testing, etc, etc. > The last time I tried to work on this there has been no real feedback > and I'm thinking that maybe the code will be too intrusive and will > give > to little gain to be accepted. IIRC, most people missed the patch last time cos it got bounced by the list (cant remember why) - that was why I missed it for a long time, anyway. cheers. -- Nathan ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: XFS shrink functionality 2007-06-05 8:00 ` Iustin Pop 2007-06-06 1:50 ` Nathan Scott @ 2007-06-07 8:18 ` David Chinner 1 sibling, 0 replies; 21+ messages in thread From: David Chinner @ 2007-06-07 8:18 UTC (permalink / raw) To: David Chinner, Ruben Porras, xfs, cw On Tue, Jun 05, 2007 at 10:00:12AM +0200, Iustin Pop wrote: > On Mon, Jun 04, 2007 at 07:21:15PM +1000, David Chinner wrote: > > > allocated on an available AG and when you remove the originals, the > > > to-be-shrinked AGs become free. Yes, utterly non-optimal, but it was the > > > simplest way to do it based on what I knew at the time. > > > > Not quite that simple, unfortunately. You can't leave the > > AGs locked in the same way we do for a grow because we need > > to be able to use the AGs to move stuff about and that > > requires locking them. Hence we need a separate mechanism > > to prevent allocation in a given AG outside of locking them. > > > > Hence we need: > > > > - a transaction to mark AGs "no-allocate" > > - a transaction to mark AGs "allocatable" > > - a flag in each AGF/AGI to say the AG is available for > > allocations (persistent over crashes) > > - a flag in the per-ag structure to indicate allocation > > status of the AG. > > - everywhere we select an AG for allocation, we need to > > check this flag and skip the AG if it's not available. > > > > FWIW, the transactions can probably just be an extension of > > xfs_alloc_log_agf() and xfs_alloc_log_agi().... > > A question: do you think that the cost of having this in the code > (especially the last part, check that flag in every allocation function) > is acceptable? I mean, let's say one would write the patch to implement > all this. Does it have a chance to be accepted? Or will people say it's > only bloat? ... Lots of ppl ask for shrink capability on XFS, so if it's implemented and reviewed and passes QA tests, then I see no reason why it wouldn't be accepted... > > Yeah, 1) and 4) are separable parts of the problem and can be done > > in any order. 2) can be implemented relatively easily as stated > > above. > > > > 3) is the hard one - we need to find the owner of each block > > (metadata and data) remaining in the AGs to be removed. This may be > > a directory btree block, a inode extent btree block, a data block, > > and extended attr block, etc. Moving the data blocks is easy to > > do (swap extents), but moving the metadata blocks is a major PITA > > as it will need to be done transactionally and that will require > > a bunch of new (complex) code to be written, I think. It will be > > of equivalent complexity to defragmenting metadata.... > > > > If we ignore the metadata block problem then finding and moving the > > data blocks should not be a problem - swap extents can be used for > > that as well - but it will be extremely time consuming and won't > > scale to large filesystem sizes.... > > So given these caveats, is there a chance that a) this will be actually > useful and b) will this be accepted? Look at it this way - if we get to the point where 3 is a problem, then we've got most of a useful shrinker. That's way ahead of what we have now and in a lot of cases it will just work. The corner cases are the hard bit, but we can work on them incrementally once the rest is done, and in doing so we'll also be introducing the means by which to defragment metadata. IOWs, we kill two birds with one stone at that point in time. Likewise for the shrink case that needs to move the log - we've got hooks for userspace tools to move the log, just no implementation. Implementing log moving for shrink will also enable us to do online log resize and internal/external log switching. Once again, two birds with one stone. Hence I don't see these issues as showstoppers at all - getting to the point of a full shrink implementation will give us other features that we need to have anyway.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: XFS shrink functionality 2007-06-04 8:41 ` Iustin Pop 2007-06-04 9:21 ` David Chinner @ 2007-06-08 8:23 ` Ruben Porras 2007-06-08 10:15 ` Iustin Pop 2007-06-08 14:44 ` David Chinner 1 sibling, 2 replies; 21+ messages in thread From: Ruben Porras @ 2007-06-08 8:23 UTC (permalink / raw) To: Iustin Pop; +Cc: David Chinner, xfs, cw [-- Attachment #1: Type: text/plain, Size: 1289 bytes --] Am Montag, den 04.06.2007, 10:41 +0200 schrieb Iustin Pop: > Good to know. If there is at least more documentation about the > internals, I could try to find some time to work on this again. there is now a document explaining the XFS on disk format [0] and some presentations for training courses, I think none of this were available at the time you made the first try. Although they are not enough for our purpose. > My suggestion would be to start implementing these steps in reverse. 4) > is the most important as it touches the entire FS. If 4) is working > correctly, then 1) would be simpler (I think) Why do you think that 1) would be simpler after 4)? For what I understand, they are independent. 3) worries me, if walking the entire filesystem is needed, it want scale... Since I don't know yet the xfs code I would like to begin with 1), I see it independent from the other parts, and I can then learn more about the transactions, allocators, and walking through the xfs structures. As you did 4) one time, maybe you could try with this part of the problem if you find the needed time, taking David's suggestions into account. [0] http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf Cheers -- Ruben Porras LinWorks GmbH [-- Attachment #2: Dies ist ein digital signierter Nachrichtenteil --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: XFS shrink functionality 2007-06-08 8:23 ` Ruben Porras @ 2007-06-08 10:15 ` Iustin Pop 2007-06-08 15:12 ` David Chinner 2007-06-08 14:44 ` David Chinner 1 sibling, 1 reply; 21+ messages in thread From: Iustin Pop @ 2007-06-08 10:15 UTC (permalink / raw) To: Ruben Porras; +Cc: David Chinner, xfs, cw On Fri, Jun 08, 2007 at 10:23:53AM +0200, Ruben Porras wrote: > Am Montag, den 04.06.2007, 10:41 +0200 schrieb Iustin Pop: > > Good to know. If there is at least more documentation about the > > internals, I could try to find some time to work on this again. > > there is now a document explaining the XFS on disk format [0] and some > presentations for training courses, I think none of this were available > at the time you made the first try. Although they are not enough for our > purpose. > Yes, just yesterday I found the document and it helps. > > My suggestion would be to start implementing these steps in reverse. 4) > > is the most important as it touches the entire FS. If 4) is working > > correctly, then 1) would be simpler (I think) > > Why do you think that 1) would be simpler after 4)? For what I > understand, they are independent. Not after that in the cronological sense, but in the importance part. Yes, it was a bad choice of words. > 3) worries me, if walking the entire filesystem is needed, it want > scale... > > Since I don't know yet the xfs code I would like to begin with 1), I see > it independent from the other parts, and I can then learn more about the > transactions, allocators, and walking through the xfs structures. As you > did 4) one time, maybe you could try with this part of the problem if > you find the needed time, taking David's suggestions into account. I took a look at both items since this discussion started. And honestly, I think 1) is harder that 4), so you're welcome to work on it :) The points that make it harder is that, per David's suggestion, there needs to be: - define two new transaction types - define two new ioctls - update the ondisk-format (!), if we want persistence of these flags; luckily, there are two spare fields in the AGF structure. - check the list of allocation functions that allocate space from the AG I did some preliminary work on this but just a little. I think that after the weekend I'll send an updated patch of 4). I have one working now with the current CVS tree, just that it's still ugly and needs polishing. Open questions (re. point 4): - the filesystem document says the agf->agf_btreeblks is held only in case we have an extended flag active for the filesystem (XFS_SB_VERSION2_LAZYSBCOUNTBIT); is this true? without this, I'm not sure how to calculate this number of blocks nicely - or can I assume that an empty AG will *always* have agf_levels = 1 for both Btrees, so there are no extra blocks actually used for the btrees (except for the two reserved ones at the beggining of the AG - can I assume that an AG with agi->icount == agi->ifree == 0 will have no blocks used for the inode btrees (logically yes, but I'm not sure) thanks, iustin ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: XFS shrink functionality 2007-06-08 10:15 ` Iustin Pop @ 2007-06-08 15:12 ` David Chinner 2007-06-08 16:03 ` Iustin Pop ` (2 more replies) 0 siblings, 3 replies; 21+ messages in thread From: David Chinner @ 2007-06-08 15:12 UTC (permalink / raw) To: Ruben Porras, David Chinner, xfs, cw On Fri, Jun 08, 2007 at 12:15:32PM +0200, Iustin Pop wrote: > On Fri, Jun 08, 2007 at 10:23:53AM +0200, Ruben Porras wrote: > > Am Montag, den 04.06.2007, 10:41 +0200 schrieb Iustin Pop: > > > Good to know. If there is at least more documentation about the > > > internals, I could try to find some time to work on this again. > > > > there is now a document explaining the XFS on disk format [0] and some > > presentations for training courses, I think none of this were available > > at the time you made the first try. Although they are not enough for our > > purpose. > > > > Yes, just yesterday I found the document and it helps. > > > > My suggestion would be to start implementing these steps in reverse. 4) > > > is the most important as it touches the entire FS. If 4) is working > > > correctly, then 1) would be simpler (I think) > > > > Why do you think that 1) would be simpler after 4)? For what I > > understand, they are independent. > Not after that in the cronological sense, but in the importance part. > Yes, it was a bad choice of words. > > > 3) worries me, if walking the entire filesystem is needed, it want > > scale... > > > > Since I don't know yet the xfs code I would like to begin with 1), I see > > it independent from the other parts, and I can then learn more about the > > transactions, allocators, and walking through the xfs structures. As you > > did 4) one time, maybe you could try with this part of the problem if > > you find the needed time, taking David's suggestions into account. > > I took a look at both items since this discussion started. And honestly, > I think 1) is harder that 4), so you're welcome to work on it :) The > points that make it harder is that, per David's suggestion, there needs > to be: > - define two new transaction types one new transaction type: XFS_TRANS_AGF_FLAGS and and extension to xfs_alloc_log_agf(). Is about all that is needed there. See the patch here: http://oss.sgi.com/archives/xfs/2007-04/msg00103.html For an example of a very simlar transaction to what is needed (look at xfs_log_sbcount()) and very similar addition to the AGF (xfs_btreeblks). > - define two new ioctls XFS_IOC_ALLOC_ALLOW_AG, parameter xfsagnumber_t. XFS_IOC_ALLOC_DENY_AG, parameter xfsagnumber_t. > - update the ondisk-format (!), if we want persistence of these flags; > luckily, there are two spare fields in the AGF structure. Better to expand, I think. The AGF is a sector in length - we can expand the structure as we need to this size without fear, esp. as the part of the sector outside the structure is guaranteed to be zero. i.e. we can add a fields flag to the end of the AGF structure - old filesystems simple read as "no flags set" and old kernels never look at those bits.... > - check the list of allocation functions that allocate space from the > AG > I did some preliminary work on this but just a little. > > I think that after the weekend I'll send an updated patch of 4). I have > one working now with the current CVS tree, just that it's still ugly and > needs polishing. > > Open questions (re. point 4): > - the filesystem document says the agf->agf_btreeblks is held only in > case we have an extended flag active for the filesystem > (XFS_SB_VERSION2_LAZYSBCOUNTBIT); is this true? without this, I'm not > sure how to calculate this number of blocks nicely Yes, that is true. There's a pre-req for shrinking for the moment :/ > - or can I assume that an empty AG will *always* have agf_levels = 1 > for both Btrees, so there are no extra blocks actually used for the > btrees (except for the two reserved ones at the beggining of the AG Yes, that is a valid assumption. > - can I assume that an AG with agi->icount == agi->ifree == 0 will have > no blocks used for the inode btrees (logically yes, but I'm not sure) yes. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: XFS shrink functionality 2007-06-08 15:12 ` David Chinner @ 2007-06-08 16:03 ` Iustin Pop 2007-06-09 2:15 ` David Chinner 2007-06-08 19:47 ` Ruben Porras 2007-06-14 8:35 ` Ruben Porras 2 siblings, 1 reply; 21+ messages in thread From: Iustin Pop @ 2007-06-08 16:03 UTC (permalink / raw) To: David Chinner; +Cc: Ruben Porras, xfs, cw On Sat, Jun 09, 2007 at 01:12:23AM +1000, David Chinner wrote: > > I took a look at both items since this discussion started. And honestly, > > I think 1) is harder that 4), so you're welcome to work on it :) The > > points that make it harder is that, per David's suggestion, there needs > > to be: > > - define two new transaction types > > one new transaction type: > > XFS_TRANS_AGF_FLAGS > > and and extension to xfs_alloc_log_agf(). Is about all that is > needed there. > > See the patch here: > > http://oss.sgi.com/archives/xfs/2007-04/msg00103.html Ah, I see now. I was wondering how one can enable the new bits (CVS xfs_db shows the btreeblks but 'version' cmd doesn't allow to change them), it seems that manual xfs_db work + xfs_repair allows them. > For an example of a very simlar transaction to what is needed > (look at xfs_log_sbcount()) and very similar addition to > the AGF (xfs_btreeblks). Just a question: why do you think this per-ag-bit to be persistent? I'm just curious. When I first thought about this, I was thinking more like this should be an in-core flag only, like the freeze flag is for the filesystem. The idea being that you don't need to recover this state after a crash - there is no actual state, just restart the shrink operation if you want. And no actual filesystem state (e.g. space allocation or such) is happenning when you toggle the AGs not allocatable. This would allow a much simpler implementation of the 'no-alloc' part. > > - update the ondisk-format (!), if we want persistence of these flags; > > luckily, there are two spare fields in the AGF structure. > > Better to expand, I think. The AGF is a sector in length - we can > expand the structure as we need to this size without fear, esp. as > the part of the sector outside the structure is guaranteed to be > zero. i.e. we can add a fields flag to the end of the AGF > structure - old filesystems simple read as "no flags set" and > old kernels never look at those bits.... Yes, makes sense. Just to make sure: the xfs_agf_t, xfs_agi_t and xfs_sb_t structures as defined in xfs_sb.h and xfs_ag.h are what actually is on-disk, right? Adding to them, defining the new bits i.e. XFS_AGF_FLAGS and bumping up XFS_AGF_ALL_BITS should take care of the on-disk part? > > Open questions (re. point 4): > > - the filesystem document says the agf->agf_btreeblks is held only in > > case we have an extended flag active for the filesystem > > (XFS_SB_VERSION2_LAZYSBCOUNTBIT); is this true? without this, I'm not > > sure how to calculate this number of blocks nicely > > Yes, that is true. There's a pre-req for shrinking for the moment :/ > > > - or can I assume that an empty AG will *always* have agf_levels = 1 > > for both Btrees, so there are no extra blocks actually used for the > > btrees (except for the two reserved ones at the beggining of the AG > > Yes, that is a valid assumption. Ok, perfect. This then eliminates the need for LAZYSBCOUNTBIT. Just one more question: can I *read* from the mp->m_perag structure or do I need a lock (even for read), i.e. down_read, read the fields, up_read? (As you can see, I don't have much experience w.r.t. kernel programming). > > - can I assume that an AG with agi->icount == agi->ifree == 0 will have > > no blocks used for the inode btrees (logically yes, but I'm not sure) > > yes. Good. Thanks for your explanations. Patch for shrink if the AGs are empty will be simpler and nicer then as opposed to what I have now. iustin ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: XFS shrink functionality 2007-06-08 16:03 ` Iustin Pop @ 2007-06-09 2:15 ` David Chinner 0 siblings, 0 replies; 21+ messages in thread From: David Chinner @ 2007-06-09 2:15 UTC (permalink / raw) To: David Chinner, Ruben Porras, xfs, cw On Fri, Jun 08, 2007 at 06:03:18PM +0200, Iustin Pop wrote: > On Sat, Jun 09, 2007 at 01:12:23AM +1000, David Chinner wrote: > > > I took a look at both items since this discussion started. And honestly, > > > I think 1) is harder that 4), so you're welcome to work on it :) The > > > points that make it harder is that, per David's suggestion, there needs > > > to be: > > > - define two new transaction types > > > > one new transaction type: > > > > XFS_TRANS_AGF_FLAGS > > > > and and extension to xfs_alloc_log_agf(). Is about all that is > > needed there. > > > > See the patch here: > > > > http://oss.sgi.com/archives/xfs/2007-04/msg00103.html > > Ah, I see now. I was wondering how one can enable the new bits (CVS > xfs_db shows the btreeblks but 'version' cmd doesn't allow to change > them), it seems that manual xfs_db work + xfs_repair allows them. The xfs_db work needs to be wrapped up in xfs_admin. That's relatively simple to do, but the repair stage is needed to count the btree blocks and update the counter in eah AGF. That could probably also be wrapped up in an xfs_db script so conversion wouldn't require you to run repair.... > > For an example of a very simlar transaction to what is needed > > (look at xfs_log_sbcount()) and very similar addition to > > the AGF (xfs_btreeblks). > Just a question: why do you think this per-ag-bit to be persistent? Shrinking is not the only reason why you might want to prevent allocation within an AG. While we might be able to get away with a totally in memory flag for a shrink, I really don't want to have multiple mechanisms for doing roughly the same thing. e.g. Think of fault tolerance - you detect a free space btree corruption, so you prevent allocation and freeing in that AG (by setting the relevant bits) until you can come along and repair it. If you want to do online repair of this sort of corruption, then you need to be able to stop the trees from being used between the time that the corruption is detected and the time it is repair. That may be longer than the filesystem is currently mounted... > I'm > just curious. When I first thought about this, I was thinking more like > this should be an in-core flag only, like the freeze flag is for the > filesystem. The idea being that you don't need to recover this state > after a crash But a freeze is different - it's not modifying the filesystem, just bringing it down into a consistent state. A shrink is a modification operation, and so if it crashes half way though, we need to ensure that recovery doesn't do silly things. Hence it is best to have all the state associated with the shrink journalled and recoverable. i.e. persistent. > - there is no actual state, just restart the shrink > operation if you want. And no actual filesystem state (e.g. space > allocation or such) is happenning when you toggle the AGs not > allocatable. This would allow a much simpler implementation of the > 'no-alloc' part. True, but much it would be much more limited in it's potential use. > > > - update the ondisk-format (!), if we want persistence of these flags; > > > luckily, there are two spare fields in the AGF structure. > > > > Better to expand, I think. The AGF is a sector in length - we can > > expand the structure as we need to this size without fear, esp. as > > the part of the sector outside the structure is guaranteed to be > > zero. i.e. we can add a fields flag to the end of the AGF > > structure - old filesystems simple read as "no flags set" and > > old kernels never look at those bits.... > Yes, makes sense. Just to make sure: the xfs_agf_t, xfs_agi_t and > xfs_sb_t structures as defined in xfs_sb.h and xfs_ag.h are what > actually is on-disk, right? Adding to them, defining the new bits i.e. > XFS_AGF_FLAGS and bumping up XFS_AGF_ALL_BITS should take care of the > on-disk part? Don't forget to modify xfs_alloc_log_agf() as well ;) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: XFS shrink functionality 2007-06-08 15:12 ` David Chinner 2007-06-08 16:03 ` Iustin Pop @ 2007-06-08 19:47 ` Ruben Porras 2007-06-14 8:35 ` Ruben Porras 2 siblings, 0 replies; 21+ messages in thread From: Ruben Porras @ 2007-06-08 19:47 UTC (permalink / raw) To: David Chinner; +Cc: xfs, cw Am Samstag, den 09.06.2007, 01:12 +1000 schrieb David Chinner: Thank you, these last mail explains the pieces I should do pretty well :) Cheers ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: XFS shrink functionality 2007-06-08 15:12 ` David Chinner 2007-06-08 16:03 ` Iustin Pop 2007-06-08 19:47 ` Ruben Porras @ 2007-06-14 8:35 ` Ruben Porras 2007-06-14 9:14 ` David Chinner 2 siblings, 1 reply; 21+ messages in thread From: Ruben Porras @ 2007-06-14 8:35 UTC (permalink / raw) To: David Chinner; +Cc: xfs, cw, iusty [-- Attachment #1: Type: text/plain, Size: 1824 bytes --] > > I took a look at both items since this discussion started. And honestly, > > I think 1) is harder that 4), so you're welcome to work on it :) The > > points that make it harder is that, per David's suggestion, there needs > > to be: > > - define two new transaction types > > one new transaction type: > > XFS_TRANS_AGF_FLAGS done > and and extension to xfs_alloc_log_agf(). Is about all that is > needed there. still to do. Will come after the ioctls. > See the patch here: > > http://oss.sgi.com/archives/xfs/2007-04/msg00103.html > > For an example of a very simlar transaction to what is needed > (look at xfs_log_sbcount()) and very similar addition to > the AGF (xfs_btreeblks). > > > - define two new ioctls > > XFS_IOC_ALLOC_ALLOW_AG, parameter xfsagnumber_t. > XFS_IOC_ALLOC_DENY_AG, parameter xfsagnumber_t. almost done. How I'm should I obtain a pointer to an xfs_agf_t from inside the ioctls? I guess that the first step is to get a *bp with xfs_getsb and then an *sbp, but, which function/macro gives me the xfs_agf_t pointer? Sorry, I can't find the way greeping through the code. > > - update the ondisk-format (!), if we want persistence of these flags; > > luckily, there are two spare fields in the AGF structure. > > Better to expand, I think. The AGF is a sector in length - we can > expand the structure as we need to this size without fear, esp. as > the part of the sector outside the structure is guaranteed to be > zero. i.e. we can add a fields flag to the end of the AGF > structure - old filesystems simple read as "no flags set" and > old kernels never look at those bits.... done. > > - check the list of allocation functions that allocate space from the > > AG still to be done. Thaks again for the help. [-- Attachment #2: Dies ist ein digital signierter Nachrichtenteil --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: XFS shrink functionality 2007-06-14 8:35 ` Ruben Porras @ 2007-06-14 9:14 ` David Chinner 0 siblings, 0 replies; 21+ messages in thread From: David Chinner @ 2007-06-14 9:14 UTC (permalink / raw) To: Ruben Porras; +Cc: David Chinner, xfs, cw, iusty On Thu, Jun 14, 2007 at 10:35:27AM +0200, Ruben Porras wrote: > > > I took a look at both items since this discussion started. And honestly, > > > I think 1) is harder that 4), so you're welcome to work on it :) The > > > points that make it harder is that, per David's suggestion, there needs > > > to be: > > > - define two new transaction types > > > > one new transaction type: > > > > XFS_TRANS_AGF_FLAGS > > done > > > and and extension to xfs_alloc_log_agf(). Is about all that is > > needed there. > > still to do. Will come after the ioctls. > > > See the patch here: > > > > http://oss.sgi.com/archives/xfs/2007-04/msg00103.html > > > > For an example of a very simlar transaction to what is needed > > (look at xfs_log_sbcount()) and very similar addition to > > the AGF (xfs_btreeblks). > > > > > - define two new ioctls > > > > XFS_IOC_ALLOC_ALLOW_AG, parameter xfsagnumber_t. > > XFS_IOC_ALLOC_DENY_AG, parameter xfsagnumber_t. > > almost done. FWIW, I've had second thoughts on this ioctl interface. It's horribly specific, considering all we are doing are setting or clearing a flag in an AG. Perhaps a better interface is: XFS_IOC_GET_AGF_FLAGS XFS_IOC_SET_AGF_FLAGS with: struct xfs_ioc_agflags { xfs_agnumber_t ag; __u32 flags; } As the parameter structure and: #define XFS_AGF_FLAGS_ALLOC_DENY (1<<0) > How I'm should I obtain a pointer to an xfs_agf_t from > inside the ioctls? > > I guess that the first step is to get a *bp with xfs_getsb and then an *sbp, > but, which function/macro gives me the xfs_agf_t pointer? Sorry, I can't > find the way greeping through the code. I've attached the quick hack I did when thinking this through initially. It'll give you an idea of how to do this and a bit more. FWIW, it was this hack that made me think the above interface is a better way to go.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/linux-2.6/xfs_ioctl.c | 27 +++++++++++ fs/xfs/xfs_ag.h | 7 ++ fs/xfs/xfs_alloc.c | 103 +++++++++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_fs.h | 2 fs/xfs/xfs_trans.h | 3 - 5 files changed, 140 insertions(+), 2 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_ioctl.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_ioctl.c 2007-06-08 21:34:37.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_ioctl.c 2007-06-08 22:22:59.305412098 +1000 @@ -899,6 +899,33 @@ xfs_ioctl( return -error; } + case XFS_IOC_ALLOC_DENY_AG: { + xfs_agnumber_t in; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + if (copy_from_user(&in, arg, sizeof(in))) + return -XFS_ERROR(EFAULT); + + error = xfs_alloc_deny_ag(mp, &in); + return -error; + + } + case XFS_IOC_ALLOC_ALLOW_AG: { + xfs_agnumber_t in; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + if (copy_from_user(&in, arg, sizeof(in))) + return -XFS_ERROR(EFAULT); + + error = xfs_alloc_allow_ag(mp, &in); + return -error; + + } + case XFS_IOC_FREEZE: if (!capable(CAP_SYS_ADMIN)) return -EPERM; Index: 2.6.x-xfs-new/fs/xfs/xfs_ag.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_ag.h 2007-06-08 21:46:28.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_ag.h 2007-06-08 22:09:18.323606142 +1000 @@ -69,6 +69,7 @@ typedef struct xfs_agf { __be32 agf_freeblks; /* total free blocks */ __be32 agf_longest; /* longest free space */ __be32 agf_btreeblks; /* # of blocks held in AGF btrees */ + __be32 agf_flags; /* status flags */ } xfs_agf_t; #define XFS_AGF_MAGICNUM 0x00000001 @@ -83,9 +84,12 @@ typedef struct xfs_agf { #define XFS_AGF_FREEBLKS 0x00000200 #define XFS_AGF_LONGEST 0x00000400 #define XFS_AGF_BTREEBLKS 0x00000800 -#define XFS_AGF_NUM_BITS 12 +#define XFS_AGF_FLAGS 0x00001000 +#define XFS_AGF_NUM_BITS 13 #define XFS_AGF_ALL_BITS ((1 << XFS_AGF_NUM_BITS) - 1) + + /* disk block (xfs_daddr_t) in the AG */ #define XFS_AGF_DADDR(mp) ((xfs_daddr_t)(1 << (mp)->m_sectbb_log)) #define XFS_AGF_BLOCK(mp) XFS_HDR_BLOCK(mp, XFS_AGF_DADDR(mp)) @@ -189,6 +193,7 @@ typedef struct xfs_perag xfs_extlen_t pagf_freeblks; /* total free blocks */ xfs_extlen_t pagf_longest; /* longest free space */ __uint32_t pagf_btreeblks; /* # of blocks held in AGF btrees */ + __uint32_t pagf_flags; /* status flags for AG */ xfs_agino_t pagi_freecount; /* number of free inodes */ xfs_agino_t pagi_count; /* number of allocated inodes */ int pagb_count; /* pagb slots in use */ Index: 2.6.x-xfs-new/fs/xfs/xfs_alloc.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_alloc.c 2007-06-05 22:12:50.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_alloc.c 2007-06-08 23:12:51.256348632 +1000 @@ -2085,6 +2085,7 @@ xfs_alloc_log_agf( offsetof(xfs_agf_t, agf_freeblks), offsetof(xfs_agf_t, agf_longest), offsetof(xfs_agf_t, agf_btreeblks), + offsetof(xfs_agf_t, agf_flags), sizeof(xfs_agf_t) }; @@ -2112,6 +2113,107 @@ xfs_alloc_pagf_init( return 0; } +#define XFS_AGFLAG_ALLOC_DENY 1 +STATIC void +xfs_alloc_set_flag_ag( + xfs_trans_t *tp, + xfs_buf_t *agbp, /* buffer for a.g. freelist header */ + xfs_perag_t *pag, + int flag) +{ + xfs_agf_t *agf; /* a.g. freespace structure */ + + agf = XFS_BUF_TO_AGF(agbp); + pag->pagf_flags |= flag; + agf->agf_flags = cpu_to_be32(pag->pagf_flags); + + xfs_alloc_log_agf(tp, agbp, XFS_AGF_FLAGS); +} + +STATIC void +xfs_alloc_clear_flag_ag( + xfs_trans_t *tp, + xfs_buf_t *agbp, /* buffer for a.g. freelist header */ + xfs_perag_t *pag, + int flag) +{ + xfs_agf_t *agf; /* a.g. freespace structure */ + + agf = XFS_BUF_TO_AGF(agbp); + pag->pagf_flags &= ~flag; + agf->agf_flags = cpu_to_be32(pag->pagf_flags); + + xfs_alloc_log_agf(tp, agbp, XFS_AGF_FLAGS); +} + +int +xfs_alloc_allow_ag( + xfs_mount_t *mp, + xfs_agnumber_t agno) +{ + xfs_perag_t *pag; + xfs_buf_t *bp; + int error; + xfs_trans_t *tp; + + if (agno >= mp->m_sb.sb_agcount) + return -EINVAL; + + tp = xfs_trans_alloc(mp, XFS_TRANS_ALLOC_FLAGS); + error = xfs_trans_reserve(tp, 0, mp->m_sb.sb_sectsize + 128, 0, 0, + XFS_DEFAULT_LOG_COUNT); + if (error) { + xfs_trans_cancel(tp, 0); + return error; + } + error = xfs_alloc_read_agf(mp, tp, agno, 0, &bp); + if (error) + return error; + + pag = &mp->m_perag[agno]; + xfs_alloc_clear_flag_ag(tp, bp, pag, XFS_AGFLAG_ALLOC_DENY); + + xfs_trans_set_sync(tp); + xfs_trans_commit(tp, 0); + + return 0; + +} + +int +xfs_alloc_deny_ag( + xfs_mount_t *mp, + xfs_agnumber_t agno) +{ + xfs_perag_t *pag; + xfs_buf_t *bp; + int error; + xfs_trans_t *tp; + + if (agno >= mp->m_sb.sb_agcount) + return -EINVAL; + + tp = xfs_trans_alloc(mp, XFS_TRANS_ALLOC_FLAGS); + error = xfs_trans_reserve(tp, 0, mp->m_sb.sb_sectsize + 128, 0, 0, + XFS_DEFAULT_LOG_COUNT); + if (error) { + xfs_trans_cancel(tp, 0); + return error; + } + error = xfs_alloc_read_agf(mp, tp, agno, 0, &bp); + if (error) + return error; + + pag = &mp->m_perag[agno]; + xfs_alloc_set_flag_ag(tp, bp, pag, XFS_AGFLAG_ALLOC_DENY); + + xfs_trans_set_sync(tp); + xfs_trans_commit(tp, 0); + + return 0; + +} + /* * Put the block on the freelist for the allocation group. */ @@ -2226,6 +2328,7 @@ xfs_alloc_read_agf( pag->pagf_btreeblks = be32_to_cpu(agf->agf_btreeblks); pag->pagf_flcount = be32_to_cpu(agf->agf_flcount); pag->pagf_longest = be32_to_cpu(agf->agf_longest); + pag->pagf_flags = be32_to_cpu(agf->agf_flags); pag->pagf_levels[XFS_BTNUM_BNOi] = be32_to_cpu(agf->agf_levels[XFS_BTNUM_BNOi]); pag->pagf_levels[XFS_BTNUM_CNTi] = Index: 2.6.x-xfs-new/fs/xfs/xfs_trans.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_trans.h 2007-06-08 21:41:32.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_trans.h 2007-06-08 22:50:46.449405162 +1000 @@ -95,7 +95,8 @@ typedef struct xfs_trans_header { #define XFS_TRANS_GROWFSRT_FREE 39 #define XFS_TRANS_SWAPEXT 40 #define XFS_TRANS_SB_COUNT 41 -#define XFS_TRANS_TYPE_MAX 41 +#define XFS_TRANS_ALLOC_FLAGS 42 +#define XFS_TRANS_TYPE_MAX 42 /* new transaction types need to be reflected in xfs_logprint(8) */ Index: 2.6.x-xfs-new/fs/xfs/xfs_fs.h =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/xfs_fs.h 2007-06-08 21:46:29.000000000 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_fs.h 2007-06-08 23:15:31.755284394 +1000 @@ -493,6 +493,8 @@ typedef struct xfs_handle { #define XFS_IOC_ATTRMULTI_BY_HANDLE _IOW ('X', 123, struct xfs_fsop_attrmulti_handlereq) #define XFS_IOC_FSGEOMETRY _IOR ('X', 124, struct xfs_fsop_geom) #define XFS_IOC_GOINGDOWN _IOR ('X', 125, __uint32_t) +#define XFS_IOC_ALLOC_DENY_AG _IOR ('X', 126, __uint32_t) +#define XFS_IOC_ALLOC_ALLOW_AG _IOR ('X', 127, __uint32_t) /* XFS_IOC_GETFSUUID ---------- deprecated 140 */ ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: XFS shrink functionality 2007-06-08 8:23 ` Ruben Porras 2007-06-08 10:15 ` Iustin Pop @ 2007-06-08 14:44 ` David Chinner 1 sibling, 0 replies; 21+ messages in thread From: David Chinner @ 2007-06-08 14:44 UTC (permalink / raw) To: Ruben Porras; +Cc: Iustin Pop, David Chinner, xfs, cw On Fri, Jun 08, 2007 at 10:23:53AM +0200, Ruben Porras wrote: > Am Montag, den 04.06.2007, 10:41 +0200 schrieb Iustin Pop: > > Good to know. If there is at least more documentation about the > > internals, I could try to find some time to work on this again. > > there is now a document explaining the XFS on disk format [0] and some > presentations for training courses, I think none of this were available > at the time you made the first try. Although they are not enough for our > purpose. There's thousands of lines of code documenting that format as well ;) > > My suggestion would be to start implementing these steps in reverse. 4) > > is the most important as it touches the entire FS. If 4) is working > > correctly, then 1) would be simpler (I think) > > Why do you think that 1) would be simpler after 4)? For what I > understand, they are independent. > > 3) worries me, if walking the entire filesystem is needed, it want > scale... I think walking the filesystem can be avoided effectively by introducing an reverse map that points to the owner of the block. (i.e. another btree). Reverse mapping provides other benefits as well e.g. somewhere to put block checksums and more information for repair and scrubbing. The hard part is the moving of metadata. I haven't really though deeply on the best method for this; there's lots of options and I don't know what is the best way to proceed there yet. That's not something I need to think about right now, though ;) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: XFS shrink (step 0) 2007-06-04 0:16 ` David Chinner 2007-06-04 8:41 ` Iustin Pop @ 2007-06-19 22:22 ` Ruben Porras 2007-06-19 23:42 ` David Chinner 1 sibling, 1 reply; 21+ messages in thread From: Ruben Porras @ 2007-06-19 22:22 UTC (permalink / raw) To: David Chinner; +Cc: xfs, iusty Am Montag, den 04.06.2007, 10:16 +1000 schrieb David Chinner: > Here's the "simple" bits that will allow you to shrink > the filesystem down to the end of the internal log: > > 0. Check space is available for shrink Now that I'm almost* finish with the point 1), is there any place in the xfs_code where a similar task is done? This way I would have a basis to start off. Cheers. * I need only to fix the indentation, and change the ioctl interface as David suggested in another mail in this thread, so that the implementation is not so specific. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: XFS shrink (step 0) 2007-06-19 22:22 ` XFS shrink (step 0) Ruben Porras @ 2007-06-19 23:42 ` David Chinner 2007-06-28 10:38 ` Ruben Porras 0 siblings, 1 reply; 21+ messages in thread From: David Chinner @ 2007-06-19 23:42 UTC (permalink / raw) To: Ruben Porras; +Cc: David Chinner, xfs, iusty On Wed, Jun 20, 2007 at 12:22:31AM +0200, Ruben Porras wrote: > Am Montag, den 04.06.2007, 10:16 +1000 schrieb David Chinner: > > > Here's the "simple" bits that will allow you to shrink > > the filesystem down to the end of the internal log: > > > > 0. Check space is available for shrink > > Now that I'm almost* finish with the point 1), Cool ;) > is there any place in the > xfs_code where a similar task is done? This way I would have a basis to > start off. No, there isn't anything currently in existence to do this. It's not difficult, though. What you need to do is count the number of used blocks in the AGs that will be truncated off, and check whether there is enough free space in the remaining AGs to hold all the blocks that we are going to move. I think this could be done we a single loop across the perag array or with a simple xfs_db wrapper and some shell/awk/perl magic. e.g: Here's the basis: budgie:~ # for i in `seq 0 1 7`; do > xfs_db -r -c "agf $i" -c "p freeblks" -c "p btreeblks" /dev/sdb8 > done freeblks = 32779 btreeblks = 0 freeblks = 63003 btreeblks = 0 freeblks = 124423 btreeblks = 0 freeblks = 114516 btreeblks = 0 freeblks = 126602 btreeblks = 0 freeblks = 125905 btreeblks = 0 freeblks = 127886 btreeblks = 0 freeblks = 125445 btreeblks = 0 Now all you need to extract is the size of each ag from teh superblock, determine which AGs are going to be freed, and do some math ;) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: XFS shrink (step 0) 2007-06-19 23:42 ` David Chinner @ 2007-06-28 10:38 ` Ruben Porras 2007-06-29 6:55 ` David Chinner 0 siblings, 1 reply; 21+ messages in thread From: Ruben Porras @ 2007-06-28 10:38 UTC (permalink / raw) Cc: xfs, iusty David Chinner wrote: > No, there isn't anything currently in existence to do this. > > It's not difficult, though. What you need to do is count the number of > used blocks in the AGs that will be truncated off, and check whether > there is enough free space in the remaining AGs to hold all the > blocks that we are going to move. > > I think this could be done we a single loop across the perag > array or with a simple xfs_db wrapper and some shell/awk/perl > magic. > Do you mind that is it ok to depend on shell/awk/perl? I'll do it in C looping through the perag array. -- Rubén Porras LinWorks GmbH ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: XFS shrink (step 0) 2007-06-28 10:38 ` Ruben Porras @ 2007-06-29 6:55 ` David Chinner 2007-07-30 17:30 ` Ruben Porras 0 siblings, 1 reply; 21+ messages in thread From: David Chinner @ 2007-06-29 6:55 UTC (permalink / raw) To: Ruben Porras; +Cc: xfs, iusty On Thu, Jun 28, 2007 at 12:38:44PM +0200, Ruben Porras wrote: > David Chinner wrote: > >No, there isn't anything currently in existence to do this. > > > >It's not difficult, though. What you need to do is count the number of > >used blocks in the AGs that will be truncated off, and check whether > >there is enough free space in the remaining AGs to hold all the > >blocks that we are going to move. > > > >I think this could be done we a single loop across the perag > >array or with a simple xfs_db wrapper and some shell/awk/perl > >magic. > > > Do you mind that is it ok to depend on shell/awk/perl? Sure. We have a few programs that are just shell wrappers of other xfs programs e.g: xfs_bmap: shell script that calls xfs_io xfs_check: shell script that calls xfs_db xfs_info: shell script that calls xfs_growfs > I'll do it in C looping through the perag array. For something like this it's probably easier to do with shell/perl/awk. e.g. in shell, the number of ags in the filesystem: iterate all ags: numags=`xfs_db -r -c "sb 0" -c "p agcount" /dev/sdb8 | sed -e 's/.* = //'` lastag=`expr $numags - 1` for ags in `seq 0 1 $lastag`; do .... done Free space in an AG 0: xfs_db -r -c "freesp -s -a 0" /dev/sdb8 | awk '/total free blocks/ {print $4}' And so on. You can peek into pretty much any structure on disk with xfs_db and you can do it online so it's pretty much perfect for this sort of checking. I'd start with something like this, and if it gets too complex then we need to look at integrating it into xfs_db (i.e. writing it in C).... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: XFS shrink (step 0) 2007-06-29 6:55 ` David Chinner @ 2007-07-30 17:30 ` Ruben Porras 0 siblings, 0 replies; 21+ messages in thread From: Ruben Porras @ 2007-07-30 17:30 UTC (permalink / raw) To: David Chinner; +Cc: xfs, iusty [-- Attachment #1: Type: text/plain, Size: 1128 bytes --] Am Freitag, den 29.06.2007, 16:55 +1000 schrieb David Chinner: > On Thu, Jun 28, 2007 at 12:38:44PM +0200, Ruben Porras wrote: > For something like this it's probably easier to do with shell/perl/awk. > > e.g. in shell, the number of ags in the filesystem: > > iterate all ags: > > numags=`xfs_db -r -c "sb 0" -c "p agcount" /dev/sdb8 | sed -e 's/.* = //'` > lastag=`expr $numags - 1` > for ags in `seq 0 1 $lastag`; do > .... > done > > Free space in an AG 0: > > xfs_db -r -c "freesp -s -a 0" /dev/sdb8 | awk '/total free blocks/ {print $4}' I decided to calcule the free space in a AG directly as the space in "freeblks" - "btreeblks". Attached is a perl script that calculates the free space of a hole filesystem. It's easy to modify it to get the free space from a range of AGs, so unless there are errors, I lay it on the mailing list as an example, and I'll adapt it later as needed. I would like to start with the step number 2. How is the state of the program xfs_reno.c? Can it be released in the near future as GPL, or should I go better for now with point number 3 (that is, move data out of offline AGs)? [-- Attachment #2: freecount.pl --] [-- Type: application/x-perl, Size: 1076 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2007-07-30 17:30 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-06-01 16:39 XFS shrink functionality Ruben Porras 2007-06-04 0:16 ` David Chinner 2007-06-04 8:41 ` Iustin Pop 2007-06-04 9:21 ` David Chinner 2007-06-05 8:00 ` Iustin Pop 2007-06-06 1:50 ` Nathan Scott 2007-06-07 8:18 ` David Chinner 2007-06-08 8:23 ` Ruben Porras 2007-06-08 10:15 ` Iustin Pop 2007-06-08 15:12 ` David Chinner 2007-06-08 16:03 ` Iustin Pop 2007-06-09 2:15 ` David Chinner 2007-06-08 19:47 ` Ruben Porras 2007-06-14 8:35 ` Ruben Porras 2007-06-14 9:14 ` David Chinner 2007-06-08 14:44 ` David Chinner 2007-06-19 22:22 ` XFS shrink (step 0) Ruben Porras 2007-06-19 23:42 ` David Chinner 2007-06-28 10:38 ` Ruben Porras 2007-06-29 6:55 ` David Chinner 2007-07-30 17:30 ` Ruben Porras
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox