XFS shrink functionality

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* XFS shrink functionality
@ 2007-06-01 16:39 Ruben Porras
  2007-06-04  0:16 ` David Chinner
  0 siblings, 1 reply; 21+ messages in thread
From: Ruben Porras @ 2007-06-01 16:39 UTC (permalink / raw)
  To: xfs; +Cc: iusty, cw

[-- Attachment #1: Type: text/plain, Size: 1268 bytes --]

Hello, 

I'm investigating the possibility to write myself the necessary code to
shrink an xfs filesystem (I'd be able to dedicate a day/week). Trying to
know if something is already done I came across the mails of a previous
intent [0], [1] (I'm cc'ing the people involved). 

At a first glance the patch is a little outdated and will no more apply
(as of linux 2.16.18, which is the last customised kernel that I was
able to run under a XEN environment), because at least the function
xfs_fs_geometry is changed. 

I'm really curious about what happened to this patches and why they were
discontinued. The second part never was made public, and there was also
no answer. Was there any flaw in any of the posted code or anything in
XFS that makes it especially hard to shrink [3] that discouraged the
development?

After that, the first questions that arouse are,
would there be some assistance/groove in from the developers? 
How doable is it?
What are the programmers requirements from your point of view?

Thank you.

[1] http://oss.sgi.com/archives/xfs/2005-08/msg00142.html
[2] http://oss.sgi.com/archives/xfs/2005-09/msg00038.html
[3] the only limitation that I might think of is not being able to
shrink past the internal journal.

[-- Attachment #2: Dies ist ein digital signierter Nachrichtenteil --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS shrink functionality
  2007-06-01 16:39 XFS shrink functionality Ruben Porras
@ 2007-06-04  0:16 ` David Chinner
  2007-06-04  8:41   ` Iustin Pop
  2007-06-19 22:22   ` XFS shrink (step 0) Ruben Porras
  0 siblings, 2 replies; 21+ messages in thread
From: David Chinner @ 2007-06-04  0:16 UTC (permalink / raw)
  To: Ruben Porras; +Cc: xfs, iusty, cw

On Fri, Jun 01, 2007 at 06:39:34PM +0200, Ruben Porras wrote:
> Hello, 
> 
> I'm investigating the possibility to write myself the necessary code to
> shrink an xfs filesystem (I'd be able to dedicate a day/week). Trying to
> know if something is already done I came across the mails of a previous
> intent [0], [1] (I'm cc'ing the people involved). 

Oh, thanks for pointing those out - they're before my time ;)

> At a first glance the patch is a little outdated and will no more apply
> (as of linux 2.16.18, which is the last customised kernel that I was
> able to run under a XEN environment), because at least the function
> xfs_fs_geometry is changed. 

Any work for this would need to be done against current mainline
of the xfs-dev tree.

Yes, that patch is out of date, and it also did things that were not
necessary i.e. walk btrees to work out if AGs are empty or not.

> I'm really curious about what happened to this patches and why they were
> discontinued. The second part never was made public, and there was also
> no answer. Was there any flaw in any of the posted code or anything in
> XFS that makes it especially hard to shrink [3] that discouraged the
> development?

The posted code is only a *tiny* part of the shrink problem.

> After that, the first questions that arouse are,
> would there be some assistance/groove in from the developers? 

Certainly there's help available. ;)

> How doable is it?

It is doable.

> What are the programmers requirements from your point of view?

Here's the "simple" bits that will allow you to shrink
the filesystem down to the end of the internal log:

	0. Check space is available for shrink

	1. Mark allocation groups as "don't use - going away soon"
		- so we don't put new stuff in them while we
		  are moving all the bits out of them
		- requires hooks in the allocators to prevent
		  the AG from being selected for allllocations
		- must still allow allocations for the free lists
		  so that extent freeing can succeed
		- *new transaction required*.
		- also needs an "undo" (e.g. on partial failure)
		  so we need to be able to mark allocation groups
		  online again.

	2. Move inodes out of offline AGs
		- On Irix, we have a program called 'xfs_reno' which
		  converts 64 bit inode filesystems to 32 bit inode
		  filesystems. This needs to be:
		  	- released under the GPL (should not be a problem).
			- ported to linux
			- modified to understand inodes sit in certain
			  AGs and to move them out of those AGs as needed.
			- requires filesystem traversal to find all the
			  inodes to be moved.

		  % wc -l xfs_reno.c
		  1991 xfs_reno.c

		- even with "-o ikeep", this needs to trigger inode cluster
		  deletion in offline AGs (needs hooks in xfs_ifree()).
	
	3. Move data out of offline AGs.
		- this is difficult to do efficiently as we do not have
		  a block-to-owner reverse mapping in the filesystem.
		  Hence requires a walk of the *entire* filesystem to find
		  the owners of data blocks in the AGs being offlined.
		- xfs_db wrapper might be the best way to do this...

	<AGs are now empty>

	4. Execute shrink
		- new transaction - XFS_TRANS_SHRINKFS
		- check AGs are empty
		  	- icount == 0
			- freeblks == mp->m_sb.sb_agblocks
			  (will be a little more than this)
		- check shrink won't go past end of internal log
		- free AGs, updating superblock fields
		- update perag structure
		  	- not a simple realloc() as there may
			  be other threads using the structure at the
			  same time....


Initially, I'd say just support shrinking to whole AGs - you've got to empty
the whole "partial-last-ag" to ensure we can shrink it anyway, so doing
a subsequent grow operation to increase the size afterwards should be trivial.

Once this all works, we can then tackle the "move the log" problem which will
allow you to shrink to much smaller sizes.

As you can see, doing a shrink properly is not trivial, which is probably
why it has't gone anywhere fast....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS shrink functionality
  2007-06-04  0:16 ` David Chinner
@ 2007-06-04  8:41   ` Iustin Pop
  2007-06-04  9:21     ` David Chinner
  2007-06-08  8:23     ` Ruben Porras
  2007-06-19 22:22   ` XFS shrink (step 0) Ruben Porras
  1 sibling, 2 replies; 21+ messages in thread
From: Iustin Pop @ 2007-06-04  8:41 UTC (permalink / raw)
  To: David Chinner; +Cc: Ruben Porras, xfs, cw

Disclaimer: all the below is based on my weak understanding of the code,
I don't claim I'm right below. 

On Mon, Jun 04, 2007 at 10:16:32AM +1000, David Chinner wrote:
> Any work for this would need to be done against current mainline
> of the xfs-dev tree.
> 
> Yes, that patch is out of date, and it also did things that were not
> necessary i.e. walk btrees to work out if AGs are empty or not.

Well, I did what I could based on my own understanding of the code.
Sorry if it's ugly :)

> > I'm really curious about what happened to this patches and why they were
> > discontinued. The second part never was made public, and there was also
> > no answer. Was there any flaw in any of the posted code or anything in
> > XFS that makes it especially hard to shrink [3] that discouraged the
> > development?
> 
> The posted code is only a *tiny* part of the shrink problem.

My ideea at that time is to start small and be able to shrink an empty
filesystem (or empty at least regarding the AGs that you want to clear).

The point is that if AGs are lockable outside of a transaction
(something like the freeze/unfreeze functionality at the fs level), then
by simply copying the conflicting files you ensure that they are
allocated on an available AG and when you remove the originals, the
to-be-shrinked AGs become free. Yes, utterly non-optimal, but it was the
simplest way to do it based on what I knew at the time.

> > After that, the first questions that arouse are,
> > would there be some assistance/groove in from the developers? 
> 
> Certainly there's help available. ;)
Good to know. If there is at least more documentation about the
internals, I could try to find some time to work on this again.
> 
> > What are the programmers requirements from your point of view?
> 
> Here's the "simple" bits that will allow you to shrink
> the filesystem down to the end of the internal log:
> 
> 	0. Check space is available for shrink
Can be done by actually allocating the space to be freed at the
beggining of the transaction. Right? This is actually a bit more than
needed, since when freeing an AG you also free some non-available space,
but it's ok.

> 	1. Mark allocation groups as "don't use - going away soon"
> 		- so we don't put new stuff in them while we
> 		  are moving all the bits out of them
> 		- requires hooks in the allocators to prevent
> 		  the AG from being selected for allllocations
> 		- must still allow allocations for the free lists
> 		  so that extent freeing can succeed
> 		- *new transaction required*.
> 		- also needs an "undo" (e.g. on partial failure)
> 		  so we need to be able to mark allocation groups
> 		  online again.

So a question: can transaction be nested? Because the offline AG
transation needs to live until the shrink transaction is done. I was
more thinking that the offline-AG should be a bit on the AG that could
be changed by the admin (like xfs_freeze); this could also help for
other reasons than shrink (when on a big FS some AGs lie on a physical
device and others on a different device, and you would like to restrict
writes to a given AG, as much as possible).

> 	2. Move inodes out of offline AGs
> 		- On Irix, we have a program called 'xfs_reno' which
> 		  converts 64 bit inode filesystems to 32 bit inode
> 		  filesystems. This needs to be:
> 		  	- released under the GPL (should not be a problem).
> 			- ported to linux
> 			- modified to understand inodes sit in certain
> 			  AGs and to move them out of those AGs as needed.
> 			- requires filesystem traversal to find all the
> 			  inodes to be moved.
Interesing. I've read on the mail list of this before, but no other
details.

> 
> 		  % wc -l xfs_reno.c
> 		  1991 xfs_reno.c
> 
> 		- even with "-o ikeep", this needs to trigger inode cluster
> 		  deletion in offline AGs (needs hooks in xfs_ifree()).
This part (removal of inodes) is not actually needed if the icount ==
ifree (I presume this means that all the existing inodes are free).

> 	3. Move data out of offline AGs.
> 		- this is difficult to do efficiently as we do not have
> 		  a block-to-owner reverse mapping in the filesystem.
> 		  Hence requires a walk of the *entire* filesystem to find
> 		  the owners of data blocks in the AGs being offlined.
> 		- xfs_db wrapper might be the best way to do this...
> 
> 	<AGs are now empty>
> 
> 	4. Execute shrink
> 		- new transaction - XFS_TRANS_SHRINKFS
> 		- check AGs are empty
> 		  	- icount == 0
> 			- freeblks == mp->m_sb.sb_agblocks
> 			  (will be a little more than this)
> 		- check shrink won't go past end of internal log
> 		- free AGs, updating superblock fields
> 		- update perag structure
> 		  	- not a simple realloc() as there may
> 			  be other threads using the structure at the
> 			  same time....
> 

My suggestion would be to start implementing these steps in reverse. 4)
is the most important as it touches the entire FS. If 4) is working
correctly, then 1) would be simpler (I think) and 3) can be implemented
by just running a forced xfs_fsr against the conflicting files. I don't
know about 2).

Sorry if I'm blatantly wrong in my statements. Good to have more
information!

regards,
iustin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS shrink functionality
  2007-06-04  8:41   ` Iustin Pop
@ 2007-06-04  9:21     ` David Chinner
  2007-06-05  8:00       ` Iustin Pop
  2007-06-08  8:23     ` Ruben Porras
  1 sibling, 1 reply; 21+ messages in thread
From: David Chinner @ 2007-06-04  9:21 UTC (permalink / raw)
  To: David Chinner, Ruben Porras, xfs, cw

On Mon, Jun 04, 2007 at 10:41:54AM +0200, Iustin Pop wrote:
> Disclaimer: all the below is based on my weak understanding of the code,
> I don't claim I'm right below. 
> 
> On Mon, Jun 04, 2007 at 10:16:32AM +1000, David Chinner wrote:
> > Any work for this would need to be done against current mainline
> > of the xfs-dev tree.
> > 
> > Yes, that patch is out of date, and it also did things that were not
> > necessary i.e. walk btrees to work out if AGs are empty or not.
> 
> Well, I did what I could based on my own understanding of the code.
> Sorry if it's ugly :)
> 
> > > I'm really curious about what happened to this patches and why they were
> > > discontinued. The second part never was made public, and there was also
> > > no answer. Was there any flaw in any of the posted code or anything in
> > > XFS that makes it especially hard to shrink [3] that discouraged the
> > > development?
> > 
> > The posted code is only a *tiny* part of the shrink problem.
> 
> My ideea at that time is to start small and be able to shrink an empty
> filesystem (or empty at least regarding the AGs that you want to clear).

Yes, that is one way of looking at it....

> The point is that if AGs are lockable outside of a transaction
> (something like the freeze/unfreeze functionality at the fs level), then
> by simply copying the conflicting files you ensure that they are

Copying is not good enough - attributes must remain unchanged.
The only thing we can't preserve is the inode number....

> allocated on an available AG and when you remove the originals, the
> to-be-shrinked AGs become free. Yes, utterly non-optimal, but it was the
> simplest way to do it based on what I knew at the time.

Not quite that simple, unfortunately. You can't leave the
AGs locked in the same way we do for a grow because we need
to be able to use the AGs to move stuff about and that
requires locking them. Hence we need a separate mechanism
to prevent allocation in a given AG outside of locking them.

Hence we need:

	- a transaction to mark AGs "no-allocate"
	- a transaction to mark AGs "allocatable"
	- a flag in each AGF/AGI to say the AG is available for
	  allocations (persistent over crashes)
	- a flag in the per-ag structure to indicate allocation
	  status of the AG.
	- everywhere we select an AG for allocation, we need to
	  check this flag and skip the AG if it's not available.

FWIW, the transactions can probably just be an extension of
xfs_alloc_log_agf() and xfs_alloc_log_agi()....


> > > What are the programmers requirements from your point of view?
> > 
> > Here's the "simple" bits that will allow you to shrink
> > the filesystem down to the end of the internal log:
> > 
> > 	0. Check space is available for shrink
> Can be done by actually allocating the space to be freed at the
> beggining of the transaction. Right?

No, I mean that you need to check that there is sufficient space in
the untouched AGs to mve all the data from the AG's to be removed
into the remaining part of the filesystem. This is not part of a
transaction, but still a check that needs to be done before
starting....

> > 	1. Mark allocation groups as "don't use - going away soon"
> > 		- so we don't put new stuff in them while we
> > 		  are moving all the bits out of them
> > 		- requires hooks in the allocators to prevent
> > 		  the AG from being selected for allllocations
> > 		- must still allow allocations for the free lists
> > 		  so that extent freeing can succeed
> > 		- *new transaction required*.
> > 		- also needs an "undo" (e.g. on partial failure)
> > 		  so we need to be able to mark allocation groups
> > 		  online again.
> 
> So a question: can transaction be nested?

No.

> Because the offline AG
> transation needs to live until the shrink transaction is done.

No it doesn't - the *state* needs to remain until we do the shrink,
the transaction only needs to live until it has hit the disk.

> I was
> more thinking that the offline-AG should be a bit on the AG that could
> be changed by the admin (like xfs_freeze); this could also help for
> other reasons than shrink (when on a big FS some AGs lie on a physical
> device and others on a different device, and you would like to restrict
> writes to a given AG, as much as possible).

Yes, that's exactly what I'm talking about ;)

> > 	2. Move inodes out of offline AGs
> > 		- On Irix, we have a program called 'xfs_reno' which
> > 		  converts 64 bit inode filesystems to 32 bit inode
> > 		  filesystems. This needs to be:
> > 		  	- released under the GPL (should not be a problem).
> > 			- ported to linux
> > 			- modified to understand inodes sit in certain
> > 			  AGs and to move them out of those AGs as needed.
> > 			- requires filesystem traversal to find all the
> > 			  inodes to be moved.
> Interesing. I've read on the mail list of this before, but no other
> details.
> 
> > 
> > 		  % wc -l xfs_reno.c
> > 		  1991 xfs_reno.c
> > 
> > 		- even with "-o ikeep", this needs to trigger inode cluster
> > 		  deletion in offline AGs (needs hooks in xfs_ifree()).
> This part (removal of inodes) is not actually needed if the icount ==
> ifree (I presume this means that all the existing inodes are free).

Yes, I guess that could be done - it means extra stuffing about when
doing the final shrink transaction, though. e.g. making sure that
free block counts update correctly given that the AGI btrees will
be consuming blocks - easier just to free the clusters as they
get emptied, I think....

> > 	3. Move data out of offline AGs.
> > 		- this is difficult to do efficiently as we do not have
> > 		  a block-to-owner reverse mapping in the filesystem.
> > 		  Hence requires a walk of the *entire* filesystem to find
> > 		  the owners of data blocks in the AGs being offlined.
> > 		- xfs_db wrapper might be the best way to do this...
> > 
> > 	<AGs are now empty>
> > 
> > 	4. Execute shrink
> > 		- new transaction - XFS_TRANS_SHRINKFS
> > 		- check AGs are empty
> > 		  	- icount == 0
> > 			- freeblks == mp->m_sb.sb_agblocks
> > 			  (will be a little more than this)
> > 		- check shrink won't go past end of internal log
> > 		- free AGs, updating superblock fields
> > 		- update perag structure
> > 		  	- not a simple realloc() as there may
> > 			  be other threads using the structure at the
> > 			  same time....
> > 
> 
> My suggestion would be to start implementing these steps in reverse. 4)
> is the most important as it touches the entire FS. If 4) is working
> correctly, then 1) would be simpler (I think) and 3) can be implemented
> by just running a forced xfs_fsr against the conflicting files. I don't
> know about 2).

Yeah, 1) and 4) are separable parts of the problem and can be done
in any order. 2) can be implemented relatively easily as stated
above.

3) is the hard one - we need to find the owner of each block
(metadata and data) remaining in the AGs to be removed. This may be
a directory btree block, a inode extent btree block, a data block,
and extended attr block, etc. Moving the data blocks is easy to
do (swap extents), but moving the metadata blocks is a major PITA
as it will need to be done transactionally and that will require
a bunch of new (complex) code to be written, I think. It will be
of equivalent complexity to defragmenting metadata....

If we ignore the metadata block problem then finding and moving the
data blocks should not be a problem - swap extents can be used for
that as well - but it will be extremely time consuming and won't
scale to large filesystem sizes....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS shrink functionality
  2007-06-04  9:21     ` David Chinner
@ 2007-06-05  8:00       ` Iustin Pop
  2007-06-06  1:50         ` Nathan Scott
  2007-06-07  8:18         ` David Chinner
  0 siblings, 2 replies; 21+ messages in thread
From: Iustin Pop @ 2007-06-05  8:00 UTC (permalink / raw)
  To: David Chinner; +Cc: Ruben Porras, xfs, cw

On Mon, Jun 04, 2007 at 07:21:15PM +1000, David Chinner wrote:
> > allocated on an available AG and when you remove the originals, the
> > to-be-shrinked AGs become free. Yes, utterly non-optimal, but it was the
> > simplest way to do it based on what I knew at the time.
> 
> Not quite that simple, unfortunately. You can't leave the
> AGs locked in the same way we do for a grow because we need
> to be able to use the AGs to move stuff about and that
> requires locking them. Hence we need a separate mechanism
> to prevent allocation in a given AG outside of locking them.
> 
> Hence we need:
> 
> 	- a transaction to mark AGs "no-allocate"
> 	- a transaction to mark AGs "allocatable"
> 	- a flag in each AGF/AGI to say the AG is available for
> 	  allocations (persistent over crashes)
> 	- a flag in the per-ag structure to indicate allocation
> 	  status of the AG.
> 	- everywhere we select an AG for allocation, we need to
> 	  check this flag and skip the AG if it's not available.
> 
> FWIW, the transactions can probably just be an extension of
> xfs_alloc_log_agf() and xfs_alloc_log_agi()....

A question: do you think that the cost of having this in the code
(especially the last part, check that flag in every allocation function)
is acceptable? I mean, let's say one would write the patch to implement
all this. Does it have a chance to be accepted? Or will people say it's
only bloat? ...


> > I was
> > more thinking that the offline-AG should be a bit on the AG that could
> > be changed by the admin (like xfs_freeze); this could also help for
> > other reasons than shrink (when on a big FS some AGs lie on a physical
> > device and others on a different device, and you would like to restrict
> > writes to a given AG, as much as possible).
> 
> Yes, that's exactly what I'm talking about ;)
Ah, I see now what did you mean by having a transaction for
locking/unlocking AGs for allocation.

> Yeah, 1) and 4) are separable parts of the problem and can be done
> in any order. 2) can be implemented relatively easily as stated
> above.
> 
> 3) is the hard one - we need to find the owner of each block
> (metadata and data) remaining in the AGs to be removed. This may be
> a directory btree block, a inode extent btree block, a data block,
> and extended attr block, etc. Moving the data blocks is easy to
> do (swap extents), but moving the metadata blocks is a major PITA
> as it will need to be done transactionally and that will require
> a bunch of new (complex) code to be written, I think. It will be
> of equivalent complexity to defragmenting metadata....
> 
> If we ignore the metadata block problem then finding and moving the
> data blocks should not be a problem - swap extents can be used for
> that as well - but it will be extremely time consuming and won't
> scale to large filesystem sizes....

So given these caveats, is there a chance that a) this will be actually
useful and b) will this be accepted?

The last time I tried to work on this there has been no real feedback
and I'm thinking that maybe the code will be too intrusive and will give
to little gain to be accepted.

Thanks for your comments,
iustin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS shrink functionality
  2007-06-05  8:00       ` Iustin Pop
@ 2007-06-06  1:50         ` Nathan Scott
  2007-06-07  8:18         ` David Chinner
  1 sibling, 0 replies; 21+ messages in thread
From: Nathan Scott @ 2007-06-06  1:50 UTC (permalink / raw)
  To: Iustin Pop; +Cc: David Chinner, Ruben Porras, xfs, cw

On Tue, 2007-06-05 at 10:00 +0200, Iustin Pop wrote:
> 
> 
> So given these caveats, is there a chance that a) this will be
> actually
> useful and b) will this be accepted?

Theres no doubt that its useful, its probably the most frequently
requested feature for XFS from the community.

I'd imagine its acceptance will depend on code quality, testing,
etc, etc.

> The last time I tried to work on this there has been no real feedback
> and I'm thinking that maybe the code will be too intrusive and will
> give
> to little gain to be accepted. 

IIRC, most people missed the patch last time cos it got bounced by the
list (cant remember why) - that was why I missed it for a long time,
anyway.

cheers.

--
Nathan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS shrink functionality
  2007-06-05  8:00       ` Iustin Pop
  2007-06-06  1:50         ` Nathan Scott
@ 2007-06-07  8:18         ` David Chinner
  1 sibling, 0 replies; 21+ messages in thread
From: David Chinner @ 2007-06-07  8:18 UTC (permalink / raw)
  To: David Chinner, Ruben Porras, xfs, cw

On Tue, Jun 05, 2007 at 10:00:12AM +0200, Iustin Pop wrote:
> On Mon, Jun 04, 2007 at 07:21:15PM +1000, David Chinner wrote:
> > > allocated on an available AG and when you remove the originals, the
> > > to-be-shrinked AGs become free. Yes, utterly non-optimal, but it was the
> > > simplest way to do it based on what I knew at the time.
> > 
> > Not quite that simple, unfortunately. You can't leave the
> > AGs locked in the same way we do for a grow because we need
> > to be able to use the AGs to move stuff about and that
> > requires locking them. Hence we need a separate mechanism
> > to prevent allocation in a given AG outside of locking them.
> > 
> > Hence we need:
> > 
> > 	- a transaction to mark AGs "no-allocate"
> > 	- a transaction to mark AGs "allocatable"
> > 	- a flag in each AGF/AGI to say the AG is available for
> > 	  allocations (persistent over crashes)
> > 	- a flag in the per-ag structure to indicate allocation
> > 	  status of the AG.
> > 	- everywhere we select an AG for allocation, we need to
> > 	  check this flag and skip the AG if it's not available.
> > 
> > FWIW, the transactions can probably just be an extension of
> > xfs_alloc_log_agf() and xfs_alloc_log_agi()....
> 
> A question: do you think that the cost of having this in the code
> (especially the last part, check that flag in every allocation function)
> is acceptable? I mean, let's say one would write the patch to implement
> all this. Does it have a chance to be accepted? Or will people say it's
> only bloat? ...

Lots of ppl ask for shrink capability on XFS, so if it's implemented
and reviewed and passes QA tests, then I see no reason why it wouldn't
be accepted...

> > Yeah, 1) and 4) are separable parts of the problem and can be done
> > in any order. 2) can be implemented relatively easily as stated
> > above.
> > 
> > 3) is the hard one - we need to find the owner of each block
> > (metadata and data) remaining in the AGs to be removed. This may be
> > a directory btree block, a inode extent btree block, a data block,
> > and extended attr block, etc. Moving the data blocks is easy to
> > do (swap extents), but moving the metadata blocks is a major PITA
> > as it will need to be done transactionally and that will require
> > a bunch of new (complex) code to be written, I think. It will be
> > of equivalent complexity to defragmenting metadata....
> > 
> > If we ignore the metadata block problem then finding and moving the
> > data blocks should not be a problem - swap extents can be used for
> > that as well - but it will be extremely time consuming and won't
> > scale to large filesystem sizes....
> 
> So given these caveats, is there a chance that a) this will be actually
> useful and b) will this be accepted?

Look at it this way - if we get to the point where 3 is a problem, then
we've got most of a useful shrinker. That's way ahead of what we
have now and in a lot of cases it will just work.

The corner cases are the hard bit, but we can work on them incrementally
once the rest is done, and in doing so we'll also be introducing the
means by which to defragment metadata. IOWs, we kill two birds with
one stone at that point in time.

Likewise for the shrink case that needs to move the log - we've got hooks for
userspace tools to move the log, just no implementation. Implementing log moving
for shrink will also enable us to do online log resize and internal/external
log switching. Once again, two birds with one stone.

Hence I don't see these issues as showstoppers at all - getting to
the point of a full shrink implementation will give us other features
that we need to have anyway....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS shrink functionality
  2007-06-04  8:41   ` Iustin Pop
  2007-06-04  9:21     ` David Chinner
@ 2007-06-08  8:23     ` Ruben Porras
  2007-06-08 10:15       ` Iustin Pop
  2007-06-08 14:44       ` David Chinner
  1 sibling, 2 replies; 21+ messages in thread
From: Ruben Porras @ 2007-06-08  8:23 UTC (permalink / raw)
  To: Iustin Pop; +Cc: David Chinner, xfs, cw

[-- Attachment #1: Type: text/plain, Size: 1289 bytes --]

Am Montag, den 04.06.2007, 10:41 +0200 schrieb Iustin Pop:
> Good to know. If there is at least more documentation about the
> internals, I could try to find some time to work on this again.

there is now a document explaining the XFS on disk format [0] and some
presentations for training courses, I think none of this were available
at the time you made the first try. Although they are not enough for our
purpose. 

> My suggestion would be to start implementing these steps in reverse. 4)
> is the most important as it touches the entire FS. If 4) is working
> correctly, then 1) would be simpler (I think)

Why do you think that 1) would be simpler after 4)? For what I
understand, they are independent.

3) worries me, if walking the entire filesystem is needed, it want
scale...

Since I don't know yet the xfs code I would like to begin with 1), I see
it independent from the other parts, and I can then learn more about the
transactions, allocators, and walking through the xfs structures. As you
did 4) one time, maybe you could try with this part of the problem if
you find the needed time, taking David's suggestions into account.

[0] http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf

Cheers

--
Ruben Porras
LinWorks GmbH

[-- Attachment #2: Dies ist ein digital signierter Nachrichtenteil --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS shrink functionality
  2007-06-08  8:23     ` Ruben Porras
@ 2007-06-08 10:15       ` Iustin Pop
  2007-06-08 15:12         ` David Chinner
  2007-06-08 14:44       ` David Chinner
  1 sibling, 1 reply; 21+ messages in thread
From: Iustin Pop @ 2007-06-08 10:15 UTC (permalink / raw)
  To: Ruben Porras; +Cc: David Chinner, xfs, cw

On Fri, Jun 08, 2007 at 10:23:53AM +0200, Ruben Porras wrote:
> Am Montag, den 04.06.2007, 10:41 +0200 schrieb Iustin Pop:
> > Good to know. If there is at least more documentation about the
> > internals, I could try to find some time to work on this again.
> 
> there is now a document explaining the XFS on disk format [0] and some
> presentations for training courses, I think none of this were available
> at the time you made the first try. Although they are not enough for our
> purpose. 
> 

Yes, just yesterday I found the document and it helps.

> > My suggestion would be to start implementing these steps in reverse. 4)
> > is the most important as it touches the entire FS. If 4) is working
> > correctly, then 1) would be simpler (I think)
> 
> Why do you think that 1) would be simpler after 4)? For what I
> understand, they are independent.
Not after that in the cronological sense, but in the importance part.
Yes, it was a bad choice of words.

> 3) worries me, if walking the entire filesystem is needed, it want
> scale...
>   
> Since I don't know yet the xfs code I would like to begin with 1), I see
> it independent from the other parts, and I can then learn more about the
> transactions, allocators, and walking through the xfs structures. As you
> did 4) one time, maybe you could try with this part of the problem if
> you find the needed time, taking David's suggestions into account.

I took a look at both items since this discussion started. And honestly,
I think 1) is harder that 4), so you're welcome to work on it :) The
points that make it harder is that, per David's suggestion, there needs
to be:
 - define two new transaction types
 - define two new ioctls
 - update the ondisk-format (!), if we want persistence of these flags;
   luckily, there are two spare fields in the AGF structure.
 - check the list of allocation functions that allocate space from the
   AG

I did some preliminary work on this but just a little.

I think that after the weekend I'll send an updated patch of 4). I have
one working now with the current CVS tree, just that it's still ugly and
needs polishing.

Open questions (re. point 4):
 - the filesystem document says the agf->agf_btreeblks is held only in
   case we have an extended flag active for the filesystem
   (XFS_SB_VERSION2_LAZYSBCOUNTBIT); is this true? without this, I'm not
   sure how to calculate this number of blocks nicely
 - or can I assume that an empty AG will *always* have agf_levels = 1
   for both Btrees, so there are no extra blocks actually used for the
   btrees (except for the two reserved ones at the beggining of the AG
 - can I assume that an AG with agi->icount == agi->ifree == 0 will have
   no blocks used for the inode btrees (logically yes, but I'm not sure)


thanks,
iustin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS shrink functionality
  2007-06-08  8:23     ` Ruben Porras
  2007-06-08 10:15       ` Iustin Pop
@ 2007-06-08 14:44       ` David Chinner
  1 sibling, 0 replies; 21+ messages in thread
From: David Chinner @ 2007-06-08 14:44 UTC (permalink / raw)
  To: Ruben Porras; +Cc: Iustin Pop, David Chinner, xfs, cw

On Fri, Jun 08, 2007 at 10:23:53AM +0200, Ruben Porras wrote:
> Am Montag, den 04.06.2007, 10:41 +0200 schrieb Iustin Pop:
> > Good to know. If there is at least more documentation about the
> > internals, I could try to find some time to work on this again.
> 
> there is now a document explaining the XFS on disk format [0] and some
> presentations for training courses, I think none of this were available
> at the time you made the first try. Although they are not enough for our
> purpose. 

There's thousands of lines of code documenting that format as well ;)

> > My suggestion would be to start implementing these steps in reverse. 4)
> > is the most important as it touches the entire FS. If 4) is working
> > correctly, then 1) would be simpler (I think)
> 
> Why do you think that 1) would be simpler after 4)? For what I
> understand, they are independent.
> 
> 3) worries me, if walking the entire filesystem is needed, it want
> scale...

I think walking the filesystem can be avoided effectively by
introducing an reverse map that points to the owner of the block.
(i.e. another btree). Reverse mapping provides other benefits as
well e.g. somewhere to put block checksums and more information
for repair and scrubbing.

The hard part is the moving of metadata.  I haven't really though
deeply on the best method for this; there's lots of options and I
don't know what is the best way to proceed there yet. That's not
something I need to think about right now, though ;)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS shrink functionality
  2007-06-08 10:15       ` Iustin Pop
@ 2007-06-08 15:12         ` David Chinner
  2007-06-08 16:03           ` Iustin Pop
                             ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: David Chinner @ 2007-06-08 15:12 UTC (permalink / raw)
  To: Ruben Porras, David Chinner, xfs, cw

On Fri, Jun 08, 2007 at 12:15:32PM +0200, Iustin Pop wrote:
> On Fri, Jun 08, 2007 at 10:23:53AM +0200, Ruben Porras wrote:
> > Am Montag, den 04.06.2007, 10:41 +0200 schrieb Iustin Pop:
> > > Good to know. If there is at least more documentation about the
> > > internals, I could try to find some time to work on this again.
> > 
> > there is now a document explaining the XFS on disk format [0] and some
> > presentations for training courses, I think none of this were available
> > at the time you made the first try. Although they are not enough for our
> > purpose. 
> > 
> 
> Yes, just yesterday I found the document and it helps.
> 
> > > My suggestion would be to start implementing these steps in reverse. 4)
> > > is the most important as it touches the entire FS. If 4) is working
> > > correctly, then 1) would be simpler (I think)
> > 
> > Why do you think that 1) would be simpler after 4)? For what I
> > understand, they are independent.
> Not after that in the cronological sense, but in the importance part.
> Yes, it was a bad choice of words.
> 
> > 3) worries me, if walking the entire filesystem is needed, it want
> > scale...
> >   
> > Since I don't know yet the xfs code I would like to begin with 1), I see
> > it independent from the other parts, and I can then learn more about the
> > transactions, allocators, and walking through the xfs structures. As you
> > did 4) one time, maybe you could try with this part of the problem if
> > you find the needed time, taking David's suggestions into account.
> 
> I took a look at both items since this discussion started. And honestly,
> I think 1) is harder that 4), so you're welcome to work on it :) The
> points that make it harder is that, per David's suggestion, there needs
> to be:
>  - define two new transaction types

one new transaction type:

XFS_TRANS_AGF_FLAGS

and and extension to xfs_alloc_log_agf(). Is about all that is
needed there.

See the patch here:

http://oss.sgi.com/archives/xfs/2007-04/msg00103.html

For an example of a very simlar transaction to what is needed
(look at xfs_log_sbcount()) and very similar addition to
the AGF (xfs_btreeblks).

>  - define two new ioctls

XFS_IOC_ALLOC_ALLOW_AG, parameter xfsagnumber_t.
XFS_IOC_ALLOC_DENY_AG, parameter xfsagnumber_t.

>  - update the ondisk-format (!), if we want persistence of these flags;
>    luckily, there are two spare fields in the AGF structure.

Better to expand, I think. The AGF is a sector in length - we can
expand the structure as we need to this size without fear, esp. as
the part of the sector outside the structure is guaranteed to be
zero.  i.e. we can add a fields flag to the end of the AGF
structure - old filesystems simple read as "no flags set" and
old kernels never look at those bits....

>  - check the list of allocation functions that allocate space from the
>    AG

> I did some preliminary work on this but just a little.
> 
> I think that after the weekend I'll send an updated patch of 4). I have
> one working now with the current CVS tree, just that it's still ugly and
> needs polishing.
> 
> Open questions (re. point 4):
>  - the filesystem document says the agf->agf_btreeblks is held only in
>    case we have an extended flag active for the filesystem
>    (XFS_SB_VERSION2_LAZYSBCOUNTBIT); is this true? without this, I'm not
>    sure how to calculate this number of blocks nicely

Yes, that is true. There's a pre-req for shrinking for the moment :/

>  - or can I assume that an empty AG will *always* have agf_levels = 1
>    for both Btrees, so there are no extra blocks actually used for the
>    btrees (except for the two reserved ones at the beggining of the AG

Yes, that is a valid assumption.

>  - can I assume that an AG with agi->icount == agi->ifree == 0 will have
>    no blocks used for the inode btrees (logically yes, but I'm not sure)

yes.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS shrink functionality
  2007-06-08 15:12         ` David Chinner
@ 2007-06-08 16:03           ` Iustin Pop
  2007-06-09  2:15             ` David Chinner
  2007-06-08 19:47           ` Ruben Porras
  2007-06-14  8:35           ` Ruben Porras
  2 siblings, 1 reply; 21+ messages in thread
From: Iustin Pop @ 2007-06-08 16:03 UTC (permalink / raw)
  To: David Chinner; +Cc: Ruben Porras, xfs, cw

On Sat, Jun 09, 2007 at 01:12:23AM +1000, David Chinner wrote:
> > I took a look at both items since this discussion started. And honestly,
> > I think 1) is harder that 4), so you're welcome to work on it :) The
> > points that make it harder is that, per David's suggestion, there needs
> > to be:
> >  - define two new transaction types
> 
> one new transaction type:
> 
> XFS_TRANS_AGF_FLAGS
> 
> and and extension to xfs_alloc_log_agf(). Is about all that is
> needed there.
> 
> See the patch here:
> 
> http://oss.sgi.com/archives/xfs/2007-04/msg00103.html

Ah, I see now. I was wondering how one can enable the new bits (CVS
xfs_db shows the btreeblks but 'version' cmd doesn't allow to change
them), it seems that manual xfs_db work + xfs_repair allows them.

> For an example of a very simlar transaction to what is needed
> (look at xfs_log_sbcount()) and very similar addition to
> the AGF (xfs_btreeblks).
Just a question: why do you think this per-ag-bit to be persistent? I'm
just curious. When I first thought about this, I was thinking more like
this should be an in-core flag only, like the freeze flag is for the
filesystem. The idea being that you don't need to recover this state
after a crash - there is no actual state, just restart the shrink
operation if you want. And no actual filesystem state (e.g. space
allocation or such) is happenning when you toggle the AGs not
allocatable. This would allow a much simpler implementation of the
'no-alloc' part.

> >  - update the ondisk-format (!), if we want persistence of these flags;
> >    luckily, there are two spare fields in the AGF structure.
> 
> Better to expand, I think. The AGF is a sector in length - we can
> expand the structure as we need to this size without fear, esp. as
> the part of the sector outside the structure is guaranteed to be
> zero.  i.e. we can add a fields flag to the end of the AGF
> structure - old filesystems simple read as "no flags set" and
> old kernels never look at those bits....
Yes, makes sense. Just to make sure: the xfs_agf_t, xfs_agi_t and
xfs_sb_t structures as defined in xfs_sb.h and xfs_ag.h are what
actually is on-disk, right? Adding to them, defining the new bits i.e.
XFS_AGF_FLAGS and bumping up XFS_AGF_ALL_BITS should take care of the
on-disk part?

> > Open questions (re. point 4):
> >  - the filesystem document says the agf->agf_btreeblks is held only in
> >    case we have an extended flag active for the filesystem
> >    (XFS_SB_VERSION2_LAZYSBCOUNTBIT); is this true? without this, I'm not
> >    sure how to calculate this number of blocks nicely
> 
> Yes, that is true. There's a pre-req for shrinking for the moment :/
> 
> >  - or can I assume that an empty AG will *always* have agf_levels = 1
> >    for both Btrees, so there are no extra blocks actually used for the
> >    btrees (except for the two reserved ones at the beggining of the AG
> 
> Yes, that is a valid assumption.
Ok, perfect. This then eliminates the need for LAZYSBCOUNTBIT. Just one
more question: can I *read* from the mp->m_perag structure or do I need
a lock (even for read), i.e. down_read, read the fields, up_read? (As
you can see, I don't have much experience w.r.t. kernel programming).

> >  - can I assume that an AG with agi->icount == agi->ifree == 0 will have
> >    no blocks used for the inode btrees (logically yes, but I'm not sure)
> 
> yes.

Good.

Thanks for your explanations. Patch for shrink if the AGs are empty will
be simpler and nicer then as opposed to what I have now.

iustin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS shrink functionality
  2007-06-08 15:12         ` David Chinner
  2007-06-08 16:03           ` Iustin Pop
@ 2007-06-08 19:47           ` Ruben Porras
  2007-06-14  8:35           ` Ruben Porras
  2 siblings, 0 replies; 21+ messages in thread
From: Ruben Porras @ 2007-06-08 19:47 UTC (permalink / raw)
  To: David Chinner; +Cc: xfs, cw

Am Samstag, den 09.06.2007, 01:12 +1000 schrieb David Chinner:

Thank you, these last mail explains the pieces I should do pretty
well :)

Cheers

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS shrink functionality
  2007-06-08 16:03           ` Iustin Pop
@ 2007-06-09  2:15             ` David Chinner
  0 siblings, 0 replies; 21+ messages in thread
From: David Chinner @ 2007-06-09  2:15 UTC (permalink / raw)
  To: David Chinner, Ruben Porras, xfs, cw

On Fri, Jun 08, 2007 at 06:03:18PM +0200, Iustin Pop wrote:
> On Sat, Jun 09, 2007 at 01:12:23AM +1000, David Chinner wrote:
> > > I took a look at both items since this discussion started. And honestly,
> > > I think 1) is harder that 4), so you're welcome to work on it :) The
> > > points that make it harder is that, per David's suggestion, there needs
> > > to be:
> > >  - define two new transaction types
> > 
> > one new transaction type:
> > 
> > XFS_TRANS_AGF_FLAGS
> > 
> > and and extension to xfs_alloc_log_agf(). Is about all that is
> > needed there.
> > 
> > See the patch here:
> > 
> > http://oss.sgi.com/archives/xfs/2007-04/msg00103.html
> 
> Ah, I see now. I was wondering how one can enable the new bits (CVS
> xfs_db shows the btreeblks but 'version' cmd doesn't allow to change
> them), it seems that manual xfs_db work + xfs_repair allows them.

The xfs_db work needs to be wrapped up in xfs_admin. That's relatively
simple to do, but the repair stage is needed to count the btree blocks
and update the counter in eah AGF. That could probably also be wrapped
up in an xfs_db script so conversion wouldn't require you to run
repair....

> > For an example of a very simlar transaction to what is needed
> > (look at xfs_log_sbcount()) and very similar addition to
> > the AGF (xfs_btreeblks).
> Just a question: why do you think this per-ag-bit to be persistent?

Shrinking is not the only reason why you might want to prevent
allocation within an AG. While we might be able to get away with a
totally in memory flag for a shrink, I really don't want to have
multiple mechanisms for doing roughly the same thing.

e.g. Think of fault tolerance - you detect a free space btree
corruption, so you prevent allocation and freeing in that AG (by
setting the relevant bits) until you can come along and repair it.
If you want to do online repair of this sort of corruption, then you
need to be able to stop the trees from being used between the time
that the corruption is detected and the time it is repair. That may
be longer than the filesystem is currently mounted...

> I'm
> just curious. When I first thought about this, I was thinking more like
> this should be an in-core flag only, like the freeze flag is for the
> filesystem. The idea being that you don't need to recover this state
> after a crash

But a freeze is different - it's not modifying the filesystem,
just bringing it down into a consistent state. A shrink is a
modification operation, and so if it crashes half way though,
we need to ensure that recovery doesn't do silly things. Hence
it is best to have all the state associated with the shrink
journalled and recoverable. i.e. persistent.

> - there is no actual state, just restart the shrink
> operation if you want. And no actual filesystem state (e.g. space
> allocation or such) is happenning when you toggle the AGs not
> allocatable. This would allow a much simpler implementation of the
> 'no-alloc' part.

True, but much it would be much more limited in it's potential use.

> > >  - update the ondisk-format (!), if we want persistence of these flags;
> > >    luckily, there are two spare fields in the AGF structure.
> > 
> > Better to expand, I think. The AGF is a sector in length - we can
> > expand the structure as we need to this size without fear, esp. as
> > the part of the sector outside the structure is guaranteed to be
> > zero.  i.e. we can add a fields flag to the end of the AGF
> > structure - old filesystems simple read as "no flags set" and
> > old kernels never look at those bits....
> Yes, makes sense. Just to make sure: the xfs_agf_t, xfs_agi_t and
> xfs_sb_t structures as defined in xfs_sb.h and xfs_ag.h are what
> actually is on-disk, right? Adding to them, defining the new bits i.e.
> XFS_AGF_FLAGS and bumping up XFS_AGF_ALL_BITS should take care of the
> on-disk part?

Don't forget to modify xfs_alloc_log_agf() as well ;)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS shrink functionality
  2007-06-08 15:12         ` David Chinner
  2007-06-08 16:03           ` Iustin Pop
  2007-06-08 19:47           ` Ruben Porras
@ 2007-06-14  8:35           ` Ruben Porras
  2007-06-14  9:14             ` David Chinner
  2 siblings, 1 reply; 21+ messages in thread
From: Ruben Porras @ 2007-06-14  8:35 UTC (permalink / raw)
  To: David Chinner; +Cc: xfs, cw, iusty

[-- Attachment #1: Type: text/plain, Size: 1824 bytes --]

> > I took a look at both items since this discussion started. And honestly,
> > I think 1) is harder that 4), so you're welcome to work on it :) The
> > points that make it harder is that, per David's suggestion, there needs
> > to be:
> >  - define two new transaction types
> 
> one new transaction type:
> 
> XFS_TRANS_AGF_FLAGS

done

> and and extension to xfs_alloc_log_agf(). Is about all that is
> needed there.

still to do. Will come after the ioctls.

> See the patch here:
> 
> http://oss.sgi.com/archives/xfs/2007-04/msg00103.html
> 
> For an example of a very simlar transaction to what is needed
> (look at xfs_log_sbcount()) and very similar addition to
> the AGF (xfs_btreeblks).
> 
> >  - define two new ioctls
> 
> XFS_IOC_ALLOC_ALLOW_AG, parameter xfsagnumber_t.
> XFS_IOC_ALLOC_DENY_AG, parameter xfsagnumber_t.

almost done. How I'm should I obtain a pointer to an xfs_agf_t from
inside the ioctls?

I guess that the first step is to get a *bp with xfs_getsb and then an *sbp,
but, which function/macro gives me the xfs_agf_t pointer? Sorry, I can't
find the way greeping through the code.

> >  - update the ondisk-format (!), if we want persistence of these flags;
> >    luckily, there are two spare fields in the AGF structure.
> 
> Better to expand, I think. The AGF is a sector in length - we can
> expand the structure as we need to this size without fear, esp. as
> the part of the sector outside the structure is guaranteed to be
> zero.  i.e. we can add a fields flag to the end of the AGF
> structure - old filesystems simple read as "no flags set" and
> old kernels never look at those bits....

done.

> >  - check the list of allocation functions that allocate space from the
> >    AG

still to be done.

Thaks again for the help.

[-- Attachment #2: Dies ist ein digital signierter Nachrichtenteil --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS shrink functionality
  2007-06-14  8:35           ` Ruben Porras
@ 2007-06-14  9:14             ` David Chinner
  0 siblings, 0 replies; 21+ messages in thread
From: David Chinner @ 2007-06-14  9:14 UTC (permalink / raw)
  To: Ruben Porras; +Cc: David Chinner, xfs, cw, iusty

On Thu, Jun 14, 2007 at 10:35:27AM +0200, Ruben Porras wrote:
> > > I took a look at both items since this discussion started. And honestly,
> > > I think 1) is harder that 4), so you're welcome to work on it :) The
> > > points that make it harder is that, per David's suggestion, there needs
> > > to be:
> > >  - define two new transaction types
> > 
> > one new transaction type:
> > 
> > XFS_TRANS_AGF_FLAGS
> 
> done
> 
> > and and extension to xfs_alloc_log_agf(). Is about all that is
> > needed there.
> 
> still to do. Will come after the ioctls.
> 
> > See the patch here:
> > 
> > http://oss.sgi.com/archives/xfs/2007-04/msg00103.html
> > 
> > For an example of a very simlar transaction to what is needed
> > (look at xfs_log_sbcount()) and very similar addition to
> > the AGF (xfs_btreeblks).
> > 
> > >  - define two new ioctls
> > 
> > XFS_IOC_ALLOC_ALLOW_AG, parameter xfsagnumber_t.
> > XFS_IOC_ALLOC_DENY_AG, parameter xfsagnumber_t.
> 
> almost done.

FWIW, I've had second thoughts on this ioctl interface. It's
horribly specific, considering all we are doing are setting
or clearing a flag in an AG.

Perhaps a better interface is:

XFS_IOC_GET_AGF_FLAGS
XFS_IOC_SET_AGF_FLAGS

with:

struct xfs_ioc_agflags {
	xfs_agnumber_t	ag;
	__u32		flags;
}

As the parameter structure and:

#define XFS_AGF_FLAGS_ALLOC_DENY	(1<<0)


> How I'm should I obtain a pointer to an xfs_agf_t from
> inside the ioctls?
> 
> I guess that the first step is to get a *bp with xfs_getsb and then an *sbp,
> but, which function/macro gives me the xfs_agf_t pointer? Sorry, I can't
> find the way greeping through the code.

I've attached the quick hack I did when thinking this through
initially. It'll give you an idea of how to do this and a bit more.
FWIW, it was this hack that made me think the above interface
is a better way to go....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

---
 fs/xfs/linux-2.6/xfs_ioctl.c |   27 +++++++++++
 fs/xfs/xfs_ag.h              |    7 ++
 fs/xfs/xfs_alloc.c           |  103 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_fs.h              |    2 
 fs/xfs/xfs_trans.h           |    3 -
 5 files changed, 140 insertions(+), 2 deletions(-)

Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_ioctl.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_ioctl.c	2007-06-08 21:34:37.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_ioctl.c	2007-06-08 22:22:59.305412098 +1000
@@ -899,6 +899,33 @@ xfs_ioctl(
 		return -error;
 	}
 
+	case XFS_IOC_ALLOC_DENY_AG: {
+		xfs_agnumber_t in;
+
+		if (!capable(CAP_SYS_ADMIN))
+			return -EPERM;
+
+		if (copy_from_user(&in, arg, sizeof(in)))
+			return -XFS_ERROR(EFAULT);
+
+		error = xfs_alloc_deny_ag(mp, &in);
+		return -error;
+
+	}
+	case XFS_IOC_ALLOC_ALLOW_AG: {
+		xfs_agnumber_t in;
+
+		if (!capable(CAP_SYS_ADMIN))
+			return -EPERM;
+
+		if (copy_from_user(&in, arg, sizeof(in)))
+			return -XFS_ERROR(EFAULT);
+
+		error = xfs_alloc_allow_ag(mp, &in);
+		return -error;
+
+	}
+
 	case XFS_IOC_FREEZE:
 		if (!capable(CAP_SYS_ADMIN))
 			return -EPERM;
Index: 2.6.x-xfs-new/fs/xfs/xfs_ag.h
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_ag.h	2007-06-08 21:46:28.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/xfs_ag.h	2007-06-08 22:09:18.323606142 +1000
@@ -69,6 +69,7 @@ typedef struct xfs_agf {
 	__be32		agf_freeblks;	/* total free blocks */
 	__be32		agf_longest;	/* longest free space */
 	__be32		agf_btreeblks;	/* # of blocks held in AGF btrees */
+	__be32		agf_flags;	/* status flags */
 } xfs_agf_t;
 
 #define	XFS_AGF_MAGICNUM	0x00000001
@@ -83,9 +84,12 @@ typedef struct xfs_agf {
 #define	XFS_AGF_FREEBLKS	0x00000200
 #define	XFS_AGF_LONGEST		0x00000400
 #define	XFS_AGF_BTREEBLKS	0x00000800
-#define	XFS_AGF_NUM_BITS	12
+#define	XFS_AGF_FLAGS		0x00001000
+#define	XFS_AGF_NUM_BITS	13
 #define	XFS_AGF_ALL_BITS	((1 << XFS_AGF_NUM_BITS) - 1)
 
+
+
 /* disk block (xfs_daddr_t) in the AG */
 #define XFS_AGF_DADDR(mp)	((xfs_daddr_t)(1 << (mp)->m_sectbb_log))
 #define	XFS_AGF_BLOCK(mp)	XFS_HDR_BLOCK(mp, XFS_AGF_DADDR(mp))
@@ -189,6 +193,7 @@ typedef struct xfs_perag
 	xfs_extlen_t	pagf_freeblks;	/* total free blocks */
 	xfs_extlen_t	pagf_longest;	/* longest free space */
 	__uint32_t	pagf_btreeblks;	/* # of blocks held in AGF btrees */
+	__uint32_t	pagf_flags;	/* status flags for AG */
 	xfs_agino_t	pagi_freecount;	/* number of free inodes */
 	xfs_agino_t	pagi_count;	/* number of allocated inodes */
 	int		pagb_count;	/* pagb slots in use */
Index: 2.6.x-xfs-new/fs/xfs/xfs_alloc.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_alloc.c	2007-06-05 22:12:50.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/xfs_alloc.c	2007-06-08 23:12:51.256348632 +1000
@@ -2085,6 +2085,7 @@ xfs_alloc_log_agf(
 		offsetof(xfs_agf_t, agf_freeblks),
 		offsetof(xfs_agf_t, agf_longest),
 		offsetof(xfs_agf_t, agf_btreeblks),
+		offsetof(xfs_agf_t, agf_flags),
 		sizeof(xfs_agf_t)
 	};
 
@@ -2112,6 +2113,107 @@ xfs_alloc_pagf_init(
 	return 0;
 }
 
+#define XFS_AGFLAG_ALLOC_DENY	1
+STATIC void
+xfs_alloc_set_flag_ag(
+	xfs_trans_t	*tp,
+	xfs_buf_t	*agbp,	/* buffer for a.g. freelist header */
+	xfs_perag_t	*pag,
+	int		flag)
+{
+	xfs_agf_t		*agf;	/* a.g. freespace structure */
+
+	agf = XFS_BUF_TO_AGF(agbp);
+	pag->pagf_flags |= flag;
+	agf->agf_flags = cpu_to_be32(pag->pagf_flags);
+
+	xfs_alloc_log_agf(tp, agbp, XFS_AGF_FLAGS);
+}
+
+STATIC void
+xfs_alloc_clear_flag_ag(
+	xfs_trans_t	*tp,
+	xfs_buf_t	*agbp,	/* buffer for a.g. freelist header */
+	xfs_perag_t	*pag,
+	int		flag)
+{
+	xfs_agf_t		*agf;	/* a.g. freespace structure */
+
+	agf = XFS_BUF_TO_AGF(agbp);
+	pag->pagf_flags &= ~flag;
+	agf->agf_flags = cpu_to_be32(pag->pagf_flags);
+
+	xfs_alloc_log_agf(tp, agbp, XFS_AGF_FLAGS);
+}
+
+int
+xfs_alloc_allow_ag(
+	xfs_mount_t	*mp,
+	xfs_agnumber_t	agno)
+{
+	xfs_perag_t	*pag;
+	xfs_buf_t	*bp;
+	int		error;
+	xfs_trans_t	*tp;
+
+	if (agno >= mp->m_sb.sb_agcount)
+		return -EINVAL;
+
+	tp = xfs_trans_alloc(mp, XFS_TRANS_ALLOC_FLAGS);
+	error = xfs_trans_reserve(tp, 0, mp->m_sb.sb_sectsize + 128, 0, 0,
+					XFS_DEFAULT_LOG_COUNT);
+	if (error) {
+		xfs_trans_cancel(tp, 0);
+		return error;
+	}
+	error = xfs_alloc_read_agf(mp, tp, agno, 0, &bp);
+	if (error)
+		return error;
+
+	pag = &mp->m_perag[agno];
+	xfs_alloc_clear_flag_ag(tp, bp, pag, XFS_AGFLAG_ALLOC_DENY);
+
+	xfs_trans_set_sync(tp);
+	xfs_trans_commit(tp, 0);
+
+	return 0;
+
+}
+
+int
+xfs_alloc_deny_ag(
+	xfs_mount_t	*mp,
+	xfs_agnumber_t	agno)
+{
+	xfs_perag_t	*pag;
+	xfs_buf_t	*bp;
+	int		error;
+	xfs_trans_t	*tp;
+
+	if (agno >= mp->m_sb.sb_agcount)
+		return -EINVAL;
+
+	tp = xfs_trans_alloc(mp, XFS_TRANS_ALLOC_FLAGS);
+	error = xfs_trans_reserve(tp, 0, mp->m_sb.sb_sectsize + 128, 0, 0,
+					XFS_DEFAULT_LOG_COUNT);
+	if (error) {
+		xfs_trans_cancel(tp, 0);
+		return error;
+	}
+	error = xfs_alloc_read_agf(mp, tp, agno, 0, &bp);
+	if (error)
+		return error;
+
+	pag = &mp->m_perag[agno];
+	xfs_alloc_set_flag_ag(tp, bp, pag, XFS_AGFLAG_ALLOC_DENY);
+
+	xfs_trans_set_sync(tp);
+	xfs_trans_commit(tp, 0);
+
+	return 0;
+
+}
+
 /*
  * Put the block on the freelist for the allocation group.
  */
@@ -2226,6 +2328,7 @@ xfs_alloc_read_agf(
 		pag->pagf_btreeblks = be32_to_cpu(agf->agf_btreeblks);
 		pag->pagf_flcount = be32_to_cpu(agf->agf_flcount);
 		pag->pagf_longest = be32_to_cpu(agf->agf_longest);
+		pag->pagf_flags = be32_to_cpu(agf->agf_flags);
 		pag->pagf_levels[XFS_BTNUM_BNOi] =
 			be32_to_cpu(agf->agf_levels[XFS_BTNUM_BNOi]);
 		pag->pagf_levels[XFS_BTNUM_CNTi] =
Index: 2.6.x-xfs-new/fs/xfs/xfs_trans.h
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_trans.h	2007-06-08 21:41:32.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/xfs_trans.h	2007-06-08 22:50:46.449405162 +1000
@@ -95,7 +95,8 @@ typedef struct xfs_trans_header {
 #define	XFS_TRANS_GROWFSRT_FREE		39
 #define	XFS_TRANS_SWAPEXT		40
 #define	XFS_TRANS_SB_COUNT		41
-#define	XFS_TRANS_TYPE_MAX		41
+#define	XFS_TRANS_ALLOC_FLAGS		42
+#define	XFS_TRANS_TYPE_MAX		42
 /* new transaction types need to be reflected in xfs_logprint(8) */
 
 
Index: 2.6.x-xfs-new/fs/xfs/xfs_fs.h
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_fs.h	2007-06-08 21:46:29.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/xfs_fs.h	2007-06-08 23:15:31.755284394 +1000
@@ -493,6 +493,8 @@ typedef struct xfs_handle {
 #define XFS_IOC_ATTRMULTI_BY_HANDLE  _IOW ('X', 123, struct xfs_fsop_attrmulti_handlereq)
 #define XFS_IOC_FSGEOMETRY	     _IOR ('X', 124, struct xfs_fsop_geom)
 #define XFS_IOC_GOINGDOWN	     _IOR ('X', 125, __uint32_t)
+#define XFS_IOC_ALLOC_DENY_AG	     _IOR ('X', 126, __uint32_t)
+#define XFS_IOC_ALLOC_ALLOW_AG	     _IOR ('X', 127, __uint32_t)
 /*	XFS_IOC_GETFSUUID ---------- deprecated 140	 */
 
 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS shrink (step 0)
  2007-06-04  0:16 ` David Chinner
  2007-06-04  8:41   ` Iustin Pop
@ 2007-06-19 22:22   ` Ruben Porras
  2007-06-19 23:42     ` David Chinner
  1 sibling, 1 reply; 21+ messages in thread
From: Ruben Porras @ 2007-06-19 22:22 UTC (permalink / raw)
  To: David Chinner; +Cc: xfs, iusty

Am Montag, den 04.06.2007, 10:16 +1000 schrieb David Chinner:

> Here's the "simple" bits that will allow you to shrink
> the filesystem down to the end of the internal log:
> 
> 	0. Check space is available for shrink

Now that I'm almost* finish with the point 1), is there any place in the
xfs_code where a similar task is done? This way I would have a basis to
start off.

Cheers.

* I need only to fix the indentation, and change the ioctl interface as
David suggested in another mail in this thread, so that the
implementation is not so specific.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS shrink (step 0)
  2007-06-19 22:22   ` XFS shrink (step 0) Ruben Porras
@ 2007-06-19 23:42     ` David Chinner
  2007-06-28 10:38       ` Ruben Porras
  0 siblings, 1 reply; 21+ messages in thread
From: David Chinner @ 2007-06-19 23:42 UTC (permalink / raw)
  To: Ruben Porras; +Cc: David Chinner, xfs, iusty

On Wed, Jun 20, 2007 at 12:22:31AM +0200, Ruben Porras wrote:
> Am Montag, den 04.06.2007, 10:16 +1000 schrieb David Chinner:
> 
> > Here's the "simple" bits that will allow you to shrink
> > the filesystem down to the end of the internal log:
> > 
> > 	0. Check space is available for shrink
> 
> Now that I'm almost* finish with the point 1),

Cool ;)

> is there any place in the
> xfs_code where a similar task is done? This way I would have a basis to
> start off.

No, there isn't anything currently in existence to do this.

It's not difficult, though. What you need to do is count the number of
used blocks in the AGs that will be truncated off, and check whether
there is enough free space in the remaining AGs to hold all the
blocks that we are going to move.

I think this could be done we a single loop across the perag
array or with a simple xfs_db wrapper and some shell/awk/perl
magic.

e.g: Here's the basis:

budgie:~ # for i in `seq 0 1 7`; do
> xfs_db -r -c "agf $i" -c "p freeblks" -c "p btreeblks" /dev/sdb8
> done
freeblks = 32779
btreeblks = 0
freeblks = 63003
btreeblks = 0
freeblks = 124423
btreeblks = 0
freeblks = 114516
btreeblks = 0
freeblks = 126602
btreeblks = 0
freeblks = 125905
btreeblks = 0
freeblks = 127886
btreeblks = 0
freeblks = 125445
btreeblks = 0

Now all you need to extract is the size of each ag from teh superblock,
determine which AGs are going to be freed, and do some math ;)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS shrink (step 0)
  2007-06-19 23:42     ` David Chinner
@ 2007-06-28 10:38       ` Ruben Porras
  2007-06-29  6:55         ` David Chinner
  0 siblings, 1 reply; 21+ messages in thread
From: Ruben Porras @ 2007-06-28 10:38 UTC (permalink / raw)
  Cc: xfs, iusty

David Chinner wrote:
> No, there isn't anything currently in existence to do this.
>
> It's not difficult, though. What you need to do is count the number of
> used blocks in the AGs that will be truncated off, and check whether
> there is enough free space in the remaining AGs to hold all the
> blocks that we are going to move.
>
> I think this could be done we a single loop across the perag
> array or with a simple xfs_db wrapper and some shell/awk/perl
> magic.
>   
Do you mind that is it ok to depend on shell/awk/perl? I'll do it in C 
looping through the perag array.
 

--
Rubén Porras
LinWorks GmbH

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS shrink (step 0)
  2007-06-28 10:38       ` Ruben Porras
@ 2007-06-29  6:55         ` David Chinner
  2007-07-30 17:30           ` Ruben Porras
  0 siblings, 1 reply; 21+ messages in thread
From: David Chinner @ 2007-06-29  6:55 UTC (permalink / raw)
  To: Ruben Porras; +Cc: xfs, iusty

On Thu, Jun 28, 2007 at 12:38:44PM +0200, Ruben Porras wrote:
> David Chinner wrote:
> >No, there isn't anything currently in existence to do this.
> >
> >It's not difficult, though. What you need to do is count the number of
> >used blocks in the AGs that will be truncated off, and check whether
> >there is enough free space in the remaining AGs to hold all the
> >blocks that we are going to move.
> >
> >I think this could be done we a single loop across the perag
> >array or with a simple xfs_db wrapper and some shell/awk/perl
> >magic.
> >  
> Do you mind that is it ok to depend on shell/awk/perl?

Sure. We have a few programs that are just shell wrappers
of other xfs programs

e.g:
	xfs_bmap: shell script that calls xfs_io
	xfs_check: shell script that calls xfs_db
	xfs_info: shell script that calls xfs_growfs

> I'll do it in C looping through the perag array.

For something like this it's probably easier to do with shell/perl/awk.

e.g. in shell, the number of ags in the filesystem:

iterate all ags:

numags=`xfs_db -r -c "sb 0" -c "p agcount" /dev/sdb8 | sed -e 's/.* = //'`
lastag=`expr $numags - 1`
for ags in `seq 0 1 $lastag`; do
	....
done

Free space in an AG 0:

xfs_db -r -c "freesp -s -a 0" /dev/sdb8 | awk '/total free blocks/ {print $4}'

And so on. You can peek into pretty much any structure on disk with xfs_db
and you can do it online so it's pretty much perfect for this sort of
checking. I'd start with something like this, and if it gets too complex
then we need to look at integrating it into xfs_db (i.e. writing it in C)....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: XFS shrink (step 0)
  2007-06-29  6:55         ` David Chinner
@ 2007-07-30 17:30           ` Ruben Porras
  0 siblings, 0 replies; 21+ messages in thread
From: Ruben Porras @ 2007-07-30 17:30 UTC (permalink / raw)
  To: David Chinner; +Cc: xfs, iusty

[-- Attachment #1: Type: text/plain, Size: 1128 bytes --]

Am Freitag, den 29.06.2007, 16:55 +1000 schrieb David Chinner:
> On Thu, Jun 28, 2007 at 12:38:44PM +0200, Ruben Porras wrote:

> For something like this it's probably easier to do with shell/perl/awk.
> 
> e.g. in shell, the number of ags in the filesystem:
> 
> iterate all ags:
> 
> numags=`xfs_db -r -c "sb 0" -c "p agcount" /dev/sdb8 | sed -e 's/.* = //'`
> lastag=`expr $numags - 1`
> for ags in `seq 0 1 $lastag`; do
> 	....
> done
> 
> Free space in an AG 0:
> 
> xfs_db -r -c "freesp -s -a 0" /dev/sdb8 | awk '/total free blocks/ {print $4}'

I decided to calcule the free space in a AG directly as the space in
"freeblks" - "btreeblks".

Attached is a perl script that calculates the free space of a hole
filesystem. It's easy to modify it to get the free space from a range of
AGs, so unless there are errors, I lay it on the mailing list as an
example, and I'll adapt it later as needed. I would like to start with
the step number 2.

How is the state of the program xfs_reno.c? Can it be released in the
near future as GPL, or should I go better for now with point number 3
(that is, move data out of offline AGs)?

[-- Attachment #2: freecount.pl --]
[-- Type: application/x-perl, Size: 1076 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2007-07-30 17:30 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-01 16:39 XFS shrink functionality Ruben Porras
2007-06-04  0:16 ` David Chinner
2007-06-04  8:41   ` Iustin Pop
2007-06-04  9:21     ` David Chinner
2007-06-05  8:00       ` Iustin Pop
2007-06-06  1:50         ` Nathan Scott
2007-06-07  8:18         ` David Chinner
2007-06-08  8:23     ` Ruben Porras
2007-06-08 10:15       ` Iustin Pop
2007-06-08 15:12         ` David Chinner
2007-06-08 16:03           ` Iustin Pop
2007-06-09  2:15             ` David Chinner
2007-06-08 19:47           ` Ruben Porras
2007-06-14  8:35           ` Ruben Porras
2007-06-14  9:14             ` David Chinner
2007-06-08 14:44       ` David Chinner
2007-06-19 22:22   ` XFS shrink (step 0) Ruben Porras
2007-06-19 23:42     ` David Chinner
2007-06-28 10:38       ` Ruben Porras
2007-06-29  6:55         ` David Chinner
2007-07-30 17:30           ` Ruben Porras

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox