public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* df bigger than ls?
@ 2012-03-07 15:54 Brian Candler
  2012-03-07 17:16 ` Brian Candler
  0 siblings, 1 reply; 16+ messages in thread
From: Brian Candler @ 2012-03-07 15:54 UTC (permalink / raw)
  To: xfs

I have created some files spread across 12 XFS filesystems, using a
glusterfs "striped" volume.  This writes files with lots of holes(*) - so I
would expect the space reported by 'du' to be less than the space reported
by 'ls'.

However it's the other way round - du is reporting more space used than ls! 
Here's what I mean:  I'm looking at the files directly on the underlying
disk mount point, not via glusterfs at all.

$ ls -lh /disk1/scratch2/work/PRSRA1/PRSRA1.1.0.bff
-rw-rw-r-- 1 brian brian 1.1G 2012-03-07 15:19 /disk1/scratch2/work/PRSRA1/PRSRA1.1.0.bff
$ ls -lk /disk1/scratch2/work/PRSRA1/PRSRA1.1.0.bff
-rw-rw-r-- 1 brian brian 1059968 2012-03-07 15:19 /disk1/scratch2/work/PRSRA1/PRSRA1.1.0.bff
$ du -h /disk1/scratch2/work/PRSRA1/PRSRA1.1.0.bff
2.0G	/disk1/scratch2/work/PRSRA1/PRSRA1.1.0.bff
$ du -k /disk1/scratch2/work/PRSRA1/PRSRA1.1.0.bff
2096392	/disk1/scratch2/work/PRSRA1/PRSRA1.1.0.bff

And here's what xfs_bmap reports:

root@storage1:~# xfs_bmap -vp /disk1/scratch2/work/PRSRA1/PRSRA1.1.0.bff
/disk1/scratch2/work/PRSRA1/PRSRA1.1.0.bff:
 EXT: FILE-OFFSET         BLOCK-RANGE            AG AG-OFFSET              TOTAL FLAGS
   0: [0..255]:           2933325744..2933325999  2 (3059152..3059407)       256 00000
   1: [256..3071]:        hole                                              2816
   2: [3072..3327]:       2933326000..2933326255  2 (3059408..3059663)       256 00000
   3: [3328..6143]:       hole                                              2816
   4: [6144..8191]:       2933326472..2933328519  2 (3059880..3061927)      2048 00000
   5: [8192..9215]:       hole                                              1024
   6: [9216..13311]:      2933369480..2933373575  2 (3102888..3106983)      4096 00000
   7: [13312..15359]:     hole                                              2048
   8: [15360..23551]:     2933375624..2933383815  2 (3109032..3117223)      8192 00000
   9: [23552..24575]:     hole                                              1024
  10: [24576..40959]:     2933587168..2933603551  2 (3320576..3336959)     16384 00000
  11: [40960..43007]:     hole                                              2048
  12: [43008..75775]:     2933623008..2933655775  2 (3356416..3389183)     32768 00000
  13: [75776..76799]:     hole                                              1024
  14: [76800..142335]:    2933656800..2933722335  2 (3390208..3455743)     65536 00000
  15: [142336..144383]:   hole                                              2048
  16: [144384..275455]:   2933724384..2933855455  2 (3457792..3588863)    131072 00000
  17: [275456..276479]:   hole                                              1024
  18: [276480..538623]:   2935019808..2935281951  2 (4753216..5015359)    262144 00000
  19: [538624..540671]:   hole                                              2048
  20: [540672..1064959]:  2935284000..2935808287  2 (5017408..5541695)    524288 00000
  21: [1064960..1065983]: hole                                              1024
  22: [1065984..2114559]: 2935809312..2936857887  2 (5542720..6591295)   1048576 00000
  23: [2114560..2116607]: hole                                              2048
  24: [2116608..2119935]: 2943037984..2943041311  2 (12771392..12774719)    3328 00000
root@storage1:~# 

Given that these values are all in 512-byte disk blocks, the total file size
is (2119935 + 1) * 512 which agrees with ls. And some proportion of it
is holes, so du should report less than this, shouldn't it?

(Aside: it starts off being 11/12th holes as expected, but after a while
isn't.  This may be a different bug, possibly in glusterfs itself)

I guess 'du' gets its info from stat(), and stat() also says the file is
using 4192784 blocks which is 2096392 KB:

$ stat /disk1/scratch2/work/PRSRA1/PRSRA1.1.0.bff
  File: `/disk1/scratch2/work/PRSRA1/PRSRA1.1.0.bff'
  Size: 1085407232	Blocks: 4192784    IO Block: 4096   regular file
Device: 810h/2064d	Inode: 8595657903  Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/   brian)   Gid: ( 1000/   brian)
Access: 2012-03-07 15:20:36.044365215 +0000
Modify: 2012-03-07 15:19:33.640364277 +0000
Change: 2012-03-07 15:19:33.640364277 +0000

Finally, here is xfs_db dump of the inode:

# ls -i /disk1/scratch2/work/PRSRA1/PRSRA1.1.0.bff
8595657903 /disk1/scratch2/work/PRSRA1/PRSRA1.1.0.bff
...
xfs_db> inode 8595657903
xfs_db> p
core.magic = 0x494e
core.mode = 0100664
core.version = 2
core.format = 3 (btree)
core.nlinkv2 = 1
core.onlink = 0
core.projid_lo = 0
core.projid_hi = 0
core.uid = 1000
core.gid = 1000
core.flushiter = 231
core.atime.sec = Wed Mar  7 15:20:36 2012
core.atime.nsec = 044365215
core.mtime.sec = Wed Mar  7 15:19:33 2012
core.mtime.nsec = 640364277
core.ctime.sec = Wed Mar  7 15:19:33 2012
core.ctime.nsec = 640364277
core.size = 1085407232
core.nblocks = 262370
core.extsize = 0
core.nextents = 13
core.naextents = 1
core.forkoff = 15
core.aformat = 2 (extents)
core.dmevmask = 0
core.dmstate = 0
core.newrtbm = 0
core.prealloc = 0
core.realtime = 0
core.immutable = 0
core.append = 0
core.sync = 0
core.noatime = 0
core.nodump = 0
core.rtinherit = 0
core.projinherit = 0
core.nosymlinks = 0
core.extsz = 0
core.extszinherit = 0
core.nodefrag = 0
core.filestream = 0
core.gen = 1875303423
next_unlinked = null
u.bmbt.level = 1
u.bmbt.numrecs = 1
u.bmbt.keys[1] = [startoff] 1:[0]
u.bmbt.ptrs[1] = 1:537294687
a.bmx[0] = [startoff,startblock,blockcount,extentflag] 0:[0,537253238,1,0]
xfs_db> 

Platform: ubuntu 11.10
Linux 3.0.0-15-server #26-Ubuntu SMP Fri Jan 20 19:07:39 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

Have I missed something obvious, or is there a bug of some sort here?

Regards,

Brian.

(*) http://gluster.org/community/documentation//index.php/GlusterFS_Technical_FAQ#Stripe_behavior_not_working_as_expected

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: df bigger than ls?
  2012-03-07 15:54 df bigger than ls? Brian Candler
@ 2012-03-07 17:16 ` Brian Candler
  2012-03-07 18:04   ` Eric Sandeen
  0 siblings, 1 reply; 16+ messages in thread
From: Brian Candler @ 2012-03-07 17:16 UTC (permalink / raw)
  To: xfs

On Wed, Mar 07, 2012 at 03:54:39PM +0000, Brian Candler wrote:
> core.size = 1085407232
> core.nblocks = 262370

core.nblocks is correct here: space used = 262370 * 4 = 1049480 KB

(If I add up all the non-hole extents I get 2098944 blocks = 1049472 KB
so there are two extra blocks of something)

This begs the question of where stat() is getting its info from?

Ah... but I've found that after unmounting and remounting the filesystem
(which I had to do for xfs_db), du and stat report the correct info.

In fact, dropping the inode caches is sufficient to fix the problem:

root@storage1:~# du -h /disk*/scratch2/work/PRSRA1/PRSRA1.1.0.bff
2.0G	/disk10/scratch2/work/PRSRA1/PRSRA1.1.0.bff
2.0G	/disk11/scratch2/work/PRSRA1/PRSRA1.1.0.bff
2.0G	/disk12/scratch2/work/PRSRA1/PRSRA1.1.0.bff
1.1G	/disk1/scratch2/work/PRSRA1/PRSRA1.1.0.bff
1.1G	/disk2/scratch2/work/PRSRA1/PRSRA1.1.0.bff
2.0G	/disk3/scratch2/work/PRSRA1/PRSRA1.1.0.bff
2.0G	/disk4/scratch2/work/PRSRA1/PRSRA1.1.0.bff
2.0G	/disk5/scratch2/work/PRSRA1/PRSRA1.1.0.bff
2.0G	/disk6/scratch2/work/PRSRA1/PRSRA1.1.0.bff
2.0G	/disk7/scratch2/work/PRSRA1/PRSRA1.1.0.bff
2.0G	/disk8/scratch2/work/PRSRA1/PRSRA1.1.0.bff
2.0G	/disk9/scratch2/work/PRSRA1/PRSRA1.1.0.bff
root@storage1:~# echo 3 >/proc/sys/vm/drop_caches 
root@storage1:~# du -h /disk*/scratch2/work/PRSRA1/PRSRA1.1.0.bff
1.1G	/disk10/scratch2/work/PRSRA1/PRSRA1.1.0.bff
1.1G	/disk11/scratch2/work/PRSRA1/PRSRA1.1.0.bff
1.1G	/disk12/scratch2/work/PRSRA1/PRSRA1.1.0.bff
1.1G	/disk1/scratch2/work/PRSRA1/PRSRA1.1.0.bff
1.1G	/disk2/scratch2/work/PRSRA1/PRSRA1.1.0.bff
1.1G	/disk3/scratch2/work/PRSRA1/PRSRA1.1.0.bff
1.1G	/disk4/scratch2/work/PRSRA1/PRSRA1.1.0.bff
1.1G	/disk5/scratch2/work/PRSRA1/PRSRA1.1.0.bff
1.1G	/disk6/scratch2/work/PRSRA1/PRSRA1.1.0.bff
1.1G	/disk7/scratch2/work/PRSRA1/PRSRA1.1.0.bff
1.1G	/disk8/scratch2/work/PRSRA1/PRSRA1.1.0.bff
1.1G	/disk9/scratch2/work/PRSRA1/PRSRA1.1.0.bff
root@storage1:~# 

Very odd, but not really a major problem other than the confusion it causes.

Regards,

Brian.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: df bigger than ls?
  2012-03-07 17:16 ` Brian Candler
@ 2012-03-07 18:04   ` Eric Sandeen
  2012-03-08  2:10     ` Dave Chinner
                       ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Eric Sandeen @ 2012-03-07 18:04 UTC (permalink / raw)
  To: Brian Candler; +Cc: xfs

On 3/7/12 11:16 AM, Brian Candler wrote:
> On Wed, Mar 07, 2012 at 03:54:39PM +0000, Brian Candler wrote:
>> core.size = 1085407232
>> core.nblocks = 262370
> 
> core.nblocks is correct here: space used = 262370 * 4 = 1049480 KB
> 
> (If I add up all the non-hole extents I get 2098944 blocks = 1049472 KB
> so there are two extra blocks of something)
> 
> This begs the question of where stat() is getting its info from?
> 
> Ah... but I've found that after unmounting and remounting the filesystem
> (which I had to do for xfs_db), du and stat report the correct info.
> 
> In fact, dropping the inode caches is sufficient to fix the problem:

Yep.

XFS speculatively preallocates space off the end of a file.  The amount of
space allocated depends on the present size of the file, and the amount of
available free space.  This can be overridden
with mount -o allocsize=64k (or other size for example)

$ git log --pretty=oneline fs/xfs | grep specul
b8fc82630ae289bb4e661567808afc59e3298dce xfs: speculative delayed allocation uses rounddown_power_of_2 badly
055388a3188f56676c21e92962fc366ac8b5cb72 xfs: dynamic speculative EOF preallocation

so:

# dd if=/dev/zero of=bigfile bs=1M count=1100 &>/dev/null
# ls -lh bigfile
-rw-r--r--. 1 root root 1.1G Mar  7 11:47 bigfile
# du -h bigfile
1.1G	bigfile

but:

# rm -f bigfile
# for I in `seq 1 1100`; do dd if=/dev/zero of=bigfile conv=notrunc bs=1M seek=$I count=1 &>/dev/null; done
# ls -lh bigfile
-rw-r--r--. 1 root root 1.1G Mar  7 11:49 bigfile
# du -h bigfile
2.0G	bigfile

This should get freed when the inode is dropped from the cache; hence your cache drop bringing it back to size.

But there does seem to be an issue here; if I make a 4G filesystem and repeat the above test 3 times, the 3rd run gets ENOSPC, and the last file written comes up short, while the first one retains all it's extra preallocated space:

# du -hc bigfile*
2.0G	bigfile1
1.1G	bigfile2
907M	bigfile3

Dave, is this working as intended?  I know the speculative preallocation amount for new files is supposed to go down as the fs fills, but is there no way to discard prealloc space to avoid ENOSPC on other files?

-Eric

> root@storage1:~# du -h /disk*/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> 2.0G	/disk10/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> 2.0G	/disk11/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> 2.0G	/disk12/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> 1.1G	/disk1/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> 1.1G	/disk2/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> 2.0G	/disk3/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> 2.0G	/disk4/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> 2.0G	/disk5/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> 2.0G	/disk6/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> 2.0G	/disk7/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> 2.0G	/disk8/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> 2.0G	/disk9/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> root@storage1:~# echo 3 >/proc/sys/vm/drop_caches 
> root@storage1:~# du -h /disk*/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> 1.1G	/disk10/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> 1.1G	/disk11/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> 1.1G	/disk12/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> 1.1G	/disk1/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> 1.1G	/disk2/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> 1.1G	/disk3/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> 1.1G	/disk4/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> 1.1G	/disk5/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> 1.1G	/disk6/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> 1.1G	/disk7/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> 1.1G	/disk8/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> 1.1G	/disk9/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> root@storage1:~# 
> 
> Very odd, but not really a major problem other than the confusion it causes.
> 
> Regards,
> 
> Brian.
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: df bigger than ls?
  2012-03-07 18:04   ` Eric Sandeen
@ 2012-03-08  2:10     ` Dave Chinner
  2012-03-08  2:17       ` Eric Sandeen
  2012-03-08 16:23       ` Ben Myers
  2012-03-08  8:04     ` Arkadiusz Miśkiewicz
  2012-03-08  8:50     ` Brian Candler
  2 siblings, 2 replies; 16+ messages in thread
From: Dave Chinner @ 2012-03-08  2:10 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs, Brian Candler

On Wed, Mar 07, 2012 at 12:04:26PM -0600, Eric Sandeen wrote:
> On 3/7/12 11:16 AM, Brian Candler wrote:
> > On Wed, Mar 07, 2012 at 03:54:39PM +0000, Brian Candler wrote:
> >> core.size = 1085407232
> >> core.nblocks = 262370
> > 
> > core.nblocks is correct here: space used = 262370 * 4 = 1049480 KB
> > 
> > (If I add up all the non-hole extents I get 2098944 blocks = 1049472 KB
> > so there are two extra blocks of something)
> > 
> > This begs the question of where stat() is getting its info from?

stat(2) also reported delayed allocation reservations that are only
kept in memory.

....

> so:
> 
> # dd if=/dev/zero of=bigfile bs=1M count=1100 &>/dev/null
> # ls -lh bigfile
> -rw-r--r--. 1 root root 1.1G Mar  7 11:47 bigfile
> # du -h bigfile
> 1.1G	bigfile
> 
> but:
> 
> # rm -f bigfile
> # for I in `seq 1 1100`; do dd if=/dev/zero of=bigfile conv=notrunc bs=1M seek=$I count=1 &>/dev/null; done
> # ls -lh bigfile
> -rw-r--r--. 1 root root 1.1G Mar  7 11:49 bigfile
> # du -h bigfile
> 2.0G	bigfile

This is tripping the NFS server write pattern heuristic. i.e. it is
detecting repeated open/write at EOF/close patterns and so is not
truncating away the speculative EOF reservation on close(). This
si what prevents fragmentation of files being written concurrently
with this pattern.

> This should get freed when the inode is dropped from the cache;
> hence your cache drop bringing it back to size.

Right. It is assumes that once you've triggered that heuristic, the
preallocation needs to last for as long as the inode is in the
working set. The inode cache tracks the current working set, so the
preallocation release is tied to cache eviction.

> But there does seem to be an issue here; if I make a 4G filesystem
> and repeat the above test 3 times, the 3rd run gets ENOSPC, and
> the last file written comes up short, while the first one retains
> all it's extra preallocated space:
> 
> # du -hc bigfile* 2.0G	bigfile1 1.1G	bigfile2 907M
> bigfile3
> 
> Dave, is this working as intended?

Yes. Your problem is that you have a very small filesystem, which is
not the case that we optimise XFS for. :/

> I know the speculative
> preallocation amount for new files is supposed to go down as the
> fs fills, but is there no way to discard prealloc space to avoid
> ENOSPC on other files?

We don't track what files have current active preallocations, we
only reduce the preallocation size as the filesystem nears ENOSPC.
This generally works just fine in situations where the filesystem
size is significantly greater than the maximum extent size. i.e. the
common case

The problem you are tripping over here is that the maximum extent
size is greater than the filesystem size, so the preallocation size
is also greater than the filesystem size and hence can contribute
significantly to premature ENOSPC. I see two possible ways to
minimise this problem:

	1. reduce the maximum speculative preallocation size based
	on filesystem size at mount time.

	2. track inodes with active speculative preallocation and
	have an enospc based trigger that can find them and truncate
	away excess idle speculative preallocation.

The first is relatively easy to do, but will only reduce the
incidence of your problem - we still need to allow significant
preallocation sizes (e.g. 64MB) to avoid the fragmentation problems.

The second is needed to reclaim the space we've already preallocated
but is not being used. That's more complex to do - probably a radix
tree bit and a periodic background scan to reduce the time window
the preallocation sits around from cache lifetime to "idle for some
time" along with a on-demand, synchronous ENOSPC scan. This will
need some more thought as to how to do it effectively, but isn't
impossible to do....

Cheers,

Dave.
> 
> -Eric
> 
> > root@storage1:~# du -h /disk*/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> > 2.0G	/disk10/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> > 2.0G	/disk11/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> > 2.0G	/disk12/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> > 1.1G	/disk1/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> > 1.1G	/disk2/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> > 2.0G	/disk3/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> > 2.0G	/disk4/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> > 2.0G	/disk5/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> > 2.0G	/disk6/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> > 2.0G	/disk7/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> > 2.0G	/disk8/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> > 2.0G	/disk9/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> > root@storage1:~# echo 3 >/proc/sys/vm/drop_caches 
> > root@storage1:~# du -h /disk*/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> > 1.1G	/disk10/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> > 1.1G	/disk11/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> > 1.1G	/disk12/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> > 1.1G	/disk1/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> > 1.1G	/disk2/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> > 1.1G	/disk3/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> > 1.1G	/disk4/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> > 1.1G	/disk5/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> > 1.1G	/disk6/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> > 1.1G	/disk7/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> > 1.1G	/disk8/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> > 1.1G	/disk9/scratch2/work/PRSRA1/PRSRA1.1.0.bff
> > root@storage1:~# 
> > 
> > Very odd, but not really a major problem other than the confusion it causes.
> > 
> > Regards,
> > 
> > Brian.
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs
> > 
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
> 

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: df bigger than ls?
  2012-03-08  2:10     ` Dave Chinner
@ 2012-03-08  2:17       ` Eric Sandeen
  2012-03-08  9:10         ` Brian Candler
  2012-03-08 16:23       ` Ben Myers
  1 sibling, 1 reply; 16+ messages in thread
From: Eric Sandeen @ 2012-03-08  2:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, Brian Candler

On 3/7/12 8:10 PM, Dave Chinner wrote:
> On Wed, Mar 07, 2012 at 12:04:26PM -0600, Eric Sandeen wrote:

...

>> # du -hc bigfile* 2.0G	bigfile1 1.1G	bigfile2 907M
>> bigfile3
>>
>> Dave, is this working as intended?
> 
> Yes. Your problem is that you have a very small filesystem, which is
> not the case that we optimise XFS for. :/

Well, sure ;)

>> I know the speculative
>> preallocation amount for new files is supposed to go down as the
>> fs fills, but is there no way to discard prealloc space to avoid
>> ENOSPC on other files?
> 
> We don't track what files have current active preallocations, we
> only reduce the preallocation size as the filesystem nears ENOSPC.
> This generally works just fine in situations where the filesystem
> size is significantly greater than the maximum extent size. i.e. the
> common case
> 
> The problem you are tripping over here is that the maximum extent
> size is greater than the filesystem size, so the preallocation size
> is also greater than the filesystem size and hence can contribute
> significantly to premature ENOSPC. I see two possible ways to
> minimise this problem:

FWIW I'm not overly worried about my 4G filesystem, that was just
an obviously extreme case to test the behavior.

I'd be more interested to know if the behavior was causing any
issues for Brian's case, or if it was just confusing.  :)

> 	1. reduce the maximum speculative preallocation size based
> 	on filesystem size at mount time.
> 
> 	2. track inodes with active speculative preallocation and
> 	have an enospc based trigger that can find them and truncate
> 	away excess idle speculative preallocation.
> 
> The first is relatively easy to do, but will only reduce the
> incidence of your problem - we still need to allow significant
> preallocation sizes (e.g. 64MB) to avoid the fragmentation problems.

Might be worth it; even though a 4G fs is obviously not a design point,
it's good not to have diabolical behavior.  Although I suppose if xfstests
doesn't hit it in everything it does, it's probably not a big deal.
ISTR you did have to fix one thing up, though.

> The second is needed to reclaim the space we've already preallocated
> but is not being used. That's more complex to do - probably a radix
> tree bit and a periodic background scan to reduce the time window
> the preallocation sits around from cache lifetime to "idle for some
> time" along with a on-demand, synchronous ENOSPC scan. This will
> need some more thought as to how to do it effectively, but isn't
> impossible to do....

It seems worth thinking about.  I guess I'm still a little concerned
about the ENOSPC case; it could lead to some confusion - I could imagine
several hundreds of gigs under preallocation, with a reasonable-sized
filesystem returning ENOSPC quite early.

-Eric

> Cheers,
> 
> Dave.
>>
>> -Eric
>>
>>> root@storage1:~# du -h /disk*/scratch2/work/PRSRA1/PRSRA1.1.0.bff
>>> 2.0G	/disk10/scratch2/work/PRSRA1/PRSRA1.1.0.bff
>>> 2.0G	/disk11/scratch2/work/PRSRA1/PRSRA1.1.0.bff
>>> 2.0G	/disk12/scratch2/work/PRSRA1/PRSRA1.1.0.bff
>>> 1.1G	/disk1/scratch2/work/PRSRA1/PRSRA1.1.0.bff
>>> 1.1G	/disk2/scratch2/work/PRSRA1/PRSRA1.1.0.bff
>>> 2.0G	/disk3/scratch2/work/PRSRA1/PRSRA1.1.0.bff
>>> 2.0G	/disk4/scratch2/work/PRSRA1/PRSRA1.1.0.bff
>>> 2.0G	/disk5/scratch2/work/PRSRA1/PRSRA1.1.0.bff
>>> 2.0G	/disk6/scratch2/work/PRSRA1/PRSRA1.1.0.bff
>>> 2.0G	/disk7/scratch2/work/PRSRA1/PRSRA1.1.0.bff
>>> 2.0G	/disk8/scratch2/work/PRSRA1/PRSRA1.1.0.bff
>>> 2.0G	/disk9/scratch2/work/PRSRA1/PRSRA1.1.0.bff
>>> root@storage1:~# echo 3 >/proc/sys/vm/drop_caches 
>>> root@storage1:~# du -h /disk*/scratch2/work/PRSRA1/PRSRA1.1.0.bff
>>> 1.1G	/disk10/scratch2/work/PRSRA1/PRSRA1.1.0.bff
>>> 1.1G	/disk11/scratch2/work/PRSRA1/PRSRA1.1.0.bff
>>> 1.1G	/disk12/scratch2/work/PRSRA1/PRSRA1.1.0.bff
>>> 1.1G	/disk1/scratch2/work/PRSRA1/PRSRA1.1.0.bff
>>> 1.1G	/disk2/scratch2/work/PRSRA1/PRSRA1.1.0.bff
>>> 1.1G	/disk3/scratch2/work/PRSRA1/PRSRA1.1.0.bff
>>> 1.1G	/disk4/scratch2/work/PRSRA1/PRSRA1.1.0.bff
>>> 1.1G	/disk5/scratch2/work/PRSRA1/PRSRA1.1.0.bff
>>> 1.1G	/disk6/scratch2/work/PRSRA1/PRSRA1.1.0.bff
>>> 1.1G	/disk7/scratch2/work/PRSRA1/PRSRA1.1.0.bff
>>> 1.1G	/disk8/scratch2/work/PRSRA1/PRSRA1.1.0.bff
>>> 1.1G	/disk9/scratch2/work/PRSRA1/PRSRA1.1.0.bff
>>> root@storage1:~# 
>>>
>>> Very odd, but not really a major problem other than the confusion it causes.
>>>
>>> Regards,
>>>
>>> Brian.
>>>
>>> _______________________________________________
>>> xfs mailing list
>>> xfs@oss.sgi.com
>>> http://oss.sgi.com/mailman/listinfo/xfs
>>>
>>
>> _______________________________________________
>> xfs mailing list
>> xfs@oss.sgi.com
>> http://oss.sgi.com/mailman/listinfo/xfs
>>
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: df bigger than ls?
  2012-03-07 18:04   ` Eric Sandeen
  2012-03-08  2:10     ` Dave Chinner
@ 2012-03-08  8:04     ` Arkadiusz Miśkiewicz
  2012-03-08 10:03       ` Dave Chinner
  2012-03-08  8:50     ` Brian Candler
  2 siblings, 1 reply; 16+ messages in thread
From: Arkadiusz Miśkiewicz @ 2012-03-08  8:04 UTC (permalink / raw)
  To: xfs

On Wednesday 07 of March 2012, Eric Sandeen wrote:

> XFS speculatively preallocates space off the end of a file.  The amount of
> space allocated depends on the present size of the file, and the amount of
> available free space.  This can be overridden
> with mount -o allocsize=64k (or other size for example)

What was the default before speculative preallocation was added (or changed 
somewhere in 2.6.3x line afaik) ?

-- 
Arkadiusz Miśkiewicz        PLD/Linux Team
arekm / maven.pl            http://ftp.pld-linux.org/

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: df bigger than ls?
  2012-03-07 18:04   ` Eric Sandeen
  2012-03-08  2:10     ` Dave Chinner
  2012-03-08  8:04     ` Arkadiusz Miśkiewicz
@ 2012-03-08  8:50     ` Brian Candler
  2012-03-08  9:59       ` Brian Candler
  2 siblings, 1 reply; 16+ messages in thread
From: Brian Candler @ 2012-03-08  8:50 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs

On Wed, Mar 07, 2012 at 12:04:26PM -0600, Eric Sandeen wrote:
> XFS speculatively preallocates space off the end of a file.  The amount of
> space allocated depends on the present size of the file, and the amount of
> available free space.  This can be overridden
> with mount -o allocsize=64k (or other size for example)

Aha.  This may well be what is screwing up gluster's disk usage on a striped
volume - I believe XFS is preallocating space which is actually going to end
up being a hole!

Here are the extent maps for two of the twelve files in my stripe:

root@storage1:~# xfs_bmap /disk1/scratch2/work/PRSRA1/PRSRA1.1.0.bff 
/disk1/scratch2/work/PRSRA1/PRSRA1.1.0.bff:
	0: [0..255]: 2933325744..2933325999
	1: [256..3071]: hole
	2: [3072..3327]: 2933326000..2933326255
	3: [3328..6143]: hole
	4: [6144..8191]: 2933326472..2933328519
	5: [8192..9215]: hole
	6: [9216..13311]: 2933369480..2933373575
	7: [13312..15359]: hole
	8: [15360..23551]: 2933375624..2933383815
	9: [23552..24575]: hole
	10: [24576..40959]: 2933587168..2933603551
	11: [40960..43007]: hole
	12: [43008..75775]: 2933623008..2933655775
	13: [75776..76799]: hole
	14: [76800..142335]: 2933656800..2933722335
	15: [142336..144383]: hole
	16: [144384..275455]: 2933724384..2933855455
	17: [275456..276479]: hole
	18: [276480..538623]: 2935019808..2935281951
	19: [538624..540671]: hole
	20: [540672..1064959]: 2935284000..2935808287
	21: [1064960..1065983]: hole
	22: [1065984..2114559]: 2935809312..2936857887
	23: [2114560..2116607]: hole
	24: [2116608..2119935]: 2943037984..2943041311
root@storage1:~# xfs_bmap /disk2/scratch2/work/PRSRA1/PRSRA1.1.0.bff 
/disk2/scratch2/work/PRSRA1/PRSRA1.1.0.bff:
	0: [0..255]: hole
	1: [256..511]: 2933194944..2933195199
	2: [512..3327]: hole
	3: [3328..3839]: 2933195200..2933195711
	4: [3840..6399]: hole
	5: [6400..8447]: 2933204416..2933206463
	6: [8448..9471]: hole
	7: [9472..13567]: 2933328792..2933332887
	8: [13568..15615]: hole
	9: [15616..23807]: 2933334936..2933343127
	10: [23808..24831]: hole
	11: [24832..41215]: 2933344152..2933360535
	12: [41216..43263]: hole
	13: [43264..76031]: 2934672032..2934704799
	14: [76032..77055]: hole
	15: [77056..142591]: 2934705824..2934771359
	16: [142592..144639]: hole
	17: [144640..275711]: 2934773408..2934904479
	18: [275712..276735]: hole
	19: [276736..538879]: 2934343328..2934605471
	20: [538880..540927]: hole
	21: [540928..1065215]: 2935498152..2936022439
	22: [1065216..1066239]: hole
	23: [1066240..2114815]: 2936023464..2937072039
	24: [2114816..2116863]: hole
	25: [2116864..2120191]: 2937074088..2937077415

You can see that at the start it works fine. There is a stripe size of
256 blocks, so:

* disk 1:    data for 1 x 256 blocks     <-- stripe 0, chunk 0
             hole for 11 x 256 blocks
             data for 1 x 256 block      <-- stripe 0, chunk 1
             ...

* disk 2:    hole for 1 x 256 blocks
             data for 1 x 256 blocks     <-- stripe 1, chunk 0
             hole for 11 x 256 blocks
             data for 1 x 256 blocks     <-- stripe 1, chunk 1
             ...

But after four chunks it gets screwed up. By the end the files are mostly
extent and hardly any hole.  The extent sizes increase in roughly powers of
two which seems to match the speculative preallocation algorithm.

I think this ought to be fixable. For example, if you seek *into* the
preallocated area and start writing, could you change the preallocation to
start at this location with a hole before?

(But would that mess up some 'seeky' workloads like databases? But they
would have ended up creating holes in filesystems which don't have
preallocation, so I doubt they do this)

Or for a more sledgehammer approach: if a file already contains any holes
then you could just disable preallocation completely.

Regards,

Brian.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: df bigger than ls?
  2012-03-08  2:17       ` Eric Sandeen
@ 2012-03-08  9:10         ` Brian Candler
  2012-03-08  9:28           ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Brian Candler @ 2012-03-08  9:10 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs

On Wed, Mar 07, 2012 at 08:17:58PM -0600, Eric Sandeen wrote:
> It seems worth thinking about.  I guess I'm still a little concerned
> about the ENOSPC case; it could lead to some confusion - I could imagine
> several hundreds of gigs under preallocation, with a reasonable-sized
> filesystem returning ENOSPC quite early.

And presumably df on the filesystem would also show it approaching 100%
utilisation?

I'm used to this where a large file has been unlinked but is still open. 
The preallocation case is a new one to me though.

How about if the total of all preallocations were limited to some small
percentage of the total filesystem size?  If you reach this limit and want
to preallocate some space for another file you'd have to either drop or
shrink an older preallocation.

Regards,

Brian.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: df bigger than ls?
  2012-03-08  9:10         ` Brian Candler
@ 2012-03-08  9:28           ` Dave Chinner
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2012-03-08  9:28 UTC (permalink / raw)
  To: Brian Candler; +Cc: Eric Sandeen, xfs

On Thu, Mar 08, 2012 at 09:10:33AM +0000, Brian Candler wrote:
> On Wed, Mar 07, 2012 at 08:17:58PM -0600, Eric Sandeen wrote:
> > It seems worth thinking about.  I guess I'm still a little concerned
> > about the ENOSPC case; it could lead to some confusion - I could imagine
> > several hundreds of gigs under preallocation, with a reasonable-sized
> > filesystem returning ENOSPC quite early.
> 
> And presumably df on the filesystem would also show it approaching 100%
> utilisation?

*nod*

> I'm used to this where a large file has been unlinked but is still open. 
> The preallocation case is a new one to me though.
> 
> How about if the total of all preallocations were limited to some small
> percentage of the total filesystem size?  If you reach this limit and want
> to preallocate some space for another file you'd have to either drop or
> shrink an older preallocation.

There is no separate accounting for preallocation - it is considered
used space so this currently can't be done even if there was some
method for tracking and trimming speculatively preallocated space.

Realistically, if you aren't running out of space there is no reason
to limit speculative preallocation. Indeed, if we didn't add
delalloc blocks to the block count in stat(2) output so they showed
up in df, almost no-one would even know about the fact that this is
done....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: df bigger than ls?
  2012-03-08  8:50     ` Brian Candler
@ 2012-03-08  9:59       ` Brian Candler
  2012-03-08 10:22         ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Brian Candler @ 2012-03-08  9:59 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs

On Thu, Mar 08, 2012 at 08:50:35AM +0000, Brian Candler wrote:
> Aha.  This may well be what is screwing up gluster's disk usage on a striped
> volume - I believe XFS is preallocating space which is actually going to end
> up being a hole!

Here is a standalone testcase.

$ for i in {0..19}; do dd if=/dev/zero of=testfile bs=128k count=1 seek=$[$i * 12]; done
$ xfs_bmap testfile
testfile:
	0: [0..255]: 1465133392..1465133647
	1: [256..3071]: hole
	2: [3072..5119]: 1465136464..1465138511
	3: [5120..6143]: hole
	4: [6144..10239]: 1465139536..1465143631
	5: [10240..12287]: hole
	6: [12288..20479]: 1465145680..1465153871
	7: [20480..21503]: hole
	8: [21504..37887]: 1465154896..1465171279
	9: [37888..39935]: hole
	10: [39936..58623]: 1465173328..1465192015

I expected to see: 20 extents of 256 blocks and 19 holes of 2816 blocks.

Regards,

Brian.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: df bigger than ls?
  2012-03-08  8:04     ` Arkadiusz Miśkiewicz
@ 2012-03-08 10:03       ` Dave Chinner
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2012-03-08 10:03 UTC (permalink / raw)
  To: Arkadiusz Miśkiewicz; +Cc: xfs

On Thu, Mar 08, 2012 at 09:04:12AM +0100, Arkadiusz Miśkiewicz wrote:
> On Wednesday 07 of March 2012, Eric Sandeen wrote:
> 
> > XFS speculatively preallocates space off the end of a file.  The amount of
> > space allocated depends on the present size of the file, and the amount of
> > available free space.  This can be overridden
> > with mount -o allocsize=64k (or other size for example)
> 
> What was the default before speculative preallocation was added (or changed 
> somewhere in 2.6.3x line afaik) ?

It's changed many times over the life of XFS.

Originally it was quite a large number and could happen anywhere in
the file - I can't remember exactly what it was but some time around
1997 on Irix it got brought back to 16 blocks per written region
because of issues with Irix NFS servers keeping tens of thousands of
files open (and hence never having speculative preallocation
removed) with multiple preallocations all through sparse files and
hence running filesystems and user quotas out of space.

This 16 block size was the default when ported to linux when porting
to Linux, but would also round up to stripe unit size if the file
was larger than 512kb.  In 2004 the speculative preallocation was
reduced to EOF preallocation only instead of anywhere in a hole in a
file.  Somewhere along the line the default size got changed to
PAGE_SIZE rather than 16 blocks. Then the allocsize mount option was
introduced in 2005 to enable it to be increased for those that
needed large preallocation (typically PVR users), and the maximum
increased to 1GB.

It got broken by changes in 2008, fixed in 2009, and then made
dynamic in 2.6.38 and now it behaves more like it did in 1995 for
files being extended.

So there's a long history of changing the speculative preallocation
behaviour of XFS to adapt to changing storage characteristics,
workloads and capacities.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: df bigger than ls?
  2012-03-08  9:59       ` Brian Candler
@ 2012-03-08 10:22         ` Dave Chinner
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2012-03-08 10:22 UTC (permalink / raw)
  To: Brian Candler; +Cc: Eric Sandeen, xfs

On Thu, Mar 08, 2012 at 09:59:32AM +0000, Brian Candler wrote:
> On Thu, Mar 08, 2012 at 08:50:35AM +0000, Brian Candler wrote:
> > Aha.  This may well be what is screwing up gluster's disk usage on a striped
> > volume - I believe XFS is preallocating space which is actually going to end
> > up being a hole!

How is the filesystem supposed to know that? All it sees is
extending writes, which is what triggers speculative preallocation.

> Here is a standalone testcase.
> 
> $ for i in {0..19}; do dd if=/dev/zero of=testfile bs=128k count=1 seek=$[$i * 12]; done

Yup, that's behaving exactly as expected there. When you seek past
the existing EOF, and there is a speculative preallocation between
the old EOF and the new EOF, it writes zeros to that range because
the assumption is that you are going to fill it with data.

There are applications that do this to trigger that exact
preallocation - Samba is a classic case because windows clients will
write one byte 128k beyond the current EOF to get NTFS to trigger
large preallocation, then send back and write the real data to the
server. In cases like these, you want the hole allocated and filled
with zeros before the real writes come in just in case the server
crashes between the single byte write and the real data being
written....

FWIW, XFS on Irix had special code to detect out of order NFS
writes and do a similar hole filling trick to avoid fragmentation.

There's no one correct behaviour when dealing with writes of this
sort. In some cases the current behaviour is perfect (and samba on
XFS is widely used), in other cases it won't be exactly what you
want.

Indeed, you can avoid this problem by using ftruncate() to extend
the file before writing, write the regions in reverse order, using
fallocate to allocate the exact blocks you want before writing, or
use the allocsize mount option to turn of the dynamic behaviour.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: df bigger than ls?
  2012-03-08  2:10     ` Dave Chinner
  2012-03-08  2:17       ` Eric Sandeen
@ 2012-03-08 16:23       ` Ben Myers
  2012-03-09  0:17         ` Dave Chinner
  1 sibling, 1 reply; 16+ messages in thread
From: Ben Myers @ 2012-03-08 16:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Candler, Eric Sandeen, xfs

On Thu, Mar 08, 2012 at 01:10:54PM +1100, Dave Chinner wrote:
> On Wed, Mar 07, 2012 at 12:04:26PM -0600, Eric Sandeen wrote:
> > But there does seem to be an issue here; if I make a 4G filesystem
> > and repeat the above test 3 times, the 3rd run gets ENOSPC, and
> > the last file written comes up short, while the first one retains
> > all it's extra preallocated space:
> > 
> > # du -hc bigfile* 2.0G	bigfile1 1.1G	bigfile2 907M
> > bigfile3
> > 
> > Dave, is this working as intended?
> 
> Yes. Your problem is that you have a very small filesystem, which is
> not the case that we optimise XFS for. :/

I have seen a similar problem on some very large filesystems too.  This
is not just dependent upon the size of the filesystem, but also the
workload.  I think it is also a big problem for folks using quotas.

> > I know the speculative
> > preallocation amount for new files is supposed to go down as the
> > fs fills, but is there no way to discard prealloc space to avoid
> > ENOSPC on other files?
>
> I see two possible ways to
> minimise this problem:
>
> 	1. reduce the maximum speculative preallocation size based
> 	on filesystem size at mount time.
> 
> 	2. track inodes with active speculative preallocation and
> 	have an enospc based trigger that can find them and truncate
> 	away excess idle speculative preallocation.
>
> 
> The first is relatively easy to do, but will only reduce the
> incidence of your problem - we still need to allow significant
> preallocation sizes (e.g. 64MB) to avoid the fragmentation problems.
> 
> The second is needed to reclaim the space we've already preallocated
> but is not being used. That's more complex to do - probably a radix
> tree bit and a periodic background scan to reduce the time window
> the preallocation sits around from cache lifetime to "idle for some
> time" along with a on-demand, synchronous ENOSPC scan. This will
> need some more thought as to how to do it effectively, but isn't
> impossible to do....

Alex and I discussed this problem briefly awhile ago.  What is the best
way to lose when you hit ENOSPC (project quotas) or EDQUOT in
xfs_iomap_write_delay?  You want to be fair; one user hitting his quota
shouldn't be able to steal some other user's block reservations unless
you really are near ENOSPC for the entire filesystem.  

I suggested something like... track inodes with preallocated block
reservations in LRU order and by dquot, so that the poor fella who is at
EDQUOT will first clean up the preallocations that resulted in quota
being enforced, try again, and then work on preallocations of other
users only if it can help in his situation.  IIRC Alex shut me down when
he heard LRU.  ;)

Now that block reservations count toward quotas the symptom will
probably be a little different.

-Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: df bigger than ls?
  2012-03-08 16:23       ` Ben Myers
@ 2012-03-09  0:17         ` Dave Chinner
  2012-03-09  1:56           ` Ben Myers
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2012-03-09  0:17 UTC (permalink / raw)
  To: Ben Myers; +Cc: Brian Candler, Eric Sandeen, xfs

On Thu, Mar 08, 2012 at 10:23:48AM -0600, Ben Myers wrote:
> On Thu, Mar 08, 2012 at 01:10:54PM +1100, Dave Chinner wrote:
> > On Wed, Mar 07, 2012 at 12:04:26PM -0600, Eric Sandeen wrote:
> > > and repeat the above test 3 times, the 3rd run gets ENOSPC, and
> > > the last file written comes up short, while the first one retains
> > > all it's extra preallocated space:
> > > 
> > > # du -hc bigfile* 2.0G	bigfile1 1.1G	bigfile2 907M
> > > bigfile3
> > > 
> > > Dave, is this working as intended?
> > 
> > Yes. Your problem is that you have a very small filesystem, which is
> > not the case that we optimise XFS for. :/
> 
> I have seen a similar problem on some very large filesystems too.

So report the problems to the list, don't keep them to yourself. If
we don't hear about problems, then as far as we are concerned they
don't exist.

> This
> is not just dependent upon the size of the filesystem, but also the
> workload.  I think it is also a big problem for folks using quotas.

Nobody has reported significant quota problems, either....

> Alex and I discussed this problem briefly awhile ago.  What is the best
> way to lose when you hit ENOSPC (project quotas) or EDQUOT in
> xfs_iomap_write_delay?  You want to be fair; one user hitting his quota
> shouldn't be able to steal some other user's block reservations unless
> you really are near ENOSPC for the entire filesystem.  
> 
> I suggested something like... track inodes with preallocated block
> reservations in LRU order and by dquot, so that the poor fella who is at
> EDQUOT will first clean up the preallocations that resulted in quota
> being enforced, try again, and then work on preallocations of other
> users only if it can help in his situation.  IIRC Alex shut me down when
> he heard LRU.  ;)

And I agree with Alex. Nothing additional needs to be tracked on top
of inodes with speculative prealloc. Just the search filter would
need to be different (i.e. only select inodes with a specific dquot
attached).

> Now that block reservations count toward quotas the symptom will
> probably be a little different.

Block reservations have always counted towards quotas, it's just
that they were never reported.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: df bigger than ls?
  2012-03-09  0:17         ` Dave Chinner
@ 2012-03-09  1:56           ` Ben Myers
  2012-03-09  2:57             ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Ben Myers @ 2012-03-09  1:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Candler, Eric Sandeen, xfs

On Fri, Mar 09, 2012 at 11:17:10AM +1100, Dave Chinner wrote:
> On Thu, Mar 08, 2012 at 10:23:48AM -0600, Ben Myers wrote:
> > On Thu, Mar 08, 2012 at 01:10:54PM +1100, Dave Chinner wrote:
> > > On Wed, Mar 07, 2012 at 12:04:26PM -0600, Eric Sandeen wrote:
> Alex and I discussed this problem briefly awhile ago.  What is the best
> > way to lose when you hit ENOSPC (project quotas) or EDQUOT in
> > xfs_iomap_write_delay?  You want to be fair; one user hitting his quota
> > shouldn't be able to steal some other user's block reservations unless
> > you really are near ENOSPC for the entire filesystem.  
> > 
> > I suggested something like... track inodes with preallocated block
> > reservations in LRU order and by dquot, so that the poor fella who is at
> > EDQUOT will first clean up the preallocations that resulted in quota
> > being enforced, try again, and then work on preallocations of other
> > users only if it can help in his situation.  IIRC Alex shut me down when
> > he heard LRU.  ;)
> 
> And I agree with Alex. Nothing additional needs to be tracked on top
> of inodes with speculative prealloc. Just the search filter would
> need to be different (i.e. only select inodes with a specific dquot
> attached).

Yeah, maybe not.  A single list of inodes with speculative prealloc does
seem a good place to start.  Later maybe you would not want to
scan/filter through all of them and we could add additional lists for
dquots.  My suggestion of using LRU was because I think that the oldest
unused speculative preallocs should be the first to go.  One chronic
over-quota user shouldn't be able to punish everyone else.

> > Now that block reservations count toward quotas the symptom will
> > probably be a little different.
> 
> Block reservations have always counted towards quotas, it's just
> that they were never reported.

Aw.  My mistake.  ;)

-Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: df bigger than ls?
  2012-03-09  1:56           ` Ben Myers
@ 2012-03-09  2:57             ` Dave Chinner
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2012-03-09  2:57 UTC (permalink / raw)
  To: Ben Myers; +Cc: Brian Candler, Eric Sandeen, xfs

On Thu, Mar 08, 2012 at 07:56:03PM -0600, Ben Myers wrote:
> On Fri, Mar 09, 2012 at 11:17:10AM +1100, Dave Chinner wrote:
> > On Thu, Mar 08, 2012 at 10:23:48AM -0600, Ben Myers wrote:
> > > On Thu, Mar 08, 2012 at 01:10:54PM +1100, Dave Chinner wrote:
> > > > On Wed, Mar 07, 2012 at 12:04:26PM -0600, Eric Sandeen wrote:
> > Alex and I discussed this problem briefly awhile ago.  What is the best
> > > way to lose when you hit ENOSPC (project quotas) or EDQUOT in
> > > xfs_iomap_write_delay?  You want to be fair; one user hitting his quota
> > > shouldn't be able to steal some other user's block reservations unless
> > > you really are near ENOSPC for the entire filesystem.  
> > > 
> > > I suggested something like... track inodes with preallocated block
> > > reservations in LRU order and by dquot, so that the poor fella who is at
> > > EDQUOT will first clean up the preallocations that resulted in quota
> > > being enforced, try again, and then work on preallocations of other
> > > users only if it can help in his situation.  IIRC Alex shut me down when
> > > he heard LRU.  ;)
> > 
> > And I agree with Alex. Nothing additional needs to be tracked on top
> > of inodes with speculative prealloc. Just the search filter would
> > need to be different (i.e. only select inodes with a specific dquot
> > attached).
> 
> Yeah, maybe not.  A single list of inodes with speculative prealloc does
> seem a good place to start.

Except it involves growing the inode by a struct listhead. Keeping
per-inode memory usage down is extremely important, and growing it
by 16 bytes for a rare corner case is not a particularly good
tradeoff.  Especially as using a radix tree tag for tracking doesn't
involve any increase in memory usage and tag based scans are
lock-free and quite efficient.

> Later maybe you would not want to
> scan/filter through all of them and we could add additional lists for
> dquots.  My suggestion of using LRU was because I think that the oldest
> unused speculative preallocs should be the first to go.  One chronic
> over-quota user shouldn't be able to punish everyone else.

Sure. But we only need to check if the inode with prealloc
belongs to the user/group/project that hit EDQUOT before taking
action. We don't need an LRU for that.

Indeed, I don't believe that removing the entire preallocation on
each inode is the right thing to do, either. Chopping it in half is
probably the right thing to do so that we free up lots of space in
the case of excessive (large preallocation) but don't cause
significant extra fragmentation for the work that currently requires
it.

If we still get EDQUOT after trimming speculative preallocations, we
can try a second time. If reducing speculative preallocations by 75%
doesn't avoid the EDQUOT (or ENOSPC), then I think we can consider it
a real error....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2012-03-09  2:57 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-07 15:54 df bigger than ls? Brian Candler
2012-03-07 17:16 ` Brian Candler
2012-03-07 18:04   ` Eric Sandeen
2012-03-08  2:10     ` Dave Chinner
2012-03-08  2:17       ` Eric Sandeen
2012-03-08  9:10         ` Brian Candler
2012-03-08  9:28           ` Dave Chinner
2012-03-08 16:23       ` Ben Myers
2012-03-09  0:17         ` Dave Chinner
2012-03-09  1:56           ` Ben Myers
2012-03-09  2:57             ` Dave Chinner
2012-03-08  8:04     ` Arkadiusz Miśkiewicz
2012-03-08 10:03       ` Dave Chinner
2012-03-08  8:50     ` Brian Candler
2012-03-08  9:59       ` Brian Candler
2012-03-08 10:22         ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox