public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* posix_fallocate
@ 2010-05-07  8:22 Krzysztof Błaszkowski
  2010-05-07  9:23 ` posix_fallocate Stan Hoeppner
  2010-05-07 16:26 ` posix_fallocate Eric Sandeen
  0 siblings, 2 replies; 15+ messages in thread
From: Krzysztof Błaszkowski @ 2010-05-07  8:22 UTC (permalink / raw)
  To: xfs

Hello,

I use this to preallocate large space but found an issue. Posix_fallocate 
works right with sizes like 100G, 1T and even 10T on some boxes (on some 
other can fail after e.g. 7T threshold) but if i tried e.g. 16T the user 
space process would be "R"unning forever and it is not interruptible. 
Furthermore some other not related processes like sshd, bash enter D state. 
There is nothing in kernel log.

I made so far a few logs with ftrace facility for 1G, 100G, 1T and 10T sizes. 
I noticed that for 1st three sizes the log is as long as abt 1.5M (2M peak) 
while 10T generates 94M long log. I couldn't retrieve a log for 17T case 
because "cat /sys ... /trace" enters D.

I would appreciate any help because i gave up with ftrace logs analysis. The 
xfs_vn_fallocate is covered in abt 11k lines for a 1.5M log case while there 
are abt 163k lines in 94M log. And all i could see is poss some relationship 
between time spent in xfs_vn_fallocate subfunctions vs requested space.

Box details:
16 Hitachi 2TB drives (backplane connected), dm, 1 lvm lun of 25T size,
kernel 2.6.31.5, more recent kernels neither xfs were not tested.


Regards,
Krzysztof Blaszkowski

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: posix_fallocate
  2010-05-07  8:22 posix_fallocate Krzysztof Błaszkowski
@ 2010-05-07  9:23 ` Stan Hoeppner
  2010-05-07  9:48   ` posix_fallocate Krzysztof Błaszkowski
  2010-05-07 10:07   ` posix_fallocate Krzysztof Błaszkowski
  2010-05-07 16:26 ` posix_fallocate Eric Sandeen
  1 sibling, 2 replies; 15+ messages in thread
From: Stan Hoeppner @ 2010-05-07  9:23 UTC (permalink / raw)
  To: xfs

Krzysztof Błaszkowski put forth on 5/7/2010 3:22 AM:
> Hello,
> 
> I use this to preallocate large space but found an issue. Posix_fallocate 
> works right with sizes like 100G, 1T and even 10T on some boxes (on some 
> other can fail after e.g. 7T threshold) but if i tried e.g. 16T the user 
> space process would be "R"unning forever and it is not interruptible. 
> Furthermore some other not related processes like sshd, bash enter D state. 
> There is nothing in kernel log.
> 
> I made so far a few logs with ftrace facility for 1G, 100G, 1T and 10T sizes. 
> I noticed that for 1st three sizes the log is as long as abt 1.5M (2M peak) 
> while 10T generates 94M long log. I couldn't retrieve a log for 17T case 
> because "cat /sys ... /trace" enters D.
> 
> I would appreciate any help because i gave up with ftrace logs analysis. The 
> xfs_vn_fallocate is covered in abt 11k lines for a 1.5M log case while there 
> are abt 163k lines in 94M log. And all i could see is poss some relationship 
> between time spent in xfs_vn_fallocate subfunctions vs requested space.
> 
> Box details:
> 16 Hitachi 2TB drives (backplane connected), dm, 1 lvm lun of 25T size,
> kernel 2.6.31.5, more recent kernels neither xfs were not tested.

32 or 64 bit kernel?  What is the size of the XFS filesystem on the 25TB LVM
LUN against which you're running posix_fallocate?  The reason I ask is that
XFS has a 16TB per filesystem limitation on 32 bit kernels.  I can only
assume that your XFS filesystem is larger than 16TB since you're attempting
to posix_fallocate 16TB.  But, it's best to ask for confirmation rather than
assume, especially given that your problem is appearing near that magical
16TB boundary.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: posix_fallocate
  2010-05-07  9:23 ` posix_fallocate Stan Hoeppner
@ 2010-05-07  9:48   ` Krzysztof Błaszkowski
  2010-05-07 10:07   ` posix_fallocate Krzysztof Błaszkowski
  1 sibling, 0 replies; 15+ messages in thread
From: Krzysztof Błaszkowski @ 2010-05-07  9:48 UTC (permalink / raw)
  To: xfs, Stan Hoeppner

On Friday 07 May 2010 11:23, Stan Hoeppner wrote:
> Krzysztof Błaszkowski put forth on 5/7/2010 3:22 AM:
> > Hello,
> >
> > I use this to preallocate large space but found an issue. Posix_fallocate
> > works right with sizes like 100G, 1T and even 10T on some boxes (on some
> > other can fail after e.g. 7T threshold) but if i tried e.g. 16T the user
> > space process would be "R"unning forever and it is not interruptible.
> > Furthermore some other not related processes like sshd, bash enter D
> > state. There is nothing in kernel log.
> >
> > I made so far a few logs with ftrace facility for 1G, 100G, 1T and 10T
> > sizes. I noticed that for 1st three sizes the log is as long as abt 1.5M
> > (2M peak) while 10T generates 94M long log. I couldn't retrieve a log for
> > 17T case because "cat /sys ... /trace" enters D.
> >
> > I would appreciate any help because i gave up with ftrace logs analysis.
> > The xfs_vn_fallocate is covered in abt 11k lines for a 1.5M log case
> > while there are abt 163k lines in 94M log. And all i could see is poss
> > some relationship between time spent in xfs_vn_fallocate subfunctions vs
> > requested space.
> >
> > Box details:
> > 16 Hitachi 2TB drives (backplane connected), dm, 1 lvm lun of 25T size,
> > kernel 2.6.31.5, more recent kernels neither xfs were not tested.
>
> 32 or 64 bit kernel? 

sorry, i meant 64 bit.

> What is the size of the XFS filesystem on the 25TB 
> LVM LUN against which you're running posix_fallocate?

xfs occupies whole lun (ie 25TB)

> The reason I ask is 
> that XFS has a 16TB per filesystem limitation on 32 bit kernels.  I can
> only assume that your XFS filesystem is larger than 16TB since you're
> attempting to posix_fallocate 16TB.  But, it's best to ask for confirmation
> rather than assume, especially given that your problem is appearing near
> that magical 16TB boundary.

sure, i see. I use 64 bit by default.

Regards,
Krzysztof

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: posix_fallocate
  2010-05-07  9:23 ` posix_fallocate Stan Hoeppner
  2010-05-07  9:48   ` posix_fallocate Krzysztof Błaszkowski
@ 2010-05-07 10:07   ` Krzysztof Błaszkowski
  2010-05-07 10:42     ` posix_fallocate Stan Hoeppner
  1 sibling, 1 reply; 15+ messages in thread
From: Krzysztof Błaszkowski @ 2010-05-07 10:07 UTC (permalink / raw)
  To: xfs; +Cc: Stan Hoeppner

Hello Stan,

It seems that your mail server blocks traffic from ".pl" domains so don't be 
offended if you will not see my reply sent to your address.

(Remote host said: 550 5.7.1 <v007470.home.net.pl[212.85.125.104]>: Client 
host rejected: We do not accept mail from .pl domains)

Krzysztof

On Friday 07 May 2010 11:23, Stan Hoeppner wrote:
> Krzysztof Błaszkowski put forth on 5/7/2010 3:22 AM:
> > Hello,
> >
> > I use this to preallocate large space but found an issue. Posix_fallocate
> > works right with sizes like 100G, 1T and even 10T on some boxes (on some
> > other can fail after e.g. 7T threshold) but if i tried e.g. 16T the user
> > space process would be "R"unning forever and it is not interruptible.
> > Furthermore some other not related processes like sshd, bash enter D
> > state. There is nothing in kernel log.
> >
> > I made so far a few logs with ftrace facility for 1G, 100G, 1T and 10T
> > sizes. I noticed that for 1st three sizes the log is as long as abt 1.5M
> > (2M peak) while 10T generates 94M long log. I couldn't retrieve a log for
> > 17T case because "cat /sys ... /trace" enters D.
> >
> > I would appreciate any help because i gave up with ftrace logs analysis.
> > The xfs_vn_fallocate is covered in abt 11k lines for a 1.5M log case
> > while there are abt 163k lines in 94M log. And all i could see is poss
> > some relationship between time spent in xfs_vn_fallocate subfunctions vs
> > requested space.
> >
> > Box details:
> > 16 Hitachi 2TB drives (backplane connected), dm, 1 lvm lun of 25T size,
> > kernel 2.6.31.5, more recent kernels neither xfs were not tested.
>
> 32 or 64 bit kernel?  What is the size of the XFS filesystem on the 25TB
> LVM LUN against which you're running posix_fallocate?  The reason I ask is
> that XFS has a 16TB per filesystem limitation on 32 bit kernels.  I can
> only assume that your XFS filesystem is larger than 16TB since you're
> attempting to posix_fallocate 16TB.  But, it's best to ask for confirmation
> rather than assume, especially given that your problem is appearing near
> that magical 16TB boundary.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: posix_fallocate
  2010-05-07 10:07   ` posix_fallocate Krzysztof Błaszkowski
@ 2010-05-07 10:42     ` Stan Hoeppner
  2010-05-07 10:56       ` posix_fallocate Krzysztof Błaszkowski
  0 siblings, 1 reply; 15+ messages in thread
From: Stan Hoeppner @ 2010-05-07 10:42 UTC (permalink / raw)
  To: xfs

Krzysztof Błaszkowski put forth on 5/7/2010 5:07 AM:
> Hello Stan,
> 
> It seems that your mail server blocks traffic from ".pl" domains so don't be 
> offended if you will not see my reply sent to your address.
> 
> (Remote host said: 550 5.7.1 <v007470.home.net.pl[212.85.125.104]>: Client 
> host rejected: We do not accept mail from .pl domains)

Sorry about that.  Due to the constant battle against spam I've implemented
some pretty draconian countermeasures over time, including some ccTLD
blocking of SMTP.  All my "overseas" contacts are via public mailing lists,
as in this case.  Very rarely does a conversation need to go "off list", so
I've not had much of a problem with this setup.  If you'd like I can
whitelist your address.

-- 
Stan


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: posix_fallocate
  2010-05-07 10:42     ` posix_fallocate Stan Hoeppner
@ 2010-05-07 10:56       ` Krzysztof Błaszkowski
  0 siblings, 0 replies; 15+ messages in thread
From: Krzysztof Błaszkowski @ 2010-05-07 10:56 UTC (permalink / raw)
  To: xfs; +Cc: Stan Hoeppner

On Friday 07 May 2010 12:42, Stan Hoeppner wrote:
> Krzysztof Błaszkowski put forth on 5/7/2010 5:07 AM:
> > Hello Stan,
> >
> > It seems that your mail server blocks traffic from ".pl" domains so don't
> > be offended if you will not see my reply sent to your address.
> >
> > (Remote host said: 550 5.7.1 <v007470.home.net.pl[212.85.125.104]>:
> > Client host rejected: We do not accept mail from .pl domains)
>
> Sorry about that.  Due to the constant battle against spam I've implemented
> some pretty draconian countermeasures over time, including some ccTLD
> blocking of SMTP.  All my "overseas" contacts are via public mailing lists,
> as in this case.  Very rarely does a conversation need to go "off list", so
> I've not had much of a problem with this setup.  If you'd like I can
> whitelist your address.

you are welcome. if it is not a big hassle then go ahead.

thanks,
Krzysztof

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: posix_fallocate
  2010-05-07  8:22 posix_fallocate Krzysztof Błaszkowski
  2010-05-07  9:23 ` posix_fallocate Stan Hoeppner
@ 2010-05-07 16:26 ` Eric Sandeen
  2010-05-07 16:53   ` posix_fallocate Eric Sandeen
  1 sibling, 1 reply; 15+ messages in thread
From: Eric Sandeen @ 2010-05-07 16:26 UTC (permalink / raw)
  To: Krzysztof Błaszkowski; +Cc: xfs

Krzysztof Błaszkowski wrote:
> Hello,
> 
> I use this to preallocate large space but found an issue. Posix_fallocate 
> works right with sizes like 100G, 1T and even 10T on some boxes (on some 
> other can fail after e.g. 7T threshold) but if i tried e.g. 16T the user 
> space process would be "R"unning forever and it is not interruptible. 
> Furthermore some other not related processes like sshd, bash enter D state. 
> There is nothing in kernel log.
> 
> I made so far a few logs with ftrace facility for 1G, 100G, 1T and 10T sizes. 
> I noticed that for 1st three sizes the log is as long as abt 1.5M (2M peak) 
> while 10T generates 94M long log. I couldn't retrieve a log for 17T case 
> because "cat /sys ... /trace" enters D.
> 
> I would appreciate any help because i gave up with ftrace logs analysis. The 
> xfs_vn_fallocate is covered in abt 11k lines for a 1.5M log case while there 
> are abt 163k lines in 94M log. And all i could see is poss some relationship 
> between time spent in xfs_vn_fallocate subfunctions vs requested space.
> 
> Box details:
> 16 Hitachi 2TB drives (backplane connected), dm, 1 lvm lun of 25T size,
> kernel 2.6.31.5, more recent kernels neither xfs were not tested.

It'd be great if you could test a more recent kernel.

sysrq-t would give us all the backtraces, except I suppose not for the
running process...

I can try to scrape up >16T to test on at some point ...

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: posix_fallocate
  2010-05-07 16:26 ` posix_fallocate Eric Sandeen
@ 2010-05-07 16:53   ` Eric Sandeen
  2010-05-07 22:16     ` posix_fallocate Dave Chinner
  2010-05-10  7:11     ` posix_fallocate Krzysztof Błaszkowski
  0 siblings, 2 replies; 15+ messages in thread
From: Eric Sandeen @ 2010-05-07 16:53 UTC (permalink / raw)
  To: Krzysztof Błaszkowski; +Cc: xfs

Eric Sandeen wrote:
> Krzysztof Błaszkowski wrote:
>> Hello,
>>
>> I use this to preallocate large space but found an issue. Posix_fallocate 
>> works right with sizes like 100G, 1T and even 10T on some boxes (on some 
>> other can fail after e.g. 7T threshold) but if i tried e.g. 16T the user 
>> space process would be "R"unning forever and it is not interruptible. 
>> Furthermore some other not related processes like sshd, bash enter D state. 
>> There is nothing in kernel log.

Oh, one thing you should know is that depending on your version of glibc,
posix_fallocate may be writing 0s and not using preallocation calls.

Do you know which yours is using?  strace should tell you on a small
file test.

Anyway, I am seeing things get stuck around 8T it seems...

# touch /mnt/test/bigfile
# xfs_io -c "resvsp 0 16t" /mnt/test/bigfile 

... wait ... in other window ...

# du -hc /mnt/test/bigfile 
8.0G	/mnt/test/bigfile
8.0G	total

# echo t > /proc/sysrq-trigger 
# dmesg | grep -A20 xfs_io
xfs_io        R  running task     3576 29444  29362 0x00000006
 ffff8809cfbb4920 ffffffff81478d9f ffffffffa032d3c5 0000000000000246
 ffff8809cfbb4920 ffffffff814788bc 0000000000000000 ffffffff81ba3510
 ffff8809d3429a68 ffffffffa032b60f ffff8809d3429aa8 000000000000001e
Call Trace:
 [<ffffffff81478d9f>] ? __mutex_lock_common+0x36d/0x392
 [<ffffffffa032d3c5>] ? xfs_icsb_modify_counters+0x17f/0x1ac [xfs]
 [<ffffffffa032b60f>] ? xfs_icsb_unlock_all_counters+0x4d/0x60 [xfs]
 [<ffffffffa032b8bf>] ? xfs_icsb_disable_counter+0x8c/0x95 [xfs]
 [<ffffffff81478e88>] ? mutex_lock_nested+0x3e/0x43
 [<ffffffffa032d3d3>] ? xfs_icsb_modify_counters+0x18d/0x1ac [xfs]
 [<ffffffffa032d536>] ? xfs_mod_incore_sb+0x29/0x6e [xfs]
 [<ffffffffa033052c>] ? _xfs_trans_alloc+0x27/0x61 [xfs]
 [<ffffffffa03303d3>] ? xfs_trans_reserve+0x6c/0x19e [xfs]
 [<ffffffff8106fb45>] ? up_write+0x2b/0x32
 [<ffffffffa0335e55>] ? xfs_alloc_file_space+0x163/0x306 [xfs]
 [<ffffffff8107120a>] ? sched_clock_cpu+0xc3/0xce
 [<ffffffffa0336122>] ? xfs_change_file_space+0x12a/0x2b8 [xfs]
 [<ffffffff8106f9bf>] ? down_write_nested+0x80/0x8b
 [<ffffffffa031b8ce>] ? xfs_ilock+0x30/0xb4 [xfs]
 [<ffffffffa033e0e4>] ? xfs_vn_fallocate+0x80/0xf4 [xfs]
--
R         xfs_io 29444  86014624.786617       162   120  86014624.786617    137655.161327       408.979977 /

# uname -r
2.6.34-0.4.rc0.git2.fc14.x86_64

I'll look into  it.

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: posix_fallocate
  2010-05-07 16:53   ` posix_fallocate Eric Sandeen
@ 2010-05-07 22:16     ` Dave Chinner
  2010-05-10  7:11     ` posix_fallocate Krzysztof Błaszkowski
  1 sibling, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2010-05-07 22:16 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs, Krzysztof Błaszkowski

On Fri, May 07, 2010 at 11:53:27AM -0500, Eric Sandeen wrote:
> Eric Sandeen wrote:
> > Krzysztof Błaszkowski wrote:
> >> Hello,
> >>
> >> I use this to preallocate large space but found an issue. Posix_fallocate 
> >> works right with sizes like 100G, 1T and even 10T on some boxes (on some 
> >> other can fail after e.g. 7T threshold) but if i tried e.g. 16T the user 
> >> space process would be "R"unning forever and it is not interruptible. 
> >> Furthermore some other not related processes like sshd, bash enter D state. 
> >> There is nothing in kernel log.
> 
> Oh, one thing you should know is that depending on your version of glibc,
> posix_fallocate may be writing 0s and not using preallocation calls.
> 
> Do you know which yours is using?  strace should tell you on a small
> file test.
> 
> Anyway, I am seeing things get stuck around 8T it seems...
> 
> # touch /mnt/test/bigfile
> # xfs_io -c "resvsp 0 16t" /mnt/test/bigfile 
> 
> ... wait ... in other window ...
> 
> # du -hc /mnt/test/bigfile 
> 8.0G	/mnt/test/bigfile
> 8.0G	total
> 
> # echo t > /proc/sysrq-trigger 
> # dmesg | grep -A20 xfs_io
> xfs_io        R  running task     3576 29444  29362 0x00000006
>  ffff8809cfbb4920 ffffffff81478d9f ffffffffa032d3c5 0000000000000246
>  ffff8809cfbb4920 ffffffff814788bc 0000000000000000 ffffffff81ba3510
>  ffff8809d3429a68 ffffffffa032b60f ffff8809d3429aa8 000000000000001e
> Call Trace:
>  [<ffffffff81478d9f>] ? __mutex_lock_common+0x36d/0x392
>  [<ffffffffa032d3c5>] ? xfs_icsb_modify_counters+0x17f/0x1ac [xfs]
>  [<ffffffffa032b60f>] ? xfs_icsb_unlock_all_counters+0x4d/0x60 [xfs]
>  [<ffffffffa032b8bf>] ? xfs_icsb_disable_counter+0x8c/0x95 [xfs]
>  [<ffffffff81478e88>] ? mutex_lock_nested+0x3e/0x43
>  [<ffffffffa032d3d3>] ? xfs_icsb_modify_counters+0x18d/0x1ac [xfs]
>  [<ffffffffa032d536>] ? xfs_mod_incore_sb+0x29/0x6e [xfs]
>  [<ffffffffa033052c>] ? _xfs_trans_alloc+0x27/0x61 [xfs]
>  [<ffffffffa03303d3>] ? xfs_trans_reserve+0x6c/0x19e [xfs]
>  [<ffffffff8106fb45>] ? up_write+0x2b/0x32
>  [<ffffffffa0335e55>] ? xfs_alloc_file_space+0x163/0x306 [xfs]
>  [<ffffffff8107120a>] ? sched_clock_cpu+0xc3/0xce
>  [<ffffffffa0336122>] ? xfs_change_file_space+0x12a/0x2b8 [xfs]
>  [<ffffffff8106f9bf>] ? down_write_nested+0x80/0x8b
>  [<ffffffffa031b8ce>] ? xfs_ilock+0x30/0xb4 [xfs]
>  [<ffffffffa033e0e4>] ? xfs_vn_fallocate+0x80/0xf4 [xfs]
> --
> R         xfs_io 29444  86014624.786617       162   120  86014624.786617    137655.161327       408.979977 /
> 
> # uname -r
> 2.6.34-0.4.rc0.git2.fc14.x86_64
> 
> I'll look into  it.

On my current delayed-logging branch on a 30GB filesystem:

# xfs_io -f -c "resvsp 0 16t" /mnt/scratch/bigfile 

And in dmesg:

[60173.119760] Assertion failed: tp->t_blk_res_used <= tp->t_blk_res, file: fs/xfs/xfs_trans.c, line: 475
[60173.121263] ------------[ cut here ]------------
[60173.121771] kernel BUG at fs/xfs/support/debug.c:109!
[60173.121771] invalid opcode: 0000 [#1] SMP
[60173.121771] last sysfs file: /sys/devices/virtio-pci/virtio2/block/vdb/removable
[60173.121771] CPU 7
[60173.121771] Modules linked in: [last unloaded: scsi_wait_scan]
[60173.121771]
[60173.121771] Pid: 3596, comm: xfs_io Not tainted 2.6.34-rc1-dgc #138 /Bochs
[60173.121771] RIP: 0010:[<ffffffff8135db5f>]  [<ffffffff8135db5f>] assfail+0x1f/0x30
[60173.121771] RSP: 0018:ffff880112043808  EFLAGS: 00010292
[60173.121771] RAX: 000000000000006d RBX: ffff880105038da0 RCX: 0000000000000000
[60173.121771] RDX: ffff880003600000 RSI: 0000000000000000 RDI: 0000000000000246
[60173.121771] RBP: ffff880112043808 R08: 0000000000000002 R09: 0000000000000000
[60173.121771] R10: ffffffff81a70bb8 R11: 0000000000000000 R12: ffffffffffe20082
[60173.121771] R13: ffff88011cea5000 R14: 0000000000000001 R15: 0000000000000000
[60173.121771] FS:  00007f0311cda6f0(0000) GS:ffff880003600000(0000) knlGS:0000000000000000
[60173.121771] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[60173.121771] CR2: 00007f031164d750 CR3: 000000011bc59000 CR4: 00000000000006e0
[60173.121771] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[60173.121771] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[60173.121771] Process xfs_io (pid: 3596, threadinfo ffff880112042000, task ffff88011629a740)
[60173.121771] Stack:
[60173.121771]  ffff880112043838 ffffffff81342415 00000000001dff7e ffff880112043928
[60173.121771] <0> 00000000001dff7e 0000000000000004 ffff880112043858 ffffffff812e82ce
[60173.121771] <0> ffff880112043928 ffff88011cea5000 ffff8801120438c8 ffffffff812e9008
[60173.121771] Call Trace:
[60173.121771]  [<ffffffff81342415>] xfs_trans_mod_sb+0x2f5/0x330
[60173.121771]  [<ffffffff812e82ce>] xfs_alloc_ag_vextent+0x18e/0x2b0
[60173.121771]  [<ffffffff812e9008>] xfs_alloc_vextent+0x598/0x870
[60173.121771]  [<ffffffff812f9c0f>] xfs_bmap_btalloc+0x29f/0x7b0
[60173.121771]  [<ffffffff812f4af1>] ? xfs_bmap_search_multi_extents+0x71/0x110
[60173.121771]  [<ffffffff812fa141>] xfs_bmap_alloc+0x21/0x40
[60173.121771]  [<ffffffff812fd43c>] xfs_bmapi+0xf2c/0x1a90
[60173.121771]  [<ffffffff81333f55>] ? xlog_grant_log_space+0x35/0x640
[60173.121771]  [<ffffffff8132293b>] ? xfs_ilock+0x10b/0x190
[60173.121771]  [<ffffffff81349660>] xfs_alloc_file_space+0x190/0x440
[60173.121771]  [<ffffffff810b41cd>] ? trace_hardirqs_on+0xd/0x10
[60173.121771]  [<ffffffff8134c7a4>] xfs_change_file_space+0x2d4/0x380
[60173.121771]  [<ffffffff810a291e>] ? down_write_nested+0x9e/0xb0
[60173.121771]  [<ffffffff81322918>] ? xfs_ilock+0xe8/0x190
[60173.121771]  [<ffffffff81358a87>] xfs_vn_fallocate+0x87/0x110
[60173.121771]  [<ffffffff8112661c>] ? __do_fault+0x12c/0x450
[60173.121771]  [<ffffffff81124ebc>] ? might_fault+0x5c/0xb0
[60173.121771]  [<ffffffff81126889>] ? __do_fault+0x399/0x450
[60173.121771]  [<ffffffff8114cfb3>] do_fallocate+0x103/0x110
[60173.121771]  [<ffffffff8115dd8c>] ioctl_preallocate+0x8c/0xb0
[60173.121771]  [<ffffffff8115e1c5>] do_vfs_ioctl+0x415/0x5b0
[60173.121771]  [<ffffffff810a2a43>] ? up_read+0x23/0x40
[60173.121771]  [<ffffffff8115e3e1>] sys_ioctl+0x81/0xa0
[60173.121771]  [<ffffffff81036032>] system_call_fastpath+0x16/0x1b

So there's been a transaction overrun, which tends to imply we're
allocating too much in a single transaction. I'd say there's an
overflow happening somewhere in this path.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: posix_fallocate
  2010-05-07 16:53   ` posix_fallocate Eric Sandeen
  2010-05-07 22:16     ` posix_fallocate Dave Chinner
@ 2010-05-10  7:11     ` Krzysztof Błaszkowski
  2010-05-10 14:39       ` posix_fallocate Eric Sandeen
  1 sibling, 1 reply; 15+ messages in thread
From: Krzysztof Błaszkowski @ 2010-05-10  7:11 UTC (permalink / raw)
  To: xfs; +Cc: Eric Sandeen

On Friday 07 May 2010 18:53, Eric Sandeen wrote:
> Eric Sandeen wrote:
> > Krzysztof Błaszkowski wrote:
> >> Hello,
> >>
> >> I use this to preallocate large space but found an issue.
> >> Posix_fallocate works right with sizes like 100G, 1T and even 10T on
> >> some boxes (on some other can fail after e.g. 7T threshold) but if i
> >> tried e.g. 16T the user space process would be "R"unning forever and it
> >> is not interruptible. Furthermore some other not related processes like
> >> sshd, bash enter D state. There is nothing in kernel log.
>
> Oh, one thing you should know is that depending on your version of glibc,
> posix_fallocate may be writing 0s and not using preallocation calls.

I am absolutely sure that recent libc doesn't emulate this syscall 


>
> Do you know which yours is using?

syscall (libc 2.9)

> strace should tell you on a small 
> file test.
>
> Anyway, I am seeing things get stuck around 8T it seems...

yes, i noticed that sometimes the threshold point is higher.

>
> # touch /mnt/test/bigfile
> # xfs_io -c "resvsp 0 16t" /mnt/test/bigfile
>
> ... wait ... in other window ...
>
> # du -hc /mnt/test/bigfile
> 8.0G	/mnt/test/bigfile
> 8.0G	total
>
> # echo t > /proc/sysrq-trigger

It was good idea to use sysrq. I didn't think about this but rather focused on 
ftrace and how to analyse these megs of data


> # dmesg | grep -A20 xfs_io
> xfs_io        R  running task     3576 29444  29362 0x00000006
>  ffff8809cfbb4920 ffffffff81478d9f ffffffffa032d3c5 0000000000000246
>  ffff8809cfbb4920 ffffffff814788bc 0000000000000000 ffffffff81ba3510
>  ffff8809d3429a68 ffffffffa032b60f ffff8809d3429aa8 000000000000001e
> Call Trace:
>  [<ffffffff81478d9f>] ? __mutex_lock_common+0x36d/0x392
>  [<ffffffffa032d3c5>] ? xfs_icsb_modify_counters+0x17f/0x1ac [xfs]
>  [<ffffffffa032b60f>] ? xfs_icsb_unlock_all_counters+0x4d/0x60 [xfs]
>  [<ffffffffa032b8bf>] ? xfs_icsb_disable_counter+0x8c/0x95 [xfs]
>  [<ffffffff81478e88>] ? mutex_lock_nested+0x3e/0x43
>  [<ffffffffa032d3d3>] ? xfs_icsb_modify_counters+0x18d/0x1ac [xfs]
>  [<ffffffffa032d536>] ? xfs_mod_incore_sb+0x29/0x6e [xfs]
>  [<ffffffffa033052c>] ? _xfs_trans_alloc+0x27/0x61 [xfs]
>  [<ffffffffa03303d3>] ? xfs_trans_reserve+0x6c/0x19e [xfs]
>  [<ffffffff8106fb45>] ? up_write+0x2b/0x32
>  [<ffffffffa0335e55>] ? xfs_alloc_file_space+0x163/0x306 [xfs]
>  [<ffffffff8107120a>] ? sched_clock_cpu+0xc3/0xce
>  [<ffffffffa0336122>] ? xfs_change_file_space+0x12a/0x2b8 [xfs]
>  [<ffffffff8106f9bf>] ? down_write_nested+0x80/0x8b
>  [<ffffffffa031b8ce>] ? xfs_ilock+0x30/0xb4 [xfs]
>  [<ffffffffa033e0e4>] ? xfs_vn_fallocate+0x80/0xf4 [xfs]
> --
> R         xfs_io 29444  86014624.786617       162   120  86014624.786617   
> 137655.161327       408.979977 /
>
> # uname -r
> 2.6.34-0.4.rc0.git2.fc14.x86_64
>
> I'll look into  it.

We stick with 2.6.31.5 which seems to be good for us. We do not change kernels 
easily, as soon as higher revision arrives because it doesn't make sense from 
stability point of view. We have seen too many times regression bugs so if we 
are confident with some revision then there is no point to change this.

Krzysztof Błaszkowski


>
> -Eric
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: posix_fallocate
  2010-05-10  7:11     ` posix_fallocate Krzysztof Błaszkowski
@ 2010-05-10 14:39       ` Eric Sandeen
  2010-05-10 18:17         ` posix_fallocate Krzysztof Błaszkowski
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Sandeen @ 2010-05-10 14:39 UTC (permalink / raw)
  To: Krzysztof Błaszkowski; +Cc: xfs

Krzysztof Błaszkowski wrote:
> On Friday 07 May 2010 18:53, Eric Sandeen wrote:
>> Eric Sandeen wrote:
>>> Krzysztof Błaszkowski wrote:
>>>> Hello,
>>>>
>>>> I use this to preallocate large space but found an issue.
>>>> Posix_fallocate works right with sizes like 100G, 1T and even 10T on
>>>> some boxes (on some other can fail after e.g. 7T threshold) but if i
>>>> tried e.g. 16T the user space process would be "R"unning forever and it
>>>> is not interruptible. Furthermore some other not related processes like
>>>> sshd, bash enter D state. There is nothing in kernel log.
>> Oh, one thing you should know is that depending on your version of glibc,
>> posix_fallocate may be writing 0s and not using preallocation calls.
> 
> I am absolutely sure that recent libc doesn't emulate this syscall 

right, recent glibc does not (unless the underlying fs doesn't support it)

...

> We stick with 2.6.31.5 which seems to be good for us. We do not change kernels 
> easily, as soon as higher revision arrives because it doesn't make sense from 
> stability point of view. We have seen too many times regression bugs so if we 
> are confident with some revision then there is no point to change this.

It was just a testing suggestion, but I already tested upstream and the problem
persists, now just need to find the time to dig into it.

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: posix_fallocate
  2010-05-10 14:39       ` posix_fallocate Eric Sandeen
@ 2010-05-10 18:17         ` Krzysztof Błaszkowski
  2010-05-10 18:45           ` posix_fallocate Eric Sandeen
  0 siblings, 1 reply; 15+ messages in thread
From: Krzysztof Błaszkowski @ 2010-05-10 18:17 UTC (permalink / raw)
  To: xfs; +Cc: Eric Sandeen

On Monday 10 May 2010 16:39, Eric Sandeen wrote:
> Krzysztof Błaszkowski wrote:
> > On Friday 07 May 2010 18:53, Eric Sandeen wrote:
> >> Eric Sandeen wrote:
> >>> Krzysztof Błaszkowski wrote:
> >>>> Hello,
> >>>>
> >>>> I use this to preallocate large space but found an issue.
> >>>> Posix_fallocate works right with sizes like 100G, 1T and even 10T on
> >>>> some boxes (on some other can fail after e.g. 7T threshold) but if i
> >>>> tried e.g. 16T the user space process would be "R"unning forever and
> >>>> it is not interruptible. Furthermore some other not related processes
> >>>> like sshd, bash enter D state. There is nothing in kernel log.
> >>
> >> Oh, one thing you should know is that depending on your version of
> >> glibc, posix_fallocate may be writing 0s and not using preallocation
> >> calls.
> >
> > I am absolutely sure that recent libc doesn't emulate this syscall
>
> right, recent glibc does not (unless the underlying fs doesn't support it)
>
> ...
>
> > We stick with 2.6.31.5 which seems to be good for us. We do not change
> > kernels easily, as soon as higher revision arrives because it doesn't
> > make sense from stability point of view. We have seen too many times
> > regression bugs so if we are confident with some revision then there is
> > no point to change this.
>
> It was just a testing suggestion, but I already tested upstream and the
> problem persists, now just need to find the time to dig into it.

I see and I am glad you confirmed this. Do you think that fallocate called 
many times with fixed size and increasing offset will work better than one 
time call with huge size @ 0 offset ?

Krzysztof
>
> -Eric
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: posix_fallocate
  2010-05-10 18:17         ` posix_fallocate Krzysztof Błaszkowski
@ 2010-05-10 18:45           ` Eric Sandeen
  2010-05-11 14:20             ` posix_fallocate Krzysztof Błaszkowski
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Sandeen @ 2010-05-10 18:45 UTC (permalink / raw)
  To: Krzysztof Błaszkowski; +Cc: xfs

Krzysztof Błaszkowski wrote:
> On Monday 10 May 2010 16:39, Eric Sandeen wrote:
>> Krzysztof Błaszkowski wrote:

...

>>> We stick with 2.6.31.5 which seems to be good for us. We do not change
>>> kernels easily, as soon as higher revision arrives because it doesn't
>>> make sense from stability point of view. We have seen too many times
>>> regression bugs so if we are confident with some revision then there is
>>> no point to change this.
>> It was just a testing suggestion, but I already tested upstream and the
>> problem persists, now just need to find the time to dig into it.
> 
> I see and I am glad you confirmed this. Do you think that fallocate called 
> many times with fixed size and increasing offset will work better than one 
> time call with huge size @ 0 offset ?

I'd expect that to work; it's certainly worth a test, and please send your
results back to the list ;)

thanks,
-Eric

> Krzysztof
>> -Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: posix_fallocate
  2010-05-10 18:45           ` posix_fallocate Eric Sandeen
@ 2010-05-11 14:20             ` Krzysztof Błaszkowski
  2010-05-11 14:54               ` posix_fallocate Eric Sandeen
  0 siblings, 1 reply; 15+ messages in thread
From: Krzysztof Błaszkowski @ 2010-05-11 14:20 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs

On Monday 10 May 2010 20:45, Eric Sandeen wrote:
> Krzysztof Błaszkowski wrote:
> > On Monday 10 May 2010 16:39, Eric Sandeen wrote:
> >> Krzysztof Błaszkowski wrote:
>
> ...
>
> >>> We stick with 2.6.31.5 which seems to be good for us. We do not change
> >>> kernels easily, as soon as higher revision arrives because it doesn't
> >>> make sense from stability point of view. We have seen too many times
> >>> regression bugs so if we are confident with some revision then there is
> >>> no point to change this.
> >>
> >> It was just a testing suggestion, but I already tested upstream and the
> >> problem persists, now just need to find the time to dig into it.
> >
> > I see and I am glad you confirmed this. Do you think that fallocate
> > called many times with fixed size and increasing offset will work better
> > than one time call with huge size @ 0 offset ?
>
> I'd expect that to work; it's certainly worth a test

agreed.

> , and please send your 
> results back to the list ;)

okay, will do this tomorrow.


BUT let's think about possible results:
- test will fail. nothing to comment.

- test will pass. this is interesting case.

A passed test - does this test prove anything ?
it may but this is not obvious.
If the fault was caused by some algorithmic mistake (some table size, buffer 
size according to input size) then the test result could be a proof.

but if it fails due to missing spinlock/mutex elsewhere then we talk about 
probability of failure which depends on requested size. 

bad news is that this failure happens at various sizes depending on hw 
configuration. On some boxes the threshold point is abt 7T while another 
boxes fail after e.g. 15T.
I am not sure about any relationship between these boxes in installed memory, 
amount of logical cores and theirs frequency (and current workload)
 
In this later case the test will prove nothing. If i run it 5 times and it 
passed 5 times it would mean only that i was lucky.

as long as we don't know the exact nature of this fault we can't consider such 
test as reliable fix.


Krzysztof

PS of course i will try this just to satisfy curiosity tomorrow afternoon (PL 
time) as all high capacity storage has been shipped to customers already. 

>
> thanks,
> -Eric
>
> > Krzysztof
> >
> >> -Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: posix_fallocate
  2010-05-11 14:20             ` posix_fallocate Krzysztof Błaszkowski
@ 2010-05-11 14:54               ` Eric Sandeen
  0 siblings, 0 replies; 15+ messages in thread
From: Eric Sandeen @ 2010-05-11 14:54 UTC (permalink / raw)
  To: Krzysztof Błaszkowski; +Cc: xfs

Krzysztof Błaszkowski wrote:
...

> as long as we don't know the exact nature of this fault we can't consider such 
> test as reliable fix.

No, it's just a datapoint, do as you wish.  :)

I'll try to find time to dig into this soon if nobody beats me to it.

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2010-05-11 14:52 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-07  8:22 posix_fallocate Krzysztof Błaszkowski
2010-05-07  9:23 ` posix_fallocate Stan Hoeppner
2010-05-07  9:48   ` posix_fallocate Krzysztof Błaszkowski
2010-05-07 10:07   ` posix_fallocate Krzysztof Błaszkowski
2010-05-07 10:42     ` posix_fallocate Stan Hoeppner
2010-05-07 10:56       ` posix_fallocate Krzysztof Błaszkowski
2010-05-07 16:26 ` posix_fallocate Eric Sandeen
2010-05-07 16:53   ` posix_fallocate Eric Sandeen
2010-05-07 22:16     ` posix_fallocate Dave Chinner
2010-05-10  7:11     ` posix_fallocate Krzysztof Błaszkowski
2010-05-10 14:39       ` posix_fallocate Eric Sandeen
2010-05-10 18:17         ` posix_fallocate Krzysztof Błaszkowski
2010-05-10 18:45           ` posix_fallocate Eric Sandeen
2010-05-11 14:20             ` posix_fallocate Krzysztof Błaszkowski
2010-05-11 14:54               ` posix_fallocate Eric Sandeen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox