* frequent kernel BUG and lockups - 2.6.39 + xfs_fsr
@ 2011-08-06 12:25 Marc Lehmann
2011-08-06 14:20 ` Dave Chinner
0 siblings, 1 reply; 18+ messages in thread
From: Marc Lehmann @ 2011-08-06 12:25 UTC (permalink / raw)
To: xfs
I get frequent (for servers) lockups and crashes when using 2.6.39. I saw the
same problems using 3.0.0rc5, 5 and 6, and I think also 2.6.38. I don't see
this lockups on 2.6.30 or 2.6.26 (all the respetcive latest debian kernels).
The symtpom slightly differs - sometimes I get thousands of backtraces
before the machine locks up, usually I get only one, and either the
machine locks up completely, or only the processes using the filesystem in
question (presumably) lock - all unkillable.
The backtraces look all very similar:
http://ue.tst.eu/85b9c9f66e36dda81be46892661c5bd0.txt
this is from a desktop system - it tends to be harder to get these from
servers.
all the backtraces crash with a null pointer dereference in xfs_iget, or
in xfs_trans_log_inode, and always for process xfs_fsr.
I haven't seen a crash without xfs_fsr.
--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / schmorp@schmorp.de
-=====/_/_//_/\_,_/ /_/\_\
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr 2011-08-06 12:25 frequent kernel BUG and lockups - 2.6.39 + xfs_fsr Marc Lehmann @ 2011-08-06 14:20 ` Dave Chinner 2011-08-07 1:42 ` Marc Lehmann 0 siblings, 1 reply; 18+ messages in thread From: Dave Chinner @ 2011-08-06 14:20 UTC (permalink / raw) To: Marc Lehmann; +Cc: xfs On Sat, Aug 06, 2011 at 02:25:56PM +0200, Marc Lehmann wrote: > I get frequent (for servers) lockups and crashes when using 2.6.39. I saw the > same problems using 3.0.0rc5, 5 and 6, and I think also 2.6.38. I don't see > this lockups on 2.6.30 or 2.6.26 (all the respetcive latest debian kernels). > > The symtpom slightly differs - sometimes I get thousands of backtraces > before the machine locks up, usually I get only one, and either the > machine locks up completely, or only the processes using the filesystem in > question (presumably) lock - all unkillable. > > The backtraces look all very similar: > > http://ue.tst.eu/85b9c9f66e36dda81be46892661c5bd0.txt Tainted kernel. Please reproduce without the NVidia binary drivers. > this is from a desktop system - it tends to be harder to get these from > servers. > > all the backtraces crash with a null pointer dereference in xfs_iget, or > in xfs_trans_log_inode, and always for process xfs_fsr. and when you do, please record an event trace of the xfs_swap_extent* trace points while xfs_fsr is running and triggers a crash. That will tell me if xfs_fsr is corrupting inodes, > I haven't seen a crash without xfs_fsr. Then don't use xfs_fsr until we know if it is the cause of the problem (except to reproduce the problem). And as I always ask - why do you need to run xfs_fsr so often? Do you really have filesystems that get quickly fragmented (or are you just running it from a cron-job because having on-line defragmentation is what all the cool kids do ;)? If you are getting fragmentation, what is the workload that is causing it? Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr 2011-08-06 14:20 ` Dave Chinner @ 2011-08-07 1:42 ` Marc Lehmann 2011-08-07 10:26 ` Dave Chinner 0 siblings, 1 reply; 18+ messages in thread From: Marc Lehmann @ 2011-08-07 1:42 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Sun, Aug 07, 2011 at 12:20:05AM +1000, Dave Chinner <david@fromorbit.com> wrote: > > The backtraces look all very similar: > > > > http://ue.tst.eu/85b9c9f66e36dda81be46892661c5bd0.txt > > Tainted kernel. Please reproduce without the NVidia binary drivers. This is just because it is form my desktop system. None of my other machines have a tainted kernel, but getting backtraces from there is much harder. > > all the backtraces crash with a null pointer dereference in xfs_iget, or > > in xfs_trans_log_inode, and always for process xfs_fsr. > > and when you do, please record an event trace of the > xfs_swap_extent* trace points while xfs_fsr is running and triggers > a crash. That will tell me if xfs_fsr is corrupting inodes, Ah - how do I do that? > > I haven't seen a crash without xfs_fsr. > > Then don't use xfs_fsr until we know if it is the cause of the > problem (except to reproduce the problem). Why so defensive? xfs_fsr is an advertised feature and should just work (and does so with older kernels). > And as I always ask - why do you need to run xfs_fsr so often? Do Did I say I am running it often? IT typically runs once a day for an hour. > you really have filesystems that get quickly fragmented (or are you Yes, fragmentation with xfs is enourmous - I have yet to see whether the changes in recent kernels make a big difference, but for log files, reading through a log file with 60000 fragments tends to be much slower than reading through one with just a few fragments (or just one...). Freenet and other daemons are also creating enourmous fragmentation. As such. xfs is much much worse at fragmentation than ext4, but at least it has xfs_fsr, which at least reduces file fragmentation. > just running it from a cron-job because having on-line > defragmentation is what all the cool kids do ;)? Didn't know that, maybe I should run it more often then... Or maybe not, now that you tell me I shouldn't because xfs implementation quality is so much lower than for other filesystems? > If you are getting fragmentation, what is the workload that is causing > it? Basically, anything but the OS itself. Copying large video files while the disk is busy with other things causes lots of fragmentation (usually 30 fragments for a 100mb file), which in turn slows down things enourmously once the disk reaches 95% full. Freenet is also a good test case. As are logfiles. Or a news spool. Or database files for databases that grow files (such as mysql myisam) - fortunately I could move of all those to SSDs this year. Or simply unpacking an archive. Simple example - the www.deliantra.net gameserver writes logs to a logfile and stdout, which is redirected to another logfile in the same directory (which gets truncated on each restart). Today I had to reboot the server because of buggy xfs (which prompted the bugreport, as I am seeing this bug for a while now, but so far didn't want to exclude e.g. bad ram or simply a corrupt filesystem), and in the 4 hours uptime, I got a 4MB logfile with 8 fragments. This is clearly an improvement over the 2.6.26 kernel I used before on that machine. But over a few months this still leads to thousands of fragments, and scanning through a few gigabytes of log file that has 60000 fragments on a disk that isn't completely idle is not exactly fast. The webserver accesslog on that machine which is a file on its own in its own directory is 15MB big (it was restarted beginning last month) and has 1043 fragments (it doesn't get defragmented by xfs_fsr because it is in use). OTOH, that filesystem isn't used much and has 300gb free out of 500, so it is surprising that I still get so many fragments (the files are only closed when runing xfs_fsr on them, which is once every few weeks). Freenet fares much worse. The persistent blob has 1757 fragments for 13gb (not that bad), and the download database has 22756 for 600mb, fragments (that sucks). On my tv, the recorded video files that haven't been defragmented yet have between 11 and 63 fragments (all smaller than 2gb), which is almost acceptable, but I do not think that without a regular xfs_fsr the fs would be in that good shape after one or two years of usage. The cool thing about xfs_fsr is not that the cool kids run it, but that, unlike other filesystems that also fragment a lot (ext3 is absolutely horrible for example), it can mostly be fixed. Given that xfs is clearly the lowest quality of the common filesystems on linux (which I mean to be reiserfs, ext2/3/4 - and before you ask, literally each time I run a file system check xfs_repair crashes or hangs, and the filesystems have some issues, on all my numerous machines, and the number of bugs I have hit with xfs is easily twice the amount of bugs I hit with reiserfs and extX together, and I was an early adopter of reiserfs, before it even had a fsck), it is important to have some features left that cancel this general lack of quality. Right now, these features for me are the very tunable nature of xfs (for example, 512b block size for news spools), the very fast xfs_repair and the long-term maintainability of the filesystem - a heavily used ext3 filesystem basically becomes unusable after a year. Another feature was the very good feedback I got from this list in the past w.r.t. bugs and fixes (while nowadays I have to listen to "xfs is optimised for nfs not for your use" or "then don't use it" replies to bug reports). All that and the fact that I haven't lost a single important file and the steady improvements to performance in XFS make xfs currently my filesystem of choice, especially for heavy-duty applications. PS: I run xfs on a total of about 40TB of filesystems at the moment. PPS: sorry for being so forcefully truthful about xfs above, but you really need an attitude change - don't tell people to not use a feature, or tell them they probably just want to be cool kids - the implementation quality of xfs is far from that of reiserfs or ext3 (not sure about ext4 yet, but I do expect e2fsck to not let me down as often as xfs_repair), there are things to do, and I contribute what little I can by testing xfs with actual workloads. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr 2011-08-07 1:42 ` Marc Lehmann @ 2011-08-07 10:26 ` Dave Chinner 2011-08-08 19:02 ` Marc Lehmann 2011-08-09 9:16 ` Marc Lehmann 0 siblings, 2 replies; 18+ messages in thread From: Dave Chinner @ 2011-08-07 10:26 UTC (permalink / raw) To: Marc Lehmann; +Cc: xfs On Sun, Aug 07, 2011 at 03:42:38AM +0200, Marc Lehmann wrote: > On Sun, Aug 07, 2011 at 12:20:05AM +1000, Dave Chinner <david@fromorbit.com> wrote: > > > The backtraces look all very similar: > > > > > > http://ue.tst.eu/85b9c9f66e36dda81be46892661c5bd0.txt > > > > Tainted kernel. Please reproduce without the NVidia binary drivers. > > This is just because it is form my desktop system. None of my other > machines have a tainted kernel, but getting backtraces from there is much > harder. > > > > all the backtraces crash with a null pointer dereference in xfs_iget, or > > > in xfs_trans_log_inode, and always for process xfs_fsr. > > > > and when you do, please record an event trace of the > > xfs_swap_extent* trace points while xfs_fsr is running and triggers > > a crash. That will tell me if xfs_fsr is corrupting inodes, > > Ah - how do I do that? Use trace-cmd or do it manually via: # echo 1 > /sys/kernel/debug/tracing/events/xfs/xfs_swap_extent_before/enable # echo 1 > /sys/kernel/debug/tracing/events/xfs/xfs_swap_extent_before/after # cat /sys/kernel/debug/tracing/trace_pipe > trace.out > > > I haven't seen a crash without xfs_fsr. > > > > Then don't use xfs_fsr until we know if it is the cause of the > > problem (except to reproduce the problem). > > Why so defensive? xfs_fsr is an advertised feature and should just work Defensive? Sure - to protect -your systems- from further corruption problems until we know what the problem is. To use a car analogy: I know the brakes on your car have a fault that could cause a catastrophic failure, and I know you are taking a drive over a mountain. Don't you think I should tell you not to drive your car over the mountain, but to get the brakes looked at first? But it's your data, so if you want to risk catastrophic corruption by continuing to run xfs_fsr then that's your choice. > (and does so with older kernels). On older kernels (2.6.34 and earlier) I can corrupt filesystems using xfs-fsr just by crafting a file with a specific layout. It's easy and doesn't require any special privileges to do. IOWs, xfs_fsr on old kernels is actually dangerous and should not be used if you have anything that stores information in attributes (like selinux). We made quite a lot of fixes to the swap extent code to fix those problems, along with regression tests so the problem doesn't arise again. It's entirely possible that a problem was introduced by these fixes. Perhaps there's a case that I didn't fully understand and fix properly or there's some other as yet unknown problem. Until I know what it is, then the safest thing is not to run xfs_fsr. Indeed, if you get new corruptions showing up without running xfs_fsr, than that's also something worth knowing. > > And as I always ask - why do you need to run xfs_fsr so often? Do > > Did I say I am running it often? IT typically runs once a day for an hour. Yes, that is often. I don't run xfs_fsr at all on any of my machines (except for the test VMs when testing xfs_fsr). The problem with running xfs_fsr is that while it defragments files, it fragments free space, i.e. xfs_fsr turns large contiguous free space ranges into smaller, non-contiguous free space ranges. IOWs, using xfs_fsr accelerates filesystem aging effects, meaning that new files are much more likely to be get fragemented as they grow because they cannot be located in large contiguous free space extents. Then you run xfs_fsr to reduce the number of fragments in the file, thereby converting free space into more smaller, less contiguous extents. It's a downward spiral.... That's why running xfs-fsr regularly out of a cron job is not advisable. This lesson was learn on Irix more than 10 years ago when it was defaulted to running once a week for two hours on Sunday night. Running it more frequently like is happening on your systems will only make things worse. FWIW, this comes up often enough that I think I need to add a FAQ entry for it. > > you really have filesystems that get quickly fragmented (or are you > > Yes, fragmentation with xfs is enourmous - I have yet to see whether > the changes in recent kernels make a big difference, but for log files, > reading through a log file with 60000 fragments tends to be much slower > than reading through one with just a few fragments (or just one...). So you've got a problem with append only workloads. 2.6.38 and more recent kernels should be much more resistent to fragmentation under such conditions thanks to the dynamic speculative allocation changes that went into 2.6.38. Alternatively, you can use the allocsize mount option, or set the append-only inode flag, or set the preallocated flag on the inode so that truncation of specualtive allocation beyond EOF doesn't occur every time the file is closed. ..... > > If you are getting fragmentation, what is the workload that is causing > > it? > > Basically, anything but the OS itself. Copying large video files while the > disk is busy with other things causes lots of fragmentation (usually 30 > fragments for a 100mb file), which in turn slows down things enourmously once > the disk reaches 95% full. Another oft-repeated rule of thumb - filling XFS filesystems over 85-90% full causes increased fragmentation because of the lack of large contiguous free space extents. That's exactly the same problem that excessive use of xfs_fsr causes..... > Freenet is also a good test case. Not for a filesystem developer. Running internet facing, anonymous, encrypted peer-to-peer file storage servers anywhere is not something I'll ever do on my network. If you think it's a good workload that we should use, then capture a typical directory profile and the IO/filesystem operations made on a busy server for an hour or so. Then write a script to reproduce that directory structure and IO pattern..... > As are logfiles. > > Or a news spool. append only workloads. > Or database files for databases that grow files (such as mysql myisam) - > fortunately I could move of all those to SSDs this year. I thought mysql as capable of preallocating regions when files grow. Perhaps it isn't configured to do so? > Or simply unpacking an archive. That should not cause fragmentation unless you have already fragmented free space... Use xfs_db -r -c "freesp -s" <dev> to get an idea of what your freespace situation looks like. > Simple example - the www.deliantra.net gameserver writes logs to a logfile > and stdout, which is redirected to another logfile in the same directory > (which gets truncated on each restart). > > Today I had to reboot the server because of buggy xfs (which prompted the > bugreport, as I am seeing this bug for a while now, but so far didn't want > to exclude e.g. bad ram or simply a corrupt filesystem), and in the 4 > hours uptime, I got a 4MB logfile with 8 fragments. What kernel, and what is the xfs_bmap -vp output for the file? > This is clearly an improvement over the 2.6.26 kernel I used before on > that machine. But over a few months this still leads to thousands of > fragments, Have you seen this, or are you extrapolating from the 4MB file you've seen above? .... > Freenet fares much worse. The persistent blob has 1757 fragments for 13gb > (not that bad), and the download database has 22756 for 600mb, fragments > (that sucks). You're still talking about how 2.6.26 kernels behave, right? > On my tv, the recorded video files that haven't been defragmented yet > have between 11 and 63 fragments (all smaller than 2gb), which is almost > acceptable, but I do not think that without a regular xfs_fsr the fs would > be in that good shape after one or two years of usage. For old kernels, allocsize should have mostly solved that problem. For current kernels that shouldn't even be necessary. > The cool thing about xfs_fsr is not that the cool kids run it, but that, > unlike other filesystems that also fragment a lot (ext3 is absolutely > horrible for example), it can mostly be fixed. "fixed" is not really true - all it has done is trade file fragementation for freespace fragementation. That bites you eventually. > Given that xfs is clearly the lowest quality of the common filesystems > on linux (which I mean to be reiserfs, ext2/3/4 - and before you ask, > literally each time I run a file system check xfs_repair crashes or hangs, > and the filesystems have some issues, on all my numerous machines, and > the number of bugs I have hit with xfs is easily twice the amount of > bugs I hit with reiserfs and extX together, and I was an early adopter > of reiserfs, before it even had a fsck), it is important to have some > features left that cancel this general lack of quality. Quality will only improve if you report bugs and help trace their root cause. Then we can fix them. If you don't, we don't know about them, can't fid them and hence can't fix them. > Right now, these features for me are the very tunable nature of xfs (for > example, 512b block size for news spools), the very fast xfs_repair and > the long-term maintainability of the filesystem - a heavily used ext3 > filesystem basically becomes unusable after a year. > > Another feature was the very good feedback I got from this list in the > past w.r.t. bugs and fixes (while nowadays I have to listen to "xfs is > optimised for nfs not for your use" or "then don't use it" replies to bug > reports). <sigh> Ok, now I remember you. I hope this time you'll provide me with the information I ask you for to triage your problem.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr 2011-08-07 10:26 ` Dave Chinner @ 2011-08-08 19:02 ` Marc Lehmann 2011-08-09 10:10 ` Michael Monnerie 2011-08-09 9:16 ` Marc Lehmann 1 sibling, 1 reply; 18+ messages in thread From: Marc Lehmann @ 2011-08-08 19:02 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Sun, Aug 07, 2011 at 08:26:25PM +1000, Dave Chinner <david@fromorbit.com> wrote: > Use trace-cmd or do it manually via: > > # echo 1 > /sys/kernel/debug/tracing/events/xfs/xfs_swap_extent_before/enable > # echo 1 > /sys/kernel/debug/tracing/events/xfs/xfs_swap_extent_before/after > # cat /sys/kernel/debug/tracing/trace_pipe > trace.out Thanks, I'll have a look at enabling this with a regular xfs_fsr on a few machines. > To use a car analogy: I know the brakes on your car have a fault > that could cause a catastrophic failure, and I know you are taking a > drive over a mountain. Don't you think I should tell you not to > drive your car over the mountain, but to get the brakes looked at > first? To take your car analogy - if I went to my car dealer and told him my brakes just malfunctioned, but fortunately it was uphill and I could safely stop with my handbrake, he would most decisively not reply with "then don't use your car". No, he would presumably offer me to take back the car and replace the brakes, for free. I am not sure what you want to say with your analogy, but it doesn'T seem to be sensible. > > (and does so with older kernels). > > On older kernels (2.6.34 and earlier) I can corrupt filesystems > using xfs-fsr just by crafting a file with a specific layout. Wow, and it's not mentioned anywhere in the status updates, unlike all those nice performance upgrades, especially those dirty NFS hacks. Yes, I am a bit sarcastic, but this corruption bug is either pretty harmless or the xfs team is really somewhat irressponsible in not giving information out about this harmful bug. > easy and doesn't require any special privileges to do. Wow, so any kernel before 2.6.34 can have its xfs corrupted by an untrusted user? Seriously, shouldn't this be mentioned at least in the fAQ or somewhere else? > IOWs, xfs_fsr on old kernels is actually dangerous and should not be > used if you Logic error - if I can corrupt an XFS without special privileges then this is not a problem with xfs_fsr, but simply a kernel bug in the xfs code. And a rather big one, one step below a remote exploit. > The problem with running xfs_fsr is that while it defragments files, > it fragments free space, i.e. xfs_fsr turns large contiguous free While that is true in *some* cases, it can also be countered in userspace, and will not happen if files get removed regulalry, e.g. for a cache partition. However, if you have those famous append-style loads, and this causes files to have thousdands of fragments, these are most likely interleaved with other files. xfs_fsr can, if it manages to defragment the file completely (which is the norm in my case), introduce at most one fragment, while, in the acse of non-static files, it will likely remove thousands of small free space fragments. Sure, xfs_fsr can be detrimental, but so be doing nothing, letting your disk gte full accidentally and many other actions. there is definitely no clear cut "xfs_fsr causes your fs to detoriate", and as always, you have to know what you are doing. > That's why running xfs-fsr regularly out of a cron job is not > advisable. This lesson was learn on Irix more than 10 years ago when > it was defaulted to running once a week for two hours on Sunday > night. Running it more frequently like is happening on your systems > will only make things worse. Yes, I remember that change - however, running it once week and daily is not a big difference. Quite obviously, the difference in workloads can and will easily dominate any difference in effects. And to me, it doesn't make a difference if xfs_fsr causes a crash every week or every other month. > FWIW, this comes up often enough that I think I need to add a FAQ > entry for it. Yes, thats a good idea in any case. > > > you really have filesystems that get quickly fragmented (or are you > > > > Yes, fragmentation with xfs is enourmous - I have yet to see whether > > the changes in recent kernels make a big difference, but for log files, > > reading through a log file with 60000 fragments tends to be much slower > > than reading through one with just a few fragments (or just one...). > > So you've got a problem with append only workloads. Basically everything is append only on unix, because preallocating files isn't being done except by special tools really, and the only way to create file contents is to append (well, you can do random writes, as e.g. vmware does, which causes havoc with XFS, but thats just a stupid way to create files...). > 2.6.38 and more recent kernels should be much more resistent to > fragmentation under such conditions thanks to the dynamic > speculative allocation changes that went into 2.6.38. I would tend to agree. > Alternatively, you can use the allocsize mount option, or set the Well, not long ago somebody (you) told me that the allocsize option is designed to eat all diskspace on servers with lots of run because of a NFS optimisation hack that didn't go into the nfs server but the filesystem. Has this been redesigned (I would say, fixed)? > append-only inode flag, or set the preallocated flag on the inode > so that truncation of specualtive allocation beyond EOF doesn't > occur every time the file is closed. Or use ext4, which fares much better without having to patch programs. > > Basically, anything but the OS itself. Copying large video files while the > > disk is busy with other things causes lots of fragmentation (usually 30 > > fragments for a 100mb file), which in turn slows down things enourmously once > > the disk reaches 95% full. > > Another oft-repeated rule of thumb - filling XFS filesystems over > 85-90% full causes increased fragmentation because of the lack of > large contiguous free space extents. That's exactly the same problem > that excessive use of xfs_fsr causes..... On a 39% full disk (my examples)? > > Freenet is also a good test case. > > Not for a filesystem developer. Running internet facing, anonymous, > encrypted peer-to-peer file storage servers anywhere is not > something I'll ever do on my network. You are entitled to your political opinions, but why poison a purely technical discussion with it? Based on technical merits, freenet is a very good test case, because it causes all kinds of I/O patterns. Your personal opinions on politics or laws or whateverdon't make it a bad testcase, just soemthing _you_ don't want to use yourself (which is ok). Claiming it is a bad testcase based on your political views is just unprofessional. > If you think it's a good workload that we should use, then capture a > typical directory profile and the IO/filesystem operations made on a > busy server for an hour or so. Then write a script to reproduce that > directory structure and IO pattern..... I'll consider it, but is a major committment of worktime I might not be able to commit to. > > Or a news spool. > > append only workloads. Or anything else that creates files, i.e. *everything*. A news spool is extremely different to logfiles - files are static and never appended to after they have been created. They do get deleted in irregular order, and can cause lots of free space fragmentation. Calling everything "append only" workload is not very useful. If XFS is bad at append-only workloads, which is *the* most common type of workload, then XFS fails to be very relevant for the real world. > > Or database files for databases that grow files (such as mysql myisam) - > > fortunately I could move of all those to SSDs this year. > > I thought mysql as capable of preallocating regions when files grow. It's not. Maybe the effect isn't so bad on most filesystems (it certainly isn't so bad on ext4): -rw-rw---- 1 mysql mysql 3665891328 Aug 8 20:00 art.MYI -rw------- 1 mysql mysql 2328898560 Aug 8 17:45 file.MYI -rw-rw---- 1 mysql mysql 1098302464 Aug 8 17:45 image.MYI art.MYI: 38 extents found file.MYI: 20 extents found image.MYI: 10 extents found Thats after about 12 months of usage, during which time the file sizes grew by about 50%. > > Or simply unpacking an archive. > > That should not cause fragmentation unless you have already > fragmented free space... I even get multiple fragments for lots of files when unpacking a big (>> memory) tar on a freshly mkfs'ed filesystem. It's mostly 2-3 fragments, affects maybe 5% of the files, and might not be a real issue, but fragmentation it is. > Use xfs_db -r -c "freesp -s" <dev> to get an idea of what your > freespace situation looks like. FWIW, this is on the disk with the 22k fragment 650mb freenet database: http://ue.tst.eu/edc5324f68b98076c9419ab0267ad9d6.txt > > Today I had to reboot the server because of buggy xfs (which prompted the > > bugreport, as I am seeing this bug for a while now, but so far didn't want > > to exclude e.g. bad ram or simply a corrupt filesystem), and in the 4 > > hours uptime, I got a 4MB logfile with 8 fragments. > > What kernel, and what is the xfs_bmap -vp output for the file? 2.6.39-2, and the crash took it with it :/ > > This is clearly an improvement over the 2.6.26 kernel I used before on > > that machine. But over a few months this still leads to thousands of > > fragments, > > Have you seen this, or are you extrapolating from the 4MB file > you've seen above? These logfiles in particular had over 60000 fragments each (60k, not 6k) before I started to regularly xfs_fsr them. Grepping through them took almost an hour, now it takes less than a minute. > > Freenet fares much worse. The persistent blob has 1757 fragments for 13gb > > (not that bad), and the download database has 22756 for 600mb, fragments > > (that sucks). > > You're still talking about how 2.6.26 kernels behave, right? No, thats with either 3.0.0-rc4/5/6 or 2.6.39-2. I am running 3.0.0-1 now for other reasons. > > On my tv, the recorded video files that haven't been defragmented yet > > have between 11 and 63 fragments (all smaller than 2gb), which is almost > > acceptable, but I do not think that without a regular xfs_fsr the fs would > > be in that good shape after one or two years of usage. > > For old kernels, allocsize should have mostly solved that problem. > For current kernels that shouldn't even be necessary. Yeha, I used allocsize=64m on all those storage filesystems. It certainly helped the video fragmentation. > > The cool thing about xfs_fsr is not that the cool kids run it, but that, > > unlike other filesystems that also fragment a lot (ext3 is absolutely > > horrible for example), it can mostly be fixed. > > "fixed" is not really true - all it has done is trade file > fragementation for freespace fragementation. That bites you > eventually. No, it might bite me, but that very much depends on the type of files. A news spool mostly has two sizes of files for example, so it would be surprising if that would bite me. > Quality will only improve if you report bugs and help trace their > root cause. Then we can fix them. If you don't, we don't know about > them, can't fid them and hence can't fix them. Your are preaching to the wrong person, and this is not very encouraging. In the past, I was often seeking the wisdom of this list, and got good replies (and bugfixes). It would tremendously helped if the obfuscation option actually worked - which is the main reaosn why I sometimes can't provide metadumps. In this case, I can because there is nothing problematic on those filesystems. > Ok, now I remember you. I hope this time you'll provide me with the > information I ask you for to triage your problem.... Sorry, but this is not the way you get people to help. I *always* provided all information that I could provide and was asked for. You are now pretending that I didn't do that in the past. Thats both insulting and frustrating - to me, it means I can just stop interacting with you - quite obviously, you are asking for the impossible. I can understand if you dislike negative but true comments about XFS, but thats niot a reason to misrepresent my contributions to track down problems. Or to put it differently, instead of making vague accusations, what exactly did you ask for that I could provide, but didn't? Can you back up your statement? -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr 2011-08-08 19:02 ` Marc Lehmann @ 2011-08-09 10:10 ` Michael Monnerie 2011-08-09 11:15 ` Marc Lehmann 0 siblings, 1 reply; 18+ messages in thread From: Michael Monnerie @ 2011-08-09 10:10 UTC (permalink / raw) To: xfs; +Cc: Marc Lehmann [-- Attachment #1.1: Type: Text/Plain, Size: 2848 bytes --] On Montag, 8. August 2011 Marc Lehmann wrote: First of all, please calm down. Getting personal is not bringing us anywhere. > > On older kernels (2.6.34 and earlier) I can corrupt filesystems > > using xfs-fsr just by crafting a file with a specific layout. [snip] > > IOWs, xfs_fsr on old kernels is actually dangerous and should not > > be used if you > > Logic error - if I can corrupt an XFS without special privileges then > this is not a problem with xfs_fsr, but simply a kernel bug in the > xfs code. And a rather big one, one step below a remote exploit. No, it's not a kernel bug because as long as you don't use xfs_fsr, nothing will ever happen. And the rest of the mail goes into lots of details which look very strange to me. I've double checked with our servers, which generally have these xfs mount options: (rw,nodiratime,relatime,logbufs=8,logbsize=256k,attr2,barrier,largeio,swalloc) and sometimes also ,allocsize=64m and I can't find evidence for fragmentation that would be harmful.Yes they are fragmented, of course. When you write to ~500 log files a time via syslog (as we do on some servers), there must be some fragmentation. The allocsize option helps a lot there. I looked at one webserver access log, it has 640MB with 99 fragments, but that's not a lot. On our Spamgate I see 250MB logs with 374 fragments. That's a bit more, but we don't use the allocsize option there, which I changed now that I looked at it ;-) But your words > If XFS is bad at append-only workloads, which is the most common type > of workload, then XFS fails to be very relevant for the real world. may be valid for your world, not mine. We have webservers, fileservers and database servers, all of which are not really append style, but more delete-and-recreate. Well, db-servers are rather exceptional here. Append style is mostly for log files, at least on our servers. But if the numbers for fragmentation on your servers are true, you must have a very good test case for fragmentation prevention. Therefore it could be really interesting if you could grab what Dave Chinner asked for: > If you think it's a good workload that we should use, then capture a > typical directory profile and the IO/filesystem operations made on a > busy server for an hour or so. Then write a script to reproduce that > directory structure and IO pattern..... And maybe he could use it for optimizations. Is there any tool on Linux to record such I/O patterns? Would need to keep all metadata and data operations for a partition to be interesting. -- mit freundlichen Grüssen, Michael Monnerie, Ing. BSc it-management Internet Services: Protéger http://proteger.at [gesprochen: Prot-e-schee] Tel: +43 660 / 415 6531 // Haus zu verkaufen: http://zmi.at/langegg/ [-- Attachment #1.2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr 2011-08-09 10:10 ` Michael Monnerie @ 2011-08-09 11:15 ` Marc Lehmann 2011-08-10 6:59 ` Michael Monnerie 2011-08-10 14:16 ` Dave Chinner 0 siblings, 2 replies; 18+ messages in thread From: Marc Lehmann @ 2011-08-09 11:15 UTC (permalink / raw) To: Michael Monnerie; +Cc: xfs On Tue, Aug 09, 2011 at 12:10:48PM +0200, Michael Monnerie <michael.monnerie@is.it-management.at> wrote: > First of all, please calm down. Getting personal is not bringing us > anywhere. Well, it's not me who's getting personal, so...? > > Logic error - if I can corrupt an XFS without special privileges then > > this is not a problem with xfs_fsr, but simply a kernel bug in the > > xfs code. And a rather big one, one step below a remote exploit. > > No, it's not a kernel bug because as long as you don't use xfs_fsr, > nothing will ever happen. "As long as you don't boot, it will not crash". xfs_fsr uses syscalls, just like other applications. According to your (wrong) logic, if an application uses chown and this causes a kernel oops, this is also not a kernel bug. Thats of course wrong - it's the kernel that crashes when an applicaiton does certain access patterns. > (rw,nodiratime,relatime,logbufs=8,logbsize=256k,attr2,barrier,largeio,swalloc) > and sometimes also > ,allocsize=64m As has been reported on this list, this option is really harmful on current xfs - in my case, it lead to xfs causing ENOSPC even when the disk was 40% empty (~188gb). > and I can't find evidence for fragmentation that would be harmful.Yes Well, define "harmful" - slow logfile reads aren't what I consider "harmful" either. It's just very very slow. > The allocsize option helps a lot there. I looked at one webserver access > log, it has 640MB with 99 fragments, but that's not a lot. On our > Spamgate I see 250MB logs with 374 fragments. Well, if it were one fragment, you could read that in 4-5 seconds, at 374 fragments, it's probably around 6-7 seconds. Thats not harmful, but if you extrapolate this to a few gigabytes and a lot of files, it becomes quite the overhead. > don't use the allocsize option there, which I changed now that I looked That allocsize option is no longer reasonable with newer kernels, as the kernel will reserve 64m diskspace even for 1kb files indefinitely. > > If XFS is bad at append-only workloads, which is the most common type > > of workload, then XFS fails to be very relevant for the real world. > > may be valid for your world, not mine. We have webservers, fileservers > and database servers, all of which are not really append style, but more > delete-and-recreate. If you find a way of recreating files without appending to them, let me know. The problem with fragmentatioon is that it happens even for a few writers for "create file" workloads (which do append...). You probably make a distinction between "writing a file fast" and "writing a file slow", but the distinction is not a qualitative difference. On busy servers thta create a lot of files, you get fragmentation the same way as on less busy servers that write files slower. There is little to no difference in the resulting patterns. > Well, db-servers are rather exceptional here. Yes, append style is what makes up for the vast majority of disk writes on a normal system, db-servers excepted indeed. > But if the numbers for fragmentation on your servers are true, you must > have a very good test case for fragmentation prevention. Therefore it > could be really interesting if you could grab what Dave Chinner asked > for: I'll keep it in mind. > And maybe he could use it for optimizations. Is there any tool on Linux > to record such I/O patterns? I presume strace would do, but thats where the "lot of work" comes in. If there is a ready-to-use tool, that would of course make it easy. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr 2011-08-09 11:15 ` Marc Lehmann @ 2011-08-10 6:59 ` Michael Monnerie 2011-08-11 22:04 ` Marc Lehmann 2011-08-10 14:16 ` Dave Chinner 1 sibling, 1 reply; 18+ messages in thread From: Michael Monnerie @ 2011-08-10 6:59 UTC (permalink / raw) To: xfs; +Cc: Marc Lehmann [-- Attachment #1.1: Type: Text/Plain, Size: 2332 bytes --] On Dienstag, 9. August 2011 Marc Lehmann wrote: > On Tue, Aug 09, 2011 at 12:10:48PM +0200, Michael Monnerie <michael.monnerie@is.it-management.at> wrote: > > First of all, please calm down. Getting personal is not bringing us > > anywhere. > > Well, it's not me who's getting personal, so...? A single rant from a dev shouldn't hurt one too much. He might have been sitting in front of some code during 72 hours, his eyes already being in 16:9 format staring at a weird bug... It's OK to strike back once, but then be cool again and work at the problem. > As has been reported on this list, this option is really harmful on > current xfs - in my case, it lead to xfs causing ENOSPC even when the > disk was 40% empty (~188gb). Was this the "NFS optimization" stuff? I don't like that either. > Well, if it were one fragment, you could read that in 4-5 seconds, at > 374 fragments, it's probably around 6-7 seconds. Thats not harmful, > but if you extrapolate this to a few gigabytes and a lot of files, > it becomes quite the overhead. True, if you have to read tons of log files all day. That's not my normal use case, so I didn't bother about that until now. > That allocsize option is no longer reasonable with newer kernels, as > the kernel will reserve 64m diskspace even for 1kb files > indefinitely. Just "as long as the inode is cached" or something, I remember that "echo 3 >drop_caches" cleans that up. Still ugly, I'd say. > If you find a way of recreating files without appending to them, let > me know. Seems we have a different meaning of "append". For me, append is when an existing file is re-opened, and data added just to the end of it. > > And maybe he could use it for optimizations. Is there any tool on > > Linux to record such I/O patterns? > > I presume strace would do, but thats where the "lot of work" comes > in. If there is a ready-to-use tool, that would of course make it > easy. It's a pity that such a generic tool doesn't existing. I can't believe that. Doesn't anybody have such a tool at hand? -- mit freundlichen Grüssen, Michael Monnerie, Ing. BSc it-management Internet Services: Protéger http://proteger.at [gesprochen: Prot-e-schee] Tel: +43 660 / 415 6531 // Haus zu verkaufen: http://zmi.at/langegg/ [-- Attachment #1.2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr 2011-08-10 6:59 ` Michael Monnerie @ 2011-08-11 22:04 ` Marc Lehmann 2011-08-12 4:05 ` Dave Chinner 0 siblings, 1 reply; 18+ messages in thread From: Marc Lehmann @ 2011-08-11 22:04 UTC (permalink / raw) To: Michael Monnerie; +Cc: xfs On Wed, Aug 10, 2011 at 08:59:26AM +0200, Michael Monnerie <michael.monnerie@is.it-management.at> wrote: > > current xfs - in my case, it lead to xfs causing ENOSPC even when the > > disk was 40% empty (~188gb). > > Was this the "NFS optimization" stuff? I don't like that either. The NFS server apparently opens and closes files very often (probably on every read/write or so, I don't know the details), so XFS was benchmark-improved by keeping the preallocation as long as the inode is in memory. Practical example: on my box (8GB ram), I upgraded the kernel and started a buildroot build. When I came back 8 hours later the disk was full (some hundreds of gigabytes), even though df showed 300gb or so of free space. That was caused by me setting allocsize=64m and this causing every 3kb object file to use 64m of diskspace (which du showed, but df didn't). To me, thats an obvious bug, and a dirty hack (you shouldn't fix the NFS server by hacking some band-aid into XFS), but to my surprise I was told on this list that this is important for performance, and my use case isn't what XFS is designed for, but thta XFS is designed for good NFS server performance. > > Well, if it were one fragment, you could read that in 4-5 seconds, at > > 374 fragments, it's probably around 6-7 seconds. Thats not harmful, > > but if you extrapolate this to a few gigabytes and a lot of files, > > it becomes quite the overhead. > > True, if you have to read tons of log files all day. That's not my > normal use case, so I didn't bother about that until now. I am well aware that there are lots of different use cases. I see that myself because I have so diverse usages on my disks and servers (desktop, media server, news server, web server, game server... all quite different). It'r clear that XFS can't handle all this magically, and that this is not a problem in XFS itself, what I do find a bit scary is this "XFS is not made for you" attitude that I was recently confronted with. > Just "as long as the inode is cached" or something, I remember that > "echo 3 >drop_caches" cleans that up. Still ugly, I'd say. Yeah, the more ram you have, the more diskspace is lost. > > If you find a way of recreating files without appending to them, let > > me know. > > Seems we have a different meaning of "append". For me, append is when an > existing file is re-opened, and data added just to the end of it. That rules out many, if not most, log file write patterns, which are classical examples of "append workloads" - most apps do not reopen log files, they create/open them once and then wrote them, often, but always, relatively slowly. Syslog is a good example of something that wouldn't be an "append" according to your definition, but typically is seen as such. Speed is the really only differentiating factor between "append" and "create only", and in practise a filesystem can only catch this by seeing if something is sitll in ram ("recent use, fast writes") or not, or keeping this information on-disk (which can be a dangerous trade-off). And yes, your deifntiino is valid - I don't think there is an obvious consensus on which is used, but I think my definition (which includes log files) is more common. > > I presume strace would do, but thats where the "lot of work" comes > > in. If there is a ready-to-use tool, that would of course make it > > easy. > > It's a pity that such a generic tool doesn't existing. I can't believe > that. Doesn't anybody have such a tool at hand? Yeah, I'm listening :) I hope it doesn't boil down to an instrumented kernel :( -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr 2011-08-11 22:04 ` Marc Lehmann @ 2011-08-12 4:05 ` Dave Chinner 2011-08-26 8:08 ` Marc Lehmann 0 siblings, 1 reply; 18+ messages in thread From: Dave Chinner @ 2011-08-12 4:05 UTC (permalink / raw) To: Marc Lehmann; +Cc: Michael Monnerie, xfs On Fri, Aug 12, 2011 at 12:04:19AM +0200, Marc Lehmann wrote: > On Wed, Aug 10, 2011 at 08:59:26AM +0200, Michael Monnerie <michael.monnerie@is.it-management.at> wrote: > > > current xfs - in my case, it lead to xfs causing ENOSPC even when the > > > disk was 40% empty (~188gb). > > > > Was this the "NFS optimization" stuff? I don't like that either. > > The NFS server apparently opens and closes files very often (probably on > every read/write or so, I don't know the details), so XFS was > benchmark-improved by keeping the preallocation as long as the inode is in > memory. It only does that if the pattern of writes are such that keeping the preallocation around for longer periods of time will reduce potential fragmentation. Indeed, it's not a NFS specific optimisation, but it is one that directly benefits NFS server IO patterns. e.g. it can also help reduce fragmentation on slow append-only workloads if the necessary conditions are triggered by the log writers (which is the other problem you are complaining noisily about). Given that inodes for log files will almost always remain in memory as they are regularly referenced, it seems like the right solution to that problem, too... FWIW, you make it sound like "benchmark-improved" is a bad thing. However, I don't hear you complaining about the delayed logging optimisations at all. I'll let you in on a dirty little secret: I tested delayed logging on nothing but benchmarks - it is -entirely- a "benchmark-improved" class optimisation. But despite how delayed logging was developed and optimised, it has significant real-world impact on performance under many different workloads. That's because the benchmarks I use accurately model the workloads that cause the problem that needs to be solved. Similarly, the "NFS optimisation" in a significant and measurable reduction in fragmentation on NFS-exported XFS filesystems across a wide range of workloads. It's a major win in the real world - I just wish I had of thought of it 4 or 5 years ago back when I was at SGI when we first started seeing serious NFS related fragmentation problems at customer sites. Yes, there have been regressions caused by both changes (though delayed logging had far more serious ones) - that's a fact of life in software development. However, the existence of regressions does not take anything away from the significant real-world improvements that are the result of the changes. > > > I presume strace would do, but thats where the "lot of work" comes > > > in. If there is a ready-to-use tool, that would of course make it > > > easy. > > > > It's a pity that such a generic tool doesn't existing. I can't believe > > that. Doesn't anybody have such a tool at hand? > > Yeah, I'm listening :) I hope it doesn't boil down to an instrumented > kernel :( GFGI. http://code.google.com/p/ioapps/wiki/ioreplay Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr 2011-08-12 4:05 ` Dave Chinner @ 2011-08-26 8:08 ` Marc Lehmann 2011-08-31 12:45 ` Dave Chinner 0 siblings, 1 reply; 18+ messages in thread From: Marc Lehmann @ 2011-08-26 8:08 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Fri, Aug 12, 2011 at 02:05:30PM +1000, Dave Chinner <david@fromorbit.com> wrote: > It only does that if the pattern of writes are such that keeping the > preallocation around for longer periods of time will reduce > potential fragmentation. That can only be false. Here is a an example that I saw *just now*: I have a process that takes a directory with jpg files (in this case, all around 64kb in size) and loslessly recompresses them. This works by reading a file, writing it under another name (single write() call) and using rename to replace the original file *iff* it got smaller. The typical reduction is 5%. no allocsize option is used. Kernel used was 2.6.39. This workload would obviously benefit most by having no preallocaiton anywhere, i.e. have all files tightly packed. Here is a "du" on a big directory where this process is running, every few minutes: 6439892 . 6439888 . 6620168 . 6633156 . 6697588 . 6729092 . 6755808 . 6852192 . 6816632 . 6250824 . Instead of decreasing, the size increased, until just before the last du. Thats where I did echo 3 >drop_caches, which presumably cleared all those inodes that have not been used for an hour and would never have been used again for writing. Since XFS obviously keeps quite a bit of preallocation here (or some other magic, but what?), and this workload definitely does not benefit from any preallocaiton (because xfs has perfect knowledge about the file size at every point in time), what you say is simply not true: The files will not be touched anymore, neither read, nor written, so preallocation is just bad. Also, bickering about extra fragmentation caused by xfs_fsr when running it daily instead of weekly is weird - the amount of external fragmentation caused by preallocation must be overwhelming with large amounts of ram. > Indeed, it's not a NFS specific optimisation, but it is one that > directly benefits NFS server IO patterns. I'd say it's a grotesque deoptimisation, and definitely doesn't work the way you describe it. In fact, it can't work the way you describe it, because XFS would have to be clairvoyant to make it work. How else would it know that keeping preallocation indefinitely will be useful? In any case, XFS detects a typical "open write, file file, close file, never tochu it again" pattern as something that somehow needs preallocation. I can see how that helps NFS, but in all other cases, this is simply a bug. > about). Given that inodes for log files will almost always remain in > memory as they are regularly referenced, it seems like the right > solution to that problem, too... Given that, with enough ram, everything stays in ram, most of which is not log files, this behaviour is simply broken. > FWIW, you make it sound like "benchmark-improved" is a bad thing. If it costs regular performance or eats diskspace like mad, it's clearly a bad thing yes. Benchmark performance is irrelevant, what counts is actual performance. If the two coincide, thats great. This is clearly not the case here, of course. > However, I don't hear you complaining about the delayed logging > optimisations at all. I wouldn't be surprised if the new xfs_fsr crashes are caused by these changes, actually. But yes, otherwise they are great - I do keep external journals for most of my filesystems, and the write load for these has decreased by a factor of 10-100 in some metadata-heavy cases (such as lots of renames). Of course, XFS is still way behind other filesystems in managing journal devices. > I'll let you in on a dirty little secret: I tested delayed logging on > nothing but benchmarks - it is -entirely- a "benchmark-improved" class > optimisation. As a good engineer one would expect you to actually think about whether this optimiation is useful outside of some benchmark setup, too. I am sure you did that, how else would you have come up with the idea in the first place? > But despite how delayed logging was developed and optimised, it The difference to the new preallocation is that it's not obviously a bad algorithm. However, the preallocation strategy of wasting some diskspace for every file that has been opened in the last 24 hours or so (depending on ram) is *obviously* wrong, regardless of what your microbenchmarks say. What it does is basically introduce big clusters allocation, just like with god old FAT, except that people with more RAM get punished more. > different workloads. That's because the benchmarks I use accurately > model the workloads that cause the problem that needs to be solved. That means you will optimise a single problem at the expense of any other workload. This indeed seems to be the case here. Good engineering would make sure that typical use cases that were not the "problem" before wouldn't get unduly affected. Apart from potentially helping with NFS in your benchmarks, I cannot see any positive aspect of this change. However, I keep hitting the bad aspects of it. It seems that with this change, XFS will degrade much faster due to the insane amounts of useless preallocation tied to files that have been closed and will never be written again, which is by far *most* files. In the example above, roughly 32kb (+-50%) overallocation are associated with each file. FAT, here we come :( Don't get me wrong, it is great that XFS is now optimised for slow log writing over NFS, and this surely is important for some people, but this comes at an enourmous cost to every other workload. A benchmark that measures additional fragmentation introduced by all those 32kb blocks over some months would be nice. > Similarly, the "NFS optimisation" in a significant and measurable > reduction in fragmentation on NFS-exported XFS filesystems across a It's the dirtiest hack I have seen in a filesystem. Making an optimisaiton that only helps with the extremely bad access patterns of NFS (and only sometimes) and forcing this on even for non-NFS filesystems where it only causes negative effects. It's a typical case of "a is broken, so apply some hack to b", while good engineering dictates "a is broken, let's fix a". Again: Your rationale is that NFS doesn't give you enough information about whether a file is in use, because it doesn't keep it open. This leads you to consider all files whose inode is cached in memory as being "in use" for unlimited amounts of time. Sure, those idiot applications such as cp or mv cannot be trusted. Surely, when mv'ing a file, this means the file will be appended later. Because if not, XFS wouldn't keep the preallocation. > Yes, there have been regressions caused by both changes (though The whole thing is a regression - slow appender processes that close a file after each write basically don't exist - close is an extremely good hint that a file has been finalised, and because NFS doesn't give the notion of close (nfsv4 has it, to some extent), suddenly it's ignored for all applications. This is simply a completely, utterly, totally broken algorithm. > regressions does not take anything away from the significant > real-world improvements that are the result of the changes. I gave plenty of real-world examples where these changes are nothing but bad. I have yet to see a *single* real-world example where this isn't the case. All you achieved is that now every workload works as bad as NFS, lots and lots of disk space is wasted, and an enourmous amount of external fragmentation is introduced. And thats just with an 8GB box. I can only imagine how many months files will be considered "in use" just because the box has enough ram to cache their inodes. > http://code.google.com/p/ioapps/wiki/ioreplay Since "cp" and "mv" already cause problems in current versions of XFS, I guess we are far from needing those. It seems XFS has been so fundamentally deoptimised w.r.t. preallocation now that there are much bigger fish to catch than freenet. Basically anything thct creates files, even when it's just a single open/write/close, is now affected. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr 2011-08-26 8:08 ` Marc Lehmann @ 2011-08-31 12:45 ` Dave Chinner 0 siblings, 0 replies; 18+ messages in thread From: Dave Chinner @ 2011-08-31 12:45 UTC (permalink / raw) To: Marc Lehmann; +Cc: xfs On Fri, Aug 26, 2011 at 10:08:41AM +0200, Marc Lehmann wrote: > On Fri, Aug 12, 2011 at 02:05:30PM +1000, Dave Chinner <david@fromorbit.com> wrote: > > It only does that if the pattern of writes are such that keeping the > > preallocation around for longer periods of time will reduce > > potential fragmentation. > > That can only be false. Here is a an example that I saw *just now*: > > I have a process that takes a directory with jpg files (in this case, > all around 64kb in size) and loslessly recompresses them. This works > by reading a file, writing it under another name (single write() call) > and using rename to replace the original file *iff* it got smaller. The > typical reduction is 5%. no allocsize option is used. Kernel used was > 2.6.39. > > This workload would obviously benefit most by having no preallocaiton > anywhere, i.e. have all files tightly packed. > > Here is a "du" on a big directory where this process is running, every few > minutes: > > 6439892 . > 6439888 . > 6620168 . > 6633156 . > 6697588 . > 6729092 . > 6755808 . > 6852192 . > 6816632 . > 6250824 . > > Instead of decreasing, the size increased, until just before the last > du. Thats where I did echo 3 >drop_caches, which presumably cleared all > those inodes that have not been used for an hour and would never have been > used again for writing. That's the case of the unlinked inode being reused immediately and no having all it's state cleared correctly when recycled. That's the problem that was diagnosed and fixed when you reported the first problem. Can you tell me if your kernel has the bug fix or not, and if not, does applying the fix make the problem go away? Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr 2011-08-09 11:15 ` Marc Lehmann 2011-08-10 6:59 ` Michael Monnerie @ 2011-08-10 14:16 ` Dave Chinner 2011-08-11 22:07 ` Marc Lehmann 1 sibling, 1 reply; 18+ messages in thread From: Dave Chinner @ 2011-08-10 14:16 UTC (permalink / raw) To: Marc Lehmann; +Cc: Michael Monnerie, xfs On Tue, Aug 09, 2011 at 01:15:27PM +0200, Marc Lehmann wrote: > On Tue, Aug 09, 2011 at 12:10:48PM +0200, Michael Monnerie <michael.monnerie@is.it-management.at> wrote: > > (rw,nodiratime,relatime,logbufs=8,logbsize=256k,attr2,barrier,largeio,swalloc) > > and sometimes also > > ,allocsize=64m > > As has been reported on this list, this option is really harmful on > current xfs - in my case, it lead to xfs causing ENOSPC even when the disk > was 40% empty (~188gb). Seeing you keep stating this is a problem, I'll ask again whether commit 778e24b ("xfs: reset inode per-lifetime state when recycling it") fixed this problem for you? Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr 2011-08-10 14:16 ` Dave Chinner @ 2011-08-11 22:07 ` Marc Lehmann 0 siblings, 0 replies; 18+ messages in thread From: Marc Lehmann @ 2011-08-11 22:07 UTC (permalink / raw) To: Dave Chinner; +Cc: Michael Monnerie, xfs On Thu, Aug 11, 2011 at 12:16:19AM +1000, Dave Chinner <david@fromorbit.com> wrote: > > As has been reported on this list, this option is really harmful on > > current xfs - in my case, it lead to xfs causing ENOSPC even when the disk > > was 40% empty (~188gb). > > Seeing you keep stating this is a problem, I can only go by what _you_ told me earlier, namely that this works as designed and no change is needed. If you changed your mind without telling me, how should I find out? If you say one thing and do another, you shouldn't be surprised when people trust you and go by what you say. > commit 778e24b ("xfs: reset inode per-lifetime state when recycling > it") fixed this problem for you? If you tell me in which kernel version this is included, I can find out easily. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr 2011-08-07 10:26 ` Dave Chinner 2011-08-08 19:02 ` Marc Lehmann @ 2011-08-09 9:16 ` Marc Lehmann 2011-08-09 11:35 ` Dave Chinner 1 sibling, 1 reply; 18+ messages in thread From: Marc Lehmann @ 2011-08-09 9:16 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs This just in, this was on screen, xfs_fsr was active at the time, kernel is tainted: [248359.646330] CPU 1 [248359.646326] last sysfs file: /sys/devices/virtual/net/lo/operstate [248359.646323] Oops: 0000 [#1] SMP [248359.646319] PGD 8b43067 PUD 1bc63067 PMD 0 [248359.646292] IP: [<ffffffffa13371ab>] xfs_trans_log_inode+0xb/0x2f [xfs] [248359.646285] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr 2011-08-09 9:16 ` Marc Lehmann @ 2011-08-09 11:35 ` Dave Chinner 2011-08-09 16:35 ` Marc Lehmann 0 siblings, 1 reply; 18+ messages in thread From: Dave Chinner @ 2011-08-09 11:35 UTC (permalink / raw) To: Marc Lehmann; +Cc: xfs On Tue, Aug 09, 2011 at 11:16:43AM +0200, Marc Lehmann wrote: > This just in, this was on screen, xfs_fsr was active at the time, kernel > is tainted: > > [248359.646330] CPU 1 > [248359.646326] last sysfs file: /sys/devices/virtual/net/lo/operstate > [248359.646323] Oops: 0000 [#1] SMP > [248359.646319] PGD 8b43067 PUD 1bc63067 PMD 0 > [248359.646292] IP: [<ffffffffa13371ab>] xfs_trans_log_inode+0xb/0x2f [xfs] > [248359.646285] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 And the event trace to go along with the xfs-fsr run? I don't need to know the dmesg output - I need the information in the event trace from the xfs-fsr run when the problem occurs.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr 2011-08-09 11:35 ` Dave Chinner @ 2011-08-09 16:35 ` Marc Lehmann 2011-08-09 22:31 ` Dave Chinner 0 siblings, 1 reply; 18+ messages in thread From: Marc Lehmann @ 2011-08-09 16:35 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs > > [248359.646330] CPU 1 > > [248359.646326] last sysfs file: /sys/devices/virtual/net/lo/operstate > > [248359.646323] Oops: 0000 [#1] SMP > > [248359.646319] PGD 8b43067 PUD 1bc63067 PMD 0 > > [248359.646292] IP: [<ffffffffa13371ab>] xfs_trans_log_inode+0xb/0x2f [xfs] > > [248359.646285] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 > > And the event trace to go along with the xfs-fsr run? It wasn't enabled yet, I didn't expect it to lock up so soon, but even if, we would have to wait for those rare occurances where the kernel oopses without the box locking up (can take months). > I don't need to know the dmesg output - I need the information in > the event trace from the xfs-fsr run when the problem occurs.... And I need an XFS that doesn't oops and takes the box with it to deliver that :) In any case, I am confident it will happen sooner or later. I will then not send any kernel oopses, although I had hoped that 0-ptr dereferences in a specific part of a function could have been a good hint. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: frequent kernel BUG and lockups - 2.6.39 + xfs_fsr 2011-08-09 16:35 ` Marc Lehmann @ 2011-08-09 22:31 ` Dave Chinner 0 siblings, 0 replies; 18+ messages in thread From: Dave Chinner @ 2011-08-09 22:31 UTC (permalink / raw) To: Marc Lehmann; +Cc: xfs On Tue, Aug 09, 2011 at 06:35:25PM +0200, Marc Lehmann wrote: > > > [248359.646330] CPU 1 > > > [248359.646326] last sysfs file: /sys/devices/virtual/net/lo/operstate > > > [248359.646323] Oops: 0000 [#1] SMP > > > [248359.646319] PGD 8b43067 PUD 1bc63067 PMD 0 > > > [248359.646292] IP: [<ffffffffa13371ab>] xfs_trans_log_inode+0xb/0x2f [xfs] > > > [248359.646285] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 > > > > And the event trace to go along with the xfs-fsr run? > > It wasn't enabled yet, I didn't expect it to lock up so soon, but even if, > we would have to wait for those rare occurances where the kernel oopses > without the box locking up (can take months). > > > I don't need to know the dmesg output - I need the information in > > the event trace from the xfs-fsr run when the problem occurs.... > > And I need an XFS that doesn't oops and takes the box with it to deliver > that :) > > In any case, I am confident it will happen sooner or later. > > I will then not send any kernel oopses, although I had hoped that 0-ptr > dereferences in a specific part of a function could have been a good hint. They tell me where the crash occurred - they don't tell me the root cause of the problem. Understanding the root cause and fixing that is more important that putting a bandaid over the resultant panic (which I'll probably do anyway at the same time). Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2011-08-31 12:46 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-08-06 12:25 frequent kernel BUG and lockups - 2.6.39 + xfs_fsr Marc Lehmann 2011-08-06 14:20 ` Dave Chinner 2011-08-07 1:42 ` Marc Lehmann 2011-08-07 10:26 ` Dave Chinner 2011-08-08 19:02 ` Marc Lehmann 2011-08-09 10:10 ` Michael Monnerie 2011-08-09 11:15 ` Marc Lehmann 2011-08-10 6:59 ` Michael Monnerie 2011-08-11 22:04 ` Marc Lehmann 2011-08-12 4:05 ` Dave Chinner 2011-08-26 8:08 ` Marc Lehmann 2011-08-31 12:45 ` Dave Chinner 2011-08-10 14:16 ` Dave Chinner 2011-08-11 22:07 ` Marc Lehmann 2011-08-09 9:16 ` Marc Lehmann 2011-08-09 11:35 ` Dave Chinner 2011-08-09 16:35 ` Marc Lehmann 2011-08-09 22:31 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox