* Problem about very high Average Read/Write Request Time @ 2014-10-18 9:26 quanjun hu 2014-10-18 12:38 ` Emmanuel Florac 2014-10-19 21:16 ` Stan Hoeppner 0 siblings, 2 replies; 13+ messages in thread From: quanjun hu @ 2014-10-18 9:26 UTC (permalink / raw) To: xfs [-- Attachment #1.1: Type: text/plain, Size: 512 bytes --] Hi, I am using xfs on a raid 5 (~100TB) and put log on external ssd device, the mount information is: /dev/sdc on /data/fhgfs/fhgfs_storage type xfs (rw,relatime,attr2,delaylog,logdev=/dev/sdb1,sunit=512,swidth=15872,noquota). when doing only reading / only writing , the speed is very fast(~1.5G), but when do both the speed is very slow (100M), and high r_await(160) and w_await(200000). 1. how can I reduce average request time? 2. can I use ssd as write/read cache for xfs? Best regards, Quanjun [-- Attachment #1.2: Type: text/html, Size: 923 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Problem about very high Average Read/Write Request Time 2014-10-18 9:26 Problem about very high Average Read/Write Request Time quanjun hu @ 2014-10-18 12:38 ` Emmanuel Florac 2014-10-19 10:10 ` Peter Grandi 2014-10-19 21:16 ` Stan Hoeppner 1 sibling, 1 reply; 13+ messages in thread From: Emmanuel Florac @ 2014-10-18 12:38 UTC (permalink / raw) To: quanjun hu; +Cc: xfs Le Sat, 18 Oct 2014 17:26:40 +0800 vous écriviez: > Hi, > I am using xfs on a raid 5 (~100TB) and put log on external ssd > device, the mount information is: > /dev/sdc on /data/fhgfs/fhgfs_storage type xfs > (rw,relatime,attr2,delaylog,logdev=/dev/sdb1,sunit=512,swidth=15872,noquota). > when doing only reading / only writing , the speed is very > fast(~1.5G), but when do both the speed is very slow (100M), and high > r_await(160) and w_await(200000). What are your kernel version, mount options and xfs_info output ? > 1. how can I reduce average request time? > 2. can I use ssd as write/read cache for xfs? Sure, using bcache and other similar tools. -- ------------------------------------------------------------------------ Emmanuel Florac | Direction technique | Intellique | <eflorac@intellique.com> | +33 1 78 94 84 02 ------------------------------------------------------------------------ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Problem about very high Average Read/Write Request Time 2014-10-18 12:38 ` Emmanuel Florac @ 2014-10-19 10:10 ` Peter Grandi 2014-10-20 8:00 ` Bernd Schubert 0 siblings, 1 reply; 13+ messages in thread From: Peter Grandi @ 2014-10-19 10:10 UTC (permalink / raw) To: Linux fs XFS >> I am using xfs on a raid 5 (~100TB) and put log on external >> ssd device, the mount information is: /dev/sdc on >> /data/fhgfs/fhgfs_storage type xfs >> (rw,relatime,attr2,delaylog,logdev=/dev/sdb1,sunit=512,swidth=15872,noquota). >> when doing only reading / only writing , the speed is very >> fast(~1.5G), but when do both the speed is very slow (100M), >> and high r_await(160) and w_await(200000). > What are your kernel version, mount options and xfs_info output ? Those are usually important details, but in this case the information that matters is already present. There is a ratio of 31 (thirty one) between 'swidth' and 'sunit' and assuming that this reflects the geometry of the RAID5 set and given commonly available disk sizes it can be guessed that with amazing "bravery" someone has configured a RAID5 out of 32 (thirty two) high capacity/low IOPS 3TB drives, or something similar. It is even "braver" than that: if the device name "/data/fhgfs/fhgfs_storage" is dedscriptive, this "brave" RAID5 set is supposed to hold the object storage layer of a BeeFS highly parallel filesystem, and therefore will likely have mostly-random accesses. This issue should be moved to the 'linux-raid' mailing list as from the reported information it has nothing to do with XFS. BTW the 100MB/s aggregate over 31 drives means around 3MB/s per drive, which seems pretty good for a RW workload with mostly-random accesses with high RMW correlation. It is notable but not surprising that XFS works well even with such a "brave" choice of block storage layer, untainted by any "cowardly" consideration of the effects of RMW and using drives designed for capacity rather than IOPS. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Problem about very high Average Read/Write Request Time 2014-10-19 10:10 ` Peter Grandi @ 2014-10-20 8:00 ` Bernd Schubert 2014-10-21 18:27 ` Peter Grandi 0 siblings, 1 reply; 13+ messages in thread From: Bernd Schubert @ 2014-10-20 8:00 UTC (permalink / raw) To: Peter Grandi, Linux fs XFS On 10/19/2014 12:10 PM, Peter Grandi wrote: >>> I am using xfs on a raid 5 (~100TB) and put log on external >>> ssd device, the mount information is: /dev/sdc on >>> /data/fhgfs/fhgfs_storage type xfs >>> (rw,relatime,attr2,delaylog,logdev=/dev/sdb1,sunit=512,swidth=15872,noquota). >>> when doing only reading / only writing , the speed is very >>> fast(~1.5G), but when do both the speed is very slow (100M), >>> and high r_await(160) and w_await(200000). > >> What are your kernel version, mount options and xfs_info output ? > > Those are usually important details, but in this case the > information that matters is already present. > > There is a ratio of 31 (thirty one) between 'swidth' and 'sunit' > and assuming that this reflects the geometry of the RAID5 set > and given commonly available disk sizes it can be guessed that > with amazing "bravery" someone has configured a RAID5 out of 32 > (thirty two) high capacity/low IOPS 3TB drives, or something > similar. > > It is even "braver" than that: if the device name > "/data/fhgfs/fhgfs_storage" is dedscriptive, this "brave" > RAID5 set is supposed to hold the object storage layer of a > BeeFS highly parallel filesystem, and therefore will likely > have mostly-random accesses. > Where do you get the assumption from that FhGFS/BeeGFS is going to do random reads/writes or the application of top of it is going to do that? Bernd _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Problem about very high Average Read/Write Request Time 2014-10-20 8:00 ` Bernd Schubert @ 2014-10-21 18:27 ` Peter Grandi 2014-10-23 16:20 ` Bernd Schubert 0 siblings, 1 reply; 13+ messages in thread From: Peter Grandi @ 2014-10-21 18:27 UTC (permalink / raw) To: Linux fs XFS >> [ ... ] supposed to hold the object storage layer of a BeeFS >> highly parallel filesystem, and therefore will likely have >> mostly-random accesses. > Where do you get the assumption from that FhGFS/BeeGFS is > going to do random reads/writes or the application of top of > it is going to do that? In this specific case it is not an assumption, thanks to the prominent fact that the original poster was testing (locally I guess) and complaining about concurrent read/writes, which result in random like arm movement even if each of the read and write streams are entirely sequential. I even pointed this out, probably not explicitly enough: >> when doing only reading / only writing , the speed is very >> fast(~1.5G), but when do both the speed is very slow >> (100M), and high r_await(160) and w_await(200000). BTW the 100MB/s aggregate over 31 drives means around 3MB/s per drive, which seems pretty good for a RW workload with mostly-random accesses with high RMW correlation. Also if this testing was appropriate then it was because the intended workload was indeed concurrent reads and writes to the object store. It is not a mere assumption in the general case either; it is both commonly observed and a simple deduction, because of the nature of distributed filesystems and in particular parallel HPC ones like Lustre or BeeGFS, but also AFS and even NFS ones. * Clients have caches. Therefore most of the locality in the (read) access patterns will hopefully be filtered out by the client cache. This applies (ideally) to any distributed filesystem. * HPC/parallel servers tend to whave many clients (e.g. for an it could be 10,000 clients and 500 object storage servers) and hopefully each client works on a different subset of the data tree, and distribution of data objects onto servers hopefully random. Therefore it is likely that many clients will access with concurrent read and write many different files on the same server resulting in many random "hotspots" in each server's load. Note that each client could be doing entirely sequential IO to each file they access, but the concurrent accesses do possibly widely scattered files will turn that into random IO at the server level. Just about the only case where sequential client workloads don't become random workloads at the server is when the client workload is such that only one file is "hot" per server. There is an additional issue favouring random access patterns: * Typically large fileservers are setup with a lot of storage because of anticipated lifetime usage, so they start mostly empty. * Most filesystems then allocate new data in regular patterns, often starting from the beginning of available storage, in an attempt to minimize arm travel time usually (XFS uses various heuristics, which are somewhat different whether the option 'inode64' is specified or not). * Unfortunately as the filetree becomes larger new allocations have to be made farther away, resulting in longer travel times and more apparent randomness at the storage server level. * Eventually if the object server reaches a steady state where roughly as much data is deleted and created the free storage areas will become widely scattered, leading to essentially random allocation, the more random the more capacity used. Leaving a significant percentage of capacity free, like at least 10% and more like 20%, greatly increases the chance of finding free space near to put new data near to existing "related" data. This increases locality, but only at the single-stream level; therefore is usually does not help that much widely shared distributed servers; and in particular does not apply that much to object stores, because usually they obscure which data object is related to which data object. The above issues are pretty much "network and distributed filesystems for beginners" notes, but in significant part also apply to widely shared non network and non distributed servers on which XFS is often used, so they may be usefully mentioned in this list. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Problem about very high Average Read/Write Request Time 2014-10-21 18:27 ` Peter Grandi @ 2014-10-23 16:20 ` Bernd Schubert 2014-10-23 20:09 ` Peter Grandi 2014-10-23 23:01 ` Peter Grandi 0 siblings, 2 replies; 13+ messages in thread From: Bernd Schubert @ 2014-10-23 16:20 UTC (permalink / raw) To: Peter Grandi, Linux fs XFS, quanjun hu On 10/21/2014 08:27 PM, Peter Grandi wrote: >>> [ ... ] supposed to hold the object storage layer of a BeeFS >>> highly parallel filesystem, and therefore will likely have >>> mostly-random accesses. > >> Where do you get the assumption from that FhGFS/BeeGFS is >> going to do random reads/writes or the application of top of >> it is going to do that? > > In this specific case it is not an assumption, thanks to the > prominent fact that the original poster was testing (locally I > guess) and complaining about concurrent read/writes, which > result in random like arm movement even if each of the read and > write streams are entirely sequential. I even pointed this out, > probably not explicitly enough: > > >> when doing only reading / only writing , the speed is very > >> fast(~1.5G), but when do both the speed is very slow > >> (100M), and high r_await(160) and w_await(200000). The OP is trying to figure out what is going on. Low speed and high latencies are not sufficient information to speculate about the cause. > > BTW the 100MB/s aggregate over 31 drives means around 3MB/s > per drive, which seems pretty good for a RW workload with > mostly-random accesses with high RMW correlation. The op did not provide sufficient information about the IO pattern to know if there is RMW or random access involved. > > Also if this testing was appropriate then it was because the > intended workload was indeed concurrent reads and writes to the > object store. > > It is not a mere assumption in the general case either; it > is both commonly observed and a simple deduction, because of > the nature of distributed filesystems and in particular parallel > HPC ones like Lustre or BeeGFS, but also AFS and even NFS ones. > > * Clients have caches. Therefore most of the locality in the Correct is: Client *might* have caches. Besides of application directio, for BeeGFS the cache type is a configuration option. > (read) access patterns will hopefully be filtered out by the > client cache. This applies (ideally) to any distributed > filesystem. You cannot filter out everything, e.g. random reads of a large file. Local or remote file system does not matter here. > * HPC/parallel servers tend to whave many clients (e.g. for an > it could be 10,000 clients and 500 object storage servers) and > hopefully each client works on a different subset of the data > tree, and distribution of data objects onto servers hopefully > random. > Therefore it is likely that many clients will access with > concurrent read and write many different files on the same > server resulting in many random "hotspots" in each server's > load. If that would be important here there would be no difference between single write and parallel read/write. So irrelevant. > Note that each client could be doing entirely sequential IO to > each file they access, but the concurrent accesses do possibly > widely scattered files will turn that into random IO at the > server level. How does this matter if the op is comparing 1-thread write vs. 2-thread read/write? > > Just about the only case where sequential client workloads don't > become random workloads at the server is when the client > workload is such that only one file is "hot" per server. > > There is an additional issue favouring random access patterns: > > * Typically large fileservers are setup with a lot of storage > because of anticipated lifetime usage, so they start mostly > empty. > * Most filesystems then allocate new data in regular patterns, > often starting from the beginning of available storage, in > an attempt to minimize arm travel time usually (XFS uses > various heuristics, which are somewhat different whether the > option 'inode64' is specified or not). > * Unfortunately as the filetree becomes larger new allocations > have to be made farther away, resulting in longer travel > times and more apparent randomness at the storage server > level. > * Eventually if the object server reaches a steady state where > roughly as much data is deleted and created the free storage > areas will become widely scattered, leading to essentially > random allocation, the more random the more capacity used. All of that is irrelevant if a single write is fast and a parallel read/write is slow. > > Leaving a significant percentage of capacity free, like at > least 10% and more like 20%, greatly increases the chance of > finding free space near to put new data near to existing > "related" data. This increases locality, but only at the > single-stream level; therefore is usually does not help that > much widely shared distributed servers; and in particular does > not apply that much to object stores, because usually they > obscure which data object is related to which data object. > > The above issues are pretty much "network and distributed > filesystems for beginners" notes, but in significant part also > apply to widely shared non network and non distributed servers > on which XFS is often used, so they may be usefully mentioned > in this list. It is lots of text and does not help the op at all. And the claim/speculation that the parallel file system would introduce random access is also wrong. Before anyone can even start to speculate, the op first needs to provide the exact IO pattern and information about /dev/sdc. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Problem about very high Average Read/Write Request Time 2014-10-23 16:20 ` Bernd Schubert @ 2014-10-23 20:09 ` Peter Grandi 2014-10-24 21:45 ` Dave Chinner 2014-10-23 23:01 ` Peter Grandi 1 sibling, 1 reply; 13+ messages in thread From: Peter Grandi @ 2014-10-23 20:09 UTC (permalink / raw) To: Linux fs XFS [ ... ] >>>>> There is a ratio of 31 (thirty one) between 'swidth' and >>>>> 'sunit' and assuming that this reflects the geometry of the >>>>> RAID5 set and given commonly available disk sizes it can be >>>>> guessed that with amazing "bravery" someone has configured a >>>>> RAID5 out of 32 (thirty two) high capacity/low IOPS 3TB >>>>> drives, or something similar. [ ... ] >>>>> if the device name "/data/fhgfs/fhgfs_storage" is >>>>> dedscriptive, this "brave" RAID5 set is supposed to hold the >>>>> object storage layer of a BeeFS highly parallel filesystem, >>>>> and therefore will likely have mostly-random accesses. [ ... ] >>>>> It is notable but not surprising that XFS works well even >>>>> with such a "brave" choice of block storage layer, untainted >>>>> by any "cowardly" consideration of the effects of RMW and >>>>> using drives designed for capacity rather than IOPS. >>>> Also if this testing was appropriate then it was because the >>>> intended workload was indeed concurrent reads and writes to >>>> the object store. >>> Where do you get the assumption from that FhGFS/BeeGFS is >>> going to do random reads/writes or the application of top of >>> it is going to do that? >> In this specific case it is not an assumption, thanks to the >> prominent fact that the original poster was testing (locally I >> guess) and complaining about concurrent read/writes, which >> result in random like arm movement even if each of the read and >> write streams are entirely sequential. [ ... ] > Low speed and high latencies are not sufficient information to > speculate about the cause. It is pleasing that you seem to know at least that by themselves «Low speed and high latencies» are indeed not sufficient. But in «the specific case» what is sufficient to make a good guess is what I wrote, which you seem to have been unable to notice or understand. >> BTW the 100MB/s aggregate over 31 drives means around 3MB/s >> per drive, which seems pretty good for a RW workload with >> mostly-random accesses with high RMW correlation. > The op did not provide sufficient information about the IO > pattern to know if there is RMW or random access involved. The op of «the specific case» reported that the XFS filesystem is configured for a 32-wide RAID5 set and that: > when doing only reading / only writing , the speed is very > fast(~1.5G), but when do both the speed is very slow and perhaps your did not notice that; or did not notice or understand that I wrote subsequently, as you seemed to be requesting a detailed explanation of my conclusion, that: >> [ ... ] concurrent read/writes, which result in random like >> arm movement even if each of the read and write streams are >> entirely sequential. [ ... ] Because then there are at least two hotspots, the read one and the write one, except in the very special case that an application is reading and writing the same block each time. Even worse, since in «the specific case» we have an "imaginative" 32-wide RAID5 unless the writes are exactly aligned with the large stripes there is going to be a lot of RMW resulting in the arms going back and forth (and even if aligned many RAID implementation still end up doing a fair bit of RMW). Knowing that and that it is a 32-wide RAID5 and the disks are 3TB in size (low IOPS per GB) and that the result is poor for single threaded but reasonable for double threaded, and that XFS in general behaves pretty well should be sufficient to give a reasonable guess: >>>>> This issue should be moved to the 'linux-raid' mailing list >>>>> as from the reported information it has nothing to do with >>>>> XFS. But I am just repeating what you seem to have been unable to read or understand... PS: as to people following this discussions, there can be many reasons why that 31-wide RAID5, which is such a very "brave" setup, is behaving like that on randomish access patterns arising from concurrent read-write, such as initial sync still going on, not so good default settings or scheduling of so many hw (as suggested by the 'sdc' instead of 'md$N') RAID HAs, etc., and some of these interact with how XFS operates, but it is indeed a discussion for the Linux RAID list at least first. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Problem about very high Average Read/Write Request Time 2014-10-23 20:09 ` Peter Grandi @ 2014-10-24 21:45 ` Dave Chinner 2014-10-25 11:00 ` Peter Grandi 2014-10-25 12:36 ` Peter Grandi 0 siblings, 2 replies; 13+ messages in thread From: Dave Chinner @ 2014-10-24 21:45 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux fs XFS [-- Attachment #1.1: Type: text/plain, Size: 3079 bytes --] On Thu, Oct 23, 2014 at 09:09:47PM +0100, Peter Grandi wrote: > >>> Where do you get the assumption from that FhGFS/BeeGFS is > >>> going to do random reads/writes or the application of top of > >>> it is going to do that? > > >> In this specific case it is not an assumption, thanks to the > >> prominent fact that the original poster was testing (locally I > >> guess) and complaining about concurrent read/writes, which > >> result in random like arm movement even if each of the read and > >> write streams are entirely sequential. > > [ ... ] > > > Low speed and high latencies are not sufficient information to > > speculate about the cause. > > It is pleasing that you seem to know at least that by themselves > «Low speed and high latencies» are indeed not sufficient. > > But in «the specific case» what is sufficient to make a good guess > is what I wrote, which you seem to have been unable to notice or > understand. Peter, I really don't care if you are right or wrong, your response is entirely inappropriate for this forum. Wheaton's Law: "Don't Be a Dick." Bernd is entitled to point out how tenuous your thread of logic is - if he didn't I was going to say exactly the same thing - it is based entirely on a house of assumptions you haven't actually verified. An appropriate response would be to ask the OP to describe their workload and storage in more detail so you can verify which of your asumptions were correct and which weren't, and take the discussion from there. But instead of taking the evidence-based verification path, you've resorted to personal attacks to defend your tenuous logic. That is out of line and not acceptible behaviour. Knowledge is not a cudgel to beat people down with. Nobody really cares how much you know, nor do they need you to try to prove you know more than they do. If you succeed in proving how much of an Expert(tm) you are, then the only thing that people will remember about you is "what a dick that guy is". Unfortunately, Peter, you've made a habit of this behaviour. Every discussion thread you enter ends up with you abusing someone because they dared to either question your assertions or didn't understand what you said precisely. Indeed, I've come to assosicate your name with such behaviour over the past couple of years, to the point where I see your name in a thread and I wonder what will trigger you to abuse someone before I've even read the email. With this email, you've finally reached my Intolerable Dickhead On The Internet Threshold. Given that this is on the XFS mailing list, and I'm the XFS Maintainer it falls to me to draw a line in the sand: such behaviour is not acceptible in this forum. In future, Peter, please do not post to the list if you can't be nice or stay on topic. We don't need you to "help" by abusing people; we get along and solve problems just fine without you. Hence if you can't play nicely with others then please go away and don't come back. -Dave. -- Dave Chinner david@fromorbit.com [-- Attachment #1.2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Problem about very high Average Read/Write Request Time 2014-10-24 21:45 ` Dave Chinner @ 2014-10-25 11:00 ` Peter Grandi 2014-10-25 19:31 ` Stan Hoeppner 2014-10-25 12:36 ` Peter Grandi 1 sibling, 1 reply; 13+ messages in thread From: Peter Grandi @ 2014-10-25 11:00 UTC (permalink / raw) To: Linux fs XFS > [ ... ] entitled to point out how tenuous your thread of logic > is - if he didn't I was going to say exactly the same thing You are both entitled to your opinions, but not to have them unchallenged, also when they are bare statements. > based entirely on a house of assumptions you haven't actually > verified. That seems highly imaginative as my guesses that conclusion that: http://oss.sgi.com/archives/xfs/2014-10/msg00335.html > This issue should be moved to the 'linux-raid' mailing list as > from the reported information it has nothing to do with XFS. ============================= were factually based: http://oss.sgi.com/archives/xfs/2014-10/msg00335.html > There is a ratio of 31 (thirty one) between 'swidth' and > 'sunit' and assuming that this reflects the geometry of the > RAID5 set and given commonly available disk sizes it can be > guessed that with amazing "bravery" someone has configured a > RAID5 out of 32 (thirty two) high capacity/low IOPS 3TB > drives, or something similar. That there is a ratio of 31 is a verified fact, and so is that the reported size of the block device was 100TB. Much of the rest is arithmetic and I indicated that there were was some guesswork involved, mostly assuming that those reported facts were descriptive of the actual configuration. Simply for brevity I did not also point out specifically that the reported facts that «high r_await(160) and w_await(200000)» and the "Subject:" «very high Average Read/Write Request Time» contributed to indicating a (big) issue with the storage layer. and that the presumed width of the array of 32 is congruentn with typical enclosure capacities. Another poster went far further in guesswork, and stated what I was describing as guesses instead as obvious facts: http://oss.sgi.com/archives/xfs/2014-10/msg00337.html > As others mentioned this isn't an XFS problem. The problem is that > your RAID geometry doesn't match your workload. Your very wide > parity stripe is apparently causing excessive seeking with your > read+write workload due to read-modify-write operations. and went on to make a whole discussion wholly unrelated to XFS based on that: > To mitigate this, and to increase resiliency, you should > switch to RAID6 with a smaller chunk. If you need maximum > capacity make a single RAID6 array with 16 KiB chunk size. > This will yield a 496 KiB stripe width, increasing the odds > that all writes are a full stripe, and hopefully eliminating > much of the RMW problem. > A better option might be making three 10 drive RAID6 arrays > (two spares) with 32 KiB chunk, 256 KiB stripe width, and > concatenating the 3 arrays with mdadm --linear. The above assumptions and offtopic suggestions have been unquestioned; by myself too, even if I disagree with some of the recommendations, also as I think them premature because we don't know what the requirements really are beyond what can be guessed from «the reported information». That's also why I suggested to continue the discussion on the Linux RAID list. The guess that the filesystem was meant to be an object store is also based on a verified fact: > if the device name "/data/fhgfs/fhgfs_storage" is dedscriptive, > this "brave" RAID5 set is supposed to hold the object storage > layer of a BeeFS Also BP did not initially question my analysis of the 100TB filesystem case, but asked a wholly separate question asking to explain this aside: > the object storage layer of a BeeFS highly parallel filesystem, > and therefore will likely have mostly-random accesses. To that question I provided a reasonable and detailed *technical* explanation, both as to the specific case and in general, and linking it to both the original question by QH and to the list topic which is XFS. As a reminder this thread seems to me to contain 3 distinct even if connected *technical* topics: * Whether the report about the 100TB RAID based XFS filesystem contained evidence indicating an XFS issue or a RAID issue; this was introduced by QH. * Whether concurrent randomish read-writes tend to be the workload observed by object stores in large parallel HPC systems; this was introduced by BP. * Whether concurrent randomish read-write would happen in the use of that specific filesystem as an object store; this was introduced by myself to link QH's original question to BP's new question, because strictly speaking BP's question seemed to me offtopic in the XFS mailing list. Then BP seemed to switch topics again by mentioning 1 and 2 threaded read-write in the context of the general issue of the access patterns of large parallel HPC filesystem object stores, and that seemed strange to me, as I commented, so I ignore it. > appropriate response would be to ask the OP to describe their > workload and storage in more detail Indeed, and I suggested to move the discussion to the Linux RAID mailing list for that purpose, because the evidence quoted above seemed to indicate that a 32-wide RAID5 was involved, as in: > This issue should be moved to the 'linux-raid' mailing list as > from the reported information it has nothing to do with XFS. ============================= This left free QH to report more information as someone asked to indicate the issue was relevant more to the XFS list than to the Linux RAID list, or to move to the Linux RAID list with more details. Again, the suggestion to continue the discussion in another list that seemed more useful to QH was based on simple inferences based on 3 reported facts: 31 ratio, 100TB size, "fast" single threaded speed vs. slow concurrent read/write speeds (and the concurrent high wait times). You and BP are entitled to think those are not good guesses (just as SH instead took them as good ones) and it would be interesting if you provided substantive reason why the suggestion to continue the discussion in the Linux RAID list was inappropriate, but you haven't contributed any other than your say-so. Also while suggestions have been made to QH to provide more details and/or move the discussion to the Linux RAID list by different people, notably this has not happened yet. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Problem about very high Average Read/Write Request Time 2014-10-25 11:00 ` Peter Grandi @ 2014-10-25 19:31 ` Stan Hoeppner 0 siblings, 0 replies; 13+ messages in thread From: Stan Hoeppner @ 2014-10-25 19:31 UTC (permalink / raw) To: Peter Grandi, Linux fs XFS On 10/25/2014 06:00 AM, Peter Grandi wrote: ... > Another poster went far further in guesswork, and stated what I > was describing as guesses instead as obvious facts: > > http://oss.sgi.com/archives/xfs/2014-10/msg00337.html > > As others mentioned this isn't an XFS problem. The problem is that > > your RAID geometry doesn't match your workload. Your very wide > > parity stripe is apparently causing excessive seeking with your > > read+write workload due to read-modify-write operations. When a parity array's throughput drops 2 orders of magnitude, from ~1.5 GB/s to 100 MB/s, RMW is historically the most likely cause, especially with such a wide stripe. So yes, this is a guess, but an educated one. > and went on to make a whole discussion wholly unrelated to XFS > based on that: > > > To mitigate this, and to increase resiliency, you should > > switch to RAID6 with a smaller chunk. If you need maximum > > capacity make a single RAID6 array with 16 KiB chunk size. > > This will yield a 496 KiB stripe width, increasing the odds > > that all writes are a full stripe, and hopefully eliminating > > much of the RMW problem. > > > A better option might be making three 10 drive RAID6 arrays > > (two spares) with 32 KiB chunk, 256 KiB stripe width, and > > concatenating the 3 arrays with mdadm --linear. XFS is a layer of the Linux IO stack, and none of these layers exist in isolation. If someone using XFS has a problem and it may not be XFS specific, we're still going to lend assistance where we can. > The above assumptions and offtopic suggestions have been > unquestioned; by myself too, even if I disagree with some of the > recommendations, also as I think them premature because we don't > know what the requirements really are beyond what can be guessed > from «the reported information». That's also why I suggested to > continue the discussion on the Linux RAID list. If you haven't noticed Peter, the Chinese guys seem to post once and never come back. I don't know if this is a cultural thing or other, but that's the way they seem to operate. There is rarely interaction with them, no follow ups, no additional information provided. So I tend to give them many ideas on the obvious path to work with in my reply, after asking for additional information, which will likely never arrive. Moving the thread to linux-raid wouldn't help. And I'm sure you know Dave didn't come down on you due to the guesswork in your posts, but because of your delivery style, and attitude and behavior towards others. It seems the latter prompted his critique of the former. Cheers, Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Problem about very high Average Read/Write Request Time 2014-10-24 21:45 ` Dave Chinner 2014-10-25 11:00 ` Peter Grandi @ 2014-10-25 12:36 ` Peter Grandi 1 sibling, 0 replies; 13+ messages in thread From: Peter Grandi @ 2014-10-25 12:36 UTC (permalink / raw) To: Linux fs XFS Stan Hoeppner: «Doing this via subterfuge simply reduces people's level of respect for you» «In your 10,000th attempt to generate self gratification by demonstrating your superior knowledge (actually lack thereof) on this list, all you could have possibly achieved here is confusing the OP even further. I find it sad that you've decided to prey on the young, the inexperienced, after realizing the educated began ignoring you long ago.» Cristoph Hellwig: «he's made himself a name as not only beeing technically incompetent but also extremly abrasive.» «But please stop giving advise taken out of the thin air to people on the lists that might actually believe whatever madness you just dreamed up.» Dave Chinner: «Just ignore the troll, Stan.» «But instead of taking the evidence-based verification path, you've resorted to personal attacks to defend your tenuous logic.» «you've made a habit of this behaviour. Every discussion thread you enter ends up with you abusing someone because they dared to either question your assertions or didn't understand what you said precisely.» «you've finally reached my Intolerable Dickhead On The Internet Threshold. Given that this is on the XFS mailing list, and I'm the XFS Maintainer it falls to me to draw a line in the sand: such behaviour is not acceptible in this forum.» According to Russel Cattelan (the owner of XFS the mailing list at least in 2012) and "other prominent members of the XFS team" the statements I quoted above are «acceptible in this forum»: http://oss.sgi.com/archives/xfs/2012-04/threads.html#00051 «[ ... ] turning the XFS mailing list in a disreputable vehicle for offtopic and offensive flaming. This to me this looks like mobbing, beause coordinated personal attacks have been done by a small group of people [ ... ]» Russell Cattelan: «You must be joking! I'm not about to ban one of the most knowledgeable and productive xfs developers from the email list.» «Well, if one of these guys post offtopic and malicious rants, I guess that his being an XFS developer should not matter. With being an XFS developer comes also some responsibility to maintain a technical and professional tone in the XFS mailing list, and not to abuse his position. Same for the others.» Russel Cattelan: «I see nothing particularly offensive about these posts.» «They personal attacks about competence and character. They are pure flames, and because they are coordinated they seem to be mobbing. To me, you seem to be expliciting endorsing the use of the XFS mailing list to publish "ad hominem" attacks and mobbing.» Russel Cattelan: «I can understand you may upset about what Christoph said he has every right to state his opinions on things no matter how much you disagree.» «But his opinions are on offtopic "things": the topic of the XFS mailing list is XFS, not people's competence or character, or rants/attacks on people. Some of the links I have sent you contain no technical content, purely personal attacks. How can this be legitimate XFS content?» Russell Cattelan: «Well I'm afraid I'm going to respectfully disagree. I don't have the full story of everything that happened but none of the post you pointed out are totally off topic. There seems to be some question about what you have said previously about XFS, which to me would seem on topic. I have forwarded your email to Christoph and the other prominent members of the XFS team and if they feel further action is warranted we can revisit the issue. Based on what you have sent I do not feel there is grounds for action.» _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Problem about very high Average Read/Write Request Time 2014-10-23 16:20 ` Bernd Schubert 2014-10-23 20:09 ` Peter Grandi @ 2014-10-23 23:01 ` Peter Grandi 1 sibling, 0 replies; 13+ messages in thread From: Peter Grandi @ 2014-10-23 23:01 UTC (permalink / raw) To: Linux fs XFS [ ... ] >>>> [ ... ] if the device name "/data/fhgfs/fhgfs_storage" is >>>> dedscriptive, this "brave" RAID5 set is supposed to hold >>>> the object storage layer of a BeeFS highly parallel >>>> filesystem, and therefore will likely have mostly-random >>>> accesses. [ ... ] >>> Where do you get the assumption from that FhGFS/BeeGFS is >>> going to do random reads/writes or the application of top of >>> it is going to do that? >> It is not a mere assumption in the general case either; it is >> both commonly observed and a simple deduction, because of the >> nature of distributed filesystems and in particular parallel >> HPC ones like Lustre or BeeGFS, but also AFS and even NFS ones. [ ... ] >> * Clients have caches. > Correct is: Client *might* have caches. Besides of application > directio, for BeeGFS the cache type is a configuration option. Perhaps you have missed the explicit qualification «in the general case» of «distributed filesystems and in particular parallel HPC ones» or perhaps you lack familiarity with «Lustre or BeeGFS, but also AFS and even NFS ones» most of which have client caches and usually enabled, and that might justify your inability to consider «the general case». >> Therefore most of the locality in the (read) access patterns >> will hopefully be filtered out by the client cache. This >> applies (ideally) to any distributed filesystem. > You cannot filter out everything, e.g. random reads of a large > file. It is good but somewhat pointless that you can understand the meaning of «most of the locality in the (read) access patterns will hopefully be filtered out by the client cache» and agree with it, and supply an example, but unfortunately you seem to have the naive expectation that: >> Local or remote file system does not matter here. It can matter as: * In the local case there is a single cache for all concurrent applications, while in the distributed case there is hopefully a separate cache per node, which segments the references (as well as hopefully providing a lot more cache space). * In the purely local case there is usually just one level of caching, in the distribured case usually there are two levels, often resulting in rather different access patterns to the object stores in the server. So the degree of filtering can be and often is quite different; which is usually quite important because network transfers add a cost. As to these three comments I am perplexed: >> Therefore it is likely that many clients will access with >> concurrent read and write many different files on the same >> server resulting in many random "hotspots" in each server's >> load. > If that would be important here there would be no difference > between single write and parallel read/write. [ ... ] >> each client could be doing entirely sequential IO to each >> file they access, but the concurrent accesses do possibly >> widely scattered files will turn that into random IO at the >> server level. [ ... ] > How does this matter if the op is comparing 1-thread write > vs. 2-thread read/write? >> * Eventually if the object server reaches a steady state >> where roughly as much data is deleted and created the free >> storage areas will become widely scattered, leading to >> essentially random allocation, the more random the more >> capacity used. > All of that is irrelevant if a single write is fast and a > parallel read/write is slow. Because you seem rather confusedm, as my explanation was the answer to this question you asked: >>> Where do you get the assumption from that FhGFS/BeeGFS is >>> going to do random reads/writes or the application of top >>> of it is going to do that? and in it you mention no special case like «1-thread write», or «2-thread read/write». Also such simple special cases don't happen much in «the object storage layer» of any realistic «highly parallel filesystem», which are often large with vast and varied workloads, as I tried to remind you: >> HPC/parallel servers tend to whave many clients (e.g. for >> an it could be 10,000 clients and 500 object storage >> servers) and hopefully each client works on a different >> subset of the data tree, and distribution of data objects >> onto servers hopefully random. Therefore there are likely to be many dozens or even hundreds of threads accessing objects per object store, with every pattern of read and write and to rather unrelated objects, not just 1 or 2 threads and single write or read/write. That's one reason why XFS is so often used for those object stores: it is particularly well suited to highly multithreaded access patterns to many files, as the XFS has benefited from quite a bit of effort in finer grained locking, and XFS uses some mostly effective heuristics to distribute files across the storage it uses in hopefully "best" ways. >> The above issues are pretty much "network and distributed >> filesystems for beginners" notes, > It is lots of text In my original reply I was terse and did not explain every reason why «the object storage layer of a BeeFS highly parallel filesystem» is «likely have mostly-random accesses» because I assumed it is common knowledge among somewhat skilled readers; but to a point I am also patient with beginners, even those who seem to become confused about which question they themselves asked. Also I am trying to quote context because you seem confused as to what the content of even your questions is. > and does not help the op at all. That seems unfortunately right, as to me you still seem very confused as to the workloads likely experienced by object stores for highly parallel filesystems despite my efforts in trying to answer in detail the question you asked: >>> Where do you get the assumption from that FhGFS/BeeGFS is >>> going to do random reads/writes or the application of top >>> of it is going to do that? At least as I already pointed out my answer to your question is at least somewhat topical for the XFS list, for example by hinting about using less "brave" configurations than 32-disk RAID5 sets. > And the claim/speculation that the parallel file system would > introduce random access is also wrong. As far as I can see it was only who you mentioned that because I discussed just the consequences of the likely access patterns of «the application of top of it» part of your question. It seemed strange to me that you would ask why «FhGFS/BeeGFS is going to do random reads/writes» because filesystems typically don't do «read/writes» except as a consequence of application requests, so I ignored that other part of your question. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Problem about very high Average Read/Write Request Time 2014-10-18 9:26 Problem about very high Average Read/Write Request Time quanjun hu 2014-10-18 12:38 ` Emmanuel Florac @ 2014-10-19 21:16 ` Stan Hoeppner 1 sibling, 0 replies; 13+ messages in thread From: Stan Hoeppner @ 2014-10-19 21:16 UTC (permalink / raw) To: quanjun hu, xfs On 10/18/2014 04:26 AM, quanjun hu wrote: > Hi, > I am using xfs on a raid 5 (~100TB) and put log on external ssd device, the mount information is: > /dev/sdc on /data/fhgfs/fhgfs_storage type xfs (rw,relatime,attr2,delaylog,logdev=/dev/sdb1,sunit=512,swidth=15872,noquota). > when doing only reading / only writing , the speed is very fast(~1.5G), but when do both the speed is very slow (100M), and high r_await(160) and w_await(200000). > 1. how can I reduce average request time? > 2. can I use ssd as write/read cache for xfs? You apparently have 31 effective SATA 7.2k RPM spindles with 256 KiB chunk, 7.75 MiB stripe width, in RAID5. That should yield 3-4.6 GiB/s of streaming throughput assuming no cable, expander, nor HBA limitations. You're achieving only 1/3rd to 1/2 of this. Which hardware RAID controller is this? What are the specs? Cache RAM, host and back end cable count and type? When you say read or write is fast individually, but read+write is slow, what types of files are you reading and writing, and how many in parallel? This combined pattern is likely the cause of the slowdown due to excessive seeking in the drives. As others mentioned this isn't an XFS problem. The problem is that your RAID geometry doesn't match your workload. Your very wide parity stripe is apparently causing excessive seeking with your read+write workload due to read-modify-write operations. To mitigate this, and to increase resiliency, you should switch to RAID6 with a smaller chunk. If you need maximum capacity make a single RAID6 array with 16 KiB chunk size. This will yield a 496 KiB stripe width, increasing the odds that all writes are a full stripe, and hopefully eliminating much of the RMW problem. A better option might be making three 10 drive RAID6 arrays (two spares) with 32 KiB chunk, 256 KiB stripe width, and concatenating the 3 arrays with mdadm --linear. You'd have 24 spindles of capacity and throughput instead of 31, but no more RMW operations, or at least very few. You'd format the linear md device with # mkfs.xfs -d su=32k,sw=8 /dev/mdX As long as your file accesses are spread fairly evenly across at least 3 directories you should achieve excellent parallel throughput, though single file streaming throughput will peak at 800-1200 MiB/s, that of 8 drives. With a little understanding of how this setup works, you can write two streaming files and read a third without any of the 3 competing with one another for disk seeks/bandwidth--which is your current problem. Or you could do one read and one write to each of 3 directories, and no pair of two would interfere with the other pairs. Scale up from here. Basically what we're doing is isolating each RAID LUN into a set of directories. When you write to one of those directories the file goes into only one of the 3 RAID arrays. Doing this isolates RMWs for a given write to only a subset of your disks, and minimizes the amount of seeks generated by parallel accesses. Cheers, Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2014-10-25 19:30 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-10-18 9:26 Problem about very high Average Read/Write Request Time quanjun hu 2014-10-18 12:38 ` Emmanuel Florac 2014-10-19 10:10 ` Peter Grandi 2014-10-20 8:00 ` Bernd Schubert 2014-10-21 18:27 ` Peter Grandi 2014-10-23 16:20 ` Bernd Schubert 2014-10-23 20:09 ` Peter Grandi 2014-10-24 21:45 ` Dave Chinner 2014-10-25 11:00 ` Peter Grandi 2014-10-25 19:31 ` Stan Hoeppner 2014-10-25 12:36 ` Peter Grandi 2014-10-23 23:01 ` Peter Grandi 2014-10-19 21:16 ` Stan Hoeppner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox