* XFS errors on large Infiniband fileserver setup
@ 2010-09-23 7:22 Christian Herzog
2010-09-23 10:01 ` Christian Herzog
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Christian Herzog @ 2010-09-23 7:22 UTC (permalink / raw)
To: xfs, isg
Dear all,
we (Physics Dept. at ETH Zurich) are trying to set up a large file
server combo (two disk backends connected to a frontend by Infiniband,
all running Ubuntu 10.04) and keep getting XFS internal error
xfs_da_do_buf(2) messages when copying large amounts of data, resulting
in 'structure needs cleaning' warnings. We have tried a lot of
different kernels, iSCSI implementations, LVM configurations, whatnot,
but these errors persist. The setup right now looks like this:
2 disk backends, each: Quad-Xeon X5550, 12G of RAM, 28T HW SATA-RAID6
sliced into 2T chunks by LVM2 and exported via tgt 1.0.0-2, Ubuntu
10.04 LTS, connected via Mellanox MHRH19B-XTR Infiniband + ISER to
1 frontend Octo-Xeon E5520, 12G of RAM, open-iscsi 2.0.871 initiator,
Ubuntu 10.04 LTS. LMV2 stitches together the 2T-iSCSI-LUNs and provides
a 10T test XFS filesystem
right now we're performing stress tests and when copying large amounts
of data to the XFS filesystem, at some point we get
Filesystem "dm-3": XFS internal error xfs_da_do_buf(2) at line 2113 of
file /home/kernel-ppa/mainline/build/fs/xfs/xfs_da_btree.c. Caller
0xffffffffa0299a1a
This can be provoked by running a 'du' or 'find' while writing the
data.
on the frontend and XFS reports 'structure needs cleaning'. The
following modifications have been suggested and we're working on them
right now:
- try w/o ISER (direct IB over TCP)
- try an XFS filesystem < 2T
- try RHEL or SLES (will take more time)
We already had to change the I/O scheduler from Deadline to CFQ in
order to get it up and running at all and also tried to change the
kernel from stock LTS to 2.6.34-020634-generic, but we still get the FS
errors.
root@phd-san-gw1:~# xfs_info /export/astrodata
meta-data=/dev/mapper/vgastro-lvastro isize=256 agcount=10,
agsize=268435455 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=2684354550,
imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=0 blks,
lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
we're slowly but surely running out of ideas here. Needless to say the
system should have been deployed quite some time ago. Any help would be
greatly appreciated. We're also happy to provide any further
information that might be useful.
thanks a lot and kind regards,
-Christian
--
Dr. Christian Herzog <herzog@phys.ethz.ch> support: +41 44 633 26 68
IT Services Group, HPT D 17 voice: +41 44 633 39 50
Department of Physics, ETH Zurich
8093 Zurich, Switzerland http://nic.phys.ethz.ch
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: XFS errors on large Infiniband fileserver setup
2010-09-23 7:22 XFS errors on large Infiniband fileserver setup Christian Herzog
@ 2010-09-23 10:01 ` Christian Herzog
2010-09-23 10:38 ` Christoph Hellwig
2010-09-23 23:53 ` Dave Chinner
2010-09-27 5:49 ` Christian Herzog
2 siblings, 1 reply; 12+ messages in thread
From: Christian Herzog @ 2010-09-23 10:01 UTC (permalink / raw)
To: xfs
Dear list,
my colleague has been running tests without ISER all morning, and so
far we haven't encountered an error, even though we have copied > twice
as much data as for the other tests. Is it possible that a problem in
the transport layer remains undetected and only manifests itself in the
filesystem?
thanks again,
-Christian
> following modifications have been suggested and we're working on them
> right now:
>
> - try w/o ISER (direct IB over TCP)
> - try an XFS filesystem < 2T
> - try RHEL or SLES (will take more time)
>
> We already had to change the I/O scheduler from Deadline to CFQ in
> order to get it up and running at all and also tried to change the
> kernel from stock LTS to 2.6.34-020634-generic, but we still get the
> FS errors.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: XFS errors on large Infiniband fileserver setup
2010-09-23 10:01 ` Christian Herzog
@ 2010-09-23 10:38 ` Christoph Hellwig
2010-09-23 11:52 ` Christian Herzog
0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2010-09-23 10:38 UTC (permalink / raw)
To: Christian Herzog; +Cc: xfs
On Thu, Sep 23, 2010 at 12:01:55PM +0200, Christian Herzog wrote:
>
> Dear list,
>
> my colleague has been running tests without ISER all morning, and so
> far we haven't encountered an error, even though we have copied > twice
> as much data as for the other tests. Is it possible that a problem in
> the transport layer remains undetected and only manifests itself in the
> filesystem?
Yes, that could very well possible. While I'm not an expert on iSER it
looks like it doesn't use the software crc32c checksums in iscsi over
TCP but relies on the RDMA protocols guaranteeing it.
Btw, what target do you use? According to
http://www.spinics.net/lists/linux-stgt/msg02038.html it seems like the
stgt iser target has some issues that look quite similar to yours.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: XFS errors on large Infiniband fileserver setup
2010-09-23 10:38 ` Christoph Hellwig
@ 2010-09-23 11:52 ` Christian Herzog
2010-09-23 13:54 ` Christoph Hellwig
0 siblings, 1 reply; 12+ messages in thread
From: Christian Herzog @ 2010-09-23 11:52 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: xfs
> Yes, that could very well possible. While I'm not an expert on iSER it
> looks like it doesn't use the software crc32c checksums in iscsi over
> TCP but relies on the RDMA protocols guaranteeing it.
that would explain quite a lot actually. Thanks for the hint.
> Btw, what target do you use? According to
> http://www.spinics.net/lists/linux-stgt/msg02038.html it seems like the
> stgt iser target has some issues that look quite similar to yours.
we're using tgt which might have a similar problem.
As even IB over TCP is several times faster than our disks, we will now
proceed w/o iSER to have the system running and then experiment with
iSER on a test system.
thanks,
-Christian
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: XFS errors on large Infiniband fileserver setup
2010-09-23 11:52 ` Christian Herzog
@ 2010-09-23 13:54 ` Christoph Hellwig
2010-09-23 14:10 ` Christian Herzog
0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2010-09-23 13:54 UTC (permalink / raw)
To: Christian Herzog; +Cc: xfs
On Thu, Sep 23, 2010 at 01:52:46PM +0200, Christian Herzog wrote:
> > Btw, what target do you use? According to
> > http://www.spinics.net/lists/linux-stgt/msg02038.html it seems like the
> > stgt iser target has some issues that look quite similar to yours.
> we're using tgt which might have a similar problem.
tgt and stgt is actually one and the same - the name is used a little
bit inconsistently.
> As even IB over TCP is several times faster than our disks, we will now
> proceed w/o iSER to have the system running and then experiment with
> iSER on a test system.
At least for NFS using RDMA natively mostly affects CPU usage and not
throughput, so if you have some CPU cycles to spare it might not make
much of a difference.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: XFS errors on large Infiniband fileserver setup
2010-09-23 13:54 ` Christoph Hellwig
@ 2010-09-23 14:10 ` Christian Herzog
0 siblings, 0 replies; 12+ messages in thread
From: Christian Herzog @ 2010-09-23 14:10 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: xfs
> On Thu, Sep 23, 2010 at 01:52:46PM +0200, Christian Herzog wrote:
>> > Btw, what target do you use? According to
>> > http://www.spinics.net/lists/linux-stgt/msg02038.html it seems like the
>> > stgt iser target has some issues that look quite similar to yours.
>> we're using tgt which might have a similar problem.
>
> tgt and stgt is actually one and the same - the name is used a little
> bit inconsistently.
oh *blush*
well I guess this might be our issue then..
>> As even IB over TCP is several times faster than our disks, we will now
>> proceed w/o iSER to have the system running and then experiment with
>> iSER on a test system.
>
> At least for NFS using RDMA natively mostly affects CPU usage and not
> throughput, so if you have some CPU cycles to spare it might not make
> much of a difference.
even in our hardcore torture tests the system load is pretty
reasonable, so I guess we're good.
thanks,
-Christian
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: XFS errors on large Infiniband fileserver setup
2010-09-23 7:22 XFS errors on large Infiniband fileserver setup Christian Herzog
2010-09-23 10:01 ` Christian Herzog
@ 2010-09-23 23:53 ` Dave Chinner
2010-09-24 5:41 ` Christian Herzog
2010-09-24 13:19 ` Emmanuel Florac
2010-09-27 5:49 ` Christian Herzog
2 siblings, 2 replies; 12+ messages in thread
From: Dave Chinner @ 2010-09-23 23:53 UTC (permalink / raw)
To: Christian Herzog; +Cc: isg, xfs
On Thu, Sep 23, 2010 at 09:22:29AM +0200, Christian Herzog wrote:
>
> Dear all,
>
> we (Physics Dept. at ETH Zurich) are trying to set up a large file
> server combo (two disk backends connected to a frontend by
> Infiniband, all running Ubuntu 10.04) and keep getting XFS internal
> error xfs_da_do_buf(2) messages when copying large amounts of data,
> resulting in 'structure needs cleaning' warnings. We have tried a
> lot of different kernels, iSCSI implementations, LVM configurations,
> whatnot, but these errors persist. The setup right now looks like
> this:
>
> 2 disk backends, each: Quad-Xeon X5550, 12G of RAM, 28T HW
> SATA-RAID6 sliced into 2T chunks by LVM2 and exported via tgt
> 1.0.0-2, Ubuntu 10.04 LTS, connected via Mellanox MHRH19B-XTR
> Infiniband + ISER to
>
> 1 frontend Octo-Xeon E5520, 12G of RAM, open-iscsi 2.0.871
> initiator, Ubuntu 10.04 LTS. LMV2 stitches together the
> 2T-iSCSI-LUNs and provides a 10T test XFS filesystem
Out of curiousity, why are you using such a complex storage
configuration?
IMO, it is unneccessarily complex - you could easily do this (~30
drives) with a single server with a couple of external SAS JBOD
arrays and SAS RAID controllers. That would give you the same
performance (or better), with many fewer points of failure (both
hardware and software), use less rack space, and probably be
significantly cheaper....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: XFS errors on large Infiniband fileserver setup
2010-09-23 23:53 ` Dave Chinner
@ 2010-09-24 5:41 ` Christian Herzog
2010-09-24 8:56 ` Stan Hoeppner
2010-09-24 13:19 ` Emmanuel Florac
2010-09-24 13:19 ` Emmanuel Florac
1 sibling, 2 replies; 12+ messages in thread
From: Christian Herzog @ 2010-09-24 5:41 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
Hi Dave,
thanks for your feedback.
>> 2 disk backends, each: Quad-Xeon X5550, 12G of RAM, 28T HW
>> SATA-RAID6 sliced into 2T chunks by LVM2 and exported via tgt
>> 1.0.0-2, Ubuntu 10.04 LTS, connected via Mellanox MHRH19B-XTR
>> Infiniband + ISER to
>>
>> 1 frontend Octo-Xeon E5520, 12G of RAM, open-iscsi 2.0.871
>> initiator, Ubuntu 10.04 LTS. LMV2 stitches together the
>> 2T-iSCSI-LUNs and provides a 10T test XFS filesystem
>
> Out of curiousity, why are you using such a complex storage
> configuration?
>
> IMO, it is unneccessarily complex - you could easily do this (~30
> drives) with a single server with a couple of external SAS JBOD
> arrays and SAS RAID controllers. That would give you the same
> performance (or better), with many fewer points of failure (both
> hardware and software), use less rack space, and probably be
> significantly cheaper....
basically, our situation is this: we have to supply our astrophysicists
(not just them, but they consume 95%) with large and ever-increasing
amounts of disk space. Up to now we bought individual file servers
whenever space was needed, which is an administrative nightmare as you
can imagine. Hence we decided to come up with a more scalable solution
that would grow with the space needed - and grow it will. We start off
with 52T and can easily add additional disk units to the Infiniband
switch.
It is well possible we have overlooked an easier/cheaper solution, but
what we have now is very flexible and has emerged from discussions we've
had with several 'storage experts'. Do you have any particular/typical
device in mind? I'd like to check it out nonetheless.
thanks,
-Christian
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: XFS errors on large Infiniband fileserver setup
2010-09-24 5:41 ` Christian Herzog
@ 2010-09-24 8:56 ` Stan Hoeppner
2010-09-24 13:19 ` Emmanuel Florac
1 sibling, 0 replies; 12+ messages in thread
From: Stan Hoeppner @ 2010-09-24 8:56 UTC (permalink / raw)
To: xfs
Christian Herzog put forth on 9/24/2010 12:41 AM:
> Do you have any particular/typical
> device in mind? I'd like to check it out nonetheless.
Almost totally ignoring your current hardware investment and Infiniband
back end...
I recommend the following for performance, storage density and total
storage, ease of configuration and management, reliability, and cost:
http://www.nexstor.co.uk/products/3/13/29/526/Disk_Storage/Nexsan/Nexsan_Storage/Nexsan_SATABeast
http://www.nexstor.co.uk/products/3/13/29/3537/Disk_Storage/Nexsan/Nexsan_Storage/Nexsan_60_Disks_in_4U_-_Beast_Expansion_Unit
Using 2TB drives, the Nexsan SATABeast with two dual port 8Gbit FC
controllers combined with the NXS-B60E expansion chassis offers a total
of 204TB in only 8U of rack space with an advertised sustained host data
rate of 1.2GB/s using both controllers.
If your bandwidth needs outweigh your capacity needs, and 1.2GB/s is too
low for a total storage back end, simply acquire multiple SATABeasts and
forgo the NXS-B60E expansion box. Using 2 Qlogic QLE2564 x8 PCIe Quad
port 8Gbit FC HBAs in your front end server would allow multipath
redundant connection to one FC port on each controller of 4 SATABeast
units. This would yield an advertised sustained aggregate data rate of
4.8GB/s between the front end server and 336TB of storage across 168
disks in 16U total rack space.
If you have an FC convergence card in your Mellanox IB switch with 4-8
FC ports, you could forgo the HBAs in the front end server and simply
jack the SATABeast(s) directly into the IB fabric. This would
definitely increase configuration complexity. I've never done it so I'd
be of no help. However, it would allow you to assign LUNs on the
SATABeasts directly to any hosts on the IB network, assuming all the
necessary software is installed and configurable on said hosts enabling
their IB HBAs to present the SATABeast LUNs as SCSI devices to Linux.
As far as configuring FC zones within an IB fabric to make the LUNs
visible to the HBAs, I'll leave that to you, as I've never done that
either. Zero IB experience here, only FC. ;)
I'm making a somewhat educated guess that a fully configured SATABeast
with dual controllers and 42x2TB disks should be attainable for around
$50K USD today. If 1.2GB/s sustained is enough performance, from a cost
and rack footprint perspective, the 8Gbit SATABeast with the NXS-B60E 60
drive expansion box is really hard to beat--204TB in only 8U, in the
ballpark of $80K USD. If my math is correct, that's around $400 USD per
terabyte. I'm guessing 1TB of similar performance EMC storage is
probably at least 4 times that.
Disclaimer: I don't work for Nexsan, Qlogic, nor any reseller. I'm
simply a satisfied customer of both.
--
Stan
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: XFS errors on large Infiniband fileserver setup
2010-09-23 23:53 ` Dave Chinner
2010-09-24 5:41 ` Christian Herzog
@ 2010-09-24 13:19 ` Emmanuel Florac
1 sibling, 0 replies; 12+ messages in thread
From: Emmanuel Florac @ 2010-09-24 13:19 UTC (permalink / raw)
Cc: Christian Herzog, isg, xfs
Le Fri, 24 Sep 2010 09:53:55 +1000
Dave Chinner <david@fromorbit.com> écrivait:
> IMO, it is unneccessarily complex - you could easily do this (~30
> drives) with a single server with a couple of external SAS JBOD
> arrays and SAS RAID controllers. That would give you the same
> performance (or better), with many fewer points of failure (both
> hardware and software), use less rack space, and probably be
> significantly cheaper....
And it will use less power, too. Actually there are Supermicro chassis
with up to 40 drives in 4U, so you don't even need an external disk
enclosure.
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: XFS errors on large Infiniband fileserver setup
2010-09-24 5:41 ` Christian Herzog
2010-09-24 8:56 ` Stan Hoeppner
@ 2010-09-24 13:19 ` Emmanuel Florac
1 sibling, 0 replies; 12+ messages in thread
From: Emmanuel Florac @ 2010-09-24 13:19 UTC (permalink / raw)
To: Christian Herzog; +Cc: xfs
Le Fri, 24 Sep 2010 07:41:53 +0200
Christian Herzog <horeizo@phys.ethz.ch> écrivait:
> We start off
> with 52T and can easily add additional disk units to the Infiniband
> switch.
I don't know if using iscsi-over-infiniband is optimal. The problem is
that you plan to expand your existing XFS filesystem by leaps and
bounds up to some extremely large size, through a simple raid-0 like
aggregation, that's really fragile.
It's a configuration that's supposed to work well with something like
ZFS, though in real life setups (no I won't tell who sent back recently
all of a 2 PB cluster to Sun but that happened in 2010 :) large raid-z
iscsi clusters aren't so great :)
For similar setups, I used PVFS2 by aggregating 40 TB nodes. PVFS2 is
known to scale up to petabytes, and (contrary to XFS over RAID-0) is
extremely tolerant to node failure (though it is not redundant); if a
node crashes, the cluster IO may freeze (though write activity can
usually go on) but restart instantly when the failed node is revived.
However PVFS isn't made for general purpose file sharing (though it
works with both samba and nfs), but really flies when used with
applications properly set up (MPIO). It's tailored for scientific work
and heavy computation clusters.
In contrast with Lustre, PVFS2 is very easy to set up, and very easy
to extend if you planned it from the start (Lustre is a fantastic PITA
to set up and administer, and don't even talk about NFS sharing).
So I would set up a storage cluster this way : each storage node is a
PVFS server, the PVFS data resides on an XFS filesystem (officially
recommended by PVFS developers anyway).
You can expand the PVFS filesystem either by enlarging the XFS on the
storage nodes, or by adding new independant storage nodes.
Each storage node can be a PVFS client too, and use its computing power
to crunch data.
As I said, the main problem is to know how you plan to make space
available to clients systems. You can use NFS/CIFS for ease of use,
desktop access, etc but performance will be low. However native PVFS
performance can be huge over infiniband (in the several GB/s range).
And the more storage nodes you're adding, the more performance you get.
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: XFS errors on large Infiniband fileserver setup
2010-09-23 7:22 XFS errors on large Infiniband fileserver setup Christian Herzog
2010-09-23 10:01 ` Christian Herzog
2010-09-23 23:53 ` Dave Chinner
@ 2010-09-27 5:49 ` Christian Herzog
2 siblings, 0 replies; 12+ messages in thread
From: Christian Herzog @ 2010-09-27 5:49 UTC (permalink / raw)
To: xfs
Dear list,
thanks a lot for a lot of helpful comments. We will now assess our
setup and also talk to the supplier. We really should've thought of
joining this list before buying the hardware, but hey.
thanks and kind regards,
-Christian
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2010-09-27 5:48 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-09-23 7:22 XFS errors on large Infiniband fileserver setup Christian Herzog
2010-09-23 10:01 ` Christian Herzog
2010-09-23 10:38 ` Christoph Hellwig
2010-09-23 11:52 ` Christian Herzog
2010-09-23 13:54 ` Christoph Hellwig
2010-09-23 14:10 ` Christian Herzog
2010-09-23 23:53 ` Dave Chinner
2010-09-24 5:41 ` Christian Herzog
2010-09-24 8:56 ` Stan Hoeppner
2010-09-24 13:19 ` Emmanuel Florac
2010-09-24 13:19 ` Emmanuel Florac
2010-09-27 5:49 ` Christian Herzog
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox