* Not being able to recover a RAID 5 20 Tb partition, help needed
@ 2014-01-29 13:39 Juan A. Sillero
2014-01-29 14:00 ` Eric Sandeen
2014-01-29 15:24 ` Roger Willcocks
0 siblings, 2 replies; 11+ messages in thread
From: Juan A. Sillero @ 2014-01-29 13:39 UTC (permalink / raw)
To: xfs
Hello,
We are a pioneering group in turbulence at the Polytechnic University of
Madrid (torroja.dmt.upm.es), running simulations in supercomputing
centers all around the world and hosting massive data in our data center
that are publicly accessible.
Apparently we have lost one of our partitions of 20 Tb because of a
under-voltage error of a power source plus a disk failure and we are
trying to fix it, but it does not look good so far.
The system is setup as follow:
XFS 6.1
Raid 5 (12 x 2 disks of 2 Tb each)
Double disk controller managed by devmapper.
What we know at this point is the next:
1) The topology of the disk is lost: the main-boot-record and the GPT
are corrupted.
2) After running the testdisk utility we find 9 partitions instead of 1.
3) With gdisk we have tried to create a new main-boot-record and GPT,
but it has not worked.
4) We know that the blocksize is 4096 bytes, and the current capacity of
the raid is under 20 Tb, so we suspect that even if the disk-manager
says that the RAID is ok, it has not reconstructed the RAID after the
disk failing.
We are stuck right now at this point, and help would be really
appreciated in order to bring up the partition. We will be pleased to
acknowledge the group in the upcoming publications.
Thanks again, Juan A. Sillero
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Not being able to recover a RAID 5 20 Tb partition, help needed
2014-01-29 13:39 Not being able to recover a RAID 5 20 Tb partition, help needed Juan A. Sillero
@ 2014-01-29 14:00 ` Eric Sandeen
2014-01-29 14:07 ` Roger Willcocks
2014-01-29 15:24 ` Roger Willcocks
1 sibling, 1 reply; 11+ messages in thread
From: Eric Sandeen @ 2014-01-29 14:00 UTC (permalink / raw)
To: sillero, xfs
On 1/29/14, 7:39 AM, Juan A. Sillero wrote:
> Hello,
>
> We are a pioneering group in turbulence at the Polytechnic University of
> Madrid (torroja.dmt.upm.es), running simulations in supercomputing
> centers all around the world and hosting massive data in our data center
> that are publicly accessible.
>
> Apparently we have lost one of our partitions of 20 Tb because of a
> under-voltage error of a power source plus a disk failure and we are
> trying to fix it, but it does not look good so far.
You lost only 1 disk?
So what exactly happened - what have you encountered, and what have
you done to try to fix it so far?
General rule - do NOT run xfs_repair in modification mode (i.e.
defaults, without "-n") or any other writes to the storage
until you know the storage is properly re-assembled.
> The system is setup as follow:
> XFS 6.1
Is this some variant of RHEL6.1? XFS doesn't have version numbers
like that. (Probably not IRIX 6.1?) :)
> Raid 5 (12 x 2 disks of 2 Tb each)
> Double disk controller managed by devmapper.
>
> What we know at this point is the next:
>
> 1) The topology of the disk is lost: the main-boot-record and the GPT
> are corrupted.
Which makes me think that the raid is possibly not in good shape.
> 2) After running the testdisk utility we find 9 partitions instead of 1.
> 3) With gdisk we have tried to create a new main-boot-record and GPT,
> but it has not worked.
Darn, so it sounds like you've already written to the storage.
> 4) We know that the blocksize is 4096 bytes, and the current capacity of
> the raid is under 20 Tb, so we suspect that even if the disk-manager
> says that the RAID is ok, it has not reconstructed the RAID after the
> disk failing.
I think you are right.
> We are stuck right now at this point, and help would be really
> appreciated in order to bring up the partition. We will be pleased to
> acknowledge the group in the upcoming publications.
Unfortunately it doesn't sound like an XFS problem at this point,
but rather a storage problem. It's probably worth reaching out to
the device mapper people first, perhaps they can help make sure
the raid is properly reassembled. Then we can see about picking up
the XFS pieces if any are left.
-Eric
> Thanks again, Juan A. Sillero
>
>
>
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Not being able to recover a RAID 5 20 Tb partition, help needed
2014-01-29 14:00 ` Eric Sandeen
@ 2014-01-29 14:07 ` Roger Willcocks
2014-01-29 14:18 ` Eric Sandeen
0 siblings, 1 reply; 11+ messages in thread
From: Roger Willcocks @ 2014-01-29 14:07 UTC (permalink / raw)
To: Eric Sandeen; +Cc: sillero, xfs
On Wed, 2014-01-29 at 08:00 -0600, Eric Sandeen wrote:
> > The system is setup as follow:
> > XFS 6.1
>
> Is this some variant of RHEL6.1? XFS doesn't have version numbers
> like that. (Probably not IRIX 6.1?) :)
See xfs_sb.h -
#define XFS_SB_MAGIC 0x58465342 /* 'XFSB' */
#define XFS_SB_VERSION_1 1 /* 5.3, 6.0.1, 6.1 */
#define XFS_SB_VERSION_2 2 /* 6.2 - attributes */
#define XFS_SB_VERSION_3 3 /* 6.2 - new inode
version */
#define XFS_SB_VERSION_4 4 /* 6.2+ - bitmask
version */
>
--
Roger Willcocks <roger@filmlight.ltd.uk>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Not being able to recover a RAID 5 20 Tb partition, help needed
2014-01-29 14:07 ` Roger Willcocks
@ 2014-01-29 14:18 ` Eric Sandeen
0 siblings, 0 replies; 11+ messages in thread
From: Eric Sandeen @ 2014-01-29 14:18 UTC (permalink / raw)
To: Roger Willcocks; +Cc: sillero@torroja.dmt.upm.es, xfs@oss.sgi.com
Those are ancient comments from IRIX versions. If he's using device mapper I suppose it is Linux after all. :)
Eric
> On Jan 29, 2014, at 8:07 AM, Roger Willcocks <roger@filmlight.ltd.uk> wrote:
>
>
> On Wed, 2014-01-29 at 08:00 -0600, Eric Sandeen wrote:
>
>>> The system is setup as follow:
>>> XFS 6.1
>>
>> Is this some variant of RHEL6.1? XFS doesn't have version numbers
>> like that. (Probably not IRIX 6.1?) :)
>
> See xfs_sb.h -
>
> #define XFS_SB_MAGIC 0x58465342 /* 'XFSB' */
> #define XFS_SB_VERSION_1 1 /* 5.3, 6.0.1, 6.1 */
> #define XFS_SB_VERSION_2 2 /* 6.2 - attributes */
> #define XFS_SB_VERSION_3 3 /* 6.2 - new inode
> version */
> #define XFS_SB_VERSION_4 4 /* 6.2+ - bitmask
> version */
>
>
> --
> Roger Willcocks <roger@filmlight.ltd.uk>
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Not being able to recover a RAID 5 20 Tb partition, help needed
2014-01-29 13:39 Not being able to recover a RAID 5 20 Tb partition, help needed Juan A. Sillero
2014-01-29 14:00 ` Eric Sandeen
@ 2014-01-29 15:24 ` Roger Willcocks
2014-01-29 16:54 ` Juan A. Sillero
1 sibling, 1 reply; 11+ messages in thread
From: Roger Willcocks @ 2014-01-29 15:24 UTC (permalink / raw)
To: sillero; +Cc: xfs
On Wed, 2014-01-29 at 14:39 +0100, Juan A. Sillero wrote:
> Hello,
>
> We are a pioneering group in turbulence at the Polytechnic University of
> Madrid (torroja.dmt.upm.es), running simulations in supercomputing
> centers all around the world and hosting massive data in our data center
> that are publicly accessible.
>
> Apparently we have lost one of our partitions of 20 Tb because of a
> under-voltage error of a power source plus a disk failure and we are
> trying to fix it, but it does not look good so far.
>
> The system is setup as follow:
> XFS 6.1
> Raid 5 (12 x 2 disks of 2 Tb each)
> Double disk controller managed by devmapper.
>
> What we know at this point is the next:
>
> 1) The topology of the disk is lost: the main-boot-record and the GPT
> are corrupted.
Are you sure the array had a main-boot-record and GPT ? What makes you
think they are corrupted ?
> 2) After running the testdisk utility we find 9 partitions instead of 1.
There is a good chance that the discovered partitions are backup XFS
superblocks; their location and content may allow you to figure out the
array topology.
> 3) With gdisk we have tried to create a new main-boot-record and GPT,
> but it has not worked.
This was a bad idea. Do you have a copy of the original data for these
sectors ?
> 4) We know that the blocksize is 4096 bytes, and the current capacity of
> the raid is under 20 Tb, so we suspect that even if the disk-manager
> says that the RAID is ok, it has not reconstructed the RAID after the
> disk failing.
>
> We are stuck right now at this point, and help would be really
> appreciated in order to bring up the partition. We will be pleased to
> acknowledge the group in the upcoming publications.
>
> Thanks again, Juan A. Sillero
>
--
Roger Willcocks <roger@filmlight.ltd.uk>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Not being able to recover a RAID 5 20 Tb partition, help needed
2014-01-29 15:24 ` Roger Willcocks
@ 2014-01-29 16:54 ` Juan A. Sillero
2014-01-29 17:42 ` Emmanuel Florac
0 siblings, 1 reply; 11+ messages in thread
From: Juan A. Sillero @ 2014-01-29 16:54 UTC (permalink / raw)
To: Roger Willcocks, guillem; +Cc: xfs
Thanks for your comments and ideas.
I'd like to add some more information about our crash to improve the
discussion.
XFS is not to blame, it's a hardware problem. The RAID controller thinks
that it has recovered the volume, but it has not.
We used gdisk to scan if there were partitions at all, and it said that
MBR was blank, and that GPT was corrupt. gdisk detected something that
was exactly what we have in the other volumes in the same storage array
(we have other 8 almost identical volumes in the same storage array).
gdisk re-flashed MBR and GPT, and we had a partition back again, but no
XFS identifier at all. We backed up the original state of the disk
before gdisk applied the changes.
We later used TestDisk, that reports 11 different partitions at sectors
that don't make any sense to us. It also says that the volume is smaller
than what it should.
We are now dd'ing the complete volume to a larger place in case we need
to get more serious with data recovery.
Our conclusions at this point is that (probably) the RAID5 is all messed
up because the controller didn't recover the volume at all. Of course,
this has destroyed the file system.
We'd like to know your opinion about what to do next. The data is still
probably on the disks, but the RAID topology is gone. We'd also like to
know if someone has experienced a similar hardware problem that could
give us some advice.
Thanks.
PS: Please, keep guillem in copy.
On Wed, 2014-01-29 at 15:24 +0000, Roger Willcocks wrote:
> On Wed, 2014-01-29 at 14:39 +0100, Juan A. Sillero wrote:
> > Hello,
> >
> > We are a pioneering group in turbulence at the Polytechnic University of
> > Madrid (torroja.dmt.upm.es), running simulations in supercomputing
> > centers all around the world and hosting massive data in our data center
> > that are publicly accessible.
> >
> > Apparently we have lost one of our partitions of 20 Tb because of a
> > under-voltage error of a power source plus a disk failure and we are
> > trying to fix it, but it does not look good so far.
> >
> > The system is setup as follow:
> > XFS 6.1
> > Raid 5 (12 x 2 disks of 2 Tb each)
> > Double disk controller managed by devmapper.
> >
> > What we know at this point is the next:
> >
> > 1) The topology of the disk is lost: the main-boot-record and the GPT
> > are corrupted.
>
>
> Are you sure the array had a main-boot-record and GPT ? What makes you
> think they are corrupted ?
>
> > 2) After running the testdisk utility we find 9 partitions instead of 1.
>
> There is a good chance that the discovered partitions are backup XFS
> superblocks; their location and content may allow you to figure out the
> array topology.
>
>
> > 3) With gdisk we have tried to create a new main-boot-record and GPT,
> > but it has not worked.
>
> This was a bad idea. Do you have a copy of the original data for these
> sectors ?
>
> > 4) We know that the blocksize is 4096 bytes, and the current capacity of
> > the raid is under 20 Tb, so we suspect that even if the disk-manager
> > says that the RAID is ok, it has not reconstructed the RAID after the
> > disk failing.
> >
> > We are stuck right now at this point, and help would be really
> > appreciated in order to bring up the partition. We will be pleased to
> > acknowledge the group in the upcoming publications.
> >
> > Thanks again, Juan A. Sillero
>
> >
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Not being able to recover a RAID 5 20 Tb partition, help needed
2014-01-29 16:54 ` Juan A. Sillero
@ 2014-01-29 17:42 ` Emmanuel Florac
2014-01-31 19:31 ` Juan A. Sillero
0 siblings, 1 reply; 11+ messages in thread
From: Emmanuel Florac @ 2014-01-29 17:42 UTC (permalink / raw)
To: sillero; +Cc: xfs, Roger Willcocks, guillem
Le Wed, 29 Jan 2014 17:54:08 +0100
"Juan A. Sillero" <sillero@torroja.dmt.upm.es> écrivait:
> We'd like to know your opinion about what to do next. The data is
> still probably on the disks, but the RAID topology is gone. We'd
> also like to know if someone has experienced a similar hardware
> problem that could give us some advice.
I tought it was a dmraid raid 5 device? If it's a RAID controller, what
brand/model is it?
At this point you could give a try to UFS Explorer.
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Not being able to recover a RAID 5 20 Tb partition, help needed
2014-01-29 17:42 ` Emmanuel Florac
@ 2014-01-31 19:31 ` Juan A. Sillero
2014-01-31 21:37 ` Emmanuel Florac
0 siblings, 1 reply; 11+ messages in thread
From: Juan A. Sillero @ 2014-01-31 19:31 UTC (permalink / raw)
To: Emmanuel Florac; +Cc: xfs, Roger Willcocks, Guillem Borrell i Nogueras
Hello Emmanuel,
The disk controller are QLogic:
04:00.0 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA (rev 02)
04:00.1 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA (rev 02)
05:00.0 Fibre Channel: QLogic Corp. ISP2432-based 4Gb Fibre Channel to PCI Express HBA (rev 03)
05:00.1 Fibre Channel: QLogic Corp. ISP2432-based 4Gb Fibre Channel to PCI Express HBA (rev 03)
06:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
Thanks again,
Juan
On Jan 29, 2014, at 6:42 PM, Emmanuel Florac <eflorac@intellique.com> wrote:
> Le Wed, 29 Jan 2014 17:54:08 +0100
> "Juan A. Sillero" <sillero@torroja.dmt.upm.es> écrivait:
>
>> We'd like to know your opinion about what to do next. The data is
>> still probably on the disks, but the RAID topology is gone. We'd
>> also like to know if someone has experienced a similar hardware
>> problem that could give us some advice.
>
> I tought it was a dmraid raid 5 device? If it's a RAID controller, what
> brand/model is it?
>
> At this point you could give a try to UFS Explorer.
>
> --
> ------------------------------------------------------------------------
> Emmanuel Florac | Direction technique
> | Intellique
> | <eflorac@intellique.com>
> | +33 1 78 94 84 02
> ------------------------------------------------------------------------
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Not being able to recover a RAID 5 20 Tb partition, help needed
2014-01-31 19:31 ` Juan A. Sillero
@ 2014-01-31 21:37 ` Emmanuel Florac
2014-01-31 21:50 ` Juan Antonio Sillero Sepulveda
0 siblings, 1 reply; 11+ messages in thread
From: Emmanuel Florac @ 2014-01-31 21:37 UTC (permalink / raw)
To: Juan A. Sillero; +Cc: xfs, Roger Willcocks, Guillem Borrell i Nogueras
Le Fri, 31 Jan 2014 20:31:25 +0100 vous écriviez:
> The disk controller are QLogic:
>
> 04:00.0 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel
> to PCI Express HBA (rev 02)
So it's just a FC HBA, not a RAID controller, so there isn't much to
do there. As I said previously, your best bet at this point is first try
UFS explorer first.
Is your data made of standard file formats (i.e. JPEG images, etc)? Or
is it special?
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Not being able to recover a RAID 5 20 Tb partition, help needed
2014-01-31 21:37 ` Emmanuel Florac
@ 2014-01-31 21:50 ` Juan Antonio Sillero Sepulveda
2014-02-01 9:56 ` Emmanuel Florac
0 siblings, 1 reply; 11+ messages in thread
From: Juan Antonio Sillero Sepulveda @ 2014-01-31 21:50 UTC (permalink / raw)
To: Emmanuel Florac; +Cc: xfs, Roger Willcocks, Guillem Borrell i Nogueras
Sorry,
Guillem, my colleague and expert, correct me. These is the hardware.
Disk /dev/sdq - 20 TB / 18 TiB - DotHill R/Evo 5730-2R
Disk /dev/sdr - 22 TB / 20 TiB - DotHill R/Evo 5730-2R
Disk /dev/sds - 22 TB / 20 TiB - DotHill R/Evo 5730-2R
Disk /dev/sdt - 22 TB / 20 TiB - DotHill R/Evo 5730-2R
Disk /dev/sdu - 22 TB / 20 TiB - DotHill R/Evo 5730-2R
Disk /dev/sdv - 22 TB / 20 TiB - DotHill R/Evo 5730-2R
Disk /dev/sdw - 22 TB / 20 TiB - DotHill R/Evo 5730-2R
Disk /dev/sdx - 22 TB / 20 TiB - DotHill R/Evo 5730-2R
Disk /dev/sdy - 22 TB / 20 TiB - DotHill R/Evo 5730-2R
Disk /dev/sdz - 20 TB / 18 TiB - DotHill R/Evo 5730-2R
We store everything, basically. Images, matlabs files, txt, dat, pdf, binary files, hdf5 files, .mat, Fortran files, etc...
On Jan 31, 2014, at 10:37 PM, Emmanuel Florac <eflorac@intellique.com> wrote:
> Le Fri, 31 Jan 2014 20:31:25 +0100 vous écriviez:
>
>> The disk controller are QLogic:
>>
>> 04:00.0 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel
>> to PCI Express HBA (rev 02)
>
>
> So it's just a FC HBA, not a RAID controller, so there isn't much to
> do there. As I said previously, your best bet at this point is first try
> UFS explorer first.
>
> Is your data made of standard file formats (i.e. JPEG images, etc)? Or
> is it special?
>
> --
> ------------------------------------------------------------------------
> Emmanuel Florac | Direction technique
> | Intellique
> | <eflorac@intellique.com>
> | +33 1 78 94 84 02
> ------------------------------------------------------------------------
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Not being able to recover a RAID 5 20 Tb partition, help needed
2014-01-31 21:50 ` Juan Antonio Sillero Sepulveda
@ 2014-02-01 9:56 ` Emmanuel Florac
0 siblings, 0 replies; 11+ messages in thread
From: Emmanuel Florac @ 2014-02-01 9:56 UTC (permalink / raw)
To: Juan Antonio Sillero Sepulveda
Cc: xfs, Roger Willcocks, Guillem Borrell i Nogueras
Le Fri, 31 Jan 2014 22:50:51 +0100 vous écriviez:
> Guillem, my colleague and expert, correct me. These is the hardware.
>
> Disk /dev/sdq - 20 TB / 18 TiB - DotHill R/Evo 5730-2R
> Disk /dev/sdr - 22 TB / 20 TiB - DotHill R/Evo 5730-2R
> Disk /dev/sds - 22 TB / 20 TiB - DotHill R/Evo 5730-2R
> Disk /dev/sdt - 22 TB / 20 TiB - DotHill R/Evo 5730-2R
> Disk /dev/sdu - 22 TB / 20 TiB - DotHill R/Evo 5730-2R
> Disk /dev/sdv - 22 TB / 20 TiB - DotHill R/Evo 5730-2R
> Disk /dev/sdw - 22 TB / 20 TiB - DotHill R/Evo 5730-2R
> Disk /dev/sdx - 22 TB / 20 TiB - DotHill R/Evo 5730-2R
> Disk /dev/sdy - 22 TB / 20 TiB - DotHill R/Evo 5730-2R
> Disk /dev/sdz - 20 TB / 18 TiB - DotHill R/Evo 5730-2R
This is quite new hardware, it's probably supported/maintained. You
should definitely check with DotHill support.
> We store everything, basically. Images, matlabs files, txt, dat, pdf,
> binary files, hdf5 files, .mat, Fortran files, etc...
Photorec can recognize many common file formats, but not all. I've
successfully restored several terabytes with it, however you won't get
any metadata back with it.
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2014-02-01 9:56 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-29 13:39 Not being able to recover a RAID 5 20 Tb partition, help needed Juan A. Sillero
2014-01-29 14:00 ` Eric Sandeen
2014-01-29 14:07 ` Roger Willcocks
2014-01-29 14:18 ` Eric Sandeen
2014-01-29 15:24 ` Roger Willcocks
2014-01-29 16:54 ` Juan A. Sillero
2014-01-29 17:42 ` Emmanuel Florac
2014-01-31 19:31 ` Juan A. Sillero
2014-01-31 21:37 ` Emmanuel Florac
2014-01-31 21:50 ` Juan Antonio Sillero Sepulveda
2014-02-01 9:56 ` Emmanuel Florac
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox