linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Growing RAID10 with active XFS filesystem
@ 2018-01-08 19:06 mdraid.pkoch
  0 siblings, 0 replies; 34+ messages in thread
From: mdraid.pkoch @ 2018-01-08 19:06 UTC (permalink / raw)
  To: linux-raid

Dear Linux-Raid and Linux-XFS experts:

I'm posting this on both the linux-raid and linux-xfs
mailing list as it's not clear at this point wether
this is a MD- od XFS-problem.

I have described my problem in a recent posting on
linux-raid and Wol's conclusion was:

 > In other words, one or more of the following three are true :-
 > 1) The OP has been caught by some random act of God
 > 2) There's a serious flaw in "mdadm --grow"
 > 3) There's a serious flaw in xfs
 >
 > Cheers,
 > Wol

There's very important data on our RAID10 device but I doubt
it's important enough for God to take a hand into our storage.

But let me first summarize what happened and why I believe that
this is an XFS-problem:

Machine running Linux 3.14.69 with no kernel-patches.

XFS filesystem was created with XFS userutils 3.1.11.
I did a fresh compile of xfsprogs-4.9.0 yesterday when
I realized that the 3.1.11 xfs_repair did not help.

mdadm is V3.3

/dev/md5 is a RAID10-device that was created in Feb 2013
with 10 2TB disks and an ext3 filesystem on it. Once in a
while I added two more 2TB disks. Reshaping was done
while the ext3 filesystem was mounted. Then the ext3
filesystem was unmounted resized and mounted again. That
worked until I resized the RAID10 from 16 to 20 disks and
realized that ext3 does not support filesystems >16TB.

I switched to XFS and created a 20TB filesystem. Here are
the details:

# xfs_info /dev/md5
meta-data=/dev/md5               isize=256    agcount=32,
agsize=152608128 blks
           =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=4883457280, imaxpct=5
           =                       sunit=128    swidth=1280 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=521728, version=2
           =                       sectsz=512   sunit=8 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Please notice: Ths XFS-filesystem has a size of
4883457280*4K = 19,533,829,120K

On saturday I tried to add two more 2TB disks to the RAID10
and the XFS filesystem was mounted (and in medium use) at that
time. Commands were:

# mdadm /dev/md5 --add /dev/sdo
# mdadm --grow /dev/md5 --raid-devices=21

# mdadm -D /dev/md5
/dev/md5:
          Version : 1.2
    Creation Time : Sun Feb 10 16:58:10 2013
       Raid Level : raid10
       Array Size : 19533829120 (18628.91 GiB 20002.64 GB)
    Used Dev Size : 1953382912 (1862.89 GiB 2000.26 GB)
     Raid Devices : 21
    Total Devices : 21
      Persistence : Superblock is persistent

      Update Time : Sat Jan  6 15:08:37 2018
            State : clean, reshaping
   Active Devices : 21
Working Devices : 21
   Failed Devices : 0
    Spare Devices : 0

           Layout : near=2
       Chunk Size : 512K

   Reshape Status : 1% complete
    Delta Devices : 1, (20->21)

             Name : backup:5  (local to host backup)
             UUID : 9030ff07:6a292a3c:26589a26:8c92a488
           Events : 86002

      Number   Major   Minor   RaidDevice State
         0       8       16        0      active sync   /dev/sdb
         1      65       48        1      active sync   /dev/sdt
         2       8       64        2      active sync   /dev/sde
         3      65       96        3      active sync   /dev/sdw
         4       8      112        4      active sync   /dev/sdh
         5      65      144        5      active sync   /dev/sdz
         6       8      160        6      active sync   /dev/sdk
         7      65      192        7      active sync   /dev/sdac
         8       8      208        8      active sync   /dev/sdn
         9      65      240        9      active sync   /dev/sdaf
        10      65        0       10      active sync   /dev/sdq
        11      66       32       11      active sync   /dev/sdai
        12       8       32       12      active sync   /dev/sdc
        13      65       64       13      active sync   /dev/sdu
        14       8       80       14      active sync   /dev/sdf
        15      65      112       15      active sync   /dev/sdx
        16       8      128       16      active sync   /dev/sdi
        17      65      160       17      active sync   /dev/sdaa
        18       8      176       18      active sync   /dev/sdl
        19      65      208       19      active sync   /dev/sdad
        20       8      224       20      active sync   /dev/sdo

Please notice: Ths RAID10-device has a size of 19,533,829,120K
that's exactly the same size as the contained XFS-filesystem.

Immediately after the RAID10 reshape operation started the
XFS-filesystem reported I/O-errors and was severly damaged.
I waited for the reshape operation to finish and tried to repair
the filesystem with xfs_repair (version 3.1.11) but xfs_repair
crashed, so I tried 4.9.0-version aif xfs_reapair with no luck
either.

/dev/md5 ist now mounted ro,norecovery with an overlay filesystem
on top of it (thanks very much to Andreas for that idea) and I have
setup a new server today. Rsyncing the data to the new server will
take a while and I'm sure I will stumble on lots of corrupted files.
I proceeded from XFS to ZFS (skipped YFS) so lengthy reshape
operations won't happen in the future anymore.

Here are the relevant log messages:

 > Jan  6 14:45:00 backup kernel: md: reshape of RAID array md5
 > Jan  6 14:45:00 backup kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
 > Jan  6 14:45:00 backup kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
 > Jan  6 14:45:00 backup kernel: md: using 128k window, over a total of 19533829120k.
 > Jan  6 14:45:00 backup kernel: XFS (md5): metadata I/O error: block 0x12c08f360 ("xfs_trans_read_buf_map") error 5 numblks 16
 > Jan  6 14:45:00 backup kernel: XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
 > Jan  6 14:45:00 backup kernel: XFS (md5): metadata I/O error: block 0x12c08f360 ("xfs_trans_read_buf_map") error 5 numblks 16
 > Jan  6 14:45:00 backup kernel: XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
 > ... hundreds of the above XFS-messages deleted
 > Jan  6 14:45:00 backup kernel: XFS (md5): Log I/O Error Detected.  Shutting down filesystem
 > Jan  6 14:45:00 backup kernel: XFS (md5): Please umount the filesystem and rectify the problem(s)

Please notice: no error message about hardware-problems.
All 21 disks are fine and the next messages from the
md-driver was:

 > Jan  7 02:28:02 backup kernel: md: md5: reshape done.
 > Jan  7 02:28:03 backup kernel: md5: detected capacity change from 20002641018880 to 21002772807680

I'm wondering about one thing: the first xfs message is about a
meatadata I/O error on block 0x12c08f360. Since the xfs filesystem
has a blocksize of 4K this block is located at position 20135005568K
which is beyond the end of the RAID10 device. No wonder that the
xfs driver receives an I/O error. And also no wonder that the
filesystem is severely corrupted right now.

Question 1: How did the xfs driver knew on Jan 6 that the RAID10
device was about to be increased from 20TB to 21TB on Jan 7?

Question 2: Why did the xfs driver started to use the additional
space that was not yet there without me executing xfs_growfs.

This looks like a severe XFS-problem to me.

But my hope is that all the data taht was within the filesystem
before Jan 6 14:45 is not involved in the corruption. If xfs
started to use space beyond the end of the underlying raid
device this should have affected only data that was created,
modified or deleted after Jan 6 14:45.

If that was true we could clearly distinct between data
that we must dump and data that we can keep. The machine is
our backup system (as you may have guessed from its name)
and I would like to keep old backup-files.

I remember that mkfs.xfs is clever enough to adopt the
filesystem paramters to the underlying hardware of the
block device that the xfs filesystem is created on. Hence
from the xfs drivers point of view the underlying block
device is not just a sequence of data blocks, but the xfs
driver knows something about the layout of the underlying
hardware.

If that was true - how does the xfs driver reacts if that
information about the layout of the underlying hardware
changes while the xfs-filesystem is mounted?

Seems to be an interesting problem

Kind regards

Peter Koch


^ permalink raw reply	[flat|nested] 34+ messages in thread
* Growing RAID10 with active XFS filesystem
@ 2018-01-06 15:44 mdraid.pkoch
  2018-01-07 19:33 ` John Stoffel
                   ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: mdraid.pkoch @ 2018-01-06 15:44 UTC (permalink / raw)
  To: linux-raid

Dear MD-experts:

I was under the impression that growing a RAID10 device could be done
with an active filesystem running on the device.

I did this a couple of times when I added additional 2TB disks to our
production RAID10 running an ext3 Filesystem. That was a very time
consuming process and we had to use the filesystem during the reshape.

When I increased the size of the RAID10 from 16 to 20 2TB-disks I could
not use ext3 anymore due to the 16TB maimum size limitation of ext3
and I replaced the ext3 filesystem by xfs.

Now today I increased the RAID10 again from 20 to 21 disks with the
following commands:

mdadm /dev/md5 --add /dev/sdo
mdadm --grow /dev/md5 --raid-devices=21

My plans were to add another disk after that and then grow
the XFS-filesystem. I do not add multiple disks at once since
its hard to predict which disk will end up in what disk-set

Here's mdadm -D /dev/md5 output:
/dev/md5:
         Version : 1.2
   Creation Time : Sun Feb 10 16:58:10 2013
      Raid Level : raid10
      Array Size : 19533829120 (18628.91 GiB 20002.64 GB)
   Used Dev Size : 1953382912 (1862.89 GiB 2000.26 GB)
    Raid Devices : 21
   Total Devices : 21
     Persistence : Superblock is persistent

     Update Time : Sat Jan  6 15:08:37 2018
           State : clean, reshaping
  Active Devices : 21
Working Devices : 21
  Failed Devices : 0
   Spare Devices : 0

          Layout : near=2
      Chunk Size : 512K

  Reshape Status : 1% complete
   Delta Devices : 1, (20->21)

            Name : backup:5  (local to host backup)
            UUID : 9030ff07:6a292a3c:26589a26:8c92a488
          Events : 86002

     Number   Major   Minor   RaidDevice State
        0       8       16        0      active sync   /dev/sdb
        1      65       48        1      active sync   /dev/sdt
        2       8       64        2      active sync   /dev/sde
        3      65       96        3      active sync   /dev/sdw
        4       8      112        4      active sync   /dev/sdh
        5      65      144        5      active sync   /dev/sdz
        6       8      160        6      active sync   /dev/sdk
        7      65      192        7      active sync   /dev/sdac
        8       8      208        8      active sync   /dev/sdn
        9      65      240        9      active sync   /dev/sdaf
       10      65        0       10      active sync   /dev/sdq
       11      66       32       11      active sync   /dev/sdai
       12       8       32       12      active sync   /dev/sdc
       13      65       64       13      active sync   /dev/sdu
       14       8       80       14      active sync   /dev/sdf
       15      65      112       15      active sync   /dev/sdx
       16       8      128       16      active sync   /dev/sdi
       17      65      160       17      active sync   /dev/sdaa
       18       8      176       18      active sync   /dev/sdl
       19      65      208       19      active sync   /dev/sdad
       20       8      224       20      active sync   /dev/sdo


As you can see the array-size is still 20TB.

Just one second after starting the reshape operation
XFS failed with the following messages:

# dmesg
...
RAID10 conf printout:
  --- wd:21 rd:21
  disk 0, wo:0, o:1, dev:sdb
  disk 1, wo:0, o:1, dev:sdt
  disk 2, wo:0, o:1, dev:sde
  disk 3, wo:0, o:1, dev:sdw
  disk 4, wo:0, o:1, dev:sdh
  disk 5, wo:0, o:1, dev:sdz
  disk 6, wo:0, o:1, dev:sdk
  disk 7, wo:0, o:1, dev:sdac
  disk 8, wo:0, o:1, dev:sdn
  disk 9, wo:0, o:1, dev:sdaf
  disk 10, wo:0, o:1, dev:sdq
  disk 11, wo:0, o:1, dev:sdai
  disk 12, wo:0, o:1, dev:sdc
  disk 13, wo:0, o:1, dev:sdu
  disk 14, wo:0, o:1, dev:sdf
  disk 15, wo:0, o:1, dev:sdx
  disk 16, wo:0, o:1, dev:sdi
  disk 17, wo:0, o:1, dev:sdaa
  disk 18, wo:0, o:1, dev:sdl
  disk 19, wo:0, o:1, dev:sdad
  disk 20, wo:1, o:1, dev:sdo
md: reshape of RAID array md5
md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
md: using maximum available idle IO bandwidth (but not more than 200000 
KB/sec) for reshape.
md: using 128k window, over a total of 19533829120k.
XFS (md5): metadata I/O error: block 0x12c08f360 
("xfs_trans_read_buf_map") error 5 numblks 16
XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
XFS (md5): metadata I/O error: block 0x12c08f360 
("xfs_trans_read_buf_map") error 5 numblks 16
XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
XFS (md5): metadata I/O error: block 0xebb62c00 
("xfs_trans_read_buf_map") error 5 numblks 16
XFS (md5): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.
...
... lots of the above messages deleted
...
XFS (md5): xfs_do_force_shutdown(0x1) called from line 138 of file 
fs/xfs/xfs_bmap_util.c.  Return address = 0xffffffff8113908f
XFS (md5): metadata I/O error: block 0x48c710b00 ("xlog_iodone") error 5 
numblks 64
XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file 
fs/xfs/xfs_log.c.  Return address = 0xffffffff8117cdf4
XFS (md5): Log I/O Error Detected.  Shutting down filesystem
XFS (md5): Please umount the filesystem and rectify the problem(s)
XFS (md5): metadata I/O error: block 0x48c710b40 ("xlog_iodone") error 5 
numblks 64
XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file 
fs/xfs/xfs_log.c.  Return address = 0xffffffff8117cdf4
XFS (md5): metadata I/O error: block 0x48c710b80 ("xlog_iodone") error 5 
numblks 64
XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file 
fs/xfs/xfs_log.c.  Return address = 0xffffffff8117cdf4
XFS (md5): metadata I/O error: block 0x48c710bc0 ("xlog_iodone") error 5 
numblks 64
XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file 
fs/xfs/xfs_log.c.  Return address = 0xffffffff8117cdf4
XFS (md5): metadata I/O error: block 0x48c710c00 ("xlog_iodone") error 5 
numblks 64
XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file 
fs/xfs/xfs_log.c.  Return address = 0xffffffff8117cdf4
XFS (md5): metadata I/O error: block 0x48c710c40 ("xlog_iodone") error 5 
numblks 64
XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file 
fs/xfs/xfs_log.c.  Return address = 0xffffffff8117cdf4
XFS (md5): metadata I/O error: block 0x48c710c80 ("xlog_iodone") error 5 
numblks 64
XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file 
fs/xfs/xfs_log.c.  Return address = 0xffffffff8117cdf4
XFS (md5): metadata I/O error: block 0x48c710cc0 ("xlog_iodone") error 5 
numblks 64
XFS (md5): xfs_do_force_shutdown(0x2) called from line 1170 of file 
fs/xfs/xfs_log.c.  Return address = 0xffffffff8117cdf4
XFS (md5): I/O Error Detected. Shutting down filesystem

I did an "umount /dev/md5" and now I'm wondering what my options are:

Should I wait until the reshape has finisched? I assume yes since 
stopping that operation will most likely make things worse.
Unfortunately reshaping a 20TB RAID10 to 21TB will last about
10 hours but it's saturday and I have approx. 40 hours to fix the 
problem until monday morning.

Should I reduce array-size back to 20 disks?

My plans are to run xfs_check first, maybe followed by xfs_repair and
see what happens.

Any other suggestions?

Do you have an explanation why reshaping a RAID10 with a running
ext3 filesystem does work while a running XFS-filesystems fails during
a reshape?

How did the XFS-filesystem notice that a reshape was running? I was
sure that during the reshape operation every single block of the RAID10
device could be read or written no matter wether it belongs to the part
of the RAID that was already reshaped or not. Obviously that's working
in theory only - or with ext3-filesystems only.

Or was i totally wrong with my assumption?

Much thanks in advance for any assistance.

Peter Koch


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2018-01-15 17:08 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <f289da8f-96ec-7db4-abb1-b151d553c088@gmail.com>
     [not found] ` <20180108192607.GS5602@magnolia>
2018-01-08 22:01   ` Growing RAID10 with active XFS filesystem Dave Chinner
2018-01-08 23:44     ` mdraid.pkoch
2018-01-09  9:36     ` Wols Lists
2018-01-09 21:47       ` IMAP-FCC:Sent
2018-01-09 22:25       ` Dave Chinner
2018-01-09 22:32         ` Reindl Harald
2018-01-10  6:17         ` Wols Lists
2018-01-11  2:14           ` Dave Chinner
2018-01-12  2:16             ` Guoqing Jiang
2018-01-10 14:10         ` Phil Turmel
2018-01-10 21:57           ` Wols Lists
2018-01-11  3:07           ` Dave Chinner
2018-01-12 13:32             ` Wols Lists
2018-01-12 14:25               ` Emmanuel Florac
2018-01-12 17:52                 ` Wols Lists
2018-01-12 18:37                   ` Emmanuel Florac
2018-01-12 19:35                     ` Wol's lists
2018-01-13 12:30                       ` Brad Campbell
2018-01-13 13:18                         ` Wols Lists
2018-01-13  0:20                   ` Stan Hoeppner
2018-01-13 19:29                     ` Wol's lists
2018-01-13 22:40                       ` Dave Chinner
2018-01-13 23:04                         ` Wols Lists
2018-01-14 21:33                 ` Wol's lists
2018-01-15 17:08                   ` Emmanuel Florac
2018-01-08 19:06 mdraid.pkoch
  -- strict thread matches above, loose matches on Subject: below --
2018-01-06 15:44 mdraid.pkoch
2018-01-07 19:33 ` John Stoffel
2018-01-07 20:16 ` Andreas Klauer
2018-01-08  7:31 ` Guoqing Jiang
2018-01-08 15:16   ` Wols Lists
2018-01-08 15:34     ` Reindl Harald
2018-01-08 16:24     ` Wolfgang Denk
2018-01-10  1:57     ` Guoqing Jiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).