From: Boaz Harrosh <bharrosh@panasas.com>
To: Trond Myklebust <Trond.Myklebust@netapp.com>,
NFS list <linux-nfs@vger.kernel.org>,
open-osd <osd-dev@open-osd.org>
Cc: Benny Halevy <bhalevy@tonian.com>, Peng Tao <bergwolf@gmail.com>
Subject: [PATCH 1/6] ore: Fix NFS crash by supporting any unaligned RAID IO
Date: Fri, 8 Jun 2012 15:06:05 +0300 [thread overview]
Message-ID: <4FD1EAAD.5010505@panasas.com> (raw)
In-Reply-To: <4FD1CEA8.3090105@panasas.com>
In RAID1/4/5/6/ We used to not permit an IO that it's end
byte is not stripe_size aligned and spans more than one stripe.
.i.e the caller must check if after submission the actual
transferred bytes was shorter, he would need to resubmit
an IO with the remainder.
Exofs supports this, and NFS was supposed to support this
as well with it's short write mechanism. But late testing has
exposed a CRASH when this is used with none-RPC layout-drivers.
The change at NFS is deep and risky, in it's place the fix
at ORE to lift the limitation is actually clean and simple.
So here it is below.
The principal here is that in the case of unaligned IO on
both ends, beginning and end, we will send two read requests
one like old code, before the calculation of the first stripe,
and also a new site, before the calculation of the last stripe.
If any "boundary" is aligned or the complete IO is within a single
stripe we do a single read like before.
The code is clean and simple by splitting the old _read_4_write
into 3 even parts:
1. _read_4_write_first_stripe
2. _read_4_write_last_stripe
3. _read_4_write_execute
And calling 1+3 at the same place as before. 2+3 before last
stripe, and in the case of all in a single stripe then 1+2+3
is preformed additively.
Why did I not think of it before. Well I had a strike of
genius because I have stared at this code for 2 years, and did
not find this simple solution, til today. Not that I did not try.
This solution is much better for NFS than the previous supposedly
solution because the short write was dealt with out-of-band after
IO_done, which would cause for a seeky IO pattern where as in here
we execute in order. At both solutions we do 2 separate reads, only
here we do it within a single IO request. (And actually combine two
writes into a single submission)
NFS/exofs code need not change since the ORE API communicates the new
shorter length on return, what will happen is that this case would not
occur anymore.
hurray!!
[Stable this is an NFS bug since 3.2 Kernel should apply cleanly]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
---
fs/exofs/ore_raid.c | 61 ++++++++++++++++++++++++++---------------------------
1 file changed, 30 insertions(+), 31 deletions(-)
diff --git a/fs/exofs/ore_raid.c b/fs/exofs/ore_raid.c
index d222c77..80ae08e 100644
--- a/fs/exofs/ore_raid.c
+++ b/fs/exofs/ore_raid.c
@@ -461,16 +461,12 @@ static void _mark_read4write_pages_uptodate(struct ore_io_state *ios, int ret)
* ios->sp2d[p][*], xor is calculated the same way. These pages are
* allocated/freed and don't go through cache
*/
-static int _read_4_write(struct ore_io_state *ios)
+static int _read_4_write_first_stripe(struct ore_io_state *ios)
{
- struct ore_io_state *ios_read;
struct ore_striping_info read_si;
struct __stripe_pages_2d *sp2d = ios->sp2d;
u64 offset = ios->si.first_stripe_start;
- u64 last_stripe_end;
- unsigned bytes_in_stripe = ios->si.bytes_in_stripe;
- unsigned i, c, p, min_p = sp2d->pages_in_unit, max_p = -1;
- int ret;
+ unsigned c, p, min_p = sp2d->pages_in_unit, max_p = -1;
if (offset == ios->offset) /* Go to start collect $200 */
goto read_last_stripe;
@@ -512,6 +508,18 @@ static int _read_4_write(struct ore_io_state *ios)
}
read_last_stripe:
+ return 0;
+}
+
+static int _read_4_write_last_stripe(struct ore_io_state *ios)
+{
+ struct ore_striping_info read_si;
+ struct __stripe_pages_2d *sp2d = ios->sp2d;
+ u64 offset;
+ u64 last_stripe_end;
+ unsigned bytes_in_stripe = ios->si.bytes_in_stripe;
+ unsigned c, p, min_p = sp2d->pages_in_unit, max_p = -1;
+
offset = ios->offset + ios->length;
if (offset % PAGE_SIZE)
_add_to_r4w_last_page(ios, &offset);
@@ -527,9 +535,6 @@ static int _read_4_write(struct ore_io_state *ios)
c = _dev_order(ios->layout->group_width * ios->layout->mirrors_p1,
ios->layout->mirrors_p1, read_si.par_dev, read_si.dev);
- BUG_ON(ios->si.first_stripe_start + bytes_in_stripe != last_stripe_end);
- /* unaligned IO must be within a single stripe */
-
if (min_p == sp2d->pages_in_unit) {
/* Didn't do it yet */
min_p = _sp2d_min_pg(sp2d);
@@ -568,6 +573,15 @@ static int _read_4_write(struct ore_io_state *ios)
}
read_it:
+ return 0;
+}
+
+static int _read_4_write_execute(struct ore_io_state *ios)
+{
+ struct ore_io_state *ios_read;
+ unsigned i;
+ int ret;
+
ios_read = ios->ios_read_4_write;
if (!ios_read)
return 0;
@@ -591,6 +605,8 @@ static int _read_4_write(struct ore_io_state *ios)
}
_mark_read4write_pages_uptodate(ios_read, ret);
+ ore_put_io_state(ios_read);
+ ios->ios_read_4_write = NULL; /* Might need a reuse at last stripe */
return 0;
}
@@ -626,8 +642,11 @@ int _ore_add_parity_unit(struct ore_io_state *ios,
/* If first stripe, Read in all read4write pages
* (if needed) before we calculate the first parity.
*/
- _read_4_write(ios);
+ _read_4_write_first_stripe(ios);
}
+ if (!cur_len) /* If last stripe r4w pages of last stripe */
+ _read_4_write_last_stripe(ios);
+ _read_4_write_execute(ios);
for (i = 0; i < num_pages; i++) {
pages[i] = _raid_page_alloc();
@@ -654,34 +673,14 @@ int _ore_add_parity_unit(struct ore_io_state *ios,
int _ore_post_alloc_raid_stuff(struct ore_io_state *ios)
{
- struct ore_layout *layout = ios->layout;
-
if (ios->parity_pages) {
+ struct ore_layout *layout = ios->layout;
unsigned pages_in_unit = layout->stripe_unit / PAGE_SIZE;
- unsigned stripe_size = ios->si.bytes_in_stripe;
- u64 last_stripe, first_stripe;
if (_sp2d_alloc(pages_in_unit, layout->group_width,
layout->parity, &ios->sp2d)) {
return -ENOMEM;
}
-
- /* Round io down to last full strip */
- first_stripe = div_u64(ios->offset, stripe_size);
- last_stripe = div_u64(ios->offset + ios->length, stripe_size);
-
- /* If an IO spans more then a single stripe it must end at
- * a stripe boundary. The reminder at the end is pushed into the
- * next IO.
- */
- if (last_stripe != first_stripe) {
- ios->length = last_stripe * stripe_size - ios->offset;
-
- BUG_ON(!ios->length);
- ios->nr_pages = (ios->length + PAGE_SIZE - 1) /
- PAGE_SIZE;
- ios->si.length = ios->length; /*make it consistent */
- }
}
return 0;
}
--
1.7.10.2.677.gb6bc67f
next prev parent reply other threads:[~2012-06-08 12:06 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-06-08 10:06 [RFC 0/5] CRASHFIX: pnfs-obj: NONE-RPC LD must not call rpc_restart_call_prepare() Boaz Harrosh
2012-06-08 12:06 ` Boaz Harrosh [this message]
2012-06-08 12:09 ` [PATCH 2/6] ore: Remove support of partial IO request (NFS crash) Boaz Harrosh
2012-06-08 12:13 ` [PATCH 3/6] pnfs-obj: Fix __r4w_get_page when offset is beyond i_size Boaz Harrosh
2012-06-13 13:53 ` Peng Tao
2012-06-08 12:14 ` [PATCH 4/6] pnfs-obj: don't leak objio_state if ore_write/read fails Boaz Harrosh
2012-06-08 12:15 ` [PATCH 5/6] NFS41: add pg_layout_private to nfs_pageio_descriptor Boaz Harrosh
2012-06-08 12:21 ` [PATCH 6/6] pnfs-obj: Better IO pattern in case of unaligned offset Boaz Harrosh
2012-06-13 13:57 ` [RFC 0/5] CRASHFIX: pnfs-obj: NONE-RPC LD must not call rpc_restart_call_prepare() Peng Tao
2012-07-12 14:11 ` Boaz Harrosh
2012-07-18 23:07 ` Myklebust, Trond
2012-07-19 9:48 ` Boaz Harrosh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4FD1EAAD.5010505@panasas.com \
--to=bharrosh@panasas.com \
--cc=Trond.Myklebust@netapp.com \
--cc=bergwolf@gmail.com \
--cc=bhalevy@tonian.com \
--cc=linux-nfs@vger.kernel.org \
--cc=osd-dev@open-osd.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).