From: Ben Hutchings <ben@decadent.org.uk>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: torvalds@linux-foundation.org, akpm@linux-foundation.org,
alan@lxorguk.ukuu.org.uk, Boaz Harrosh <bharrosh@panasas.com>
Subject: [ 068/108] ore: Fix NFS crash by supporting any unaligned RAID IO
Date: Mon, 23 Jul 2012 02:07:59 +0100 [thread overview]
Message-ID: <20120723010701.746601273@decadent.org.uk> (raw)
In-Reply-To: <20120723010651.408577075@decadent.org.uk>
3.2-stable review patch. If anyone has any objections, please let me know.
------------------
From: Boaz Harrosh <bharrosh@panasas.com>
commit 9ff19309a9623f2963ac5a136782ea4d8b5d67fb upstream.
In RAID_5/6 We used to not permit an IO that it's end
byte is not stripe_size aligned and spans more than one stripe.
.i.e the caller must check if after submission the actual
transferred bytes is shorter, and would need to resubmit
a new IO with the remainder.
Exofs supports this, and NFS was supposed to support this
as well with it's short write mechanism. But late testing has
exposed a CRASH when this is used with none-RPC layout-drivers.
The change at NFS is deep and risky, in it's place the fix
at ORE to lift the limitation is actually clean and simple.
So here it is below.
The principal here is that in the case of unaligned IO on
both ends, beginning and end, we will send two read requests
one like old code, before the calculation of the first stripe,
and also a new site, before the calculation of the last stripe.
If any "boundary" is aligned or the complete IO is within a single
stripe. we do a single read like before.
The code is clean and simple by splitting the old _read_4_write
into 3 even parts:
1._read_4_write_first_stripe
2. _read_4_write_last_stripe
3. _read_4_write_execute
And calling 1+3 at the same place as before. 2+3 before last
stripe, and in the case of all in a single stripe then 1+2+3
is preformed additively.
Why did I not think of it before. Well I had a strike of
genius because I have stared at this code for 2 years, and did
not find this simple solution, til today. Not that I did not try.
This solution is much better for NFS than the previous supposedly
solution because the short write was dealt with out-of-band after
IO_done, which would cause for a seeky IO pattern where as in here
we execute in order. At both solutions we do 2 separate reads, only
here we do it within a single IO request. (And actually combine two
writes into a single submission)
NFS/exofs code need not change since the ORE API communicates the new
shorter length on return, what will happen is that this case would not
occur anymore.
hurray!!
[Stable this is an NFS bug since 3.2 Kernel should apply cleanly]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
---
fs/exofs/ore_raid.c | 67 +++++++++++++++++++++++++++------------------------
1 file changed, 36 insertions(+), 31 deletions(-)
diff --git a/fs/exofs/ore_raid.c b/fs/exofs/ore_raid.c
index d222c77..fff2070 100644
--- a/fs/exofs/ore_raid.c
+++ b/fs/exofs/ore_raid.c
@@ -461,16 +461,12 @@ static void _mark_read4write_pages_uptodate(struct ore_io_state *ios, int ret)
* ios->sp2d[p][*], xor is calculated the same way. These pages are
* allocated/freed and don't go through cache
*/
-static int _read_4_write(struct ore_io_state *ios)
+static int _read_4_write_first_stripe(struct ore_io_state *ios)
{
- struct ore_io_state *ios_read;
struct ore_striping_info read_si;
struct __stripe_pages_2d *sp2d = ios->sp2d;
u64 offset = ios->si.first_stripe_start;
- u64 last_stripe_end;
- unsigned bytes_in_stripe = ios->si.bytes_in_stripe;
- unsigned i, c, p, min_p = sp2d->pages_in_unit, max_p = -1;
- int ret;
+ unsigned c, p, min_p = sp2d->pages_in_unit, max_p = -1;
if (offset == ios->offset) /* Go to start collect $200 */
goto read_last_stripe;
@@ -478,6 +474,9 @@ static int _read_4_write(struct ore_io_state *ios)
min_p = _sp2d_min_pg(sp2d);
max_p = _sp2d_max_pg(sp2d);
+ ORE_DBGMSG("stripe_start=0x%llx ios->offset=0x%llx min_p=%d max_p=%d\n",
+ offset, ios->offset, min_p, max_p);
+
for (c = 0; ; c++) {
ore_calc_stripe_info(ios->layout, offset, 0, &read_si);
read_si.obj_offset += min_p * PAGE_SIZE;
@@ -512,6 +511,18 @@ static int _read_4_write(struct ore_io_state *ios)
}
read_last_stripe:
+ return 0;
+}
+
+static int _read_4_write_last_stripe(struct ore_io_state *ios)
+{
+ struct ore_striping_info read_si;
+ struct __stripe_pages_2d *sp2d = ios->sp2d;
+ u64 offset;
+ u64 last_stripe_end;
+ unsigned bytes_in_stripe = ios->si.bytes_in_stripe;
+ unsigned c, p, min_p = sp2d->pages_in_unit, max_p = -1;
+
offset = ios->offset + ios->length;
if (offset % PAGE_SIZE)
_add_to_r4w_last_page(ios, &offset);
@@ -527,15 +538,15 @@ read_last_stripe:
c = _dev_order(ios->layout->group_width * ios->layout->mirrors_p1,
ios->layout->mirrors_p1, read_si.par_dev, read_si.dev);
- BUG_ON(ios->si.first_stripe_start + bytes_in_stripe != last_stripe_end);
- /* unaligned IO must be within a single stripe */
-
if (min_p == sp2d->pages_in_unit) {
/* Didn't do it yet */
min_p = _sp2d_min_pg(sp2d);
max_p = _sp2d_max_pg(sp2d);
}
+ ORE_DBGMSG("offset=0x%llx stripe_end=0x%llx min_p=%d max_p=%d\n",
+ offset, last_stripe_end, min_p, max_p);
+
while (offset < last_stripe_end) {
struct __1_page_stripe *_1ps = &sp2d->_1p_stripes[p];
@@ -568,6 +579,15 @@ read_last_stripe:
}
read_it:
+ return 0;
+}
+
+static int _read_4_write_execute(struct ore_io_state *ios)
+{
+ struct ore_io_state *ios_read;
+ unsigned i;
+ int ret;
+
ios_read = ios->ios_read_4_write;
if (!ios_read)
return 0;
@@ -591,6 +611,8 @@ read_it:
}
_mark_read4write_pages_uptodate(ios_read, ret);
+ ore_put_io_state(ios_read);
+ ios->ios_read_4_write = NULL; /* Might need a reuse at last stripe */
return 0;
}
@@ -626,8 +648,11 @@ int _ore_add_parity_unit(struct ore_io_state *ios,
/* If first stripe, Read in all read4write pages
* (if needed) before we calculate the first parity.
*/
- _read_4_write(ios);
+ _read_4_write_first_stripe(ios);
}
+ if (!cur_len) /* If last stripe r4w pages of last stripe */
+ _read_4_write_last_stripe(ios);
+ _read_4_write_execute(ios);
for (i = 0; i < num_pages; i++) {
pages[i] = _raid_page_alloc();
@@ -654,34 +679,14 @@ int _ore_add_parity_unit(struct ore_io_state *ios,
int _ore_post_alloc_raid_stuff(struct ore_io_state *ios)
{
- struct ore_layout *layout = ios->layout;
-
if (ios->parity_pages) {
+ struct ore_layout *layout = ios->layout;
unsigned pages_in_unit = layout->stripe_unit / PAGE_SIZE;
- unsigned stripe_size = ios->si.bytes_in_stripe;
- u64 last_stripe, first_stripe;
if (_sp2d_alloc(pages_in_unit, layout->group_width,
layout->parity, &ios->sp2d)) {
return -ENOMEM;
}
-
- /* Round io down to last full strip */
- first_stripe = div_u64(ios->offset, stripe_size);
- last_stripe = div_u64(ios->offset + ios->length, stripe_size);
-
- /* If an IO spans more then a single stripe it must end at
- * a stripe boundary. The reminder at the end is pushed into the
- * next IO.
- */
- if (last_stripe != first_stripe) {
- ios->length = last_stripe * stripe_size - ios->offset;
-
- BUG_ON(!ios->length);
- ios->nr_pages = (ios->length + PAGE_SIZE - 1) /
- PAGE_SIZE;
- ios->si.length = ios->length; /*make it consistent */
- }
}
return 0;
}
next prev parent reply other threads:[~2012-07-23 2:07 UTC|newest]
Thread overview: 116+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-07-23 1:06 [ 000/108] 3.2.24-stable review Ben Hutchings
2012-07-23 1:06 ` [ 001/108] samsung-laptop: make the dmi check less strict Ben Hutchings
2012-07-23 1:06 ` [ 002/108] raid5: delayed stripe fix Ben Hutchings
2012-07-23 1:06 ` [ 003/108] tcp: drop SYN+FIN messages Ben Hutchings
2012-07-23 1:06 ` [ 004/108] tg3: Apply short DMA frag workaround to 5906 Ben Hutchings
2012-07-23 1:06 ` [ 005/108] rtl8187: ->brightness_set can not sleep Ben Hutchings
2012-07-23 1:06 ` [ 006/108] net/wireless: ipw2x00: add supported cipher suites to wiphy initialization Ben Hutchings
2012-07-23 1:06 ` [ 007/108] drm/i915: do not enable RC6p on Sandy Bridge Ben Hutchings
2012-07-23 1:06 ` [ 008/108] drm/i915: fix operator precedence when enabling RC6p Ben Hutchings
2012-07-23 1:07 ` [ 009/108] kbuild: do not check for ancient modutils tools Ben Hutchings
2012-07-23 1:07 ` [ 010/108] brcmsmac: "INTERMEDIATE but not AMPDU" only when tracing Ben Hutchings
2012-07-23 1:07 ` [ 011/108] NFSv4: Rate limit the state manager for lock reclaim warning messages Ben Hutchings
2012-07-23 1:07 ` [ 012/108] ext4: Report max_batch_time option correctly Ben Hutchings
2012-07-23 1:07 ` [ 013/108] hugepages: fix use after free bug in "quota" handling Ben Hutchings
2012-07-23 1:07 ` [ 014/108] NFSv4: Reduce the footprint of the idmapper Ben Hutchings
2012-07-23 1:07 ` [ 015/108] NFSv4: Further reduce " Ben Hutchings
2012-07-23 1:07 ` [ 016/108] macvtap: zerocopy: fix offset calculation when building skb Ben Hutchings
2012-07-23 1:07 ` [ 017/108] macvtap: zerocopy: fix truesize underestimation Ben Hutchings
2012-07-23 1:07 ` [ 018/108] macvtap: zerocopy: put page when fail to get all requested user pages Ben Hutchings
2012-07-23 1:07 ` [ 019/108] macvtap: zerocopy: set SKBTX_DEV_ZEROCOPY only when skb is built successfully Ben Hutchings
2012-07-23 1:07 ` [ 020/108] macvtap: zerocopy: validate vectors before building skb Ben Hutchings
2012-07-23 1:07 ` [ 021/108] KVM: Fix buffer overflow in kvm_set_irq() Ben Hutchings
2012-07-23 1:07 ` [ 022/108] scsi: Silence unnecessary warnings about ioctl to partition Ben Hutchings
2012-07-23 7:31 ` Paolo Bonzini
2012-07-23 1:07 ` [ 023/108] epoll: clear the tfile_check_list on -ELOOP Ben Hutchings
2012-07-23 1:07 ` [ 024/108] iommu/amd: Fix missing iommu_shutdown initialization in passthrough mode Ben Hutchings
2012-07-23 1:07 ` [ 025/108] iommu/amd: Initialize dma_ops for hotplug and sriov devices Ben Hutchings
2012-07-23 1:07 ` [ 026/108] usb: Add support for root hub port status CAS Ben Hutchings
2012-07-23 1:07 ` [ 027/108] gpiolib: wm8994: Pay attention to the value set when enabling as output Ben Hutchings
2012-07-23 1:07 ` [ 028/108] sched/nohz: Rewrite and fix load-avg computation -- again Ben Hutchings
2012-07-24 14:06 ` Ben Hutchings
2012-07-26 21:25 ` Peter Zijlstra
2012-07-26 22:01 ` Ben Hutchings
2012-07-26 22:02 ` Peter Zijlstra
2012-07-23 1:07 ` [ 029/108] USB: option: add ZTE MF60 Ben Hutchings
2012-07-23 1:07 ` [ 030/108] USB: option: Add MEDIATEK product ids Ben Hutchings
2012-07-23 1:07 ` [ 031/108] USB: cdc-wdm: fix lockup on error in wdm_read Ben Hutchings
2012-07-23 1:07 ` [ 032/108] mtd: nandsim: dont open code a do_div helper Ben Hutchings
2012-07-23 1:07 ` [ 033/108] [media] dvb-core: Release semaphore on error path dvb_register_device() Ben Hutchings
2012-07-23 1:07 ` [ 034/108] hwspinlock/core: use global ID to register hwspinlocks on multiple devices Ben Hutchings
2012-07-23 1:07 ` [ 035/108] [SCSI] libsas: fix taskfile corruption in sas_ata_qc_fill_rtf Ben Hutchings
2012-07-23 1:07 ` [ 036/108] md/raid1: fix use-after-free bug in RAID1 data-check code Ben Hutchings
2012-07-23 1:07 ` [ 037/108] PCI: EHCI: fix crash during suspend on ASUS computers Ben Hutchings
2012-07-23 1:07 ` [ 038/108] memory hotplug: fix invalid memory access caused by stale kswapd pointer Ben Hutchings
2012-07-23 1:07 ` [ 039/108] ocfs2: fix NULL pointer dereference in __ocfs2_change_file_space() Ben Hutchings
2012-07-23 1:07 ` [ 040/108] mm, thp: abort compaction if migration page cannot be charged to memcg Ben Hutchings
2012-07-23 1:07 ` [ 041/108] drivers/rtc/rtc-mxc.c: fix irq enabled interrupts warning Ben Hutchings
2012-07-23 1:07 ` [ 042/108] fs: ramfs: file-nommu: add SetPageUptodate() Ben Hutchings
2012-07-23 1:07 ` [ 043/108] cpufreq / ACPI: Fix not loading acpi-cpufreq driver regression Ben Hutchings
2012-07-23 1:07 ` [ 044/108] hwmon: (it87) Preserve configuration register bits on init Ben Hutchings
2012-07-23 1:07 ` [ 045/108] ARM: SAMSUNG: fix race in s3c_adc_start for ADC Ben Hutchings
2012-07-23 1:07 ` [ 046/108] block: fix infinite loop in __getblk_slow Ben Hutchings
2012-07-23 1:07 ` [ 047/108] Remove easily user-triggerable BUG from generic_setlease Ben Hutchings
2012-07-23 1:07 ` [ 048/108] NFC: Export nfc.h to userland Ben Hutchings
2012-07-23 1:07 ` [ 049/108] PM / Hibernate: Hibernate/thaw fixes/improvements Ben Hutchings
2012-07-23 1:07 ` [ 050/108] cfg80211: check iface combinations only when iface is running Ben Hutchings
2012-07-23 1:07 ` [ 051/108] intel_ips: blacklist HP ProBook laptops Ben Hutchings
2012-07-23 1:07 ` [ 052/108] atl1c: fix issue of transmit queue 0 timed out Ben Hutchings
2012-07-23 1:07 ` [ 053/108] rt2x00usb: fix indexes ordering on RX queue kick Ben Hutchings
2012-07-23 1:07 ` [ 054/108] iwlegacy: always monitor for stuck queues Ben Hutchings
2012-07-23 1:07 ` [ 055/108] iwlegacy: dont mess up the SCD when removing a key Ben Hutchings
2012-07-23 1:07 ` [ 056/108] e1000e: Correct link check logic for 82571 serdes Ben Hutchings
2012-07-23 1:07 ` [ 057/108] tcm_fc: Fix crash seen with aborts and large reads Ben Hutchings
2012-07-23 1:07 ` [ 058/108] fifo: Do not restart open() if it already found a partner Ben Hutchings
2012-07-23 1:07 ` [ 059/108] target: Clean up returning errors in PR handling code Ben Hutchings
2012-07-23 1:07 ` [ 060/108] target: Fix range calculation in WRITE SAME emulation when num blocks == 0 Ben Hutchings
2012-07-23 1:07 ` [ 061/108] cifs: on CONFIG_HIGHMEM machines, limit the rsize/wsize to the kmap space Ben Hutchings
2012-07-23 1:07 ` [ 062/108] cifs: always update the inode cache with the results from a FIND_* Ben Hutchings
2012-07-23 1:07 ` [ 063/108] mm: fix lost kswapd wakeup in kswapd_stop() Ben Hutchings
2012-07-23 1:07 ` [ 064/108] md: avoid crash when stopping md array races with closing other open fds Ben Hutchings
2012-07-23 1:07 ` [ 065/108] md/raid1: close some possible races on write errors during resync Ben Hutchings
2012-07-23 1:07 ` [ 066/108] MIPS: Properly align the .data..init_task section Ben Hutchings
2012-07-23 1:07 ` [ 067/108] UBIFS: fix a bug in empty space fix-up Ben Hutchings
2012-07-23 1:07 ` Ben Hutchings [this message]
2012-07-23 1:08 ` [ 069/108] ore: Remove support of partial IO request (NFS crash) Ben Hutchings
2012-07-23 1:08 ` [ 070/108] pnfs-obj: dont leak objio_state if ore_write/read fails Ben Hutchings
2012-07-23 1:08 ` [ 071/108] pnfs-obj: Fix __r4w_get_page when offset is beyond i_size Ben Hutchings
2012-07-23 1:08 ` [ 072/108] dm raid1: fix crash with mirror recovery and discard Ben Hutchings
2012-07-23 1:08 ` [ 073/108] dm raid1: set discard_zeroes_data_unsupported Ben Hutchings
2012-07-23 1:08 ` [ 074/108] ntp: Fix leap-second hrtimer livelock Ben Hutchings
2012-07-23 1:08 ` [ 075/108] ntp: Correct TAI offset during leap second Ben Hutchings
2012-07-23 1:08 ` [ 076/108] timekeeping: Fix CLOCK_MONOTONIC inconsistency during leapsecond Ben Hutchings
2012-07-23 1:08 ` [ 077/108] time: Move common updates to a function Ben Hutchings
2012-07-23 1:08 ` [ 078/108] hrtimer: Provide clock_was_set_delayed() Ben Hutchings
2012-07-23 1:08 ` [ 079/108] timekeeping: Fix leapsecond triggered load spike issue Ben Hutchings
2012-07-23 1:08 ` [ 080/108] timekeeping: Maintain ktime_t based offsets for hrtimers Ben Hutchings
2012-07-23 1:08 ` [ 081/108] hrtimers: Move lock held region in hrtimer_interrupt() Ben Hutchings
2012-07-23 1:08 ` [ 082/108] timekeeping: Provide hrtimer update function Ben Hutchings
2012-07-23 1:08 ` [ 083/108] hrtimer: Update hrtimer base offsets each hrtimer_interrupt Ben Hutchings
2012-07-23 1:08 ` [ 084/108] timekeeping: Add missing update call in timekeeping_resume() Ben Hutchings
2012-07-23 1:08 ` [ 085/108] powerpc: Fix wrong divisor in usecs_to_cputime Ben Hutchings
2012-07-23 1:08 ` [ 086/108] vhost: dont forget to schedule() Ben Hutchings
2012-07-23 1:08 ` [ 087/108] r8169: call netif_napi_del at errpaths and at driver unload Ben Hutchings
2012-07-23 1:08 ` [ 088/108] bnx2x: fix checksum validation Ben Hutchings
2012-07-23 1:08 ` [ 089/108] bnx2x: fix panic when TX ring is full Ben Hutchings
2012-07-23 1:08 ` [ 090/108] net: remove skb_orphan_try() Ben Hutchings
2012-07-23 1:08 ` [ 091/108] ACPI: Make acpi_skip_timer_override cover all source_irq==0 cases Ben Hutchings
2012-07-23 1:08 ` [ 092/108] ACPI: Remove one board specific WARN when ignoring timer overriding Ben Hutchings
2012-07-23 1:08 ` [ 093/108] ACPI: Add a quirk for "AMILO PRO V2030" to ignore the " Ben Hutchings
2012-07-23 1:08 ` [ 094/108] ACPI, x86: fix Dell M6600 ACPI reboot regression via DMI Ben Hutchings
2012-07-23 1:08 ` [ 095/108] ACPI sysfs.c strlen fix Ben Hutchings
2012-07-23 1:08 ` [ 096/108] eCryptfs: Gracefully refuse miscdev file ops on inherited/passed files Ben Hutchings
2012-07-23 1:08 ` [ 097/108] eCryptfs: Fix lockdep warning in miscdev operations Ben Hutchings
2012-07-23 1:08 ` [ 098/108] eCryptfs: Properly check for O_RDONLY flag before doing privileged open Ben Hutchings
2012-07-23 1:08 ` [ 099/108] ACPI / PM: Make acpi_pm_device_sleep_state() follow the specification Ben Hutchings
2012-07-23 1:08 ` [ 100/108] ipheth: add support for iPad Ben Hutchings
2012-07-23 1:08 ` [ 101/108] stmmac: Fix for nfs hang on multiple reboot Ben Hutchings
2012-07-23 1:08 ` [ 102/108] bonding: debugfs and network namespaces are incompatible Ben Hutchings
2012-07-23 1:08 ` [ 103/108] bonding: Manage /proc/net/bonding/ entries from the netdev events Ben Hutchings
2012-07-23 1:08 ` [ 104/108] Input: bcm5974 - Add support for 2012 MacBook Pro Retina Ben Hutchings
2012-07-23 1:08 ` [ 105/108] Input: xpad - handle all variations of Mad Catz Beat Pad Ben Hutchings
2012-07-23 1:08 ` [ 106/108] Input: xpad - add signature for Razer Onza Tournament Edition Ben Hutchings
2012-07-23 1:08 ` [ 107/108] Input: xpad - add Andamiro Pump It Up pad Ben Hutchings
2012-07-23 1:08 ` [ 108/108] HID: add support for 2012 MacBook Pro Retina Ben Hutchings
2012-07-23 1:51 ` [ 000/108] 3.2.24-stable review Ben Hutchings
[not found] ` <CAD9gYJKwrcovmGcDJoCMAzQF=zfT2jnk9ghctejWAX1R5ifB=w@mail.gmail.com>
2012-07-30 1:37 ` Ben Hutchings
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120723010701.746601273@decadent.org.uk \
--to=ben@decadent.org.uk \
--cc=akpm@linux-foundation.org \
--cc=alan@lxorguk.ukuu.org.uk \
--cc=bharrosh@panasas.com \
--cc=linux-kernel@vger.kernel.org \
--cc=stable@vger.kernel.org \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox