Linux RAID subsystem development
 help / color / mirror / Atom feed
* Fwd: (user) Help needed: mdadm seems to constantly touch my disks
From: Jure Erznožnik @ 2016-12-15  7:01 UTC (permalink / raw)
  To: NeilBrown, linux-raid
In-Reply-To: <CAJ=9zieRuTNiEGuB_RouqbdLGoxNkn09yiogR6rND84LtMdbxA@mail.gmail.com>

Thanks for helping Neil. I have run the suggested utilities and here
are my findings:

It is always [kworker/x:yy] (x:yy changes somewhat) or [0].
A few lines from one of the outputs:

  9,0    3        0     0.061577998     0  m   N raid5 rcw 3758609392 2 2 0
  9,0    3        0     0.061580084     0  m   N raid5 rcw 3758609400 2 2 0
  9,0    3        0     0.061580084     0  m   N raid5 rcw 3758609400 2 2 0
  9,0    3        0     0.061580084     0  m   N raid5 rcw 3758609400 2 2 0
  9,0    3        0     0.061580084     0  m   N raid5 rcw 3758609400 2 2 0
  9,0    0        1     0.065333879   283  C   W 11275825480 [0]
  9,0    0        1     0.065333879   283  C   W 11275825480 [0]
  9,0    0        1     0.065333879   283  C   W 11275825480 [0]
  9,0    0        1     0.065333879   283  C   W 11275825480 [0]
  9,0    3        2     1.022155200  2861  Q   W 11275826504 + 32 [kworker/3:38]
  9,0    3        2     1.022155200  2861  Q   W 11275826504 + 32 [kworker/3:38]
  9,0    3        2     1.022155200  2861  Q   W 11275826504 + 32 [kworker/3:38]
  9,0    3        2     1.022155200  2861  Q   W 11275826504 + 32 [kworker/3:38]
  9,0    0        2     1.054590402   283  C   W 11275826504 [0]
  9,0    0        2     1.054590402   283  C   W 11275826504 [0]
  9,0    0        2     1.054590402   283  C   W 11275826504 [0]
  9,0    0        2     1.054590402   283  C   W 11275826504 [0]
  9,0    3        3     2.046065106  2861  Q   W 11275861232 + 8 [kworker/3:38]
  9,0    3        3     2.046065106  2861  Q   W 11275861232 + 8 [kworker/3:38]
  9,0    3        3     2.046065106  2861  Q   W 11275861232 + 8 [kworker/3:38]
  9,0    3        3     2.046065106  2861  Q   W 11275861232 + 8 [kworker/3:38]
  9,0    0        0     2.075247515     0  m   N raid5 rcw 3758619888 2 0 1
  9,0    0        0     2.075247515     0  m   N raid5 rcw 3758619888 2 0 1
  9,0    0        0     2.075247515     0  m   N raid5 rcw 3758619888 2 0 1
  9,0    0        0     2.075247515     0  m   N raid5 rcw 3758619888 2 0 1
  9,0    0        0     2.075250686     0  m   N raid5 rcw 3758619888 2 2 0
  9,0    0        0     2.075250686     0  m   N raid5 rcw 3758619888 2 2 0
  9,0    0        0     2.075250686     0  m   N raid5 rcw 3758619888 2 2 0
  9,0    0        0     2.075250686     0  m   N raid5 rcw 3758619888 2 2 0
  9,0    2        1     2.086924691   283  C   W 11275861232 [0]
  9,0    2        1     2.086924691   283  C   W 11275861232 [0]
  9,0    2        1     2.086924691   283  C   W 11275861232 [0]
  9,0    2        1     2.086924691   283  C   W 11275861232 [0]
  9,0    0        3     2.967340614  1061  Q FWS [kworker/0:18]
  9,0    0        3     2.967340614  1061  Q FWS [kworker/0:18]
  9,0    0        3     2.967340614  1061  Q FWS [kworker/0:18]
  9,0    0        3     2.967340614  1061  Q FWS [kworker/0:18]
  9,0    3        4     3.070092310  2861  Q   W 11275861272 + 8 [kworker/3:38]
  9,0    3        4     3.070092310  2861  Q   W 11275861272 + 8 [kworker/3:38]
  9,0    3        4     3.070092310  2861  Q   W 11275861272 + 8 [kworker/3:38]
  9,0    3        4     3.070092310  2861  Q   W 11275861272 + 8 [kworker/3:38]
  9,0    0        0     3.101966398     0  m   N raid5 rcw 3758619928 2 0 1
  9,0    0        0     3.101966398     0  m   N raid5 rcw 3758619928 2 0 1
  9,0    0        0     3.101966398     0  m   N raid5 rcw 3758619928 2 0 1
  9,0    0        0     3.101966398     0  m   N raid5 rcw 3758619928 2 0 1
  9,0    0        0     3.101969169     0  m   N raid5 rcw 3758619928 2 2 0
  9,0    0        0     3.101969169     0  m   N raid5 rcw 3758619928 2 2 0
  9,0    0        0     3.101969169     0  m   N raid5 rcw 3758619928 2 2 0
  9,0    0        0     3.101969169     0  m   N raid5 rcw 3758619928 2 2 0
  9,0    0        4     3.102340646   283  C   W 11275861272 [0]
  9,0    0        4     3.102340646   283  C   W 11275861272 [0]
  9,0    0        4     3.102340646   283  C   W 11275861272 [0]
  9,0    0        4     3.102340646   283  C   W 11275861272 [0]
  9,0    3        5     4.094666938  2861  Q   W 11276014160 + 336
[kworker/3:38]
  9,0    3        5     4.094666938  2861  Q   W 11276014160 + 336
[kworker/3:38]
  9,0    3        5     4.094666938  2861  Q   W 11276014160 + 336
[kworker/3:38]
  9,0    3        5     4.094666938  2861  Q   W 11276014160 + 336
[kworker/3:38]
  9,0    3        0     4.137869804     0  m   N raid5 rcw 3758671440 2 0 1
  9,0    3        0     4.137869804     0  m   N raid5 rcw 3758671440 2 0 1
  9,0    3        0     4.137869804     0  m   N raid5 rcw 3758671440 2 0 1
  9,0    3        0     4.137869804     0  m   N raid5 rcw 3758671440 2 0 1
  9,0    3        0     4.137872647     0  m   N raid5 rcw 3758671448 2 0 1
  9,0    3        0     4.137872647     0  m   N raid5 rcw 3758671448 2 0 1
  9,0    3        0     4.137872647     0  m   N raid5 rcw 3758671448 2 0 1

LP,
Jure

On Wed, Dec 14, 2016 at 2:15 AM, NeilBrown <neilb@suse.com> wrote:
> On Tue, Dec 13 2016, Jure Erznožnik wrote:
>
>> First of all, I apologise if this mail list is not intended for layman
>> help, but this is what I am and I couldn't get an explanation
>> elsewhere.
>>
>> My problem is that (as it seems) mdadm is touching HDD superblocks
>> once per second, once at address 8 (sectors), next at address 16.
>> Total traffic is kilobytes per second, writes only, no other
>> detectable traffic.
>>
>> I have detailed the problem here:
>> http://unix.stackexchange.com/questions/329477/
>>
>> Shortened:
>> kubuntu 16.10 4.8.0-30-generic #32, mdadm v3.4 2016-01-28
>> My configuration: 4 spinning platters (/dev/sd[cdef]) assembled into a
>> raid5 array, then bcache set to cache (hopefully) everything
>> (cache_mode = writeback, sequential_cutoff = 0). On top of bcache
>> volume I have set up lvm.
>>
>> * iostat shows traffic on sd[cdef] and md0
>> * iotop shows no traffic
>> * iosnoop shows COMM=[idle, md0_raid5, kworker] as processes working
>> on the disk. Blocks reported are 8, 16 (data size a few KB) and
>> 18446744073709500000 (data size 0). That last one must be some virtual
>> thingie as the disks are nowhere near that large.
>> * enabling block_dump shows md0_raid5 process writing to block 8 (1
>> sectors) and 16 (8 sectors)
>>
>> This touching is caused by any write into the array and goes on for
>> quite a while after the write has been done (a couple of hours for
>> 60GB of writes). When services actually work with the array, this
>> becomes pretty much constant.
>>
>> What am I observing and is there any way of stopping it?
>
> Start with the uppermost layer which has I/O that you cannot explain.
> Presumably that is md0.
> Run 'blktrace' on that device for a little while, then 'blkparse' to
> look at the results.
>
>  blktrace -w 10 md0
>  blkparse *blktrace*
>
> It will give the name of the process that initiated the request in [] at
> the end of some lines.
>
> NeilBrown

^ permalink raw reply

* [PATCH 5/8] linux: drop __bitwise__ everywhere
From: Michael S. Tsirkin @ 2016-12-15  5:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Emmanuel Grumbach, Mike Snitzer, virtualization, linux-mm,
	dm-devel, target-devel, Luca Coelho, Alasdair Kergon,
	linux-samsung-soc, James E.J. Bottomley, linux-scsi,
	Stefan Schmidt, Russell King, Krzysztof Kozlowski,
	Javier Martinez Canillas, Kukjin Kim, linux-arm-kernel,
	Jiri Slaby, open-iscsi, Shaohua Li, Johannes Berg,
	Intel Linux Wireless, Alexander Aring, linux-raid,
	Kalle Valo <kvalo>
In-Reply-To: <1481778865-27667-1-git-send-email-mst@redhat.com>

__bitwise__ used to mean "yes, please enable sparse checks
unconditionally", but now that we dropped __CHECK_ENDIAN__
__bitwise is exactly the same.
There aren't many users, replace it by __bitwise everywhere.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 arch/arm/plat-samsung/include/plat/gpio-cfg.h    | 2 +-
 drivers/md/dm-cache-block-types.h                | 6 +++---
 drivers/net/ethernet/sun/sunhme.h                | 2 +-
 drivers/net/wireless/intel/iwlwifi/iwl-fw-file.h | 4 ++--
 include/linux/mmzone.h                           | 2 +-
 include/linux/serial_core.h                      | 4 ++--
 include/linux/types.h                            | 4 ++--
 include/scsi/iscsi_proto.h                       | 2 +-
 include/target/target_core_base.h                | 2 +-
 include/uapi/linux/virtio_types.h                | 6 +++---
 net/ieee802154/6lowpan/6lowpan_i.h               | 2 +-
 net/mac80211/ieee80211_i.h                       | 4 ++--
 12 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/arch/arm/plat-samsung/include/plat/gpio-cfg.h b/arch/arm/plat-samsung/include/plat/gpio-cfg.h
index 21391fa..e55d1f5 100644
--- a/arch/arm/plat-samsung/include/plat/gpio-cfg.h
+++ b/arch/arm/plat-samsung/include/plat/gpio-cfg.h
@@ -26,7 +26,7 @@
 
 #include <linux/types.h>
 
-typedef unsigned int __bitwise__ samsung_gpio_pull_t;
+typedef unsigned int __bitwise samsung_gpio_pull_t;
 
 /* forward declaration if gpio-core.h hasn't been included */
 struct samsung_gpio_chip;
diff --git a/drivers/md/dm-cache-block-types.h b/drivers/md/dm-cache-block-types.h
index bed4ad4..389c9e8 100644
--- a/drivers/md/dm-cache-block-types.h
+++ b/drivers/md/dm-cache-block-types.h
@@ -17,9 +17,9 @@
  * discard bitset.
  */
 
-typedef dm_block_t __bitwise__ dm_oblock_t;
-typedef uint32_t __bitwise__ dm_cblock_t;
-typedef dm_block_t __bitwise__ dm_dblock_t;
+typedef dm_block_t __bitwise dm_oblock_t;
+typedef uint32_t __bitwise dm_cblock_t;
+typedef dm_block_t __bitwise dm_dblock_t;
 
 static inline dm_oblock_t to_oblock(dm_block_t b)
 {
diff --git a/drivers/net/ethernet/sun/sunhme.h b/drivers/net/ethernet/sun/sunhme.h
index f430765..4a8d5b1 100644
--- a/drivers/net/ethernet/sun/sunhme.h
+++ b/drivers/net/ethernet/sun/sunhme.h
@@ -302,7 +302,7 @@
  * Always write the address first before setting the ownership
  * bits to avoid races with the hardware scanning the ring.
  */
-typedef u32 __bitwise__ hme32;
+typedef u32 __bitwise hme32;
 
 struct happy_meal_rxd {
 	hme32 rx_flags;
diff --git a/drivers/net/wireless/intel/iwlwifi/iwl-fw-file.h b/drivers/net/wireless/intel/iwlwifi/iwl-fw-file.h
index 1ad0ec1..84813b5 100644
--- a/drivers/net/wireless/intel/iwlwifi/iwl-fw-file.h
+++ b/drivers/net/wireless/intel/iwlwifi/iwl-fw-file.h
@@ -228,7 +228,7 @@ enum iwl_ucode_tlv_flag {
 	IWL_UCODE_TLV_FLAGS_BCAST_FILTERING	= BIT(29),
 };
 
-typedef unsigned int __bitwise__ iwl_ucode_tlv_api_t;
+typedef unsigned int __bitwise iwl_ucode_tlv_api_t;
 
 /**
  * enum iwl_ucode_tlv_api - ucode api
@@ -258,7 +258,7 @@ enum iwl_ucode_tlv_api {
 #endif
 };
 
-typedef unsigned int __bitwise__ iwl_ucode_tlv_capa_t;
+typedef unsigned int __bitwise iwl_ucode_tlv_capa_t;
 
 /**
  * enum iwl_ucode_tlv_capa - ucode capabilities
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0f088f3..36d9896 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -246,7 +246,7 @@ struct lruvec {
 #define ISOLATE_UNEVICTABLE	((__force isolate_mode_t)0x8)
 
 /* LRU Isolation modes. */
-typedef unsigned __bitwise__ isolate_mode_t;
+typedef unsigned __bitwise isolate_mode_t;
 
 enum zone_watermarks {
 	WMARK_MIN,
diff --git a/include/linux/serial_core.h b/include/linux/serial_core.h
index 5d49488..5def8e8 100644
--- a/include/linux/serial_core.h
+++ b/include/linux/serial_core.h
@@ -111,8 +111,8 @@ struct uart_icount {
 	__u32	buf_overrun;
 };
 
-typedef unsigned int __bitwise__ upf_t;
-typedef unsigned int __bitwise__ upstat_t;
+typedef unsigned int __bitwise upf_t;
+typedef unsigned int __bitwise upstat_t;
 
 struct uart_port {
 	spinlock_t		lock;			/* port lock */
diff --git a/include/linux/types.h b/include/linux/types.h
index baf7183..d501ad3 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -154,8 +154,8 @@ typedef u64 dma_addr_t;
 typedef u32 dma_addr_t;
 #endif
 
-typedef unsigned __bitwise__ gfp_t;
-typedef unsigned __bitwise__ fmode_t;
+typedef unsigned __bitwise gfp_t;
+typedef unsigned __bitwise fmode_t;
 
 #ifdef CONFIG_PHYS_ADDR_T_64BIT
 typedef u64 phys_addr_t;
diff --git a/include/scsi/iscsi_proto.h b/include/scsi/iscsi_proto.h
index c1260d8..df156f1 100644
--- a/include/scsi/iscsi_proto.h
+++ b/include/scsi/iscsi_proto.h
@@ -74,7 +74,7 @@ static inline int iscsi_sna_gte(u32 n1, u32 n2)
 #define zero_data(p) {p[0]=0;p[1]=0;p[2]=0;}
 
 /* initiator tags; opaque for target */
-typedef uint32_t __bitwise__ itt_t;
+typedef uint32_t __bitwise itt_t;
 /* below makes sense only for initiator that created this tag */
 #define build_itt(itt, age) ((__force itt_t)\
 	((itt) | ((age) << ISCSI_AGE_SHIFT)))
diff --git a/include/target/target_core_base.h b/include/target/target_core_base.h
index c211900..0055828 100644
--- a/include/target/target_core_base.h
+++ b/include/target/target_core_base.h
@@ -149,7 +149,7 @@ enum se_cmd_flags_table {
  * Used by transport_send_check_condition_and_sense()
  * to signal which ASC/ASCQ sense payload should be built.
  */
-typedef unsigned __bitwise__ sense_reason_t;
+typedef unsigned __bitwise sense_reason_t;
 
 enum tcm_sense_reason_table {
 #define R(x)	(__force sense_reason_t )(x)
diff --git a/include/uapi/linux/virtio_types.h b/include/uapi/linux/virtio_types.h
index e845e8c..55c3b73 100644
--- a/include/uapi/linux/virtio_types.h
+++ b/include/uapi/linux/virtio_types.h
@@ -39,8 +39,8 @@
  * - __le{16,32,64} for standard-compliant virtio devices
  */
 
-typedef __u16 __bitwise__ __virtio16;
-typedef __u32 __bitwise__ __virtio32;
-typedef __u64 __bitwise__ __virtio64;
+typedef __u16 __bitwise __virtio16;
+typedef __u32 __bitwise __virtio32;
+typedef __u64 __bitwise __virtio64;
 
 #endif /* _UAPI_LINUX_VIRTIO_TYPES_H */
diff --git a/net/ieee802154/6lowpan/6lowpan_i.h b/net/ieee802154/6lowpan/6lowpan_i.h
index 5ac7789..ac7c96b 100644
--- a/net/ieee802154/6lowpan/6lowpan_i.h
+++ b/net/ieee802154/6lowpan/6lowpan_i.h
@@ -7,7 +7,7 @@
 #include <net/inet_frag.h>
 #include <net/6lowpan.h>
 
-typedef unsigned __bitwise__ lowpan_rx_result;
+typedef unsigned __bitwise lowpan_rx_result;
 #define RX_CONTINUE		((__force lowpan_rx_result) 0u)
 #define RX_DROP_UNUSABLE	((__force lowpan_rx_result) 1u)
 #define RX_DROP			((__force lowpan_rx_result) 2u)
diff --git a/net/mac80211/ieee80211_i.h b/net/mac80211/ieee80211_i.h
index d37a577..b2069fb 100644
--- a/net/mac80211/ieee80211_i.h
+++ b/net/mac80211/ieee80211_i.h
@@ -159,7 +159,7 @@ enum ieee80211_bss_valid_data_flags {
 	IEEE80211_BSS_VALID_ERP			= BIT(3)
 };
 
-typedef unsigned __bitwise__ ieee80211_tx_result;
+typedef unsigned __bitwise ieee80211_tx_result;
 #define TX_CONTINUE	((__force ieee80211_tx_result) 0u)
 #define TX_DROP		((__force ieee80211_tx_result) 1u)
 #define TX_QUEUED	((__force ieee80211_tx_result) 2u)
@@ -180,7 +180,7 @@ struct ieee80211_tx_data {
 };
 
 
-typedef unsigned __bitwise__ ieee80211_rx_result;
+typedef unsigned __bitwise ieee80211_rx_result;
 #define RX_CONTINUE		((__force ieee80211_rx_result) 0u)
 #define RX_DROP_UNUSABLE	((__force ieee80211_rx_result) 1u)
 #define RX_DROP_MONITOR		((__force ieee80211_rx_result) 2u)
-- 
MST

^ permalink raw reply related

* Re: [BUG] MD/RAID1 hung forever on freeze_array
From: NeilBrown @ 2016-12-15  3:20 UTC (permalink / raw)
  To: Jinpu Wang; +Cc: linux-raid, Shaohua Li, Nate Dailey
In-Reply-To: <CAMGffEnvn4YQ5ADFXrOm2vOeEwApo38V84wB7VF3D1eZihF-Zw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2300 bytes --]


> Hi Neil,
>
> I found a old mail thread below
> http://www.spinics.net/lists/raid/msg52792.html
>
> Likely Alex is trying to fix same bug, right?
> in one reply you suggested to modify the call in make_request
>
> @@ -1207,7 +1207,8 @@ read_again:
>                                 sectors_handled;
>                         goto read_again;
>                 } else
> -                       generic_make_request(read_bio);
> +                       reschedule_retry(r1_bio);
>                 return;
>         }
>
>
> I append above change, it looks fix the bug, I've run same tests over
> one hour,  no hung task anymore.
>
> Do you think this is right fix? Do we still need the change you
> suggested with punt_bios_to_rescuer?

I don't really like that fix.  I suspect it would probably hurt
performance.

I'd prefer to fix generic_make_request() to process queued requests in a
more sensible order.
Can you please try the following (with all other patches removed)?
Thanks,
NeilBrown

diff --git a/block/blk-core.c b/block/blk-core.c
index 14d7c0740dc0..3436b6fc3ef8 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2036,10 +2036,31 @@ blk_qc_t generic_make_request(struct bio *bio)
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
 		if (likely(blk_queue_enter(q, false) == 0)) {
+			struct bio_list hold;
+			struct bio_list lower, same;
+
+			/* Create a fresh bio_list for all subordinate requests */
+			bio_list_merge(&hold, &bio_list_on_stack);
+			bio_list_init(&bio_list_on_stack);
 			ret = q->make_request_fn(q, bio);
 
 			blk_queue_exit(q);
 
+			/* sort new bios into those for a lower level
+			 * and those for the same level
+			 */
+			bio_list_init(&lower);
+			bio_list_init(&same);
+			while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
+				if (q == bdev_get_queue(bio->bi_bdev))
+					bio_list_add(&same, bio);
+				else
+					bio_list_add(&lower, bio);
+			/* now assemble so we handle the lowest level first */
+			bio_list_merge(&bio_list_on_stack, &lower);
+			bio_list_merge(&bio_list_on_stack, &same);
+			bio_list_merge(&bio_list_on_stack, &hold);
+
 			bio = bio_list_pop(current->bio_list);
 		} else {
 			struct bio *bio_next = bio_list_pop(current->bio_list);

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

* Re: [PATCH] dm: Avoid sleeping while holding the dm_bufio lock
From: Doug Anderson @ 2016-12-15  0:55 UTC (permalink / raw)
  To: Mikulas Patocka, Mike Snitzer
  Cc: Alasdair Kergon, Shaohua Li, Dmitry Torokhov,
	linux-kernel@vger.kernel.org, linux-raid, dm-devel,
	David Rientjes, Sonny Rao, Guenter Roeck
In-Reply-To: <CAD=FV=WUEkuDNktZ1ShYZnbsCMBwjkTBAEX7UJU3UML6Nvy_+Q@mail.gmail.com>

Hi,

On Wed, Dec 14, 2016 at 4:53 PM, Doug Anderson <dianders@chromium.org> wrote:
> Hi,
>
> On Wed, Nov 23, 2016 at 12:57 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>> Hi
>>
>> The GFP_NOIO allocation frees clean cached pages. The GFP_NOWAIT
>> allocation doesn't. Your patch would incorrectly reuse buffers in a
>> situation when the memory is filled with clean cached pages.
>>
>> Here I'm proposing an alternate patch that first tries GFP_NOWAIT
>> allocation, then drops the lock and tries GFP_NOIO allocation.
>>
>> Note that the root cause why you are seeing this stacktrace is, that your
>> block device is congested - i.e. there are too many requests in the
>> device's queue - and note that fixing this wait won't fix the root cause
>> (congested device).
>>
>> The congestion limits are set in blk_queue_congestion_threshold to 7/8 to
>> 13/16 size of the nr_requests value.
>>
>> If you don't want your device to report the congested status, you can
>> increase /sys/block/<device>/queue/nr_requests - you should test if your
>> chromebook is faster of slower with this setting increased. But note that
>> this setting won't increase the IO-per-second of the device.
>>
>> Mikulas
>>
>>
>> On Thu, 17 Nov 2016, Douglas Anderson wrote:
>>
>>> We've seen in-field reports showing _lots_ (18 in one case, 41 in
>>> another) of tasks all sitting there blocked on:
>>>
>>>   mutex_lock+0x4c/0x68
>>>   dm_bufio_shrink_count+0x38/0x78
>>>   shrink_slab.part.54.constprop.65+0x100/0x464
>>>   shrink_zone+0xa8/0x198
>>>
>>> In the two cases analyzed, we see one task that looks like this:
>>>
>>>   Workqueue: kverityd verity_prefetch_io
>>>
>>>   __switch_to+0x9c/0xa8
>>>   __schedule+0x440/0x6d8
>>>   schedule+0x94/0xb4
>>>   schedule_timeout+0x204/0x27c
>>>   schedule_timeout_uninterruptible+0x44/0x50
>>>   wait_iff_congested+0x9c/0x1f0
>>>   shrink_inactive_list+0x3a0/0x4cc
>>>   shrink_lruvec+0x418/0x5cc
>>>   shrink_zone+0x88/0x198
>>>   try_to_free_pages+0x51c/0x588
>>>   __alloc_pages_nodemask+0x648/0xa88
>>>   __get_free_pages+0x34/0x7c
>>>   alloc_buffer+0xa4/0x144
>>>   __bufio_new+0x84/0x278
>>>   dm_bufio_prefetch+0x9c/0x154
>>>   verity_prefetch_io+0xe8/0x10c
>>>   process_one_work+0x240/0x424
>>>   worker_thread+0x2fc/0x424
>>>   kthread+0x10c/0x114
>>>
>>> ...and that looks to be the one holding the mutex.
>>>
>>> The problem has been reproduced on fairly easily:
>>> 0. Be running Chrome OS w/ verity enabled on the root filesystem
>>> 1. Pick test patch: http://crosreview.com/412360
>>> 2. Install launchBalloons.sh and balloon.arm from
>>>      http://crbug.com/468342
>>>    ...that's just a memory stress test app.
>>> 3. On a 4GB rk3399 machine, run
>>>      nice ./launchBalloons.sh 4 900 100000
>>>    ...that tries to eat 4 * 900 MB of memory and keep accessing.
>>> 4. Login to the Chrome web browser and restore many tabs
>>>
>>> With that, I've seen printouts like:
>>>   DOUG: long bufio 90758 ms
>>> ...and stack trace always show's we're in dm_bufio_prefetch().
>>>
>>> The problem is that we try to allocate memory with GFP_NOIO while
>>> we're holding the dm_bufio lock.  Instead we should be using
>>> GFP_NOWAIT.  Using GFP_NOIO can cause us to sleep while holding the
>>> lock and that causes the above problems.
>>>
>>> The current behavior explained by David Rientjes:
>>>
>>>   It will still try reclaim initially because __GFP_WAIT (or
>>>   __GFP_KSWAPD_RECLAIM) is set by GFP_NOIO.  This is the cause of
>>>   contention on dm_bufio_lock() that the thread holds.  You want to
>>>   pass GFP_NOWAIT instead of GFP_NOIO to alloc_buffer() when holding a
>>>   mutex that can be contended by a concurrent slab shrinker (if
>>>   count_objects didn't use a trylock, this pattern would trivially
>>>   deadlock).
>>>
>>> Suggested-by: David Rientjes <rientjes@google.com>
>>> Signed-off-by: Douglas Anderson <dianders@chromium.org>
>>> ---
>>> Note that this change was developed and tested against the Chrome OS
>>> 4.4 kernel tree, not mainline.  Due to slight differences in verity
>>> between mainline and Chrome OS it became too difficult to reproduce my
>>> testing setup on mainline.  This patch still seems correct and
>>> relevant to upstream, so I'm posting it.  If this is not acceptible to
>>> you then please ignore this patch.
>>>
>>> Also note that when I tested the Chrome OS 3.14 kernel tree I couldn't
>>> reproduce the long delays described in the patch.  Presumably
>>> something changed in either the kernel config or the memory management
>>> code between the two kernel versions that made this crop up.  In a
>>> similar vein, it is possible that problems described in this patch are
>>> no longer reproducible upstream.  However, the arguments made in this
>>> patch (that we don't want to block while holding the mutex) still
>>> apply so I think the patch may still have merit.
>>>
>>>  drivers/md/dm-bufio.c | 6 ++++--
>>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
>>> index b3ba142e59a4..3c767399cc59 100644
>>> --- a/drivers/md/dm-bufio.c
>>> +++ b/drivers/md/dm-bufio.c
>>> @@ -827,7 +827,8 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
>>>        * dm-bufio is resistant to allocation failures (it just keeps
>>>        * one buffer reserved in cases all the allocations fail).
>>>        * So set flags to not try too hard:
>>> -      *      GFP_NOIO: don't recurse into the I/O layer
>>> +      *      GFP_NOWAIT: don't wait; if we need to sleep we'll release our
>>> +      *                  mutex and wait ourselves.
>>>        *      __GFP_NORETRY: don't retry and rather return failure
>>>        *      __GFP_NOMEMALLOC: don't use emergency reserves
>>>        *      __GFP_NOWARN: don't print a warning in case of failure
>>> @@ -837,7 +838,8 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
>>>        */
>>>       while (1) {
>>>               if (dm_bufio_cache_size_latch != 1) {
>>> -                     b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
>>> +                     b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY |
>>> +                                      __GFP_NOMEMALLOC | __GFP_NOWARN);
>>>                       if (b)
>>>                               return b;
>>>               }
>>> --
>>> 2.8.0.rc3.226.g39d4020
>>>
>>> --
>>> dm-devel mailing list
>>> dm-devel@redhat.com
>>> https://www.redhat.com/mailman/listinfo/dm-devel
>>>
>>
>> From: Mikulas Patocka <mpatocka@redhat.com>
>>
>> Subject: dm-bufio: drop the lock when doing GFP_NOIO alloaction
>>
>> Drop the lock when doing GFP_NOIO alloaction beacuse the allocation can
>> take some time.
>>
>> Note that we won't do GFP_NOIO allocation when we loop for the second
>> time, because the lock shouldn't be dropped between __wait_for_free_buffer
>> and __get_unclaimed_buffer.
>>
>> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
>>
>> ---
>>  drivers/md/dm-bufio.c |   13 ++++++++++++-
>>  1 file changed, 12 insertions(+), 1 deletion(-)
>>
>> Index: linux-2.6/drivers/md/dm-bufio.c
>> ===================================================================
>> --- linux-2.6.orig/drivers/md/dm-bufio.c
>> +++ linux-2.6/drivers/md/dm-bufio.c
>> @@ -822,11 +822,13 @@ enum new_flag {
>>  static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client *c, enum new_flag nf)
>>  {
>>         struct dm_buffer *b;
>> +       bool tried_noio_alloc = false;
>>
>>         /*
>>          * dm-bufio is resistant to allocation failures (it just keeps
>>          * one buffer reserved in cases all the allocations fail).
>>          * So set flags to not try too hard:
>> +        *      GFP_NOWAIT: don't sleep and don't release cache
>>          *      GFP_NOIO: don't recurse into the I/O layer
>>          *      __GFP_NORETRY: don't retry and rather return failure
>>          *      __GFP_NOMEMALLOC: don't use emergency reserves
>> @@ -837,7 +839,7 @@ static struct dm_buffer *__alloc_buffer_
>>          */
>>         while (1) {
>>                 if (dm_bufio_cache_size_latch != 1) {
>> -                       b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
>> +                       b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
>>                         if (b)
>>                                 return b;
>>                 }
>> @@ -845,6 +847,15 @@ static struct dm_buffer *__alloc_buffer_
>>                 if (nf == NF_PREFETCH)
>>                         return NULL;
>>
>> +               if (dm_bufio_cache_size_latch != 1 && !tried_noio_alloc) {
>> +                       dm_bufio_unlock(c);
>> +                       b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
>> +                       dm_bufio_lock(c);
>> +                       if (b)
>> +                               return b;
>> +                       tried_noio_alloc = true;
>> +               }
>> +
>>                 if (!list_empty(&c->reserved_buffers)) {
>>                         b = list_entry(c->reserved_buffers.next,
>>                                        struct dm_buffer, lru_list);
>
> I saw a git pull go by today from Mike Snitzer with my version of the
> patch in it.  I think this is fine because I think my version of the
> patch works all right, but I think Mikulas's version of the patch (see
> above) is even better.
>
> Since the "git pull" was to Linus and I believe that my version of the
> patch is functional (even if it's not optimal), maybe the right thing
> to do is to send a new patch with Mikulas's changes atop mine.
> Mikulas: does that sound good to you?  Do you want to send it?

Dang, or I could read the full list of patches in the pull and realize
you guys already did that.  Sigh.  Sorry for the noise.

-Doug

^ permalink raw reply

* Re: [PATCH] dm: Avoid sleeping while holding the dm_bufio lock
From: Doug Anderson @ 2016-12-15  0:53 UTC (permalink / raw)
  To: Mikulas Patocka, Mike Snitzer
  Cc: Alasdair Kergon, Shaohua Li, Dmitry Torokhov,
	linux-kernel@vger.kernel.org, linux-raid, dm-devel,
	David Rientjes, Sonny Rao, Guenter Roeck
In-Reply-To: <alpine.LRH.2.02.1611231543020.31481@file01.intranet.prod.int.rdu2.redhat.com>

Hi,

On Wed, Nov 23, 2016 at 12:57 PM, Mikulas Patocka <mpatocka@redhat.com> wrote:
> Hi
>
> The GFP_NOIO allocation frees clean cached pages. The GFP_NOWAIT
> allocation doesn't. Your patch would incorrectly reuse buffers in a
> situation when the memory is filled with clean cached pages.
>
> Here I'm proposing an alternate patch that first tries GFP_NOWAIT
> allocation, then drops the lock and tries GFP_NOIO allocation.
>
> Note that the root cause why you are seeing this stacktrace is, that your
> block device is congested - i.e. there are too many requests in the
> device's queue - and note that fixing this wait won't fix the root cause
> (congested device).
>
> The congestion limits are set in blk_queue_congestion_threshold to 7/8 to
> 13/16 size of the nr_requests value.
>
> If you don't want your device to report the congested status, you can
> increase /sys/block/<device>/queue/nr_requests - you should test if your
> chromebook is faster of slower with this setting increased. But note that
> this setting won't increase the IO-per-second of the device.
>
> Mikulas
>
>
> On Thu, 17 Nov 2016, Douglas Anderson wrote:
>
>> We've seen in-field reports showing _lots_ (18 in one case, 41 in
>> another) of tasks all sitting there blocked on:
>>
>>   mutex_lock+0x4c/0x68
>>   dm_bufio_shrink_count+0x38/0x78
>>   shrink_slab.part.54.constprop.65+0x100/0x464
>>   shrink_zone+0xa8/0x198
>>
>> In the two cases analyzed, we see one task that looks like this:
>>
>>   Workqueue: kverityd verity_prefetch_io
>>
>>   __switch_to+0x9c/0xa8
>>   __schedule+0x440/0x6d8
>>   schedule+0x94/0xb4
>>   schedule_timeout+0x204/0x27c
>>   schedule_timeout_uninterruptible+0x44/0x50
>>   wait_iff_congested+0x9c/0x1f0
>>   shrink_inactive_list+0x3a0/0x4cc
>>   shrink_lruvec+0x418/0x5cc
>>   shrink_zone+0x88/0x198
>>   try_to_free_pages+0x51c/0x588
>>   __alloc_pages_nodemask+0x648/0xa88
>>   __get_free_pages+0x34/0x7c
>>   alloc_buffer+0xa4/0x144
>>   __bufio_new+0x84/0x278
>>   dm_bufio_prefetch+0x9c/0x154
>>   verity_prefetch_io+0xe8/0x10c
>>   process_one_work+0x240/0x424
>>   worker_thread+0x2fc/0x424
>>   kthread+0x10c/0x114
>>
>> ...and that looks to be the one holding the mutex.
>>
>> The problem has been reproduced on fairly easily:
>> 0. Be running Chrome OS w/ verity enabled on the root filesystem
>> 1. Pick test patch: http://crosreview.com/412360
>> 2. Install launchBalloons.sh and balloon.arm from
>>      http://crbug.com/468342
>>    ...that's just a memory stress test app.
>> 3. On a 4GB rk3399 machine, run
>>      nice ./launchBalloons.sh 4 900 100000
>>    ...that tries to eat 4 * 900 MB of memory and keep accessing.
>> 4. Login to the Chrome web browser and restore many tabs
>>
>> With that, I've seen printouts like:
>>   DOUG: long bufio 90758 ms
>> ...and stack trace always show's we're in dm_bufio_prefetch().
>>
>> The problem is that we try to allocate memory with GFP_NOIO while
>> we're holding the dm_bufio lock.  Instead we should be using
>> GFP_NOWAIT.  Using GFP_NOIO can cause us to sleep while holding the
>> lock and that causes the above problems.
>>
>> The current behavior explained by David Rientjes:
>>
>>   It will still try reclaim initially because __GFP_WAIT (or
>>   __GFP_KSWAPD_RECLAIM) is set by GFP_NOIO.  This is the cause of
>>   contention on dm_bufio_lock() that the thread holds.  You want to
>>   pass GFP_NOWAIT instead of GFP_NOIO to alloc_buffer() when holding a
>>   mutex that can be contended by a concurrent slab shrinker (if
>>   count_objects didn't use a trylock, this pattern would trivially
>>   deadlock).
>>
>> Suggested-by: David Rientjes <rientjes@google.com>
>> Signed-off-by: Douglas Anderson <dianders@chromium.org>
>> ---
>> Note that this change was developed and tested against the Chrome OS
>> 4.4 kernel tree, not mainline.  Due to slight differences in verity
>> between mainline and Chrome OS it became too difficult to reproduce my
>> testing setup on mainline.  This patch still seems correct and
>> relevant to upstream, so I'm posting it.  If this is not acceptible to
>> you then please ignore this patch.
>>
>> Also note that when I tested the Chrome OS 3.14 kernel tree I couldn't
>> reproduce the long delays described in the patch.  Presumably
>> something changed in either the kernel config or the memory management
>> code between the two kernel versions that made this crop up.  In a
>> similar vein, it is possible that problems described in this patch are
>> no longer reproducible upstream.  However, the arguments made in this
>> patch (that we don't want to block while holding the mutex) still
>> apply so I think the patch may still have merit.
>>
>>  drivers/md/dm-bufio.c | 6 ++++--
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
>> index b3ba142e59a4..3c767399cc59 100644
>> --- a/drivers/md/dm-bufio.c
>> +++ b/drivers/md/dm-bufio.c
>> @@ -827,7 +827,8 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
>>        * dm-bufio is resistant to allocation failures (it just keeps
>>        * one buffer reserved in cases all the allocations fail).
>>        * So set flags to not try too hard:
>> -      *      GFP_NOIO: don't recurse into the I/O layer
>> +      *      GFP_NOWAIT: don't wait; if we need to sleep we'll release our
>> +      *                  mutex and wait ourselves.
>>        *      __GFP_NORETRY: don't retry and rather return failure
>>        *      __GFP_NOMEMALLOC: don't use emergency reserves
>>        *      __GFP_NOWARN: don't print a warning in case of failure
>> @@ -837,7 +838,8 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
>>        */
>>       while (1) {
>>               if (dm_bufio_cache_size_latch != 1) {
>> -                     b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
>> +                     b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY |
>> +                                      __GFP_NOMEMALLOC | __GFP_NOWARN);
>>                       if (b)
>>                               return b;
>>               }
>> --
>> 2.8.0.rc3.226.g39d4020
>>
>> --
>> dm-devel mailing list
>> dm-devel@redhat.com
>> https://www.redhat.com/mailman/listinfo/dm-devel
>>
>
> From: Mikulas Patocka <mpatocka@redhat.com>
>
> Subject: dm-bufio: drop the lock when doing GFP_NOIO alloaction
>
> Drop the lock when doing GFP_NOIO alloaction beacuse the allocation can
> take some time.
>
> Note that we won't do GFP_NOIO allocation when we loop for the second
> time, because the lock shouldn't be dropped between __wait_for_free_buffer
> and __get_unclaimed_buffer.
>
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
>
> ---
>  drivers/md/dm-bufio.c |   13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/drivers/md/dm-bufio.c
> ===================================================================
> --- linux-2.6.orig/drivers/md/dm-bufio.c
> +++ linux-2.6/drivers/md/dm-bufio.c
> @@ -822,11 +822,13 @@ enum new_flag {
>  static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client *c, enum new_flag nf)
>  {
>         struct dm_buffer *b;
> +       bool tried_noio_alloc = false;
>
>         /*
>          * dm-bufio is resistant to allocation failures (it just keeps
>          * one buffer reserved in cases all the allocations fail).
>          * So set flags to not try too hard:
> +        *      GFP_NOWAIT: don't sleep and don't release cache
>          *      GFP_NOIO: don't recurse into the I/O layer
>          *      __GFP_NORETRY: don't retry and rather return failure
>          *      __GFP_NOMEMALLOC: don't use emergency reserves
> @@ -837,7 +839,7 @@ static struct dm_buffer *__alloc_buffer_
>          */
>         while (1) {
>                 if (dm_bufio_cache_size_latch != 1) {
> -                       b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> +                       b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
>                         if (b)
>                                 return b;
>                 }
> @@ -845,6 +847,15 @@ static struct dm_buffer *__alloc_buffer_
>                 if (nf == NF_PREFETCH)
>                         return NULL;
>
> +               if (dm_bufio_cache_size_latch != 1 && !tried_noio_alloc) {
> +                       dm_bufio_unlock(c);
> +                       b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> +                       dm_bufio_lock(c);
> +                       if (b)
> +                               return b;
> +                       tried_noio_alloc = true;
> +               }
> +
>                 if (!list_empty(&c->reserved_buffers)) {
>                         b = list_entry(c->reserved_buffers.next,
>                                        struct dm_buffer, lru_list);

I saw a git pull go by today from Mike Snitzer with my version of the
patch in it.  I think this is fine because I think my version of the
patch works all right, but I think Mikulas's version of the patch (see
above) is even better.

Since the "git pull" was to Linus and I believe that my version of the
patch is functional (even if it's not optimal), maybe the right thing
to do is to send a new patch with Mikulas's changes atop mine.
Mikulas: does that sound good to you?  Do you want to send it?

Thanks!

-Doug

^ permalink raw reply

* [PATCH 6/6] md/r5cache: shift complex rmw from read path to write path
From: Song Liu @ 2016-12-14 23:38 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuan,
	liuyun01, Song Liu
In-Reply-To: <20161214233806.4137850-1-songliubraving@fb.com>

Write back cache requires a complex RMW mechanism, where old data is
read into dev->orig_page for prexor, and then xor is done with
dev->page. This logic is already implemented in the write path.

However, current read path is not awared of this requirement. When
the array is optimal, the RMW is not required, as the data are
read from raid disks. However, when the target stripe is degraded,
complex RMW is required to generate right data.

To keep read path as clean as possible, we handle read path by
flushing degraded, in-journal stripes before processing reads to
missing dev.

Specifically, when there is read requests to a degraded stripe
with data in journal, handle_stripe_fill() calls
r5c_make_stripe_write_out() and exits. Then handle_stripe_dirtying()
will do the complex RMW and flush the stripe to RAID disks. After
that, read requests are handled.

There is one more corner case when there is non-overwrite bio for
the missing (or out of sync) dev. handle_stripe_dirtying() will not
be able to process the non-overwrite bios without constructing the
data in handle_stripe_fill(). This is fixed by delaying non-overwrite
bios in handle_stripe_dirtying(). So handle_stripe_fill() works on
these bios after the stripe is flushed to raid disks.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5.c | 48 ++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 44 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 0f25d84..6d5391f 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2894,6 +2894,30 @@ sector_t raid5_compute_blocknr(struct stripe_head *sh, int i, int previous)
 	return r_sector;
 }
 
+/*
+ * There are cases where we want handle_stripe_dirtying() and
+ * schedule_reconstruction() to delay towrite to some dev of a stripe.
+ *
+ * This function checks whether we want to delay the towrite. Specifically,
+ * we delay the towrite when:
+ *   1. degraded stripe has a non-overwrite to the missing dev, AND this
+ *      stripe has data in journal (for other devices).
+ *
+ *      In this case, when reading data for the non-overwrite dev, it is
+ *      necessary to handle complex rmw of write back cache (prexor with
+ *      orig_page, and xor with page). To keep read path simple, we would
+ *      like to flush data in journal to RAID disks first, so complex rmw
+ *      is handled in the write patch (handle_stripe_dirtying).
+ *
+ *   2. to be added
+ */
+static inline bool delay_towrite(struct r5dev *dev,
+				   struct stripe_head_state *s)
+{
+	return dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags) &&
+		!test_bit(R5_Insync, &dev->flags) && s->injournal;
+}
+
 static void
 schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
 			 int rcw, int expand)
@@ -2914,7 +2938,7 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
 		for (i = disks; i--; ) {
 			struct r5dev *dev = &sh->dev[i];
 
-			if (dev->towrite) {
+			if (dev->towrite && !delay_towrite(dev, s)) {
 				set_bit(R5_LOCKED, &dev->flags);
 				set_bit(R5_Wantdrain, &dev->flags);
 				if (!expand)
@@ -3491,10 +3515,25 @@ static void handle_stripe_fill(struct stripe_head *sh,
 	 * midst of changing due to a write
 	 */
 	if (!test_bit(STRIPE_COMPUTE_RUN, &sh->state) && !sh->check_state &&
-	    !sh->reconstruct_state)
+	    !sh->reconstruct_state) {
+
+		/* for degraded stripe with data in journal, do not handle
+		 * read requests yet, instead, flush the stripe to raid
+		 * disks first, this avoids handling complex rmw of write
+		 * back cache (prexor with orig_page, and then xor with
+		 * page) in the read path
+		 */
+		if (s->injournal && s->failed) {
+			if (test_bit(STRIPE_R5C_CACHING, &sh->state))
+				r5c_make_stripe_write_out(sh);
+			goto out;
+		}
+
 		for (i = disks; i--; )
 			if (fetch_block(sh, s, i, disks))
 				break;
+	}
+out:
 	set_bit(STRIPE_HANDLE, &sh->state);
 }
 
@@ -3650,7 +3689,8 @@ static int handle_stripe_dirtying(struct r5conf *conf,
 	} else for (i = disks; i--; ) {
 		/* would I have to read this buffer for read_modify_write */
 		struct r5dev *dev = &sh->dev[i];
-		if ((dev->towrite || i == sh->pd_idx || i == sh->qd_idx ||
+		if (((dev->towrite && !delay_towrite(dev, s)) ||
+		     i == sh->pd_idx || i == sh->qd_idx ||
 		     test_bit(R5_InJournal, &dev->flags)) &&
 		    !test_bit(R5_LOCKED, &dev->flags) &&
 		    !(uptodate_for_rmw(dev) ||
@@ -3714,7 +3754,7 @@ static int handle_stripe_dirtying(struct r5conf *conf,
 
 		for (i = disks; i--; ) {
 			struct r5dev *dev = &sh->dev[i];
-			if ((dev->towrite ||
+			if (((dev->towrite && !delay_towrite(dev, s)) ||
 			     i == sh->pd_idx || i == sh->qd_idx ||
 			     test_bit(R5_InJournal, &dev->flags)) &&
 			    !test_bit(R5_LOCKED, &dev->flags) &&
-- 
2.9.3


^ permalink raw reply related

* [PATCH 5/6] md/raid5: move comment of fetch_block to right location
From: Song Liu @ 2016-12-14 23:38 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuan,
	liuyun01, Song Liu
In-Reply-To: <20161214233806.4137850-1-songliubraving@fb.com>

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 760af2e..0f25d84 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3309,13 +3309,6 @@ static int want_replace(struct stripe_head *sh, int disk_idx)
 	return rv;
 }
 
-/* fetch_block - checks the given member device to see if its data needs
- * to be read or computed to satisfy a request.
- *
- * Returns 1 when no more member devices need to be checked, otherwise returns
- * 0 to tell the loop in handle_stripe_fill to continue
- */
-
 static int need_this_block(struct stripe_head *sh, struct stripe_head_state *s,
 			   int disk_idx, int disks)
 {
@@ -3406,6 +3399,12 @@ static int need_this_block(struct stripe_head *sh, struct stripe_head_state *s,
 	return 0;
 }
 
+/* fetch_block - checks the given member device to see if its data needs
+ * to be read or computed to satisfy a request.
+ *
+ * Returns 1 when no more member devices need to be checked, otherwise returns
+ * 0 to tell the loop in handle_stripe_fill to continue
+ */
 static int fetch_block(struct stripe_head *sh, struct stripe_head_state *s,
 		       int disk_idx, int disks)
 {
-- 
2.9.3


^ permalink raw reply related

* [PATCH 4/6] md/r5cache: read data into orig_page for prexor of cached data
From: Song Liu @ 2016-12-14 23:38 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuan,
	liuyun01, Song Liu
In-Reply-To: <20161214233806.4137850-1-songliubraving@fb.com>

With write back cache, we use orig_page to do prexor. This patch
makes sure we read data into orig_page for it.

Flag R5_OrigPageUPTDODATE is added to show whether orig_page
has the latest data from raid disk.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c |  2 ++
 drivers/md/raid5.c       | 37 ++++++++++++++++++++++++++++---------
 drivers/md/raid5.h       |  5 +++++
 3 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index cded2b1..e65de05 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -2377,6 +2377,8 @@ void r5c_release_extra_page(struct stripe_head *sh)
 			struct page *p = sh->dev[i].orig_page;
 
 			sh->dev[i].orig_page = sh->dev[i].page;
+			clear_bit(R5_OrigPageUPTDODATE, &sh->dev[i].flags);
+
 			if (!using_disk_info_extra_page)
 				put_page(p);
 		}
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b98dfb2..760af2e 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1019,7 +1019,13 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 
 			if (test_bit(R5_SkipCopy, &sh->dev[i].flags))
 				WARN_ON(test_bit(R5_UPTODATE, &sh->dev[i].flags));
-			sh->dev[i].vec.bv_page = sh->dev[i].page;
+
+			if (!op_is_write(op) &&
+			    test_bit(R5_InJournal, &sh->dev[i].flags) &&
+			    sh->dev[i].page != sh->dev[i].orig_page)
+				sh->dev[i].vec.bv_page = sh->dev[i].orig_page;
+			else
+				sh->dev[i].vec.bv_page = sh->dev[i].page;
 			bi->bi_vcnt = 1;
 			bi->bi_io_vec[0].bv_len = STRIPE_SIZE;
 			bi->bi_io_vec[0].bv_offset = 0;
@@ -2384,6 +2390,10 @@ static void raid5_end_read_request(struct bio * bi)
 		} else if (test_bit(R5_ReadNoMerge, &sh->dev[i].flags))
 			clear_bit(R5_ReadNoMerge, &sh->dev[i].flags);
 
+		if (test_bit(R5_InJournal, &sh->dev[i].flags) &&
+		    sh->dev[i].page != sh->dev[i].orig_page)
+			set_bit(R5_OrigPageUPTDODATE, &sh->dev[i].flags);
+
 		if (atomic_read(&rdev->read_errors))
 			atomic_set(&rdev->read_errors, 0);
 	} else {
@@ -3598,6 +3608,21 @@ static void handle_stripe_clean_event(struct r5conf *conf,
 		break_stripe_batch_list(head_sh, STRIPE_EXPAND_SYNC_FLAGS);
 }
 
+/*
+ * For RMW in write back cache, we need extra page in prexor to store the
+ * old data. This page is stored in dev->orig_page.
+ *
+ * This function checks whether we have data for prexor. The exact logic
+ * is:
+ *       R5_UPTODATE && (!R5_InJournal || R5_OrigPageUPTDODATE)
+ */
+static inline bool uptodate_for_rmw(struct r5dev *dev)
+{
+	return (test_bit(R5_UPTODATE, &dev->flags)) &&
+		(!test_bit(R5_InJournal, &dev->flags) ||
+		 test_bit(R5_OrigPageUPTDODATE, &dev->flags));
+}
+
 static int handle_stripe_dirtying(struct r5conf *conf,
 				  struct stripe_head *sh,
 				  struct stripe_head_state *s,
@@ -3629,9 +3654,7 @@ static int handle_stripe_dirtying(struct r5conf *conf,
 		if ((dev->towrite || i == sh->pd_idx || i == sh->qd_idx ||
 		     test_bit(R5_InJournal, &dev->flags)) &&
 		    !test_bit(R5_LOCKED, &dev->flags) &&
-		    !((test_bit(R5_UPTODATE, &dev->flags) &&
-		       (!test_bit(R5_InJournal, &dev->flags) ||
-			dev->page != dev->orig_page)) ||
+		    !(uptodate_for_rmw(dev) ||
 		      test_bit(R5_Wantcompute, &dev->flags))) {
 			if (test_bit(R5_Insync, &dev->flags))
 				rmw++;
@@ -3643,7 +3666,6 @@ static int handle_stripe_dirtying(struct r5conf *conf,
 		    i != sh->pd_idx && i != sh->qd_idx &&
 		    !test_bit(R5_LOCKED, &dev->flags) &&
 		    !(test_bit(R5_UPTODATE, &dev->flags) ||
-		      test_bit(R5_InJournal, &dev->flags) ||
 		      test_bit(R5_Wantcompute, &dev->flags))) {
 			if (test_bit(R5_Insync, &dev->flags))
 				rcw++;
@@ -3697,9 +3719,7 @@ static int handle_stripe_dirtying(struct r5conf *conf,
 			     i == sh->pd_idx || i == sh->qd_idx ||
 			     test_bit(R5_InJournal, &dev->flags)) &&
 			    !test_bit(R5_LOCKED, &dev->flags) &&
-			    !((test_bit(R5_UPTODATE, &dev->flags) &&
-			       (!test_bit(R5_InJournal, &dev->flags) ||
-				dev->page != dev->orig_page)) ||
+			    !(uptodate_for_rmw(dev) ||
 			      test_bit(R5_Wantcompute, &dev->flags)) &&
 			    test_bit(R5_Insync, &dev->flags)) {
 				if (test_bit(STRIPE_PREREAD_ACTIVE,
@@ -3726,7 +3746,6 @@ static int handle_stripe_dirtying(struct r5conf *conf,
 			    i != sh->pd_idx && i != sh->qd_idx &&
 			    !test_bit(R5_LOCKED, &dev->flags) &&
 			    !(test_bit(R5_UPTODATE, &dev->flags) ||
-			      test_bit(R5_InJournal, &dev->flags) ||
 			      test_bit(R5_Wantcompute, &dev->flags))) {
 				rcw++;
 				if (test_bit(R5_Insync, &dev->flags) &&
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index b39fe46..6cc8d4c 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -322,6 +322,11 @@ enum r5dev_flags {
 			 * data and parity being written are in the journal
 			 * device
 			 */
+	R5_OrigPageUPTDODATE,	/* with write back cache, we read old data into
+				 * dev->orig_page for prexor. When this flag is
+				 * set, orig_page contains latest data in the
+				 * raid disk.
+				 */
 };
 
 /*
-- 
2.9.3


^ permalink raw reply related

* [PATCH 3/6] md/r5cache: flush data only stripes in r5l_recovery_log()
From: Song Liu @ 2016-12-14 23:38 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuan,
	liuyun01, Song Liu
In-Reply-To: <20161214233806.4137850-1-songliubraving@fb.com>

When there is data only stripes in the journal, we flush them out in
r5l_recovery_log(). Ths logic is implemented in a new function:
r5c_recovery_flush_data_only_stripes();

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 58 +++++++++++++++++++++++++++++++++++-------------
 drivers/md/raid5.c       |  9 +++++++-
 drivers/md/raid5.h       |  4 ++++
 3 files changed, 54 insertions(+), 17 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index b4f2b82..cded2b1 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -2062,7 +2062,7 @@ static int
 r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
 				       struct r5l_recovery_ctx *ctx)
 {
-	struct stripe_head *sh, *next;
+	struct stripe_head *sh;
 	struct mddev *mddev = log->rdev->mddev;
 	struct page *page;
 	sector_t next_checkpoint = MaxSector;
@@ -2076,7 +2076,7 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
 
 	WARN_ON(list_empty(&ctx->cached_list));
 
-	list_for_each_entry_safe(sh, next, &ctx->cached_list, lru) {
+	list_for_each_entry(sh, &ctx->cached_list, lru) {
 		struct r5l_meta_block *mb;
 		int i;
 		int offset;
@@ -2126,14 +2126,41 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
 		ctx->pos = write_pos;
 		ctx->seq += 1;
 		next_checkpoint = sh->log_start;
-		list_del_init(&sh->lru);
-		raid5_release_stripe(sh);
 	}
 	log->next_checkpoint = next_checkpoint;
 	__free_page(page);
 	return 0;
 }
 
+static void r5c_recovery_flush_data_only_stripes(struct r5l_log *log,
+						 struct r5l_recovery_ctx *ctx)
+{
+	struct mddev *mddev = log->rdev->mddev;
+	struct r5conf *conf = mddev->private;
+	struct stripe_head *sh, *next;
+
+	if (ctx->data_only_stripes == 0)
+		return;
+
+	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_BACK;
+	set_bit(R5C_PRE_INIT_FLUSH, &conf->cache_state);
+
+	list_for_each_entry_safe(sh, next, &ctx->cached_list, lru) {
+		r5c_make_stripe_write_out(sh);
+		list_del_init(&sh->lru);
+		raid5_release_stripe(sh);
+	}
+
+	md_wakeup_thread(conf->mddev->thread);
+	wait_event(conf->wait_for_r5c_pre_init_flush,
+		   atomic_read(&conf->active_stripes) == 0 &&
+		   atomic_read(&conf->r5c_cached_full_stripes) == 0 &&
+		   atomic_read(&conf->r5c_cached_partial_stripes) == 0);
+
+	clear_bit(R5C_PRE_INIT_FLUSH, &conf->cache_state);
+	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
+}
+
 static int r5l_recovery_log(struct r5l_log *log)
 {
 	struct mddev *mddev = log->rdev->mddev;
@@ -2160,32 +2187,31 @@ static int r5l_recovery_log(struct r5l_log *log)
 	pos = ctx.pos;
 	ctx.seq += 10000;
 
-	if (ctx.data_only_stripes == 0) {
-		log->next_checkpoint = ctx.pos;
-		r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq++);
-		ctx.pos = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
-	}
 
 	if ((ctx.data_only_stripes == 0) && (ctx.data_parity_stripes == 0))
 		pr_debug("md/raid:%s: starting from clean shutdown\n",
 			 mdname(mddev));
-	else {
+	else
 		pr_debug("md/raid:%s: recoverying %d data-only stripes and %d data-parity stripes\n",
 			 mdname(mddev), ctx.data_only_stripes,
 			 ctx.data_parity_stripes);
 
-		if (ctx.data_only_stripes > 0)
-			if (r5c_recovery_rewrite_data_only_stripes(log, &ctx)) {
-				pr_err("md/raid:%s: failed to rewrite stripes to journal\n",
-				       mdname(mddev));
-				return -EIO;
-			}
+	if (ctx.data_only_stripes == 0) {
+		log->next_checkpoint = ctx.pos;
+		r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq++);
+		ctx.pos = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
+	} else if (r5c_recovery_rewrite_data_only_stripes(log, &ctx)) {
+		pr_err("md/raid:%s: failed to rewrite stripes to journal\n",
+		       mdname(mddev));
+		return -EIO;
 	}
 
 	log->log_start = ctx.pos;
 	log->seq = ctx.seq;
 	log->last_checkpoint = pos;
 	r5l_write_super(log, pos);
+
+	r5c_recovery_flush_data_only_stripes(log, &ctx);
 	return 0;
 }
 
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 06d7279..b98dfb2 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -232,7 +232,9 @@ static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh,
 	 * When quiesce in r5c write back, set STRIPE_HANDLE for stripes with
 	 * data in journal, so they are not released to cached lists
 	 */
-	if (conf->quiesce && r5c_is_writeback(conf->log) &&
+	if ((conf->quiesce ||
+	     test_bit(R5C_PRE_INIT_FLUSH, &conf->cache_state)) &&
+	    r5c_is_writeback(conf->log) &&
 	    !test_bit(STRIPE_HANDLE, &sh->state) && injournal != 0) {
 		if (test_bit(STRIPE_R5C_CACHING, &sh->state))
 			r5c_make_stripe_write_out(sh);
@@ -264,6 +266,10 @@ static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh,
 			    < IO_THRESHOLD)
 				md_wakeup_thread(conf->mddev->thread);
 		atomic_dec(&conf->active_stripes);
+		if (test_bit(R5C_PRE_INIT_FLUSH, &conf->cache_state) &&
+		    atomic_read(&conf->active_stripes) == 0)
+			wake_up(&sh->raid_conf->wait_for_r5c_pre_init_flush);
+
 		if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
 			if (!r5c_is_writeback(conf->log))
 				list_add_tail(&sh->lru, temp_inactive_list);
@@ -6633,6 +6639,7 @@ static struct r5conf *setup_conf(struct mddev *mddev)
 	init_waitqueue_head(&conf->wait_for_quiescent);
 	init_waitqueue_head(&conf->wait_for_stripe);
 	init_waitqueue_head(&conf->wait_for_overlap);
+	init_waitqueue_head(&conf->wait_for_r5c_pre_init_flush);
 	INIT_LIST_HEAD(&conf->handle_list);
 	INIT_LIST_HEAD(&conf->hold_list);
 	INIT_LIST_HEAD(&conf->delayed_list);
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index ed8e136..b39fe46 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -564,6 +564,9 @@ enum r5_cache_state {
 	R5C_EXTRA_PAGE_IN_USE,	/* a stripe is using disk_info.extra_page
 				 * for prexor
 				 */
+	R5C_PRE_INIT_FLUSH,	/* flushing data only stripes recovered from
+				 * the journal
+				 */
 };
 
 struct r5conf {
@@ -679,6 +682,7 @@ struct r5conf {
 	int			group_cnt;
 	int			worker_cnt_per_group;
 	struct r5l_log		*log;
+	wait_queue_head_t	wait_for_r5c_pre_init_flush;
 };
 
 
-- 
2.9.3


^ permalink raw reply related

* [PATCH 2/6] md/r5cache: assign conf->log before r5l_load_log()
From: Song Liu @ 2016-12-14 23:38 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuan,
	liuyun01, Song Liu
In-Reply-To: <20161214233806.4137850-1-songliubraving@fb.com>

r5l_load_log() calls functions that requires a proper conf->log,
for example, r5c_is_writeback(). Therefore, we should set
conf->log before calling r5l_load_log(). If r5l_load_log() fails,
conf->log is set back to NULL.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 0f5e6a2..b4f2b82 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -2636,14 +2636,16 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 	spin_lock_init(&log->stripe_in_journal_lock);
 	atomic_set(&log->stripe_in_journal_count, 0);
 
+	rcu_assign_pointer(conf->log, log);
+
 	if (r5l_load_log(log))
 		goto error;
 
-	rcu_assign_pointer(conf->log, log);
 	set_bit(MD_HAS_JOURNAL, &conf->mddev->flags);
 	return 0;
 
 error:
+	rcu_assign_pointer(conf->log, NULL);
 	md_unregister_thread(&log->reclaim_thread);
 reclaim_thread:
 	mempool_destroy(log->meta_pool);
-- 
2.9.3


^ permalink raw reply related

* [PATCH 1/6] md/r5cache: simplify handling of sh->log_start in recovery
From: Song Liu @ 2016-12-14 23:38 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuan,
	liuyun01, Song Liu

We only need to update sh->log_start at the end of recovery,
which is r5c_recovery_rewrite_data_only_stripes(), so it is not
necessary to set it before that. In this patch, log_start is
removed from r5c_recovery_alloc_stripe().

After updating all sh->log_start, rewrite_data_only_stripes()
also updates log->next_checkpoints to the last sh->log_start.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 27 ++++++++++++---------------
 1 file changed, 12 insertions(+), 15 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index d7bfb6f..0f5e6a2 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -1682,8 +1682,7 @@ r5l_recovery_replay_one_stripe(struct r5conf *conf,
 
 static struct stripe_head *
 r5c_recovery_alloc_stripe(struct r5conf *conf,
-			  sector_t stripe_sect,
-			  sector_t log_start)
+			  sector_t stripe_sect)
 {
 	struct stripe_head *sh;
 
@@ -1692,7 +1691,6 @@ r5c_recovery_alloc_stripe(struct r5conf *conf,
 		return NULL;  /* no more stripe available */
 
 	r5l_recovery_reset_stripe(sh);
-	sh->log_start = log_start;
 
 	return sh;
 }
@@ -1862,7 +1860,7 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
 						stripe_sect);
 
 		if (!sh) {
-			sh = r5c_recovery_alloc_stripe(conf, stripe_sect, ctx->pos);
+			sh = r5c_recovery_alloc_stripe(conf, stripe_sect);
 			/*
 			 * cannot get stripe from raid5_get_active_stripe
 			 * try replay some stripes
@@ -1871,7 +1869,7 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
 				r5c_recovery_replay_stripes(
 					cached_stripe_list, ctx);
 				sh = r5c_recovery_alloc_stripe(
-					conf, stripe_sect, ctx->pos);
+					conf, stripe_sect);
 			}
 			if (!sh) {
 				pr_debug("md/raid:%s: Increasing stripe cache size to %d to recovery data on journal.\n",
@@ -1879,8 +1877,8 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
 					conf->min_nr_stripes * 2);
 				raid5_set_cache_size(mddev,
 						     conf->min_nr_stripes * 2);
-				sh = r5c_recovery_alloc_stripe(
-					conf, stripe_sect, ctx->pos);
+				sh = r5c_recovery_alloc_stripe(conf,
+							       stripe_sect);
 			}
 			if (!sh) {
 				pr_err("md/raid:%s: Cannot get enough stripes due to memory pressure. Recovery failed.\n",
@@ -1894,7 +1892,6 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
 			if (!test_bit(STRIPE_R5C_CACHING, &sh->state) &&
 			    test_bit(R5_Wantwrite, &sh->dev[sh->pd_idx].flags)) {
 				r5l_recovery_replay_one_stripe(conf, sh, ctx);
-				sh->log_start = ctx->pos;
 				list_move_tail(&sh->lru, cached_stripe_list);
 			}
 			r5l_recovery_load_data(log, sh, ctx, payload,
@@ -1933,8 +1930,6 @@ static void r5c_recovery_load_one_stripe(struct r5l_log *log,
 			set_bit(R5_UPTODATE, &dev->flags);
 		}
 	}
-	list_add_tail(&sh->r5c, &log->stripe_in_journal_list);
-	atomic_inc(&log->stripe_in_journal_count);
 }
 
 /*
@@ -2070,6 +2065,7 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
 	struct stripe_head *sh, *next;
 	struct mddev *mddev = log->rdev->mddev;
 	struct page *page;
+	sector_t next_checkpoint = MaxSector;
 
 	page = alloc_page(GFP_KERNEL);
 	if (!page) {
@@ -2078,6 +2074,8 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
 		return -ENOMEM;
 	}
 
+	WARN_ON(list_empty(&ctx->cached_list));
+
 	list_for_each_entry_safe(sh, next, &ctx->cached_list, lru) {
 		struct r5l_meta_block *mb;
 		int i;
@@ -2123,12 +2121,15 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
 		sync_page_io(log->rdev, ctx->pos, PAGE_SIZE, page,
 			     REQ_OP_WRITE, REQ_FUA, false);
 		sh->log_start = ctx->pos;
+		list_add_tail(&sh->r5c, &log->stripe_in_journal_list);
+		atomic_inc(&log->stripe_in_journal_count);
 		ctx->pos = write_pos;
 		ctx->seq += 1;
-
+		next_checkpoint = sh->log_start;
 		list_del_init(&sh->lru);
 		raid5_release_stripe(sh);
 	}
+	log->next_checkpoint = next_checkpoint;
 	__free_page(page);
 	return 0;
 }
@@ -2139,7 +2140,6 @@ static int r5l_recovery_log(struct r5l_log *log)
 	struct r5l_recovery_ctx ctx;
 	int ret;
 	sector_t pos;
-	struct stripe_head *sh;
 
 	ctx.pos = log->last_checkpoint;
 	ctx.seq = log->last_cp_seq;
@@ -2164,9 +2164,6 @@ static int r5l_recovery_log(struct r5l_log *log)
 		log->next_checkpoint = ctx.pos;
 		r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq++);
 		ctx.pos = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
-	} else {
-		sh = list_last_entry(&ctx.cached_list, struct stripe_head, lru);
-		log->next_checkpoint = sh->log_start;
 	}
 
 	if ((ctx.data_only_stripes == 0) && (ctx.data_parity_stripes == 0))
-- 
2.9.3


^ permalink raw reply related

* Re: What is the solution for USB HDs
From: Thomas Fjellstrom @ 2016-12-14 20:59 UTC (permalink / raw)
  To: Juan Carlos Carvajal B.; +Cc: linux-raid
In-Reply-To: <CAMQE2N47aUPkW4s8juNWLm7TmRhkk6B4UOo9TbGm1ZoujgokCQ@mail.gmail.com>

On Wednesday, December 14, 2016 12:08:47 PM MST you wrote:
> Hi Phil,
> 
> usbcore.autosuspend=-1
> 
> This might be the solution! I'll test it and let you know.
> 
> And regarding power issues:
> 
> - this was the first thing I notices and I bought a powered usb-hub to
> provide enough current to the HDs.
> 
> Using USB drives is kind of a home user scenario. I understand that
> MDADM might be more focused on large data centers, but would be great
> to have such a great tool also available for the average joe.

I don't know if I'd call linux md-raid a "home user" senario. home user is 
more just not having any kind of backup or at most a single usb stick or usb 
hdd. 

Now, I use md-raid but I am not typical. I have a little self built NAS box, 
with 5x5 TB disks in raid 5. (I had thought I did raid 6, but I think I 
skipped it cause I have a full backup of the array on another box using a 
bunch of my older drives). Nice litle box using the Lian-Li PC Q25 case. 
Though I'd like to upgrade to the PC Q26, but its hard to find in canada for a 
decent price :(

Most of my friends use synology or drobo NAS boxes so they don't have to care 
about all the inner-workings most of the time.

I find USB disks too finicky for any real use. Also many are rather slow. I do 
however use them on occasion, especially usb3/3.1 ssds for quick backup or re-
image work when a usb stick won't cut it.

> Best!
> Juan Carlos Carvajal Bermúdez
> +43 650 477 0005
> juca@juan-carlos.info
> Beingasse 17/2/3
> AT 1150, Wien
> www.juan-carlos.info
> 
> On 12 December 2016 at 04:17, Phil Turmel <philip@turmel.org> wrote:
> > On 12/11/2016 05:14 PM, Adam Goryachev wrote:
> >> On 12/12/16 04:14, juca@juan-carlos.info wrote:
> >> If you are still having problems after that, then please try to post
> >> more details on what happens when the drives vanish (eg, kernel logs,
> >> system logs, etc).
> >> 
> >> Also, you might consider an alternative SBC, some have SATA ports
> >> already available (RPi are the only SBC's I've used, but there are many
> >> other options/variations).
> > 
> > There's also a kernel boot parameter that can cut off all USB
> > auto-suspend, which I suspect is part of the problem.  It has been known
> > for a long time that USB is not robust enough to be trusted in MD arrays.
> > 
> > { looking around..... }
> > 
> > usbcore.autosuspend=-1
> > 
> > is what you need in your bootloader or builtin to the kernel.
> > 
> > Phil
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

^ permalink raw reply

* Re: [PATCH v2 00/12] Partial Parity Log for MD RAID 5
From: Shaohua Li @ 2016-12-14 19:47 UTC (permalink / raw)
  To: Jes Sorensen; +Cc: Artur Paszkiewicz, NeilBrown, linux-raid
In-Reply-To: <wrfjy3zjq2b3.fsf@redhat.com>

On Tue, Dec 13, 2016 at 10:25:04AM -0500, Jes Sorensen wrote:
> Shaohua Li <shli@kernel.org> writes:
> > On Wed, Dec 07, 2016 at 03:36:01PM +0100, Artur Paszkiewicz wrote:
> >> On 12/07/2016 01:32 AM, NeilBrown wrote:
> >> > 
> >> > I would expect to see as description of what a PPL actually is and how
> >> > it works here... but there is none.
> >> > 
> >> > The change-log for patch 06 has a tiny bit more information which is
> >> > just enough to be able to start trying to understand the code, but it
> >> > isn't much.
> >> > And none of this description gets into the code, or into the
> >> > Documentation/.  This makes it hard to review and hard to maintain.
> >> > 
> >> > Remember: if you want people to review you code, it is in your interest
> >> > to make it easy.  That means give lots of details.
> >> 
> >> Hi Neil,
> >> 
> >> Thank you for taking the time to look at this and for your feedback. I
> >> didn't try to make it hard to review... Sometimes it's easy to forget
> >> how non-obvious things are after looking at them for too long :) I will
> >> improve the descriptions and address the issues that you found in the
> >> next version of the patches.
> >
> > Havn't looked at the patches yet, being busy recently, sorry! When you repost
> > these, I'd like to know why we need another log for raid5 considering we
> > already had one to fix similar issue. What's the good/bad side of this new log?
> > There is such feature in Intel RSTe doesn't sound like a technical reason we
> > should support this.
> 
> Shaohua,
> 
> Any further thought on these patches? I am considering doing a release
> of mdadm early in the new year. it would be nice to include these
> patches if the feature is going in.
> 
> As for supporting it, if IMSM supports it and it is used in the field,
> then it seems legitimate for Linux to support it too. Just like we
> support so many other obscure pieces of hardware :)

Sure, I don't object to support it. Just need to understand how it works. Had a
brief review. The ondisk format looks good. That probably is related to mdadm
mostly. The disk format has alignment issue as Neil noted, which would be
unfriendly for non-x86 arch. Will we stick to this disk format or change it?
We'd make a decision.

For the implementation, I don't understand how the ppl works much, there aren't
many details there. Two things I noted:

- The code skips the log for full stripe write. This isn't good. It would means
  after a unclean shutdown/recovery, one disk has arbitrary data, not the old
  data and new data. This breaks an assumption in filesystem, after a failed
  write to a sector, the sector has either old or new data. Thinking about a
  write to superblock. The data could be old or new superblock, but it's still a
  superblock, not something random.

- From the patch 6 & 10, looks PPL only help recover unwritten disks. If one
  disk of a stripe is dirty (eg it's written before unclean shutdown), and it's
  lost in recovery, what will happen? Seems the data of lost disk will be read as
  0? It will break the assumption above too. If I understand the code clearly
  (maybe not, need clarification), this is a design flaw.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH] Use disk sector size value to set offset for reading GPT
From: Wols Lists @ 2016-12-14 18:38 UTC (permalink / raw)
  To: Mariusz Dabrowski, linux-raid; +Cc: jes.sorensen
In-Reply-To: <cd102c85-a9d5-25a9-bf2c-706310b53cc6@intel.com>

On 14/12/16 11:04, Mariusz Dabrowski wrote:
> Hi,
> this patch affects only reading GPT from disk. It fixes minor issue when
> there
> were no warning from mdadm when user was trying to create 1-disk RAID 0
> with
> external metadata on disk with some partitions and given array size was
> smaller
> than total size of all partitions on this disk (mdadm failed to read GPT
> because it was using wrong offset).
> 
In both instances I'm thinking of, iirc, a user pulled a working array
from one machine, tried to assemble it in another, then went back to the
original machine and the GPT had disappeared.

I'm clutching a bit at straws, but when the same bug bites more than one
user, you think something must be behind it. And as you can tell from
the scenario I describe, nothing *should* have been writing to disk, yet
the disk got corrupted.

> I couldn't reproduce those incidents so I can't tell that this patch
> fixes the
> root cause of this issue.

It'll have to stay on the "mysterious" list then :-)
> 
> Regards,
> Mariusz

Cheers,
Wol


^ permalink raw reply

* Re: [BUG] MD/RAID1 hung forever on freeze_array
From: Jinpu Wang @ 2016-12-14 14:49 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, Shaohua Li, Nate Dailey
In-Reply-To: <CAMGffE=KoVdoYRzkHdRMuCopjmUdcrP9-woFFr-4-VszGsSHRQ@mail.gmail.com>

On Wed, Dec 14, 2016 at 1:13 PM, Jinpu Wang <jinpu.wang@profitbricks.com> wrote:
> On Wed, Dec 14, 2016 at 11:22 AM, Jinpu Wang
> <jinpu.wang@profitbricks.com> wrote:
>> Thanks Neil,
>>
>> On Tue, Dec 13, 2016 at 11:18 PM, NeilBrown <neilb@suse.com> wrote:
>>> On Wed, Dec 14 2016, Jinpu Wang wrote:
>>>
>>>>
>>>> As you suggested, I re-run same test with 4.4.36 with no our own patch on MD.
>>>> I can still reproduce the same bug, nr_pending on heathy leg(loop1) is till 1.
>>>>
>>>
>>> Thanks.
>>>
>>> I have an hypothesis.
>>>
>>> md_make_request() calls blk_queue_split().
>>> If that does split the request it will call generic_make_request()
>>> on the first half. That will call back into md_make_request() and
>>> raid1_make_request() which will submit requests to the underlying
>>> devices.  These will get caught on the bio_list_on_stack queue in
>>> generic_make_request().
>>> This is a queue which is not accounted in nr_queued.
>>>
>>> When blk_queue_split() completes, 'bio' will be the second half of the
>>> bio.
>>> This enters raid1_make_request() and by this time the array have been
>>> frozen.
>>> So wait_barrier() has to wait for pending requests to complete, and that
>>> includes the one that it stuck in bio_list_on_stack, which will never
>>> complete now.
>>>
>>> To see if this might be happening, please change the
>>>
>>>         blk_queue_split(q, &bio, q->bio_split);
>>>
>>> call in md_make_request() to
>>>
>>>         struct bio *tmp = bio;
>>>         blk_queue_split(q, &bio, q->bio_split);
>>>         WARN_ON_ONCE(bio != tmp);
>>>
>>> If that ever triggers, then the above is a real possibility.
>>
>> I triggered the warning as you expected, we can confirm the bug was
>> caused by your above hypothesis.
>> [  429.282235] ------------[ cut here ]------------
>> [  429.282407] WARNING: CPU: 2 PID: 4139 at drivers/md/md.c:262
>> md_set_array_sectors+0xac0/0xc30 [md_mod]()
>> [  429.285288] Modules linked in: raid1 ibnbd_client(O)
>> ibtrs_client(O) dm_service_time dm_multipath rdma_ucm ib_ucm rdma_cm
>> iw_cm ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_c
>> ore vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core
>> ib_addr ib_netlink mlx4_core mlx_compat loop md_mod kvm_amd
>> edac_mce_amd kvm edac_core irqbypass acpi_cpufreq tpm
>> _infineon tpm_tis i2c_piix4 tpm serio_raw evdev k10temp processor
>> button fam15h_power crct10dif_pclmul crc32_pclmul sg sd_mod ahci
>> libahci libata scsi_mod crc32c_intel r8169 psmo
>> use xhci_pci xhci_hcd [last unloaded: mlx_compat]
>> [  429.288543] CPU: 2 PID: 4139 Comm: fio Tainted: G           O    4.4.36-1-pse
>> rver #1
>> [  429.288825] Hardware name: To be filled by O.E.M. To be filled by
>> O.E.M./M5A97 R2.0, BIOS 2501 04/07/2014
>> [  429.289113]  0000000000000000 ffff8801f64ff8f0 ffffffff81424486
>> 0000000000000000
>> [  429.289538]  ffffffffa0561938 ffff8801f64ff928 ffffffff81058a60
>> ffff8800b8f3e000
>> [  429.290157]  0000000000000000 ffff8800b51f4100 ffff880234f9a700
>> ffff880234f9a700
>> [  429.290594] Call Trace:
>> [  429.290743]  [<ffffffff81424486>] dump_stack+0x4d/0x67
>> [  429.290893]  [<ffffffff81058a60>] warn_slowpath_common+0x90/0xd0
>> [  429.291046]  [<ffffffff81058b55>] warn_slowpath_null+0x15/0x20
>> [  429.291202]  [<ffffffffa0550740>] md_set_array_sectors+0xac0/0xc30 [md_mod]
>> [  429.291358]  [<ffffffff813fd3de>] generic_make_request+0xfe/0x1e0
>> [  429.291540]  [<ffffffff813fd522>] submit_bio+0x62/0x150
>> [  429.291693]  [<ffffffff813f53d9>] ? bio_set_pages_dirty+0x49/0x60
>> [  429.291847]  [<ffffffff811d32a7>] do_blockdev_direct_IO+0x2317/0x2ba0
>> [  429.292011]  [<ffffffffa0834f64>] ?
>> ib_post_rdma_write_imm+0x24/0x30 [ibtrs_client]
>> [  429.292271]  [<ffffffff811cdc40>] ? I_BDEV+0x10/0x10
>> [  429.292417]  [<ffffffff811d3b6e>] __blockdev_direct_IO+0x3e/0x40
>> [  429.292566]  [<ffffffff811ce2d7>] blkdev_direct_IO+0x47/0x50
>> [  429.292746]  [<ffffffff81132abf>] generic_file_read_iter+0x45f/0x580
>> [  429.292894]  [<ffffffff811ce620>] ? blkdev_write_iter+0x110/0x110
>> [  429.293073]  [<ffffffff811ce652>] blkdev_read_iter+0x32/0x40
>> [  429.293284]  [<ffffffff811deb86>] aio_run_iocb+0x116/0x2a0
>> [  429.293492]  [<ffffffff813fed52>] ? blk_flush_plug_list+0xc2/0x200
>> [  429.293703]  [<ffffffff81183ac6>] ? kmem_cache_alloc+0xb6/0x180
>> [  429.293901]  [<ffffffff811dfaf4>] ? do_io_submit+0x184/0x4d0
>> [  429.294047]  [<ffffffff811dfbaa>] do_io_submit+0x23a/0x4d0
>> [  429.294194]  [<ffffffff811dfe4b>] SyS_io_submit+0xb/0x10
>> [  429.294375]  [<ffffffff81815497>] entry_SYSCALL_64_fastpath+0x12/0x6a
>> [  429.294610] ---[ end trace 25d1cece0e01494b ]---
>>
>> I double checked the nr_pending on heathy leg is still 1 as before.
>>
>>>
>>> Fixing the problem isn't very easy...
>>>
>>> You could try:
>>> 1/ write a function in raid1.c which calls punt_bios_to_rescuer()
>>>   (which you will need to export from block/bio.c),
>>>   passing mddev->queue->bio_split as the bio_set.
>>>
>>> 1/ change the wait_event_lock_irq() call in wait_barrier() to
>>>    wait_event_lock_irq_cmd(), and pass the new function as the command.
>>>
>>> That way, if wait_barrier() ever blocks, all the requests in
>>> bio_list_on_stack will be handled by a separate thread.
>>>
>>> NeilBrown
>>
>> I will try your sugested way to see if it fix the bug, will report back soon.
>>

Hi Neil,

I found a old mail thread below
http://www.spinics.net/lists/raid/msg52792.html

Likely Alex is trying to fix same bug, right?
in one reply you suggested to modify the call in make_request

@@ -1207,7 +1207,8 @@ read_again:
                                sectors_handled;
                        goto read_again;
                } else
-                       generic_make_request(read_bio);
+                       reschedule_retry(r1_bio);
                return;
        }


I append above change, it looks fix the bug, I've run same tests over
one hour,  no hung task anymore.

Do you think this is right fix? Do we still need the change you
suggested with punt_bios_to_rescuer?

-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply

* Re: [BUG] MD/RAID1 hung forever on freeze_array
From: Jinpu Wang @ 2016-12-14 12:13 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, Shaohua Li, Nate Dailey
In-Reply-To: <CAMGffEnCesgUp4gBsPN2L9qg3WSxNXsCcYEPWH-BaeEEktaqcw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 8255 bytes --]

On Wed, Dec 14, 2016 at 11:22 AM, Jinpu Wang
<jinpu.wang@profitbricks.com> wrote:
> Thanks Neil,
>
> On Tue, Dec 13, 2016 at 11:18 PM, NeilBrown <neilb@suse.com> wrote:
>> On Wed, Dec 14 2016, Jinpu Wang wrote:
>>
>>>
>>> As you suggested, I re-run same test with 4.4.36 with no our own patch on MD.
>>> I can still reproduce the same bug, nr_pending on heathy leg(loop1) is till 1.
>>>
>>
>> Thanks.
>>
>> I have an hypothesis.
>>
>> md_make_request() calls blk_queue_split().
>> If that does split the request it will call generic_make_request()
>> on the first half. That will call back into md_make_request() and
>> raid1_make_request() which will submit requests to the underlying
>> devices.  These will get caught on the bio_list_on_stack queue in
>> generic_make_request().
>> This is a queue which is not accounted in nr_queued.
>>
>> When blk_queue_split() completes, 'bio' will be the second half of the
>> bio.
>> This enters raid1_make_request() and by this time the array have been
>> frozen.
>> So wait_barrier() has to wait for pending requests to complete, and that
>> includes the one that it stuck in bio_list_on_stack, which will never
>> complete now.
>>
>> To see if this might be happening, please change the
>>
>>         blk_queue_split(q, &bio, q->bio_split);
>>
>> call in md_make_request() to
>>
>>         struct bio *tmp = bio;
>>         blk_queue_split(q, &bio, q->bio_split);
>>         WARN_ON_ONCE(bio != tmp);
>>
>> If that ever triggers, then the above is a real possibility.
>
> I triggered the warning as you expected, we can confirm the bug was
> caused by your above hypothesis.
> [  429.282235] ------------[ cut here ]------------
> [  429.282407] WARNING: CPU: 2 PID: 4139 at drivers/md/md.c:262
> md_set_array_sectors+0xac0/0xc30 [md_mod]()
> [  429.285288] Modules linked in: raid1 ibnbd_client(O)
> ibtrs_client(O) dm_service_time dm_multipath rdma_ucm ib_ucm rdma_cm
> iw_cm ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_c
> ore vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core
> ib_addr ib_netlink mlx4_core mlx_compat loop md_mod kvm_amd
> edac_mce_amd kvm edac_core irqbypass acpi_cpufreq tpm
> _infineon tpm_tis i2c_piix4 tpm serio_raw evdev k10temp processor
> button fam15h_power crct10dif_pclmul crc32_pclmul sg sd_mod ahci
> libahci libata scsi_mod crc32c_intel r8169 psmo
> use xhci_pci xhci_hcd [last unloaded: mlx_compat]
> [  429.288543] CPU: 2 PID: 4139 Comm: fio Tainted: G           O    4.4.36-1-pse
> rver #1
> [  429.288825] Hardware name: To be filled by O.E.M. To be filled by
> O.E.M./M5A97 R2.0, BIOS 2501 04/07/2014
> [  429.289113]  0000000000000000 ffff8801f64ff8f0 ffffffff81424486
> 0000000000000000
> [  429.289538]  ffffffffa0561938 ffff8801f64ff928 ffffffff81058a60
> ffff8800b8f3e000
> [  429.290157]  0000000000000000 ffff8800b51f4100 ffff880234f9a700
> ffff880234f9a700
> [  429.290594] Call Trace:
> [  429.290743]  [<ffffffff81424486>] dump_stack+0x4d/0x67
> [  429.290893]  [<ffffffff81058a60>] warn_slowpath_common+0x90/0xd0
> [  429.291046]  [<ffffffff81058b55>] warn_slowpath_null+0x15/0x20
> [  429.291202]  [<ffffffffa0550740>] md_set_array_sectors+0xac0/0xc30 [md_mod]
> [  429.291358]  [<ffffffff813fd3de>] generic_make_request+0xfe/0x1e0
> [  429.291540]  [<ffffffff813fd522>] submit_bio+0x62/0x150
> [  429.291693]  [<ffffffff813f53d9>] ? bio_set_pages_dirty+0x49/0x60
> [  429.291847]  [<ffffffff811d32a7>] do_blockdev_direct_IO+0x2317/0x2ba0
> [  429.292011]  [<ffffffffa0834f64>] ?
> ib_post_rdma_write_imm+0x24/0x30 [ibtrs_client]
> [  429.292271]  [<ffffffff811cdc40>] ? I_BDEV+0x10/0x10
> [  429.292417]  [<ffffffff811d3b6e>] __blockdev_direct_IO+0x3e/0x40
> [  429.292566]  [<ffffffff811ce2d7>] blkdev_direct_IO+0x47/0x50
> [  429.292746]  [<ffffffff81132abf>] generic_file_read_iter+0x45f/0x580
> [  429.292894]  [<ffffffff811ce620>] ? blkdev_write_iter+0x110/0x110
> [  429.293073]  [<ffffffff811ce652>] blkdev_read_iter+0x32/0x40
> [  429.293284]  [<ffffffff811deb86>] aio_run_iocb+0x116/0x2a0
> [  429.293492]  [<ffffffff813fed52>] ? blk_flush_plug_list+0xc2/0x200
> [  429.293703]  [<ffffffff81183ac6>] ? kmem_cache_alloc+0xb6/0x180
> [  429.293901]  [<ffffffff811dfaf4>] ? do_io_submit+0x184/0x4d0
> [  429.294047]  [<ffffffff811dfbaa>] do_io_submit+0x23a/0x4d0
> [  429.294194]  [<ffffffff811dfe4b>] SyS_io_submit+0xb/0x10
> [  429.294375]  [<ffffffff81815497>] entry_SYSCALL_64_fastpath+0x12/0x6a
> [  429.294610] ---[ end trace 25d1cece0e01494b ]---
>
> I double checked the nr_pending on heathy leg is still 1 as before.
>
>>
>> Fixing the problem isn't very easy...
>>
>> You could try:
>> 1/ write a function in raid1.c which calls punt_bios_to_rescuer()
>>   (which you will need to export from block/bio.c),
>>   passing mddev->queue->bio_split as the bio_set.
>>
>> 1/ change the wait_event_lock_irq() call in wait_barrier() to
>>    wait_event_lock_irq_cmd(), and pass the new function as the command.
>>
>> That way, if wait_barrier() ever blocks, all the requests in
>> bio_list_on_stack will be handled by a separate thread.
>>
>> NeilBrown
>
> I will try your sugested way to see if it fix the bug, will report back soon.
>
Hi Neil,

Sorry, bad news, with the 2 patch attached, I can still reproduce the same bug.
nr_pending on healthy leg is still 1, as before.
crash> struct r1conf 0xffff8800b7176100
struct r1conf {
  mddev = 0xffff8800b59b0000,
  mirrors = 0xffff88022bab7900,
  raid_disks = 2,
  next_resync = 18446744073709527039,
  start_next_window = 18446744073709551615,
  current_window_requests = 0,
  next_window_requests = 0,
  device_lock = {
    {
      rlock = {
        raw_lock = {
          val = {
            counter = 0
          }
        }
      }
    }
  },
  retry_list = {
    next = 0xffff880211b2ec40,
    prev = 0xffff88022819ad40
  },
  bio_end_io_list = {
    next = 0xffff880227e9a9c0,
    prev = 0xffff8802119c6140
  },
  pending_bio_list = {
    head = 0x0,
    tail = 0x0
  },
  pending_count = 0,
  wait_barrier = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff8800adf3b818,
      prev = 0xffff88021180f7a8
    }
  },
  resync_lock = {
    {
      rlock = {
        raw_lock = {
          val = {
            counter = 0
          }
        }
      }
    }
  },
  nr_pending = 1675,
  nr_waiting = 100,
  nr_queued = 1673,
  barrier = 0,
  array_frozen = 1,
  fullsync = 0,
  recovery_disabled = 1,
  poolinfo = 0xffff88022c80f640,
  r1bio_pool = 0xffff88022b8b6a20,
  r1buf_pool = 0x0,
  tmppage = 0xffffea0008a90c80,
  thread = 0x0,
  cluster_sync_low = 0,
  cluster_sync_high = 0
}

 kobj = {
    name = 0xffff88022b7194a0 "dev-loop1",
    entry = {
      next = 0xffff880231495280,
      prev = 0xffff880231495280
    },
    parent = 0xffff8800b59b0050,
    kset = 0x0,
    ktype = 0xffffffffa0564060 <rdev_ktype>,
    sd = 0xffff8800b6510960,
    kref = {
      refcount = {
        counter = 1
      }
    },
    state_initialized = 1,
    state_in_sysfs = 1,
    state_add_uevent_sent = 0,
    state_remove_uevent_sent = 0,
    uevent_suppress = 0
  },
  flags = 2,
  blocked_wait = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff8802314952c8,
      prev = 0xffff8802314952c8
    }
  },
  desc_nr = 1,
  raid_disk = 1,
  new_raid_disk = 0,
  saved_raid_disk = -1,
  {
    recovery_offset = 0,
    journal_tail = 0
  },
  nr_pending = {
    counter = 1
  },


-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

[-- Attachment #2: 0001-block-export-punt_bios_to_rescuer.patch --]
[-- Type: text/x-patch, Size: 1566 bytes --]

From e7adbbb1a8d542ea68ada5996e0f9ffe87c479b6 Mon Sep 17 00:00:00 2001
From: Jack Wang <jinpu.wang@profitbricks.com>
Date: Wed, 14 Dec 2016 11:26:23 +0100
Subject: [PATCH 1/2] block: export punt_bios_to_rescuer

We need it later in raid1

Suggested-by: Neil Brown <neil.brown@suse.com>
Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
---
 block/bio.c         | 3 ++-
 include/linux/bio.h | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index 46e2cc1..f6a250d 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -354,7 +354,7 @@ static void bio_alloc_rescue(struct work_struct *work)
 	}
 }
 
-static void punt_bios_to_rescuer(struct bio_set *bs)
+void punt_bios_to_rescuer(struct bio_set *bs)
 {
 	struct bio_list punt, nopunt;
 	struct bio *bio;
@@ -384,6 +384,7 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
 
 	queue_work(bs->rescue_workqueue, &bs->rescue_work);
 }
+EXPORT_SYMBOL(punt_bios_to_rescuer);
 
 /**
  * bio_alloc_bioset - allocate a bio for I/O
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 42e4e3c..6256ba7 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -479,6 +479,7 @@ extern void bio_advance(struct bio *, unsigned);
 extern void bio_init(struct bio *);
 extern void bio_reset(struct bio *);
 void bio_chain(struct bio *, struct bio *);
+void punt_bios_to_rescuer(struct bio_set *);
 
 extern int bio_add_page(struct bio *, struct page *, unsigned int,unsigned int);
 extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *,
-- 
2.7.4


[-- Attachment #3: 0002-raid1-fix-deadlock.patch --]
[-- Type: text/x-patch, Size: 1420 bytes --]

From 2ad4cc5e8b5d7ec9db7a6fffaa2fdcd5f20419bf Mon Sep 17 00:00:00 2001
From: Jack Wang <jinpu.wang@profitbricks.com>
Date: Wed, 14 Dec 2016 11:35:52 +0100
Subject: [PATCH 2/2] raid1: fix deadlock

Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/md/raid1.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 478223c..61dafb1 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -190,6 +190,11 @@ static void put_all_bios(struct r1conf *conf, struct r1bio *r1_bio)
 	}
 }
 
+static void raid1_punt_bios_to_rescuer(struct r1conf *conf)
+{
+	punt_bios_to_rescuer(conf->mddev->queue->bio_split);
+}
+
 static void free_r1bio(struct r1bio *r1_bio)
 {
 	struct r1conf *conf = r1_bio->mddev->private;
@@ -871,14 +876,15 @@ static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
 		 * that queue to allow conf->start_next_window
 		 * to increase.
 		 */
-		wait_event_lock_irq(conf->wait_barrier,
+		wait_event_lock_irq_cmd(conf->wait_barrier,
 				    !conf->array_frozen &&
 				    (!conf->barrier ||
 				     ((conf->start_next_window <
 				       conf->next_resync + RESYNC_SECTORS) &&
 				      current->bio_list &&
 				      !bio_list_empty(current->bio_list))),
-				    conf->resync_lock);
+				    conf->resync_lock,
+				    raid1_punt_bios_to_rescuer(conf));
 		conf->nr_waiting--;
 	}
 
-- 
2.7.4


^ permalink raw reply related

* Re: What is the solution for USB HDs
From: Juan Carlos Carvajal B. @ 2016-12-14 11:08 UTC (permalink / raw)
  Cc: linux-raid
In-Reply-To: <9481982b-bfda-644e-e160-74b94bbbec0c@turmel.org>

Hi Phil,

usbcore.autosuspend=-1

This might be the solution! I'll test it and let you know.

And regarding power issues:

- this was the first thing I notices and I bought a powered usb-hub to
provide enough current to the HDs.

Using USB drives is kind of a home user scenario. I understand that
MDADM might be more focused on large data centers, but would be great
to have such a great tool also available for the average joe.

Best!
Juan Carlos Carvajal Bermúdez
+43 650 477 0005
juca@juan-carlos.info
Beingasse 17/2/3
AT 1150, Wien
www.juan-carlos.info


On 12 December 2016 at 04:17, Phil Turmel <philip@turmel.org> wrote:
> On 12/11/2016 05:14 PM, Adam Goryachev wrote:
>> On 12/12/16 04:14, juca@juan-carlos.info wrote:
>> If you are still having problems after that, then please try to post
>> more details on what happens when the drives vanish (eg, kernel logs,
>> system logs, etc).
>>
>> Also, you might consider an alternative SBC, some have SATA ports
>> already available (RPi are the only SBC's I've used, but there are many
>> other options/variations).
>
> There's also a kernel boot parameter that can cut off all USB
> auto-suspend, which I suspect is part of the problem.  It has been known
> for a long time that USB is not robust enough to be trusted in MD arrays.
>
> { looking around..... }
>
> usbcore.autosuspend=-1
>
> is what you need in your bootloader or builtin to the kernel.
>
> Phil

^ permalink raw reply

* Re: What is the solution for USB HDs
From: Juan Carlos Carvajal B. @ 2016-12-14 11:07 UTC (permalink / raw)
  Cc: linux-raid
In-Reply-To: <171542c9-1ab8-0993-733d-fbb95467f512@websitemanagers.com.au>

Hi Adam,

thanks for the suggestions. Looking at the syslog I managed to see
what is happening. Apparently there is a bug in the Raspberry that
resets USB devices. See here:
https://www.raspberrypi.org/forums/viewtopic.php?f=91&t=47239

Apparently is the kernel itself that disconnects the devices.

Dec 13 07:52:52 fileserver kernel: [121050.860021] usb 1-1.3.2: USB
disconnect, device number 20
Dec 13 07:52:52 fileserver kernel: [121050.861653] sd 15:0:0:0: [sdc]
Synchronizing SCSI cache
Dec 13 07:52:52 fileserver kernel: [121050.861828] sd 15:0:0:0: [sdc]
Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00
Dec 13 07:52:52 fileserver udisksd[1963]: Unable to resolve
/sys/devices/virtual/block/md0/md/dev-sdb/block symlink
Dec 13 07:52:52 fileserver kernel: [121051.195770] usb 1-1.3.2: new
high-speed USB device number 21 using dwc_otg

The full dump is here:
http://pastebin.com/F18GHLjV

I'll post this also in the raspberry forums to see what they say.
thanks!

Best regards!

Juan Carlos Carvajal Bermúdez
+43 650 477 0005
juca@juan-carlos.info
Beingasse 17/2/3
AT 1150, Wien
www.juan-carlos.info

On 11 December 2016 at 23:14, Adam Goryachev
<mailinglists@websitemanagers.com.au> wrote:
>
> On 12/12/16 04:14, juca@juan-carlos.info wrote:
>>
>> Dear Linux-RAID Developers,
>>
>> I always wanted to have a small home server with a RAID-array to store pesonal and work files.
>>
>> So I decided to create a fileserver with a raspberry pi and a RAID-1 Array. For this I bought two USB 1TB HDs.
>>
>> After a while I noticed that Mdadm was marking my drives as failed or even worse as faulty. This was strange since both HDs are new.
>>
>> So I tested them. I deleted the RAID partitions completely with WIPEFS. Then I created Ext4 partitions and run a benchmark of the partitions in ubuntu. Everything is fine.
>>
>> I created then again my RAID array. To my surprise one drive was again faulty. I googled a little  and I found that my USB drives went to sleep mode and when they were woken up, one of them changed name /dev/sdb/ -> /dev/sdc. Mdadm was unable to find the "new" drive and marked it as faulty. Re-add didn't work. Adding it again went into an unnecessary and long sync of 1TB.
>>
>> I googled the problem and learnt about UDEV. It turns out that Linux gracefully creates symlinks to the disks-partitions under /dev/disks/by-id/... I thought that I could use this symlinks to fix the problem with Mdadm, but mdadm just followed the links to find again the changing /dev/sd* names.
>>
>> I continued my journey and found more about the UDEV rules. I created a couple of them, using the serial number of the disks. I turns out that changing the name is not anymore allowed. (NAME="sdb" is ignored) There is even a warning about this. See here: http://unix.stackexchange.com/questions/119593/is-there-a-way-to-change-device-names-in-dev-directory
>> "NAME="pendrak" ignored, kernel device nodes can not be renamed; please fix it in /etc/udev/rules.d/99-local.rules:1"
>>
>> My last desperate idea was to create a cron job that reads directly from the disks, to keep them awake and avoid a name change. Something like this:
>>
>> dd if=/dev/disk/by-id/usb-TOSHIBA_External_USB_3.0_20151214504D3-0:0 bs=1024 count=$(($[1<<10]*10)) skip=$(($SKIP)) ibs=1024 of=/dev/null iflag=direct status=none 2>&1
>>
>> But even with that workaround the name of a disk just changed and Mdadm marked it as faulty. --- I'm out of ideas.
>>
>> So I'm stuck with this wonderful situation:
>>
>> - MDADM will follow symlinks and ignore the /dev/disk/by-id/... identification.
>> - UDEV does not allow to change the name of a device.
>> - The names will apparently change no matter what I do.
>>
>> I'm starting to regret my decision to create this array. I expected some challenges but it is turning to be ridiculously difficult. Almost unusable.
>>
>> Maybe you have a solution for this.
>>
>> In any case it would be great if you consider some long-term out-of-the-box solution for this new reality of USB HDs. My ideas would be:
>>
>> -- Allow to create arrays with symlinks ("/dev/disk/by-id/usb-TOSHIBA_External_USB_3.0_20151214504D3-0:0")
>>
>> -- Include a keep-alive functionality that avoids names to change randomly.
>>
>> -- Store HD ids to recognize disks.
>>
>
> Hi,
> I think you are going down the wrong path. It isn't for Linux MD to magically "keep drives awake" or to detect which drive is the one you want to use.
>
> Also, you haven't provided enough information/detail on *why* the drives are vanishing. For a drive to enter sleep mode, and then wake up again, it shouldn't change device name. The only time it will change device name (in my experience) is if you unplug it, and then re-plug it.
> Given you are using a Raspberry Pi, and I've quite some experience with them, my initial question would be to confirm that you are providing sufficient power to the RPi, and to the 2 x USB HDD's, if there is not enough power, then one drive could easily "die" and then wake up again when the other drive reduces its power demand. One of the best ways to solve this is to:
> a) Use desktop drives (ie, drives that have their own power supply, and don't just draw power from USB
> b) Use a powered USB hub (ie, a USB hub that has it's own power supply)
> c) Use USB flash drives, or SSD's, I suspect both will have a lower power demand
>
> If you are still having problems after that, then please try to post more details on what happens when the drives vanish (eg, kernel logs, system logs, etc).
>
> Also, you might consider an alternative SBC, some have SATA ports already available (RPi are the only SBC's I've used, but there are many other options/variations).
>
> Regards,
> Adam
>
>
>
> --
> Adam Goryachev Website Managers www.websitemanagers.com.au

^ permalink raw reply

* Re: [PATCH] Use disk sector size value to set offset for reading GPT
From: Mariusz Dabrowski @ 2016-12-14 11:04 UTC (permalink / raw)
  To: Wols Lists, linux-raid; +Cc: jes.sorensen
In-Reply-To: <5850241B.8030007@youngman.org.uk>

Hi,
this patch affects only reading GPT from disk. It fixes minor issue when 
there
were no warning from mdadm when user was trying to create 1-disk RAID 0 with
external metadata on disk with some partitions and given array size was 
smaller
than total size of all partitions on this disk (mdadm failed to read GPT
because it was using wrong offset).

I couldn't reproduce those incidents so I can't tell that this patch 
fixes the
root cause of this issue.

Regards,
Mariusz

On 12/13/2016 05:38 PM, Wols Lists wrote:
> On 08/12/16 11:13, Mariusz Dabrowski wrote:
>> mdadm is using invalid byte-offset while reading GPT header to get
>> partition info (size, first sector, last sector etc.). Now this offset
>> is hardcoded to 512 bytes and it is not valid for disks with sector
>> size different than 512 bytes because MBR and GPT headers are aligned
>> to LBA, so valid offset for 4k drives is 4096 bytes.
>>
>> Signed-off-by: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
> Could this be behind the couple of incidents recently, where an array
> has been moved from one machine to another, and the GPT has disappeared?
>
> I know I've been following the threads, and I've been puzzled in that
> I've thought "nothing should be writing there!"
>
> If it is, what mdadm/kernels are affected, and I can put it on the wiki,
> writing the page up about corrupted disks is next on my list.
>
> Cheers,
> Wol
>


^ permalink raw reply

* Re: [BUG] MD/RAID1 hung forever on freeze_array
From: Jinpu Wang @ 2016-12-14 10:22 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, Shaohua Li, Nate Dailey
In-Reply-To: <877f73zd5d.fsf@notabene.neil.brown.name>

Thanks Neil,

On Tue, Dec 13, 2016 at 11:18 PM, NeilBrown <neilb@suse.com> wrote:
> On Wed, Dec 14 2016, Jinpu Wang wrote:
>
>>
>> As you suggested, I re-run same test with 4.4.36 with no our own patch on MD.
>> I can still reproduce the same bug, nr_pending on heathy leg(loop1) is till 1.
>>
>
> Thanks.
>
> I have an hypothesis.
>
> md_make_request() calls blk_queue_split().
> If that does split the request it will call generic_make_request()
> on the first half. That will call back into md_make_request() and
> raid1_make_request() which will submit requests to the underlying
> devices.  These will get caught on the bio_list_on_stack queue in
> generic_make_request().
> This is a queue which is not accounted in nr_queued.
>
> When blk_queue_split() completes, 'bio' will be the second half of the
> bio.
> This enters raid1_make_request() and by this time the array have been
> frozen.
> So wait_barrier() has to wait for pending requests to complete, and that
> includes the one that it stuck in bio_list_on_stack, which will never
> complete now.
>
> To see if this might be happening, please change the
>
>         blk_queue_split(q, &bio, q->bio_split);
>
> call in md_make_request() to
>
>         struct bio *tmp = bio;
>         blk_queue_split(q, &bio, q->bio_split);
>         WARN_ON_ONCE(bio != tmp);
>
> If that ever triggers, then the above is a real possibility.

I triggered the warning as you expected, we can confirm the bug was
caused by your above hypothesis.
[  429.282235] ------------[ cut here ]------------
[  429.282407] WARNING: CPU: 2 PID: 4139 at drivers/md/md.c:262
md_set_array_sectors+0xac0/0xc30 [md_mod]()
[  429.285288] Modules linked in: raid1 ibnbd_client(O)
ibtrs_client(O) dm_service_time dm_multipath rdma_ucm ib_ucm rdma_cm
iw_cm ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_c
ore vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core
ib_addr ib_netlink mlx4_core mlx_compat loop md_mod kvm_amd
edac_mce_amd kvm edac_core irqbypass acpi_cpufreq tpm
_infineon tpm_tis i2c_piix4 tpm serio_raw evdev k10temp processor
button fam15h_power crct10dif_pclmul crc32_pclmul sg sd_mod ahci
libahci libata scsi_mod crc32c_intel r8169 psmo
use xhci_pci xhci_hcd [last unloaded: mlx_compat]
[  429.288543] CPU: 2 PID: 4139 Comm: fio Tainted: G           O    4.4.36-1-pse
rver #1
[  429.288825] Hardware name: To be filled by O.E.M. To be filled by
O.E.M./M5A97 R2.0, BIOS 2501 04/07/2014
[  429.289113]  0000000000000000 ffff8801f64ff8f0 ffffffff81424486
0000000000000000
[  429.289538]  ffffffffa0561938 ffff8801f64ff928 ffffffff81058a60
ffff8800b8f3e000
[  429.290157]  0000000000000000 ffff8800b51f4100 ffff880234f9a700
ffff880234f9a700
[  429.290594] Call Trace:
[  429.290743]  [<ffffffff81424486>] dump_stack+0x4d/0x67
[  429.290893]  [<ffffffff81058a60>] warn_slowpath_common+0x90/0xd0
[  429.291046]  [<ffffffff81058b55>] warn_slowpath_null+0x15/0x20
[  429.291202]  [<ffffffffa0550740>] md_set_array_sectors+0xac0/0xc30 [md_mod]
[  429.291358]  [<ffffffff813fd3de>] generic_make_request+0xfe/0x1e0
[  429.291540]  [<ffffffff813fd522>] submit_bio+0x62/0x150
[  429.291693]  [<ffffffff813f53d9>] ? bio_set_pages_dirty+0x49/0x60
[  429.291847]  [<ffffffff811d32a7>] do_blockdev_direct_IO+0x2317/0x2ba0
[  429.292011]  [<ffffffffa0834f64>] ?
ib_post_rdma_write_imm+0x24/0x30 [ibtrs_client]
[  429.292271]  [<ffffffff811cdc40>] ? I_BDEV+0x10/0x10
[  429.292417]  [<ffffffff811d3b6e>] __blockdev_direct_IO+0x3e/0x40
[  429.292566]  [<ffffffff811ce2d7>] blkdev_direct_IO+0x47/0x50
[  429.292746]  [<ffffffff81132abf>] generic_file_read_iter+0x45f/0x580
[  429.292894]  [<ffffffff811ce620>] ? blkdev_write_iter+0x110/0x110
[  429.293073]  [<ffffffff811ce652>] blkdev_read_iter+0x32/0x40
[  429.293284]  [<ffffffff811deb86>] aio_run_iocb+0x116/0x2a0
[  429.293492]  [<ffffffff813fed52>] ? blk_flush_plug_list+0xc2/0x200
[  429.293703]  [<ffffffff81183ac6>] ? kmem_cache_alloc+0xb6/0x180
[  429.293901]  [<ffffffff811dfaf4>] ? do_io_submit+0x184/0x4d0
[  429.294047]  [<ffffffff811dfbaa>] do_io_submit+0x23a/0x4d0
[  429.294194]  [<ffffffff811dfe4b>] SyS_io_submit+0xb/0x10
[  429.294375]  [<ffffffff81815497>] entry_SYSCALL_64_fastpath+0x12/0x6a
[  429.294610] ---[ end trace 25d1cece0e01494b ]---

I double checked the nr_pending on heathy leg is still 1 as before.

>
> Fixing the problem isn't very easy...
>
> You could try:
> 1/ write a function in raid1.c which calls punt_bios_to_rescuer()
>   (which you will need to export from block/bio.c),
>   passing mddev->queue->bio_split as the bio_set.
>
> 1/ change the wait_event_lock_irq() call in wait_barrier() to
>    wait_event_lock_irq_cmd(), and pass the new function as the command.
>
> That way, if wait_barrier() ever blocks, all the requests in
> bio_list_on_stack will be handled by a separate thread.
>
> NeilBrown

I will try your sugested way to see if it fix the bug, will report back soon.

-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply

* [GIT PULL] MD update for 4.10
From: Shaohua Li @ 2016-12-14  6:23 UTC (permalink / raw)
  To: torvalds; +Cc: linux-kernel, linux-raid, neilb

Hi Linus,

Please pull MD changes for 4.10. This update includes:
- A raid5 writeback cache feature. The goal is to aggregate writes to make full
  stripe write and reduce read-modify-write. It's helpful for workload which
  does sequential write and follows fsync for example. This feature is
  experimental and off by default right now.
- FAILFAST support. This fails IOs to broken raid disks quickly, so can improve
  latency. It's mainly for DASD storage, but some patches help normal raid
  array too.
- Support bad block for raid array with external metadata
- AVX2 instruction support for raid6 parity calculation
- Normalize MD info output
- Add missing blktrace
- Other bug fixes

Thanks,
Shaohua

The following changes since commit b78b499a67c3f77aeb6cd0b54724bc38b141255d:

  Merge tag 'char-misc-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc (2016-12-13 12:11:01 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git for-linus

for you to fetch changes up to 20737738d397dfadbca1ea50dcc00d7259f500cf:

  Merge branch 'md-next' into md-linus (2016-12-13 12:40:15 -0800)

----------------------------------------------------------------
Dan Carpenter (1):
      md/r5cache: enable IRQs on error path

Gayatri Kammela (1):
      lib/raid6: Add AVX2 optimized xor_syndrome functions

Guoqing Jiang (1):
      md/bitmap: call bitmap_file_unmap once bitmap_storage_alloc returns -ENOMEM

JackieLiu (8):
      raid5-cache: restrict the use area of the log_offset variable
      md/raid5-cache: remove unnecessary function parameters
      md/raid5-cache: use ring add to prevent overflow
      md/raid5-cache: release the stripe_head at the appropriate location
      md/raid5-cache: remove the unnecessary next_cp_seq field from the r5l_log
      md/raid5-cache: do not need to set STRIPE_PREREAD_ACTIVE repeatedly
      md/raid5-cache: adjust the write position of the empty block if no data blocks
      md/raid5-cache: no recovery is required when create super-block

Konstantin Khlebnikov (1):
      md/raid5: limit request size according to implementation limits

NeilBrown (26):
      md: fix some issues with alloc_disk_sb()
      md: change all printk() to pr_err() or pr_warn() etc.
      md/bitmap: change all printk() to pr_*()
      md/linear: replace printk() with pr_*()
      md/multipath: replace printk() with pr_*()
      md/raid0: replace printk() with pr_*()
      md/raid1: change printk() to pr_*()
      md/raid10: change printk() to pr_*()
      md/raid5: change printk() to pr_*()
      md: perform async updates for metadata where possible.
      md/raid1: abort delayed writes when device fails.
      md/raid10: abort delayed writes when device fails.
      md/bitmap: Don't write bitmap while earlier writes might be in-flight
      md/raid1: fix: IO can block resync indefinitely
      md: define mddev flags, recovery flags and r1bio state bits using enums
      md: remove md_super_wait() call after bitmap_flush()
      md: add block tracing for bio_remapping
      md/bitmap: add blktrace event for writes to the bitmap
      md/raid1, raid10: add blktrace records when IO is delayed
      md/failfast: add failfast flag for md to be used by some personalities.
      md: Use REQ_FAILFAST_* on metadata writes where appropriate
      md/raid1: add failfast handling for reads.
      md/raid1: add failfast handling for writes.
      md/raid10: add failfast handling for reads.
      md/raid10: add failfast handling for writes.
      md: fix refcount problem on mddev when stopping array.

Shaohua Li (8):
      raid5-cache: fix lockdep warning
      md: add blktrace event for writes to superblock
      raid5-cache: suspend reclaim thread instead of shutdown
      md: stop write should stop journal reclaim
      md: takeover should clear unrelated bits
      md: MD_RECOVERY_NEEDED is set for mddev->recovery
      md: separate flags for superblock changes
      Merge branch 'md-next' into md-linus

Song Liu (14):
      md/r5cache: Check array size in r5l_init_log
      md/r5cache: move some code to raid5.h
      md/r5cache: State machine for raid5-cache write back mode
      md/r5cache: caching phase of r5cache
      md/r5cache: write-out phase and reclaim support
      md/r5cache: sysfs entry journal_mode
      md/r5cache: refactoring journal recovery code
      md/r5cache: r5cache recovery: part 1
      md/r5cache: r5cache recovery: part 2
      md/r5cache: handle FLUSH and FUA
      md/r5cache: handle alloc_page failure
      md/r5cache: run_no_space_stripes() when R5C_LOG_CRITICAL == 0
      md/raid5-cache: fix crc in rewrite_data_only_stripes()
      md/r5cache: after recovery, increase journal seq by 10000

Tomasz Majchrzak (4):
      md: add bad block support for external metadata
      md: don't fail an array if there are unacknowledged bad blocks
      md: wake up personality thread after array state update
      raid5: revert commit 11367799f3d1

Zhengyuan Liu (3):
      raid5-cache: add another check conditon before replaying one stripe
      raid5-cache: don't set STRIPE_R5C_PARTIAL_STRIPE flag while load stripe into cache
      md/r5cache: do r5c_update_log_state after log recovery

 drivers/md/bitmap.c            |  166 ++--
 drivers/md/dm-raid.c           |    4 +-
 drivers/md/linear.c            |   31 +-
 drivers/md/md.c                |  701 ++++++++-------
 drivers/md/md.h                |  108 ++-
 drivers/md/multipath.c         |   92 +-
 drivers/md/raid0.c             |  107 ++-
 drivers/md/raid1.c             |  247 ++++--
 drivers/md/raid1.h             |   19 +-
 drivers/md/raid10.c            |  295 ++++---
 drivers/md/raid10.h            |    2 +
 drivers/md/raid5-cache.c       | 1885 +++++++++++++++++++++++++++++++++++-----
 drivers/md/raid5.c             |  623 +++++++------
 drivers/md/raid5.h             |  172 +++-
 include/uapi/linux/raid/md_p.h |    7 +-
 lib/raid6/avx2.c               |  232 ++++-
 16 files changed, 3429 insertions(+), 1262 deletions(-)

^ permalink raw reply

* Re: [RFC PATCH v2] crypto: Add IV generation algorithms
From: Binoy Jayan @ 2016-12-14  6:09 UTC (permalink / raw)
  To: Milan Broz
  Cc: Oded, Ofir, Herbert Xu, David S. Miller, linux-crypto, Mark Brown,
	Arnd Bergmann, Linux kernel mailing list, Alasdair Kergon,
	Mike Snitzer, dm-devel, Shaohua Li, linux-raid, Rajendra
In-Reply-To: <d6d92865-98fa-4d02-035f-9080bc265c35@gmail.com>

Hi Milan,

Thank you for the reply.

On 13 December 2016 at 15:31, Milan Broz <gmazyland@gmail.com> wrote:

> I really do not think the disk encryption key management should be moved
> outside of dm-crypt. We cannot then change key structure later easily.

Yes, I agree. but the key selection based on sector number restricts the
option of having a larger block size used for encryption.

>> +     unsigned int key_size;
>> +     unsigned int key_extra_size;
>> +     unsigned int key_parts;      /* independent parts in key buffer */
>
> ^^^ these key sizes you probably mean by key management.

Yes, I mean splitting the keys into subkeys based on the keycount
parameter (as mentioned below) to the dm-crypt.

cipher[:keycount]-mode-iv:ivopts
aes:2-cbc-essiv:sha256

> It is based on way how the key is currently sent into kernel
> (one hexa string in ioctl that needs to be split) and have to be changed in future.

-Binoy

^ permalink raw reply

* Re: (user) Help needed: mdadm seems to constantly touch my disks
From: NeilBrown @ 2016-12-14  1:15 UTC (permalink / raw)
  To: Jure Erznožnik, linux-raid
In-Reply-To: <CAJ=9zicFM+wOUhpdX1XEVm+y2gYoW3r2aTRj6gAcEkypxJULmA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1936 bytes --]

On Tue, Dec 13 2016, Jure Erznožnik wrote:

> First of all, I apologise if this mail list is not intended for layman
> help, but this is what I am and I couldn't get an explanation
> elsewhere.
>
> My problem is that (as it seems) mdadm is touching HDD superblocks
> once per second, once at address 8 (sectors), next at address 16.
> Total traffic is kilobytes per second, writes only, no other
> detectable traffic.
>
> I have detailed the problem here:
> http://unix.stackexchange.com/questions/329477/
>
> Shortened:
> kubuntu 16.10 4.8.0-30-generic #32, mdadm v3.4 2016-01-28
> My configuration: 4 spinning platters (/dev/sd[cdef]) assembled into a
> raid5 array, then bcache set to cache (hopefully) everything
> (cache_mode = writeback, sequential_cutoff = 0). On top of bcache
> volume I have set up lvm.
>
> * iostat shows traffic on sd[cdef] and md0
> * iotop shows no traffic
> * iosnoop shows COMM=[idle, md0_raid5, kworker] as processes working
> on the disk. Blocks reported are 8, 16 (data size a few KB) and
> 18446744073709500000 (data size 0). That last one must be some virtual
> thingie as the disks are nowhere near that large.
> * enabling block_dump shows md0_raid5 process writing to block 8 (1
> sectors) and 16 (8 sectors)
>
> This touching is caused by any write into the array and goes on for
> quite a while after the write has been done (a couple of hours for
> 60GB of writes). When services actually work with the array, this
> becomes pretty much constant.
>
> What am I observing and is there any way of stopping it?

Start with the uppermost layer which has I/O that you cannot explain.
Presumably that is md0.
Run 'blktrace' on that device for a little while, then 'blkparse' to
look at the results.

 blktrace -w 10 md0
 blkparse *blktrace*

It will give the name of the process that initiated the request in [] at
the end of some lines.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [BUG] MD/RAID1 hung forever on freeze_array
From: NeilBrown @ 2016-12-13 22:18 UTC (permalink / raw)
  To: Jinpu Wang; +Cc: linux-raid, Shaohua Li, Nate Dailey
In-Reply-To: <CAMGffEm4BDhKJSiDt=2WmxrpRutg862at7JP221gMi6_SpPomQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1703 bytes --]

On Wed, Dec 14 2016, Jinpu Wang wrote:

>
> As you suggested, I re-run same test with 4.4.36 with no our own patch on MD.
> I can still reproduce the same bug, nr_pending on heathy leg(loop1) is till 1.
>

Thanks.

I have an hypothesis.

md_make_request() calls blk_queue_split().
If that does split the request it will call generic_make_request()
on the first half. That will call back into md_make_request() and
raid1_make_request() which will submit requests to the underlying
devices.  These will get caught on the bio_list_on_stack queue in
generic_make_request().
This is a queue which is not accounted in nr_queued.

When blk_queue_split() completes, 'bio' will be the second half of the
bio.
This enters raid1_make_request() and by this time the array have been
frozen.
So wait_barrier() has to wait for pending requests to complete, and that
includes the one that it stuck in bio_list_on_stack, which will never
complete now.

To see if this might be happening, please change the

	blk_queue_split(q, &bio, q->bio_split);

call in md_make_request() to

	struct bio *tmp = bio;
	blk_queue_split(q, &bio, q->bio_split);
	WARN_ON_ONCE(bio != tmp);

If that ever triggers, then the above is a real possibility.

Fixing the problem isn't very easy...

You could try:
1/ write a function in raid1.c which calls punt_bios_to_rescuer()
  (which you will need to export from block/bio.c),
  passing mddev->queue->bio_split as the bio_set.

1/ change the wait_event_lock_irq() call in wait_barrier() to
   wait_event_lock_irq_cmd(), and pass the new function as the command.

That way, if wait_barrier() ever blocks, all the requests in
bio_list_on_stack will be handled by a separate thread.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH] dm: Avoid sleeping while holding the dm_bufio lock
From: Doug Anderson @ 2016-12-13 22:01 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Alasdair Kergon, Mike Snitzer, Shaohua Li, Dmitry Torokhov,
	linux-kernel@vger.kernel.org, linux-raid, dm-devel,
	David Rientjes, Sonny Rao, Guenter Roeck
In-Reply-To: <CAD=FV=U+1VA23s1WMn4srH5U3Cjuz+0OLhq1qD5qw0oTE34xFQ@mail.gmail.com>

Hi,

On Mon, Dec 12, 2016 at 4:08 PM, Doug Anderson <dianders@chromium.org> wrote:
> OK, so I just put a printk in wait_iff_congested() and it didn't show
> me waiting for the timeout (!).  I know that I saw
> wait_iff_congested() in the originally reproduction of this problem,
> but it appears that in my little "balloon" reproduction it's not
> actually involved...
>
>
> ...I dug further and it appears that __alloc_pages_direct_reclaim() is
> actually what's slow.  Specifically it looks as if shrink_zone() can
> actually take quite a while.  As I've said, I'm not an expert on the
> memory manager but I'm not convinced that it's wrong for the direct
> reclaim path to be pretty slow at times, especially when I'm putting
> an abnormally high amount of stress on it.
>
> I'm going to take this as further evidence that the patch being
> discussed in this thread is a good one (AKA don't hold the dm bufio
> lock while allocating memory).  :)  If it's unexpected that
> shrink_zone() might take several seconds when under extreme memory
> pressure then I can do some additional digging.  Do note that I am
> running with "zram" and remember that I'm on an ancient 4.4-based
> kernel, so perhaps one of those two factors causes problems.

Sadly, I couldn't get this go as just "the way things were" in case
there was some major speedup to be had here.  :-P

I tracked this down to shrink_list() taking 1 ms per call (perhaps
because I have HZ=1000?) and in shrink_lruvec() the outer loop ran
many thousands of times.  Thus the total time taken by shrink_lruvec()
could easily be many seconds.

Wow, interesting, when I change HZ to 100 instead of 1000 then the
behavior changes quite a bit.  I can still get my bufio lock warning
easily, but all of a sudden shrink_lruvec() isn't slow.  :-P

OK, really truly going to stop digging further now...  ;)  Presumably
reporting weird behaviors with old kernels doesn't help anyone in
mainline, and I can buy the whole "memory accesses are slow when you
start thrashing the system" argument.

-Doug

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox