Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Fwd: (user) Help needed: mdadm seems to constantly touch my disks
From: Jure Erznožnik @ 2016-12-15  7:01 UTC (permalink / raw)
  To: NeilBrown, linux-raid
In-Reply-To: <CAJ=9zieRuTNiEGuB_RouqbdLGoxNkn09yiogR6rND84LtMdbxA@mail.gmail.com>

Thanks for helping Neil. I have run the suggested utilities and here
are my findings:

It is always [kworker/x:yy] (x:yy changes somewhat) or [0].
A few lines from one of the outputs:

  9,0    3        0     0.061577998     0  m   N raid5 rcw 3758609392 2 2 0
  9,0    3        0     0.061580084     0  m   N raid5 rcw 3758609400 2 2 0
  9,0    3        0     0.061580084     0  m   N raid5 rcw 3758609400 2 2 0
  9,0    3        0     0.061580084     0  m   N raid5 rcw 3758609400 2 2 0
  9,0    3        0     0.061580084     0  m   N raid5 rcw 3758609400 2 2 0
  9,0    0        1     0.065333879   283  C   W 11275825480 [0]
  9,0    0        1     0.065333879   283  C   W 11275825480 [0]
  9,0    0        1     0.065333879   283  C   W 11275825480 [0]
  9,0    0        1     0.065333879   283  C   W 11275825480 [0]
  9,0    3        2     1.022155200  2861  Q   W 11275826504 + 32 [kworker/3:38]
  9,0    3        2     1.022155200  2861  Q   W 11275826504 + 32 [kworker/3:38]
  9,0    3        2     1.022155200  2861  Q   W 11275826504 + 32 [kworker/3:38]
  9,0    3        2     1.022155200  2861  Q   W 11275826504 + 32 [kworker/3:38]
  9,0    0        2     1.054590402   283  C   W 11275826504 [0]
  9,0    0        2     1.054590402   283  C   W 11275826504 [0]
  9,0    0        2     1.054590402   283  C   W 11275826504 [0]
  9,0    0        2     1.054590402   283  C   W 11275826504 [0]
  9,0    3        3     2.046065106  2861  Q   W 11275861232 + 8 [kworker/3:38]
  9,0    3        3     2.046065106  2861  Q   W 11275861232 + 8 [kworker/3:38]
  9,0    3        3     2.046065106  2861  Q   W 11275861232 + 8 [kworker/3:38]
  9,0    3        3     2.046065106  2861  Q   W 11275861232 + 8 [kworker/3:38]
  9,0    0        0     2.075247515     0  m   N raid5 rcw 3758619888 2 0 1
  9,0    0        0     2.075247515     0  m   N raid5 rcw 3758619888 2 0 1
  9,0    0        0     2.075247515     0  m   N raid5 rcw 3758619888 2 0 1
  9,0    0        0     2.075247515     0  m   N raid5 rcw 3758619888 2 0 1
  9,0    0        0     2.075250686     0  m   N raid5 rcw 3758619888 2 2 0
  9,0    0        0     2.075250686     0  m   N raid5 rcw 3758619888 2 2 0
  9,0    0        0     2.075250686     0  m   N raid5 rcw 3758619888 2 2 0
  9,0    0        0     2.075250686     0  m   N raid5 rcw 3758619888 2 2 0
  9,0    2        1     2.086924691   283  C   W 11275861232 [0]
  9,0    2        1     2.086924691   283  C   W 11275861232 [0]
  9,0    2        1     2.086924691   283  C   W 11275861232 [0]
  9,0    2        1     2.086924691   283  C   W 11275861232 [0]
  9,0    0        3     2.967340614  1061  Q FWS [kworker/0:18]
  9,0    0        3     2.967340614  1061  Q FWS [kworker/0:18]
  9,0    0        3     2.967340614  1061  Q FWS [kworker/0:18]
  9,0    0        3     2.967340614  1061  Q FWS [kworker/0:18]
  9,0    3        4     3.070092310  2861  Q   W 11275861272 + 8 [kworker/3:38]
  9,0    3        4     3.070092310  2861  Q   W 11275861272 + 8 [kworker/3:38]
  9,0    3        4     3.070092310  2861  Q   W 11275861272 + 8 [kworker/3:38]
  9,0    3        4     3.070092310  2861  Q   W 11275861272 + 8 [kworker/3:38]
  9,0    0        0     3.101966398     0  m   N raid5 rcw 3758619928 2 0 1
  9,0    0        0     3.101966398     0  m   N raid5 rcw 3758619928 2 0 1
  9,0    0        0     3.101966398     0  m   N raid5 rcw 3758619928 2 0 1
  9,0    0        0     3.101966398     0  m   N raid5 rcw 3758619928 2 0 1
  9,0    0        0     3.101969169     0  m   N raid5 rcw 3758619928 2 2 0
  9,0    0        0     3.101969169     0  m   N raid5 rcw 3758619928 2 2 0
  9,0    0        0     3.101969169     0  m   N raid5 rcw 3758619928 2 2 0
  9,0    0        0     3.101969169     0  m   N raid5 rcw 3758619928 2 2 0
  9,0    0        4     3.102340646   283  C   W 11275861272 [0]
  9,0    0        4     3.102340646   283  C   W 11275861272 [0]
  9,0    0        4     3.102340646   283  C   W 11275861272 [0]
  9,0    0        4     3.102340646   283  C   W 11275861272 [0]
  9,0    3        5     4.094666938  2861  Q   W 11276014160 + 336
[kworker/3:38]
  9,0    3        5     4.094666938  2861  Q   W 11276014160 + 336
[kworker/3:38]
  9,0    3        5     4.094666938  2861  Q   W 11276014160 + 336
[kworker/3:38]
  9,0    3        5     4.094666938  2861  Q   W 11276014160 + 336
[kworker/3:38]
  9,0    3        0     4.137869804     0  m   N raid5 rcw 3758671440 2 0 1
  9,0    3        0     4.137869804     0  m   N raid5 rcw 3758671440 2 0 1
  9,0    3        0     4.137869804     0  m   N raid5 rcw 3758671440 2 0 1
  9,0    3        0     4.137869804     0  m   N raid5 rcw 3758671440 2 0 1
  9,0    3        0     4.137872647     0  m   N raid5 rcw 3758671448 2 0 1
  9,0    3        0     4.137872647     0  m   N raid5 rcw 3758671448 2 0 1
  9,0    3        0     4.137872647     0  m   N raid5 rcw 3758671448 2 0 1

LP,
Jure

On Wed, Dec 14, 2016 at 2:15 AM, NeilBrown <neilb@suse.com> wrote:
> On Tue, Dec 13 2016, Jure Erznožnik wrote:
>
>> First of all, I apologise if this mail list is not intended for layman
>> help, but this is what I am and I couldn't get an explanation
>> elsewhere.
>>
>> My problem is that (as it seems) mdadm is touching HDD superblocks
>> once per second, once at address 8 (sectors), next at address 16.
>> Total traffic is kilobytes per second, writes only, no other
>> detectable traffic.
>>
>> I have detailed the problem here:
>> http://unix.stackexchange.com/questions/329477/
>>
>> Shortened:
>> kubuntu 16.10 4.8.0-30-generic #32, mdadm v3.4 2016-01-28
>> My configuration: 4 spinning platters (/dev/sd[cdef]) assembled into a
>> raid5 array, then bcache set to cache (hopefully) everything
>> (cache_mode = writeback, sequential_cutoff = 0). On top of bcache
>> volume I have set up lvm.
>>
>> * iostat shows traffic on sd[cdef] and md0
>> * iotop shows no traffic
>> * iosnoop shows COMM=[idle, md0_raid5, kworker] as processes working
>> on the disk. Blocks reported are 8, 16 (data size a few KB) and
>> 18446744073709500000 (data size 0). That last one must be some virtual
>> thingie as the disks are nowhere near that large.
>> * enabling block_dump shows md0_raid5 process writing to block 8 (1
>> sectors) and 16 (8 sectors)
>>
>> This touching is caused by any write into the array and goes on for
>> quite a while after the write has been done (a couple of hours for
>> 60GB of writes). When services actually work with the array, this
>> becomes pretty much constant.
>>
>> What am I observing and is there any way of stopping it?
>
> Start with the uppermost layer which has I/O that you cannot explain.
> Presumably that is md0.
> Run 'blktrace' on that device for a little while, then 'blkparse' to
> look at the results.
>
>  blktrace -w 10 md0
>  blkparse *blktrace*
>
> It will give the name of the process that initiated the request in [] at
> the end of some lines.
>
> NeilBrown

^ permalink raw reply

* Re: [PATCH 5/8] linux: drop __bitwise__ everywhere
From: Stefan Schmidt @ 2016-12-15  9:04 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Kukjin Kim, Krzysztof Kozlowski, Javier Martinez Canillas,
	Russell King, Alasdair Kergon, Mike Snitzer,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Shaohua Li, Johannes Berg,
	Emmanuel Grumbach, Luca Coelho, Intel Linux Wireless, Kalle Valo,
	Greg Kroah-Hartman, Jiri Slaby, Lee Duncan, Chris Leech,
	James E.J. Bottomley, Martin K. Petersen, Nicholas A. Bellinger,
	Jason Wang, Alexander Aring, David S. Miller
In-Reply-To: <1481778865-27667-6-git-send-email-mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Hello.

On 15/12/16 06:15, Michael S. Tsirkin wrote:
> __bitwise__ used to mean "yes, please enable sparse checks
> unconditionally", but now that we dropped __CHECK_ENDIAN__
> __bitwise is exactly the same.
> There aren't many users, replace it by __bitwise everywhere.
>
> Signed-off-by: Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>  arch/arm/plat-samsung/include/plat/gpio-cfg.h    | 2 +-
>  drivers/md/dm-cache-block-types.h                | 6 +++---
>  drivers/net/ethernet/sun/sunhme.h                | 2 +-
>  drivers/net/wireless/intel/iwlwifi/iwl-fw-file.h | 4 ++--
>  include/linux/mmzone.h                           | 2 +-
>  include/linux/serial_core.h                      | 4 ++--
>  include/linux/types.h                            | 4 ++--
>  include/scsi/iscsi_proto.h                       | 2 +-
>  include/target/target_core_base.h                | 2 +-
>  include/uapi/linux/virtio_types.h                | 6 +++---
>  net/ieee802154/6lowpan/6lowpan_i.h               | 2 +-
>  net/mac80211/ieee80211_i.h                       | 4 ++--
>  12 files changed, 20 insertions(+), 20 deletions(-)
>
> diff --git a/arch/arm/plat-samsung/include/plat/gpio-cfg.h b/arch/arm/plat-samsung/include/plat/gpio-cfg.h
> index 21391fa..e55d1f5 100644
> --- a/arch/arm/plat-samsung/include/plat/gpio-cfg.h
> +++ b/arch/arm/plat-samsung/include/plat/gpio-cfg.h
> @@ -26,7 +26,7 @@
>
>  #include <linux/types.h>
>
> -typedef unsigned int __bitwise__ samsung_gpio_pull_t;
> +typedef unsigned int __bitwise samsung_gpio_pull_t;
>
>  /* forward declaration if gpio-core.h hasn't been included */
>  struct samsung_gpio_chip;
> diff --git a/drivers/md/dm-cache-block-types.h b/drivers/md/dm-cache-block-types.h
> index bed4ad4..389c9e8 100644
> --- a/drivers/md/dm-cache-block-types.h
> +++ b/drivers/md/dm-cache-block-types.h
> @@ -17,9 +17,9 @@
>   * discard bitset.
>   */
>
> -typedef dm_block_t __bitwise__ dm_oblock_t;
> -typedef uint32_t __bitwise__ dm_cblock_t;
> -typedef dm_block_t __bitwise__ dm_dblock_t;
> +typedef dm_block_t __bitwise dm_oblock_t;
> +typedef uint32_t __bitwise dm_cblock_t;
> +typedef dm_block_t __bitwise dm_dblock_t;
>
>  static inline dm_oblock_t to_oblock(dm_block_t b)
>  {
> diff --git a/drivers/net/ethernet/sun/sunhme.h b/drivers/net/ethernet/sun/sunhme.h
> index f430765..4a8d5b1 100644
> --- a/drivers/net/ethernet/sun/sunhme.h
> +++ b/drivers/net/ethernet/sun/sunhme.h
> @@ -302,7 +302,7 @@
>   * Always write the address first before setting the ownership
>   * bits to avoid races with the hardware scanning the ring.
>   */
> -typedef u32 __bitwise__ hme32;
> +typedef u32 __bitwise hme32;
>
>  struct happy_meal_rxd {
>  	hme32 rx_flags;
> diff --git a/drivers/net/wireless/intel/iwlwifi/iwl-fw-file.h b/drivers/net/wireless/intel/iwlwifi/iwl-fw-file.h
> index 1ad0ec1..84813b5 100644
> --- a/drivers/net/wireless/intel/iwlwifi/iwl-fw-file.h
> +++ b/drivers/net/wireless/intel/iwlwifi/iwl-fw-file.h
> @@ -228,7 +228,7 @@ enum iwl_ucode_tlv_flag {
>  	IWL_UCODE_TLV_FLAGS_BCAST_FILTERING	= BIT(29),
>  };
>
> -typedef unsigned int __bitwise__ iwl_ucode_tlv_api_t;
> +typedef unsigned int __bitwise iwl_ucode_tlv_api_t;
>
>  /**
>   * enum iwl_ucode_tlv_api - ucode api
> @@ -258,7 +258,7 @@ enum iwl_ucode_tlv_api {
>  #endif
>  };
>
> -typedef unsigned int __bitwise__ iwl_ucode_tlv_capa_t;
> +typedef unsigned int __bitwise iwl_ucode_tlv_capa_t;
>
>  /**
>   * enum iwl_ucode_tlv_capa - ucode capabilities
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 0f088f3..36d9896 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -246,7 +246,7 @@ struct lruvec {
>  #define ISOLATE_UNEVICTABLE	((__force isolate_mode_t)0x8)
>
>  /* LRU Isolation modes. */
> -typedef unsigned __bitwise__ isolate_mode_t;
> +typedef unsigned __bitwise isolate_mode_t;
>
>  enum zone_watermarks {
>  	WMARK_MIN,
> diff --git a/include/linux/serial_core.h b/include/linux/serial_core.h
> index 5d49488..5def8e8 100644
> --- a/include/linux/serial_core.h
> +++ b/include/linux/serial_core.h
> @@ -111,8 +111,8 @@ struct uart_icount {
>  	__u32	buf_overrun;
>  };
>
> -typedef unsigned int __bitwise__ upf_t;
> -typedef unsigned int __bitwise__ upstat_t;
> +typedef unsigned int __bitwise upf_t;
> +typedef unsigned int __bitwise upstat_t;
>
>  struct uart_port {
>  	spinlock_t		lock;			/* port lock */
> diff --git a/include/linux/types.h b/include/linux/types.h
> index baf7183..d501ad3 100644
> --- a/include/linux/types.h
> +++ b/include/linux/types.h
> @@ -154,8 +154,8 @@ typedef u64 dma_addr_t;
>  typedef u32 dma_addr_t;
>  #endif
>
> -typedef unsigned __bitwise__ gfp_t;
> -typedef unsigned __bitwise__ fmode_t;
> +typedef unsigned __bitwise gfp_t;
> +typedef unsigned __bitwise fmode_t;
>
>  #ifdef CONFIG_PHYS_ADDR_T_64BIT
>  typedef u64 phys_addr_t;
> diff --git a/include/scsi/iscsi_proto.h b/include/scsi/iscsi_proto.h
> index c1260d8..df156f1 100644
> --- a/include/scsi/iscsi_proto.h
> +++ b/include/scsi/iscsi_proto.h
> @@ -74,7 +74,7 @@ static inline int iscsi_sna_gte(u32 n1, u32 n2)
>  #define zero_data(p) {p[0]=0;p[1]=0;p[2]=0;}
>
>  /* initiator tags; opaque for target */
> -typedef uint32_t __bitwise__ itt_t;
> +typedef uint32_t __bitwise itt_t;
>  /* below makes sense only for initiator that created this tag */
>  #define build_itt(itt, age) ((__force itt_t)\
>  	((itt) | ((age) << ISCSI_AGE_SHIFT)))
> diff --git a/include/target/target_core_base.h b/include/target/target_core_base.h
> index c211900..0055828 100644
> --- a/include/target/target_core_base.h
> +++ b/include/target/target_core_base.h
> @@ -149,7 +149,7 @@ enum se_cmd_flags_table {
>   * Used by transport_send_check_condition_and_sense()
>   * to signal which ASC/ASCQ sense payload should be built.
>   */
> -typedef unsigned __bitwise__ sense_reason_t;
> +typedef unsigned __bitwise sense_reason_t;
>
>  enum tcm_sense_reason_table {
>  #define R(x)	(__force sense_reason_t )(x)
> diff --git a/include/uapi/linux/virtio_types.h b/include/uapi/linux/virtio_types.h
> index e845e8c..55c3b73 100644
> --- a/include/uapi/linux/virtio_types.h
> +++ b/include/uapi/linux/virtio_types.h
> @@ -39,8 +39,8 @@
>   * - __le{16,32,64} for standard-compliant virtio devices
>   */
>
> -typedef __u16 __bitwise__ __virtio16;
> -typedef __u32 __bitwise__ __virtio32;
> -typedef __u64 __bitwise__ __virtio64;
> +typedef __u16 __bitwise __virtio16;
> +typedef __u32 __bitwise __virtio32;
> +typedef __u64 __bitwise __virtio64;
>
>  #endif /* _UAPI_LINUX_VIRTIO_TYPES_H */
> diff --git a/net/ieee802154/6lowpan/6lowpan_i.h b/net/ieee802154/6lowpan/6lowpan_i.h
> index 5ac7789..ac7c96b 100644
> --- a/net/ieee802154/6lowpan/6lowpan_i.h
> +++ b/net/ieee802154/6lowpan/6lowpan_i.h
> @@ -7,7 +7,7 @@
>  #include <net/inet_frag.h>
>  #include <net/6lowpan.h>
>
> -typedef unsigned __bitwise__ lowpan_rx_result;
> +typedef unsigned __bitwise lowpan_rx_result;
>  #define RX_CONTINUE		((__force lowpan_rx_result) 0u)
>  #define RX_DROP_UNUSABLE	((__force lowpan_rx_result) 1u)
>  #define RX_DROP			((__force lowpan_rx_result) 2u)
> diff --git a/net/mac80211/ieee80211_i.h b/net/mac80211/ieee80211_i.h
> index d37a577..b2069fb 100644
> --- a/net/mac80211/ieee80211_i.h
> +++ b/net/mac80211/ieee80211_i.h
> @@ -159,7 +159,7 @@ enum ieee80211_bss_valid_data_flags {
>  	IEEE80211_BSS_VALID_ERP			= BIT(3)
>  };
>
> -typedef unsigned __bitwise__ ieee80211_tx_result;
> +typedef unsigned __bitwise ieee80211_tx_result;
>  #define TX_CONTINUE	((__force ieee80211_tx_result) 0u)
>  #define TX_DROP		((__force ieee80211_tx_result) 1u)
>  #define TX_QUEUED	((__force ieee80211_tx_result) 2u)
> @@ -180,7 +180,7 @@ struct ieee80211_tx_data {
>  };
>
>
> -typedef unsigned __bitwise__ ieee80211_rx_result;
> +typedef unsigned __bitwise ieee80211_rx_result;
>  #define RX_CONTINUE		((__force ieee80211_rx_result) 0u)
>  #define RX_DROP_UNUSABLE	((__force ieee80211_rx_result) 1u)
>  #define RX_DROP_MONITOR		((__force ieee80211_rx_result) 2u)
>

For net/ieee802154/6lowpan/6lowpan_i.h

Acked-by: Stefan Schmidt <stefan-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org>

regards
Stefan Schmidt

-- 
You received this message because you are subscribed to the Google Groups "open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to open-iscsi+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to open-iscsi-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
Visit this group at https://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply

* Re: [BUG] MD/RAID1 hung forever on freeze_array
From: Jinpu Wang @ 2016-12-15  9:24 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, Shaohua Li, Nate Dailey
In-Reply-To: <87fulpyj33.fsf@notabene.neil.brown.name>

On Thu, Dec 15, 2016 at 4:20 AM, NeilBrown <neilb@suse.com> wrote:
>
>> Hi Neil,
>>
>> I found a old mail thread below
>> http://www.spinics.net/lists/raid/msg52792.html
>>
>> Likely Alex is trying to fix same bug, right?
>> in one reply you suggested to modify the call in make_request
>>
>> @@ -1207,7 +1207,8 @@ read_again:
>>                                 sectors_handled;
>>                         goto read_again;
>>                 } else
>> -                       generic_make_request(read_bio);
>> +                       reschedule_retry(r1_bio);
>>                 return;
>>         }
>>
>>
>> I append above change, it looks fix the bug, I've run same tests over
>> one hour,  no hung task anymore.
>>
>> Do you think this is right fix? Do we still need the change you
>> suggested with punt_bios_to_rescuer?
>
> I don't really like that fix.  I suspect it would probably hurt
> performance.
>
> I'd prefer to fix generic_make_request() to process queued requests in a
> more sensible order.
> Can you please try the following (with all other patches removed)?
> Thanks,
> NeilBrown

Thanks Neil, I will try when back to office, I'm sick at home.
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 14d7c0740dc0..3436b6fc3ef8 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2036,10 +2036,31 @@ blk_qc_t generic_make_request(struct bio *bio)
>                 struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>
>                 if (likely(blk_queue_enter(q, false) == 0)) {
> +                       struct bio_list hold;
> +                       struct bio_list lower, same;
> +
> +                       /* Create a fresh bio_list for all subordinate requests */
> +                       bio_list_merge(&hold, &bio_list_on_stack);
> +                       bio_list_init(&bio_list_on_stack);
>                         ret = q->make_request_fn(q, bio);
>
>                         blk_queue_exit(q);
>
> +                       /* sort new bios into those for a lower level
> +                        * and those for the same level
> +                        */
> +                       bio_list_init(&lower);
> +                       bio_list_init(&same);
> +                       while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
> +                               if (q == bdev_get_queue(bio->bi_bdev))
> +                                       bio_list_add(&same, bio);
> +                               else
> +                                       bio_list_add(&lower, bio);
> +                       /* now assemble so we handle the lowest level first */
> +                       bio_list_merge(&bio_list_on_stack, &lower);
> +                       bio_list_merge(&bio_list_on_stack, &same);
> +                       bio_list_merge(&bio_list_on_stack, &hold);
> +
>                         bio = bio_list_pop(current->bio_list);
>                 } else {
>                         struct bio *bio_next = bio_list_pop(current->bio_list);



-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply

* Re: [PATCH v2 00/12] Partial Parity Log for MD RAID 5
From: Artur Paszkiewicz @ 2016-12-15 11:44 UTC (permalink / raw)
  To: Shaohua Li, Jes Sorensen; +Cc: NeilBrown, linux-raid
In-Reply-To: <20161214194726.4to3rnqyrqlqlx7t@kernel.org>

On 12/14/2016 08:47 PM, Shaohua Li wrote:
> On Tue, Dec 13, 2016 at 10:25:04AM -0500, Jes Sorensen wrote:
>> Shaohua Li <shli@kernel.org> writes:
>>> On Wed, Dec 07, 2016 at 03:36:01PM +0100, Artur Paszkiewicz wrote:
>>>> On 12/07/2016 01:32 AM, NeilBrown wrote:
>>>>>
>>>>> I would expect to see as description of what a PPL actually is and how
>>>>> it works here... but there is none.
>>>>>
>>>>> The change-log for patch 06 has a tiny bit more information which is
>>>>> just enough to be able to start trying to understand the code, but it
>>>>> isn't much.
>>>>> And none of this description gets into the code, or into the
>>>>> Documentation/.  This makes it hard to review and hard to maintain.
>>>>>
>>>>> Remember: if you want people to review you code, it is in your interest
>>>>> to make it easy.  That means give lots of details.
>>>>
>>>> Hi Neil,
>>>>
>>>> Thank you for taking the time to look at this and for your feedback. I
>>>> didn't try to make it hard to review... Sometimes it's easy to forget
>>>> how non-obvious things are after looking at them for too long :) I will
>>>> improve the descriptions and address the issues that you found in the
>>>> next version of the patches.
>>>
>>> Havn't looked at the patches yet, being busy recently, sorry! When you repost
>>> these, I'd like to know why we need another log for raid5 considering we
>>> already had one to fix similar issue. What's the good/bad side of this new log?
>>> There is such feature in Intel RSTe doesn't sound like a technical reason we
>>> should support this.
>>
>> Shaohua,
>>
>> Any further thought on these patches? I am considering doing a release
>> of mdadm early in the new year. it would be nice to include these
>> patches if the feature is going in.
>>
>> As for supporting it, if IMSM supports it and it is used in the field,
>> then it seems legitimate for Linux to support it too. Just like we
>> support so many other obscure pieces of hardware :)
> 
> Sure, I don't object to support it. Just need to understand how it works. Had a
> brief review. The ondisk format looks good. That probably is related to mdadm
> mostly. The disk format has alignment issue as Neil noted, which would be
> unfriendly for non-x86 arch. Will we stick to this disk format or change it?
> We'd make a decision.

This alignment issue will be fixed by extending the 'parity_disk' field
to 4 bytes. The 'checksum' field will then be properly aligned and the
size of the structure will be 24 bytes, also fixing the array alignment.

> For the implementation, I don't understand how the ppl works much, there aren't
> many details there. Two things I noted:
> 
> - The code skips the log for full stripe write. This isn't good. It would means
>   after a unclean shutdown/recovery, one disk has arbitrary data, not the old
>   data and new data. This breaks an assumption in filesystem, after a failed
>   write to a sector, the sector has either old or new data. Thinking about a
>   write to superblock. The data could be old or new superblock, but it's still a
>   superblock, not something random.
> 
> - From the patch 6 & 10, looks PPL only help recover unwritten disks. If one
>   disk of a stripe is dirty (eg it's written before unclean shutdown), and it's
>   lost in recovery, what will happen? Seems the data of lost disk will be read as
>   0? It will break the assumption above too. If I understand the code clearly
>   (maybe not, need clarification), this is a design flaw.

PPL is only used to update the parity for a stripe, data chunks are not
modified at all during PPL recovery. The assumption was that it would
protect only from silent data corruption, to eliminate the cases when
data that was not touched by a write request could change. So if a dirty
disk is lost, no recovery is performed for this stripe (parity is not
updated). For full stripe write we only recalculate the parity after a
dirty shutdown if all disks are available (like resync). So you are
right that it is still possible to have arbitrary data in the written
part of a stripe if that disk is lost. In such case the behavior is the
same as in plain raid5.

Thanks,
Artur

^ permalink raw reply

* Re: [PATCH 5/8] linux: drop __bitwise__ everywhere
From: Greg Kroah-Hartman @ 2016-12-15 11:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Kukjin Kim, Krzysztof Kozlowski,
	Javier Martinez Canillas, Russell King, Alasdair Kergon,
	Mike Snitzer, dm-devel, Shaohua Li, Johannes Berg,
	Emmanuel Grumbach, Luca Coelho, Intel Linux Wireless, Kalle Valo,
	Jiri Slaby, Lee Duncan, Chris Leech, James E.J. Bottomley,
	Martin K. Petersen, Nicholas A. Bellinger, Jason Wang,
	Alexander Aring, Stefan Schmidt, Davi
In-Reply-To: <1481778865-27667-6-git-send-email-mst@redhat.com>

On Thu, Dec 15, 2016 at 07:15:20AM +0200, Michael S. Tsirkin wrote:
> __bitwise__ used to mean "yes, please enable sparse checks
> unconditionally", but now that we dropped __CHECK_ENDIAN__
> __bitwise is exactly the same.
> There aren't many users, replace it by __bitwise everywhere.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  arch/arm/plat-samsung/include/plat/gpio-cfg.h    | 2 +-
>  drivers/md/dm-cache-block-types.h                | 6 +++---
>  drivers/net/ethernet/sun/sunhme.h                | 2 +-
>  drivers/net/wireless/intel/iwlwifi/iwl-fw-file.h | 4 ++--
>  include/linux/mmzone.h                           | 2 +-
>  include/linux/serial_core.h                      | 4 ++--
>  include/linux/types.h                            | 4 ++--
>  include/scsi/iscsi_proto.h                       | 2 +-
>  include/target/target_core_base.h                | 2 +-
>  include/uapi/linux/virtio_types.h                | 6 +++---
>  net/ieee802154/6lowpan/6lowpan_i.h               | 2 +-
>  net/mac80211/ieee80211_i.h                       | 4 ++--
>  12 files changed, 20 insertions(+), 20 deletions(-)

for include/linux/serial_core.h:

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 5/8] linux: drop __bitwise__ everywhere
From: Krzysztof Kozlowski @ 2016-12-15 17:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Kukjin Kim, Krzysztof Kozlowski,
	Javier Martinez Canillas, Russell King, Alasdair Kergon,
	Mike Snitzer, dm-devel, Shaohua Li, Johannes Berg,
	Emmanuel Grumbach, Luca Coelho, Intel Linux Wireless, Kalle Valo,
	Greg Kroah-Hartman, Jiri Slaby, Lee Duncan, Chris Leech,
	James E.J. Bottomley, Martin K. Petersen, Nicholas A. Bellinger,
	Jason Wang, Alexander Aring
In-Reply-To: <1481778865-27667-6-git-send-email-mst@redhat.com>

On Thu, Dec 15, 2016 at 07:15:20AM +0200, Michael S. Tsirkin wrote:
> __bitwise__ used to mean "yes, please enable sparse checks
> unconditionally", but now that we dropped __CHECK_ENDIAN__
> __bitwise is exactly the same.
> There aren't many users, replace it by __bitwise everywhere.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  arch/arm/plat-samsung/include/plat/gpio-cfg.h    | 2 +-
>  drivers/md/dm-cache-block-types.h                | 6 +++---
>  drivers/net/ethernet/sun/sunhme.h                | 2 +-
>  drivers/net/wireless/intel/iwlwifi/iwl-fw-file.h | 4 ++--
>  include/linux/mmzone.h                           | 2 +-
>  include/linux/serial_core.h                      | 4 ++--
>  include/linux/types.h                            | 4 ++--
>  include/scsi/iscsi_proto.h                       | 2 +-
>  include/target/target_core_base.h                | 2 +-
>  include/uapi/linux/virtio_types.h                | 6 +++---
>  net/ieee802154/6lowpan/6lowpan_i.h               | 2 +-
>  net/mac80211/ieee80211_i.h                       | 4 ++--
>  12 files changed, 20 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/arm/plat-samsung/include/plat/gpio-cfg.h b/arch/arm/plat-samsung/include/plat/gpio-cfg.h
> index 21391fa..e55d1f5 100644
> --- a/arch/arm/plat-samsung/include/plat/gpio-cfg.h
> +++ b/arch/arm/plat-samsung/include/plat/gpio-cfg.h
> @@ -26,7 +26,7 @@
>  
>  #include <linux/types.h>
>  
> -typedef unsigned int __bitwise__ samsung_gpio_pull_t;
> +typedef unsigned int __bitwise samsung_gpio_pull_t;
>  
>  /* forward declaration if gpio-core.h hasn't been included */
>  struct samsung_gpio_chip;

For plat-samsung:
Acked-by: Krzysztof Kozlowski <krzk@kernel.org>

Best regards,
Krzysztof

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 5/8] linux: drop __bitwise__ everywhere
From: Lee Duncan @ 2016-12-15 19:44 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: Kukjin Kim, Krzysztof Kozlowski, Javier Martinez Canillas,
	Russell King, Alasdair Kergon, Mike Snitzer, dm-devel, Shaohua Li,
	Johannes Berg, Emmanuel Grumbach, Luca Coelho,
	Intel Linux Wireless, Kalle Valo, Greg Kroah-Hartman, Jiri Slaby,
	Chris Leech, James E.J. Bottomley, Martin K. Petersen,
	Nicholas A. Bellinger, Jason Wang, Alexander Aring,
	Stefan Schmidt
In-Reply-To: <1481778865-27667-6-git-send-email-mst@redhat.com>

On 12/14/2016 09:15 PM, Michael S. Tsirkin wrote:
> __bitwise__ used to mean "yes, please enable sparse checks
> unconditionally", but now that we dropped __CHECK_ENDIAN__
> __bitwise is exactly the same.
> There aren't many users, replace it by __bitwise everywhere.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  arch/arm/plat-samsung/include/plat/gpio-cfg.h    | 2 +-
>  drivers/md/dm-cache-block-types.h                | 6 +++---
>  drivers/net/ethernet/sun/sunhme.h                | 2 +-
>  drivers/net/wireless/intel/iwlwifi/iwl-fw-file.h | 4 ++--
>  include/linux/mmzone.h                           | 2 +-
>  include/linux/serial_core.h                      | 4 ++--
>  include/linux/types.h                            | 4 ++--
>  include/scsi/iscsi_proto.h                       | 2 +-
>  include/target/target_core_base.h                | 2 +-
>  include/uapi/linux/virtio_types.h                | 6 +++---
>  net/ieee802154/6lowpan/6lowpan_i.h               | 2 +-
>  net/mac80211/ieee80211_i.h                       | 4 ++--
>  12 files changed, 20 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/arm/plat-samsung/include/plat/gpio-cfg.h b/arch/arm/plat-samsung/include/plat/gpio-cfg.h
> index 21391fa..e55d1f5 100644
> --- a/arch/arm/plat-samsung/include/plat/gpio-cfg.h
> +++ b/arch/arm/plat-samsung/include/plat/gpio-cfg.h
> @@ -26,7 +26,7 @@
>  
>  #include <linux/types.h>
>  
> -typedef unsigned int __bitwise__ samsung_gpio_pull_t;
> +typedef unsigned int __bitwise samsung_gpio_pull_t;
>  
>  /* forward declaration if gpio-core.h hasn't been included */
>  struct samsung_gpio_chip;
> diff --git a/drivers/md/dm-cache-block-types.h b/drivers/md/dm-cache-block-types.h
> index bed4ad4..389c9e8 100644
> --- a/drivers/md/dm-cache-block-types.h
> +++ b/drivers/md/dm-cache-block-types.h
> @@ -17,9 +17,9 @@
>   * discard bitset.
>   */
>  
> -typedef dm_block_t __bitwise__ dm_oblock_t;
> -typedef uint32_t __bitwise__ dm_cblock_t;
> -typedef dm_block_t __bitwise__ dm_dblock_t;
> +typedef dm_block_t __bitwise dm_oblock_t;
> +typedef uint32_t __bitwise dm_cblock_t;
> +typedef dm_block_t __bitwise dm_dblock_t;
>  
>  static inline dm_oblock_t to_oblock(dm_block_t b)
>  {
> diff --git a/drivers/net/ethernet/sun/sunhme.h b/drivers/net/ethernet/sun/sunhme.h
> index f430765..4a8d5b1 100644
> --- a/drivers/net/ethernet/sun/sunhme.h
> +++ b/drivers/net/ethernet/sun/sunhme.h
> @@ -302,7 +302,7 @@
>   * Always write the address first before setting the ownership
>   * bits to avoid races with the hardware scanning the ring.
>   */
> -typedef u32 __bitwise__ hme32;
> +typedef u32 __bitwise hme32;
>  
>  struct happy_meal_rxd {
>  	hme32 rx_flags;
> diff --git a/drivers/net/wireless/intel/iwlwifi/iwl-fw-file.h b/drivers/net/wireless/intel/iwlwifi/iwl-fw-file.h
> index 1ad0ec1..84813b5 100644
> --- a/drivers/net/wireless/intel/iwlwifi/iwl-fw-file.h
> +++ b/drivers/net/wireless/intel/iwlwifi/iwl-fw-file.h
> @@ -228,7 +228,7 @@ enum iwl_ucode_tlv_flag {
>  	IWL_UCODE_TLV_FLAGS_BCAST_FILTERING	= BIT(29),
>  };
>  
> -typedef unsigned int __bitwise__ iwl_ucode_tlv_api_t;
> +typedef unsigned int __bitwise iwl_ucode_tlv_api_t;
>  
>  /**
>   * enum iwl_ucode_tlv_api - ucode api
> @@ -258,7 +258,7 @@ enum iwl_ucode_tlv_api {
>  #endif
>  };
>  
> -typedef unsigned int __bitwise__ iwl_ucode_tlv_capa_t;
> +typedef unsigned int __bitwise iwl_ucode_tlv_capa_t;
>  
>  /**
>   * enum iwl_ucode_tlv_capa - ucode capabilities
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 0f088f3..36d9896 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -246,7 +246,7 @@ struct lruvec {
>  #define ISOLATE_UNEVICTABLE	((__force isolate_mode_t)0x8)
>  
>  /* LRU Isolation modes. */
> -typedef unsigned __bitwise__ isolate_mode_t;
> +typedef unsigned __bitwise isolate_mode_t;
>  
>  enum zone_watermarks {
>  	WMARK_MIN,
> diff --git a/include/linux/serial_core.h b/include/linux/serial_core.h
> index 5d49488..5def8e8 100644
> --- a/include/linux/serial_core.h
> +++ b/include/linux/serial_core.h
> @@ -111,8 +111,8 @@ struct uart_icount {
>  	__u32	buf_overrun;
>  };
>  
> -typedef unsigned int __bitwise__ upf_t;
> -typedef unsigned int __bitwise__ upstat_t;
> +typedef unsigned int __bitwise upf_t;
> +typedef unsigned int __bitwise upstat_t;
>  
>  struct uart_port {
>  	spinlock_t		lock;			/* port lock */
> diff --git a/include/linux/types.h b/include/linux/types.h
> index baf7183..d501ad3 100644
> --- a/include/linux/types.h
> +++ b/include/linux/types.h
> @@ -154,8 +154,8 @@ typedef u64 dma_addr_t;
>  typedef u32 dma_addr_t;
>  #endif
>  
> -typedef unsigned __bitwise__ gfp_t;
> -typedef unsigned __bitwise__ fmode_t;
> +typedef unsigned __bitwise gfp_t;
> +typedef unsigned __bitwise fmode_t;
>  
>  #ifdef CONFIG_PHYS_ADDR_T_64BIT
>  typedef u64 phys_addr_t;
> diff --git a/include/scsi/iscsi_proto.h b/include/scsi/iscsi_proto.h
> index c1260d8..df156f1 100644
> --- a/include/scsi/iscsi_proto.h
> +++ b/include/scsi/iscsi_proto.h
> @@ -74,7 +74,7 @@ static inline int iscsi_sna_gte(u32 n1, u32 n2)
>  #define zero_data(p) {p[0]=0;p[1]=0;p[2]=0;}
>  
>  /* initiator tags; opaque for target */
> -typedef uint32_t __bitwise__ itt_t;
> +typedef uint32_t __bitwise itt_t;
>  /* below makes sense only for initiator that created this tag */
>  #define build_itt(itt, age) ((__force itt_t)\
>  	((itt) | ((age) << ISCSI_AGE_SHIFT)))
> diff --git a/include/target/target_core_base.h b/include/target/target_core_base.h
> index c211900..0055828 100644
> --- a/include/target/target_core_base.h
> +++ b/include/target/target_core_base.h
> @@ -149,7 +149,7 @@ enum se_cmd_flags_table {
>   * Used by transport_send_check_condition_and_sense()
>   * to signal which ASC/ASCQ sense payload should be built.
>   */
> -typedef unsigned __bitwise__ sense_reason_t;
> +typedef unsigned __bitwise sense_reason_t;
>  
>  enum tcm_sense_reason_table {
>  #define R(x)	(__force sense_reason_t )(x)
> diff --git a/include/uapi/linux/virtio_types.h b/include/uapi/linux/virtio_types.h
> index e845e8c..55c3b73 100644
> --- a/include/uapi/linux/virtio_types.h
> +++ b/include/uapi/linux/virtio_types.h
> @@ -39,8 +39,8 @@
>   * - __le{16,32,64} for standard-compliant virtio devices
>   */
>  
> -typedef __u16 __bitwise__ __virtio16;
> -typedef __u32 __bitwise__ __virtio32;
> -typedef __u64 __bitwise__ __virtio64;
> +typedef __u16 __bitwise __virtio16;
> +typedef __u32 __bitwise __virtio32;
> +typedef __u64 __bitwise __virtio64;
>  
>  #endif /* _UAPI_LINUX_VIRTIO_TYPES_H */
> diff --git a/net/ieee802154/6lowpan/6lowpan_i.h b/net/ieee802154/6lowpan/6lowpan_i.h
> index 5ac7789..ac7c96b 100644
> --- a/net/ieee802154/6lowpan/6lowpan_i.h
> +++ b/net/ieee802154/6lowpan/6lowpan_i.h
> @@ -7,7 +7,7 @@
>  #include <net/inet_frag.h>
>  #include <net/6lowpan.h>
>  
> -typedef unsigned __bitwise__ lowpan_rx_result;
> +typedef unsigned __bitwise lowpan_rx_result;
>  #define RX_CONTINUE		((__force lowpan_rx_result) 0u)
>  #define RX_DROP_UNUSABLE	((__force lowpan_rx_result) 1u)
>  #define RX_DROP			((__force lowpan_rx_result) 2u)
> diff --git a/net/mac80211/ieee80211_i.h b/net/mac80211/ieee80211_i.h
> index d37a577..b2069fb 100644
> --- a/net/mac80211/ieee80211_i.h
> +++ b/net/mac80211/ieee80211_i.h
> @@ -159,7 +159,7 @@ enum ieee80211_bss_valid_data_flags {
>  	IEEE80211_BSS_VALID_ERP			= BIT(3)
>  };
>  
> -typedef unsigned __bitwise__ ieee80211_tx_result;
> +typedef unsigned __bitwise ieee80211_tx_result;
>  #define TX_CONTINUE	((__force ieee80211_tx_result) 0u)
>  #define TX_DROP		((__force ieee80211_tx_result) 1u)
>  #define TX_QUEUED	((__force ieee80211_tx_result) 2u)
> @@ -180,7 +180,7 @@ struct ieee80211_tx_data {
>  };
>  
>  
> -typedef unsigned __bitwise__ ieee80211_rx_result;
> +typedef unsigned __bitwise ieee80211_rx_result;
>  #define RX_CONTINUE		((__force ieee80211_rx_result) 0u)
>  #define RX_DROP_UNUSABLE	((__force ieee80211_rx_result) 1u)
>  #define RX_DROP_MONITOR		((__force ieee80211_rx_result) 2u)
>

For iscsi initiator, looks good.

Akced-by: Lee Duncan <lduncan@suse.com>

-- 
Lee Duncan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH] mdadm: add test case for raid5 write back cache
From: Song Liu @ 2016-12-16  0:00 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuan,
	liuyun01, Song Liu, Jes.Sorensen

This test cases checks data integrity of raid5 write back cache
under various scenarios:

degraded mode, non-overwrite, raid-5/6.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 tests/21raid5cache | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 87 insertions(+)
 create mode 100644 tests/21raid5cache

diff --git a/tests/21raid5cache b/tests/21raid5cache
new file mode 100644
index 0000000..0dd97bf
--- /dev/null
+++ b/tests/21raid5cache
@@ -0,0 +1,87 @@
+# check data integrity with raid5 write back cache
+
+# create a 4kB random file and 4 files each with a 1kB chunk of the random file:
+#    randfile: ABCD   randchunk[0-3]:  A  B  C  D
+#
+# then create another random 1kB chunk E, and a new random page with A, B, E, D:
+#    randchunk4: E    newrandfile:   ABED
+create_random_data() {
+    dd if=/dev/urandom of=/tmp/randfile bs=4k count=1
+    for x in {0..3}
+    do
+        dd if=/tmp/randfile of=/tmp/randchunk$x bs=1k count=1 skip=$x count=1
+    done
+
+    dd if=/dev/urandom of=/tmp/randchunk4 bs=1k count=1
+
+    rm /tmp/newrandfile
+    for x in 0 1 4 3
+    do
+        cat /tmp/randchunk$x >> /tmp/newrandfile
+    done
+}
+
+# create array, $1 could be 5 for raid5 and 6 for raid6
+create_array() {
+    if [ $1 -lt 5 -o $1 -gt 6 ]
+    then
+        echo wrong array type $1
+        exit 2
+    fi
+
+    mdadm -CR $md0 -c4 -l5 -n10 $dev0 $dev1 $dev2 $dev3 $dev4 $dev5 $dev6 $dev11 $dev8 $dev9 --write-journal $dev10
+    check wait
+    echo write-back > /sys/block/md0/md/journal_mode
+}
+
+restart_array_write_back() {
+    mdadm -S $md0
+    mdadm -A $md0 $dev0 $dev1 $dev2 $dev3 $dev4 $dev5 $dev6 $dev11 $dev8 $dev9 $dev10
+    echo write-back > /sys/block/md0/md/journal_mode
+}
+
+# compare the first page of md0 with file in $1
+cmp_first_page() {
+    cmp  -n 4096 $1 $md0 || { echo cmp failed ; exit 2 ; }
+}
+
+# write 3 pages after the first page of md0
+write_three_pages() {
+    for x in {1..3}
+    do
+        dd if=/dev/urandom of=$md0 bs=4k count=1 seek=$x count=1
+    done
+}
+
+# run_test <array_type:5/6> <degraded_or_not:yes/no>
+run_test() {
+    create_random_data
+    create_array $1
+
+    if [ $2 == yes ]
+    then
+        mdadm --fail $md0 $dev0
+    fi
+
+    dd if=/tmp/randfile of=$md0 bs=4k count=1
+    restart_array_write_back
+    cmp_first_page /tmp/randfile
+    restart_array_write_back
+    write_three_pages
+    cmp_first_page /tmp/randfile
+
+
+    dd if=/tmp/randchunk4 of=/dev/md0 bs=1k count=1 seek=2
+    restart_array_write_back
+    cmp_first_page /tmp/newrandfile
+    restart_array_write_back
+    write_three_pages
+    cmp_first_page /tmp/newrandfile
+
+    mdadm -S $md0
+}
+
+run_test 5 no
+run_test 5 yes
+run_test 6 no
+run_test 6 yes
-- 
2.9.3


^ permalink raw reply related

* Re: [RFC PATCH v2] crypto: Add IV generation algorithms
From: Binoy Jayan @ 2016-12-16  5:55 UTC (permalink / raw)
  To: Milan Broz
  Cc: Oded, Ofir, Herbert Xu, David S. Miller, linux-crypto, Mark Brown,
	Arnd Bergmann, Linux kernel mailing list, Alasdair Kergon,
	Mike Snitzer, dm-devel, Shaohua Li, linux-raid, Rajendra
In-Reply-To: <d6d92865-98fa-4d02-035f-9080bc265c35@gmail.com>

Hi Milan,

On 13 December 2016 at 15:31, Milan Broz <gmazyland@gmail.com> wrote:

> I think that IV generators should not modify or read encrypted data directly,
> it should only generate IV.

I was trying to find more information about what you said and how a
iv generator should be written. I saw two examples of IV generators
too used with AEAD ciphers (crypto/seqiv.c and crypto/echainiv.c)

Excerpt from crypto api doc:
http://www.chronox.de/crypto-API/crypto/architecture.html#crypto-api-cipher-references-and-priority

2. Now, SEQIV uses the AEAD API function calls to invoke the associated
AEAD cipher. In our case, during the instantiation of SEQIV, the cipher
handle for GCM is provided to SEQIV. This means that SEQIV invokes
AEAD cipher operations with the GCM cipher handle.

Here, it says seqiv invokes cipher operations. However the code crypto/seqiv.c
does not look similar to how the modes are implemented which is confusing. I
was looking for an example of an IV generator used with a regular block cipher
and not a AEAD cipher. Could you point me out to some?

Thanks,
Binoy

^ permalink raw reply

* Re: [PATCH v3 2/2] md/raid10: Refactor raid10_make_request
From: Shaohua Li @ 2016-12-16 19:59 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: linux-raid
In-Reply-To: <20161205200258.6653-3-robert@leblancnet.us>

On Mon, Dec 05, 2016 at 01:02:58PM -0700, Robert LeBlanc wrote:
> Refactor raid10_make_request into seperate read and write functions to
> clean up the code.

Merged the two patches, thanks!

For this one, you deleted the recovery check for read path, I added it back.
Please double check if the merged patch is good. The cleanup is supposed to not
change behavior, so next time if you change something (like the recovery
check), please do mention.

Thanks,
Shaohua

 
> Signed-off-by: Robert LeBlanc <robert@leblancnet.us>
> ---
>  drivers/md/raid10.c | 215 +++++++++++++++++++++++++++-------------------------
>  1 file changed, 111 insertions(+), 104 deletions(-)
> 
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 525ca99..8698e00 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -1087,23 +1087,97 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)
>  	kfree(plug);
>  }
>  
> -static void __make_request(struct mddev *mddev, struct bio *bio)
> +static void raid10_read_request(struct mddev *mddev, struct bio *bio,
> +				struct r10bio *r10_bio)
>  {
>  	struct r10conf *conf = mddev->private;
> -	struct r10bio *r10_bio;
>  	struct bio *read_bio;
> +	const int op = bio_op(bio);
> +	const unsigned long do_sync = (bio->bi_opf & REQ_SYNC);
> +	int sectors_handled;
> +	int max_sectors;
> +	struct md_rdev *rdev;
> +	int slot;
> +
> +	wait_barrier(conf);
> +
> +read_again:
> +	rdev = read_balance(conf, r10_bio, &max_sectors);
> +	if (!rdev) {
> +		raid_end_bio_io(r10_bio);
> +		return;
> +	}
> +	slot = r10_bio->read_slot;
> +
> +	read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev);
> +	bio_trim(read_bio, r10_bio->sector - bio->bi_iter.bi_sector,
> +		 max_sectors);
> +
> +	r10_bio->devs[slot].bio = read_bio;
> +	r10_bio->devs[slot].rdev = rdev;
> +
> +	read_bio->bi_iter.bi_sector = r10_bio->devs[slot].addr +
> +		choose_data_offset(r10_bio, rdev);
> +	read_bio->bi_bdev = rdev->bdev;
> +	read_bio->bi_end_io = raid10_end_read_request;
> +	bio_set_op_attrs(read_bio, op, do_sync);
> +	if (test_bit(FailFast, &rdev->flags) &&
> +	    test_bit(R10BIO_FailFast, &r10_bio->state))
> +	        read_bio->bi_opf |= MD_FAILFAST;
> +	read_bio->bi_private = r10_bio;
> +
> +	if (mddev->gendisk)
> +	        trace_block_bio_remap(bdev_get_queue(read_bio->bi_bdev),
> +	                              read_bio, disk_devt(mddev->gendisk),
> +	                              r10_bio->sector);
> +	if (max_sectors < r10_bio->sectors) {
> +		/* Could not read all from this device, so we will
> +		 * need another r10_bio.
> +		 */
> +		sectors_handled = (r10_bio->sector + max_sectors
> +				   - bio->bi_iter.bi_sector);
> +		r10_bio->sectors = max_sectors;
> +		spin_lock_irq(&conf->device_lock);
> +		if (bio->bi_phys_segments == 0)
> +			bio->bi_phys_segments = 2;
> +		else
> +			bio->bi_phys_segments++;
> +		spin_unlock_irq(&conf->device_lock);
> +		/* Cannot call generic_make_request directly
> +		 * as that will be queued in __generic_make_request
> +		 * and subsequent mempool_alloc might block
> +		 * waiting for it.  so hand bio over to raid10d.
> +		 */
> +		reschedule_retry(r10_bio);
> +
> +		r10_bio = mempool_alloc(conf->r10bio_pool, GFP_NOIO);
> +
> +		r10_bio->master_bio = bio;
> +		r10_bio->sectors = bio_sectors(bio) - sectors_handled;
> +		r10_bio->state = 0;
> +		r10_bio->mddev = mddev;
> +		r10_bio->sector = bio->bi_iter.bi_sector + sectors_handled;
> +		goto read_again;
> +	} else
> +		generic_make_request(read_bio);
> +	return;
> +}
> +
> +static void raid10_write_request(struct mddev *mddev, struct bio *bio,
> +				 struct r10bio *r10_bio)
> +{
> +	struct r10conf *conf = mddev->private;
>  	int i;
>  	const int op = bio_op(bio);
> -	const int rw = bio_data_dir(bio);
>  	const unsigned long do_sync = (bio->bi_opf & REQ_SYNC);
>  	const unsigned long do_fua = (bio->bi_opf & REQ_FUA);
>  	unsigned long flags;
>  	struct md_rdev *blocked_rdev;
>  	struct blk_plug_cb *cb;
>  	struct raid10_plug_cb *plug = NULL;
> +	sector_t sectors;
>  	int sectors_handled;
>  	int max_sectors;
> -	int sectors;
>  
>  	md_write_start(mddev, bio);
>  
> @@ -1130,7 +1204,6 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
>  		wait_barrier(conf);
>  	}
>  	if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
> -	    bio_data_dir(bio) == WRITE &&
>  	    (mddev->reshape_backwards
>  	     ? (bio->bi_iter.bi_sector < conf->reshape_safe &&
>  		bio->bi_iter.bi_sector + sectors > conf->reshape_progress)
> @@ -1147,99 +1220,6 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
>  
>  		conf->reshape_safe = mddev->reshape_position;
>  	}
> -
> -	r10_bio = mempool_alloc(conf->r10bio_pool, GFP_NOIO);
> -
> -	r10_bio->master_bio = bio;
> -	r10_bio->sectors = sectors;
> -
> -	r10_bio->mddev = mddev;
> -	r10_bio->sector = bio->bi_iter.bi_sector;
> -	r10_bio->state = 0;
> -
> -	/* We might need to issue multiple reads to different
> -	 * devices if there are bad blocks around, so we keep
> -	 * track of the number of reads in bio->bi_phys_segments.
> -	 * If this is 0, there is only one r10_bio and no locking
> -	 * will be needed when the request completes.  If it is
> -	 * non-zero, then it is the number of not-completed requests.
> -	 */
> -	bio->bi_phys_segments = 0;
> -	bio_clear_flag(bio, BIO_SEG_VALID);
> -
> -	if (rw == READ) {
> -		/*
> -		 * read balancing logic:
> -		 */
> -		struct md_rdev *rdev;
> -		int slot;
> -
> -read_again:
> -		rdev = read_balance(conf, r10_bio, &max_sectors);
> -		if (!rdev) {
> -			raid_end_bio_io(r10_bio);
> -			return;
> -		}
> -		slot = r10_bio->read_slot;
> -
> -		read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev);
> -		bio_trim(read_bio, r10_bio->sector - bio->bi_iter.bi_sector,
> -			 max_sectors);
> -
> -		r10_bio->devs[slot].bio = read_bio;
> -		r10_bio->devs[slot].rdev = rdev;
> -
> -		read_bio->bi_iter.bi_sector = r10_bio->devs[slot].addr +
> -			choose_data_offset(r10_bio, rdev);
> -		read_bio->bi_bdev = rdev->bdev;
> -		read_bio->bi_end_io = raid10_end_read_request;
> -		bio_set_op_attrs(read_bio, op, do_sync);
> -		if (test_bit(FailFast, &rdev->flags) &&
> -		    test_bit(R10BIO_FailFast, &r10_bio->state))
> -			read_bio->bi_opf |= MD_FAILFAST;
> -		read_bio->bi_private = r10_bio;
> -
> -		if (mddev->gendisk)
> -			trace_block_bio_remap(bdev_get_queue(read_bio->bi_bdev),
> -					      read_bio, disk_devt(mddev->gendisk),
> -					      r10_bio->sector);
> -		if (max_sectors < r10_bio->sectors) {
> -			/* Could not read all from this device, so we will
> -			 * need another r10_bio.
> -			 */
> -			sectors_handled = (r10_bio->sector + max_sectors
> -					   - bio->bi_iter.bi_sector);
> -			r10_bio->sectors = max_sectors;
> -			spin_lock_irq(&conf->device_lock);
> -			if (bio->bi_phys_segments == 0)
> -				bio->bi_phys_segments = 2;
> -			else
> -				bio->bi_phys_segments++;
> -			spin_unlock_irq(&conf->device_lock);
> -			/* Cannot call generic_make_request directly
> -			 * as that will be queued in __generic_make_request
> -			 * and subsequent mempool_alloc might block
> -			 * waiting for it.  so hand bio over to raid10d.
> -			 */
> -			reschedule_retry(r10_bio);
> -
> -			r10_bio = mempool_alloc(conf->r10bio_pool, GFP_NOIO);
> -
> -			r10_bio->master_bio = bio;
> -			r10_bio->sectors = bio_sectors(bio) - sectors_handled;
> -			r10_bio->state = 0;
> -			r10_bio->mddev = mddev;
> -			r10_bio->sector = bio->bi_iter.bi_sector +
> -				sectors_handled;
> -			goto read_again;
> -		} else
> -			generic_make_request(read_bio);
> -		return;
> -	}
> -
> -	/*
> -	 * WRITE:
> -	 */
>  	if (conf->pending_count >= max_queued_requests) {
>  		md_wakeup_thread(mddev->thread);
>  		raid10_log(mddev, "wait queued");
> @@ -1300,8 +1280,7 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
>  			int bad_sectors;
>  			int is_bad;
>  
> -			is_bad = is_badblock(rdev, dev_sector,
> -					     max_sectors,
> +			is_bad = is_badblock(rdev, dev_sector, max_sectors,
>  					     &first_bad, &bad_sectors);
>  			if (is_bad < 0) {
>  				/* Mustn't write here until the bad block
> @@ -1405,8 +1384,7 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
>  			r10_bio->devs[i].bio = mbio;
>  
>  			mbio->bi_iter.bi_sector	= (r10_bio->devs[i].addr+
> -					   choose_data_offset(r10_bio,
> -							      rdev));
> +					   choose_data_offset(r10_bio, rdev));
>  			mbio->bi_bdev = rdev->bdev;
>  			mbio->bi_end_io	= raid10_end_write_request;
>  			bio_set_op_attrs(mbio, op, do_sync | do_fua);
> @@ -1457,8 +1435,7 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
>  			r10_bio->devs[i].repl_bio = mbio;
>  
>  			mbio->bi_iter.bi_sector	= (r10_bio->devs[i].addr +
> -					   choose_data_offset(
> -						   r10_bio, rdev));
> +					   choose_data_offset(r10_bio, rdev));
>  			mbio->bi_bdev = rdev->bdev;
>  			mbio->bi_end_io	= raid10_end_write_request;
>  			bio_set_op_attrs(mbio, op, do_sync | do_fua);
> @@ -1503,6 +1480,36 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
>  	one_write_done(r10_bio);
>  }
>  
> +static void __make_request(struct mddev *mddev, struct bio *bio)
> +{
> +	struct r10conf *conf = mddev->private;
> +	struct r10bio *r10_bio;
> +
> +	r10_bio = mempool_alloc(conf->r10bio_pool, GFP_NOIO);
> +
> +	r10_bio->master_bio = bio;
> +	r10_bio->sectors = bio_sectors(bio);
> +
> +	r10_bio->mddev = mddev;
> +	r10_bio->sector = bio->bi_iter.bi_sector;
> +	r10_bio->state = 0;
> +
> +	/* We might need to issue multiple reads to different
> +	 * devices if there are bad blocks around, so we keep
> +	 * track of the number of reads in bio->bi_phys_segments.
> +	 * If this is 0, there is only one r10_bio and no locking
> +	 * will be needed when the request completes.  If it is
> +	 * non-zero, then it is the number of not-completed requests.
> +	 */
> +	bio->bi_phys_segments = 0;
> +	bio_clear_flag(bio, BIO_SEG_VALID);
> +
> +	if (bio_data_dir(bio) == READ)
> +		raid10_read_request(mddev, bio, r10_bio);
> +	else
> +		raid10_write_request(mddev, bio, r10_bio);
> +}
> +
>  static void raid10_make_request(struct mddev *mddev, struct bio *bio)
>  {
>  	struct r10conf *conf = mddev->private;
> -- 
> 2.10.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] md/raid5-cache: removes unnecessary write-through mode judgments
From: Shaohua Li @ 2016-12-16 20:12 UTC (permalink / raw)
  To: JackieLiu; +Cc: songliubraving, linux-raid
In-Reply-To: <20161213055527.3306-1-liuyun01@kylinos.cn>

On Tue, Dec 13, 2016 at 01:55:27PM +0800, JackieLiu wrote:
> The write-through mode has been returned in front of the function,
> do not need to do it again.
applied, thanks!
 
> Signed-off-by: JackieLiu <liuyun01@kylinos.cn>
> ---
>  drivers/md/raid5-cache.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 6d1a150..4dd8e4e 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -2418,9 +2418,6 @@ void r5c_finish_stripe_write_out(struct r5conf *conf,
>  	if (do_wakeup)
>  		wake_up(&conf->wait_for_overlap);
>  
> -	if (conf->log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_THROUGH)
> -		return;
> -
>  	spin_lock_irq(&conf->log->stripe_in_journal_lock);
>  	list_del_init(&sh->r5c);
>  	spin_unlock_irq(&conf->log->stripe_in_journal_lock);
> -- 
> 2.10.2
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v3 2/2] md/raid10: Refactor raid10_make_request
From: Robert LeBlanc @ 2016-12-16 20:49 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid
In-Reply-To: <20161216195951.hprhmmruxactaxlw@kernel.org>

Thank you for the feedback. It looked like read_balance already
checked for recovery and so I removed it from the read path here. I
didn't think it needed a comment. I'll be more complete next time.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Fri, Dec 16, 2016 at 12:59 PM, Shaohua Li <shli@kernel.org> wrote:
> On Mon, Dec 05, 2016 at 01:02:58PM -0700, Robert LeBlanc wrote:
>> Refactor raid10_make_request into seperate read and write functions to
>> clean up the code.
>
> Merged the two patches, thanks!
>
> For this one, you deleted the recovery check for read path, I added it back.
> Please double check if the merged patch is good. The cleanup is supposed to not
> change behavior, so next time if you change something (like the recovery
> check), please do mention.
>
> Thanks,
> Shaohua
>
>
>> Signed-off-by: Robert LeBlanc <robert@leblancnet.us>
>> ---
>>  drivers/md/raid10.c | 215 +++++++++++++++++++++++++++-------------------------
>>  1 file changed, 111 insertions(+), 104 deletions(-)
>>
>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> index 525ca99..8698e00 100644
>> --- a/drivers/md/raid10.c
>> +++ b/drivers/md/raid10.c
>> @@ -1087,23 +1087,97 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)
>>       kfree(plug);
>>  }
>>
>> -static void __make_request(struct mddev *mddev, struct bio *bio)
>> +static void raid10_read_request(struct mddev *mddev, struct bio *bio,
>> +                             struct r10bio *r10_bio)
>>  {
>>       struct r10conf *conf = mddev->private;
>> -     struct r10bio *r10_bio;
>>       struct bio *read_bio;
>> +     const int op = bio_op(bio);
>> +     const unsigned long do_sync = (bio->bi_opf & REQ_SYNC);
>> +     int sectors_handled;
>> +     int max_sectors;
>> +     struct md_rdev *rdev;
>> +     int slot;
>> +
>> +     wait_barrier(conf);
>> +
>> +read_again:
>> +     rdev = read_balance(conf, r10_bio, &max_sectors);
>> +     if (!rdev) {
>> +             raid_end_bio_io(r10_bio);
>> +             return;
>> +     }
>> +     slot = r10_bio->read_slot;
>> +
>> +     read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev);
>> +     bio_trim(read_bio, r10_bio->sector - bio->bi_iter.bi_sector,
>> +              max_sectors);
>> +
>> +     r10_bio->devs[slot].bio = read_bio;
>> +     r10_bio->devs[slot].rdev = rdev;
>> +
>> +     read_bio->bi_iter.bi_sector = r10_bio->devs[slot].addr +
>> +             choose_data_offset(r10_bio, rdev);
>> +     read_bio->bi_bdev = rdev->bdev;
>> +     read_bio->bi_end_io = raid10_end_read_request;
>> +     bio_set_op_attrs(read_bio, op, do_sync);
>> +     if (test_bit(FailFast, &rdev->flags) &&
>> +         test_bit(R10BIO_FailFast, &r10_bio->state))
>> +             read_bio->bi_opf |= MD_FAILFAST;
>> +     read_bio->bi_private = r10_bio;
>> +
>> +     if (mddev->gendisk)
>> +             trace_block_bio_remap(bdev_get_queue(read_bio->bi_bdev),
>> +                                   read_bio, disk_devt(mddev->gendisk),
>> +                                   r10_bio->sector);
>> +     if (max_sectors < r10_bio->sectors) {
>> +             /* Could not read all from this device, so we will
>> +              * need another r10_bio.
>> +              */
>> +             sectors_handled = (r10_bio->sector + max_sectors
>> +                                - bio->bi_iter.bi_sector);
>> +             r10_bio->sectors = max_sectors;
>> +             spin_lock_irq(&conf->device_lock);
>> +             if (bio->bi_phys_segments == 0)
>> +                     bio->bi_phys_segments = 2;
>> +             else
>> +                     bio->bi_phys_segments++;
>> +             spin_unlock_irq(&conf->device_lock);
>> +             /* Cannot call generic_make_request directly
>> +              * as that will be queued in __generic_make_request
>> +              * and subsequent mempool_alloc might block
>> +              * waiting for it.  so hand bio over to raid10d.
>> +              */
>> +             reschedule_retry(r10_bio);
>> +
>> +             r10_bio = mempool_alloc(conf->r10bio_pool, GFP_NOIO);
>> +
>> +             r10_bio->master_bio = bio;
>> +             r10_bio->sectors = bio_sectors(bio) - sectors_handled;
>> +             r10_bio->state = 0;
>> +             r10_bio->mddev = mddev;
>> +             r10_bio->sector = bio->bi_iter.bi_sector + sectors_handled;
>> +             goto read_again;
>> +     } else
>> +             generic_make_request(read_bio);
>> +     return;
>> +}
>> +
>> +static void raid10_write_request(struct mddev *mddev, struct bio *bio,
>> +                              struct r10bio *r10_bio)
>> +{
>> +     struct r10conf *conf = mddev->private;
>>       int i;
>>       const int op = bio_op(bio);
>> -     const int rw = bio_data_dir(bio);
>>       const unsigned long do_sync = (bio->bi_opf & REQ_SYNC);
>>       const unsigned long do_fua = (bio->bi_opf & REQ_FUA);
>>       unsigned long flags;
>>       struct md_rdev *blocked_rdev;
>>       struct blk_plug_cb *cb;
>>       struct raid10_plug_cb *plug = NULL;
>> +     sector_t sectors;
>>       int sectors_handled;
>>       int max_sectors;
>> -     int sectors;
>>
>>       md_write_start(mddev, bio);
>>
>> @@ -1130,7 +1204,6 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
>>               wait_barrier(conf);
>>       }
>>       if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
>> -         bio_data_dir(bio) == WRITE &&
>>           (mddev->reshape_backwards
>>            ? (bio->bi_iter.bi_sector < conf->reshape_safe &&
>>               bio->bi_iter.bi_sector + sectors > conf->reshape_progress)
>> @@ -1147,99 +1220,6 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
>>
>>               conf->reshape_safe = mddev->reshape_position;
>>       }
>> -
>> -     r10_bio = mempool_alloc(conf->r10bio_pool, GFP_NOIO);
>> -
>> -     r10_bio->master_bio = bio;
>> -     r10_bio->sectors = sectors;
>> -
>> -     r10_bio->mddev = mddev;
>> -     r10_bio->sector = bio->bi_iter.bi_sector;
>> -     r10_bio->state = 0;
>> -
>> -     /* We might need to issue multiple reads to different
>> -      * devices if there are bad blocks around, so we keep
>> -      * track of the number of reads in bio->bi_phys_segments.
>> -      * If this is 0, there is only one r10_bio and no locking
>> -      * will be needed when the request completes.  If it is
>> -      * non-zero, then it is the number of not-completed requests.
>> -      */
>> -     bio->bi_phys_segments = 0;
>> -     bio_clear_flag(bio, BIO_SEG_VALID);
>> -
>> -     if (rw == READ) {
>> -             /*
>> -              * read balancing logic:
>> -              */
>> -             struct md_rdev *rdev;
>> -             int slot;
>> -
>> -read_again:
>> -             rdev = read_balance(conf, r10_bio, &max_sectors);
>> -             if (!rdev) {
>> -                     raid_end_bio_io(r10_bio);
>> -                     return;
>> -             }
>> -             slot = r10_bio->read_slot;
>> -
>> -             read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev);
>> -             bio_trim(read_bio, r10_bio->sector - bio->bi_iter.bi_sector,
>> -                      max_sectors);
>> -
>> -             r10_bio->devs[slot].bio = read_bio;
>> -             r10_bio->devs[slot].rdev = rdev;
>> -
>> -             read_bio->bi_iter.bi_sector = r10_bio->devs[slot].addr +
>> -                     choose_data_offset(r10_bio, rdev);
>> -             read_bio->bi_bdev = rdev->bdev;
>> -             read_bio->bi_end_io = raid10_end_read_request;
>> -             bio_set_op_attrs(read_bio, op, do_sync);
>> -             if (test_bit(FailFast, &rdev->flags) &&
>> -                 test_bit(R10BIO_FailFast, &r10_bio->state))
>> -                     read_bio->bi_opf |= MD_FAILFAST;
>> -             read_bio->bi_private = r10_bio;
>> -
>> -             if (mddev->gendisk)
>> -                     trace_block_bio_remap(bdev_get_queue(read_bio->bi_bdev),
>> -                                           read_bio, disk_devt(mddev->gendisk),
>> -                                           r10_bio->sector);
>> -             if (max_sectors < r10_bio->sectors) {
>> -                     /* Could not read all from this device, so we will
>> -                      * need another r10_bio.
>> -                      */
>> -                     sectors_handled = (r10_bio->sector + max_sectors
>> -                                        - bio->bi_iter.bi_sector);
>> -                     r10_bio->sectors = max_sectors;
>> -                     spin_lock_irq(&conf->device_lock);
>> -                     if (bio->bi_phys_segments == 0)
>> -                             bio->bi_phys_segments = 2;
>> -                     else
>> -                             bio->bi_phys_segments++;
>> -                     spin_unlock_irq(&conf->device_lock);
>> -                     /* Cannot call generic_make_request directly
>> -                      * as that will be queued in __generic_make_request
>> -                      * and subsequent mempool_alloc might block
>> -                      * waiting for it.  so hand bio over to raid10d.
>> -                      */
>> -                     reschedule_retry(r10_bio);
>> -
>> -                     r10_bio = mempool_alloc(conf->r10bio_pool, GFP_NOIO);
>> -
>> -                     r10_bio->master_bio = bio;
>> -                     r10_bio->sectors = bio_sectors(bio) - sectors_handled;
>> -                     r10_bio->state = 0;
>> -                     r10_bio->mddev = mddev;
>> -                     r10_bio->sector = bio->bi_iter.bi_sector +
>> -                             sectors_handled;
>> -                     goto read_again;
>> -             } else
>> -                     generic_make_request(read_bio);
>> -             return;
>> -     }
>> -
>> -     /*
>> -      * WRITE:
>> -      */
>>       if (conf->pending_count >= max_queued_requests) {
>>               md_wakeup_thread(mddev->thread);
>>               raid10_log(mddev, "wait queued");
>> @@ -1300,8 +1280,7 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
>>                       int bad_sectors;
>>                       int is_bad;
>>
>> -                     is_bad = is_badblock(rdev, dev_sector,
>> -                                          max_sectors,
>> +                     is_bad = is_badblock(rdev, dev_sector, max_sectors,
>>                                            &first_bad, &bad_sectors);
>>                       if (is_bad < 0) {
>>                               /* Mustn't write here until the bad block
>> @@ -1405,8 +1384,7 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
>>                       r10_bio->devs[i].bio = mbio;
>>
>>                       mbio->bi_iter.bi_sector = (r10_bio->devs[i].addr+
>> -                                        choose_data_offset(r10_bio,
>> -                                                           rdev));
>> +                                        choose_data_offset(r10_bio, rdev));
>>                       mbio->bi_bdev = rdev->bdev;
>>                       mbio->bi_end_io = raid10_end_write_request;
>>                       bio_set_op_attrs(mbio, op, do_sync | do_fua);
>> @@ -1457,8 +1435,7 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
>>                       r10_bio->devs[i].repl_bio = mbio;
>>
>>                       mbio->bi_iter.bi_sector = (r10_bio->devs[i].addr +
>> -                                        choose_data_offset(
>> -                                                r10_bio, rdev));
>> +                                        choose_data_offset(r10_bio, rdev));
>>                       mbio->bi_bdev = rdev->bdev;
>>                       mbio->bi_end_io = raid10_end_write_request;
>>                       bio_set_op_attrs(mbio, op, do_sync | do_fua);
>> @@ -1503,6 +1480,36 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
>>       one_write_done(r10_bio);
>>  }
>>
>> +static void __make_request(struct mddev *mddev, struct bio *bio)
>> +{
>> +     struct r10conf *conf = mddev->private;
>> +     struct r10bio *r10_bio;
>> +
>> +     r10_bio = mempool_alloc(conf->r10bio_pool, GFP_NOIO);
>> +
>> +     r10_bio->master_bio = bio;
>> +     r10_bio->sectors = bio_sectors(bio);
>> +
>> +     r10_bio->mddev = mddev;
>> +     r10_bio->sector = bio->bi_iter.bi_sector;
>> +     r10_bio->state = 0;
>> +
>> +     /* We might need to issue multiple reads to different
>> +      * devices if there are bad blocks around, so we keep
>> +      * track of the number of reads in bio->bi_phys_segments.
>> +      * If this is 0, there is only one r10_bio and no locking
>> +      * will be needed when the request completes.  If it is
>> +      * non-zero, then it is the number of not-completed requests.
>> +      */
>> +     bio->bi_phys_segments = 0;
>> +     bio_clear_flag(bio, BIO_SEG_VALID);
>> +
>> +     if (bio_data_dir(bio) == READ)
>> +             raid10_read_request(mddev, bio, r10_bio);
>> +     else
>> +             raid10_write_request(mddev, bio, r10_bio);
>> +}
>> +
>>  static void raid10_make_request(struct mddev *mddev, struct bio *bio)
>>  {
>>       struct r10conf *conf = mddev->private;
>> --
>> 2.10.2
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* root in grub for raid1
From: Egbert Bouwman @ 2016-12-16 21:14 UTC (permalink / raw)
  To: linux-raid

Newbie, but this list seems to be for developers.
If that is true I'll ask my one question, and then retire.

Actually it is a question about grub, but I think the raid specialists
know more about this grub problem than the grubbers do.

Setting up raid1 for /dev/sda and /dev/sdb (actually for missing and /dev/sdb)
on /dev/md/data succeeded, but now I have to do a dpkg-reconfigure
grub-efi-amd64 and i don't know how to specify root in the two grub lines: 
	set root=...
and the linux command line
	linux root=
Please note that i chose md/data, and not the common md0.

egbert
-- 
Egbert Bouwman - Keizersgracht 197 II - 1016 DS Amsterdam - 020 6257991

^ permalink raw reply

* Re: [PATCH v2 00/12] Partial Parity Log for MD RAID 5
From: Shaohua Li @ 2016-12-16 23:24 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: Jes Sorensen, NeilBrown, linux-raid
In-Reply-To: <c97a46de-f88a-f8f2-4520-e3c8a57d1b3b@intel.com>

On Thu, Dec 15, 2016 at 12:44:57PM +0100, Artur Paszkiewicz wrote:
> On 12/14/2016 08:47 PM, Shaohua Li wrote:
> > On Tue, Dec 13, 2016 at 10:25:04AM -0500, Jes Sorensen wrote:
> >> Shaohua Li <shli@kernel.org> writes:
> >>> On Wed, Dec 07, 2016 at 03:36:01PM +0100, Artur Paszkiewicz wrote:
> >>>> On 12/07/2016 01:32 AM, NeilBrown wrote:
> >>>>>
> >>>>> I would expect to see as description of what a PPL actually is and how
> >>>>> it works here... but there is none.
> >>>>>
> >>>>> The change-log for patch 06 has a tiny bit more information which is
> >>>>> just enough to be able to start trying to understand the code, but it
> >>>>> isn't much.
> >>>>> And none of this description gets into the code, or into the
> >>>>> Documentation/.  This makes it hard to review and hard to maintain.
> >>>>>
> >>>>> Remember: if you want people to review you code, it is in your interest
> >>>>> to make it easy.  That means give lots of details.
> >>>>
> >>>> Hi Neil,
> >>>>
> >>>> Thank you for taking the time to look at this and for your feedback. I
> >>>> didn't try to make it hard to review... Sometimes it's easy to forget
> >>>> how non-obvious things are after looking at them for too long :) I will
> >>>> improve the descriptions and address the issues that you found in the
> >>>> next version of the patches.
> >>>
> >>> Havn't looked at the patches yet, being busy recently, sorry! When you repost
> >>> these, I'd like to know why we need another log for raid5 considering we
> >>> already had one to fix similar issue. What's the good/bad side of this new log?
> >>> There is such feature in Intel RSTe doesn't sound like a technical reason we
> >>> should support this.
> >>
> >> Shaohua,
> >>
> >> Any further thought on these patches? I am considering doing a release
> >> of mdadm early in the new year. it would be nice to include these
> >> patches if the feature is going in.
> >>
> >> As for supporting it, if IMSM supports it and it is used in the field,
> >> then it seems legitimate for Linux to support it too. Just like we
> >> support so many other obscure pieces of hardware :)
> > 
> > Sure, I don't object to support it. Just need to understand how it works. Had a
> > brief review. The ondisk format looks good. That probably is related to mdadm
> > mostly. The disk format has alignment issue as Neil noted, which would be
> > unfriendly for non-x86 arch. Will we stick to this disk format or change it?
> > We'd make a decision.
> 
> This alignment issue will be fixed by extending the 'parity_disk' field
> to 4 bytes. The 'checksum' field will then be properly aligned and the
> size of the structure will be 24 bytes, also fixing the array alignment.
> 
> > For the implementation, I don't understand how the ppl works much, there aren't
> > many details there. Two things I noted:
> > 
> > - The code skips the log for full stripe write. This isn't good. It would means
> >   after a unclean shutdown/recovery, one disk has arbitrary data, not the old
> >   data and new data. This breaks an assumption in filesystem, after a failed
> >   write to a sector, the sector has either old or new data. Thinking about a
> >   write to superblock. The data could be old or new superblock, but it's still a
> >   superblock, not something random.
> > 
> > - From the patch 6 & 10, looks PPL only help recover unwritten disks. If one
> >   disk of a stripe is dirty (eg it's written before unclean shutdown), and it's
> >   lost in recovery, what will happen? Seems the data of lost disk will be read as
> >   0? It will break the assumption above too. If I understand the code clearly
> >   (maybe not, need clarification), this is a design flaw.
> 
> PPL is only used to update the parity for a stripe, data chunks are not
> modified at all during PPL recovery. The assumption was that it would
> protect only from silent data corruption, to eliminate the cases when
> data that was not touched by a write request could change. So if a dirty
> disk is lost, no recovery is performed for this stripe (parity is not
> updated). For full stripe write we only recalculate the parity after a
> dirty shutdown if all disks are available (like resync). So you are
> right that it is still possible to have arbitrary data in the written
> part of a stripe if that disk is lost. In such case the behavior is the
> same as in plain raid5.

Ok, this matches my understanding. This isn't a completed solution but does
help a lot. If users want to use this, there is no reason to not support it.
After you fix the alignment issue and describe the solution in details, I'll
look at it again.

Thanks,
Shaohua

^ permalink raw reply

* Linux software RAID becomes unresponsive after removing a disk from server
From: PHP-Friends GmbH @ 2016-12-17  1:23 UTC (permalink / raw)
  To: linux-raid

Hello everyone,

first of all: This is in fact a crossposting from serverfault 
(http://serverfault.com/questions/821195/linux-software-raid-becomes-unresponsive-after-removing-a-disk-from-server), 
as the user shodanshok recommended contacting this mailing list because 
to him this seems like a possible bug in the Linux RAID software. I want 
to add that I can provide more logs and information if they are needed, 
but as the text is already quite long I thought that would be enough for 
the moment.

I am running a CentOS 7 machine (standard kernel: 
3.10.0-327.36.3.el7.x86_64) with a software RAID-10 over 16x 1 TB SSDs 
(to be more precise, there are two RAID arrays on the disks; one of the 
arrays is providing the host's swap partition). Last week, a SSD failed:

13:18:07 kvm7 kernel: sd 1:0:2:0: attempting task abort! 
scmd(ffff887e57b916c0)
13:18:07 kvm7 kernel: sd 1:0:2:0: [sdk] CDB: Write(10) 2a 08 02 55 20 08 
00 00 01 00
13:18:07 kvm7 kernel: scsi target1:0:2: handle(0x000b), 
sas_address(0x4433221102000000), phy(2)
13:18:07 kvm7 kernel: scsi target1:0:2: 
enclosure_logical_id(0x500304801c14a001), slot(2)
13:18:10 kvm7 kernel: sd 1:0:2:0: task abort: SUCCESS scmd(ffff887e57b916c0)
13:18:11 kvm7 kernel: sd 1:0:2:0: [sdk] FAILED Result: hostbyte=DID_OK 
driverbyte=DRIVER_SENSE
13:18:11 kvm7 kernel: sd 1:0:2:0: [sdk] Sense Key : Not Ready [current]
13:18:11 kvm7 kernel: sd 1:0:2:0: [sdk] Add. Sense: Logical unit not 
ready, cause not reportable
13:18:11 kvm7 kernel: sd 1:0:2:0: [sdk] CDB: Write(10) 2a 08 02 55 20 08 
00 00 01 00
13:18:11 kvm7 kernel: blk_update_request: I/O error, dev sdk, sector 
39133192
13:18:11 kvm7 kernel: blk_update_request: I/O error, dev sdk, sector 
39133192
13:18:11 kvm7 kernel: md: super_written gets error=-5, uptodate=0
13:18:11 kvm7 kernel: md/raid10:md3: Disk failure on sdk3, disabling 
device.#012md/raid10:md3: Operation continuing on 15 devices.
13:19:27 kvm7 kernel: sd 1:0:2:0: device_blocked, handle(0x000b)
13:19:29 kvm7 kernel: sd 1:0:2:0: [sdk] Synchronizing SCSI cache
13:19:29 kvm7 kernel: md: md3 still in use.
13:19:29 kvm7 kernel: sd 1:0:2:0: [sdk] Synchronize Cache(10) failed: 
Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
13:19:29 kvm7 kernel: mpt3sas1: removing handle(0x000b), 
sas_addr(0x4433221102000000)
13:19:29 kvm7 kernel: md: md2 still in use.
13:19:29 kvm7 kernel: md/raid10:md2: Disk failure on sdk2, disabling 
device.#012md/raid10:md2: Operation continuing on 15 devices.
13:19:29 kvm7 kernel: md: unbind<sdk3>
13:19:29 kvm7 kernel: md: export_rdev(sdk3)
13:19:29 kvm7 kernel: md: unbind<sdk2>
13:19:29 kvm7 kernel: md: export_rdev(sdk2)

/proc/mdstat looked as expected (1 faulty member) and the VMs kept 
running without any problems.

md3 : active raid10 sdp3[15] sdb3[2] sdg3[12] sde3[8] sdn3[11] sdl3[7] 
sdm3[9] sdf3[10] sdi3[1] sdk3[5](F) sdc3[4] sdd3[6] sdh3[14] sdo3[13] 
sda3[0] sdj3[3]
   7844052992 blocks super 1.2 128K chunks 2 near-copies [16/15] 
[UUUUU_UUUUUUUUUU]

The SSD had to be temporarily replaced with a bigger SSD as no 1 TB SSD 
was available; so we did, started the rebuild and everything was fine. 
Today the "right" SSD arrived, so the datacenter technican just pulled 
the tray containing the mentioned SSD and the system became unresponsive 
within seconds. While the host was running fine on a seperated RAID 
array, the VMs were unable to perform I/O. The load increased to > 800. 
I was able to execute mdadm --detail /dev/md3 which showed a degraded 
(but active / clean) array, so from this point of view the system was 
absolutely fine. I tried to remove the faulty / missing drive from the 
array, which of course failed ("no such device"), and suddenly even 
mdadm --detail /dev/md3 didn't generate any output anymore, it simply 
stuck and I had to kill the SSH session to get out of this. After this, 
I decided to force a reboot as I didn't even know how to remove this 
faulty drive from the array - and everything came up correctly. Of 
course the RAID was still degraded and needed a resync, but apart from 
that: no problems.

I'm pretty sure that I should have removed the drive via mdadm after a 
--set-faulty before pulling the tray out of the rack, though I can't 
explain this behaviour of mdraid. In my opinion we "simulated" a regular 
disk outage, so has anybody an idea what caused this issue and how I can 
make sure, that the next regular disk outage won't cause the same problem?

The kernel logged some messages, and what I find interesting is that the 
new device came up as sdq while the pulled device was known as sdk. So I 
assume that sdk was not kicked correctly from the array. When the 
initial SSD failure happened last week, I didn't see this behaviour; so 
the replacement drive also came up as sdk.

The log also shows 7 minutes between the failure of the old and the 
insertion of the new SSD, so I don't think that a problem like it was 
described under 
http://superuser.com/questions/942886/fail-device-in-md-raid-when-ata-stops-responding 
took place. Also the VMs went down immediately and not 7 minutes later. 
So - any thoughts on that? Would be greatly appreciated :)

11:45:36 kvm7 kernel: sd 1:0:8:0: device_blocked, handle(0x000b)
11:45:37 kvm7 kernel: blk_update_request: I/O error, dev sdk, sector 0
11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069640
11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069648
11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069656
11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069664
11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069672
11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069680
11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069688
11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069696
11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069704
11:45:37 kvm7 kernel: md/raid10:md3: sdk3: rescheduling sector 4072069712
11:45:37 kvm7 kernel: sd 1:0:8:0: [sdk] FAILED Result: 
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
11:45:37 kvm7 kernel: sd 1:0:8:0: [sdk] CDB: Read(10) 28 00 20 af f7 08 
00 00 08 00
11:45:37 kvm7 kernel: blk_update_request: I/O error, dev sdk, sector 
548402952
11:45:37 kvm7 kernel: blk_update_request: I/O error, dev sdk, sector 0
11:45:37 kvm7 kernel: blk_update_request: I/O error, dev sdk, sector 
39133192
11:45:37 kvm7 kernel: md: super_written gets error=-5, uptodate=0
11:45:37 kvm7 kernel: md/raid10:md3: Disk failure on sdk3, disabling 
device.#012md/raid10:md3: Operation continuing on 15 devices.
11:45:37 kvm7 kernel: md: md2 still in use.
11:45:37 kvm7 kernel: md/raid10:md2: Disk failure on sdk2, disabling 
device.#012md/raid10:md2: Operation continuing on 15 devices.
11:45:37 kvm7 kernel: blk_update_request: I/O error, dev sdk, sector 
39133264
11:45:37 kvm7 kernel: md: super_written gets error=-5, uptodate=0
11:45:37 kvm7 kernel: sd 1:0:8:0: [sdk] Synchronizing SCSI cache
11:45:37 kvm7 kernel: sd 1:0:8:0: [sdk] Synchronize Cache(10) failed: 
Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
11:45:37 kvm7 kernel: mpt3sas1: removing handle(0x000b), 
sas_addr(0x4433221102000000)
11:45:37 kvm7 kernel: md: unbind<sdk2>
11:45:37 kvm7 kernel: md: export_rdev(sdk2)
11:48:00 kvm7 kernel: INFO: task md3_raid10:1293 blocked for more than 
120 seconds.
11:48:00 kvm7 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
11:48:00 kvm7 kernel: md3_raid10      D ffff883f26e55c00     0 1293      
2 0x00000000
11:48:00 kvm7 kernel: ffff887f24bd7c58 0000000000000046 ffff887f212eb980 
ffff887f24bd7fd8
11:48:00 kvm7 kernel: ffff887f24bd7fd8 ffff887f24bd7fd8 ffff887f212eb980 
ffff887f23514400
11:48:00 kvm7 kernel: ffff887f235144dc 0000000000000001 ffff887f23514500 
ffff8807fa4c4300
11:48:00 kvm7 kernel: Call Trace:
11:48:00 kvm7 kernel: [<ffffffff8163bb39>] schedule+0x29/0x70
11:48:00 kvm7 kernel: [<ffffffffa0104ef7>] freeze_array+0xb7/0x180 [raid10]
11:48:00 kvm7 kernel: [<ffffffff810a6b80>] ? wake_up_atomic_t+0x30/0x30
11:48:00 kvm7 kernel: [<ffffffffa010880d>] handle_read_error+0x2bd/0x360 
[raid10]
11:48:00 kvm7 kernel: [<ffffffff812c7412>] ? generic_make_request+0xe2/0x130
11:48:00 kvm7 kernel: [<ffffffffa0108a1d>] raid10d+0x16d/0x1440 [raid10]
11:48:00 kvm7 kernel: [<ffffffff814bb785>] md_thread+0x155/0x1a0
11:48:00 kvm7 kernel: [<ffffffff810a6b80>] ? wake_up_atomic_t+0x30/0x30
11:48:00 kvm7 kernel: [<ffffffff814bb630>] ? md_safemode_timeout+0x50/0x50
11:48:00 kvm7 kernel: [<ffffffff810a5b8f>] kthread+0xcf/0xe0
11:48:00 kvm7 kernel: [<ffffffff810a5ac0>] ? 
kthread_create_on_node+0x140/0x140
11:48:00 kvm7 kernel: [<ffffffff81646a98>] ret_from_fork+0x58/0x90
11:48:00 kvm7 kernel: [<ffffffff810a5ac0>] ? 
kthread_create_on_node+0x140/0x140
11:48:00 kvm7 kernel: INFO: task qemu-kvm:26929 blocked for more than 
120 seconds.

[serveral messages for stuck qemu-kvm processes]

11:52:42 kvm7 kernel: scsi 1:0:9:0: Direct-Access     ATA KINGSTON 
SKC400S 001A PQ: 0 ANSI: 6
11:52:42 kvm7 kernel: scsi 1:0:9:0: SATA: handle(0x000b), 
sas_addr(0x4433221102000000), phy(2), device_name(0x4d6b497569a68ba2)
11:52:42 kvm7 kernel: scsi 1:0:9:0: SATA: 
enclosure_logical_id(0x500304801c14a001), slot(2)
11:52:42 kvm7 kernel: scsi 1:0:9:0: atapi(n), ncq(y), asyn_notify(n), 
smart(y), fua(y), sw_preserve(y)
11:52:42 kvm7 kernel: scsi 1:0:9:0: qdepth(32), tagged(1), simple(0), 
ordered(0), scsi_level(7), cmd_que(1)
11:52:42 kvm7 kernel: sd 1:0:9:0: Attached scsi generic sg10 type 0
11:52:42 kvm7 kernel: sd 1:0:9:0: [sdq] 2000409264 512-byte logical 
blocks: (1.02 TB/953 GiB)
11:52:42 kvm7 kernel: sd 1:0:9:0: [sdq] Write Protect is off
11:52:42 kvm7 kernel: sd 1:0:9:0: [sdq] Write cache: enabled, read 
cache: enabled, supports DPO and FUA
11:52:42 kvm7 kernel: sdq: unknown partition table
11:52:42 kvm7 kernel: sd 1:0:9:0: [sdq] Attached SCSI disk

Best Regards,
Tim

^ permalink raw reply

* Inconsistent use of sectors vs 1k-blocks in sysfs and other places
From: John Brooks @ 2016-12-17  5:14 UTC (permalink / raw)
  To: linux-raid; +Cc: NeilBrown

I noticed that the md.rst (formerly md.txt) file in Documentation/ says that
the component_size attribute in sysfs is measured in sectors. What the
attribute actually returns is sectors/2 (1K blocks, perhaps?). This is the
relevant code from md.c:

static ssize_t
size_show(struct mddev *mddev, char *page)
{
        return sprintf(page, "%llu\n",
                (unsigned long long)mddev->dev_sectors / 2);
}

So the documentation doesn't match the code. Obviously, that needs fixing. But
in this case, I'm not sure which one is "wrong". mdadm's get_component_size()
function in sysfs.c reads this file and multiplies the result by 2 to get
sectors. So clearly this is a known behaviour and not just a forgotten typo.

Looking further, I found that this seems to be a point of inconsistency in
multiple areas:

- The per-device "offset" attribute is in sectors (as documented).
  The per-device "size" attribute is in 1K blocks (which the documentation
  doesn't specify).

- The "sync_completed" attribute uses sectors.
  The "mdstat" file in procfs uses 1K blocks.

- mdadm --examine shows Avail Dev Size in sectors.
  mdadm --detail shows Used Dev Size in 1K blocks.
  And they both show Array Size in 1K blocks.

I suspect that the sysfs attributes have stayed the way they are in the
interest of not breaking programs that use them. The easiest solution would
probably be to leave the behaviour as-is but update the documentation so it's
clear what units are used where.

Somewhat related, suspend_{lo|hi}, resync_{min|max} attributes specify ranges
in sectors, but the documentation does not specify if they are ranges on the
array size or device size. And RAID10 may even handle resync_max differently
from the rest; I didn't look deeply into that but see commit c805cdecea.

The reason I found out about the component_size issue is that I was trying to
make use of the min/max attributes in a project I'm working on, found that they
wanted device-based sectors, and then found that none of the sysfs attributes
actually tell you the device size in sectors (or the array size for that
matter, by the way).

I'm interested to hear what the developers think. I didn't do a thorough audit
to find all the different places that the different units are used; I just
pointed out a few that I noticed while working with it. So I think it's a good
discussion to bring up with the people familiar with the code.

P.S. First mailing list post :)

^ permalink raw reply

* Re: root in grub for raid1
From: Robert L Mathews @ 2016-12-17 19:46 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <20161216221414.7c67f80b@para.lan>

On 12/16/16 1:14 PM, Egbert Bouwman wrote:
> Newbie, but this list seems to be for developers.
> If that is true I'll ask my one question, and then retire.

It's for both, so this is fine.


> Actually it is a question about grub, but I think the raid specialists
> know more about this grub problem than the grubbers do.
> 
> Setting up raid1 for /dev/sda and /dev/sdb (actually for missing and /dev/sdb)
> on /dev/md/data succeeded, but now I have to do a dpkg-reconfigure
> grub-efi-amd64 and i don't know how to specify root in the two grub lines: 
> 	set root=...
> and the linux command line
> 	linux root=
> Please note that i chose md/data, and not the common md0.

Here's how Debian's installer automatically set it up on one of our
machines, which should help.

df:
/dev/md0      497817      30722    456815   7% /boot


mdadm --detail /dev/md0:
 0       8        1        0      active sync   /dev/sda1
 1       8       17        1      active sync   /dev/sdb1
 2       8       33        2      active sync   /dev/sdc1


blkid:
/dev/sda1: UUID="97dad9d1-ffbd-302a-f0d9-91deb32240d0"
UUID_SUB="cf4cfce6-2f61-d15c-2ebf-3aea07f9644c" TYPE="linux_raid_member"

/dev/sdb1: UUID="97dad9d1-ffbd-302a-f0d9-91deb32240d0"
UUID_SUB="c879211d-9efa-3f08-6d54-0f2f3249a839" TYPE="linux_raid_member"

/dev/sdc1: UUID="97dad9d1-ffbd-302a-f0d9-91deb32240d0"
UUID_SUB="51e0ab31-8fb4-8be3-7626-3ab449732c24" TYPE="linux_raid_member"

/dev/md0: UUID="4e19f8d1-d373-42f1-832b-4da7c1a72930" TYPE="ext3"


And finally, here's the grub.cfg lines:

 set root='(mduuid/97dad9d1ffbd302af0d991deb32240d0)'
 search --no-floppy --fs-uuid --set=root
4e19f8d1-d373-42f1-832b-4da7c1a72930
 echo    'Loading Linux 3.2.0-4-amd64 ...'
 linux   /vmlinuz-3.2.0-4-amd64
root=UUID=04324b83-7740-49cd-8866-cb88e1a54845 ro  quiet radeon.modeset=0

So it appears the first "set root=" line should be the blkid UUID of the
underlying partitions (/dev/sdb in your case?), and the second "search
--set=root" should be the UUID of the RAID partition (/dev/md/data is
your case?).

Works on our machines, anyway. <shrug>

-- 
Robert L Mathews, Tiger Technologies, http://www.tigertech.net/

^ permalink raw reply

* Re: root in grub for raid1
From: Wols Lists @ 2016-12-17 21:11 UTC (permalink / raw)
  To: Egbert Bouwman, linux-raid
In-Reply-To: <20161216221414.7c67f80b@para.lan>

On 16/12/16 21:14, Egbert Bouwman wrote:
> Newbie, but this list seems to be for developers.
> If that is true I'll ask my one question, and then retire.
> 
> Actually it is a question about grub, but I think the raid specialists
> know more about this grub problem than the grubbers do.
> 
> Setting up raid1 for /dev/sda and /dev/sdb (actually for missing and /dev/sdb)
> on /dev/md/data succeeded, but now I have to do a dpkg-reconfigure
> grub-efi-amd64 and i don't know how to specify root in the two grub lines: 
> 	set root=...
> and the linux command line
> 	linux root=
> Please note that i chose md/data, and not the common md0.
> 
> egbert
> 
The first thing I noticed in your post was "grub-efi-amd" - I thought on
modern UEFI systems, you used UEFI and not grub to boot linux ...

But, as another datapoint, I boot from almost exactly the same setup as
you, two mirrored disks on a GPT/BIOS system. My grub entry is

menuentry 'Gentoo GNU/Linux, with Linux 4.4.6-gentoo' --class gentoo
--class gnu-linux --class gnu --class os $menuentry_id_option
'gnulinux-4.4.6-gentoo-advanced-ab538350-d249-413b-86ef-4bd5280600b8' {
        load_video
        insmod gzio
        insmod part_gpt
        insmod part_gpt
        insmod diskfilter
        insmod mdraid1x
        insmod ext2
        set root='mduuid/69270eaca840f6e70199064bd5863c5d'
        if [ x$feature_platform_search_hint = xy ]; then
          search --no-floppy --fs-uuid --set=root
--hint='mduuid/69270eaca840f6e70199064bd5863c5d'
ab538350-d249-413b-86ef-4bd5280600b8
        else
          search --no-floppy --fs-uuid --set=root
ab538350-d249-413b-86ef-4bd5280600b8
        fi
        echo    'Loading Linux 4.4.6-gentoo ...'
        linux   /boot/vmlinuz-4.4.6-gentoo
root=UUID=ab538350-d249-413b-86ef-4bd5280600b8 ro  domdadm
        echo    'Loading initial ramdisk ...'
        initrd  /boot/initramfs-genkernel-x86_64-4.4.6-gentoo
}

Cheers,
Wol

^ permalink raw reply

* (unknown), 
From: linux-raid @ 2016-12-18  0:32 UTC (permalink / raw)
  To: linux-raid; +Cc: pxni, 8886670, gxizg, 95950137125, znnq, nmbf, 89550912

[-- Attachment #1: ORDER-816339228.zip --]
[-- Type: application/zip, Size: 16281 bytes --]

^ permalink raw reply

* Re: (user) Help needed: mdadm seems to constantly touch my disks
From: Jure Erznožnik @ 2016-12-18 19:40 UTC (permalink / raw)
  To: NeilBrown, linux-raid
In-Reply-To: <CAJ=9zifGPd+3YnP1aLwLPXjtpygRc9DsDP37CNz-VHBHn1vDOg@mail.gmail.com>

My further attempts to solve this issue include the following (all
unsuccessful):

1. Installing a fresh Ubuntu, assemble the array
2. Install OpenSUSE, assemble the array
3. Tear the array down, create it anew from scratch (it now has a new
UUID, but the data seems to have been preserved, so my bcache / LVM2
configuration remains the same) - interestingly though: during initial
array rebuild which took the better part of today, there was no
clicking even though drives were constantly in action. Either it was
inaudible or the touching didn't take place.

I think I'm barking up the wrong tree with these experiments. Not sure
how to proceed from here.

LP,
Jure

On Thu, Dec 15, 2016 at 8:01 AM, Jure Erznožnik
<jure.erznoznik@gmail.com> wrote:
> Thanks for helping Neil. I have run the suggested utilities and here
> are my findings:
>
> It is always [kworker/x:yy] (x:yy changes somewhat) or [0].
> A few lines from one of the outputs:
>
>   9,0    3        0     0.061577998     0  m   N raid5 rcw 3758609392 2 2 0
>   9,0    3        0     0.061580084     0  m   N raid5 rcw 3758609400 2 2 0
>   9,0    3        0     0.061580084     0  m   N raid5 rcw 3758609400 2 2 0
>   9,0    3        0     0.061580084     0  m   N raid5 rcw 3758609400 2 2 0
>   9,0    3        0     0.061580084     0  m   N raid5 rcw 3758609400 2 2 0
>   9,0    0        1     0.065333879   283  C   W 11275825480 [0]
>   9,0    0        1     0.065333879   283  C   W 11275825480 [0]
>   9,0    0        1     0.065333879   283  C   W 11275825480 [0]
>   9,0    0        1     0.065333879   283  C   W 11275825480 [0]
>   9,0    3        2     1.022155200  2861  Q   W 11275826504 + 32 [kworker/3:38]
>   9,0    3        2     1.022155200  2861  Q   W 11275826504 + 32 [kworker/3:38]
>   9,0    3        2     1.022155200  2861  Q   W 11275826504 + 32 [kworker/3:38]
>   9,0    3        2     1.022155200  2861  Q   W 11275826504 + 32 [kworker/3:38]
>   9,0    0        2     1.054590402   283  C   W 11275826504 [0]
>   9,0    0        2     1.054590402   283  C   W 11275826504 [0]
>   9,0    0        2     1.054590402   283  C   W 11275826504 [0]
>   9,0    0        2     1.054590402   283  C   W 11275826504 [0]
>   9,0    3        3     2.046065106  2861  Q   W 11275861232 + 8 [kworker/3:38]
>   9,0    3        3     2.046065106  2861  Q   W 11275861232 + 8 [kworker/3:38]
>   9,0    3        3     2.046065106  2861  Q   W 11275861232 + 8 [kworker/3:38]
>   9,0    3        3     2.046065106  2861  Q   W 11275861232 + 8 [kworker/3:38]
>   9,0    0        0     2.075247515     0  m   N raid5 rcw 3758619888 2 0 1
>   9,0    0        0     2.075247515     0  m   N raid5 rcw 3758619888 2 0 1
>   9,0    0        0     2.075247515     0  m   N raid5 rcw 3758619888 2 0 1
>   9,0    0        0     2.075247515     0  m   N raid5 rcw 3758619888 2 0 1
>   9,0    0        0     2.075250686     0  m   N raid5 rcw 3758619888 2 2 0
>   9,0    0        0     2.075250686     0  m   N raid5 rcw 3758619888 2 2 0
>   9,0    0        0     2.075250686     0  m   N raid5 rcw 3758619888 2 2 0
>   9,0    0        0     2.075250686     0  m   N raid5 rcw 3758619888 2 2 0
>   9,0    2        1     2.086924691   283  C   W 11275861232 [0]
>   9,0    2        1     2.086924691   283  C   W 11275861232 [0]
>   9,0    2        1     2.086924691   283  C   W 11275861232 [0]
>   9,0    2        1     2.086924691   283  C   W 11275861232 [0]
>   9,0    0        3     2.967340614  1061  Q FWS [kworker/0:18]
>   9,0    0        3     2.967340614  1061  Q FWS [kworker/0:18]
>   9,0    0        3     2.967340614  1061  Q FWS [kworker/0:18]
>   9,0    0        3     2.967340614  1061  Q FWS [kworker/0:18]
>   9,0    3        4     3.070092310  2861  Q   W 11275861272 + 8 [kworker/3:38]
>   9,0    3        4     3.070092310  2861  Q   W 11275861272 + 8 [kworker/3:38]
>   9,0    3        4     3.070092310  2861  Q   W 11275861272 + 8 [kworker/3:38]
>   9,0    3        4     3.070092310  2861  Q   W 11275861272 + 8 [kworker/3:38]
>   9,0    0        0     3.101966398     0  m   N raid5 rcw 3758619928 2 0 1
>   9,0    0        0     3.101966398     0  m   N raid5 rcw 3758619928 2 0 1
>   9,0    0        0     3.101966398     0  m   N raid5 rcw 3758619928 2 0 1
>   9,0    0        0     3.101966398     0  m   N raid5 rcw 3758619928 2 0 1
>   9,0    0        0     3.101969169     0  m   N raid5 rcw 3758619928 2 2 0
>   9,0    0        0     3.101969169     0  m   N raid5 rcw 3758619928 2 2 0
>   9,0    0        0     3.101969169     0  m   N raid5 rcw 3758619928 2 2 0
>   9,0    0        0     3.101969169     0  m   N raid5 rcw 3758619928 2 2 0
>   9,0    0        4     3.102340646   283  C   W 11275861272 [0]
>   9,0    0        4     3.102340646   283  C   W 11275861272 [0]
>   9,0    0        4     3.102340646   283  C   W 11275861272 [0]
>   9,0    0        4     3.102340646   283  C   W 11275861272 [0]
>   9,0    3        5     4.094666938  2861  Q   W 11276014160 + 336
> [kworker/3:38]
>   9,0    3        5     4.094666938  2861  Q   W 11276014160 + 336
> [kworker/3:38]
>   9,0    3        5     4.094666938  2861  Q   W 11276014160 + 336
> [kworker/3:38]
>   9,0    3        5     4.094666938  2861  Q   W 11276014160 + 336
> [kworker/3:38]
>   9,0    3        0     4.137869804     0  m   N raid5 rcw 3758671440 2 0 1
>   9,0    3        0     4.137869804     0  m   N raid5 rcw 3758671440 2 0 1
>   9,0    3        0     4.137869804     0  m   N raid5 rcw 3758671440 2 0 1
>   9,0    3        0     4.137869804     0  m   N raid5 rcw 3758671440 2 0 1
>   9,0    3        0     4.137872647     0  m   N raid5 rcw 3758671448 2 0 1
>   9,0    3        0     4.137872647     0  m   N raid5 rcw 3758671448 2 0 1
>   9,0    3        0     4.137872647     0  m   N raid5 rcw 3758671448 2 0 1
>
> LP,
> Jure
>
> On Wed, Dec 14, 2016 at 2:15 AM, NeilBrown <neilb@suse.com> wrote:
>> On Tue, Dec 13 2016, Jure Erznožnik wrote:
>>
>>> First of all, I apologise if this mail list is not intended for layman
>>> help, but this is what I am and I couldn't get an explanation
>>> elsewhere.
>>>
>>> My problem is that (as it seems) mdadm is touching HDD superblocks
>>> once per second, once at address 8 (sectors), next at address 16.
>>> Total traffic is kilobytes per second, writes only, no other
>>> detectable traffic.
>>>
>>> I have detailed the problem here:
>>> http://unix.stackexchange.com/questions/329477/
>>>
>>> Shortened:
>>> kubuntu 16.10 4.8.0-30-generic #32, mdadm v3.4 2016-01-28
>>> My configuration: 4 spinning platters (/dev/sd[cdef]) assembled into a
>>> raid5 array, then bcache set to cache (hopefully) everything
>>> (cache_mode = writeback, sequential_cutoff = 0). On top of bcache
>>> volume I have set up lvm.
>>>
>>> * iostat shows traffic on sd[cdef] and md0
>>> * iotop shows no traffic
>>> * iosnoop shows COMM=[idle, md0_raid5, kworker] as processes working
>>> on the disk. Blocks reported are 8, 16 (data size a few KB) and
>>> 18446744073709500000 (data size 0). That last one must be some virtual
>>> thingie as the disks are nowhere near that large.
>>> * enabling block_dump shows md0_raid5 process writing to block 8 (1
>>> sectors) and 16 (8 sectors)
>>>
>>> This touching is caused by any write into the array and goes on for
>>> quite a while after the write has been done (a couple of hours for
>>> 60GB of writes). When services actually work with the array, this
>>> becomes pretty much constant.
>>>
>>> What am I observing and is there any way of stopping it?
>>
>> Start with the uppermost layer which has I/O that you cannot explain.
>> Presumably that is md0.
>> Run 'blktrace' on that device for a little while, then 'blkparse' to
>> look at the results.
>>
>>  blktrace -w 10 md0
>>  blkparse *blktrace*
>>
>> It will give the name of the process that initiated the request in [] at
>> the end of some lines.
>>
>> NeilBrown

^ permalink raw reply

* Re: (user) Help needed: mdadm seems to constantly touch my disks
From: Theophanis Kontogiannis @ 2016-12-18 21:30 UTC (permalink / raw)
  To: Jure Erznožnik; +Cc: NeilBrown, Linux RAID
In-Reply-To: <CAJ=9zieazbqvWaMXPF1hUL+M5rpvswx5kcbBqU5crcgYi7Qg5Q@mail.gmail.com>

Hello All,

Kind reminder that I had to start a similar thread last month.

https://marc.info/?t=147871214200005&r=1&w=2

Just in case it rings any bells.

BR
Theo



---
Best regards,
ΜΦΧ,

Theophanis Kontogiannis



On Sun, Dec 18, 2016 at 9:40 PM, Jure Erznožnik
<jure.erznoznik@gmail.com> wrote:
> My further attempts to solve this issue include the following (all
> unsuccessful):
>
> 1. Installing a fresh Ubuntu, assemble the array
> 2. Install OpenSUSE, assemble the array
> 3. Tear the array down, create it anew from scratch (it now has a new
> UUID, but the data seems to have been preserved, so my bcache / LVM2
> configuration remains the same) - interestingly though: during initial
> array rebuild which took the better part of today, there was no
> clicking even though drives were constantly in action. Either it was
> inaudible or the touching didn't take place.
>
> I think I'm barking up the wrong tree with these experiments. Not sure
> how to proceed from here.
>
> LP,
> Jure
>
> On Thu, Dec 15, 2016 at 8:01 AM, Jure Erznožnik
> <jure.erznoznik@gmail.com> wrote:
>> Thanks for helping Neil. I have run the suggested utilities and here
>> are my findings:
>>
>> It is always [kworker/x:yy] (x:yy changes somewhat) or [0].
>> A few lines from one of the outputs:
>>
>>   9,0    3        0     0.061577998     0  m   N raid5 rcw 3758609392 2 2 0
>>   9,0    3        0     0.061580084     0  m   N raid5 rcw 3758609400 2 2 0
>>   9,0    3        0     0.061580084     0  m   N raid5 rcw 3758609400 2 2 0
>>   9,0    3        0     0.061580084     0  m   N raid5 rcw 3758609400 2 2 0
>>   9,0    3        0     0.061580084     0  m   N raid5 rcw 3758609400 2 2 0
>>   9,0    0        1     0.065333879   283  C   W 11275825480 [0]
>>   9,0    0        1     0.065333879   283  C   W 11275825480 [0]
>>   9,0    0        1     0.065333879   283  C   W 11275825480 [0]
>>   9,0    0        1     0.065333879   283  C   W 11275825480 [0]
>>   9,0    3        2     1.022155200  2861  Q   W 11275826504 + 32 [kworker/3:38]
>>   9,0    3        2     1.022155200  2861  Q   W 11275826504 + 32 [kworker/3:38]
>>   9,0    3        2     1.022155200  2861  Q   W 11275826504 + 32 [kworker/3:38]
>>   9,0    3        2     1.022155200  2861  Q   W 11275826504 + 32 [kworker/3:38]
>>   9,0    0        2     1.054590402   283  C   W 11275826504 [0]
>>   9,0    0        2     1.054590402   283  C   W 11275826504 [0]
>>   9,0    0        2     1.054590402   283  C   W 11275826504 [0]
>>   9,0    0        2     1.054590402   283  C   W 11275826504 [0]
>>   9,0    3        3     2.046065106  2861  Q   W 11275861232 + 8 [kworker/3:38]
>>   9,0    3        3     2.046065106  2861  Q   W 11275861232 + 8 [kworker/3:38]
>>   9,0    3        3     2.046065106  2861  Q   W 11275861232 + 8 [kworker/3:38]
>>   9,0    3        3     2.046065106  2861  Q   W 11275861232 + 8 [kworker/3:38]
>>   9,0    0        0     2.075247515     0  m   N raid5 rcw 3758619888 2 0 1
>>   9,0    0        0     2.075247515     0  m   N raid5 rcw 3758619888 2 0 1
>>   9,0    0        0     2.075247515     0  m   N raid5 rcw 3758619888 2 0 1
>>   9,0    0        0     2.075247515     0  m   N raid5 rcw 3758619888 2 0 1
>>   9,0    0        0     2.075250686     0  m   N raid5 rcw 3758619888 2 2 0
>>   9,0    0        0     2.075250686     0  m   N raid5 rcw 3758619888 2 2 0
>>   9,0    0        0     2.075250686     0  m   N raid5 rcw 3758619888 2 2 0
>>   9,0    0        0     2.075250686     0  m   N raid5 rcw 3758619888 2 2 0
>>   9,0    2        1     2.086924691   283  C   W 11275861232 [0]
>>   9,0    2        1     2.086924691   283  C   W 11275861232 [0]
>>   9,0    2        1     2.086924691   283  C   W 11275861232 [0]
>>   9,0    2        1     2.086924691   283  C   W 11275861232 [0]
>>   9,0    0        3     2.967340614  1061  Q FWS [kworker/0:18]
>>   9,0    0        3     2.967340614  1061  Q FWS [kworker/0:18]
>>   9,0    0        3     2.967340614  1061  Q FWS [kworker/0:18]
>>   9,0    0        3     2.967340614  1061  Q FWS [kworker/0:18]
>>   9,0    3        4     3.070092310  2861  Q   W 11275861272 + 8 [kworker/3:38]
>>   9,0    3        4     3.070092310  2861  Q   W 11275861272 + 8 [kworker/3:38]
>>   9,0    3        4     3.070092310  2861  Q   W 11275861272 + 8 [kworker/3:38]
>>   9,0    3        4     3.070092310  2861  Q   W 11275861272 + 8 [kworker/3:38]
>>   9,0    0        0     3.101966398     0  m   N raid5 rcw 3758619928 2 0 1
>>   9,0    0        0     3.101966398     0  m   N raid5 rcw 3758619928 2 0 1
>>   9,0    0        0     3.101966398     0  m   N raid5 rcw 3758619928 2 0 1
>>   9,0    0        0     3.101966398     0  m   N raid5 rcw 3758619928 2 0 1
>>   9,0    0        0     3.101969169     0  m   N raid5 rcw 3758619928 2 2 0
>>   9,0    0        0     3.101969169     0  m   N raid5 rcw 3758619928 2 2 0
>>   9,0    0        0     3.101969169     0  m   N raid5 rcw 3758619928 2 2 0
>>   9,0    0        0     3.101969169     0  m   N raid5 rcw 3758619928 2 2 0
>>   9,0    0        4     3.102340646   283  C   W 11275861272 [0]
>>   9,0    0        4     3.102340646   283  C   W 11275861272 [0]
>>   9,0    0        4     3.102340646   283  C   W 11275861272 [0]
>>   9,0    0        4     3.102340646   283  C   W 11275861272 [0]
>>   9,0    3        5     4.094666938  2861  Q   W 11276014160 + 336
>> [kworker/3:38]
>>   9,0    3        5     4.094666938  2861  Q   W 11276014160 + 336
>> [kworker/3:38]
>>   9,0    3        5     4.094666938  2861  Q   W 11276014160 + 336
>> [kworker/3:38]
>>   9,0    3        5     4.094666938  2861  Q   W 11276014160 + 336
>> [kworker/3:38]
>>   9,0    3        0     4.137869804     0  m   N raid5 rcw 3758671440 2 0 1
>>   9,0    3        0     4.137869804     0  m   N raid5 rcw 3758671440 2 0 1
>>   9,0    3        0     4.137869804     0  m   N raid5 rcw 3758671440 2 0 1
>>   9,0    3        0     4.137869804     0  m   N raid5 rcw 3758671440 2 0 1
>>   9,0    3        0     4.137872647     0  m   N raid5 rcw 3758671448 2 0 1
>>   9,0    3        0     4.137872647     0  m   N raid5 rcw 3758671448 2 0 1
>>   9,0    3        0     4.137872647     0  m   N raid5 rcw 3758671448 2 0 1
>>
>> LP,
>> Jure
>>
>> On Wed, Dec 14, 2016 at 2:15 AM, NeilBrown <neilb@suse.com> wrote:
>>> On Tue, Dec 13 2016, Jure Erznožnik wrote:
>>>
>>>> First of all, I apologise if this mail list is not intended for layman
>>>> help, but this is what I am and I couldn't get an explanation
>>>> elsewhere.
>>>>
>>>> My problem is that (as it seems) mdadm is touching HDD superblocks
>>>> once per second, once at address 8 (sectors), next at address 16.
>>>> Total traffic is kilobytes per second, writes only, no other
>>>> detectable traffic.
>>>>
>>>> I have detailed the problem here:
>>>> http://unix.stackexchange.com/questions/329477/
>>>>
>>>> Shortened:
>>>> kubuntu 16.10 4.8.0-30-generic #32, mdadm v3.4 2016-01-28
>>>> My configuration: 4 spinning platters (/dev/sd[cdef]) assembled into a
>>>> raid5 array, then bcache set to cache (hopefully) everything
>>>> (cache_mode = writeback, sequential_cutoff = 0). On top of bcache
>>>> volume I have set up lvm.
>>>>
>>>> * iostat shows traffic on sd[cdef] and md0
>>>> * iotop shows no traffic
>>>> * iosnoop shows COMM=[idle, md0_raid5, kworker] as processes working
>>>> on the disk. Blocks reported are 8, 16 (data size a few KB) and
>>>> 18446744073709500000 (data size 0). That last one must be some virtual
>>>> thingie as the disks are nowhere near that large.
>>>> * enabling block_dump shows md0_raid5 process writing to block 8 (1
>>>> sectors) and 16 (8 sectors)
>>>>
>>>> This touching is caused by any write into the array and goes on for
>>>> quite a while after the write has been done (a couple of hours for
>>>> 60GB of writes). When services actually work with the array, this
>>>> becomes pretty much constant.
>>>>
>>>> What am I observing and is there any way of stopping it?
>>>
>>> Start with the uppermost layer which has I/O that you cannot explain.
>>> Presumably that is md0.
>>> Run 'blktrace' on that device for a little while, then 'blkparse' to
>>> look at the results.
>>>
>>>  blktrace -w 10 md0
>>>  blkparse *blktrace*
>>>
>>> It will give the name of the process that initiated the request in [] at
>>> the end of some lines.
>>>
>>> NeilBrown
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Fwd: (user) Help needed: mdadm seems to constantly touch my disks
From: NeilBrown @ 2016-12-18 22:21 UTC (permalink / raw)
  To: Jure Erznožnik, linux-raid
In-Reply-To: <CAJ=9zifGPd+3YnP1aLwLPXjtpygRc9DsDP37CNz-VHBHn1vDOg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 5506 bytes --]

On Thu, Dec 15 2016, Jure Erznožnik wrote:

> Thanks for helping Neil. I have run the suggested utilities and here
> are my findings:
>
> It is always [kworker/x:yy] (x:yy changes somewhat) or [0].
> A few lines from one of the outputs:

That's disappointing.  "kworker" could be any work queue.

It might be useful to look for large scale patterns.
What different addresses are written to?  Is there are regular pattern?
What is the period?

md doesn't use work_queues for IO so these much be coming from
elsewhere.  bcache uses a few work-queues...

NeilBrown

>
>   9,0    3        0     0.061577998     0  m   N raid5 rcw 3758609392 2 2 0
>   9,0    3        0     0.061580084     0  m   N raid5 rcw 3758609400 2 2 0
>   9,0    3        0     0.061580084     0  m   N raid5 rcw 3758609400 2 2 0
>   9,0    3        0     0.061580084     0  m   N raid5 rcw 3758609400 2 2 0
>   9,0    3        0     0.061580084     0  m   N raid5 rcw 3758609400 2 2 0
>   9,0    0        1     0.065333879   283  C   W 11275825480 [0]
>   9,0    0        1     0.065333879   283  C   W 11275825480 [0]
>   9,0    0        1     0.065333879   283  C   W 11275825480 [0]
>   9,0    0        1     0.065333879   283  C   W 11275825480 [0]
>   9,0    3        2     1.022155200  2861  Q   W 11275826504 + 32 [kworker/3:38]
>   9,0    3        2     1.022155200  2861  Q   W 11275826504 + 32 [kworker/3:38]
>   9,0    3        2     1.022155200  2861  Q   W 11275826504 + 32 [kworker/3:38]
>   9,0    3        2     1.022155200  2861  Q   W 11275826504 + 32 [kworker/3:38]
>   9,0    0        2     1.054590402   283  C   W 11275826504 [0]
>   9,0    0        2     1.054590402   283  C   W 11275826504 [0]
>   9,0    0        2     1.054590402   283  C   W 11275826504 [0]
>   9,0    0        2     1.054590402   283  C   W 11275826504 [0]
>   9,0    3        3     2.046065106  2861  Q   W 11275861232 + 8 [kworker/3:38]
>   9,0    3        3     2.046065106  2861  Q   W 11275861232 + 8 [kworker/3:38]
>   9,0    3        3     2.046065106  2861  Q   W 11275861232 + 8 [kworker/3:38]
>   9,0    3        3     2.046065106  2861  Q   W 11275861232 + 8 [kworker/3:38]
>   9,0    0        0     2.075247515     0  m   N raid5 rcw 3758619888 2 0 1
>   9,0    0        0     2.075247515     0  m   N raid5 rcw 3758619888 2 0 1
>   9,0    0        0     2.075247515     0  m   N raid5 rcw 3758619888 2 0 1
>   9,0    0        0     2.075247515     0  m   N raid5 rcw 3758619888 2 0 1
>   9,0    0        0     2.075250686     0  m   N raid5 rcw 3758619888 2 2 0
>   9,0    0        0     2.075250686     0  m   N raid5 rcw 3758619888 2 2 0
>   9,0    0        0     2.075250686     0  m   N raid5 rcw 3758619888 2 2 0
>   9,0    0        0     2.075250686     0  m   N raid5 rcw 3758619888 2 2 0
>   9,0    2        1     2.086924691   283  C   W 11275861232 [0]
>   9,0    2        1     2.086924691   283  C   W 11275861232 [0]
>   9,0    2        1     2.086924691   283  C   W 11275861232 [0]
>   9,0    2        1     2.086924691   283  C   W 11275861232 [0]
>   9,0    0        3     2.967340614  1061  Q FWS [kworker/0:18]
>   9,0    0        3     2.967340614  1061  Q FWS [kworker/0:18]
>   9,0    0        3     2.967340614  1061  Q FWS [kworker/0:18]
>   9,0    0        3     2.967340614  1061  Q FWS [kworker/0:18]
>   9,0    3        4     3.070092310  2861  Q   W 11275861272 + 8 [kworker/3:38]
>   9,0    3        4     3.070092310  2861  Q   W 11275861272 + 8 [kworker/3:38]
>   9,0    3        4     3.070092310  2861  Q   W 11275861272 + 8 [kworker/3:38]
>   9,0    3        4     3.070092310  2861  Q   W 11275861272 + 8 [kworker/3:38]
>   9,0    0        0     3.101966398     0  m   N raid5 rcw 3758619928 2 0 1
>   9,0    0        0     3.101966398     0  m   N raid5 rcw 3758619928 2 0 1
>   9,0    0        0     3.101966398     0  m   N raid5 rcw 3758619928 2 0 1
>   9,0    0        0     3.101966398     0  m   N raid5 rcw 3758619928 2 0 1
>   9,0    0        0     3.101969169     0  m   N raid5 rcw 3758619928 2 2 0
>   9,0    0        0     3.101969169     0  m   N raid5 rcw 3758619928 2 2 0
>   9,0    0        0     3.101969169     0  m   N raid5 rcw 3758619928 2 2 0
>   9,0    0        0     3.101969169     0  m   N raid5 rcw 3758619928 2 2 0
>   9,0    0        4     3.102340646   283  C   W 11275861272 [0]
>   9,0    0        4     3.102340646   283  C   W 11275861272 [0]
>   9,0    0        4     3.102340646   283  C   W 11275861272 [0]
>   9,0    0        4     3.102340646   283  C   W 11275861272 [0]
>   9,0    3        5     4.094666938  2861  Q   W 11276014160 + 336
> [kworker/3:38]
>   9,0    3        5     4.094666938  2861  Q   W 11276014160 + 336
> [kworker/3:38]
>   9,0    3        5     4.094666938  2861  Q   W 11276014160 + 336
> [kworker/3:38]
>   9,0    3        5     4.094666938  2861  Q   W 11276014160 + 336
> [kworker/3:38]
>   9,0    3        0     4.137869804     0  m   N raid5 rcw 3758671440 2 0 1
>   9,0    3        0     4.137869804     0  m   N raid5 rcw 3758671440 2 0 1
>   9,0    3        0     4.137869804     0  m   N raid5 rcw 3758671440 2 0 1
>   9,0    3        0     4.137869804     0  m   N raid5 rcw 3758671440 2 0 1
>   9,0    3        0     4.137872647     0  m   N raid5 rcw 3758671448 2 0 1
>   9,0    3        0     4.137872647     0  m   N raid5 rcw 3758671448 2 0 1
>   9,0    3        0     4.137872647     0  m   N raid5 rcw 3758671448 2 0 1

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: Inconsistent use of sectors vs 1k-blocks in sysfs and other places
From: NeilBrown @ 2016-12-18 22:32 UTC (permalink / raw)
  To: John Brooks, linux-raid
In-Reply-To: <20161217051420.GA29241@oldkitsune.fastquake.com>

[-- Attachment #1: Type: text/plain, Size: 3594 bytes --]

On Sat, Dec 17 2016, John Brooks wrote:

> I noticed that the md.rst (formerly md.txt) file in Documentation/ says that
> the component_size attribute in sysfs is measured in sectors. What the
> attribute actually returns is sectors/2 (1K blocks, perhaps?). This is the
> relevant code from md.c:
>
> static ssize_t
> size_show(struct mddev *mddev, char *page)
> {
>         return sprintf(page, "%llu\n",
>                 (unsigned long long)mddev->dev_sectors / 2);
> }
>
> So the documentation doesn't match the code. Obviously, that needs fixing. But
> in this case, I'm not sure which one is "wrong". mdadm's get_component_size()
> function in sysfs.c reads this file and multiplies the result by 2 to get
> sectors. So clearly this is a known behaviour and not just a forgotten typo.

The code is right by definition.  As you say, mdadm uses this interface
so we cannot change it.

>
> Looking further, I found that this seems to be a point of inconsistency in
> multiple areas:

Yes.  Sad isn't it?
The original ioctl interface used 'K' so I copied that when I created
the sysfs interface.  After a while I realized that 'sectors' made a lot
more sense, so I change to that for subsequent additions.


>
> - The per-device "offset" attribute is in sectors (as documented).
>   The per-device "size" attribute is in 1K blocks (which the documentation
>   doesn't specify).
>
> - The "sync_completed" attribute uses sectors.
>   The "mdstat" file in procfs uses 1K blocks.
>
> - mdadm --examine shows Avail Dev Size in sectors.
>   mdadm --detail shows Used Dev Size in 1K blocks.
>   And they both show Array Size in 1K blocks.
>
> I suspect that the sysfs attributes have stayed the way they are in the
> interest of not breaking programs that use them. The easiest solution would
> probably be to leave the behaviour as-is but update the documentation so it's
> clear what units are used where.

Yes.  Not just "easiest", but "only acceptable".

>
> Somewhat related, suspend_{lo|hi}, resync_{min|max} attributes specify ranges
> in sectors, but the documentation does not specify if they are ranges on the
> array size or device size. And RAID10 may even handle resync_max differently
> from the rest; I didn't look deeply into that but see commit c805cdecea.

suspend_{lo|hi} are are array addresses
resync_{min|max} are array addresses for RAID1 and RAID10, and they are
device-addresses-offset-from-data_offset for RAID1 and RAID456.

>
> The reason I found out about the component_size issue is that I was trying to
> make use of the min/max attributes in a project I'm working on, found that they
> wanted device-based sectors, and then found that none of the sysfs attributes
> actually tell you the device size in sectors (or the array size for that
> matter, by the way).
>
> I'm interested to hear what the developers think. I didn't do a thorough audit
> to find all the different places that the different units are used; I just
> pointed out a few that I noticed while working with it. So I think it's a good
> discussion to bring up with the people familiar with the code.
>

I agree that it is sad that units aren't consistent, but there is not
much we can do about that.
It is bad that the documentation is incorrect and incomplete.  If you we
to post a patch which fixed some of it, I'm sure that would be
thankfully accepted.

In theory you could have a RAID1 with an odd number of sectors used in
each component, but I doubt that happens in practice.  So you can get
the component size in sectors by reading component_size and multiplying
by two.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: Fwd: (user) Help needed: mdadm seems to constantly touch my disks
From: NeilBrown @ 2016-12-19  4:01 UTC (permalink / raw)
  To: Jure Erznožnik; +Cc: linux-raid
In-Reply-To: <CAJ=9zidNV4sPj7KC7_mJEo8+=-YTKyWD5RiLsGG9p33CV12Qdg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 555 bytes --]

[please remember to keep linux-raid cc:ed]

On Mon, Dec 19 2016, Jure Erznožnik wrote:

> I wrote this in OP: iosnoop shows COMM=[idle, md0_raid5, kworker] as
> processes working on the disk. Blocks reported are 8, 16 (data size a
> few KB) and 18446744073709500000 (data size 0). That last one must be
> some virtual thingie as the disks are nowhere near that large.
>
> Does this answer the question or did you mean something else?

Maybe if you just make the blktrace logs available somewhere and I will
look at them myself.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: Fwd: (user) Help needed: mdadm seems to constantly touch my disks
From: Jure Erznožnik @ 2016-12-19  7:12 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
In-Reply-To: <87twa0tvma.fsf@notabene.neil.brown.name>

I have made two blktraces at the same time: one for md0 and one for a
member of the array. I hope they will show something sensible.

I dropped them here:
http://expirebox.com/download/ee081fa4f85ffbd0bfad68e4ee257e11.html

The file will be available for 48 hours or so they say.

LP,
Jure

On Mon, Dec 19, 2016 at 5:01 AM, NeilBrown <neilb@suse.com> wrote:
> [please remember to keep linux-raid cc:ed]
>
> On Mon, Dec 19 2016, Jure Erznožnik wrote:
>
>> I wrote this in OP: iosnoop shows COMM=[idle, md0_raid5, kworker] as
>> processes working on the disk. Blocks reported are 8, 16 (data size a
>> few KB) and 18446744073709500000 (data size 0). That last one must be
>> some virtual thingie as the disks are nowhere near that large.
>>
>> Does this answer the question or did you mean something else?
>
> Maybe if you just make the blktrace logs available somewhere and I will
> look at them myself.
>
> NeilBrown

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox