Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: Recovering a RAID6 after all disks were disconnected
From: Giuseppe Bilotta @ 2016-12-23 21:14 UTC (permalink / raw)
  To: NeilBrown; +Cc: John Stoffel, linux-raid
In-Reply-To: <CAOxFTcxaC1WOj7HeD5bRaPKV93fQZ6X-mBtHOFcQmPwWfjPxDQ@mail.gmail.com>

On Fri, Dec 23, 2016 at 5:17 PM, Giuseppe Bilotta
<giuseppe.bilotta@gmail.com> wrote:
>
> Now I wonder if it it would be possible to combine this approach with
> something that simply hacked the metadata of each disk to re-establish
> the correct disk order to make it possible to reassemble this
> particular array without recreating anything. Are problems such as
> mine common enough to warrant support for this kind of verified
> reassembly from assumed-clean disks easier?.

Actually, now that the correct order is verified, I would like to know
why re-creating the array using mdadm -C --assume-clean with the disks
in the correct order works (the RAID is then accessible, and I can
read data off of it).

However, if I  simply hand-edit the metadata to assign the correct
device order to the disks (I do this by restoring the correct device
roles in the dev_roles table, at the entries corresponding to the
disks' dev_numbers, in the correct order, and then adjust the checksum
accrdingly) and then assemble the array, I get I/O errors accessing
the array contents, even though raid6check doesn't report issues.

In the 'hacked dev role' case, the dmesg reads:

[  +0.002057] md: bind<dm-2>
[  +0.000936] md: bind<dm-1>
[  +0.000932] md: bind<dm-0>
[  +0.000925] md: bind<dm-3>
[  +0.001443] md/raid:md112: device dm-3 operational as raid disk 0
[  +0.000540] md/raid:md112: device dm-0 operational as raid disk 3
[  +0.000710] md/raid:md112: device dm-1 operational as raid disk 2
[  +0.000508] md/raid:md112: device dm-2 operational as raid disk 1
[  +0.009716] md/raid:md112: allocated 4374kB
[  +0.000555] md/raid:md112: raid level 6 active with 4 out of 4
devices, algorithm 2
[  +0.000531] RAID conf printout:
[  +0.000001]  --- level:6 rd:4 wd:4
[  +0.000001]  disk 0, o:1, dev:dm-3
[  +0.000001]  disk 1, o:1, dev:dm-2
[  +0.000000]  disk 2, o:1, dev:dm-1
[  +0.000001]  disk 3, o:1, dev:dm-0
[  +0.000449] created bitmap (22 pages) for device md112
[  +0.001865] md112: bitmap initialized from disk: read 2 pages, set 5
of 44711 bits
[  +0.533458] md112: detected capacity change from 0 to 6000916561920
[  +0.004194] Buffer I/O error on dev md112, logical block 0, async page read
[  +0.003450] Buffer I/O error on dev md112, logical block 0, async page read
[  +0.001953] Buffer I/O error on dev md112, logical block 0, async page read
[  +0.001978] Buffer I/O error on dev md112, logical block 0, async page read
[  +0.001852] ldm_validate_partition_table(): Disk read failed.
[  +0.001889] Buffer I/O error on dev md112, logical block 0, async page read
[  +0.001875] Buffer I/O error on dev md112, logical block 0, async page read
[  +0.001834] Buffer I/O error on dev md112, logical block 0, async page read
[  +0.001596] Buffer I/O error on dev md112, logical block 0, async page read
[  +0.001551] Dev md112: unable to read RDB block 0
[  +0.001293] Buffer I/O error on dev md112, logical block 0, async page read
[  +0.001284] Buffer I/O error on dev md112, logical block 0, async page read
[  +0.001307]  md112: unable to read partition table


So the array assembles, and raid6check reports no error, but the data
is actually inaccessible .. am I missing other aspects of the metadata
that need to be restored?


-- 
Giuseppe "Oblomov" Bilotta

^ permalink raw reply

* Re: Recovering a RAID6 after all disks were disconnected
From: NeilBrown @ 2016-12-23 22:46 UTC (permalink / raw)
  To: Giuseppe Bilotta; +Cc: John Stoffel, linux-raid
In-Reply-To: <CAOxFTcxaC1WOj7HeD5bRaPKV93fQZ6X-mBtHOFcQmPwWfjPxDQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4429 bytes --]

On Sat, Dec 24 2016, Giuseppe Bilotta wrote:

> On Fri, Dec 23, 2016 at 12:25 AM, NeilBrown <neilb@suse.com> wrote:
>> On Fri, Dec 23 2016, Giuseppe Bilotta wrote:
>>> I also wrote a small script to test all combinations (nothing smart,
>>> really, simply enumeration of combos, but I'll consider putting it up
>>> on the wiki as well), and I was actually surprised by the results. To
>>> test if the RAID was being re-created correctly with each combination,
>>> I used `file -s` on the RAID, and verified that the results made
>>> sense. I am surprised to find out that there are multiple combinations
>>> that make sense (note that the disk names are shifted by one compared
>>> to previous emails due a machine lockup that required a reboot and
>>> another disk butting in to a different order):
>>>
>>> trying /dev/sdd /dev/sdf /dev/sde /dev/sdg
>>> /dev/md111: Linux rev 1.0 ext4 filesystem data,
>>> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
>>> (needs journal recovery) (extents) (large files) (huge files)
>>>
>>> trying /dev/sdd /dev/sdf /dev/sdg /dev/sde
>>> /dev/md111: Linux rev 1.0 ext4 filesystem data,
>>> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
>>> (needs journal recovery) (extents) (large files) (huge files)
>>>
>>> trying /dev/sde /dev/sdf /dev/sdd /dev/sdg
>>> /dev/md111: Linux rev 1.0 ext4 filesystem data,
>>> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
>>> (needs journal recovery) (extents) (large files) (huge files)
>>>
>>> trying /dev/sde /dev/sdf /dev/sdg /dev/sdd
>>> /dev/md111: Linux rev 1.0 ext4 filesystem data,
>>> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
>>> (needs journal recovery) (extents) (large files) (huge files)
>>>
>>> trying /dev/sdg /dev/sdf /dev/sde /dev/sdd
>>> /dev/md111: Linux rev 1.0 ext4 filesystem data,
>>> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
>>> (needs journal recovery) (extents) (large files) (huge files)
>>>
>>> trying /dev/sdg /dev/sdf /dev/sdd /dev/sde
>>> /dev/md111: Linux rev 1.0 ext4 filesystem data,
>>> UUID=0031565c-38dd-4445-a707-f77aef1cbf7e, volume name "oneforall"
>>> (needs journal recovery) (extents) (large files) (huge files)
>>> :
>>> So there are six out of 24 combinations that make sense, at least for
>>> the first block. I know from the pre-fail dmesg that the g-f-e-d order
>>> should be the correct one, but now I'm left wondering if there is a
>>> better way to verify this (other than manually sampling files to see
>>> if they make sense), or if the left-symmetric layout on a RAID6 simply
>>> allows some of the disk positions to be swapped without loss of data.
>
>> You script has reported all arrangements with /dev/sdf as the second
>> device.  Presumably that is where the single block you are reading
>> resides.
>
> That makes sense.
>
>> To check if a RAID6 arrangement is credible, you can try the raid6check
>> program that is include in the mdadm source release.  There is a man
>> page.
>> If the order of devices is not correct raid6check will tell you about
>> it.
>
> That's a wonderful small utility, thanks for making it known to me!
> Checking even just a small number of stripes was enough in this case,
> as the expected combination (g f e d) was the only one that produced
> no errors.
>
> Now I wonder if it it would be possible to combine this approach with
> something that simply hacked the metadata of each disk to re-establish
> the correct disk order to make it possible to reassemble this
> particular array without recreating anything. Are problems such as
> mine common enough to warrant support for this kind of verified
> reassembly from assumed-clean disks easier?.

The way I look at this sort of question is to ask "what is the root
cause?", and then "What is the best response to the consequences of that
root cause?".

In your case, I would look at the sequence of event that lead to you
needing to re-create your array, and ask "At which point could md or
mdadm done something differently?".

If you, or someone, can describe precisely how to reproduce your outcome
- so that I can reproduce it myself - then I'll happily have a look and
see at which point something different could have happened.

Until then, I think the best response to these situations is to ask for
help, and to have tools which allow details to be extract and repairs to
be made.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: Recovering a RAID6 after all disks were disconnected
From: NeilBrown @ 2016-12-23 22:50 UTC (permalink / raw)
  To: Giuseppe Bilotta; +Cc: John Stoffel, linux-raid
In-Reply-To: <CAOxFTcwaDfK+gKX2d4L4DuX=pf273M8qKGGOo8G+iAVSxurHhA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 588 bytes --]

On Sat, Dec 24 2016, Giuseppe Bilotta wrote:
>
>
> So the array assembles, and raid6check reports no error, but the data
> is actually inaccessible .. am I missing other aspects of the metadata
> that need to be restored?

Presumably, yes.

If you provide "mdadm --examine" from devices in both the "working" and
the "not working" case, I might be able to point to the difference.

Alternately, use "mdadm --dump" to extract the metadata, then "tar
--sparse" to combine the (sparse) metadata files into a tar-archive, and
send that.  Then I would be able to experiment myself.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: Raid5 performance issue
From: Roman Mamedov @ 2016-12-24  6:46 UTC (permalink / raw)
  To: Doug Dumitru; +Cc: Marc Roos, linux-raid
In-Reply-To: <CAFx4rwQY3zF3cz-WV+L72mtmJodrYjO-kP1MUhC4YeX=zLJWmA@mail.gmail.com>

On Fri, 23 Dec 2016 11:24:13 -0800
Doug Dumitru <doug@easyco.com> wrote:

> With hard drives, I suspect your single disk tests are taking
> advantage of the disks' on-controller cache and is doing read-ahead
> and thus streaming.  With the array in place, you are probably doing
> 512K reads (check the array chunk size) so the disks will see bursts
> of 512K reads with big gaps.  The gaps are large enough that the
> rotation has gone too far and the caching makes you wait a rotation.
> This is just a guess.

You can compensate for that via setting a higher read-ahead for the array
itself, such as 

  blockdev --setra 8192 /dev/mdX 

(or more)

-- 
With respect,
Roman

^ permalink raw reply

* Re: SAS disk from RAID card (no RAID mode) problems
From: IW News @ 2016-12-24  7:49 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <CAFCYAse2Bq3dYXDrmC6dxdh4L435552crv2sDuZDb3pngZi8=A@mail.gmail.com>

Sandisk USA derived me to Sandisk EU. From there to Sandisk Spain (where 
I am).
I Sandisk Spain they know nothing about that product. They redirected me 
to WD.
I can't contact WD because they need a part number but they do not 
accept the SMARTMOD disk s one.
Crazy.
There is near 0 info about SMARTMOD in the net.

On 23/12/16 20:04, Jeff Johnson wrote:
> I would definitely contact Sandisk (now owns Smart Systems products) 
> to ensure you have up to date firmware. I know there have been issues 
> with command set support and other issues with SAS and SATA converted 
> to SAS SSDs. I've been through three bugfix related firmware updates 
> to SanDIsk Lightning SAS SSDs so best to contact them and get latest 
> firmware update.
>
> --Jeff
>
> On Fri, Dec 23, 2016 at 10:49 AM, IW News <news@imagedworld.com 
> <mailto:news@imagedworld.com>> wrote:
>
>     On 23/12/16 16:03, Jeff Johnson wrote:
>
>         Which make/model of 400GB SAS SSD? They can vary in firmware
>         issues, etc.
>
>     Hi,
>
>     Both are:
>
>     SMARTMOD SG9XCA2E400GEEMC (E032)
>
>     Thanks.
>
>
>
>
> -- 
> ------------------------------
> Jeff Johnson
> Co-Founder
> Aeon Computing
>
> jeff.johnson@aeoncomputing.com <mailto:jeff.johnson@aeoncomputing.com>
> www.aeoncomputing.com <http://www.aeoncomputing.com>
> t: 858-412-3810 x1001   f: 858-412-3845
> m: 619-204-9061
>
> 4170 Morena Boulevard, Suite D - San Diego, CA 92117
>
> High-Performance Computing / Lustre Filesystems / Scale-out Storage


^ permalink raw reply

* Re: Recovering a RAID6 after all disks were disconnected
From: Giuseppe Bilotta @ 2016-12-24 14:34 UTC (permalink / raw)
  To: NeilBrown; +Cc: John Stoffel, linux-raid
In-Reply-To: <87d1gip8kk.fsf@notabene.neil.brown.name>

On Fri, Dec 23, 2016 at 11:46 PM, NeilBrown <neilb@suse.com> wrote:
> On Sat, Dec 24 2016, Giuseppe Bilotta wrote:
>>
>> Now I wonder if it it would be possible to combine this approach with
>> something that simply hacked the metadata of each disk to re-establish
>> the correct disk order to make it possible to reassemble this
>> particular array without recreating anything. Are problems such as
>> mine common enough to warrant support for this kind of verified
>> reassembly from assumed-clean disks easier?.
>
> The way I look at this sort of question is to ask "what is the root
> cause?", and then "What is the best response to the consequences of that
> root cause?".
>
> In your case, I would look at the sequence of event that lead to you
> needing to re-create your array, and ask "At which point could md or
> mdadm done something differently?".
>
> If you, or someone, can describe precisely how to reproduce your outcome
> - so that I can reproduce it myself - then I'll happily have a look and
> see at which point something different could have happened.

As I mentioned on the first post, the root of the issue is cheap
hardware plus user error. Basically, all disks in this RAID are hosted
on a JBOD that has a tendency to 'disappear' at times. I've seen this
happen generally when one of the disks acts up (in which case Linux
attempting to reset it leads to a reset of the whole JBOD, which makes
all disks disappear until the device recovers). The JBOD is connected
via USB3, but I had the same issues when using an eSATA connection
with port multiplexer, and from what I've read around it's a known
limitation of SATA (as opposed to professional stuff based on SAS).

When this happens, md ends up removing all devices from the RAID. The
proper way to handle this, I've found, is to unmount the filesystem,
stop the array, and then reassemble it and remount it as soon as the
JBOD is back online. With this approach the RAID recovers in pretty
good shape (aside from the disk that is acting up, possibly). However,
it's a bit bothersome and may take some time to free up all filesystem
usage to allow for the unmounting, sometimes to the point of requiring
a reboot. So the last time this happened I tried something different,
and I made the mistake of trying a re-add of all the disks. This
resulted in the disks being marked as spares because md could not
restore the RAID functionality after having dropped to 0 disks.

I'm not sure this could be handled differently, unless mdraid could be
made to not kick all disks out if the whole JBOD disappears, but
rather wait for it to come back?

-- 
Giuseppe "Oblomov" Bilotta

^ permalink raw reply

* Re: Recovering a RAID6 after all disks were disconnected
From: Giuseppe Bilotta @ 2016-12-24 14:47 UTC (permalink / raw)
  To: NeilBrown; +Cc: John Stoffel, linux-raid
In-Reply-To: <87a8bmp8el.fsf@notabene.neil.brown.name>

On Fri, Dec 23, 2016 at 11:50 PM, NeilBrown <neilb@suse.com> wrote:
> On Sat, Dec 24 2016, Giuseppe Bilotta wrote:
>>
>>
>> So the array assembles, and raid6check reports no error, but the data
>> is actually inaccessible .. am I missing other aspects of the metadata
>> that need to be restored?
>
> Presumably, yes.
>
> If you provide "mdadm --examine" from devices in both the "working" and
> the "not working" case, I might be able to point to the difference.

I found the culprit. All disks have bad block lists, and the bad block
lists include the initial data sectors (i.e. the sectors pointed at by
the data offset in the superblock). This is quite probably a side
effect of my stupid idea of trying a re-add of all disks after all of
them were kicked out during the JBOD disconnect. One more reason to
just stop the array when this situation arises.

-- 
Giuseppe "Oblomov" Bilotta

^ permalink raw reply

* Soberem dlja Vas po internet telefonnye i email bazy dannyh v techenie sutok v ogromnom kolichestve Uznajte podrobnee esli chto to prodaete!  tel +79139230330 (whatsapp\viber\telegram) Skype: prodawez390 Email: prodawez392@gmail.com
From: 128linux-raid@vger.kernel.org @ 2016-12-24 17:55 UTC (permalink / raw)


Soberem dlja Vas po internet telefonnye i email bazy dannyh v techenie sutok v ogromnom kolichestve Uznajte podrobnee esli chto to prodaete!  tel +79139230330 (whatsapp\viber\telegram) Skype: prodawez390 Email: prodawez392@gmail.com

^ permalink raw reply

* pat--【充电站+新能源汽车展】 2017第八届中国（广州）国际新能源汽车工业展  （G-YD2-C50-Me）
From: piud @ 2016-12-24 18:15 UTC (permalink / raw)


中华人民共和国商务部引导支持展会
国家级国际性汽车充电设备展贸平台
　
2017第八届中国（广州）国际新能源汽车工业展览会 （NEA 2017）
The 8th Guangzhou International New Energy Automobiles Industry Expo 2017
　
2017第八届中国（广州）国际充电站技术装备展览会 （EVE 2017）
The 8th Guangzhou International Electric Vehicle Supply Equipments Expo 2017
　
　
【 展 会 信 息 】
　
■ 展览日期
2017年06月02—04日
　
■ 展览场馆
中国•广州琶洲保利世贸博览馆
　
■ 批准举办
中华人民共和国商务部
批文号：商服贸批（2015）823号
　
■ 支持单位
中华人民共和国国家能源局
中华人民共和国国家发展和改革委员会
广东省人民政府
广州市人民政府
国家电网
南方电网　•••
　
■ 主办单位
中国对外贸易经济合作企业协会
　
■ 联合承办
映德会展（北京）有限公司
广州中汽展览有限公司
　
■ 官方网站
http://www.CAPE-china.com
　
■ 在线客服： 
邮箱/QQ：12809395#qq.com；  微信：yondexpo；  微博：http://weibo.com/yingdehuizhan
　
■ 咨询电话： 
4000-580-850（转8144或8699）； 139-1031-8144； 010―8699-7155、 8084-2128； 
　
　
【 宣 传 推 广 】
　
　　2017第八届广州国际新能源汽车工业展览会将是一次立足泛珠三角，依托内地、辐射亚太、放眼世界具有进出口双向导向型的现代化国际性专业品牌盛会。主办方将保证所有参展商能透过本次专业展会有效的产品推介及形象展示，并最终达成市场成功推广销售之目的。专业观众的组织和邀请是组委会工作重点之一，主办方将引进国际先进的CRM客户关系管理系统，整合政府、协会、生产企业、媒体资源及主办方庞大的买家数据库，通过直邮、传真、电子邮件、拜访、VIP邀请、即时推广、广告传媒、展前预览、专业观众在线登记、行业展会推介、专业媒体推介、展商邀请、行业协会组团和国际合作等多种灵活有效的方式组织专业人士到场参观采购。
　　组委会将在上百家新闻媒体、专业杂志、网站上发布本次展会信息，刊登广告；坚持以主流媒体面对消费者，专业媒体面对厂商、经销商的传播导向,结合大型活动的影响力、权威性,以注意力营消的新观念,全力打造具有国际影响力的专业展会。
　　主办方将充分发挥自身优势，与全国范围内超过1000家大型新能源汽车及充电桩市场携手对展会进行宣传。邀请组织全国各地区所有新能源汽车市场及全国各地新能源汽车行业相关协会会员单位参展参观。
　　具体举措如下： 
■ 联手海内外各大知名媒体、专业媒体对展会进行强势宣传和深入报道，与行业权威媒体强强合作，共同打造新能源汽车行业的专业展会；
■ 在国内外各大汽车行业展会上推广本次展会、广发参观券；
■ 对海内外9万多专业观众邮寄参观券邀请，对主办方20多万行业买家数据发送电子邮件邀请，对海内外100000家以上的新能源汽车市场销售企业发传真邀请，对海内外行业内超过1000位专业VIP人士发送请柬邀请；
■ 通过外国驻华使领馆、商务处和贸易机构向其本国相关新能源汽车行业协会和机构推介本次展会，邀请有关企业参展参观，同时通过我国驻外使领馆和商务处推介展会。
■ 从往届展会观众数据库中，挖掘专业观众，继续邀请他们参观本届展会；同时委托海内外各专业协（学）会、行业媒体分发展会宣传品并组织参观；完整的赠票计划将协助展商将请柬送至最合适的目标观众群体；
■ 积极为展商提供不断更新的市场资料和政策方面的指导与咨询，开展新建、在建项目发布，将展览会的贸易活动进一步延伸，引导参展商制定适宜的营销策略；并将利用自身的社会资源和卓越的组织能力，协助参展商组织举办产品推广会，使参展商最大限度地扩大参展的效益。
　
　
【 参 展 细 则 】　　　　
　
■ 展位规格： 
　　1、特装展位：36平方米起租，仅提供相应面积室内外空地。展台搭建、展览器具、用电用水等自理。 
　　2、标准展位：9平方米（3m×3m）每个，2.5m高壁板、一条楣板（展商名称）、一张洽谈桌、两把椅子、两盏射灯、220V/5A电源插座一处。 
　
■ 展位费用：      
　　特装展位：境内企业RMB1200/平方米；　　境外企业USD350/平方米； 
　　标准展位：境内企业RMB12000/个；　　境外企业USD3500/个； （双面开口标准展位另加收10%费用）
　
■ 会刊广告： （大会《会刊》将帮助您在展会后找到客户！除在展会期间广为发送外，还通过各种有关渠道发送给未能前来参观展会的各地专业人士手中，他们可利用会刊迅速查找服务内容与联络方法。 会刊尺寸：130mm*210mm，进口铜板纸彩色精印，发行量10万册。）
　　封面 CNY 30000；     封二封三 CNY 22000；     扉页 CNY 18000；     黑白页 CNY 5000；
　　封底 CNY 20000；     彩页跨版 CNY 18000；     彩页 CNY 12000；     300字简介 CNY 2000；
　
■ 会议论坛：
　　如技术交流会/产品推广发布会，CNY9000/小时/场，用于会场及相关设备租金（包括场地、扩音设施、灯具、投影机、投影仪，桌椅、空调、茶水并协助主讲企业组织听众）。
　
　
【 筹 展 联 络 】　　　　
 　
全国统一客服热线： 4000-580-850.（分机：8144、5206或5220）
全国统一报名专线： 4000-680-860.（兼传真）
官方网站： http://www.CAPE-china.com 
参展顾问： 王先生、 李先生、 段小姐、 梁小姐
参展热线： 131-2662-5206、 139-1031-8144；  010— 8699- 7155、 8084- 2128.
在线客服： QQ/邮箱：12809395#qq.com；  微信/ yondexpo； 微博/ http://weibo.com/yondexpo
　
　
　　　　　　　　　　　　　　〔 映德会展 / 诚邀参加2017年度权威实效品牌盛会﹞
　
　
【 公 众 平 台 】
　
微信： 参展消息 （ID：CanZhanXiaoXi）—— 品牌扩张的平台　市场开拓的桥梁
微信： 展商之家 （ID：ZhanShangZhiJia）—— 为展商提供最佳营地　为阁下营造参展价值
　
-----------------------------------------------------------------------------------------------------------
百万群发系统|为您发送|如不希望再收到此行业资讯|请回复“退订+NEA2017”至邮箱1055800812@qq.com

^ permalink raw reply

* Re: PROBLEM: Kernel BUG with raid5 soft + Xen + DRBD - invalid opcode
From: MasterPrenium @ 2016-12-26 13:14 UTC (permalink / raw)
  To: linux-kernel, xen-users
  Cc: linux-raid, shli, MasterPrenium@gmail.com, xen-devel
In-Reply-To: <585D6C34.2020908@gmail.com>

Hi guys,

I've tested the same set-up except with a RAID 1 Soft Array, in this 
case I get no issue at all.
It's definitely a raid 5 problem.
As requested, I've tested re-creating the RAID 5 array (just to be 
sure), issue remains the same, even with metadata 0.90 or metadata 1.2.

Thanks,

Le 23/12/2016 19:25, MasterPrenium a écrit :
> Hello Guys,
>
> I've having some trouble on a new system I'm setting up. I'm getting a 
> kernel BUG message, seems to be related with the use of Xen (when I 
> boot the system _without_ Xen, I don't get any crash).
> Here is configuration :
> - 3x Hard Drives running on RAID 5 Software raid created by mdadm
> - On top of it, DRBD for replication over another node (Active/passive 
> cluster)
> - On top of it, a BTRFS FileSystem with a few subvolumes
> - On top of it, XEN VMs running.
>
> The BUG is happening when I'm making "huge" I/O (20MB/s with a rsync 
> for example) on the RAID5 stack.
> I've to reset system to make it work again.
>
> Reproducible : ALWAYS (making the i/o, it crash in 2-5mins). Also 
> reproducible on another system with the same hardware.
>
> Kernel versions impacted (at least): kernel-4.4.26, kernel-4.8.15, 
> kernel-4.9.0
>
> Here dmesg errors :
> [  937.123220] ------------[ cut here ]------------
> [  937.127549] kernel BUG at drivers/md/raid5.c:527!
> [  937.131891] invalid opcode: 0000 [#1] SMP
> [  937.136216] Modules linked in: x86_pkg_temp_thermal coretemp 
> crc32c_intel aesni_intel aes_x86_64 ablk_helper mei_me mei mpt3sas
> [  937.145665] CPU: 2 PID: 9704 Comm: kworker/u16:8 Not tainted 
> 4.9.0-gentoo #2
> [  937.150293] Hardware name: Supermicro Super Server/X10SDV-4C-7TP4F, 
> BIOS 1.0b 11/21/2016
> [  937.155531] Workqueue: drbd0_submit do_submit
> [  937.160506] task: ffff88026b0b2940 task.stack: ffffc9000a66c000
> [  937.164115] RIP: e030:[<ffffffff819e1fc1>] [<ffffffff819e1fc1>] 
> raid5_get_active_stripe+0x5e1/0x670
> [  937.169584] RSP: e02b:ffffc9000a66fa58  EFLAGS: 00010086
> [  937.175070] RAX: 0000000000000000 RBX: ffff880249d50000 RCX: 
> ffff8802648bb5d0
> [  937.180640] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 
> ffff880249d50000
> [  937.185505] RBP: ffffc9000a66faf0 R08: ffff8801f4813288 R09: 
> 0000000000000000
> [  937.190631] R10: 0000000000000288 R11: 0000000000000000 R12: 
> 0000000000000000
> [  937.196030] R13: 000000001e773e88 R14: ffff880249d50000 R15: 
> ffff8802648bb400
> [  937.202011] FS:  0000000000000000(0000) GS:ffff880270c80000(0000) 
> knlGS:ffff880270c80000
> [  937.206628] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  937.212372] CR2: 00007f68a101b520 CR3: 0000000257875000 CR4: 
> 0000000000042660
> [  937.217538] Stack:
> [  937.223361]  ffff8802648bb400 ffff880269550b40 0000000000000000 
> 0000000166cf3800
> [  937.229103]  000000001e773e88 ffff8802648bb5d0 0000000000000001 
> 0000000000000000
> [  937.233707]  ffff8802648bb40c 0000000000000001 ffffc9000a66faf0 
> ffff880047cba958
> [  937.239736] Call Trace:
> [  937.244406]  [<ffffffff819e21cd>] raid5_make_request+0x17d/0xdf0
> [  937.250345]  [<ffffffff810bcfb0>] ? wake_up_atomic_t+0x30/0x30
> [  937.256173]  [<ffffffff81a09c03>] md_make_request+0xe3/0x220
> [  937.261031]  [<ffffffff81483e9b>] generic_make_request+0xcb/0x1a0
> [  937.265615]  [<ffffffff81732537>] drbd_send_and_submit+0x497/0x1310
> [  937.271605]  [<ffffffff810bcfb0>] ? wake_up_atomic_t+0x30/0x30
> [  937.276726]  [<ffffffff817339ba>] send_and_submit_pending+0x6a/0x90
> [  937.282292]  [<ffffffff81733e43>] do_submit+0x463/0x550
> [  937.288333]  [<ffffffff810bcfb0>] ? wake_up_atomic_t+0x30/0x30
> [  937.293205]  [<ffffffff81095400>] process_one_work+0x170/0x420
> [  937.298982]  [<ffffffff810957d3>] worker_thread+0x123/0x500
> [  937.304154]  [<ffffffff810956b0>] ? process_one_work+0x420/0x420
> [  937.310314]  [<ffffffff810956b0>] ? process_one_work+0x420/0x420
> [  937.316013]  [<ffffffff8109b135>] kthread+0xc5/0xe0
> [  937.320918]  [<ffffffff8102c815>] ? __switch_to+0x355/0x7a0
> [  937.327029]  [<ffffffff8109b070>] ? kthread_park+0x60/0x60
> [  937.331994]  [<ffffffff81ccbbc5>] ret_from_fork+0x25/0x30
> [  937.338068] Code: 85 d0 fb ff ff f0 41 80 8f 98 02 00 00 04 e9 c2 
> fb ff ff f3 90 41 8b 47 70 a8 01 75 f6 89 45 a4 e9 e2 fd ff ff 0f 0b 
> 0f 0b 0f 0b <0f> 0b 49 89 d6 e9 e1 fa ff ff 49 8b 82 e8 01 00 00 4d 8b 
> 8a e0
> [  937.349579] RIP  [<ffffffff819e1fc1>] 
> raid5_get_active_stripe+0x5e1/0x670
> [  937.355290]  RSP <ffffc9000a66fa58>
> [  937.386587] ---[ end trace b870be01f61065a5 ]---
> [  941.931453] BUG: unable to handle kernel NULL pointer dereference 
> at           (null)
> [  941.937139] IP: [<ffffffff810bcaa6>] __wake_up_common+0x26/0x80
> [  941.943106] PGD 252dde067
> [  941.943219] PUD 252ee7067
> [  941.950107] PMD 0
>
> [  941.956080] Oops: 0000 [#2] SMP
> [  941.961919] Modules linked in: x86_pkg_temp_thermal coretemp 
> crc32c_intel aesni_intel aes_x86_64 ablk_helper mei_me mei mpt3sas
> [  941.974933] CPU: 2 PID: 9704 Comm: kworker/u16:8 Tainted: G      
> D         4.9.0-gentoo #2
> [  941.982080] Hardware name: Supermicro Super Server/X10SDV-4C-7TP4F, 
> BIOS 1.0b 11/21/2016
> [  941.989296] task: ffff88026b0b2940 task.stack: ffffc9000a66c000
> [  941.996831] RIP: e030:[<ffffffff810bcaa6>] [<ffffffff810bcaa6>] 
> __wake_up_common+0x26/0x80
> [  942.004391] RSP: e02b:ffffc9000a66fe50  EFLAGS: 00010086
> [  942.011818] RAX: 0000000000000200 RBX: ffffc9000a66ff18 RCX: 
> 0000000000000000
> [  942.019290] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 
> ffffc9000a66ff18
> [  942.026779] RBP: ffffc9000a66fe88 R08: 0000000000000000 R09: 
> 0000000000000000
> [  942.034246] R10: 0000000000000008 R11: 0000000000000001 R12: 
> ffffc9000a66ff20
> [  942.041693] R13: 0000000000000200 R14: 0000000000000000 R15: 
> 0000000000000003
> [  942.049166] FS:  0000000000000000(0000) GS:ffff880270c80000(0000) 
> knlGS:ffff880270c80000
> [  942.056599] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  942.063953] CR2: 0000000000000028 CR3: 0000000257875000 CR4: 
> 0000000000042660
> [  942.070841] kernel tried to execute NX-protected page - exploit 
> attempt? (uid: 0)
> [  942.074862] BUG: unable to handle kernel paging request at 
> ffffc9000234f8f8
> [  942.078910] IP: [<ffffc9000234f8f8>] 0xffffc9000234f8f8
> [  942.082961] PGD 1e9840067
> [  942.083010] PUD 1e983f067
> [  942.086963] PMD 26b42c067
> [  942.086978] PTE 801000026b66c067
>
> [  942.094822] Oops: 0011 [#3] SMP
> [  942.098734] Modules linked in: x86_pkg_temp_thermal coretemp 
> crc32c_intel aesni_intel aes_x86_64 ablk_helper mei_me mei mpt3sas
> [  942.107222] CPU: 2 PID: 9704 Comm: kworker/u16:8 Tainted: G      
> D         4.9.0-gentoo #2
> [  942.111581] Hardware name: Supermicro Super Server/X10SDV-4C-7TP4F, 
> BIOS 1.0b 11/21/2016
> [  942.116050] task: ffff88026b0b2940 task.stack: ffffc9000a66c000
> [  942.120530] RIP: e030:[<ffffc9000234f8f8>] [<ffffc9000234f8f8>] 
> 0xffffc9000234f8f8
> [  942.125019] RSP: e02b:ffffc9000a66fb80  EFLAGS: 00010082
> [  942.129534] RAX: 0000000000000041 RBX: 0000000000042660 RCX: 
> 0000000000000006
> [  942.134355] RDX: 0000000000000041 RSI: ffffffff824e00a0 RDI: 
> ffff880270c8dd80
> [  942.138921] RBP: ffffc9000a66fbe0 R08: 0000000000000000 R09: 
> 0000000000000000
> [  942.143564] R10: 0000000000000008 R11: 0000000000000001 R12: 
> 0000000080050033
> [  942.148172] R13: 0000000000000000 R14: 0000000000000000 R15: 
> 0000000000000000
> [  942.152837] FS:  0000000000000000(0000) GS:ffff880270c80000(0000) 
> knlGS:ffff880270c80000
> [  942.157525] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  942.162213] CR2: 0000000000000028 CR3: 0000000257875000 CR4: 
> 0000000000042660
> [  942.166954] Stack:
> [  942.171576]  0000000257875000 0000000000000028 ffff880270c80000 
> ffff880270c80000
> [  942.176267]  0000000000000000 0000e0330a66c000 0000000000000000 
> ffffc9000a66fda8
> [  942.180918]  0000000000000000 ffffc9000a66fda8 0000000000000000 
> 0000000000000000
> [  942.185521] Call Trace:
> [  942.190043]  [<ffffffff810302ad>] show_regs+0x2d/0x180
> [  942.194605]  [<ffffffff81030725>] __die+0xa5/0xf0
> [  942.199050]  [<ffffffff8106041e>] no_context+0x14e/0x3d0
> [  942.203562]  [<ffffffff81060798>] __bad_area_nosemaphore+0xf8/0x1c0
> [  942.208002]  [<ffffffff8106086f>] bad_area_nosemaphore+0xf/0x20
> [  942.212478]  [<ffffffff81061034>] __do_page_fault+0x84/0x4b0
> [  942.216797]  [<ffffffff8106148c>] do_page_fault+0x2c/0x40
> [  942.221021]  [<ffffffff81ccd808>] page_fault+0x28/0x30
> [  942.225184]  [<ffffffff810bcaa6>] ? __wake_up_common+0x26/0x80
> [  942.229287]  [<ffffffff810bcb0e>] __wake_up_locked+0xe/0x10
> [  942.233366]  [<ffffffff810bd4d2>] complete+0x32/0x50
> [  942.237330]  [<ffffffff8107a500>] mm_release+0xc0/0x160
> [  942.241216]  [<ffffffff81080206>] do_exit+0x136/0xb50
> [  942.245056]  [<ffffffff81ccdc07>] rewind_stack_do_exit+0x17/0x20
> [  942.248933] Code: c9 ff ff c0 cf 74 b7 01 88 ff ff 00 30 cf 66 02 
> 88 ff ff 00 00 00 00 00 00 00 00 40 29 57 6b 02 88 ff ff b0 cf 0b 81 
> ff ff ff ff <70> fb 66 0a 00 c9 ff ff 88 b6 8b 64 02 88 ff ff 00 00 00 
> 00 01
> [  942.257683] RIP  [<ffffc9000234f8f8>] 0xffffc9000234f8f8
> [  942.261814]  RSP <ffffc9000a66fb80>
> [  942.265860] CR2: ffffc9000234f8f8
> [  942.269830] ---[ end trace b870be01f61065a6 ]---
> [  942.293603] Fixing recursive fault but reboot is needed!
> [  962.926746] INFO: rcu_sched detected stalls on CPUs/tasks:
> [  962.930582]  4-...: (1 GPs behind) idle=deb/140000000000000/0 
> softirq=51234/51234 fqs=5195
> [  962.934400]  (detected by 1, t=21002 jiffies, g=26732, c=26731, q=173)
> [  962.938231] Task dump for CPU 4:
> [  962.942054] md10_raid5      R  running task    13232  2654 2 
> 0x00000008
> [  962.945939]  ffff880270d0dcc0 ffff880270ed8ec0 000000000306bc88 
> 0000000000000000
> [  962.949899]  0000000000000220 ffff8802648bb40c 0000000000000002 
> ffff8802648bb708
> [  962.953781]  0000000000000001 ffffc9000306bcc8 ffffffff81ccb884 
> ffff8802648bb400
> [  962.957570] Call Trace:
> [  962.961272]  [<ffffffff81ccb884>] ? _raw_spin_lock_irqsave+0x54/0x60
> [  962.964943]  [<ffffffff819d87f4>] ? 
> release_inactive_stripe_list+0x44/0x180
> [  962.968604]  [<ffffffff819e5469>] ? 
> handle_active_stripes.isra.56+0x169/0x440
> [  962.972253]  [<ffffffff819e5ae1>] ? raid5d+0x3a1/0x730
> [  962.975825]  [<ffffffff81a094d3>] ? md_thread+0xf3/0x100
> [  962.979360]  [<ffffffff810bcfb0>] ? wake_up_atomic_t+0x30/0x30
> [  962.982900]  [<ffffffff81a093e0>] ? find_pers+0x70/0x70
> [  962.986392]  [<ffffffff8109b135>] ? kthread+0xc5/0xe0
> [  962.989881]  [<ffffffff8102c815>] ? __switch_to+0x355/0x7a0
> [  962.993382]  [<ffffffff8109b070>] ? kthread_park+0x60/0x60
> [  962.996858]  [<ffffffff81ccbbc5>] ? ret_from_fork+0x25/0x30
> [ 1025.932534] INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 1025.936027]  4-...: (1 GPs behind) idle=deb/140000000000000/0 
> softirq=51234/51234 fqs=20780
> [ 1025.939486]  (detected by 0, t=84014 jiffies, g=26732, c=26731, q=742)
> [ 1025.942969] Task dump for CPU 4:
> [ 1025.946373] md10_raid5      R  running task    13232  2654 2 
> 0x00000008
> [ 1025.949909]  ffff880270d0dcc0 ffff880270ed8ec0 000000000306bc88 
> 0000000000000000
> [ 1025.953451]  0000000000000220 ffff8802648bb40c 0000000000000002 
> ffff8802648bb708
> [ 1025.957015]  0000000000000001 ffffc9000306bcc8 ffffffff81ccb884 
> ffff8802648bb400
> [ 1025.960601] Call Trace:
> [ 1025.964139]  [<ffffffff81ccb884>] ? _raw_spin_lock_irqsave+0x54/0x60
> [ 1025.967724]  [<ffffffff819d87f4>] ? 
> release_inactive_stripe_list+0x44/0x180
> [ 1025.971351]  [<ffffffff819e5469>] ? 
> handle_active_stripes.isra.56+0x169/0x440
> [ 1025.975001]  [<ffffffff819e5ae1>] ? raid5d+0x3a1/0x730
> [ 1025.978598]  [<ffffffff81a094d3>] ? md_thread+0xf3/0x100
> [ 1025.982255]  [<ffffffff810bcfb0>] ? wake_up_atomic_t+0x30/0x30
> [ 1025.985875]  [<ffffffff81a093e0>] ? find_pers+0x70/0x70
> [ 1025.989496]  [<ffffffff8109b135>] ? kthread+0xc5/0xe0
> [ 1025.993117]  [<ffffffff8102c815>] ? __switch_to+0x355/0x7a0
> [ 1025.996707]  [<ffffffff8109b070>] ? kthread_park+0x60/0x60
> [ 1026.000354]  [<ffffffff81ccbbc5>] ? ret_from_fork+0x25/0x30
> [ 1088.937436] INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 1088.941109]  4-...: (1 GPs behind) idle=deb/140000000000000/0 
> softirq=51234/51234 fqs=36280
> [ 1088.944649]  (detected by 0, t=147019 jiffies, g=26732, c=26731, 
> q=1328)
> [ 1088.948180] Task dump for CPU 4:
> [ 1088.951671] md10_raid5      R  running task    13232  2654 2 
> 0x00000008
> [ 1088.955296]  ffff880270d0dcc0 ffff880270ed8ec0 000000000306bc88 
> 0000000000000000
> [ 1088.958963]  0000000000000220 ffff8802648bb40c 0000000000000002 
> ffff8802648bb708
> [ 1088.962665]  0000000000000001 ffffc9000306bcc8 ffffffff81ccb884 
> ffff8802648bb400
> [ 1088.966301] Call Trace:
> [ 1088.969868]  [<ffffffff81ccb884>] ? _raw_spin_lock_irqsave+0x54/0x60
> [ 1088.973451]  [<ffffffff819d87f4>] ? 
> release_inactive_stripe_list+0x44/0x180
> [ 1088.977020]  [<ffffffff819e5469>] ? 
> handle_active_stripes.isra.56+0x169/0x440
> [ 1088.980572]  [<ffffffff819e5ae1>] ? raid5d+0x3a1/0x730
> [ 1088.984066]  [<ffffffff81a094d3>] ? md_thread+0xf3/0x100
> [ 1088.987549]  [<ffffffff810bcfb0>] ? wake_up_atomic_t+0x30/0x30
> [ 1088.991011]  [<ffffffff81a093e0>] ? find_pers+0x70/0x70
> [ 1088.994444]  [<ffffffff8109b135>] ? kthread+0xc5/0xe0
> [ 1088.997815]  [<ffffffff8102c815>] ? __switch_to+0x355/0x7a0
> [ 1089.001181]  [<ffffffff8109b070>] ? kthread_park+0x60/0x60
> [ 1089.004534]  [<ffffffff81ccbbc5>] ? ret_from_fork+0x25/0x30
>
> (Another log here : http://pastebin.com/maxGFc1z)
>
> Xen versions affected (at least): xen-4.6, xen-4.7, xen-4.8
> xen-tools same version
>
> Userland is a gentoo linux box.
>
> Kernel .config : http://pastebin.com/p0EcHjbu
>
> All buit with : gcc (Gentoo 4.9.3 p1.5, pie-0.6.4) 4.9.3
>
> -> scripts/ver_linux
> If some fields are empty or look unusual you may have an old version.
> Compare to the current minimal requirements in Documentation/Changes.
>
> Linux Node_1 4.9.0-gentoo #2 SMP Fri Dec 23 16:37:48 CET 2016 x86_64 
> Intel(R) Xeon(R) CPU D-1518 @ 2.20GHz GenuineIntel GNU/Linux
>
> GNU C                   4.9.3
> GNU Make                4.1
> Binutils                2.25.1
> Util-linux              2.26.2
> Mount                   2.26.2
> Module-init-tools       22
> E2fsprogs               1.43.3
> Linux C Library         2.22
> Dynamic linker (ldd)    2.22
> Linux C++ Library       6.0.20
> Procps                  3.3.12
> Net-tools               1.60
> Kbd                     2.0.3
> Console-tools           2.0.3
> Sh-utils                8.25
> Udev                    220
> Modules Loaded          ablk_helper aesni_intel aes_x86_64 coretemp 
> crc32c_intel mei mei_me mpt3sas x86_pkg_temp_thermal
>
> -> System is a SuperMicro Motherboard X10SDV-4C-7TP4F with 8GB of DDR 
> 4 ECC Registered memory
>
>
> Any help would be greatly appreciated !
>
> Thanks,
>

^ permalink raw reply

* Re: SAS disk from RAID card (no RAID mode) problems
From: Andrew Ryder @ 2016-12-26 18:55 UTC (permalink / raw)
  To: IW News, linux-raid
In-Reply-To: <bccd49d1-fdbe-01c5-505e-f0e8ad044e88@imagedworld.com>

Try EMC?

IBM has a similar part number with the last 3 letters being "IBM" not "EMC".

ftp://ftp.software.ibm.com/software/server/firmware/XceedIOPS.htm

On 12/24/2016 02:49 AM, IW News wrote:
> Sandisk USA derived me to Sandisk EU. From there to Sandisk Spain (where
> I am).
> I Sandisk Spain they know nothing about that product. They redirected me
> to WD.
> I can't contact WD because they need a part number but they do not
> accept the SMARTMOD disk s one.
> Crazy.
> There is near 0 info about SMARTMOD in the net.
>
> On 23/12/16 20:04, Jeff Johnson wrote:
>> I would definitely contact Sandisk (now owns Smart Systems products)
>> to ensure you have up to date firmware. I know there have been issues
>> with command set support and other issues with SAS and SATA converted
>> to SAS SSDs. I've been through three bugfix related firmware updates
>> to SanDIsk Lightning SAS SSDs so best to contact them and get latest
>> firmware update.
>>
>> --Jeff
>>
>> On Fri, Dec 23, 2016 at 10:49 AM, IW News <news@imagedworld.com
>> <mailto:news@imagedworld.com>> wrote:
>>
>>     On 23/12/16 16:03, Jeff Johnson wrote:
>>
>>         Which make/model of 400GB SAS SSD? They can vary in firmware
>>         issues, etc.
>>
>>     Hi,
>>
>>     Both are:
>>
>>     SMARTMOD SG9XCA2E400GEEMC (E032)
>>
>>     Thanks.
>>
>>
>>
>>
>> --
>> ------------------------------
>> Jeff Johnson
>> Co-Founder
>> Aeon Computing
>>
>> jeff.johnson@aeoncomputing.com <mailto:jeff.johnson@aeoncomputing.com>
>> www.aeoncomputing.com <http://www.aeoncomputing.com>
>> t: 858-412-3810 x1001   f: 858-412-3845
>> m: 619-204-9061
>>
>> 4170 Morena Boulevard, Suite D - San Diego, CA 92117
>>
>> High-Performance Computing / Lustre Filesystems / Scale-out Storage
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* Re: SAS disk from RAID card (no RAID mode) problems
From: Thomas Fjellstrom @ 2016-12-26 21:25 UTC (permalink / raw)
  To: IW News; +Cc: linux-raid
In-Reply-To: <2326bdf0-2948-5cbf-3033-27ed41803e23@imagedworld.com>

On Friday, December 23, 2016 9:01:40 AM MST IW News wrote:
> Hello,
> 
> First message here.
> 
> After looking for a solution without any luck I have found this list. I
> hope someone can help me with this.
> 
> I have an ASUS P6T Deluxe with a MARVELL 88SE63xx SAS RAID controller.
> There are to identical 400GB SAS SSD drives attached to it. One of them
> has a Windows 10 installation, the other one Linux.
> Grub is installed on the second disk.
> 
> Windows works as expected, but I have problems with the Linux
> installation: the desktop environment freezes for some second once in a
> while. This occurs with Mint Cinnamon, OpenSuSe KDE, Ubuntu and Manjaro
> KDE. All of them are current installations. I'm now working in up to
> date Manjaro KDE (kernel 4.9.0).
> When the temporary freezes occur the mouse pointer moves, some windows
> are updated correctly, other do not and DE stops working.
> When this happens always I have a system log like this:

If thats the mvsas driver, god help you. I had nothing but troubles with it 
for a good year or two or possibly more and it was never fixed or really 
acknowledged, so I sold that card (a supermicro aoc-saslp-mv8) and picked up a 
IBM m1015 (LSI 9211-8i a-like) and haven't looked back.

Basically it would reset a lot, and lock up. AND I believe it caused a lot of 
silent corruption as well.

> ______________________________________________________
> 23/12/16 8:29    kernel    sas: Enter sas_scsi_recover_host busy: 6
> failed: 6
> 23/12/16 8:29    kernel    sas: trying to find task 0xffff8bda8d2bd900
> 23/12/16 8:29    kernel    sas: sas_scsi_find_task: aborting task
> 0xffff8bda8d2bd900
> 23/12/16 8:29    kernel    sas: task done but aborted
> 23/12/16 8:29    kernel    sas: sas_scsi_find_task: task
> 0xffff8bda8d2bd900 is done
> 23/12/16 8:29    kernel    sas: sas_eh_handle_sas_errors: task
> 0xffff8bda8d2bd900 is done
> 23/12/16 8:29    kernel    sas: trying to find task 0xffff8bda8d2bce00
> 23/12/16 8:29    kernel    sas: sas_scsi_find_task: aborting task
> 0xffff8bda8d2bce00
> 23/12/16 8:29    kernel    sas: task done but aborted
> 23/12/16 8:29    kernel    sas: sas_scsi_find_task: task
> 0xffff8bda8d2bce00 is done
> 23/12/16 8:29    kernel    sas: sas_eh_handle_sas_errors: task
> 0xffff8bda8d2bce00 is done
> 23/12/16 8:29    kernel    sas: trying to find task 0xffff8bda8d2bd700
> 23/12/16 8:29    kernel    sas: sas_scsi_find_task: aborting task
> 0xffff8bda8d2bd700
> 23/12/16 8:29    kernel    sas: task done but aborted
> 23/12/16 8:29    kernel    sas: sas_scsi_find_task: task
> 0xffff8bda8d2bd700 is done
> 23/12/16 8:29    kernel    sas: sas_eh_handle_sas_errors: task
> 0xffff8bda8d2bd700 is done
> 23/12/16 8:29    kernel    sas: trying to find task 0xffff8bda8d2bde00
> 23/12/16 8:29    kernel    sas: sas_scsi_find_task: aborting task
> 0xffff8bda8d2bde00
> 23/12/16 8:29    kernel    sas: task done but aborted
> 23/12/16 8:29    kernel    sas: sas_scsi_find_task: task
> 0xffff8bda8d2bde00 is done
> 23/12/16 8:29    kernel    sas: sas_eh_handle_sas_errors: task
> 0xffff8bda8d2bde00 is done
> 23/12/16 8:29    kernel    sas: trying to find task 0xffff8bda8d2bdc00
> 23/12/16 8:29    kernel    sas: sas_scsi_find_task: aborting task
> 0xffff8bda8d2bdc00
> 23/12/16 8:29    kernel    sas: task done but aborted
> 23/12/16 8:29    kernel    sas: sas_scsi_find_task: task
> 0xffff8bda8d2bdc00 is done
> 23/12/16 8:29    kernel    sas: sas_eh_handle_sas_errors: task
> 0xffff8bda8d2bdc00 is done
> 23/12/16 8:29    kernel    sas: trying to find task 0xffff8bdcb0298800
> 23/12/16 8:29    kernel    sas: sas_scsi_find_task: aborting task
> 0xffff8bdcb0298800
> 23/12/16 8:29    kernel    sas: task done but aborted
> 23/12/16 8:29    kernel    sas: sas_scsi_find_task: task
> 0xffff8bdcb0298800 is done
> 23/12/16 8:29    kernel    sas: sas_eh_handle_sas_errors: task
> 0xffff8bdcb0298800 is done
> 23/12/16 8:29    kernel    sas: --- Exit sas_scsi_recover_host: busy: 0
> failed: 6 tries: 1
> 23/12/16 8:29    kernel    drivers/scsi/mvsas/mv_sas.c 1694:reuse same
> slot, retry command.
> 23/12/16 8:29    kernel    drivers/scsi/mvsas/mv_sas.c 1694:reuse same
> slot, retry command.
> 23/12/16 8:29    kernel    drivers/scsi/mvsas/mv_sas.c 1694:reuse same
> slot, retry command.
> 23/12/16 8:29    kernel    sas: Enter sas_scsi_recover_host busy: 2
> failed: 2
> 23/12/16 8:29    kernel    sas: trying to find task 0xffff8bdb9453ae00
> 23/12/16 8:29    kernel    sas: sas_scsi_find_task: aborting task
> 0xffff8bdb9453ae00
> 23/12/16 8:29    kernel    sas: task done but aborted
> 23/12/16 8:29    kernel    sas: sas_scsi_find_task: task
> 0xffff8bdb9453ae00 is done
> 23/12/16 8:29    kernel    sas: sas_eh_handle_sas_errors: task
> 0xffff8bdb9453ae00 is done
> 23/12/16 8:29    kernel    sas: trying to find task 0xffff8bdb9453b000
> 23/12/16 8:29    kernel    sas: sas_scsi_find_task: aborting task
> 0xffff8bdb9453b000
> 23/12/16 8:29    kernel    sas: task done but aborted
> 23/12/16 8:29    kernel    sas: sas_scsi_find_task: task
> 0xffff8bdb9453b000 is done
> 23/12/16 8:29    kernel    sas: sas_eh_handle_sas_errors: task
> 0xffff8bdb9453b000 is done
> 23/12/16 8:29    kernel    sas: --- Exit sas_scsi_recover_host: busy: 0
> failed: 2 tries: 1
> ____________________________________________________________________________
> ______________
> 
> Sometimes shorter sometimes larger.
> It looks like a controller/drive/cable problem?
> Any thoughts?
> 
> Thanks in advance.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

^ permalink raw reply

* Re: [PATCH v2 1/3] PM suspend/hibernate: Call notifier after freezing processes
From: Pali Rohár @ 2016-12-27 14:29 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: NeilBrown, Rafael J. Wysocki, Alasdair Kergon, Mike Snitzer,
	Neil Brown, Len Brown, Pavel Machek, dm-devel, linux-raid,
	Linux Kernel Mailing List, linux-pm@vger.kernel.org
In-Reply-To: <CAJZ5v0jGuLTJbsjW3WPe2Y8OgwMgcSSwTRNB8+PWaxwKpSDP-g@mail.gmail.com>

[-- Attachment #1: Type: Text/Plain, Size: 2736 bytes --]

On Wednesday 22 July 2015 01:03:23 Rafael J. Wysocki wrote:
> On Wed, Jul 22, 2015 at 1:00 AM, Rafael J. Wysocki <rafael@kernel.org>
> wrote:
> > Hi Neil,
> > 
> > On Wed, Jul 22, 2015 at 12:08 AM, NeilBrown <neilb@suse.com> wrote:
> >> On Mon, 20 Jul 2015 23:46:32 +0200 "Rafael J. Wysocki"
> >> 
> >> <rjw@rjwysocki.net> wrote:
> >>> On Monday, July 20, 2015 09:32:26 AM Pali Rohár wrote:
> >>> > On Saturday 18 July 2015 01:27:15 Rafael J. Wysocki wrote:
> >>> > > On Thursday, July 16, 2015 09:33:02 AM Pali Rohár wrote:
> >>> > > > On Thursday 16 July 2015 03:02:03 Rafael J. Wysocki wrote:
> >>> > > > > Also, if you're adding AFTER_FREEZE, it would be good to
> >>> > > > > add BEFORE_THAW too for symmetry.
> >>> > > > 
> >>> > > > But there is no use case for BEFORE_THAW. At least it is
> >>> > > > not needed for now.
> >>> > > 
> >>> > > For your use case, a single function pointer would be
> >>> > > sufficient too.
> >>> > 
> >>> > What do you mean by single function pointer? kernel/power is
> >>> > part of kernel image and dm-crypt is external kernel module.
> >>> 
> >>> Well, if there is a function pointer in the core suspend code
> >>> initially set to NULL and exported to modules such that the
> >>> dm-crypt code can set it to something else, that should be
> >>> sufficient, shouldn't it?
> >> 
> >> As long as the dm-crypt module is never unloaded.
> > 
> > OK, there is a race related to that.
> > 
> >> And as long as no other module could very possible want
> >> functionality like this. Ever.
> > 
> > The point was that there were no other users currently, so dm-crypt
> > is going to be the only user for the time being.
> > 
> >> If a module wants to be notified - the providing a notifier chain
> >> really seems like the right thing to do...
> > 
> > Well, so please see my last response in this thread. :-)
> 
> So it was below: "Anyway, I guess the "post freeze" new one should be
> enough for now" which doesn't mean I'm really against the notifier.
> Or at least it is not supposed to mean so.  If there's any confusion
> related to that, I'm sorry.
> 
> Thanks,
> Rafael

In that case we are not able to distinguish if computer is going to be 
hibernated or just suspended to RAM.

If we have both notifier (one for suspend and for hibernate) then 
different actions can be configured for suspend and hibernate.

And it makes sense to configure different behaviour for suspend and for 
hibernate. E.g. when you have encrypted partition where is stored 
hibernation image then you do not have to wipe keys before going to do 
hibernation. But for suspend to RAM you may want to wipe them.

-- 
Pali Rohár
pali.rohar@gmail.com

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* [v2 PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: Coly Li @ 2016-12-27 15:47 UTC (permalink / raw)
  To: linux-raid
  Cc: Coly Li, Shaohua Li, Neil Brown, Johannes Thumshirn,
	Guoqing Jiang

'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
introduces a sliding resync window for raid1 I/O barrier, this idea limits
I/O barriers to happen only inside a slidingresync window, for regular
I/Os out of this resync window they don't need to wait for barrier any
more. On large raid1 device, it helps a lot to improve parallel writing
I/O throughput when there are background resync I/Os performing at
same time.

The idea of sliding resync widow is awesome, but there are several
challenges are very difficult to solve,
 - code complexity
   Sliding resync window requires several veriables to work collectively,
   this is complexed and very hard to make it work correctly. Just grep
   "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches to fix
   the original resync window patch. This is not the end, any further
   related modification may easily introduce more regreassion.
 - multiple sliding resync windows
   Currently raid1 code only has a single sliding resync window, we cannot
   do parallel resync with current I/O barrier implementation.
   Implementing multiple resync windows are much more complexed, and very
   hard to make it correctly.

Therefore I decide to implement a much simpler raid1 I/O barrier, by
removing resync window code, I believe life will be much easier.

The brief idea of the simpler barrier is,
 - Do not maintain a logbal unique resync window
 - Use multiple hash buckets to reduce I/O barrier conflictions, regular
   I/O only has to wait for a resync I/O when both them have same barrier
   bucket index, vice versa.
 - I/O barrier can be recuded to an acceptable number if there are enought
   barrier buckets

Here I explain how the barrier buckets are designed,
 - BARRIER_UNIT_SECTOR_SIZE
   The whole LBA address space of a raid1 device is divided into multiple
   barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
   Bio request won't go across border of barrier unit size, that means
   maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 in bytes.
 - BARRIER_BUCKETS_NR
   There are BARRIER_BUCKETS_NR buckets in total, which is defined by,
        #define BARRIER_BUCKETS_NR_BITS   9
        #define BARRIER_BUCKETS_NR        (1<<BARRIER_BUCKETS_NR_BITS)
   if multiple I/O requests hit different barrier units, they only need
   to compete I/O barrier with other I/Os which hit the same barrier
   bucket index with each other. The index of a barrier bucket which a
   bio should look for is calculated by,
        int idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS)
   that sector_nr is the start sector number of a bio. We use function
   align_to_barrier_unit_end() to calculate sectors number from sector_nr
   to the next barrier unit size boundary, if the requesting bio size
   goes across the boundary, we split the bio in raid1_make_request(), to
   make sure the finall bio sent into generic_make_request() won't exceed
   barrier unit boundary.

Comparing to single sliding resync window,
 - Currently resync I/O grows linearly, therefore regular and resync I/O
   will have confliction within a single barrier units. So it is similar to
   single sliding resync window.
 - But a barrier unit bucket is shared by all barrier units with identical
   barrier uinit index, the probability of confliction might be higher
   than single sliding resync window, in condition that writing I/Os
   always hit barrier units which have identical barrier bucket index with
   the resync I/Os. This is a very rare condition in real I/O work loads,
   I cannot imagine how it could happen in practice.
 - Therefore we can achieve a good enough low confliction rate with much
   simpler barrier algorithm and implementation.

If user has a (realy) large raid1 device, for example 10PB size, we may
just increase the buckets number BARRIER_BUCKETS_NR. Now this is a macro,
it is possible to be a raid1-created-time-defined variable in future.

There are two changes should be noticed,
 - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
   single loop, it looks like this,
        spin_lock_irqsave(&conf->device_lock, flags);
        conf->nr_queued[idx]--;
        spin_unlock_irqrestore(&conf->device_lock, flags);
   This change generates more spin lock operations, but in next patch of
   this patch set, it will be replaced by a single line code,
        atomic_dec(conf->nr_queueud[idx]);
   So we don't need to worry about spin lock cost here.
 - Original function raid1_make_request() is split into two functions,
   - raid1_make_read_request(): handles regular read request and calls
     wait_read_barrier() for I/O barrier.
   - raid1_make_write_request(): handles regular write request and calls
     wait_barrier() for I/O barrier.
   The differnece is wait_read_barrier() only waits if array is frozen,
   using different barrier function in different code path makes the code
   more clean and easy to read.
 - align_to_barrier_unit_end() is called to make sure both regular and
   resync I/O won't go across a barrier unit boundary.

Changelog
V1:
- Original RFC patch for comments
V2:
- Use bio_split() to split the orignal bio if it goes across barrier unit
  bounday, to make the code more simple, by suggestion from Shaohua and
  Neil.
- Use hash_long() to replace original linear hash, to avoid a possible
  confilict between resync I/O and sequential write I/O, by suggestion from
  Shaohua.
- Add conf->total_barriers to record barrier depth, which is used to
  control number of parallel sync I/O barriers, by suggestion from Shaohua.
- In V1 patch the bellowed barrier buckets related members in r1conf are
  allocated in memory page. To make the code more simple, V2 patch moves
  the memory space into struct r1conf, like this,
        -       int                     nr_pending;
        -       int                     nr_waiting;
        -       int                     nr_queued;
        -       int                     barrier;
        +       int                     nr_pending[BARRIER_BUCKETS_NR];
        +       int                     nr_waiting[BARRIER_BUCKETS_NR];
        +       int                     nr_queued[BARRIER_BUCKETS_NR];
        +       int                     barrier[BARRIER_BUCKETS_NR];
  This change is by the suggestion from Shaohua.
- Remove some inrelavent code comments, by suggestion from Guoqing.
- Add a missing wait_barrier() before jumping to retry_write, in
  raid1_make_write_request().

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Shaohua Li <shli@fb.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/raid1.c | 485 ++++++++++++++++++++++++++++++-----------------------
 drivers/md/raid1.h |  37 ++--
 2 files changed, 291 insertions(+), 231 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index a1f3fbe..5813656 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -67,9 +67,8 @@
  */
 static int max_queued_requests = 1024;
 
-static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
-			  sector_t bi_sector);
-static void lower_barrier(struct r1conf *conf);
+static void allow_barrier(struct r1conf *conf, sector_t sector_nr);
+static void lower_barrier(struct r1conf *conf, sector_t sector_nr);
 
 #define raid1_log(md, fmt, args...)				\
 	do { if ((md)->queue) blk_add_trace_msg((md)->queue, "raid1 " fmt, ##args); } while (0)
@@ -96,7 +95,6 @@ static void r1bio_pool_free(void *r1_bio, void *data)
 #define RESYNC_WINDOW_SECTORS (RESYNC_WINDOW >> 9)
 #define CLUSTER_RESYNC_WINDOW (16 * RESYNC_WINDOW)
 #define CLUSTER_RESYNC_WINDOW_SECTORS (CLUSTER_RESYNC_WINDOW >> 9)
-#define NEXT_NORMALIO_DISTANCE (3 * RESYNC_WINDOW_SECTORS)
 
 static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
 {
@@ -211,7 +209,7 @@ static void put_buf(struct r1bio *r1_bio)
 
 	mempool_free(r1_bio, conf->r1buf_pool);
 
-	lower_barrier(conf);
+	lower_barrier(conf, r1_bio->sector);
 }
 
 static void reschedule_retry(struct r1bio *r1_bio)
@@ -219,10 +217,12 @@ static void reschedule_retry(struct r1bio *r1_bio)
 	unsigned long flags;
 	struct mddev *mddev = r1_bio->mddev;
 	struct r1conf *conf = mddev->private;
+	int idx;
 
+	idx = hash_long(r1_bio->sector, BARRIER_BUCKETS_NR_BITS);
 	spin_lock_irqsave(&conf->device_lock, flags);
 	list_add(&r1_bio->retry_list, &conf->retry_list);
-	conf->nr_queued ++;
+	conf->nr_queued[idx]++;
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 
 	wake_up(&conf->wait_barrier);
@@ -239,8 +239,6 @@ static void call_bio_endio(struct r1bio *r1_bio)
 	struct bio *bio = r1_bio->master_bio;
 	int done;
 	struct r1conf *conf = r1_bio->mddev->private;
-	sector_t start_next_window = r1_bio->start_next_window;
-	sector_t bi_sector = bio->bi_iter.bi_sector;
 
 	if (bio->bi_phys_segments) {
 		unsigned long flags;
@@ -265,7 +263,7 @@ static void call_bio_endio(struct r1bio *r1_bio)
 		 * Wake up any possible resync thread that waits for the device
 		 * to go idle.
 		 */
-		allow_barrier(conf, start_next_window, bi_sector);
+		allow_barrier(conf, bio->bi_iter.bi_sector);
 	}
 }
 
@@ -513,6 +511,25 @@ static void raid1_end_write_request(struct bio *bio)
 		bio_put(to_put);
 }
 
+static sector_t align_to_barrier_unit_end(sector_t start_sector,
+					  sector_t sectors)
+{
+	sector_t len;
+
+	WARN_ON(sectors == 0);
+	/* len is the number of sectors from start_sector to end of the
+	 * barrier unit which start_sector belongs to.
+	 */
+	len = ((start_sector + sectors + (1<<BARRIER_UNIT_SECTOR_BITS) - 1) &
+	       (~(BARRIER_UNIT_SECTOR_SIZE - 1))) -
+	      start_sector;
+
+	if (len > sectors)
+		len = sectors;
+
+	return len;
+}
+
 /*
  * This routine returns the disk from which the requested read should
  * be done. There is a per-array 'next expected sequential IO' sector
@@ -809,168 +826,179 @@ static void flush_pending_writes(struct r1conf *conf)
  */
 static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
 {
+	int idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
+
 	spin_lock_irq(&conf->resync_lock);
 
 	/* Wait until no block IO is waiting */
-	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting,
+	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting[idx],
 			    conf->resync_lock);
 
 	/* block any new IO from starting */
-	conf->barrier++;
-	conf->next_resync = sector_nr;
+	conf->barrier[idx]++;
+	conf->total_barriers++;
 
 	/* For these conditions we must wait:
 	 * A: while the array is in frozen state
-	 * B: while barrier >= RESYNC_DEPTH, meaning resync reach
-	 *    the max count which allowed.
-	 * C: next_resync + RESYNC_SECTORS > start_next_window, meaning
-	 *    next resync will reach to the window which normal bios are
-	 *    handling.
-	 * D: while there are any active requests in the current window.
+	 * B: while conf->nr_pending[idx] is not 0, meaning regular I/O
+	 *    existing in sector number ranges corresponding to idx.
+	 * C: while conf->total_barriers >= RESYNC_DEPTH, meaning resync reach
+	 *    the max count which allowed on the whole raid1 device.
 	 */
 	wait_event_lock_irq(conf->wait_barrier,
 			    !conf->array_frozen &&
-			    conf->barrier < RESYNC_DEPTH &&
-			    conf->current_window_requests == 0 &&
-			    (conf->start_next_window >=
-			     conf->next_resync + RESYNC_SECTORS),
+			     !conf->nr_pending[idx] &&
+			     conf->total_barriers < RESYNC_DEPTH,
 			    conf->resync_lock);
 
-	conf->nr_pending++;
+	conf->nr_pending[idx]++;
 	spin_unlock_irq(&conf->resync_lock);
 }
 
-static void lower_barrier(struct r1conf *conf)
+static void lower_barrier(struct r1conf *conf, sector_t sector_nr)
 {
 	unsigned long flags;
-	BUG_ON(conf->barrier <= 0);
+	int idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
+
+	BUG_ON((conf->barrier[idx] <= 0) || conf->total_barriers <= 0);
+
 	spin_lock_irqsave(&conf->resync_lock, flags);
-	conf->barrier--;
-	conf->nr_pending--;
+	conf->barrier[idx]--;
+	conf->total_barriers--;
+	conf->nr_pending[idx]--;
 	spin_unlock_irqrestore(&conf->resync_lock, flags);
 	wake_up(&conf->wait_barrier);
 }
 
-static bool need_to_wait_for_sync(struct r1conf *conf, struct bio *bio)
+static void _wait_barrier(struct r1conf *conf, int idx)
 {
-	bool wait = false;
-
-	if (conf->array_frozen || !bio)
-		wait = true;
-	else if (conf->barrier && bio_data_dir(bio) == WRITE) {
-		if ((conf->mddev->curr_resync_completed
-		     >= bio_end_sector(bio)) ||
-		    (conf->start_next_window + NEXT_NORMALIO_DISTANCE
-		     <= bio->bi_iter.bi_sector))
-			wait = false;
-		else
-			wait = true;
+	spin_lock_irq(&conf->resync_lock);
+	if (conf->array_frozen || conf->barrier[idx]) {
+		conf->nr_waiting[idx]++;
+		/* Wait for the barrier to drop. */
+		wait_event_lock_irq(
+			conf->wait_barrier,
+			!conf->array_frozen && !conf->barrier[idx],
+			conf->resync_lock);
+		conf->nr_waiting[idx]--;
 	}
 
-	return wait;
+	conf->nr_pending[idx]++;
+	spin_unlock_irq(&conf->resync_lock);
 }
 
-static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
+static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
 {
-	sector_t sector = 0;
+	long idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
 
 	spin_lock_irq(&conf->resync_lock);
-	if (need_to_wait_for_sync(conf, bio)) {
-		conf->nr_waiting++;
-		/* Wait for the barrier to drop.
-		 * However if there are already pending
-		 * requests (preventing the barrier from
-		 * rising completely), and the
-		 * per-process bio queue isn't empty,
-		 * then don't wait, as we need to empty
-		 * that queue to allow conf->start_next_window
-		 * to increase.
-		 */
-		raid1_log(conf->mddev, "wait barrier");
-		wait_event_lock_irq(conf->wait_barrier,
-				    !conf->array_frozen &&
-				    (!conf->barrier ||
-				     ((conf->start_next_window <
-				       conf->next_resync + RESYNC_SECTORS) &&
-				      current->bio_list &&
-				      !bio_list_empty(current->bio_list))),
-				    conf->resync_lock);
-		conf->nr_waiting--;
-	}
-
-	if (bio && bio_data_dir(bio) == WRITE) {
-		if (bio->bi_iter.bi_sector >= conf->next_resync) {
-			if (conf->start_next_window == MaxSector)
-				conf->start_next_window =
-					conf->next_resync +
-					NEXT_NORMALIO_DISTANCE;
-
-			if ((conf->start_next_window + NEXT_NORMALIO_DISTANCE)
-			    <= bio->bi_iter.bi_sector)
-				conf->next_window_requests++;
-			else
-				conf->current_window_requests++;
-			sector = conf->start_next_window;
-		}
+	if (conf->array_frozen) {
+		conf->nr_waiting[idx]++;
+		/* Wait for array to unfreeze */
+		wait_event_lock_irq(
+			conf->wait_barrier,
+			!conf->array_frozen,
+			conf->resync_lock);
+		conf->nr_waiting[idx]--;
 	}
 
-	conf->nr_pending++;
+	conf->nr_pending[idx]++;
 	spin_unlock_irq(&conf->resync_lock);
-	return sector;
 }
 
-static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
-			  sector_t bi_sector)
+static void wait_barrier(struct r1conf *conf, sector_t sector_nr)
+{
+	int idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
+
+	_wait_barrier(conf, idx);
+}
+
+static void wait_all_barriers(struct r1conf *conf)
+{
+	int idx;
+
+	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
+		_wait_barrier(conf, idx);
+}
+
+static void _allow_barrier(struct r1conf *conf, int idx)
 {
 	unsigned long flags;
 
 	spin_lock_irqsave(&conf->resync_lock, flags);
-	conf->nr_pending--;
-	if (start_next_window) {
-		if (start_next_window == conf->start_next_window) {
-			if (conf->start_next_window + NEXT_NORMALIO_DISTANCE
-			    <= bi_sector)
-				conf->next_window_requests--;
-			else
-				conf->current_window_requests--;
-		} else
-			conf->current_window_requests--;
-
-		if (!conf->current_window_requests) {
-			if (conf->next_window_requests) {
-				conf->current_window_requests =
-					conf->next_window_requests;
-				conf->next_window_requests = 0;
-				conf->start_next_window +=
-					NEXT_NORMALIO_DISTANCE;
-			} else
-				conf->start_next_window = MaxSector;
-		}
-	}
+	conf->nr_pending[idx]--;
 	spin_unlock_irqrestore(&conf->resync_lock, flags);
 	wake_up(&conf->wait_barrier);
 }
 
+static void allow_barrier(struct r1conf *conf, sector_t sector_nr)
+{
+	int idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
+
+	_allow_barrier(conf, idx);
+}
+
+static void allow_all_barriers(struct r1conf *conf)
+{
+	int idx;
+
+	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
+		_allow_barrier(conf, idx);
+}
+
+/* conf->resync_lock should be held */
+static int get_all_pendings(struct r1conf *conf)
+{
+	int idx, ret;
+
+	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
+		ret += conf->nr_pending[idx];
+	return ret;
+}
+
+/* conf->resync_lock should be held */
+static int get_all_queued(struct r1conf *conf)
+{
+	int idx, ret;
+
+	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
+		ret += conf->nr_queued[idx];
+	return ret;
+}
+
 static void freeze_array(struct r1conf *conf, int extra)
 {
-	/* stop syncio and normal IO and wait for everything to
+	/* Stop sync I/O and normal I/O and wait for everything to
 	 * go quite.
-	 * We wait until nr_pending match nr_queued+extra
-	 * This is called in the context of one normal IO request
-	 * that has failed. Thus any sync request that might be pending
-	 * will be blocked by nr_pending, and we need to wait for
-	 * pending IO requests to complete or be queued for re-try.
-	 * Thus the number queued (nr_queued) plus this request (extra)
-	 * must match the number of pending IOs (nr_pending) before
-	 * we continue.
+	 * This is called in two situations:
+	 * 1) management command handlers (reshape, remove disk, quiesce).
+	 * 2) one normal I/O request failed.
+
+	 * After array_frozen is set to 1, new sync IO will be blocked at
+	 * raise_barrier(), and new normal I/O will blocked at _wait_barrier().
+	 * The flying I/Os will either complete or be queued. When everything
+	 * goes quite, there are only queued I/Os left.
+
+	 * Every flying I/O contributes to a conf->nr_pending[idx], idx is the
+	 * barrier bucket index which this I/O request hits. When all sync and
+	 * normal I/O are queued, sum of all conf->nr_pending[] will match sum
+	 * of all conf->nr_queued[]. But normal I/O failure is an exception,
+	 * in handle_read_error(), we may call freeze_array() before trying to
+	 * fix the read error. In this case, the error read I/O is not queued,
+	 * so get_all_pending() == get_all_queued() + 1.
+	 *
+	 * Therefore before this function returns, we need to wait until
+	 * get_all_pendings(conf) gets equal to get_all_queued(conf)+extra. For
+	 * normal I/O context, extra is 1, in rested situations extra is 0.
 	 */
 	spin_lock_irq(&conf->resync_lock);
 	conf->array_frozen = 1;
 	raid1_log(conf->mddev, "wait freeze");
-	wait_event_lock_irq_cmd(conf->wait_barrier,
-				conf->nr_pending == conf->nr_queued+extra,
-				conf->resync_lock,
-				flush_pending_writes(conf));
+	wait_event_lock_irq_cmd(
+		conf->wait_barrier,
+		get_all_pendings(conf) == get_all_queued(conf)+extra,
+		conf->resync_lock,
+		flush_pending_writes(conf));
 	spin_unlock_irq(&conf->resync_lock);
 }
 static void unfreeze_array(struct r1conf *conf)
@@ -1066,64 +1094,23 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
 	kfree(plug);
 }
 
-static void raid1_make_request(struct mddev *mddev, struct bio * bio)
+static void raid1_make_read_request(struct mddev *mddev, struct bio *bio)
 {
 	struct r1conf *conf = mddev->private;
 	struct raid1_info *mirror;
 	struct r1bio *r1_bio;
 	struct bio *read_bio;
-	int i, disks;
 	struct bitmap *bitmap;
-	unsigned long flags;
 	const int op = bio_op(bio);
-	const int rw = bio_data_dir(bio);
 	const unsigned long do_sync = (bio->bi_opf & REQ_SYNC);
-	const unsigned long do_flush_fua = (bio->bi_opf &
-						(REQ_PREFLUSH | REQ_FUA));
-	struct md_rdev *blocked_rdev;
-	struct blk_plug_cb *cb;
-	struct raid1_plug_cb *plug = NULL;
-	int first_clone;
 	int sectors_handled;
 	int max_sectors;
-	sector_t start_next_window;
+	int rdisk;
 
-	/*
-	 * Register the new request and wait if the reconstruction
-	 * thread has put up a bar for new requests.
-	 * Continue immediately if no resync is active currently.
+	/* Still need barrier for READ in case that whole
+	 * array is frozen.
 	 */
-
-	md_write_start(mddev, bio); /* wait on superblock update early */
-
-	if (bio_data_dir(bio) == WRITE &&
-	    ((bio_end_sector(bio) > mddev->suspend_lo &&
-	    bio->bi_iter.bi_sector < mddev->suspend_hi) ||
-	    (mddev_is_clustered(mddev) &&
-	     md_cluster_ops->area_resyncing(mddev, WRITE,
-		     bio->bi_iter.bi_sector, bio_end_sector(bio))))) {
-		/* As the suspend_* range is controlled by
-		 * userspace, we want an interruptible
-		 * wait.
-		 */
-		DEFINE_WAIT(w);
-		for (;;) {
-			flush_signals(current);
-			prepare_to_wait(&conf->wait_barrier,
-					&w, TASK_INTERRUPTIBLE);
-			if (bio_end_sector(bio) <= mddev->suspend_lo ||
-			    bio->bi_iter.bi_sector >= mddev->suspend_hi ||
-			    (mddev_is_clustered(mddev) &&
-			     !md_cluster_ops->area_resyncing(mddev, WRITE,
-				     bio->bi_iter.bi_sector, bio_end_sector(bio))))
-				break;
-			schedule();
-		}
-		finish_wait(&conf->wait_barrier, &w);
-	}
-
-	start_next_window = wait_barrier(conf, bio);
-
+	wait_read_barrier(conf, bio->bi_iter.bi_sector);
 	bitmap = mddev->bitmap;
 
 	/*
@@ -1149,12 +1136,9 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
 	bio->bi_phys_segments = 0;
 	bio_clear_flag(bio, BIO_SEG_VALID);
 
-	if (rw == READ) {
 		/*
 		 * read balancing logic:
 		 */
-		int rdisk;
-
 read_again:
 		rdisk = read_balance(conf, r1_bio, &max_sectors);
 
@@ -1176,7 +1160,6 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
 				   atomic_read(&bitmap->behind_writes) == 0);
 		}
 		r1_bio->read_disk = rdisk;
-		r1_bio->start_next_window = 0;
 
 		read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev);
 		bio_trim(read_bio, r1_bio->sector - bio->bi_iter.bi_sector,
@@ -1232,11 +1215,89 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
 		} else
 			generic_make_request(read_bio);
 		return;
+}
+
+static void raid1_make_write_request(struct mddev *mddev, struct bio *bio)
+{
+	struct r1conf *conf = mddev->private;
+	struct r1bio *r1_bio;
+	int i, disks;
+	struct bitmap *bitmap;
+	unsigned long flags;
+	const int op = bio_op(bio);
+	const unsigned long do_sync = (bio->bi_opf & REQ_SYNC);
+	const unsigned long do_flush_fua = (bio->bi_opf &
+						(REQ_PREFLUSH | REQ_FUA));
+	struct md_rdev *blocked_rdev;
+	struct blk_plug_cb *cb;
+	struct raid1_plug_cb *plug = NULL;
+	int first_clone;
+	int sectors_handled;
+	int max_sectors;
+
+	/*
+	 * Register the new request and wait if the reconstruction
+	 * thread has put up a bar for new requests.
+	 * Continue immediately if no resync is active currently.
+	 */
+
+	md_write_start(mddev, bio); /* wait on superblock update early */
+
+	if (((bio_end_sector(bio) > mddev->suspend_lo &&
+	    bio->bi_iter.bi_sector < mddev->suspend_hi) ||
+	    (mddev_is_clustered(mddev) &&
+	     md_cluster_ops->area_resyncing(mddev, WRITE,
+		     bio->bi_iter.bi_sector, bio_end_sector(bio))))) {
+		/* As the suspend_* range is controlled by
+		 * userspace, we want an interruptible
+		 * wait.
+		 */
+		DEFINE_WAIT(w);
+
+		for (;;) {
+			flush_signals(current);
+			prepare_to_wait(&conf->wait_barrier,
+					&w, TASK_INTERRUPTIBLE);
+			if (bio_end_sector(bio) <= mddev->suspend_lo ||
+			    bio->bi_iter.bi_sector >= mddev->suspend_hi ||
+			    (mddev_is_clustered(mddev) &&
+			     !md_cluster_ops->area_resyncing(
+						mddev,
+						WRITE,
+						bio->bi_iter.bi_sector,
+						bio_end_sector(bio))))
+				break;
+			schedule();
+		}
+		finish_wait(&conf->wait_barrier, &w);
 	}
 
+	wait_barrier(conf, bio->bi_iter.bi_sector);
+	bitmap = mddev->bitmap;
+
 	/*
-	 * WRITE:
+	 * make_request() can abort the operation when read-ahead is being
+	 * used and no empty request is available.
+	 *
+	 */
+	r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
+
+	r1_bio->master_bio = bio;
+	r1_bio->sectors = bio_sectors(bio);
+	r1_bio->state = 0;
+	r1_bio->mddev = mddev;
+	r1_bio->sector = bio->bi_iter.bi_sector;
+
+	/* We might need to issue multiple reads to different
+	 * devices if there are bad blocks around, so we keep
+	 * track of the number of reads in bio->bi_phys_segments.
+	 * If this is 0, there is only one r1_bio and no locking
+	 * will be needed when requests complete.  If it is
+	 * non-zero, then it is the number of not-completed requests.
 	 */
+	bio->bi_phys_segments = 0;
+	bio_clear_flag(bio, BIO_SEG_VALID);
+
 	if (conf->pending_count >= max_queued_requests) {
 		md_wakeup_thread(mddev->thread);
 		raid1_log(mddev, "wait queued");
@@ -1256,7 +1317,6 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
 
 	disks = conf->raid_disks * 2;
  retry_write:
-	r1_bio->start_next_window = start_next_window;
 	blocked_rdev = NULL;
 	rcu_read_lock();
 	max_sectors = r1_bio->sectors;
@@ -1324,25 +1384,15 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
 	if (unlikely(blocked_rdev)) {
 		/* Wait for this device to become unblocked */
 		int j;
-		sector_t old = start_next_window;
 
 		for (j = 0; j < i; j++)
 			if (r1_bio->bios[j])
 				rdev_dec_pending(conf->mirrors[j].rdev, mddev);
 		r1_bio->state = 0;
-		allow_barrier(conf, start_next_window, bio->bi_iter.bi_sector);
+		allow_barrier(conf, bio->bi_iter.bi_sector);
 		raid1_log(mddev, "wait rdev %d blocked", blocked_rdev->raid_disk);
 		md_wait_for_blocked_rdev(blocked_rdev, mddev);
-		start_next_window = wait_barrier(conf, bio);
-		/*
-		 * We must make sure the multi r1bios of bio have
-		 * the same value of bi_phys_segments
-		 */
-		if (bio->bi_phys_segments && old &&
-		    old != start_next_window)
-			/* Wait for the former r1bio(s) to complete */
-			wait_event(conf->wait_barrier,
-				   bio->bi_phys_segments == 1);
+		wait_barrier(conf, bio->bi_iter.bi_sector);
 		goto retry_write;
 	}
 
@@ -1464,6 +1514,31 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
 	wake_up(&conf->wait_barrier);
 }
 
+static void raid1_make_request(struct mddev *mddev, struct bio *bio)
+{
+	void (*make_request_fn)(struct mddev *mddev, struct bio *bio);
+	struct bio *split;
+	sector_t sectors;
+
+	make_request_fn = (bio_data_dir(bio) == READ) ?
+			  raid1_make_read_request :
+			  raid1_make_write_request;
+
+	/* if bio exceeds barrier unit boundary, split it */
+	do {
+		sectors = align_to_barrier_unit_end(bio->bi_iter.bi_sector,
+						    bio_sectors(bio));
+		if (sectors < bio_sectors(bio)) {
+			split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set);
+			bio_chain(split, bio);
+		} else {
+			split = bio;
+		}
+
+		make_request_fn(mddev, split);
+	} while (split != bio);
+}
+
 static void raid1_status(struct seq_file *seq, struct mddev *mddev)
 {
 	struct r1conf *conf = mddev->private;
@@ -1552,19 +1627,11 @@ static void print_conf(struct r1conf *conf)
 
 static void close_sync(struct r1conf *conf)
 {
-	wait_barrier(conf, NULL);
-	allow_barrier(conf, 0, 0);
+	wait_all_barriers(conf);
+	allow_all_barriers(conf);
 
 	mempool_destroy(conf->r1buf_pool);
 	conf->r1buf_pool = NULL;
-
-	spin_lock_irq(&conf->resync_lock);
-	conf->next_resync = MaxSector - 2 * NEXT_NORMALIO_DISTANCE;
-	conf->start_next_window = MaxSector;
-	conf->current_window_requests +=
-		conf->next_window_requests;
-	conf->next_window_requests = 0;
-	spin_unlock_irq(&conf->resync_lock);
 }
 
 static int raid1_spare_active(struct mddev *mddev)
@@ -2311,8 +2378,9 @@ static void handle_sync_write_finished(struct r1conf *conf, struct r1bio *r1_bio
 
 static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
 {
-	int m;
+	int m, idx;
 	bool fail = false;
+
 	for (m = 0; m < conf->raid_disks * 2 ; m++)
 		if (r1_bio->bios[m] == IO_MADE_GOOD) {
 			struct md_rdev *rdev = conf->mirrors[m].rdev;
@@ -2338,7 +2406,8 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
 	if (fail) {
 		spin_lock_irq(&conf->device_lock);
 		list_add(&r1_bio->retry_list, &conf->bio_end_io_list);
-		conf->nr_queued++;
+		idx = hash_long(r1_bio->sector, BARRIER_BUCKETS_NR_BITS);
+		conf->nr_queued[idx]++;
 		spin_unlock_irq(&conf->device_lock);
 		md_wakeup_thread(conf->mddev->thread);
 	} else {
@@ -2460,6 +2529,7 @@ static void raid1d(struct md_thread *thread)
 	struct r1conf *conf = mddev->private;
 	struct list_head *head = &conf->retry_list;
 	struct blk_plug plug;
+	int idx;
 
 	md_check_recovery(mddev);
 
@@ -2467,17 +2537,18 @@ static void raid1d(struct md_thread *thread)
 	    !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
 		LIST_HEAD(tmp);
 		spin_lock_irqsave(&conf->device_lock, flags);
-		if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
-			while (!list_empty(&conf->bio_end_io_list)) {
-				list_move(conf->bio_end_io_list.prev, &tmp);
-				conf->nr_queued--;
-			}
-		}
+		if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
+			list_splice_init(&conf->bio_end_io_list, &tmp);
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 		while (!list_empty(&tmp)) {
 			r1_bio = list_first_entry(&tmp, struct r1bio,
 						  retry_list);
 			list_del(&r1_bio->retry_list);
+			idx = hash_long(r1_bio->sector,
+					BARRIER_BUCKETS_NR_BITS);
+			spin_lock_irqsave(&conf->device_lock, flags);
+			conf->nr_queued[idx]--;
+			spin_unlock_irqrestore(&conf->device_lock, flags);
 			if (mddev->degraded)
 				set_bit(R1BIO_Degraded, &r1_bio->state);
 			if (test_bit(R1BIO_WriteError, &r1_bio->state))
@@ -2498,7 +2569,8 @@ static void raid1d(struct md_thread *thread)
 		}
 		r1_bio = list_entry(head->prev, struct r1bio, retry_list);
 		list_del(head->prev);
-		conf->nr_queued--;
+		idx = hash_long(r1_bio->sector, BARRIER_BUCKETS_NR_BITS);
+		conf->nr_queued[idx]--;
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 
 		mddev = r1_bio->mddev;
@@ -2537,7 +2609,6 @@ static int init_resync(struct r1conf *conf)
 					  conf->poolinfo);
 	if (!conf->r1buf_pool)
 		return -ENOMEM;
-	conf->next_resync = 0;
 	return 0;
 }
 
@@ -2566,6 +2637,7 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
 	int still_degraded = 0;
 	int good_sectors = RESYNC_SECTORS;
 	int min_bad = 0; /* number of sectors that are bad in all devices */
+	int idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
 
 	if (!conf->r1buf_pool)
 		if (init_resync(conf))
@@ -2615,7 +2687,7 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
 	 * If there is non-resync activity waiting for a turn, then let it
 	 * though before starting on this new sync request.
 	 */
-	if (conf->nr_waiting)
+	if (conf->nr_waiting[idx])
 		schedule_timeout_uninterruptible(1);
 
 	/* we are incrementing sector_nr below. To be safe, we check against
@@ -2642,6 +2714,8 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
 	r1_bio->sector = sector_nr;
 	r1_bio->state = 0;
 	set_bit(R1BIO_IsSync, &r1_bio->state);
+	/* make sure good_sectors won't go across barrier unit boundary */
+	good_sectors = align_to_barrier_unit_end(sector_nr, good_sectors);
 
 	for (i = 0; i < conf->raid_disks * 2; i++) {
 		struct md_rdev *rdev;
@@ -2927,9 +3001,6 @@ static struct r1conf *setup_conf(struct mddev *mddev)
 	conf->pending_count = 0;
 	conf->recovery_disabled = mddev->recovery_disabled - 1;
 
-	conf->start_next_window = MaxSector;
-	conf->current_window_requests = conf->next_window_requests = 0;
-
 	err = -EIO;
 	for (i = 0; i < conf->raid_disks * 2; i++) {
 
diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
index c52ef42..817115d 100644
--- a/drivers/md/raid1.h
+++ b/drivers/md/raid1.h
@@ -1,6 +1,14 @@
 #ifndef _RAID1_H
 #define _RAID1_H
 
+/* each barrier unit size is 64MB fow now
+ * note: it must be larger than RESYNC_DEPTH
+ */
+#define BARRIER_UNIT_SECTOR_BITS	17
+#define BARRIER_UNIT_SECTOR_SIZE	(1<<17)
+#define BARRIER_BUCKETS_NR_BITS		9
+#define BARRIER_BUCKETS_NR		(1<<BARRIER_BUCKETS_NR_BITS)
+
 struct raid1_info {
 	struct md_rdev	*rdev;
 	sector_t	head_position;
@@ -35,25 +43,6 @@ struct r1conf {
 						 */
 	int			raid_disks;
 
-	/* During resync, read_balancing is only allowed on the part
-	 * of the array that has been resynced.  'next_resync' tells us
-	 * where that is.
-	 */
-	sector_t		next_resync;
-
-	/* When raid1 starts resync, we divide array into four partitions
-	 * |---------|--------------|---------------------|-------------|
-	 *        next_resync   start_next_window       end_window
-	 * start_next_window = next_resync + NEXT_NORMALIO_DISTANCE
-	 * end_window = start_next_window + NEXT_NORMALIO_DISTANCE
-	 * current_window_requests means the count of normalIO between
-	 *   start_next_window and end_window.
-	 * next_window_requests means the count of normalIO after end_window.
-	 * */
-	sector_t		start_next_window;
-	int			current_window_requests;
-	int			next_window_requests;
-
 	spinlock_t		device_lock;
 
 	/* list of 'struct r1bio' that need to be processed by raid1d,
@@ -79,10 +68,11 @@ struct r1conf {
 	 */
 	wait_queue_head_t	wait_barrier;
 	spinlock_t		resync_lock;
-	int			nr_pending;
-	int			nr_waiting;
-	int			nr_queued;
-	int			barrier;
+	int			nr_pending[BARRIER_BUCKETS_NR];
+	int			nr_waiting[BARRIER_BUCKETS_NR];
+	int			nr_queued[BARRIER_BUCKETS_NR];
+	int			barrier[BARRIER_BUCKETS_NR];
+	int			total_barriers;
 	int			array_frozen;
 
 	/* Set to 1 if a full sync is needed, (fresh device added).
@@ -135,7 +125,6 @@ struct r1bio {
 						 * in this BehindIO request
 						 */
 	sector_t		sector;
-	sector_t		start_next_window;
 	int			sectors;
 	unsigned long		state;
 	struct mddev		*mddev;
-- 
2.6.6


^ permalink raw reply related

* [v2 PATCH 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code
From: Coly Li @ 2016-12-27 15:47 UTC (permalink / raw)
  To: linux-raid
  Cc: Coly Li, Shaohua Li, Hannes Reinecke, Neil Brown,
	Johannes Thumshirn, Guoqing Jiang
In-Reply-To: <1482853658-82535-1-git-send-email-colyli@suse.de>

When I run a parallel reading performan testing on a md raid1 device with
two NVMe SSDs, I observe very bad throughput in supprise: by fio with 64KB
block size, 40 seq read I/O jobs, 128 iodepth, overall throughput is
only 2.7GB/s, this is around 50% of the idea performance number.

The perf reports locking contention happens at allow_barrier() and
wait_barrier() code,
 - 41.41%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
   - _raw_spin_lock_irqsave
         + 89.92% allow_barrier
         + 9.34% __wake_up
 - 37.30%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irq
   - _raw_spin_lock_irq
         - 100.00% wait_barrier

The reason is, in these I/O barrier related functions,
 - raise_barrier()
 - lower_barrier()
 - wait_barrier()
 - allow_barrier()
They always hold conf->resync_lock firstly, even there are only regular
reading I/Os and no resync I/O at all. This is a huge performance penalty.

The solution is a lockless-like algorithm in I/O barrier code, and only
holding conf->resync_lock when it is really necessary.

The original idea is from Hannes Reinecke, and Neil Brown provides
comments to improve it. Now I write the patch based on new simpler raid1
I/O barrier code.

In the new simpler raid1 I/O barrier implementation, there are two
wait barrier functions,
 - wait_barrier()
   Which in turns calls _wait_barrier(), is used for regular write I/O.
   If there is resync I/O happening on the same barrier bucket index, or
   the whole array is frozen, task will wait until no barrier on same
   bucket index, or the whold array is unfreezed.
 - wait_read_barrier()
   Since regular read I/O won't interfere with resync I/O (read_balance()
   will make sure only uptodate data will be read out), so it is
   unnecessary to wait for barrier in regular read I/Os, they only have to
   wait only when the whole array is frozen.
The operations on conf->nr_pending[idx], conf->nr_waiting[idx], conf->
barrier[idx] are very carefully designed in raise_barrier(),
lower_barrier(), _wait_barrier() and wait_read_barrier(), in order to
avoid unnecessary spin locks in these functions. Once conf->
nr_pengding[idx] is increased, a resync I/O with same barrier bucket index
has to wait in raise_barrier(). Then in _wait_barrier() or
wait_read_barrier() if no barrier raised in same barrier bucket index or
array is not frozen, the regular I/O doesn't need to hold conf->
resync_lock, it can just increase conf->nr_pending[idx], and return to its
caller. For heavy parallel reading I/Os, the lockless I/O barrier code
almostly gets rid of all spin lock cost.

This patch significantly improves raid1 reading peroformance. From my
testing, a raid1 device built by two NVMe SSD, runs fio with 64KB
blocksize, 40 seq read I/O jobs, 128 iodepth, overall throughput
increases from 2.7GB/s to 4.6GB/s (+70%).

Open question:
Shaohua points out the memory barrier should be added to some atomic
operations. Now I am reading the document to learn how to add the memory
barriers correctly. Anyway, if anyone has suggestion, please don't
hesitate to let me know.

Changelog
V1:
- Original RFC patch for comments.
V2:
- Remove a spin_lock/unlock pair in raid1d().
- Add more code comments to explain why there is no racy when checking two
  atomic_t variables at same time.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Shaohua Li <shli@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/raid1.c | 134 +++++++++++++++++++++++++++++++----------------------
 drivers/md/raid1.h |  12 ++---
 2 files changed, 85 insertions(+), 61 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 5813656..b1fb4c1 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -222,7 +222,7 @@ static void reschedule_retry(struct r1bio *r1_bio)
 	idx = hash_long(r1_bio->sector, BARRIER_BUCKETS_NR_BITS);
 	spin_lock_irqsave(&conf->device_lock, flags);
 	list_add(&r1_bio->retry_list, &conf->retry_list);
-	conf->nr_queued[idx]++;
+	atomic_inc(&conf->nr_queued[idx]);
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 
 	wake_up(&conf->wait_barrier);
@@ -831,13 +831,13 @@ static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
 	spin_lock_irq(&conf->resync_lock);
 
 	/* Wait until no block IO is waiting */
-	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting[idx],
+	wait_event_lock_irq(conf->wait_barrier,
+			    !atomic_read(&conf->nr_waiting[idx]),
 			    conf->resync_lock);
 
 	/* block any new IO from starting */
-	conf->barrier[idx]++;
-	conf->total_barriers++;
-
+	atomic_inc(&conf->barrier[idx]);
+	atomic_inc(&conf->total_barriers);
 	/* For these conditions we must wait:
 	 * A: while the array is in frozen state
 	 * B: while conf->nr_pending[idx] is not 0, meaning regular I/O
@@ -846,44 +846,69 @@ static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
 	 *    the max count which allowed on the whole raid1 device.
 	 */
 	wait_event_lock_irq(conf->wait_barrier,
-			    !conf->array_frozen &&
-			     !conf->nr_pending[idx] &&
-			     conf->total_barriers < RESYNC_DEPTH,
+			    !atomic_read(&conf->array_frozen) &&
+			     !atomic_read(&conf->nr_pending[idx]) &&
+			     atomic_read(&conf->total_barriers) < RESYNC_DEPTH,
 			    conf->resync_lock);
 
-	conf->nr_pending[idx]++;
+	atomic_inc(&conf->nr_pending[idx]);
 	spin_unlock_irq(&conf->resync_lock);
 }
 
 static void lower_barrier(struct r1conf *conf, sector_t sector_nr)
 {
-	unsigned long flags;
 	int idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
 
-	BUG_ON((conf->barrier[idx] <= 0) || conf->total_barriers <= 0);
-
-	spin_lock_irqsave(&conf->resync_lock, flags);
-	conf->barrier[idx]--;
-	conf->total_barriers--;
-	conf->nr_pending[idx]--;
-	spin_unlock_irqrestore(&conf->resync_lock, flags);
+	BUG_ON(atomic_read(&conf->barrier[idx]) <= 0);
+	BUG_ON(atomic_read(&conf->total_barriers) <= 0);
+	atomic_dec(&conf->barrier[idx]);
+	atomic_dec(&conf->total_barriers);
+	atomic_dec(&conf->nr_pending[idx]);
 	wake_up(&conf->wait_barrier);
 }
 
 static void _wait_barrier(struct r1conf *conf, int idx)
 {
-	spin_lock_irq(&conf->resync_lock);
-	if (conf->array_frozen || conf->barrier[idx]) {
-		conf->nr_waiting[idx]++;
-		/* Wait for the barrier to drop. */
-		wait_event_lock_irq(
-			conf->wait_barrier,
-			!conf->array_frozen && !conf->barrier[idx],
-			conf->resync_lock);
-		conf->nr_waiting[idx]--;
-	}
+	/* We need to increase conf->nr_pending[idx] very early here,
+	 * then raise_barrier() can be blocked when it waits for
+	 * conf->nr_pending[idx] to be 0. Then we can avoid holding
+	 * conf->resync_lock when there is no barrier raised in same
+	 * barrier unit bucket. Also if the array is frozen, I/O
+	 * should be blocked until array is unfrozen.
+	 */
+	atomic_inc(&conf->nr_pending[idx]);
+
+	/* Don't worry about checking two atomic_t variables at same time
+	 * here. conf->array_frozen MUST be checked firstly, The logic is,
+	 * if the array is frozen, no matter there is any barrier or not,
+	 * all I/O should be blocked. If there is no barrier in current
+	 * barrier bucket, we still need to check whether the array is frozen,
+	 * otherwise I/O will happen on frozen array, that's buggy.
+	 * If during we check conf->barrier[idx], the array is frozen (a.k.a
+	 * conf->array_frozen is set), and chonf->barrier[idx] is 0, it is
+	 * safe to return and make the I/O continue. Because the array is
+	 * frozen, all I/O returned here will eventually complete or be
+	 * queued, see code comment in frozen_array().
+	 */
+	if (!atomic_read(&conf->array_frozen) &&
+	    !atomic_read(&conf->barrier[idx]))
+		return;
 
-	conf->nr_pending[idx]++;
+	/* After holding conf->resync_lock, conf->nr_pending[idx]
+	 * should be decreased before waiting for barrier to drop.
+	 * Otherwise, we may encounter a race condition because
+	 * raise_barrer() might be waiting for conf->nr_pending[idx]
+	 * to be 0 at same time.
+	 */
+	atomic_inc(&conf->nr_waiting[idx]);
+	atomic_dec(&conf->nr_pending[idx]);
+	/* Wait for the barrier in same barrier unit bucket to drop. */
+	wait_event_lock_irq(conf->wait_barrier,
+			    !atomic_read(&conf->array_frozen) &&
+			     !atomic_read(&conf->barrier[idx]),
+			    conf->resync_lock);
+	atomic_inc(&conf->nr_pending[idx]);
+	atomic_dec(&conf->nr_waiting[idx]);
 	spin_unlock_irq(&conf->resync_lock);
 }
 
@@ -891,18 +916,23 @@ static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
 {
 	long idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
 
-	spin_lock_irq(&conf->resync_lock);
-	if (conf->array_frozen) {
-		conf->nr_waiting[idx]++;
-		/* Wait for array to unfreeze */
-		wait_event_lock_irq(
-			conf->wait_barrier,
-			!conf->array_frozen,
-			conf->resync_lock);
-		conf->nr_waiting[idx]--;
-	}
+	/* Very similar to _wait_barrier(). The difference is, for read
+	 * I/O we don't need wait for sync I/O, but if the whole array
+	 * is frozen, the read I/O still has to wait until the array is
+	 * unfrozen.
+	 */
+	atomic_inc(&conf->nr_pending[idx]);
+	if (!atomic_read(&conf->array_frozen))
+		return;
 
-	conf->nr_pending[idx]++;
+	atomic_inc(&conf->nr_waiting[idx]);
+	atomic_dec(&conf->nr_pending[idx]);
+	/* Wait for array to be unfrozen */
+	wait_event_lock_irq(conf->wait_barrier,
+			    !atomic_read(&conf->array_frozen),
+			    conf->resync_lock);
+	atomic_inc(&conf->nr_pending[idx]);
+	atomic_dec(&conf->nr_waiting[idx]);
 	spin_unlock_irq(&conf->resync_lock);
 }
 
@@ -923,11 +953,7 @@ static void wait_all_barriers(struct r1conf *conf)
 
 static void _allow_barrier(struct r1conf *conf, int idx)
 {
-	unsigned long flags;
-
-	spin_lock_irqsave(&conf->resync_lock, flags);
-	conf->nr_pending[idx]--;
-	spin_unlock_irqrestore(&conf->resync_lock, flags);
+	atomic_dec(&conf->nr_pending[idx]);
 	wake_up(&conf->wait_barrier);
 }
 
@@ -952,7 +978,7 @@ static int get_all_pendings(struct r1conf *conf)
 	int idx, ret;
 
 	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
-		ret += conf->nr_pending[idx];
+		ret += atomic_read(&conf->nr_pending[idx]);
 	return ret;
 }
 
@@ -962,7 +988,7 @@ static int get_all_queued(struct r1conf *conf)
 	int idx, ret;
 
 	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
-		ret += conf->nr_queued[idx];
+		ret += atomic_read(&conf->nr_queued[idx]);
 	return ret;
 }
 
@@ -992,7 +1018,7 @@ static void freeze_array(struct r1conf *conf, int extra)
 	 * normal I/O context, extra is 1, in rested situations extra is 0.
 	 */
 	spin_lock_irq(&conf->resync_lock);
-	conf->array_frozen = 1;
+	atomic_set(&conf->array_frozen, 1);
 	raid1_log(conf->mddev, "wait freeze");
 	wait_event_lock_irq_cmd(
 		conf->wait_barrier,
@@ -1005,7 +1031,7 @@ static void unfreeze_array(struct r1conf *conf)
 {
 	/* reverse the effect of the freeze */
 	spin_lock_irq(&conf->resync_lock);
-	conf->array_frozen = 0;
+	atomic_set(&conf->array_frozen, 0);
 	wake_up(&conf->wait_barrier);
 	spin_unlock_irq(&conf->resync_lock);
 }
@@ -2407,7 +2433,7 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
 		spin_lock_irq(&conf->device_lock);
 		list_add(&r1_bio->retry_list, &conf->bio_end_io_list);
 		idx = hash_long(r1_bio->sector, BARRIER_BUCKETS_NR_BITS);
-		conf->nr_queued[idx]++;
+		atomic_inc(&conf->nr_queued[idx]);
 		spin_unlock_irq(&conf->device_lock);
 		md_wakeup_thread(conf->mddev->thread);
 	} else {
@@ -2546,9 +2572,7 @@ static void raid1d(struct md_thread *thread)
 			list_del(&r1_bio->retry_list);
 			idx = hash_long(r1_bio->sector,
 					BARRIER_BUCKETS_NR_BITS);
-			spin_lock_irqsave(&conf->device_lock, flags);
-			conf->nr_queued[idx]--;
-			spin_unlock_irqrestore(&conf->device_lock, flags);
+			atomic_dec(&conf->nr_queued[idx]);
 			if (mddev->degraded)
 				set_bit(R1BIO_Degraded, &r1_bio->state);
 			if (test_bit(R1BIO_WriteError, &r1_bio->state))
@@ -2570,7 +2594,7 @@ static void raid1d(struct md_thread *thread)
 		r1_bio = list_entry(head->prev, struct r1bio, retry_list);
 		list_del(head->prev);
 		idx = hash_long(r1_bio->sector, BARRIER_BUCKETS_NR_BITS);
-		conf->nr_queued[idx]--;
+		atomic_dec(&conf->nr_queued[idx]);
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 
 		mddev = r1_bio->mddev;
@@ -2687,7 +2711,7 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
 	 * If there is non-resync activity waiting for a turn, then let it
 	 * though before starting on this new sync request.
 	 */
-	if (conf->nr_waiting[idx])
+	if (atomic_read(&conf->nr_waiting[idx]))
 		schedule_timeout_uninterruptible(1);
 
 	/* we are incrementing sector_nr below. To be safe, we check against
@@ -3316,7 +3340,7 @@ static void *raid1_takeover(struct mddev *mddev)
 		conf = setup_conf(mddev);
 		if (!IS_ERR(conf)) {
 			/* Array must appear to be quiesced */
-			conf->array_frozen = 1;
+			atomic_set(&conf->array_frozen, 1);
 			clear_bit(MD_HAS_JOURNAL, &mddev->flags);
 			clear_bit(MD_JOURNAL_CLEAN, &mddev->flags);
 		}
diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
index 817115d..bbe65f7 100644
--- a/drivers/md/raid1.h
+++ b/drivers/md/raid1.h
@@ -68,12 +68,12 @@ struct r1conf {
 	 */
 	wait_queue_head_t	wait_barrier;
 	spinlock_t		resync_lock;
-	int			nr_pending[BARRIER_BUCKETS_NR];
-	int			nr_waiting[BARRIER_BUCKETS_NR];
-	int			nr_queued[BARRIER_BUCKETS_NR];
-	int			barrier[BARRIER_BUCKETS_NR];
-	int			total_barriers;
-	int			array_frozen;
+	atomic_t		nr_pending[BARRIER_BUCKETS_NR];
+	atomic_t		nr_waiting[BARRIER_BUCKETS_NR];
+	atomic_t		nr_queued[BARRIER_BUCKETS_NR];
+	atomic_t		barrier[BARRIER_BUCKETS_NR];
+	atomic_t		total_barriers;
+	atomic_t		array_frozen;
 
 	/* Set to 1 if a full sync is needed, (fresh device added).
 	 * Cleared when a sync completes.
-- 
2.6.6


^ permalink raw reply related

* [PATCH v1 00/54] block: support multipage bvec
From: Ming Lei @ 2016-12-27 15:55 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: linux-block, Christoph Hellwig, Ming Lei, Al Viro, Andrew Morton,
	Bart Van Assche, Chaitanya Kulkarni, open list:GFS2 FILE SYSTEM,
	Damien Le Moal, Dan Williams, open list:DEVICE-MAPPER  LVM,
	open list:DRBD DRIVER, Eric Wheeler, Guoqing Jiang,
	Hannes Reinecke, Hannes Reinecke, Jiri Kosina, Joe Perches,
	Johannes Berg, Johannes Thumshirn, Kent Overstreet, linux-bcache

Hi,

This patchset brings multipage bvec into block layer. Basic
xfstests(-a auto) over virtio-blk/virtio-scsi have been run
and no regression is found, so it should be good enough
to show the approach now, and any comments are welcome!

1) what is multipage bvec?

Multipage bvecs means that one 'struct bio_bvec' can hold
multiple pages which are physically contiguous instead
of one single page used in linux kernel for long time.

2) why is multipage bvec introduced?

Kent proposed the idea[1] first. 

As system's RAM becomes much bigger than before, and 
at the same time huge page, transparent huge page and
memory compaction are widely used, it is a bit easy now
to see physically contiguous pages from fs in I/O.
On the other hand, from block layer's view, it isn't
necessary to store intermediate pages into bvec, and
it is enough to just store the physicallly contiguous
'segment'.

Also huge pages are being brought to filesystem[2], we
can do IO a hugepage a time[3], requires that one bio can
transfer at least one huge page one time. Turns out it isn't
flexiable to change BIO_MAX_PAGES simply[3]. Multipage bvec
can fit in this case very well.

With multipage bvec:

- bio size can be increased and it should improve some
high-bandwidth IO case in theory[4].

- Inside block layer, both bio splitting and sg map can
become more efficient than before by just traversing the
physically contiguous 'segment' instead of each page.

- there is possibility in future to improve memory footprint
of bvecs usage. 

3) how is multipage bvec implemented in this patchset?

The 1st 9 patches comment on some special cases. As we saw,
most of cases are found as safe for multipage bvec,
only fs/buffer, MD and btrfs need to deal with. Both fs/buffer
and btrfs are dealt with in the following patches based on some
new block APIs for multipage bvec. 

Given a little more work is involved to cleanup MD, this patchset
introduces QUEUE_FLAG_NO_MP for them, and this component can still
see/use singlepage bvec. In the future, once the cleanup is done, the
flag can be killed.

The 2nd part(23 ~ 54) implements multipage bvec in block:

- put all tricks into bvec/bio/rq iterators, and as far as
drivers and fs use these standard iterators, they are happy
with multipage bvec

- bio_for_each_segment_all() changes
this helper pass pointer of each bvec directly to user, and
it has to be changed. Two new helpers(bio_for_each_segment_all_sp()
and bio_for_each_segment_all_mp()) are introduced. 

Also convert current bio_for_each_segment_all() into the
above two.

- bio_clone() changes
At default bio_clone still clones one new bio in multipage bvec
way. Also single page version of bio_clone() is introduced
for some special cases, such as only single page bvec is used
for the new cloned bio(bio bounce, ...)

- btrfs cleanup
just three patches for avoiding direct access to bvec table.

These patches can be found in the following git tree:

	https://github.com/ming1/linux/commits/mp-bvec-0.6-v4.10-rc

Thanks Christoph for looking at the early version and providing
very good suggestions, such as: introduce bio_init_with_vec_table(),
remove another unnecessary helpers for cleanup and so on.

TODO:
	- cleanup direct access to bvec table for MD

V1:
	- against v4.10-rc1 and some cleanup in V0 are in -linus already
	- handle queue_virt_boundary() in mp bvec change and make NVMe happy
	- further BTRFS cleanup
	- remove QUEUE_FLAG_SPLIT_MP
	- rename for two new helpers of bio_for_each_segment_all()
	- fix bounce convertion
	- address comments in V0

[1], http://marc.info/?l=linux-kernel&m=141680246629547&w=2
[2], https://patchwork.kernel.org/patch/9451523/
[3], http://marc.info/?t=147735447100001&r=1&w=2
[4], http://marc.info/?l=linux-mm&m=147745525801433&w=2


Ming Lei (54):
  block: drbd: comment on direct access bvec table
  block: loop: comment on direct access to bvec table
  kernel/power/swap.c: comment on direct access to bvec table
  mm: page_io.c: comment on direct access to bvec table
  fs/buffer: comment on direct access to bvec table
  f2fs: f2fs_read_end_io: comment on direct access to bvec table
  bcache: comment on direct access to bvec table
  block: comment on bio_alloc_pages()
  block: comment on bio_iov_iter_get_pages()
  block: introduce flag QUEUE_FLAG_NO_MP
  md: set NO_MP for request queue of md
  dm: limit the max bio size as BIO_MAX_PAGES * PAGE_SIZE
  block: comments on bio_for_each_segment[_all]
  block: introduce multipage/single page bvec helpers
  block: implement sp version of bvec iterator helpers
  block: introduce bio_for_each_segment_mp()
  block: introduce bio_clone_sp()
  bvec_iter: introduce BVEC_ITER_ALL_INIT
  block: bounce: avoid direct access to bvec table
  block: bounce: don't access bio->bi_io_vec in copy_to_high_bio_irq
  block: introduce bio_can_convert_to_sp()
  block: bounce: convert multipage bvecs into singlepage
  bcache: handle bio_clone() & bvec updating for multipage bvecs
  blk-merge: compute bio->bi_seg_front_size efficiently
  block: blk-merge: try to make front segments in full size
  block: blk-merge: remove unnecessary check
  block: use bio_for_each_segment_mp() to compute segments count
  block: use bio_for_each_segment_mp() to map sg
  block: introduce bvec_for_each_sp_bvec()
  block: bio: introduce single/multi page version of
    bio_for_each_segment_all()
  block: introduce bio_segments_all()
  block: introduce bvec_get_last_sp()
  block: deal with dirtying pages for multipage bvec
  block: convert to singe/multi page version of
    bio_for_each_segment_all()
  bcache: convert to bio_for_each_segment_all_sp()
  dm-crypt: don't clear bvec->bv_page in crypt_free_buffer_pages()
  dm-crypt: convert to bio_for_each_segment_all_sp()
  md/raid1.c: convert to bio_for_each_segment_all_sp()
  fs/mpage: convert to bio_for_each_segment_all_sp()
  fs/direct-io: convert to bio_for_each_segment_all_sp()
  ext4: convert to bio_for_each_segment_all_sp()
  xfs: convert to bio_for_each_segment_all_sp()
  gfs2: convert to bio_for_each_segment_all_sp()
  f2fs: convert to bio_for_each_segment_all_sp()
  exofs: convert to bio_for_each_segment_all_sp()
  fs: crypto: convert to bio_for_each_segment_all_sp()
  fs/btrfs: convert to bio_for_each_segment_all_sp()
  fs/block_dev.c: convert to bio_for_each_segment_all_sp()
  fs/iomap.c: convert to bio_for_each_segment_all_sp()
  fs/buffer.c: use bvec iterator to truncate the bio
  btrfs: avoid access to .bi_vcnt directly
  btrfs: use bvec_get_last_sp to get the last singlepage bvec
  btrfs: comment on direct access bvec table
  block: enable multipage bvecs

 block/bio.c                      | 110 +++++++++++++++----
 block/blk-merge.c                | 227 +++++++++++++++++++++++++++++++--------
 block/blk-zoned.c                |   5 +-
 block/bounce.c                   |  75 +++++++++----
 drivers/block/drbd/drbd_bitmap.c |   1 +
 drivers/block/loop.c             |   5 +
 drivers/md/bcache/btree.c        |   4 +-
 drivers/md/bcache/debug.c        |  30 +++++-
 drivers/md/bcache/super.c        |   6 ++
 drivers/md/bcache/util.c         |   7 ++
 drivers/md/dm-crypt.c            |   4 +-
 drivers/md/dm.c                  |  11 +-
 drivers/md/md.c                  |  12 +++
 drivers/md/raid1.c               |   3 +-
 fs/block_dev.c                   |   6 +-
 fs/btrfs/check-integrity.c       |  12 ++-
 fs/btrfs/compression.c           |  12 ++-
 fs/btrfs/disk-io.c               |   3 +-
 fs/btrfs/extent_io.c             |  26 +++--
 fs/btrfs/extent_io.h             |   1 +
 fs/btrfs/file-item.c             |   6 +-
 fs/btrfs/inode.c                 |  34 ++++--
 fs/btrfs/raid56.c                |   6 +-
 fs/buffer.c                      |  24 +++--
 fs/crypto/crypto.c               |   3 +-
 fs/direct-io.c                   |   4 +-
 fs/exofs/ore.c                   |   3 +-
 fs/exofs/ore_raid.c              |   3 +-
 fs/ext4/page-io.c                |   3 +-
 fs/ext4/readpage.c               |   3 +-
 fs/f2fs/data.c                   |  13 ++-
 fs/gfs2/lops.c                   |   3 +-
 fs/gfs2/meta_io.c                |   3 +-
 fs/iomap.c                       |   3 +-
 fs/mpage.c                       |   3 +-
 fs/xfs/xfs_aops.c                |   3 +-
 include/linux/bio.h              | 164 ++++++++++++++++++++++++++--
 include/linux/blk_types.h        |   6 ++
 include/linux/blkdev.h           |   2 +
 include/linux/bvec.h             | 138 ++++++++++++++++++++++--
 kernel/power/swap.c              |   2 +
 mm/page_io.c                     |   2 +
 42 files changed, 829 insertions(+), 162 deletions(-)

-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH v1 07/54] bcache: comment on direct access to bvec table
From: Ming Lei @ 2016-12-27 15:55 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: linux-block, Christoph Hellwig, Ming Lei, Kent Overstreet,
	Shaohua Li, Guoqing Jiang, Zheng Liu, Mike Christie, Jiri Kosina,
	Eric Wheeler, Yijing Wang, Al Viro,
	open list:BCACHE BLOCK LAYER CACHE,
	open list:SOFTWARE RAID Multiple Disks SUPPORT
In-Reply-To: <1482854250-13481-1-git-send-email-tom.leiming@gmail.com>

Looks all are safe after multipage bvec is supported.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 drivers/md/bcache/btree.c | 1 +
 drivers/md/bcache/super.c | 6 ++++++
 drivers/md/bcache/util.c  | 7 +++++++
 3 files changed, 14 insertions(+)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index a43eedd5804d..fc35cfb4d0f1 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -428,6 +428,7 @@ static void do_btree_node_write(struct btree *b)
 
 		continue_at(cl, btree_node_write_done, NULL);
 	} else {
+		/* No harm for multipage bvec since the new is just allocated */
 		b->bio->bi_vcnt = 0;
 		bch_bio_map(b->bio, i);
 
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 3a19cbc8b230..607b022259dc 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -208,6 +208,7 @@ static void write_bdev_super_endio(struct bio *bio)
 
 static void __write_super(struct cache_sb *sb, struct bio *bio)
 {
+	/* single page bio, safe for multipage bvec */
 	struct cache_sb *out = page_address(bio->bi_io_vec[0].bv_page);
 	unsigned i;
 
@@ -1156,6 +1157,8 @@ static void register_bdev(struct cache_sb *sb, struct page *sb_page,
 	dc->bdev->bd_holder = dc;
 
 	bio_init(&dc->sb_bio, dc->sb_bio.bi_inline_vecs, 1);
+
+	/* single page bio, safe for multipage bvec */
 	dc->sb_bio.bi_io_vec[0].bv_page = sb_page;
 	get_page(sb_page);
 
@@ -1799,6 +1802,7 @@ void bch_cache_release(struct kobject *kobj)
 	for (i = 0; i < RESERVE_NR; i++)
 		free_fifo(&ca->free[i]);
 
+	/* single page bio, safe for multipage bvec */
 	if (ca->sb_bio.bi_inline_vecs[0].bv_page)
 		put_page(ca->sb_bio.bi_io_vec[0].bv_page);
 
@@ -1854,6 +1858,8 @@ static int register_cache(struct cache_sb *sb, struct page *sb_page,
 	ca->bdev->bd_holder = ca;
 
 	bio_init(&ca->sb_bio, ca->sb_bio.bi_inline_vecs, 1);
+
+	/* single page bio, safe for multipage bvec */
 	ca->sb_bio.bi_io_vec[0].bv_page = sb_page;
 	get_page(sb_page);
 
diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
index dde6172f3f10..5cc0b49a65fb 100644
--- a/drivers/md/bcache/util.c
+++ b/drivers/md/bcache/util.c
@@ -222,6 +222,13 @@ uint64_t bch_next_delay(struct bch_ratelimit *d, uint64_t done)
 		: 0;
 }
 
+/*
+ * Generally it isn't good to access .bi_io_vec and .bi_vcnt
+ * directly, the preferred way is bio_add_page, but in
+ * this case, bch_bio_map() supposes that the bvec table
+ * is empty, so it is safe to access .bi_vcnt & .bi_io_vec
+ * in this way even after multipage bvec is supported.
+ */
 void bch_bio_map(struct bio *bio, void *base)
 {
 	size_t size = bio->bi_iter.bi_size;
-- 
2.7.4

^ permalink raw reply related

* [PATCH v1 08/54] block: comment on bio_alloc_pages()
From: Ming Lei @ 2016-12-27 15:55 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: linux-block, Christoph Hellwig, Ming Lei, Jens Axboe,
	Kent Overstreet, Shaohua Li, Mike Christie, Guoqing Jiang,
	Hannes Reinecke, open list:BCACHE BLOCK LAYER CACHE,
	open list:SOFTWARE RAID Multiple Disks SUPPORT
In-Reply-To: <1482854250-13481-1-git-send-email-tom.leiming@gmail.com>

This patch adds comment on usage of bio_alloc_pages(),
also comments on one special case of bch_data_verify().

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 block/bio.c               | 4 +++-
 drivers/md/bcache/debug.c | 6 ++++++
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index 2b375020fc49..d4a1e0b63ea0 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -961,7 +961,9 @@ EXPORT_SYMBOL(bio_advance);
  * @bio: bio to allocate pages for
  * @gfp_mask: flags for allocation
  *
- * Allocates pages up to @bio->bi_vcnt.
+ * Allocates pages up to @bio->bi_vcnt, and this function should only
+ * be called on a new initialized bio, which means all pages aren't added
+ * to the bio via bio_add_page() yet.
  *
  * Returns 0 on success, -ENOMEM on failure. On failure, any allocated pages are
  * freed.
diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c
index 06f55056aaae..48d03e8b3385 100644
--- a/drivers/md/bcache/debug.c
+++ b/drivers/md/bcache/debug.c
@@ -110,6 +110,12 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio)
 	struct bio_vec bv, cbv;
 	struct bvec_iter iter, citer = { 0 };
 
+	/*
+	 * Once multipage bvec is supported, the bio_clone()
+	 * has to make sure page count in this bio can be held
+	 * in the new cloned bio because each single page need
+	 * to assign to each bvec of the new bio.
+	 */
 	check = bio_clone(bio, GFP_NOIO);
 	if (!check)
 		return;
-- 
2.7.4


^ permalink raw reply related

* [PATCH v1 11/54] md: set NO_MP for request queue of md
From: Ming Lei @ 2016-12-27 15:56 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: linux-block, Christoph Hellwig, Ming Lei, Shaohua Li,
	open list:SOFTWARE RAID Multiple Disks SUPPORT
In-Reply-To: <1482854250-13481-1-git-send-email-tom.leiming@gmail.com>

MD isn't ready for multipage bvecs, so mark it as
NO_MP.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 drivers/md/md.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 82821ee0d57f..63c6326bafde 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5162,6 +5162,16 @@ static void md_safemode_timeout(unsigned long data)
 
 static int start_dirty_degraded;
 
+/*
+ * MD isn't ready for multipage bvecs yet, and set the flag
+ * so that MD still can see singlepage bvecs bio
+ */
+static inline void md_set_no_mp(struct mddev *mddev)
+{
+	if (mddev->queue)
+		set_bit(QUEUE_FLAG_NO_MP, &mddev->queue->queue_flags);
+}
+
 int md_run(struct mddev *mddev)
 {
 	int err;
@@ -5381,6 +5391,8 @@ int md_run(struct mddev *mddev)
 	if (mddev->sb_flags)
 		md_update_sb(mddev, 0);
 
+	md_set_no_mp(mddev);
+
 	md_new_event(mddev);
 	sysfs_notify_dirent_safe(mddev->sysfs_state);
 	sysfs_notify_dirent_safe(mddev->sysfs_action);
-- 
2.7.4

^ permalink raw reply related

* [PATCH v1 12/54] dm: limit the max bio size as BIO_MAX_PAGES * PAGE_SIZE
From: Ming Lei @ 2016-12-27 15:56 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: linux-block, Christoph Hellwig, Ming Lei, Alasdair Kergon,
	Mike Snitzer, maintainer:DEVICE-MAPPER LVM, Shaohua Li,
	open list:SOFTWARE RAID Multiple Disks SUPPORT
In-Reply-To: <1482854250-13481-1-git-send-email-tom.leiming@gmail.com>

For BIO based DM, some targets aren't ready for dealing with
bigger incoming bio than 1Mbyte, such as crypt target.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 drivers/md/dm.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3086da5664f3..6139bf7623f7 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -899,7 +899,16 @@ int dm_set_target_max_io_len(struct dm_target *ti, sector_t len)
 		return -EINVAL;
 	}
 
-	ti->max_io_len = (uint32_t) len;
+	/*
+	 * BIO based queue uses its own splitting. When multipage bvecs
+	 * is switched on, size of the incoming bio may be too big to
+	 * be handled in some targets, such as crypt.
+	 *
+	 * When these targets are ready for the big bio, we can remove
+	 * the limit.
+	 */
+	ti->max_io_len = min_t(uint32_t, len,
+			       (BIO_MAX_PAGES * PAGE_SIZE));
 
 	return 0;
 }
-- 
2.7.4


^ permalink raw reply related

* [PATCH v1 23/54] bcache: handle bio_clone() & bvec updating for multipage bvecs
From: Ming Lei @ 2016-12-27 15:56 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: linux-block, Christoph Hellwig, Ming Lei, Kent Overstreet,
	Shaohua Li, Mike Christie, Guoqing Jiang,
	open list:BCACHE BLOCK LAYER CACHE,
	open list:SOFTWARE RAID Multiple Disks SUPPORT
In-Reply-To: <1482854250-13481-1-git-send-email-tom.leiming@gmail.com>

The incoming bio may be too big to be cloned into
one singlepage bvecs bio, so split the bio and
check the splitted bio one by one.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 drivers/md/bcache/debug.c | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c
index 48d03e8b3385..18b2d2d138e3 100644
--- a/drivers/md/bcache/debug.c
+++ b/drivers/md/bcache/debug.c
@@ -103,7 +103,7 @@ void bch_btree_verify(struct btree *b)
 	up(&b->io_mutex);
 }
 
-void bch_data_verify(struct cached_dev *dc, struct bio *bio)
+static void __bch_data_verify(struct cached_dev *dc, struct bio *bio)
 {
 	char name[BDEVNAME_SIZE];
 	struct bio *check;
@@ -116,7 +116,7 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio)
 	 * in the new cloned bio because each single page need
 	 * to assign to each bvec of the new bio.
 	 */
-	check = bio_clone(bio, GFP_NOIO);
+	check = bio_clone_sp(bio, GFP_NOIO);
 	if (!check)
 		return;
 	check->bi_opf = REQ_OP_READ;
@@ -151,6 +151,26 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio)
 	bio_put(check);
 }
 
+void bch_data_verify(struct cached_dev *dc, struct bio *bio)
+{
+	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
+	struct bio *clone = bio_clone_fast(bio, GFP_NOIO, q->bio_split);
+	unsigned sectors;
+
+	while (!bio_can_convert_to_sp(clone, &sectors)) {
+		struct bio *split = bio_split(clone, sectors,
+					      GFP_NOIO, q->bio_split);
+
+		__bch_data_verify(dc, split);
+		bio_put(split);
+	}
+
+	if (bio_sectors(clone))
+		__bch_data_verify(dc, clone);
+
+	bio_put(clone);
+}
+
 #endif
 
 #ifdef CONFIG_DEBUG_FS
-- 
2.7.4


^ permalink raw reply related

* [PATCH v1 35/54] bcache: convert to bio_for_each_segment_all_sp()
From: Ming Lei @ 2016-12-27 16:04 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: linux-block, Christoph Hellwig, Ming Lei, Kent Overstreet,
	Shaohua Li, Hannes Reinecke, Takashi Iwai, Jiri Kosina,
	Mike Christie, Guoqing Jiang, Zheng Liu,
	open list:BCACHE BLOCK LAYER CACHE,
	open list:SOFTWARE RAID Multiple Disks SUPPORT
In-Reply-To: <1482854706-14128-1-git-send-email-tom.leiming@gmail.com>

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 drivers/md/bcache/btree.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index fc35cfb4d0f1..61d40b1903f3 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -419,8 +419,9 @@ static void do_btree_node_write(struct btree *b)
 		int j;
 		struct bio_vec *bv;
 		void *base = (void *) ((unsigned long) i & ~(PAGE_SIZE - 1));
+		struct bvec_iter_all bia;
 
-		bio_for_each_segment_all(bv, b->bio, j)
+		bio_for_each_segment_all_sp(bv, b->bio, j, bia)
 			memcpy(page_address(bv->bv_page),
 			       base + j * PAGE_SIZE, PAGE_SIZE);
 
-- 
2.7.4


^ permalink raw reply related

* [PATCH v1 36/54] dm-crypt: don't clear bvec->bv_page in crypt_free_buffer_pages()
From: Ming Lei @ 2016-12-27 16:04 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: linux-block, Christoph Hellwig, Ming Lei, Alasdair Kergon,
	Mike Snitzer, maintainer:DEVICE-MAPPER LVM, Shaohua Li,
	open list:SOFTWARE RAID Multiple Disks SUPPORT
In-Reply-To: <1482854706-14128-1-git-send-email-tom.leiming@gmail.com>

The bio is always freed after running crypt_free_buffer_pages(),
so it isn't necessary to clear the bv->bv_page.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 drivers/md/dm-crypt.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 7c6c57216bf2..593cdf88bf5f 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1042,7 +1042,6 @@ static void crypt_free_buffer_pages(struct crypt_config *cc, struct bio *clone)
 	bio_for_each_segment_all(bv, clone, i) {
 		BUG_ON(!bv->bv_page);
 		mempool_free(bv->bv_page, cc->page_pool);
-		bv->bv_page = NULL;
 	}
 }
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH v1 37/54] dm-crypt: convert to bio_for_each_segment_all_sp()
From: Ming Lei @ 2016-12-27 16:04 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: linux-block, Christoph Hellwig, Ming Lei, Alasdair Kergon,
	Mike Snitzer, maintainer:DEVICE-MAPPER LVM, Shaohua Li,
	open list:SOFTWARE RAID Multiple Disks SUPPORT
In-Reply-To: <1482854706-14128-1-git-send-email-tom.leiming@gmail.com>

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 drivers/md/dm-crypt.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 593cdf88bf5f..c6932fb85418 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1038,8 +1038,9 @@ static void crypt_free_buffer_pages(struct crypt_config *cc, struct bio *clone)
 {
 	unsigned int i;
 	struct bio_vec *bv;
+	struct bvec_iter_all bia;
 
-	bio_for_each_segment_all(bv, clone, i) {
+	bio_for_each_segment_all_sp(bv, clone, i, bia) {
 		BUG_ON(!bv->bv_page);
 		mempool_free(bv->bv_page, cc->page_pool);
 	}
-- 
2.7.4

^ permalink raw reply related

* [PATCH v1 38/54] md/raid1.c: convert to bio_for_each_segment_all_sp()
From: Ming Lei @ 2016-12-27 16:04 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: linux-block, Christoph Hellwig, Ming Lei, Shaohua Li,
	open list:SOFTWARE RAID Multiple Disks SUPPORT
In-Reply-To: <1482854706-14128-1-git-send-email-tom.leiming@gmail.com>

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 drivers/md/raid1.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index a1f3fbed9100..4818f40e7ce9 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -988,12 +988,13 @@ static void alloc_behind_pages(struct bio *bio, struct r1bio *r1_bio)
 {
 	int i;
 	struct bio_vec *bvec;
+	struct bvec_iter_all bia;
 	struct bio_vec *bvecs = kzalloc(bio->bi_vcnt * sizeof(struct bio_vec),
 					GFP_NOIO);
 	if (unlikely(!bvecs))
 		return;
 
-	bio_for_each_segment_all(bvec, bio, i) {
+	bio_for_each_segment_all_sp(bvec, bio, i, bia) {
 		bvecs[i] = *bvec;
 		bvecs[i].bv_page = alloc_page(GFP_NOIO);
 		if (unlikely(!bvecs[i].bv_page))
-- 
2.7.4


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox