* Re: [Lsf] Preliminary Agenda and Activities for LSF [not found] <1301373398.2590.20.camel@mulgrave.site> @ 2011-03-29 5:14 ` Amir Goldstein 2011-03-29 11:16 ` Ric Wheeler 2011-03-29 17:35 ` Chad Talbott 2 siblings, 0 replies; 138+ messages in thread From: Amir Goldstein @ 2011-03-29 5:14 UTC (permalink / raw) To: James Bottomley; +Cc: lsf, linux-fsdevel On Tue, Mar 29, 2011 at 6:36 AM, James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > Hi All, > > Since LSF is less than a week away, the programme committee put together > a just in time preliminary agenda for LSF. As you can see there is > still plenty of empty space, which you can make suggestions (to this > list with appropriate general list cc's) for filling: Hi James, I would like to give a session on Ext4 snapshots overview and development status update. Perhaps on the 2nd day? I would also like to run a session on common API's and common challenges for snapshotting file systems (ext4, ocfs2, nilfs2 and btrfs are the ones I know of). Amir. > > https://spreadsheets.google.com/pub?hl=en&hl=en&key=0AiQMl7GcVa7OdFdNQzM5UDRXUnVEbHlYVmZUVHQ2amc&output=html > > If you don't make suggestions, the programme committee will feel > empowered to make arbitrary assignments based on your topic and attendee > email requests ... > > We're still not quite sure what rooms we will have at the Kabuki, but > we'll add them to the spreadsheet when we know (they should be close to > each other). > > The spreadsheet above also gives contact information for all the > attendees and the programme committee. > > Yours, > > James Bottomley > on behalf of LSF/MM Programme Committee > > > _______________________________________________ > Lsf mailing list > Lsf@lists.linux-foundation.org > https://lists.linux-foundation.org/mailman/listinfo/lsf > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF [not found] <1301373398.2590.20.camel@mulgrave.site> 2011-03-29 5:14 ` [Lsf] Preliminary Agenda and Activities for LSF Amir Goldstein @ 2011-03-29 11:16 ` Ric Wheeler 2011-03-29 11:22 ` Matthew Wilcox ` (4 more replies) 2011-03-29 17:35 ` Chad Talbott 2 siblings, 5 replies; 138+ messages in thread From: Ric Wheeler @ 2011-03-29 11:16 UTC (permalink / raw) To: James Bottomley Cc: lsf, linux-fsdevel, linux-scsi@vger.kernel.org, device-mapper development On 03/29/2011 12:36 AM, James Bottomley wrote: > Hi All, > > Since LSF is less than a week away, the programme committee put together > a just in time preliminary agenda for LSF. As you can see there is > still plenty of empty space, which you can make suggestions (to this > list with appropriate general list cc's) for filling: > > https://spreadsheets.google.com/pub?hl=en&hl=en&key=0AiQMl7GcVa7OdFdNQzM5UDRXUnVEbHlYVmZUVHQ2amc&output=html > > If you don't make suggestions, the programme committee will feel > empowered to make arbitrary assignments based on your topic and attendee > email requests ... > > We're still not quite sure what rooms we will have at the Kabuki, but > we'll add them to the spreadsheet when we know (they should be close to > each other). > > The spreadsheet above also gives contact information for all the > attendees and the programme committee. > > Yours, > > James Bottomley > on behalf of LSF/MM Programme Committee > Here are a few topic ideas: (1) The first topic that might span IO & FS tracks (or just pull in device mapper people to an FS track) could be adding new commands that would allow users to grow/shrink/etc file systems in a generic way. The thought I had was that we have a reasonable model that we could reuse for these new commands like mount and mount.fs or fsck and fsck.fs. With btrfs coming down the road, it could be nice to identify exactly what common operations users want to do and agree on how to implement them. Alasdair pointed out in the upstream thread that we had a prototype here in fsadm. (2) Very high speed, low latency SSD devices and testing. Have we settled on the need for these devices to all have block level drivers? For S-ATA or SAS devices, are there known performance issues that require enhancements in somewhere in the stack? (3) The union mount versus overlayfs debate - pros and cons. What each do well, what needs doing. Do we want/need both upstream? (Maybe this can get 10 minutes in Al's VFS session?) Thanks! Ric ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-29 11:16 ` Ric Wheeler @ 2011-03-29 11:22 ` Matthew Wilcox 2011-03-29 12:17 ` Jens Axboe 2011-03-29 17:20 ` Shyam_Iyer ` (3 subsequent siblings) 4 siblings, 1 reply; 138+ messages in thread From: Matthew Wilcox @ 2011-03-29 11:22 UTC (permalink / raw) To: Ric Wheeler Cc: James Bottomley, lsf, linux-fsdevel, linux-scsi@vger.kernel.org, device-mapper development On Tue, Mar 29, 2011 at 07:16:32AM -0400, Ric Wheeler wrote: > (2) Very high speed, low latency SSD devices and testing. Have we settled > on the need for these devices to all have block level drivers? For S-ATA > or SAS devices, are there known performance issues that require > enhancements in somewhere in the stack? I can throw together a quick presentation on this topic. -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-29 11:22 ` Matthew Wilcox @ 2011-03-29 12:17 ` Jens Axboe 2011-03-29 13:09 ` Martin K. Petersen 0 siblings, 1 reply; 138+ messages in thread From: Jens Axboe @ 2011-03-29 12:17 UTC (permalink / raw) To: Matthew Wilcox Cc: lsf, linux-fsdevel, device-mapper development, Ric Wheeler, linux-scsi@vger.kernel.org On 2011-03-29 13:22, Matthew Wilcox wrote: > On Tue, Mar 29, 2011 at 07:16:32AM -0400, Ric Wheeler wrote: >> (2) Very high speed, low latency SSD devices and testing. Have we settled >> on the need for these devices to all have block level drivers? For S-ATA >> or SAS devices, are there known performance issues that require >> enhancements in somewhere in the stack? > > I can throw together a quick presentation on this topic. I'll join that too. -- Jens Axboe ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-29 12:17 ` Jens Axboe @ 2011-03-29 13:09 ` Martin K. Petersen 2011-03-29 13:12 ` Ric Wheeler 2011-03-29 13:38 ` James Bottomley 0 siblings, 2 replies; 138+ messages in thread From: Martin K. Petersen @ 2011-03-29 13:09 UTC (permalink / raw) To: Jens Axboe Cc: Matthew Wilcox, lsf, linux-fsdevel, device-mapper development, Ric Wheeler, linux-scsi@vger.kernel.org >>>>> "Jens" == Jens Axboe <jaxboe@fusionio.com> writes: >> I can throw together a quick presentation on this topic. Jens> I'll join that too. Stack tuning aside, maybe Matthew can speak a bit about NVMe and I'll cover what's going on with the SCSI over PCIe efforts... -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-29 13:09 ` Martin K. Petersen @ 2011-03-29 13:12 ` Ric Wheeler 2011-03-29 13:38 ` James Bottomley 1 sibling, 0 replies; 138+ messages in thread From: Ric Wheeler @ 2011-03-29 13:12 UTC (permalink / raw) To: Martin K. Petersen Cc: Jens Axboe, linux-scsi@vger.kernel.org, lsf, device-mapper development, linux-fsdevel, Ric Wheeler On 03/29/2011 09:09 AM, Martin K. Petersen wrote: > > Jens> I'll join that too. > > Stack tuning aside, maybe Matthew can speak a bit about NVMe and I'll > cover what's going on with the SCSI over PCIe efforts... That sounds interesting to me... Ric ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-29 13:09 ` Martin K. Petersen 2011-03-29 13:12 ` Ric Wheeler @ 2011-03-29 13:38 ` James Bottomley 1 sibling, 0 replies; 138+ messages in thread From: James Bottomley @ 2011-03-29 13:38 UTC (permalink / raw) To: Martin K. Petersen Cc: Jens Axboe, Matthew Wilcox, lsf, linux-fsdevel, device-mapper development, Ric Wheeler, linux-scsi@vger.kernel.org On Tue, 2011-03-29 at 09:09 -0400, Martin K. Petersen wrote: > >>>>> "Jens" == Jens Axboe <jaxboe@fusionio.com> writes: > > >> I can throw together a quick presentation on this topic. > > Jens> I'll join that too. > > Stack tuning aside, maybe Matthew can speak a bit about NVMe and I'll > cover what's going on with the SCSI over PCIe efforts... OK, I put you down for a joint sessions with FS and IO after the tea break on Tuesday. James ^ permalink raw reply [flat|nested] 138+ messages in thread
* RE: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-29 11:16 ` Ric Wheeler 2011-03-29 11:22 ` Matthew Wilcox @ 2011-03-29 17:20 ` Shyam_Iyer 2011-03-29 17:33 ` Vivek Goyal 2011-03-29 19:47 ` Nicholas A. Bellinger ` (2 subsequent siblings) 4 siblings, 1 reply; 138+ messages in thread From: Shyam_Iyer @ 2011-03-29 17:20 UTC (permalink / raw) To: rwheeler, James.Bottomley; +Cc: lsf, linux-fsdevel, linux-scsi, dm-devel > -----Original Message----- > From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi- > owner@vger.kernel.org] On Behalf Of Ric Wheeler > Sent: Tuesday, March 29, 2011 7:17 AM > To: James Bottomley > Cc: lsf@lists.linux-foundation.org; linux-fsdevel; linux- > scsi@vger.kernel.org; device-mapper development > Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF > > On 03/29/2011 12:36 AM, James Bottomley wrote: > > Hi All, > > > > Since LSF is less than a week away, the programme committee put > together > > a just in time preliminary agenda for LSF. As you can see there is > > still plenty of empty space, which you can make suggestions (to this > > list with appropriate general list cc's) for filling: > > > > > https://spreadsheets.google.com/pub?hl=en&hl=en&key=0AiQMl7GcVa7OdFdNQz > M5UDRXUnVEbHlYVmZUVHQ2amc&output=html > > > > If you don't make suggestions, the programme committee will feel > > empowered to make arbitrary assignments based on your topic and > attendee > > email requests ... > > > > We're still not quite sure what rooms we will have at the Kabuki, but > > we'll add them to the spreadsheet when we know (they should be close > to > > each other). > > > > The spreadsheet above also gives contact information for all the > > attendees and the programme committee. > > > > Yours, > > > > James Bottomley > > on behalf of LSF/MM Programme Committee > > > > Here are a few topic ideas: > > (1) The first topic that might span IO & FS tracks (or just pull in > device > mapper people to an FS track) could be adding new commands that would > allow > users to grow/shrink/etc file systems in a generic way. The thought I > had was > that we have a reasonable model that we could reuse for these new > commands like > mount and mount.fs or fsck and fsck.fs. With btrfs coming down the > road, it > could be nice to identify exactly what common operations users want to > do and > agree on how to implement them. Alasdair pointed out in the upstream > thread that > we had a prototype here in fsadm. > > (2) Very high speed, low latency SSD devices and testing. Have we > settled on the > need for these devices to all have block level drivers? For S-ATA or > SAS > devices, are there known performance issues that require enhancements > in > somewhere in the stack? > > (3) The union mount versus overlayfs debate - pros and cons. What each > do well, > what needs doing. Do we want/need both upstream? (Maybe this can get 10 > minutes > in Al's VFS session?) > > Thanks! > > Ric A few others that I think may span across I/O, Block fs..layers. 1) Dm-thinp target vs File system thin profile vs block map based thin/trim profile. Facilitate I/O throttling for thin/trimmable storage. Online and Offline profil. 2) Interfaces for SCSI, Ethernet/*transport configuration parameters floating around in sysfs, procfs. Architecting guidelines for accepting patches for hybrid devices. 3) DM snapshot vs FS snapshots vs H/W snapshots. There is room for all and they have to help each other 4) B/W control - VM->DM->Block->Ethernet->Switch->Storage. Pick your subsystem and there are many non-cooperating B/W control constructs in each subsystem. -Shyam ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-29 17:20 ` Shyam_Iyer @ 2011-03-29 17:33 ` Vivek Goyal 2011-03-29 18:10 ` Shyam_Iyer 0 siblings, 1 reply; 138+ messages in thread From: Vivek Goyal @ 2011-03-29 17:33 UTC (permalink / raw) To: Shyam_Iyer Cc: rwheeler, James.Bottomley, lsf, linux-fsdevel, dm-devel, linux-scsi On Tue, Mar 29, 2011 at 10:20:57AM -0700, Shyam_Iyer@dell.com wrote: > > > > -----Original Message----- > > From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi- > > owner@vger.kernel.org] On Behalf Of Ric Wheeler > > Sent: Tuesday, March 29, 2011 7:17 AM > > To: James Bottomley > > Cc: lsf@lists.linux-foundation.org; linux-fsdevel; linux- > > scsi@vger.kernel.org; device-mapper development > > Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF > > > > On 03/29/2011 12:36 AM, James Bottomley wrote: > > > Hi All, > > > > > > Since LSF is less than a week away, the programme committee put > > together > > > a just in time preliminary agenda for LSF. As you can see there is > > > still plenty of empty space, which you can make suggestions (to this > > > list with appropriate general list cc's) for filling: > > > > > > > > https://spreadsheets.google.com/pub?hl=en&hl=en&key=0AiQMl7GcVa7OdFdNQz > > M5UDRXUnVEbHlYVmZUVHQ2amc&output=html > > > > > > If you don't make suggestions, the programme committee will feel > > > empowered to make arbitrary assignments based on your topic and > > attendee > > > email requests ... > > > > > > We're still not quite sure what rooms we will have at the Kabuki, but > > > we'll add them to the spreadsheet when we know (they should be close > > to > > > each other). > > > > > > The spreadsheet above also gives contact information for all the > > > attendees and the programme committee. > > > > > > Yours, > > > > > > James Bottomley > > > on behalf of LSF/MM Programme Committee > > > > > > > Here are a few topic ideas: > > > > (1) The first topic that might span IO & FS tracks (or just pull in > > device > > mapper people to an FS track) could be adding new commands that would > > allow > > users to grow/shrink/etc file systems in a generic way. The thought I > > had was > > that we have a reasonable model that we could reuse for these new > > commands like > > mount and mount.fs or fsck and fsck.fs. With btrfs coming down the > > road, it > > could be nice to identify exactly what common operations users want to > > do and > > agree on how to implement them. Alasdair pointed out in the upstream > > thread that > > we had a prototype here in fsadm. > > > > (2) Very high speed, low latency SSD devices and testing. Have we > > settled on the > > need for these devices to all have block level drivers? For S-ATA or > > SAS > > devices, are there known performance issues that require enhancements > > in > > somewhere in the stack? > > > > (3) The union mount versus overlayfs debate - pros and cons. What each > > do well, > > what needs doing. Do we want/need both upstream? (Maybe this can get 10 > > minutes > > in Al's VFS session?) > > > > Thanks! > > > > Ric > > A few others that I think may span across I/O, Block fs..layers. > > 1) Dm-thinp target vs File system thin profile vs block map based thin/trim profile. > Facilitate I/O throttling for thin/trimmable storage. Online and Offline profil. Is above any different from block IO throttling we have got for block devices? > 2) Interfaces for SCSI, Ethernet/*transport configuration parameters floating around in sysfs, procfs. Architecting guidelines for accepting patches for hybrid devices. > 3) DM snapshot vs FS snapshots vs H/W snapshots. There is room for all and they have to help each other > 4) B/W control - VM->DM->Block->Ethernet->Switch->Storage. Pick your subsystem and there are many non-cooperating B/W control constructs in each subsystem. Above is pretty generic. Do you have specific needs/ideas/concerns? Thanks Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
* RE: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-29 17:33 ` Vivek Goyal @ 2011-03-29 18:10 ` Shyam_Iyer 2011-03-29 18:45 ` Vivek Goyal 0 siblings, 1 reply; 138+ messages in thread From: Shyam_Iyer @ 2011-03-29 18:10 UTC (permalink / raw) To: vgoyal; +Cc: rwheeler, James.Bottomley, lsf, linux-fsdevel, dm-devel, linux-scsi > -----Original Message----- > From: Vivek Goyal [mailto:vgoyal@redhat.com] > Sent: Tuesday, March 29, 2011 1:34 PM > To: Iyer, Shyam > Cc: rwheeler@redhat.com; James.Bottomley@hansenpartnership.com; > lsf@lists.linux-foundation.org; linux-fsdevel@vger.kernel.org; dm- > devel@redhat.com; linux-scsi@vger.kernel.org > Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF > > On Tue, Mar 29, 2011 at 10:20:57AM -0700, Shyam_Iyer@dell.com wrote: > > > > > > > -----Original Message----- > > > From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi- > > > owner@vger.kernel.org] On Behalf Of Ric Wheeler > > > Sent: Tuesday, March 29, 2011 7:17 AM > > > To: James Bottomley > > > Cc: lsf@lists.linux-foundation.org; linux-fsdevel; linux- > > > scsi@vger.kernel.org; device-mapper development > > > Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF > > > > > > On 03/29/2011 12:36 AM, James Bottomley wrote: > > > > Hi All, > > > > > > > > Since LSF is less than a week away, the programme committee put > > > together > > > > a just in time preliminary agenda for LSF. As you can see there > is > > > > still plenty of empty space, which you can make suggestions (to > this > > > > list with appropriate general list cc's) for filling: > > > > > > > > > > > > https://spreadsheets.google.com/pub?hl=en&hl=en&key=0AiQMl7GcVa7OdFdNQz > > > M5UDRXUnVEbHlYVmZUVHQ2amc&output=html > > > > > > > > If you don't make suggestions, the programme committee will feel > > > > empowered to make arbitrary assignments based on your topic and > > > attendee > > > > email requests ... > > > > > > > > We're still not quite sure what rooms we will have at the Kabuki, > but > > > > we'll add them to the spreadsheet when we know (they should be > close > > > to > > > > each other). > > > > > > > > The spreadsheet above also gives contact information for all the > > > > attendees and the programme committee. > > > > > > > > Yours, > > > > > > > > James Bottomley > > > > on behalf of LSF/MM Programme Committee > > > > > > > > > > Here are a few topic ideas: > > > > > > (1) The first topic that might span IO & FS tracks (or just pull > in > > > device > > > mapper people to an FS track) could be adding new commands that > would > > > allow > > > users to grow/shrink/etc file systems in a generic way. The > thought I > > > had was > > > that we have a reasonable model that we could reuse for these new > > > commands like > > > mount and mount.fs or fsck and fsck.fs. With btrfs coming down the > > > road, it > > > could be nice to identify exactly what common operations users want > to > > > do and > > > agree on how to implement them. Alasdair pointed out in the > upstream > > > thread that > > > we had a prototype here in fsadm. > > > > > > (2) Very high speed, low latency SSD devices and testing. Have we > > > settled on the > > > need for these devices to all have block level drivers? For S-ATA > or > > > SAS > > > devices, are there known performance issues that require > enhancements > > > in > > > somewhere in the stack? > > > > > > (3) The union mount versus overlayfs debate - pros and cons. What > each > > > do well, > > > what needs doing. Do we want/need both upstream? (Maybe this can > get 10 > > > minutes > > > in Al's VFS session?) > > > > > > Thanks! > > > > > > Ric > > > > A few others that I think may span across I/O, Block fs..layers. > > > > 1) Dm-thinp target vs File system thin profile vs block map based > thin/trim profile. > > > Facilitate I/O throttling for thin/trimmable storage. Online and > Offline profil. > > Is above any different from block IO throttling we have got for block > devices? > Yes.. so the throttling would be capacity based.. when the storage array wants us to throttle the I/O. Depending on the event we may keep getting space allocation write protect check conditions for writes until a user intervenes to stop I/O. > > 2) Interfaces for SCSI, Ethernet/*transport configuration parameters > floating around in sysfs, procfs. Architecting guidelines for accepting > patches for hybrid devices. > > 3) DM snapshot vs FS snapshots vs H/W snapshots. There is room for > all and they have to help each other For instance if you took a DM snapshot and the storage sent a check condition to the original dm device I am not sure if the DM snapshot would get one too.. If you had a scenario of taking H/W snapshot of an entire pool and decide to delete the individual DM snapshots the H/W snapshot would be inconsistent. The blocks being managed by a DM-device would have moved (SCSI referrals). I believe Hannes is working on the referrals piece.. > > 4) B/W control - VM->DM->Block->Ethernet->Switch->Storage. Pick your > subsystem and there are many non-cooperating B/W control constructs in > each subsystem. > > Above is pretty generic. Do you have specific needs/ideas/concerns? > > Thanks > Vivek Yes.. if I limited by Ethernet b/w to 40% I don't need to limit I/O b/w via cgroups. Such bandwidth manipulations are network switch driven and cgroups never take care of these events from the Ethernet driver. The TC classes route the network I/O to multiqueue groups and so theoretically you could have block queues 1:1 with the number of network multiqueues.. -Shyam ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-29 18:10 ` Shyam_Iyer @ 2011-03-29 18:45 ` Vivek Goyal 2011-03-29 19:13 ` Shyam_Iyer 0 siblings, 1 reply; 138+ messages in thread From: Vivek Goyal @ 2011-03-29 18:45 UTC (permalink / raw) To: Shyam_Iyer Cc: rwheeler, James.Bottomley, lsf, linux-fsdevel, dm-devel, linux-scsi On Tue, Mar 29, 2011 at 11:10:18AM -0700, Shyam_Iyer@Dell.com wrote: > > > > -----Original Message----- > > From: Vivek Goyal [mailto:vgoyal@redhat.com] > > Sent: Tuesday, March 29, 2011 1:34 PM > > To: Iyer, Shyam > > Cc: rwheeler@redhat.com; James.Bottomley@hansenpartnership.com; > > lsf@lists.linux-foundation.org; linux-fsdevel@vger.kernel.org; dm- > > devel@redhat.com; linux-scsi@vger.kernel.org > > Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF > > > > On Tue, Mar 29, 2011 at 10:20:57AM -0700, Shyam_Iyer@dell.com wrote: > > > > > > > > > > -----Original Message----- > > > > From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi- > > > > owner@vger.kernel.org] On Behalf Of Ric Wheeler > > > > Sent: Tuesday, March 29, 2011 7:17 AM > > > > To: James Bottomley > > > > Cc: lsf@lists.linux-foundation.org; linux-fsdevel; linux- > > > > scsi@vger.kernel.org; device-mapper development > > > > Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF > > > > > > > > On 03/29/2011 12:36 AM, James Bottomley wrote: > > > > > Hi All, > > > > > > > > > > Since LSF is less than a week away, the programme committee put > > > > together > > > > > a just in time preliminary agenda for LSF. As you can see there > > is > > > > > still plenty of empty space, which you can make suggestions (to > > this > > > > > list with appropriate general list cc's) for filling: > > > > > > > > > > > > > > > > https://spreadsheets.google.com/pub?hl=en&hl=en&key=0AiQMl7GcVa7OdFdNQz > > > > M5UDRXUnVEbHlYVmZUVHQ2amc&output=html > > > > > > > > > > If you don't make suggestions, the programme committee will feel > > > > > empowered to make arbitrary assignments based on your topic and > > > > attendee > > > > > email requests ... > > > > > > > > > > We're still not quite sure what rooms we will have at the Kabuki, > > but > > > > > we'll add them to the spreadsheet when we know (they should be > > close > > > > to > > > > > each other). > > > > > > > > > > The spreadsheet above also gives contact information for all the > > > > > attendees and the programme committee. > > > > > > > > > > Yours, > > > > > > > > > > James Bottomley > > > > > on behalf of LSF/MM Programme Committee > > > > > > > > > > > > > Here are a few topic ideas: > > > > > > > > (1) The first topic that might span IO & FS tracks (or just pull > > in > > > > device > > > > mapper people to an FS track) could be adding new commands that > > would > > > > allow > > > > users to grow/shrink/etc file systems in a generic way. The > > thought I > > > > had was > > > > that we have a reasonable model that we could reuse for these new > > > > commands like > > > > mount and mount.fs or fsck and fsck.fs. With btrfs coming down the > > > > road, it > > > > could be nice to identify exactly what common operations users want > > to > > > > do and > > > > agree on how to implement them. Alasdair pointed out in the > > upstream > > > > thread that > > > > we had a prototype here in fsadm. > > > > > > > > (2) Very high speed, low latency SSD devices and testing. Have we > > > > settled on the > > > > need for these devices to all have block level drivers? For S-ATA > > or > > > > SAS > > > > devices, are there known performance issues that require > > enhancements > > > > in > > > > somewhere in the stack? > > > > > > > > (3) The union mount versus overlayfs debate - pros and cons. What > > each > > > > do well, > > > > what needs doing. Do we want/need both upstream? (Maybe this can > > get 10 > > > > minutes > > > > in Al's VFS session?) > > > > > > > > Thanks! > > > > > > > > Ric > > > > > > A few others that I think may span across I/O, Block fs..layers. > > > > > > 1) Dm-thinp target vs File system thin profile vs block map based > > thin/trim profile. > > > > > Facilitate I/O throttling for thin/trimmable storage. Online and > > Offline profil. > > > > Is above any different from block IO throttling we have got for block > > devices? > > > Yes.. so the throttling would be capacity based.. when the storage array wants us to throttle the I/O. Depending on the event we may keep getting space allocation write protect check conditions for writes until a user intervenes to stop I/O. > Sounds like some user space daemon listening for these events and then modifying cgroup throttling limits dynamically? > > > > 2) Interfaces for SCSI, Ethernet/*transport configuration parameters > > floating around in sysfs, procfs. Architecting guidelines for accepting > > patches for hybrid devices. > > > 3) DM snapshot vs FS snapshots vs H/W snapshots. There is room for > > all and they have to help each other > > For instance if you took a DM snapshot and the storage sent a check condition to the original dm device I am not sure if the DM snapshot would get one too.. > > If you had a scenario of taking H/W snapshot of an entire pool and decide to delete the individual DM snapshots the H/W snapshot would be inconsistent. > > The blocks being managed by a DM-device would have moved (SCSI referrals). I believe Hannes is working on the referrals piece.. > > > > 4) B/W control - VM->DM->Block->Ethernet->Switch->Storage. Pick your > > subsystem and there are many non-cooperating B/W control constructs in > > each subsystem. > > > > Above is pretty generic. Do you have specific needs/ideas/concerns? > > > > Thanks > > Vivek > Yes.. if I limited by Ethernet b/w to 40% I don't need to limit I/O b/w via cgroups. Such bandwidth manipulations are network switch driven and cgroups never take care of these events from the Ethernet driver. So if IO is going over network and actual bandwidth control is taking place by throttling ethernet traffic then one does not have to specify block cgroup throttling policy and hence no need for cgroups to be worried about ethernet driver events? I think I am missing something here. Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-29 18:45 ` Vivek Goyal @ 2011-03-29 19:13 ` Shyam_Iyer 2011-03-29 19:57 ` Vivek Goyal 2011-03-29 19:59 ` Mike Snitzer 0 siblings, 2 replies; 138+ messages in thread From: Shyam_Iyer @ 2011-03-29 19:13 UTC (permalink / raw) To: vgoyal; +Cc: lsf, linux-scsi, dm-devel, linux-fsdevel, rwheeler > -----Original Message----- > From: Vivek Goyal [mailto:vgoyal@redhat.com] > Sent: Tuesday, March 29, 2011 2:45 PM > To: Iyer, Shyam > Cc: rwheeler@redhat.com; James.Bottomley@hansenpartnership.com; > lsf@lists.linux-foundation.org; linux-fsdevel@vger.kernel.org; dm- > devel@redhat.com; linux-scsi@vger.kernel.org > Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF > > On Tue, Mar 29, 2011 at 11:10:18AM -0700, Shyam_Iyer@Dell.com wrote: > > > > > > > -----Original Message----- > > > From: Vivek Goyal [mailto:vgoyal@redhat.com] > > > Sent: Tuesday, March 29, 2011 1:34 PM > > > To: Iyer, Shyam > > > Cc: rwheeler@redhat.com; James.Bottomley@hansenpartnership.com; > > > lsf@lists.linux-foundation.org; linux-fsdevel@vger.kernel.org; dm- > > > devel@redhat.com; linux-scsi@vger.kernel.org > > > Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF > > > > > > On Tue, Mar 29, 2011 at 10:20:57AM -0700, Shyam_Iyer@dell.com > wrote: > > > > > > > > > > > > > -----Original Message----- > > > > > From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi- > > > > > owner@vger.kernel.org] On Behalf Of Ric Wheeler > > > > > Sent: Tuesday, March 29, 2011 7:17 AM > > > > > To: James Bottomley > > > > > Cc: lsf@lists.linux-foundation.org; linux-fsdevel; linux- > > > > > scsi@vger.kernel.org; device-mapper development > > > > > Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF > > > > > > > > > > On 03/29/2011 12:36 AM, James Bottomley wrote: > > > > > > Hi All, > > > > > > > > > > > > Since LSF is less than a week away, the programme committee > put > > > > > together > > > > > > a just in time preliminary agenda for LSF. As you can see > there > > > is > > > > > > still plenty of empty space, which you can make suggestions > (to > > > this > > > > > > list with appropriate general list cc's) for filling: > > > > > > > > > > > > > > > > > > > > > https://spreadsheets.google.com/pub?hl=en&hl=en&key=0AiQMl7GcVa7OdFdNQz > > > > > M5UDRXUnVEbHlYVmZUVHQ2amc&output=html > > > > > > > > > > > > If you don't make suggestions, the programme committee will > feel > > > > > > empowered to make arbitrary assignments based on your topic > and > > > > > attendee > > > > > > email requests ... > > > > > > > > > > > > We're still not quite sure what rooms we will have at the > Kabuki, > > > but > > > > > > we'll add them to the spreadsheet when we know (they should > be > > > close > > > > > to > > > > > > each other). > > > > > > > > > > > > The spreadsheet above also gives contact information for all > the > > > > > > attendees and the programme committee. > > > > > > > > > > > > Yours, > > > > > > > > > > > > James Bottomley > > > > > > on behalf of LSF/MM Programme Committee > > > > > > > > > > > > > > > > Here are a few topic ideas: > > > > > > > > > > (1) The first topic that might span IO & FS tracks (or just > pull > > > in > > > > > device > > > > > mapper people to an FS track) could be adding new commands that > > > would > > > > > allow > > > > > users to grow/shrink/etc file systems in a generic way. The > > > thought I > > > > > had was > > > > > that we have a reasonable model that we could reuse for these > new > > > > > commands like > > > > > mount and mount.fs or fsck and fsck.fs. With btrfs coming down > the > > > > > road, it > > > > > could be nice to identify exactly what common operations users > want > > > to > > > > > do and > > > > > agree on how to implement them. Alasdair pointed out in the > > > upstream > > > > > thread that > > > > > we had a prototype here in fsadm. > > > > > > > > > > (2) Very high speed, low latency SSD devices and testing. Have > we > > > > > settled on the > > > > > need for these devices to all have block level drivers? For S- > ATA > > > or > > > > > SAS > > > > > devices, are there known performance issues that require > > > enhancements > > > > > in > > > > > somewhere in the stack? > > > > > > > > > > (3) The union mount versus overlayfs debate - pros and cons. > What > > > each > > > > > do well, > > > > > what needs doing. Do we want/need both upstream? (Maybe this > can > > > get 10 > > > > > minutes > > > > > in Al's VFS session?) > > > > > > > > > > Thanks! > > > > > > > > > > Ric > > > > > > > > A few others that I think may span across I/O, Block fs..layers. > > > > > > > > 1) Dm-thinp target vs File system thin profile vs block map based > > > thin/trim profile. > > > > > > > Facilitate I/O throttling for thin/trimmable storage. Online and > > > Offline profil. > > > > > > Is above any different from block IO throttling we have got for > block > > > devices? > > > > > Yes.. so the throttling would be capacity based.. when the storage > array wants us to throttle the I/O. Depending on the event we may keep > getting space allocation write protect check conditions for writes > until a user intervenes to stop I/O. > > > > Sounds like some user space daemon listening for these events and then > modifying cgroup throttling limits dynamically? But we have dm-targets in the horizon like dm-thinp setting soft limits on capacity.. we could extend the concept to H/W imposed soft/hard limits. The user space could throttle the I/O but it had have to go about finding all processes running I/O on the LUN.. In some cases it could be an I/O process running within a VM.. That would require a passthrough interface to inform it.. I doubt if we would be able to accomplish that any sooner with the multiple operating systems involved. Or requiring each application to register with the userland process. Doable but cumbersome and buggy.. The dm-thinp target can help in this scenario by setting a blanket storage limit. We could go about extending the limit dynamically based on hints/commands from the userland daemon listening to such events. This approach will probably not take care of scenarios where VM storage is over say NFS or clustered filesystem.. > > > > > > > 2) Interfaces for SCSI, Ethernet/*transport configuration > parameters > > > floating around in sysfs, procfs. Architecting guidelines for > accepting > > > patches for hybrid devices. > > > > 3) DM snapshot vs FS snapshots vs H/W snapshots. There is room > for > > > all and they have to help each other > > > > For instance if you took a DM snapshot and the storage sent a check > condition to the original dm device I am not sure if the DM snapshot > would get one too.. > > > > If you had a scenario of taking H/W snapshot of an entire pool and > decide to delete the individual DM snapshots the H/W snapshot would be > inconsistent. > > > > The blocks being managed by a DM-device would have moved (SCSI > referrals). I believe Hannes is working on the referrals piece.. > > > > > > 4) B/W control - VM->DM->Block->Ethernet->Switch->Storage. Pick > your > > > subsystem and there are many non-cooperating B/W control constructs > in > > > each subsystem. > > > > > > Above is pretty generic. Do you have specific needs/ideas/concerns? > > > > > > Thanks > > > Vivek > > Yes.. if I limited by Ethernet b/w to 40% I don't need to limit I/O > b/w via cgroups. Such bandwidth manipulations are network switch driven > and cgroups never take care of these events from the Ethernet driver. > > So if IO is going over network and actual bandwidth control is taking > place by throttling ethernet traffic then one does not have to specify > block cgroup throttling policy and hence no need for cgroups to be > worried > about ethernet driver events? > > I think I am missing something here. > > Vivek Well.. here is the catch.. example scenario.. - Two iSCSI I/O sessions emanating from Ethernet ports eth0, eth1 multipathed together. Let us say round-robin policy. - The cgroup profile is to limit I/O bandwidth to 40% of the multipathed I/O bandwidth. But the switch may have limited the I/O bandwidth to 40% for the corresponding vlan associated with one of the eth interface say eth1 The computation that the bandwidth configured is 40% of the available bandwidth is false in this case. What we need to do is possibly push more I/O through eth0 as it is allowed to run at 100% of bandwidth by the switch. Now this is a dynamic decision and multipathing layer should take care of it.. but it would need a hint.. Policies are usually decided at different levels, SLAs and sometimes logistics determine these decisions etc. Sometimes the bandwidth lowering by the switch is traffic dependent but user level policies remain in tact. Typical case of network administrator not talking to the system administrator. -Shyam ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-29 19:13 ` Shyam_Iyer @ 2011-03-29 19:57 ` Vivek Goyal 2011-03-29 19:59 ` Mike Snitzer 1 sibling, 0 replies; 138+ messages in thread From: Vivek Goyal @ 2011-03-29 19:57 UTC (permalink / raw) To: Shyam_Iyer Cc: rwheeler, James.Bottomley, lsf, linux-fsdevel, dm-devel, linux-scsi On Tue, Mar 29, 2011 at 12:13:41PM -0700, Shyam_Iyer@Dell.com wrote: [..] > > > > Sounds like some user space daemon listening for these events and then > > modifying cgroup throttling limits dynamically? > > But we have dm-targets in the horizon like dm-thinp setting soft limits on capacity.. we could extend the concept to H/W imposed soft/hard limits. > > The user space could throttle the I/O but it had have to go about finding all processes running I/O on the LUN.. In some cases it could be an I/O process running within a VM.. Well, if there is one cgroup (root cgroup), then daemon does not have to find anything. This is one global space and there is provision to set per device limit. So daemon can just go and adjust device limits dynamically and that gets applicable for all processes. The problem will happen if there are more cgroups created and limits are per cgroup, per device. (For creating service differentiation). I would say in that case daemon needs to be more sophisticated and reduce the limit in each group by same % as required by thinly provisioned target. That way a higher rate group will still get higher IO rate on a thinly provisioned device which is imposing its own throttling. Otherwise we again run into issues where there is no service differentiation between faster group or slower group. IOW, if we are throttling thinly povisioned devices, I think throttling these using a user space daemon might be better as it will reuse the kernel throttling infrastructure as well as throttling will be cgroup aware. > > That would require a passthrough interface to inform it.. I doubt if we would be able to accomplish that any sooner with the multiple operating systems involved. Or requiring each application to register with the userland process. Doable but cumbersome and buggy.. > > The dm-thinp target can help in this scenario by setting a blanket storage limit. We could go about extending the limit dynamically based on hints/commands from the userland daemon listening to such events. > > This approach will probably not take care of scenarios where VM storage is over say NFS or clustered filesystem.. Even current blkio throttling does not work over NFS. This is one of the issues I wanted to discuss at LSF. [..] > Well.. here is the catch.. example scenario.. > > - Two iSCSI I/O sessions emanating from Ethernet ports eth0, eth1 multipathed together. Let us say round-robin policy. > > - The cgroup profile is to limit I/O bandwidth to 40% of the multipathed I/O bandwidth. But the switch may have limited the I/O bandwidth to 40% for the corresponding vlan associated with one of the eth interface say eth1 > > The computation that the bandwidth configured is 40% of the available bandwidth is false in this case. What we need to do is possibly push more I/O through eth0 as it is allowed to run at 100% of bandwidth by the switch. > > Now this is a dynamic decision and multipathing layer should take care of it.. but it would need a hint.. > So we have multipathed two paths in a round robin manner and one path is faster and other is slower. I am not sure what multipath does in those scenarios but trying to send more IO on faster path sounds like right thing to do. Thanks Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: Preliminary Agenda and Activities for LSF 2011-03-29 19:13 ` Shyam_Iyer 2011-03-29 19:57 ` Vivek Goyal @ 2011-03-29 19:59 ` Mike Snitzer 2011-03-29 20:12 ` Shyam_Iyer 1 sibling, 1 reply; 138+ messages in thread From: Mike Snitzer @ 2011-03-29 19:59 UTC (permalink / raw) To: Shyam_Iyer Cc: vgoyal, lsf, linux-scsi, linux-fsdevel, rwheeler, device-mapper development On Tue, Mar 29 2011 at 3:13pm -0400, Shyam_Iyer@dell.com <Shyam_Iyer@dell.com> wrote: > > > > Above is pretty generic. Do you have specific needs/ideas/concerns? > > > > > > > > Thanks > > > > Vivek > > > Yes.. if I limited by Ethernet b/w to 40% I don't need to limit I/O > > b/w via cgroups. Such bandwidth manipulations are network switch driven > > and cgroups never take care of these events from the Ethernet driver. > > > > So if IO is going over network and actual bandwidth control is taking > > place by throttling ethernet traffic then one does not have to specify > > block cgroup throttling policy and hence no need for cgroups to be > > worried > > about ethernet driver events? > > > > I think I am missing something here. > > > > Vivek > Well.. here is the catch.. example scenario.. > > - Two iSCSI I/O sessions emanating from Ethernet ports eth0, eth1 multipathed together. Let us say round-robin policy. > > - The cgroup profile is to limit I/O bandwidth to 40% of the multipathed I/O bandwidth. But the switch may have limited the I/O bandwidth to 40% for the corresponding vlan associated with one of the eth interface say eth1 > > The computation that the bandwidth configured is 40% of the available bandwidth is false in this case. What we need to do is possibly push more I/O through eth0 as it is allowed to run at 100% of bandwidth by the switch. > > Now this is a dynamic decision and multipathing layer should take care of it.. but it would need a hint.. No hint should be needed. Just use one of the newer multipath path selectors that are dynamic by design: "queue-length" or "service-time". This scenario is exactly what those path selectors are meant to address. Mike ^ permalink raw reply [flat|nested] 138+ messages in thread
* RE: Preliminary Agenda and Activities for LSF 2011-03-29 19:59 ` Mike Snitzer @ 2011-03-29 20:12 ` Shyam_Iyer 2011-03-29 20:23 ` Mike Snitzer 0 siblings, 1 reply; 138+ messages in thread From: Shyam_Iyer @ 2011-03-29 20:12 UTC (permalink / raw) To: snitzer; +Cc: vgoyal, lsf, linux-scsi, linux-fsdevel, rwheeler, dm-devel > -----Original Message----- > From: Mike Snitzer [mailto:snitzer@redhat.com] > Sent: Tuesday, March 29, 2011 4:00 PM > To: Iyer, Shyam > Cc: vgoyal@redhat.com; lsf@lists.linux-foundation.org; linux- > scsi@vger.kernel.org; linux-fsdevel@vger.kernel.org; > rwheeler@redhat.com; device-mapper development > Subject: Re: Preliminary Agenda and Activities for LSF > > On Tue, Mar 29 2011 at 3:13pm -0400, > Shyam_Iyer@dell.com <Shyam_Iyer@dell.com> wrote: > > > > > > Above is pretty generic. Do you have specific > needs/ideas/concerns? > > > > > > > > > > Thanks > > > > > Vivek > > > > Yes.. if I limited by Ethernet b/w to 40% I don't need to limit > I/O > > > b/w via cgroups. Such bandwidth manipulations are network switch > driven > > > and cgroups never take care of these events from the Ethernet > driver. > > > > > > So if IO is going over network and actual bandwidth control is > taking > > > place by throttling ethernet traffic then one does not have to > specify > > > block cgroup throttling policy and hence no need for cgroups to be > > > worried > > > about ethernet driver events? > > > > > > I think I am missing something here. > > > > > > Vivek > > Well.. here is the catch.. example scenario.. > > > > - Two iSCSI I/O sessions emanating from Ethernet ports eth0, eth1 > multipathed together. Let us say round-robin policy. > > > > - The cgroup profile is to limit I/O bandwidth to 40% of the > multipathed I/O bandwidth. But the switch may have limited the I/O > bandwidth to 40% for the corresponding vlan associated with one of the > eth interface say eth1 > > > > The computation that the bandwidth configured is 40% of the available > bandwidth is false in this case. What we need to do is possibly push > more I/O through eth0 as it is allowed to run at 100% of bandwidth by > the switch. > > > > Now this is a dynamic decision and multipathing layer should take > care of it.. but it would need a hint.. > > No hint should be needed. Just use one of the newer multipath path > selectors that are dynamic by design: "queue-length" or "service-time". > > This scenario is exactly what those path selectors are meant to > address. > > Mike Since iSCSI multipaths are essentially sessions one could configure more than one session through the same ethX interface. The sessions need not be going to the same LUN and hence not governed by the same multipath selector but the bandwidth policy group would be for a group of resources. -Shyam ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: Preliminary Agenda and Activities for LSF 2011-03-29 20:12 ` Shyam_Iyer @ 2011-03-29 20:23 ` Mike Snitzer 2011-03-29 23:09 ` Shyam_Iyer 0 siblings, 1 reply; 138+ messages in thread From: Mike Snitzer @ 2011-03-29 20:23 UTC (permalink / raw) To: Shyam_Iyer Cc: linux-scsi, lsf, linux-fsdevel, rwheeler, vgoyal, device-mapper development On Tue, Mar 29 2011 at 4:12pm -0400, Shyam_Iyer@dell.com <Shyam_Iyer@dell.com> wrote: > > > > -----Original Message----- > > From: Mike Snitzer [mailto:snitzer@redhat.com] > > Sent: Tuesday, March 29, 2011 4:00 PM > > To: Iyer, Shyam > > Cc: vgoyal@redhat.com; lsf@lists.linux-foundation.org; linux- > > scsi@vger.kernel.org; linux-fsdevel@vger.kernel.org; > > rwheeler@redhat.com; device-mapper development > > Subject: Re: Preliminary Agenda and Activities for LSF > > > > On Tue, Mar 29 2011 at 3:13pm -0400, > > Shyam_Iyer@dell.com <Shyam_Iyer@dell.com> wrote: > > > > > > > > Above is pretty generic. Do you have specific > > needs/ideas/concerns? > > > > > > > > > > > > Thanks > > > > > > Vivek > > > > > Yes.. if I limited by Ethernet b/w to 40% I don't need to limit > > I/O > > > > b/w via cgroups. Such bandwidth manipulations are network switch > > driven > > > > and cgroups never take care of these events from the Ethernet > > driver. > > > > > > > > So if IO is going over network and actual bandwidth control is > > taking > > > > place by throttling ethernet traffic then one does not have to > > specify > > > > block cgroup throttling policy and hence no need for cgroups to be > > > > worried > > > > about ethernet driver events? > > > > > > > > I think I am missing something here. > > > > > > > > Vivek > > > Well.. here is the catch.. example scenario.. > > > > > > - Two iSCSI I/O sessions emanating from Ethernet ports eth0, eth1 > > multipathed together. Let us say round-robin policy. > > > > > > - The cgroup profile is to limit I/O bandwidth to 40% of the > > multipathed I/O bandwidth. But the switch may have limited the I/O > > bandwidth to 40% for the corresponding vlan associated with one of the > > eth interface say eth1 > > > > > > The computation that the bandwidth configured is 40% of the available > > bandwidth is false in this case. What we need to do is possibly push > > more I/O through eth0 as it is allowed to run at 100% of bandwidth by > > the switch. > > > > > > Now this is a dynamic decision and multipathing layer should take > > care of it.. but it would need a hint.. > > > > No hint should be needed. Just use one of the newer multipath path > > selectors that are dynamic by design: "queue-length" or "service-time". > > > > This scenario is exactly what those path selectors are meant to > > address. > > > > Mike > > Since iSCSI multipaths are essentially sessions one could configure > more than one session through the same ethX interface. The sessions > need not be going to the same LUN and hence not governed by the same > multipath selector but the bandwidth policy group would be for a group > of resources. Then the sessions don't correspond to the same backend LUN (and by definition aren't part of the same mpath device). You're really all over the map with your talking points. I'm having a hard time following you. Mike ^ permalink raw reply [flat|nested] 138+ messages in thread
* RE: Preliminary Agenda and Activities for LSF 2011-03-29 20:23 ` Mike Snitzer @ 2011-03-29 23:09 ` Shyam_Iyer 2011-03-30 5:58 ` [Lsf] " Hannes Reinecke 0 siblings, 1 reply; 138+ messages in thread From: Shyam_Iyer @ 2011-03-29 23:09 UTC (permalink / raw) To: snitzer; +Cc: linux-scsi, lsf, linux-fsdevel, rwheeler, vgoyal, dm-devel > -----Original Message----- > From: Mike Snitzer [mailto:snitzer@redhat.com] > Sent: Tuesday, March 29, 2011 4:24 PM > To: Iyer, Shyam > Cc: linux-scsi@vger.kernel.org; lsf@lists.linux-foundation.org; linux- > fsdevel@vger.kernel.org; rwheeler@redhat.com; vgoyal@redhat.com; > device-mapper development > Subject: Re: Preliminary Agenda and Activities for LSF > > On Tue, Mar 29 2011 at 4:12pm -0400, > Shyam_Iyer@dell.com <Shyam_Iyer@dell.com> wrote: > > > > > > > > -----Original Message----- > > > From: Mike Snitzer [mailto:snitzer@redhat.com] > > > Sent: Tuesday, March 29, 2011 4:00 PM > > > To: Iyer, Shyam > > > Cc: vgoyal@redhat.com; lsf@lists.linux-foundation.org; linux- > > > scsi@vger.kernel.org; linux-fsdevel@vger.kernel.org; > > > rwheeler@redhat.com; device-mapper development > > > Subject: Re: Preliminary Agenda and Activities for LSF > > > > > > On Tue, Mar 29 2011 at 3:13pm -0400, > > > Shyam_Iyer@dell.com <Shyam_Iyer@dell.com> wrote: > > > > > > > > > > Above is pretty generic. Do you have specific > > > needs/ideas/concerns? > > > > > > > > > > > > > > Thanks > > > > > > > Vivek > > > > > > Yes.. if I limited by Ethernet b/w to 40% I don't need to > limit > > > I/O > > > > > b/w via cgroups. Such bandwidth manipulations are network > switch > > > driven > > > > > and cgroups never take care of these events from the Ethernet > > > driver. > > > > > > > > > > So if IO is going over network and actual bandwidth control is > > > taking > > > > > place by throttling ethernet traffic then one does not have to > > > specify > > > > > block cgroup throttling policy and hence no need for cgroups to > be > > > > > worried > > > > > about ethernet driver events? > > > > > > > > > > I think I am missing something here. > > > > > > > > > > Vivek > > > > Well.. here is the catch.. example scenario.. > > > > > > > > - Two iSCSI I/O sessions emanating from Ethernet ports eth0, eth1 > > > multipathed together. Let us say round-robin policy. > > > > > > > > - The cgroup profile is to limit I/O bandwidth to 40% of the > > > multipathed I/O bandwidth. But the switch may have limited the I/O > > > bandwidth to 40% for the corresponding vlan associated with one of > the > > > eth interface say eth1 > > > > > > > > The computation that the bandwidth configured is 40% of the > available > > > bandwidth is false in this case. What we need to do is possibly > push > > > more I/O through eth0 as it is allowed to run at 100% of bandwidth > by > > > the switch. > > > > > > > > Now this is a dynamic decision and multipathing layer should take > > > care of it.. but it would need a hint.. > > > > > > No hint should be needed. Just use one of the newer multipath path > > > selectors that are dynamic by design: "queue-length" or "service- > time". > > > > > > This scenario is exactly what those path selectors are meant to > > > address. > > > > > > Mike > > > > Since iSCSI multipaths are essentially sessions one could configure > > more than one session through the same ethX interface. The sessions > > need not be going to the same LUN and hence not governed by the same > > multipath selector but the bandwidth policy group would be for a > group > > of resources. > > Then the sessions don't correspond to the same backend LUN (and by > definition aren't part of the same mpath device). You're really all > over the map with your talking points. > > I'm having a hard time following you. > > Mike Let me back up here.. this has to be thought in not only the traditional Ethernet sense but also in a Data Centre Bridged environment. I shouldn't have wandered into the multipath constructs.. I think the statement on not going to the same LUN was a little erroneous. I meant different /dev/sdXs.. and hence different block I/O queues. Each I/O queue could be thought of as a bandwidth queue class being serviced through a corresponding network adapter's queue(assuming a multiqueue capable adapter) Let us say /dev/sda(Through eth0) and /dev/sdb(eth1) are a cgroup bandwidth group corresponding to a weightage of 20% of the I/O bandwidth the user has configured this weight thinking that this will correspond to say 200Mb of bandwidth. Let us say the network bandwidth on the corresponding network queues corresponding was reduced by the DCB capable switch... We still need an SLA of 200Mb of I/O bandwidth but the underlying dynamics have changed. In such a scenario the option is to move I/O to a different bandwidth priority queue in the network adapter. This could be moving I/O to a new network queue in eth0 or another queue in eth1 .. This requires mapping the block queue to the new network queue. One way of solving this is what is getting into the open-iscsi world i.e. creating a session tagged to the relevant DCB priority and thus the session gets mapped to the relevant tc queue which ultimately maps to one of the network adapters multiqueue.. But when multipath fails over to the different session path then the DCB bandwidth priority will not move with it.. Ok one could argue that is a user mistake to have configured bandwidth priorities differently but it may so happen that the bandwidth priority was just dynamically changed by the switch for the particular queue. Although I gave an example of a DCB environment but we could definitely look at doing a 1:n map of block queues to network adapter queues for non-DCB environments too.. -Shyam ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-29 23:09 ` Shyam_Iyer @ 2011-03-30 5:58 ` Hannes Reinecke 2011-03-30 14:02 ` James Bottomley 0 siblings, 1 reply; 138+ messages in thread From: Hannes Reinecke @ 2011-03-30 5:58 UTC (permalink / raw) To: Shyam_Iyer; +Cc: snitzer, linux-scsi, lsf, dm-devel, linux-fsdevel, rwheeler On 03/30/2011 01:09 AM, Shyam_Iyer@dell.com wrote: > > Let me back up here.. this has to be thought in not only the traditional Ethernet > sense but also in a Data Centre Bridged environment. I shouldn't have wandered > into the multipath constructs.. > > I think the statement on not going to the same LUN was a little erroneous. I meant > different /dev/sdXs.. and hence different block I/O queues. > > Each I/O queue could be thought of as a bandwidth queue class being serviced through > a corresponding network adapter's queue(assuming a multiqueue capable adapter) > > Let us say /dev/sda(Through eth0) and /dev/sdb(eth1) are a cgroup bandwidth group > corresponding to a weightage of 20% of the I/O bandwidth the user has configured > this weight thinking that this will correspond to say 200Mb of bandwidth. > > Let us say the network bandwidth on the corresponding network queues corresponding > was reduced by the DCB capable switch... > We still need an SLA of 200Mb of I/O bandwidth but the underlying dynamics have changed. > > In such a scenario the option is to move I/O to a different bandwidth priority queue > in the network adapter. This could be moving I/O to a new network queue in eth0 or > another queue in eth1 .. > > This requires mapping the block queue to the new network queue. > > One way of solving this is what is getting into the open-iscsi world i.e. creating > a session tagged to the relevant DCB priority and thus the session gets mapped > to the relevant tc queue which ultimately maps to one of the network adapters multiqueue.. > > But when multipath fails over to the different session path then the DCB bandwidth > priority will not move with it.. > > Ok one could argue that is a user mistake to have configured bandwidth priorities > differently but it may so happen that the bandwidth priority was just dynamically > changed by the switch for the particular queue. > > Although I gave an example of a DCB environment but we could definitely look at > doing a 1:n map of block queues to network adapter queues for non-DCB environments too.. > That sounds quite convoluted enough to warrant it's own slot :-) No, seriously. I think it would be good to have a separate slot discussing DCB (be it FCoE or iSCSI) and cgroups. And how to best align these things. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-30 5:58 ` [Lsf] " Hannes Reinecke @ 2011-03-30 14:02 ` James Bottomley 2011-03-30 14:10 ` Hannes Reinecke 0 siblings, 1 reply; 138+ messages in thread From: James Bottomley @ 2011-03-30 14:02 UTC (permalink / raw) To: Hannes Reinecke Cc: Shyam_Iyer, linux-scsi, lsf, dm-devel, linux-fsdevel, rwheeler On Wed, 2011-03-30 at 07:58 +0200, Hannes Reinecke wrote: > On 03/30/2011 01:09 AM, Shyam_Iyer@dell.com wrote: > > > > Let me back up here.. this has to be thought in not only the traditional Ethernet > > sense but also in a Data Centre Bridged environment. I shouldn't > have wandered > > into the multipath constructs.. > > > > I think the statement on not going to the same LUN was a little erroneous. I meant > > different /dev/sdXs.. and hence different block I/O queues. > > > > Each I/O queue could be thought of as a bandwidth queue class being serviced through > > a corresponding network adapter's queue(assuming a multiqueue > capable adapter) > > > > Let us say /dev/sda(Through eth0) and /dev/sdb(eth1) are a cgroup bandwidth group > > corresponding to a weightage of 20% of the I/O bandwidth the user > has configured > > this weight thinking that this will correspond to say 200Mb of > bandwidth. > > > > Let us say the network bandwidth on the corresponding network queues corresponding > > was reduced by the DCB capable switch... > > We still need an SLA of 200Mb of I/O bandwidth but the underlying dynamics have changed. > > > > In such a scenario the option is to move I/O to a different bandwidth priority queue > > in the network adapter. This could be moving I/O to a new network > queue in eth0 or > > another queue in eth1 .. > > > > This requires mapping the block queue to the new network queue. > > > > One way of solving this is what is getting into the open-iscsi world i.e. creating > > a session tagged to the relevant DCB priority and thus the > session gets mapped > > to the relevant tc queue which ultimately maps to one of the > network adapters multiqueue.. > > > > But when multipath fails over to the different session path then the DCB bandwidth > > priority will not move with it.. > > > > Ok one could argue that is a user mistake to have configured bandwidth priorities > > differently but it may so happen that the bandwidth priority was > just dynamically > > changed by the switch for the particular queue. > > > > Although I gave an example of a DCB environment but we could definitely look at > > doing a 1:n map of block queues to network adapter queues for > non-DCB environments too.. > > > That sounds quite convoluted enough to warrant it's own slot :-) > > No, seriously. I think it would be good to have a separate slot > discussing DCB (be it FCoE or iSCSI) and cgroups. > And how to best align these things. OK, I'll go for that ... Data Centre Bridging; experiences, technologies and needs ... something like that. What about virtualisation and open vSwitch? James ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-30 14:02 ` James Bottomley @ 2011-03-30 14:10 ` Hannes Reinecke 2011-03-30 14:26 ` James Bottomley 0 siblings, 1 reply; 138+ messages in thread From: Hannes Reinecke @ 2011-03-30 14:10 UTC (permalink / raw) To: James Bottomley Cc: Shyam_Iyer, linux-scsi, lsf, dm-devel, linux-fsdevel, rwheeler On 03/30/2011 04:02 PM, James Bottomley wrote: > On Wed, 2011-03-30 at 07:58 +0200, Hannes Reinecke wrote: >> On 03/30/2011 01:09 AM, Shyam_Iyer@dell.com wrote: >>> >>> Let me back up here.. this has to be thought in not only the traditional Ethernet >> > sense but also in a Data Centre Bridged environment. I shouldn't >> have wandered >> > into the multipath constructs.. >>> >>> I think the statement on not going to the same LUN was a little erroneous. I meant >> > different /dev/sdXs.. and hence different block I/O queues. >>> >>> Each I/O queue could be thought of as a bandwidth queue class being serviced through >> > a corresponding network adapter's queue(assuming a multiqueue >> capable adapter) >>> >>> Let us say /dev/sda(Through eth0) and /dev/sdb(eth1) are a cgroup bandwidth group >> > corresponding to a weightage of 20% of the I/O bandwidth the user >> has configured >> > this weight thinking that this will correspond to say 200Mb of >> bandwidth. >>> >>> Let us say the network bandwidth on the corresponding network queues corresponding >> > was reduced by the DCB capable switch... >>> We still need an SLA of 200Mb of I/O bandwidth but the underlying dynamics have changed. >>> >>> In such a scenario the option is to move I/O to a different bandwidth priority queue >> > in the network adapter. This could be moving I/O to a new network >> queue in eth0 or >> > another queue in eth1 .. >>> >>> This requires mapping the block queue to the new network queue. >>> >>> One way of solving this is what is getting into the open-iscsi world i.e. creating >> > a session tagged to the relevant DCB priority and thus the >> session gets mapped >> > to the relevant tc queue which ultimately maps to one of the >> network adapters multiqueue.. >>> >>> But when multipath fails over to the different session path then the DCB bandwidth >> > priority will not move with it.. >>> >>> Ok one could argue that is a user mistake to have configured bandwidth priorities >> > differently but it may so happen that the bandwidth priority was >> just dynamically >> > changed by the switch for the particular queue. >>> >>> Although I gave an example of a DCB environment but we could definitely look at >> > doing a 1:n map of block queues to network adapter queues for >> non-DCB environments too.. >>> >> That sounds quite convoluted enough to warrant it's own slot :-) >> >> No, seriously. I think it would be good to have a separate slot >> discussing DCB (be it FCoE or iSCSI) and cgroups. >> And how to best align these things. > > OK, I'll go for that ... Data Centre Bridging; experiences, technologies > and needs ... something like that. What about virtualisation and open > vSwitch? > Hmm. Not qualified enough to talk about the latter; I was more envisioning the storage-related aspects here (multiqueue mapping, QoS classes etc). With virtualisation and open vSwitch we're more in the network side of things; doubt open vSwitch can do DCB. And even if it could, virtio certainly can't :-) Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-30 14:10 ` Hannes Reinecke @ 2011-03-30 14:26 ` James Bottomley 2011-03-30 14:55 ` Hannes Reinecke 0 siblings, 1 reply; 138+ messages in thread From: James Bottomley @ 2011-03-30 14:26 UTC (permalink / raw) To: Hannes Reinecke Cc: Shyam_Iyer, linux-scsi, lsf, dm-devel, linux-fsdevel, rwheeler On Wed, 2011-03-30 at 16:10 +0200, Hannes Reinecke wrote: > On 03/30/2011 04:02 PM, James Bottomley wrote: > > On Wed, 2011-03-30 at 07:58 +0200, Hannes Reinecke wrote: > >> No, seriously. I think it would be good to have a separate slot > >> discussing DCB (be it FCoE or iSCSI) and cgroups. > >> And how to best align these things. > > > > OK, I'll go for that ... Data Centre Bridging; experiences, technologies > > and needs ... something like that. What about virtualisation and open > > vSwitch? > > > Hmm. Not qualified enough to talk about the latter; I was more > envisioning the storage-related aspects here (multiqueue mapping, > QoS classes etc). With virtualisation and open vSwitch we're more in > the network side of things; doubt open vSwitch can do DCB. > And even if it could, virtio certainly can't :-) Technically, the topic DCB is about Data Centre Ethernet enhancements and converged networks ... that's why it's naturally allied to virtual switching. I was thinking we might put up a panel of vendors to get us all an education on the topic ... James ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-30 14:26 ` James Bottomley @ 2011-03-30 14:55 ` Hannes Reinecke 2011-03-30 15:33 ` James Bottomley 0 siblings, 1 reply; 138+ messages in thread From: Hannes Reinecke @ 2011-03-30 14:55 UTC (permalink / raw) To: James Bottomley Cc: Shyam_Iyer, linux-scsi, lsf, dm-devel, linux-fsdevel, rwheeler On 03/30/2011 04:26 PM, James Bottomley wrote: > On Wed, 2011-03-30 at 16:10 +0200, Hannes Reinecke wrote: >> On 03/30/2011 04:02 PM, James Bottomley wrote: >>> On Wed, 2011-03-30 at 07:58 +0200, Hannes Reinecke wrote: >>>> No, seriously. I think it would be good to have a separate slot >>>> discussing DCB (be it FCoE or iSCSI) and cgroups. >>>> And how to best align these things. >>> >>> OK, I'll go for that ... Data Centre Bridging; experiences, technologies >>> and needs ... something like that. What about virtualisation and open >>> vSwitch? >>> >> Hmm. Not qualified enough to talk about the latter; I was more >> envisioning the storage-related aspects here (multiqueue mapping, >> QoS classes etc). With virtualisation and open vSwitch we're more in >> the network side of things; doubt open vSwitch can do DCB. >> And even if it could, virtio certainly can't :-) > > Technically, the topic DCB is about Data Centre Ethernet enhancements > and converged networks ... that's why it's naturally allied to virtual > switching. > > I was thinking we might put up a panel of vendors to get us all an > education on the topic ... > Oh, but gladly. Didn't know we had some at the LSF. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-30 14:55 ` Hannes Reinecke @ 2011-03-30 15:33 ` James Bottomley 2011-03-30 15:46 ` Shyam_Iyer 2011-03-30 20:32 ` Giridhar Malavali 0 siblings, 2 replies; 138+ messages in thread From: James Bottomley @ 2011-03-30 15:33 UTC (permalink / raw) To: Hannes Reinecke Cc: Shyam_Iyer, linux-scsi, lsf, dm-devel, linux-fsdevel, rwheeler On Wed, 2011-03-30 at 16:55 +0200, Hannes Reinecke wrote: > On 03/30/2011 04:26 PM, James Bottomley wrote: > > On Wed, 2011-03-30 at 16:10 +0200, Hannes Reinecke wrote: > >> On 03/30/2011 04:02 PM, James Bottomley wrote: > >>> On Wed, 2011-03-30 at 07:58 +0200, Hannes Reinecke wrote: > >>>> No, seriously. I think it would be good to have a separate slot > >>>> discussing DCB (be it FCoE or iSCSI) and cgroups. > >>>> And how to best align these things. > >>> > >>> OK, I'll go for that ... Data Centre Bridging; experiences, technologies > >>> and needs ... something like that. What about virtualisation and open > >>> vSwitch? > >>> > >> Hmm. Not qualified enough to talk about the latter; I was more > >> envisioning the storage-related aspects here (multiqueue mapping, > >> QoS classes etc). With virtualisation and open vSwitch we're more in > >> the network side of things; doubt open vSwitch can do DCB. > >> And even if it could, virtio certainly can't :-) > > > > Technically, the topic DCB is about Data Centre Ethernet enhancements > > and converged networks ... that's why it's naturally allied to virtual > > switching. > > > > I was thinking we might put up a panel of vendors to get us all an > > education on the topic ... > > > Oh, but gladly. > Didn't know we had some at the LSF. OK, so I scheduled this with Dell (Shyam Iyer), Intel (Robert Love) and Emulex (James Smart) but any other attending vendors who want to pitch in, send me an email and I'll add you. James ^ permalink raw reply [flat|nested] 138+ messages in thread
* RE: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-30 15:33 ` James Bottomley @ 2011-03-30 15:46 ` Shyam_Iyer 2011-03-30 20:32 ` Giridhar Malavali 1 sibling, 0 replies; 138+ messages in thread From: Shyam_Iyer @ 2011-03-30 15:46 UTC (permalink / raw) To: James.Bottomley, hare; +Cc: linux-scsi, lsf, dm-devel, linux-fsdevel, rwheeler > -----Original Message----- > From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com] > Sent: Wednesday, March 30, 2011 11:34 AM > To: Hannes Reinecke > Cc: Iyer, Shyam; linux-scsi@vger.kernel.org; lsf@lists.linux- > foundation.org; dm-devel@redhat.com; linux-fsdevel@vger.kernel.org; > rwheeler@redhat.com > Subject: Re: [Lsf] Preliminary Agenda and Activities for LSF > > On Wed, 2011-03-30 at 16:55 +0200, Hannes Reinecke wrote: > > On 03/30/2011 04:26 PM, James Bottomley wrote: > > > On Wed, 2011-03-30 at 16:10 +0200, Hannes Reinecke wrote: > > >> On 03/30/2011 04:02 PM, James Bottomley wrote: > > >>> On Wed, 2011-03-30 at 07:58 +0200, Hannes Reinecke wrote: > > >>>> No, seriously. I think it would be good to have a separate slot > > >>>> discussing DCB (be it FCoE or iSCSI) and cgroups. > > >>>> And how to best align these things. > > >>> > > >>> OK, I'll go for that ... Data Centre Bridging; experiences, > technologies > > >>> and needs ... something like that. What about virtualisation and > open > > >>> vSwitch? > > >>> > > >> Hmm. Not qualified enough to talk about the latter; I was more > > >> envisioning the storage-related aspects here (multiqueue mapping, > > >> QoS classes etc). With virtualisation and open vSwitch we're more > in > > >> the network side of things; doubt open vSwitch can do DCB. > > >> And even if it could, virtio certainly can't :-) > > > > > > Technically, the topic DCB is about Data Centre Ethernet > enhancements > > > and converged networks ... that's why it's naturally allied to > virtual > > > switching. > > > > > > I was thinking we might put up a panel of vendors to get us all an > > > education on the topic ... > > > > > Oh, but gladly. > > Didn't know we had some at the LSF. > > OK, so I scheduled this with Dell (Shyam Iyer), Intel (Robert Love) and > Emulex (James Smart) but any other attending vendors who want to pitch > in, send me an email and I'll add you. > > James > Excellent. I would probably volunteer Giridhar(Qlogic) as well looking at the list of attendees as some of the CNA implementations vary.. -Shyam ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-30 15:33 ` James Bottomley 2011-03-30 15:46 ` Shyam_Iyer @ 2011-03-30 20:32 ` Giridhar Malavali 2011-03-30 20:45 ` James Bottomley 1 sibling, 1 reply; 138+ messages in thread From: Giridhar Malavali @ 2011-03-30 20:32 UTC (permalink / raw) To: James Bottomley, Hannes Reinecke Cc: Shyam_Iyer@dell.com, linux-scsi@vger.kernel.org, lsf@lists.linux-foundation.org, dm-devel@redhat.com, linux-fsdevel@vger.kernel.org, rwheeler@redhat.com >> >On Wed, 2011-03-30 at 16:55 +0200, Hannes Reinecke wrote: >> On 03/30/2011 04:26 PM, James Bottomley wrote: >> > On Wed, 2011-03-30 at 16:10 +0200, Hannes Reinecke wrote: >> >> On 03/30/2011 04:02 PM, James Bottomley wrote: >> >>> On Wed, 2011-03-30 at 07:58 +0200, Hannes Reinecke wrote: >> >>>> No, seriously. I think it would be good to have a separate slot >> >>>> discussing DCB (be it FCoE or iSCSI) and cgroups. >> >>>> And how to best align these things. >> >>> >> >>> OK, I'll go for that ... Data Centre Bridging; experiences, >>technologies >> >>> and needs ... something like that. What about virtualisation and >>open >> >>> vSwitch? >> >>> >> >> Hmm. Not qualified enough to talk about the latter; I was more >> >> envisioning the storage-related aspects here (multiqueue mapping, >> >> QoS classes etc). With virtualisation and open vSwitch we're more in >> >> the network side of things; doubt open vSwitch can do DCB. >> >> And even if it could, virtio certainly can't :-) >> > >> > Technically, the topic DCB is about Data Centre Ethernet enhancements >> > and converged networks ... that's why it's naturally allied to virtual >> > switching. >> > >> > I was thinking we might put up a panel of vendors to get us all an >> > education on the topic ... >> > >> Oh, but gladly. >> Didn't know we had some at the LSF. > >OK, so I scheduled this with Dell (Shyam Iyer), Intel (Robert Love) and >Emulex (James Smart) but any other attending vendors who want to pitch >in, send me an email and I'll add you. Can u please add me for this. -- Giridhar > >James > > >-- >To unsubscribe from this list: send the line "unsubscribe linux-scsi" in >the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html > This message and any attached documents contain information from QLogic Corporation or its wholly-owned subsidiaries that may be confidential. If you are not the intended recipient, you may not read, copy, distribute, or use this information. If you have received this transmission in error, please notify the sender immediately by reply e-mail and then delete this message. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-30 20:32 ` Giridhar Malavali @ 2011-03-30 20:45 ` James Bottomley 0 siblings, 0 replies; 138+ messages in thread From: James Bottomley @ 2011-03-30 20:45 UTC (permalink / raw) To: Giridhar Malavali Cc: Hannes Reinecke, Shyam_Iyer@dell.com, linux-scsi@vger.kernel.org, lsf@lists.linux-foundation.org, dm-devel@redhat.com, linux-fsdevel@vger.kernel.org, rwheeler@redhat.com On Wed, 2011-03-30 at 13:32 -0700, Giridhar Malavali wrote: > >OK, so I scheduled this with Dell (Shyam Iyer), Intel (Robert Love) and > >Emulex (James Smart) but any other attending vendors who want to pitch > >in, send me an email and I'll add you. > > Can u please add me for this. I already did. (The agenda web actually updates about 5 minutes behind the driving spreadsheet). James ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-29 11:16 ` Ric Wheeler 2011-03-29 11:22 ` Matthew Wilcox 2011-03-29 17:20 ` Shyam_Iyer @ 2011-03-29 19:47 ` Nicholas A. Bellinger 2011-03-29 20:29 ` Jan Kara 2011-03-30 0:33 ` Mingming Cao 4 siblings, 0 replies; 138+ messages in thread From: Nicholas A. Bellinger @ 2011-03-29 19:47 UTC (permalink / raw) To: Ric Wheeler Cc: James Bottomley, lsf, linux-fsdevel, linux-scsi@vger.kernel.org, device-mapper development On Tue, 2011-03-29 at 07:16 -0400, Ric Wheeler wrote: > On 03/29/2011 12:36 AM, James Bottomley wrote: > > Hi All, > > > > Since LSF is less than a week away, the programme committee put together > > a just in time preliminary agenda for LSF. As you can see there is > > still plenty of empty space, which you can make suggestions (to this > > list with appropriate general list cc's) for filling: > > > > https://spreadsheets.google.com/pub?hl=en&hl=en&key=0AiQMl7GcVa7OdFdNQzM5UDRXUnVEbHlYVmZUVHQ2amc&output=html > > > > If you don't make suggestions, the programme committee will feel > > empowered to make arbitrary assignments based on your topic and attendee > > email requests ... > > > > We're still not quite sure what rooms we will have at the Kabuki, but > > we'll add them to the spreadsheet when we know (they should be close to > > each other). > > > > The spreadsheet above also gives contact information for all the > > attendees and the programme committee. > > > > Yours, > > > > James Bottomley > > on behalf of LSF/MM Programme Committee > > > > Here are a few topic ideas: > > (1) The first topic that might span IO & FS tracks (or just pull in device > mapper people to an FS track) could be adding new commands that would allow > users to grow/shrink/etc file systems in a generic way. The thought I had was > that we have a reasonable model that we could reuse for these new commands like > mount and mount.fs or fsck and fsck.fs. With btrfs coming down the road, it > could be nice to identify exactly what common operations users want to do and > agree on how to implement them. Alasdair pointed out in the upstream thread that > we had a prototype here in fsadm. > > (2) Very high speed, low latency SSD devices and testing. Have we settled on the > need for these devices to all have block level drivers? For S-ATA or SAS > devices, are there known performance issues that require enhancements in > somewhere in the stack? > > (3) The union mount versus overlayfs debate - pros and cons. What each do well, > what needs doing. Do we want/need both upstream? (Maybe this can get 10 minutes > in Al's VFS session?) > Hi Ric, James and LSF-PC chairs, Beyond my original LSF topic proposal for the next-generation QEMU/KVM Virtio-SCSI target driver here: http://marc.info/?l=linux-scsi&m=129706545408966&w=2 The following target mode related topics would be useful for the current attendees with interest in /drivers/target/ code if there is extra room available for local attendance within the IO/storage track. (4) Enabling mixed Target/Initiator mode in existing mainline SCSI LLDs that support HW target mode, and come to an consensus determination for how best to make the SCSI LLD / target fabric driver split when enabling mainline target infrastructure support into existing SCSI LLDs. This code is currently in flight for qla2xxx / tcm_qla2xxx for .40 (Hannes, Christoph, Mike, Qlogic and other LLD maintainers) (5) Driving target configfs group creation from kernel-space via a userspace passthrough using some form of portable / acceptable mainline interface. This is a topic that has been raised on the scsi list for the ibmvscsis target driver for .40, and is going to be useful for other in-flight HW target driver as well. (Tomo-san, Hannes, Mike, James, Joel) Thank you! --nab ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-29 11:16 ` Ric Wheeler ` (2 preceding siblings ...) 2011-03-29 19:47 ` Nicholas A. Bellinger @ 2011-03-29 20:29 ` Jan Kara 2011-03-29 20:31 ` Ric Wheeler 2011-03-30 0:33 ` Mingming Cao 4 siblings, 1 reply; 138+ messages in thread From: Jan Kara @ 2011-03-29 20:29 UTC (permalink / raw) To: Ric Wheeler Cc: James Bottomley, lsf, linux-fsdevel, device-mapper development, linux-scsi@vger.kernel.org On Tue 29-03-11 07:16:32, Ric Wheeler wrote: > On 03/29/2011 12:36 AM, James Bottomley wrote: > (3) The union mount versus overlayfs debate - pros and cons. What each do well, > what needs doing. Do we want/need both upstream? (Maybe this can get 10 minutes > in Al's VFS session?) It might be interesting but neither Miklos nor Val seems to be attending so I'm not sure how deep discussion we can have :). Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-29 20:29 ` Jan Kara @ 2011-03-29 20:31 ` Ric Wheeler 0 siblings, 0 replies; 138+ messages in thread From: Ric Wheeler @ 2011-03-29 20:31 UTC (permalink / raw) To: Jan Kara Cc: Ric Wheeler, James Bottomley, lsf, device-mapper development, linux-fsdevel, linux-scsi@vger.kernel.org On 03/29/2011 04:29 PM, Jan Kara wrote: > On Tue 29-03-11 07:16:32, Ric Wheeler wrote: >> On 03/29/2011 12:36 AM, James Bottomley wrote: >> (3) The union mount versus overlayfs debate - pros and cons. What each do well, >> what needs doing. Do we want/need both upstream? (Maybe this can get 10 minutes >> in Al's VFS session?) > It might be interesting but neither Miklos nor Val seems to be attending > so I'm not sure how deep discussion we can have :). > > Honza Very true - probably best to keep that discussion focused upstream (but that seems to have quieted down as well)... Ric ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-29 11:16 ` Ric Wheeler ` (3 preceding siblings ...) 2011-03-29 20:29 ` Jan Kara @ 2011-03-30 0:33 ` Mingming Cao 2011-03-30 2:17 ` Dave Chinner 4 siblings, 1 reply; 138+ messages in thread From: Mingming Cao @ 2011-03-30 0:33 UTC (permalink / raw) To: Ric Wheeler Cc: James Bottomley, lsf, linux-fsdevel, linux-scsi@vger.kernel.org, device-mapper development On Tue, 2011-03-29 at 07:16 -0400, Ric Wheeler wrote: > On 03/29/2011 12:36 AM, James Bottomley wrote: > > Hi All, > > > > Since LSF is less than a week away, the programme committee put together > > a just in time preliminary agenda for LSF. As you can see there is > > still plenty of empty space, which you can make suggestions (to this > > list with appropriate general list cc's) for filling: > > > > https://spreadsheets.google.com/pub?hl=en&hl=en&key=0AiQMl7GcVa7OdFdNQzM5UDRXUnVEbHlYVmZUVHQ2amc&output=html > > > > If you don't make suggestions, the programme committee will feel > > empowered to make arbitrary assignments based on your topic and attendee > > email requests ... > > > > We're still not quite sure what rooms we will have at the Kabuki, but > > we'll add them to the spreadsheet when we know (they should be close to > > each other). > > > > The spreadsheet above also gives contact information for all the > > attendees and the programme committee. > > > > Yours, > > > > James Bottomley > > on behalf of LSF/MM Programme Committee > > > > Here are a few topic ideas: > > (1) The first topic that might span IO & FS tracks (or just pull in device > mapper people to an FS track) could be adding new commands that would allow > users to grow/shrink/etc file systems in a generic way. The thought I had was > that we have a reasonable model that we could reuse for these new commands like > mount and mount.fs or fsck and fsck.fs. With btrfs coming down the road, it > could be nice to identify exactly what common operations users want to do and > agree on how to implement them. Alasdair pointed out in the upstream thread that > we had a prototype here in fsadm. > > (2) Very high speed, low latency SSD devices and testing. Have we settled on the > need for these devices to all have block level drivers? For S-ATA or SAS > devices, are there known performance issues that require enhancements in > somewhere in the stack? > > (3) The union mount versus overlayfs debate - pros and cons. What each do well, > what needs doing. Do we want/need both upstream? (Maybe this can get 10 minutes > in Al's VFS session?) > Ric, May I propose some discussion about concurrent direct IO support for ext4? Direct IO write are serialized by the single i_mutex lock. This lock contention becomes significant when running database or direct IO heavy workload on guest, where the host pass a file image to guest as a block device. All the parallel IOs in guests are being serialized by the i_mutex lock on the host disk image file. This greatly penalize the data base application performance in KVM. I am looking for some discussion about removing the i_mutex lock in the direct IO write code path for ext4, when multiple threads are direct write to different offset of the same file. This would require some way to track the in-fly DIO IO range, either done at ext4 level or above th vfs layer. Thanks, ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-30 0:33 ` Mingming Cao @ 2011-03-30 2:17 ` Dave Chinner 2011-03-30 11:13 ` Theodore Tso 2011-03-30 21:49 ` Mingming Cao 0 siblings, 2 replies; 138+ messages in thread From: Dave Chinner @ 2011-03-30 2:17 UTC (permalink / raw) To: Mingming Cao Cc: Ric Wheeler, James Bottomley, lsf, linux-fsdevel, linux-scsi@vger.kernel.org, device-mapper development On Tue, Mar 29, 2011 at 05:33:30PM -0700, Mingming Cao wrote: > Ric, > > May I propose some discussion about concurrent direct IO support for > ext4? Just look at the way XFS does it and copy that? i.e. it has a filesytem level IO lock and an inode lock both with shared/exclusive semantics. These lie below the i_mutex (i.e. locking order is i_mutex, i_iolock, i_ilock), and effectively result in the i_mutex only being used for VFS level synchronisation and as such is rarely used inside XFS itself. Inode attribute operations are protected by the inode lock, while IO operations and truncation synchronisation is provided by the IO lock. So for buffered IO, the IO lock is used in shared mode for reads and exclusive mode for writes. This gives normal POSIX buffered IO semantics and holding the IO lock exclusive allows sycnhronisation against new IO of any kind for truncate. For direct IO, the IO lock is always taken in shared mode, so we can have concurrent read and write operations taking place at once regardless of the offset into the file. > I am looking for some discussion about removing the i_mutex lock in the > direct IO write code path for ext4, when multiple threads are > direct write to different offset of the same file. This would require > some way to track the in-fly DIO IO range, either done at ext4 level or > above th vfs layer. Direct IO semantics have always been that the application is allowed to overlap IO to the same range if it wants to. The result is undefined (just like issuing overlapping reads and writes to a disk at the same time) so it's the application's responsibility to avoid overlapping IO if it is a problem. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-30 2:17 ` Dave Chinner @ 2011-03-30 11:13 ` Theodore Tso 2011-03-30 11:28 ` Ric Wheeler 2011-03-30 21:49 ` Mingming Cao 1 sibling, 1 reply; 138+ messages in thread From: Theodore Tso @ 2011-03-30 11:13 UTC (permalink / raw) To: Dave Chinner Cc: Mingming Cao, Ric Wheeler, James Bottomley, lsf, linux-fsdevel, linux-scsi@vger.kernel.org, device-mapper development On Mar 29, 2011, at 10:17 PM, Dave Chinner wrote: > Direct IO semantics have always been that the application is allowed > to overlap IO to the same range if it wants to. The result is > undefined (just like issuing overlapping reads and writes to a disk > at the same time) so it's the application's responsibility to avoid > overlapping IO if it is a problem. Even if the overlapping read/writes are taking place in different processes? DIO has never been standardized, and was originally implemented as gentleman's agreements between various database manufacturers and proprietary unix vendors. The lack of formal specifications of what applications are guaranteed to receive is unfortunate.... -- Ted ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-30 11:13 ` Theodore Tso @ 2011-03-30 11:28 ` Ric Wheeler 2011-03-30 14:07 ` Chris Mason 2011-04-01 15:19 ` Ted Ts'o 0 siblings, 2 replies; 138+ messages in thread From: Ric Wheeler @ 2011-03-30 11:28 UTC (permalink / raw) To: Theodore Tso Cc: Dave Chinner, lsf, linux-scsi@vger.kernel.org, James Bottomley, device-mapper development, linux-fsdevel, Ric Wheeler On 03/30/2011 07:13 AM, Theodore Tso wrote: > On Mar 29, 2011, at 10:17 PM, Dave Chinner wrote: > >> Direct IO semantics have always been that the application is allowed >> to overlap IO to the same range if it wants to. The result is >> undefined (just like issuing overlapping reads and writes to a disk >> at the same time) so it's the application's responsibility to avoid >> overlapping IO if it is a problem. > Even if the overlapping read/writes are taking place in different processes? > > DIO has never been standardized, and was originally implemented as gentleman's agreements between various database manufacturers and proprietary unix vendors. The lack of formal specifications of what applications are guaranteed to receive is unfortunate.... > > -- Ted What possible semantics could you have? If you ever write concurrently from multiple processes without locking, you clearly are at the mercy of the scheduler and the underlying storage which could fragment a single write into multiple IO's sent to the backend device. I would agree with Dave, let's not make it overly complicated or try to give people "atomic" unbounded size writes just because they set the O_DIRECT flag :) Ric ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-30 11:28 ` Ric Wheeler @ 2011-03-30 14:07 ` Chris Mason 2011-04-01 15:19 ` Ted Ts'o 1 sibling, 0 replies; 138+ messages in thread From: Chris Mason @ 2011-03-30 14:07 UTC (permalink / raw) To: Ric Wheeler Cc: Theodore Tso, Dave Chinner, lsf, linux-scsi@vger.kernel.org, James Bottomley, device-mapper development, linux-fsdevel, Ric Wheeler Excerpts from Ric Wheeler's message of 2011-03-30 07:28:34 -0400: > On 03/30/2011 07:13 AM, Theodore Tso wrote: > > On Mar 29, 2011, at 10:17 PM, Dave Chinner wrote: > > > >> Direct IO semantics have always been that the application is allowed > >> to overlap IO to the same range if it wants to. The result is > >> undefined (just like issuing overlapping reads and writes to a disk > >> at the same time) so it's the application's responsibility to avoid > >> overlapping IO if it is a problem. > > Even if the overlapping read/writes are taking place in different processes? > > > > DIO has never been standardized, and was originally implemented as gentleman's agreements between various database manufacturers and proprietary unix vendors. The lack of formal specifications of what applications are guaranteed to receive is unfortunate.... > > > > -- Ted > > What possible semantics could you have? > > If you ever write concurrently from multiple processes without locking, you > clearly are at the mercy of the scheduler and the underlying storage which could > fragment a single write into multiple IO's sent to the backend device. > > I would agree with Dave, let's not make it overly complicated or try to give > people "atomic" unbounded size writes just because they set the O_DIRECT flag :) We've talked about this with the oracle database people at least, any concurrent O_DIRECT ios to the same area would be considered a db bug. As long as it doesn't make the kernel crash or hang, we can return one of these: http://www.youtube.com/watch?v=rX7wtNOkuHo IBM might have a different answer, but I don't see how you can have good results from mixing concurrent IOs. -chris ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-30 11:28 ` Ric Wheeler 2011-03-30 14:07 ` Chris Mason @ 2011-04-01 15:19 ` Ted Ts'o 2011-04-01 16:30 ` Amir Goldstein 2011-04-01 21:43 ` Joel Becker 1 sibling, 2 replies; 138+ messages in thread From: Ted Ts'o @ 2011-04-01 15:19 UTC (permalink / raw) To: Ric Wheeler Cc: Dave Chinner, lsf, linux-scsi@vger.kernel.org, James Bottomley, device-mapper development, linux-fsdevel, Ric Wheeler On Wed, Mar 30, 2011 at 07:28:34AM -0400, Ric Wheeler wrote: > > What possible semantics could you have? > > If you ever write concurrently from multiple processes without > locking, you clearly are at the mercy of the scheduler and the > underlying storage which could fragment a single write into multiple > IO's sent to the backend device. > > I would agree with Dave, let's not make it overly complicated or try > to give people "atomic" unbounded size writes just because they set > the O_DIRECT flag :) I just want to have it written down. After getting burned with ext3's semantics promising more than what the standard guaranteed, I've just gotten paranoid about application programmers getting upset when things change on them --- and in the case of direct I/O, this stuff isn't even clearly documented anywhere official. I just think it's best that we document it the fact that concurrent DIO's to the same region may result in completely arbitrary behaviour, make sure it's well publicized to likely users (and I'm more worried about the open source code bases than Oracle DB), and then call it a day. The closest place that we have to any official documentation about O_DIRECT semantics is the open(2) man page in the Linux manpages, and it doesn't say anything about this. It does give a recommendation against not mixing buffered and O_DIRECT accesses to the same file, but it does promise that things will work in that case. (Even if it does, do we really want to make the promise that it will always work?) In any case, adding some text in that paragraph, or just after that paragraph, to the effect that two concurrent DIO accesses to the same file block, even by two different processes will result in undefined behavior would be a good start. - Ted ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-04-01 15:19 ` Ted Ts'o @ 2011-04-01 16:30 ` Amir Goldstein 2011-04-01 21:46 ` Joel Becker 2011-04-01 21:43 ` Joel Becker 1 sibling, 1 reply; 138+ messages in thread From: Amir Goldstein @ 2011-04-01 16:30 UTC (permalink / raw) To: Theodore Tso Cc: Ric Wheeler, Dave Chinner, lsf, linux-scsi@vger.kernel.org, James Bottomley, device-mapper development, linux-fsdevel, Ric Wheeler, Yongqiang Yang On Fri, Apr 1, 2011 at 8:19 AM, Ted Ts'o <tytso@mit.edu> wrote: > On Wed, Mar 30, 2011 at 07:28:34AM -0400, Ric Wheeler wrote: >> >> What possible semantics could you have? >> >> If you ever write concurrently from multiple processes without >> locking, you clearly are at the mercy of the scheduler and the >> underlying storage which could fragment a single write into multiple >> IO's sent to the backend device. >> >> I would agree with Dave, let's not make it overly complicated or try >> to give people "atomic" unbounded size writes just because they set >> the O_DIRECT flag :) > > I just want to have it written down. After getting burned with ext3's > semantics promising more than what the standard guaranteed, I've just > gotten paranoid about application programmers getting upset when > things change on them --- and in the case of direct I/O, this stuff > isn't even clearly documented anywhere official. > > I just think it's best that we document it the fact that concurrent > DIO's to the same region may result in completely arbitrary behaviour, > make sure it's well publicized to likely users (and I'm more worried > about the open source code bases than Oracle DB), and then call it a day. > > The closest place that we have to any official documentation about > O_DIRECT semantics is the open(2) man page in the Linux manpages, and > it doesn't say anything about this. It does give a recommendation > against not mixing buffered and O_DIRECT accesses to the same file, > but it does promise that things will work in that case. (Even if it > does, do we really want to make the promise that it will always work?) when writing DIO to indirect mapped file holes, we fall back to buffered write (so we won't expose stale data in the case of a crash) concurrent DIO reads to that file (before data writeback) can expose stale data. right? do you consider this case mixing buffered and DIO access? do you consider that as a problem? the case interests me because I am afraid we may have to use the fallback trick for extent move on write from DIO (we did so in current implementation anyway). of course, if we end up implementing in-memory extent tree, we will probably be able to cope with DIO MOW without fallback to buffered IO. > > In any case, adding some text in that paragraph, or just after that > paragraph, to the effect that two concurrent DIO accesses to the same > file block, even by two different processes will result in undefined > behavior would be a good start. > > - Ted > _______________________________________________ > Lsf mailing list > Lsf@lists.linux-foundation.org > https://lists.linux-foundation.org/mailman/listinfo/lsf > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-04-01 16:30 ` Amir Goldstein @ 2011-04-01 21:46 ` Joel Becker 2011-04-02 3:26 ` Amir Goldstein 0 siblings, 1 reply; 138+ messages in thread From: Joel Becker @ 2011-04-01 21:46 UTC (permalink / raw) To: Amir Goldstein Cc: Theodore Tso, Ric Wheeler, Dave Chinner, lsf, linux-scsi@vger.kernel.org, James Bottomley, device-mapper development, linux-fsdevel, Ric Wheeler, Yongqiang Yang On Fri, Apr 01, 2011 at 09:30:04AM -0700, Amir Goldstein wrote: > when writing DIO to indirect mapped file holes, we fall back to buffered write > (so we won't expose stale data in the case of a crash) concurrent DIO reads > to that file (before data writeback) can expose stale data. right? > do you consider this case mixing buffered and DIO access? > do you consider that as a problem? I do not consider this 'mixing', nor do I consider it a problem. ocfs2 does exactly this for holes, unwritten extents, and CoW. It does not violate the user's expectation that the data will be on disk when the write(2) returns. Falling back to buffered on read(2) is a different story; the caller wants the current state of the disk block, not five minutes ago. So we can't do that. But we also don't need to. O_DIRECT users that are worried about any possible space usage in the page cache have already pre-allocated their disk blocks and don't get here. Joel -- "Under capitalism, man exploits man. Under Communism, it's just the opposite." - John Kenneth Galbraith http://www.jlbec.org/ jlbec@evilplan.org ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-04-01 21:46 ` Joel Becker @ 2011-04-02 3:26 ` Amir Goldstein 0 siblings, 0 replies; 138+ messages in thread From: Amir Goldstein @ 2011-04-02 3:26 UTC (permalink / raw) To: Joel Becker Cc: Theodore Tso, Ric Wheeler, Dave Chinner, lsf, linux-scsi@vger.kernel.org, James Bottomley, device-mapper development, linux-fsdevel, Ric Wheeler, Yongqiang Yang On Fri, Apr 1, 2011 at 2:46 PM, Joel Becker <jlbec@evilplan.org> wrote: > On Fri, Apr 01, 2011 at 09:30:04AM -0700, Amir Goldstein wrote: >> when writing DIO to indirect mapped file holes, we fall back to buffered write >> (so we won't expose stale data in the case of a crash) concurrent DIO reads >> to that file (before data writeback) can expose stale data. right? >> do you consider this case mixing buffered and DIO access? >> do you consider that as a problem? > > I do not consider this 'mixing', nor do I consider it a problem. > ocfs2 does exactly this for holes, unwritten extents, and CoW. It does > not violate the user's expectation that the data will be on disk when > the write(2) returns. > Falling back to buffered on read(2) is a different story; the > caller wants the current state of the disk block, not five minutes ago. > So we can't do that. But we also don't need to. the issue is with DIO read exposing uninitialized data on disk is a security issue. it's not about giving the read what is expects to see. > O_DIRECT users that are worried about any possible space usage in > the page cache have already pre-allocated their disk blocks and don't > get here. > > Joel > > -- > > "Under capitalism, man exploits man. Under Communism, it's just > the opposite." > - John Kenneth Galbraith > > http://www.jlbec.org/ > jlbec@evilplan.org > -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-04-01 15:19 ` Ted Ts'o 2011-04-01 16:30 ` Amir Goldstein @ 2011-04-01 21:43 ` Joel Becker 1 sibling, 0 replies; 138+ messages in thread From: Joel Becker @ 2011-04-01 21:43 UTC (permalink / raw) To: Ted Ts'o, Ric Wheeler, Dave Chinner, lsf, linux-scsi@vger.kernel.org On Fri, Apr 01, 2011 at 11:19:07AM -0400, Ted Ts'o wrote: > The closest place that we have to any official documentation about > O_DIRECT semantics is the open(2) man page in the Linux manpages, and > it doesn't say anything about this. It does give a recommendation > against not mixing buffered and O_DIRECT accesses to the same file, > but it does promise that things will work in that case. (Even if it > does, do we really want to make the promise that it will always work?) No, we do not. Some OSes will silently turn buffered I/O into direct I/O if another file already has it opened O_DIRECT. Some OSes will fail the write, or the open, or both, if it doesn't match the mode of an existing fd. Some just leave O_DIRECT and buffered access inconsistent. I think that Linux should strive to make the mixed buffered/direct case work; it's the nicest thing we can do. But we should not promise it. Joel -- Life's Little Instruction Book #24 "Drink champagne for no reason at all." http://www.jlbec.org/ jlbec@evilplan.org ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-30 2:17 ` Dave Chinner 2011-03-30 11:13 ` Theodore Tso @ 2011-03-30 21:49 ` Mingming Cao 2011-03-31 0:05 ` Matthew Wilcox 2011-03-31 1:00 ` Joel Becker 1 sibling, 2 replies; 138+ messages in thread From: Mingming Cao @ 2011-03-30 21:49 UTC (permalink / raw) To: Dave Chinner Cc: Ric Wheeler, James Bottomley, lsf, linux-fsdevel, linux-scsi@vger.kernel.org, device-mapper development On Wed, 2011-03-30 at 13:17 +1100, Dave Chinner wrote: > On Tue, Mar 29, 2011 at 05:33:30PM -0700, Mingming Cao wrote: > > Ric, > > > > May I propose some discussion about concurrent direct IO support for > > ext4? > > Just look at the way XFS does it and copy that? i.e. it has a > filesytem level IO lock and an inode lock both with shared/exclusive > semantics. These lie below the i_mutex (i.e. locking order is > i_mutex, i_iolock, i_ilock), and effectively result in the i_mutex > only being used for VFS level synchronisation and as such is rarely > used inside XFS itself. > > Inode attribute operations are protected by the inode lock, while IO > operations and truncation synchronisation is provided by the IO > lock. > Right, inode attribute operations should be covered by the i_lock. in ext4 the i_mutex is used to protect IO and truncation synch, along with the i_datasem to pretect concurrent access.modification to file's allocation. > So for buffered IO, the IO lock is used in shared mode for reads > and exclusive mode for writes. This gives normal POSIX buffered IO > semantics and holding the IO lock exclusive allows sycnhronisation > against new IO of any kind for truncate. > > For direct IO, the IO lock is always taken in shared mode, so we can > have concurrent read and write operations taking place at once > regardless of the offset into the file. > thanks for reminding me,in xfs concurrent direct IO write to the same offset is allowed. > > I am looking for some discussion about removing the i_mutex lock in the > > direct IO write code path for ext4, when multiple threads are > > direct write to different offset of the same file. This would require > > some way to track the in-fly DIO IO range, either done at ext4 level or > > above th vfs layer. > > Direct IO semantics have always been that the application is allowed > to overlap IO to the same range if it wants to. The result is > undefined (just like issuing overlapping reads and writes to a disk > at the same time) so it's the application's responsibility to avoid > overlapping IO if it is a problem. > I was thinking along the line to provide finer granularity lock to allow concurrent direct IO to different offset/range, but to same offset, they have to be serialized. If it's undefined behavior, i.e. overlapping is allowed, then concurrent dio implementation is much easier. But not sure if any apps currently using DIO aware of the ordering has to be done at the application level. > Cheers, > > Dave. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-30 21:49 ` Mingming Cao @ 2011-03-31 0:05 ` Matthew Wilcox 2011-03-31 1:00 ` Joel Becker 1 sibling, 0 replies; 138+ messages in thread From: Matthew Wilcox @ 2011-03-31 0:05 UTC (permalink / raw) To: Mingming Cao Cc: Dave Chinner, Ric Wheeler, James Bottomley, lsf, linux-fsdevel, linux-scsi@vger.kernel.org, device-mapper development On Wed, Mar 30, 2011 at 02:49:58PM -0700, Mingming Cao wrote: > > Direct IO semantics have always been that the application is allowed > > to overlap IO to the same range if it wants to. The result is > > undefined (just like issuing overlapping reads and writes to a disk > > at the same time) so it's the application's responsibility to avoid > > overlapping IO if it is a problem. > > I was thinking along the line to provide finer granularity lock to allow > concurrent direct IO to different offset/range, but to same offset, they > have to be serialized. If it's undefined behavior, i.e. overlapping is > allowed, then concurrent dio implementation is much easier. But not sure > if any apps currently using DIO aware of the ordering has to be done at > the application level. Yes, they're aware of it. And they consider it a bug if they ever do concurrent I/O to the same sector. -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-30 21:49 ` Mingming Cao 2011-03-31 0:05 ` Matthew Wilcox @ 2011-03-31 1:00 ` Joel Becker 2011-04-01 21:34 ` Mingming Cao 1 sibling, 1 reply; 138+ messages in thread From: Joel Becker @ 2011-03-31 1:00 UTC (permalink / raw) To: Mingming Cao Cc: Dave Chinner, Ric Wheeler, James Bottomley, lsf, linux-fsdevel, linux-scsi@vger.kernel.org, device-mapper development On Wed, Mar 30, 2011 at 02:49:58PM -0700, Mingming Cao wrote: > On Wed, 2011-03-30 at 13:17 +1100, Dave Chinner wrote: > > For direct IO, the IO lock is always taken in shared mode, so we can > > have concurrent read and write operations taking place at once > > regardless of the offset into the file. > > > > thanks for reminding me,in xfs concurrent direct IO write to the same > offset is allowed. ocfs2 as well, with the same sort of strategem (including across the cluster). > > Direct IO semantics have always been that the application is allowed > > to overlap IO to the same range if it wants to. The result is > > undefined (just like issuing overlapping reads and writes to a disk > > at the same time) so it's the application's responsibility to avoid > > overlapping IO if it is a problem. > > > > I was thinking along the line to provide finer granularity lock to allow > concurrent direct IO to different offset/range, but to same offset, they > have to be serialized. If it's undefined behavior, i.e. overlapping is > allowed, then concurrent dio implementation is much easier. But not sure > if any apps currently using DIO aware of the ordering has to be done at > the application level. Oh dear God no. One of the major DIO use cases is to tell the kernel, "I know I won't do that, so don't spend any effort protecting me." Joel -- "I don't want to achieve immortality through my work; I want to achieve immortality through not dying." - Woody Allen http://www.jlbec.org/ jlbec@evilplan.org ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-31 1:00 ` Joel Becker @ 2011-04-01 21:34 ` Mingming Cao 2011-04-01 21:49 ` Joel Becker 0 siblings, 1 reply; 138+ messages in thread From: Mingming Cao @ 2011-04-01 21:34 UTC (permalink / raw) To: Joel Becker Cc: Dave Chinner, Ric Wheeler, James Bottomley, lsf, linux-fsdevel, linux-scsi@vger.kernel.org, device-mapper development On Wed, 2011-03-30 at 18:00 -0700, Joel Becker wrote: > On Wed, Mar 30, 2011 at 02:49:58PM -0700, Mingming Cao wrote: > > On Wed, 2011-03-30 at 13:17 +1100, Dave Chinner wrote: > > > For direct IO, the IO lock is always taken in shared mode, so we can > > > have concurrent read and write operations taking place at once > > > regardless of the offset into the file. > > > > > > > thanks for reminding me,in xfs concurrent direct IO write to the same > > offset is allowed. > > ocfs2 as well, with the same sort of strategem (including across > the cluster). > Thanks for providing view from OCFS2 side. This is good to know. > > > Direct IO semantics have always been that the application is allowed > > > to overlap IO to the same range if it wants to. The result is > > > undefined (just like issuing overlapping reads and writes to a disk > > > at the same time) so it's the application's responsibility to avoid > > > overlapping IO if it is a problem. > > > > > > > I was thinking along the line to provide finer granularity lock to allow > > concurrent direct IO to different offset/range, but to same offset, they > > have to be serialized. If it's undefined behavior, i.e. overlapping is > > allowed, then concurrent dio implementation is much easier. But not sure > > if any apps currently using DIO aware of the ordering has to be done at > > the application level. > > Oh dear God no. One of the major DIO use cases is to tell the > kernel, "I know I won't do that, so don't spend any effort protecting > me." > > Joel > Looks like so - So I think we could have a mode to turn on/off concurrent dio if the non heavy duty applications relies on filesystem to take care of the serialization. Mingming ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-04-01 21:34 ` Mingming Cao @ 2011-04-01 21:49 ` Joel Becker 0 siblings, 0 replies; 138+ messages in thread From: Joel Becker @ 2011-04-01 21:49 UTC (permalink / raw) To: Mingming Cao Cc: Dave Chinner, Ric Wheeler, James Bottomley, lsf, linux-fsdevel, linux-scsi@vger.kernel.org, device-mapper development On Fri, Apr 01, 2011 at 02:34:26PM -0700, Mingming Cao wrote: > > > I was thinking along the line to provide finer granularity lock to allow > > > concurrent direct IO to different offset/range, but to same offset, they > > > have to be serialized. If it's undefined behavior, i.e. overlapping is > > > allowed, then concurrent dio implementation is much easier. But not sure > > > if any apps currently using DIO aware of the ordering has to be done at > > > the application level. > > > > Oh dear God no. One of the major DIO use cases is to tell the > > kernel, "I know I won't do that, so don't spend any effort protecting > > me." > > > > Joel > > > > Looks like so - > > So I think we could have a mode to turn on/off concurrent dio if the non > heavy duty applications relies on filesystem to take care of the > serialization. I would prefer to leave this complexity out. If you must have it, unsafe, concurrent DIO must be the default. Let the people who really want it turn on serialized DIO. Joel -- "Get right to the heart of matters. It's the heart that matters more." http://www.jlbec.org/ jlbec@evilplan.org ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF [not found] <1301373398.2590.20.camel@mulgrave.site> 2011-03-29 5:14 ` [Lsf] Preliminary Agenda and Activities for LSF Amir Goldstein 2011-03-29 11:16 ` Ric Wheeler @ 2011-03-29 17:35 ` Chad Talbott 2011-03-29 19:09 ` Vivek Goyal 2011-03-30 4:18 ` Dave Chinner 2 siblings, 2 replies; 138+ messages in thread From: Chad Talbott @ 2011-03-29 17:35 UTC (permalink / raw) To: James Bottomley; +Cc: lsf, linux-fsdevel, Curt Wohlgemuth I'd like to propose a discussion topic: IO-less Dirty Throttling Considered Harmful... to isolation and cgroup IO schedulers in general. The disk scheduler is knocked out of the picture unless it can see the IO generated by each group above it. The world of memcg-aware writeback stacked on top of block-cgroups is a complicated one. Throttling in balance_dirty_pages() will likely be a non-starter for current users of group-aware CFQ. I'd like a discussion that covers the system-wide view of: memory -> memcg groups -> block cgroups -> multiple block devices Chad On Mon, Mar 28, 2011 at 9:36 PM, James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > Hi All, > > Since LSF is less than a week away, the programme committee put together > a just in time preliminary agenda for LSF. As you can see there is > still plenty of empty space, which you can make suggestions (to this > list with appropriate general list cc's) for filling: > > https://spreadsheets.google.com/pub?hl=en&hl=en&key=0AiQMl7GcVa7OdFdNQzM5UDRXUnVEbHlYVmZUVHQ2amc&output=html > > If you don't make suggestions, the programme committee will feel > empowered to make arbitrary assignments based on your topic and attendee > email requests ... > > We're still not quite sure what rooms we will have at the Kabuki, but > we'll add them to the spreadsheet when we know (they should be close to > each other). > > The spreadsheet above also gives contact information for all the > attendees and the programme committee. > > Yours, > > James Bottomley > on behalf of LSF/MM Programme Committee > > > _______________________________________________ > Lsf mailing list > Lsf@lists.linux-foundation.org > https://lists.linux-foundation.org/mailman/listinfo/lsf > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-29 17:35 ` Chad Talbott @ 2011-03-29 19:09 ` Vivek Goyal 2011-03-29 20:14 ` Chad Talbott 2011-03-29 20:35 ` Jan Kara 2011-03-30 4:18 ` Dave Chinner 1 sibling, 2 replies; 138+ messages in thread From: Vivek Goyal @ 2011-03-29 19:09 UTC (permalink / raw) To: Chad Talbott; +Cc: James Bottomley, lsf, linux-fsdevel On Tue, Mar 29, 2011 at 10:35:13AM -0700, Chad Talbott wrote: > I'd like to propose a discussion topic: > > IO-less Dirty Throttling Considered Harmful... > I see that writeback has extended session at 10.00. I am assuming IO less throttling will be discussed there. Is it possible to discuss its effect on block cgroups there? I am not sure enough time is there because it ties in memory cgroup also. Or there is a session at 12.30 "memcg dirty limits and writeback", it can probably be discussed there too. > to isolation and cgroup IO schedulers in general. The disk scheduler > is knocked out of the picture unless it can see the IO generated by > each group above it. The world of memcg-aware writeback stacked on > top of block-cgroups is a complicated one. Throttling in > balance_dirty_pages() will likely be a non-starter for current users > of group-aware CFQ. Can't a single flusher thread keep all the groups busy/full on slow SATA device. Thanks Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-29 19:09 ` Vivek Goyal @ 2011-03-29 20:14 ` Chad Talbott 2011-03-29 20:35 ` Jan Kara 1 sibling, 0 replies; 138+ messages in thread From: Chad Talbott @ 2011-03-29 20:14 UTC (permalink / raw) To: Vivek Goyal; +Cc: James Bottomley, lsf, linux-fsdevel On Tue, Mar 29, 2011 at 12:09 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > On Tue, Mar 29, 2011 at 10:35:13AM -0700, Chad Talbott wrote: >> I'd like to propose a discussion topic: >> >> IO-less Dirty Throttling Considered Harmful... >> > > I see that writeback has extended session at 10.00. I am assuming > IO less throttling will be discussed there. Is it possible to > discuss its effect on block cgroups there? I am not sure enough > time is there because it ties in memory cgroup also. > > Or there is a session at 12.30 "memcg dirty limits and writeback", it > can probably be discussed there too. I just want to make sure that the topic is discussed and I don't want to eat into someone else's time. I'll be sure to bring it up if it's not granted a dedicated session. >> to isolation and cgroup IO schedulers in general. The disk scheduler >> is knocked out of the picture unless it can see the IO generated by >> each group above it. The world of memcg-aware writeback stacked on >> top of block-cgroups is a complicated one. Throttling in >> balance_dirty_pages() will likely be a non-starter for current users >> of group-aware CFQ. > > Can't a single flusher thread keep all the groups busy/full on slow > SATA device. A single flusher thread *could* keep all the groups busy and full, but the current implementation does nothing explicit to make that happen. I'd like to make sure that this case is considered, independent of a particular implementation. Chad -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-29 19:09 ` Vivek Goyal 2011-03-29 20:14 ` Chad Talbott @ 2011-03-29 20:35 ` Jan Kara 2011-03-29 21:08 ` Greg Thelen 1 sibling, 1 reply; 138+ messages in thread From: Jan Kara @ 2011-03-29 20:35 UTC (permalink / raw) To: Vivek Goyal; +Cc: Chad Talbott, James Bottomley, lsf, linux-fsdevel, gthelen On Tue 29-03-11 15:09:21, Vivek Goyal wrote: > On Tue, Mar 29, 2011 at 10:35:13AM -0700, Chad Talbott wrote: > > I'd like to propose a discussion topic: > > > > IO-less Dirty Throttling Considered Harmful... > > > > I see that writeback has extended session at 10.00. I am assuming > IO less throttling will be discussed there. Is it possible to > discuss its effect on block cgroups there? I am not sure enough > time is there because it ties in memory cgroup also. > > Or there is a session at 12.30 "memcg dirty limits and writeback", it > can probably be discussed there too. Yes, I'd like to have this discussion in this session if Greg agrees. We've been discussing about how to combine IO-less throttling and memcg awareness of the writeback and Greg was designing some framework to do this... Greg? Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-29 20:35 ` Jan Kara @ 2011-03-29 21:08 ` Greg Thelen 0 siblings, 0 replies; 138+ messages in thread From: Greg Thelen @ 2011-03-29 21:08 UTC (permalink / raw) To: Jan Kara; +Cc: Vivek Goyal, Chad Talbott, James Bottomley, lsf, linux-fsdevel On Tue, Mar 29, 2011 at 1:35 PM, Jan Kara <jack@suse.cz> wrote: > On Tue 29-03-11 15:09:21, Vivek Goyal wrote: >> On Tue, Mar 29, 2011 at 10:35:13AM -0700, Chad Talbott wrote: >> > I'd like to propose a discussion topic: >> > >> > IO-less Dirty Throttling Considered Harmful... >> > >> >> I see that writeback has extended session at 10.00. I am assuming >> IO less throttling will be discussed there. Is it possible to >> discuss its effect on block cgroups there? I am not sure enough >> time is there because it ties in memory cgroup also. >> >> Or there is a session at 12.30 "memcg dirty limits and writeback", it >> can probably be discussed there too. > Yes, I'd like to have this discussion in this session if Greg agrees. It's fine with me if the morning session considers IO-less dirty throttling with block cgroup service differentiation, but defers memcg aspects to 12:30. > We've been discussing about how to combine IO-less throttling and memcg > awareness of the writeback and Greg was designing some framework to do > this... Greg? My initial patches are between memcg and the current IO-full throttling code. However, the framework ideally will also allow for IO-less dirty throttling with memcg. I have not wrapped my head around how this should work with block cgroup isolation. I am hoping others can help out with the block aspects. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] Preliminary Agenda and Activities for LSF 2011-03-29 17:35 ` Chad Talbott 2011-03-29 19:09 ` Vivek Goyal @ 2011-03-30 4:18 ` Dave Chinner 2011-03-30 15:37 ` IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF) Vivek Goyal 1 sibling, 1 reply; 138+ messages in thread From: Dave Chinner @ 2011-03-30 4:18 UTC (permalink / raw) To: Chad Talbott; +Cc: James Bottomley, lsf, linux-fsdevel, Curt Wohlgemuth On Tue, Mar 29, 2011 at 10:35:13AM -0700, Chad Talbott wrote: > I'd like to propose a discussion topic: > > IO-less Dirty Throttling Considered Harmful... > > to isolation and cgroup IO schedulers in general. Why is that, exactly? The current writeback infrastructure isn't cgroup aware at all, so isn't that the problem you need to solve first? i.e. how to delegate page cache writeback from one context to anotheri and account for it correctly? Once you solve that problem, triggering cgroup specific writeback from the throttling code is the same regardless of whether we are doing IO directly from the throttling code or via a separate flusher thread. Hence I don't really understand why you think IO-less throttling is really a problem. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 138+ messages in thread
* IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF) 2011-03-30 4:18 ` Dave Chinner @ 2011-03-30 15:37 ` Vivek Goyal 2011-03-30 22:20 ` Dave Chinner 0 siblings, 1 reply; 138+ messages in thread From: Vivek Goyal @ 2011-03-30 15:37 UTC (permalink / raw) To: Dave Chinner; +Cc: Chad Talbott, James Bottomley, lsf, linux-fsdevel On Wed, Mar 30, 2011 at 03:18:02PM +1100, Dave Chinner wrote: > On Tue, Mar 29, 2011 at 10:35:13AM -0700, Chad Talbott wrote: > > I'd like to propose a discussion topic: > > > > IO-less Dirty Throttling Considered Harmful... > > > > to isolation and cgroup IO schedulers in general. > > Why is that, exactly? The current writeback infrastructure isn't > cgroup aware at all, so isn't that the problem you need to solve > first? i.e. how to delegate page cache writeback from > one context to anotheri and account for it correctly? > > Once you solve that problem, triggering cgroup specific writeback > from the throttling code is the same regardless of whether we > are doing IO directly from the throttling code or via a separate > flusher thread. Hence I don't really understand why you think > IO-less throttling is really a problem. Dave, We are planning to track the IO context of original submitter of IO by storing that information in page_cgroup. So that is not the problem. The problem google guys are trying to raise is that can a single flusher thread keep all the groups on bdi busy in such a way so that higher prio group can get more IO done. It should not happen that flusher thread gets blocked somewhere (trying to get request descriptors on request queue) or it tries to dispatch too much IO from an inode which primarily contains pages from low prio cgroup and high prio cgroup task does not get enough pages dispatched to device hence not getting any prio over low prio group. Currently we can do some IO in the context of writting process also hence faster group can try to dispatch its own pages to bdi for writeout. With IO less throttling, that notion will disappear. So the concern they raised that is single flusher thread per device is enough to keep faster cgroup full at the bdi and hence get the service differentiation. My take on this is that on slow SATA device it might be as long as we make sure that flusher thread does not block on individual groups and also try to select inodes intelligently (cgroup aware manner). If it really becomes an issue on faster devices, will a flusher thread per cgroup per bdi make sense. Thanks Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF) 2011-03-30 15:37 ` IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF) Vivek Goyal @ 2011-03-30 22:20 ` Dave Chinner 2011-03-30 22:49 ` Chad Talbott 2011-03-31 14:16 ` Vivek Goyal 0 siblings, 2 replies; 138+ messages in thread From: Dave Chinner @ 2011-03-30 22:20 UTC (permalink / raw) To: Vivek Goyal; +Cc: Chad Talbott, James Bottomley, lsf, linux-fsdevel On Wed, Mar 30, 2011 at 11:37:57AM -0400, Vivek Goyal wrote: > On Wed, Mar 30, 2011 at 03:18:02PM +1100, Dave Chinner wrote: > > On Tue, Mar 29, 2011 at 10:35:13AM -0700, Chad Talbott wrote: > > > I'd like to propose a discussion topic: > > > > > > IO-less Dirty Throttling Considered Harmful... > > > > > > to isolation and cgroup IO schedulers in general. > > > > Why is that, exactly? The current writeback infrastructure isn't > > cgroup aware at all, so isn't that the problem you need to solve > > first? i.e. how to delegate page cache writeback from > > one context to anotheri and account for it correctly? > > > > Once you solve that problem, triggering cgroup specific writeback > > from the throttling code is the same regardless of whether we > > are doing IO directly from the throttling code or via a separate > > flusher thread. Hence I don't really understand why you think > > IO-less throttling is really a problem. > > Dave, > > We are planning to track the IO context of original submitter of IO > by storing that information in page_cgroup. So that is not the problem. > > The problem google guys are trying to raise is that can a single flusher > thread keep all the groups on bdi busy in such a way so that higher > prio group can get more IO done. Which has nothing to do with IO-less dirty throttling at all! > It should not happen that flusher > thread gets blocked somewhere (trying to get request descriptors on > request queue) A major design principle of the bdi-flusher threads is that they are supposed to block when the request queue gets full - that's how we got rid of all the congestion garbage from the writeback stack. There are plans to move the bdi-flusher threads to work queues, and once that is done all your concerns about blocking and parallelism are pretty much gone because it's trivial to have multiple writeback works in progress at once on the same bdi with that infrastructure. > or it tries to dispatch too much IO from an inode which > primarily contains pages from low prio cgroup and high prio cgroup > task does not get enough pages dispatched to device hence not getting > any prio over low prio group. That's a writeback scheduling issue independent of how we throttle, and something we don't do at all right now. Our only decision on what to write back is based on how low ago the inode was dirtied. You need to completely rework the dirty inode tracking if you want to efficiently prioritise writeback between different groups. Given that filesystems don't all use the VFS dirty inode tracking infrastructure and specific filesystems have different ideas of the order of writeback, you've got a really difficult problem there. e.g. ext3/4 and btrfs use ordered writeback for filesystem integrity purposes which will completely screw any sort of prioritised writeback. Remember the ext3 "fsync = global sync" latency problems? > Currently we can do some IO in the context of writting process also > hence faster group can try to dispatch its own pages to bdi for writeout. > With IO less throttling, that notion will disappear. We'll stil do exactly the same amount of throttling - what we write back is still the same decision, just made in a different place with a different trigger. > So the concern they raised that is single flusher thread per device > is enough to keep faster cgroup full at the bdi and hence get the > service differentiation. I think there's much bigger problems than that. > My take on this is that on slow SATA device it might be as long as > we make sure that flusher thread does not block on individual groups I don't think you can ever guarantee that - e.g. Delayed allocation will need metadata to be read from disk to perform the allocation so preventing blocking is impossible. Besides, see above about using work queues rather than threads for flushing. > and also try to select inodes intelligently (cgroup aware manner). Such selection algorithms would need to be able to handle hundreds of thousands of newly dirtied inodes per second so sorting and selecting them efficiently will be a major issue... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF) 2011-03-30 22:20 ` Dave Chinner @ 2011-03-30 22:49 ` Chad Talbott 2011-03-31 3:00 ` Dave Chinner 2011-03-31 14:16 ` Vivek Goyal 1 sibling, 1 reply; 138+ messages in thread From: Chad Talbott @ 2011-03-30 22:49 UTC (permalink / raw) To: Dave Chinner; +Cc: Vivek Goyal, James Bottomley, lsf, linux-fsdevel On Wed, Mar 30, 2011 at 3:20 PM, Dave Chinner <david@fromorbit.com> wrote: > On Wed, Mar 30, 2011 at 11:37:57AM -0400, Vivek Goyal wrote: >> We are planning to track the IO context of original submitter of IO >> by storing that information in page_cgroup. So that is not the problem. >> >> The problem google guys are trying to raise is that can a single flusher >> thread keep all the groups on bdi busy in such a way so that higher >> prio group can get more IO done. > > Which has nothing to do with IO-less dirty throttling at all! Not quite. Pre IO-less dirty throttling, any thread which was dirtying did the writeback itself. Because there's no shortage of threads to do the work, the IO scheduler sees a bunch of threads doing writes against a given BDI and schedules them against each other. This is how async IO isolation works for us. >> It should not happen that flusher >> thread gets blocked somewhere (trying to get request descriptors on >> request queue) > > A major design principle of the bdi-flusher threads is that they > are supposed to block when the request queue gets full - that's how > we got rid of all the congestion garbage from the writeback > stack. With IO cgroups and async write isolation, there are multiple queues per disk that all need to be filled to allow cgroup-aware CFQ schedule between them. If the per-BDI threads could be taught to fill each per-cgroup queue before giving up on a BDI, then IO-less throttling could work. Also, having per-(BDI, blkio cgroup)-flusher threads would work. I think it's complicated enough to warrant a discussion. > There are plans to move the bdi-flusher threads to work queues, and > once that is done all your concerns about blocking and parallelism > are pretty much gone because it's trivial to have multiple writeback > works in progress at once on the same bdi with that infrastructure. This sounds promising. >> So the concern they raised that is single flusher thread per device >> is enough to keep faster cgroup full at the bdi and hence get the >> service differentiation. > > I think there's much bigger problems than that. We seem to be agreeing that it's a complicated problem. That's why I think async write isolation needs some design-level discussion. Chad ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF) 2011-03-30 22:49 ` Chad Talbott @ 2011-03-31 3:00 ` Dave Chinner 0 siblings, 0 replies; 138+ messages in thread From: Dave Chinner @ 2011-03-31 3:00 UTC (permalink / raw) To: Chad Talbott; +Cc: Vivek Goyal, James Bottomley, lsf, linux-fsdevel On Wed, Mar 30, 2011 at 03:49:17PM -0700, Chad Talbott wrote: > On Wed, Mar 30, 2011 at 3:20 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Wed, Mar 30, 2011 at 11:37:57AM -0400, Vivek Goyal wrote: > >> We are planning to track the IO context of original submitter of IO > >> by storing that information in page_cgroup. So that is not the problem. > >> > >> The problem google guys are trying to raise is that can a single flusher > >> thread keep all the groups on bdi busy in such a way so that higher > >> prio group can get more IO done. > > > > Which has nothing to do with IO-less dirty throttling at all! > > Not quite. Pre IO-less dirty throttling, any thread which was > dirtying did the writeback itself. Because there's no shortage of > threads to do the work, the IO scheduler sees a bunch of threads doing > writes against a given BDI and schedules them against each other. > This is how async IO isolation works for us. And it's precisely this behaviour that makes foreground throttling a scalability limitation, both from a list/lock contention POV and from a IO optimisation POV. > >> So the concern they raised that is single flusher thread per device > >> is enough to keep faster cgroup full at the bdi and hence get the > >> service differentiation. > > > > I think there's much bigger problems than that. > > We seem to be agreeing that it's a complicated problem. That's why I > think async write isolation needs some design-level discussion. From my perspeccctive, we've still got a significant amount of work to get writeback into a scalable form for current generation machines, let alone future machines. Fixing the writeback code is a slow process because of all the subtle interactions with different filesystems and different workloads, whіch made more complex by the fact that many filesystems implement their own writeback paths and have their own writeback semantics. We need to make the right decision on what IO to issue, not just issue lots of IO and hope it all turns out OK in the end. If we can't get that decision matrix right for the simple case of a global context, then we have no hope of extending it to cgroup-aware writeback. IOWs, we need to get writeback working in a scalable manner before we complicate it immensely with all this cgroup and isolation madness. Hence I think trying to make writeback cgroup-aware is probably 6-12 months premature at this point and trying to do it now will only serve to make it harder to get the common, simple cases working as we desire them to... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF) 2011-03-30 22:20 ` Dave Chinner 2011-03-30 22:49 ` Chad Talbott @ 2011-03-31 14:16 ` Vivek Goyal 2011-03-31 14:34 ` Chris Mason 2011-03-31 14:50 ` [Lsf] IO less throttling and cgroup aware writeback (Was: " Greg Thelen 1 sibling, 2 replies; 138+ messages in thread From: Vivek Goyal @ 2011-03-31 14:16 UTC (permalink / raw) To: Dave Chinner; +Cc: Chad Talbott, James Bottomley, lsf, linux-fsdevel On Thu, Mar 31, 2011 at 09:20:02AM +1100, Dave Chinner wrote: [..] > > It should not happen that flusher > > thread gets blocked somewhere (trying to get request descriptors on > > request queue) > > A major design principle of the bdi-flusher threads is that they > are supposed to block when the request queue gets full - that's how > we got rid of all the congestion garbage from the writeback > stack. Instead of blocking flusher threads, can they voluntarily stop submitting more IO when they realize too much IO is in progress. We aready keep stats of how much IO is under writeback on bdi (BDI_WRITEBACK) and flusher tread can use that? Jens mentioned this idea of how about getting rid of this request accounting at request queue level and move it somewhere up say at bdi level. > > There are plans to move the bdi-flusher threads to work queues, and > once that is done all your concerns about blocking and parallelism > are pretty much gone because it's trivial to have multiple writeback > works in progress at once on the same bdi with that infrastructure. Will this essentially not nullify the advantage of IO less throttling? I thought that we did not want have multiple threads doing writeback at the same time to avoid number of seeks and achieve better throughput. Now with this I am assuming that multiple work can be on progress doing writeback. May be we can limit writeback work one per group so in global context only one work will be active. > > > or it tries to dispatch too much IO from an inode which > > primarily contains pages from low prio cgroup and high prio cgroup > > task does not get enough pages dispatched to device hence not getting > > any prio over low prio group. > > That's a writeback scheduling issue independent of how we throttle, > and something we don't do at all right now. Our only decision on > what to write back is based on how low ago the inode was dirtied. > You need to completely rework the dirty inode tracking if you want > to efficiently prioritise writeback between different groups. > > Given that filesystems don't all use the VFS dirty inode tracking > infrastructure and specific filesystems have different ideas of the > order of writeback, you've got a really difficult problem there. > e.g. ext3/4 and btrfs use ordered writeback for filesystem integrity > purposes which will completely screw any sort of prioritised > writeback. Remember the ext3 "fsync = global sync" latency problems? Ok, so if one issues a fsync when filesystem is mounted in "data=ordered" mode we will flush all the writes to disk before committing meta data. I have no knowledge of filesystem code so here comes a stupid question. Do multiple fsyncs get completely serialized or they can progress in parallel? IOW, if a fsync is in progress and we slow down the writeback of that inode's pages, can other fsync still make progress without getting stuck behind the previous fsync? For me knowing this is also important in another context of absolute IO throttling. - If a fsync is in progress and gets throttled at device, what impact it has on other file system operations. What gets serialized behind it. [..] > > and also try to select inodes intelligently (cgroup aware manner). > > Such selection algorithms would need to be able to handle hundreds > of thousands of newly dirtied inodes per second so sorting and > selecting them efficiently will be a major issue... There was proposal of memory cgroup maintaining a per memory cgroup per bdi structure which will keep a list of inodes which need writeback from that cgroup. So any cgroup looking for a writeback will queue up this structure on bdi and flusher threads can walk though this list and figure out which memory cgroups and which inodes within memory cgroup need to be written back. Thanks Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF) 2011-03-31 14:16 ` Vivek Goyal @ 2011-03-31 14:34 ` Chris Mason 2011-03-31 22:14 ` Dave Chinner 2011-03-31 22:25 ` Vivek Goyal 2011-03-31 14:50 ` [Lsf] IO less throttling and cgroup aware writeback (Was: " Greg Thelen 1 sibling, 2 replies; 138+ messages in thread From: Chris Mason @ 2011-03-31 14:34 UTC (permalink / raw) To: Vivek Goyal Cc: Dave Chinner, Chad Talbott, James Bottomley, lsf, linux-fsdevel Excerpts from Vivek Goyal's message of 2011-03-31 10:16:37 -0400: > On Thu, Mar 31, 2011 at 09:20:02AM +1100, Dave Chinner wrote: > > [..] > > > It should not happen that flusher > > > thread gets blocked somewhere (trying to get request descriptors on > > > request queue) > > > > A major design principle of the bdi-flusher threads is that they > > are supposed to block when the request queue gets full - that's how > > we got rid of all the congestion garbage from the writeback > > stack. > > Instead of blocking flusher threads, can they voluntarily stop submitting > more IO when they realize too much IO is in progress. We aready keep > stats of how much IO is under writeback on bdi (BDI_WRITEBACK) and > flusher tread can use that? We could, but the difficult part is keeping the hardware saturated as requests complete. The voluntarily stopping part is pretty much the same thing the congestion code was trying to do. > > Jens mentioned this idea of how about getting rid of this request accounting > at request queue level and move it somewhere up say at bdi level. > > > > > There are plans to move the bdi-flusher threads to work queues, and > > once that is done all your concerns about blocking and parallelism > > are pretty much gone because it's trivial to have multiple writeback > > works in progress at once on the same bdi with that infrastructure. > > Will this essentially not nullify the advantage of IO less throttling? > I thought that we did not want have multiple threads doing writeback > at the same time to avoid number of seeks and achieve better throughput. Work queues alone are probably not appropriate, at least for spinning storage. It will introduce seeks into what would have been sequential writes. I had to make the btrfs worker thread pools after having a lot of trouble cramming writeback into work queues. > > Now with this I am assuming that multiple work can be on progress doing > writeback. May be we can limit writeback work one per group so in global > context only one work will be active. > > > > > > or it tries to dispatch too much IO from an inode which > > > primarily contains pages from low prio cgroup and high prio cgroup > > > task does not get enough pages dispatched to device hence not getting > > > any prio over low prio group. > > > > That's a writeback scheduling issue independent of how we throttle, > > and something we don't do at all right now. Our only decision on > > what to write back is based on how low ago the inode was dirtied. > > You need to completely rework the dirty inode tracking if you want > > to efficiently prioritise writeback between different groups. > > > > Given that filesystems don't all use the VFS dirty inode tracking > > infrastructure and specific filesystems have different ideas of the > > order of writeback, you've got a really difficult problem there. > > e.g. ext3/4 and btrfs use ordered writeback for filesystem integrity > > purposes which will completely screw any sort of prioritised > > writeback. Remember the ext3 "fsync = global sync" latency problems? > > Ok, so if one issues a fsync when filesystem is mounted in "data=ordered" > mode we will flush all the writes to disk before committing meta data. > > I have no knowledge of filesystem code so here comes a stupid question. > Do multiple fsyncs get completely serialized or they can progress in > parallel? IOW, if a fsync is in progress and we slow down the writeback > of that inode's pages, can other fsync still make progress without > getting stuck behind the previous fsync? An fsync has two basic parts 1) write the file data pages 2a) flush data=ordered in reiserfs/ext34 2b) do the real transaction commit We can do part one in parallel across any number of writers. For part two, there is only one running transaction. If the FS is smart, the commit will only force down the transaction that last modified the file. 50 procs running fsync may only need to trigger one commit. btrfs and xfs do data=ordered differently. They still avoid exposing stale data but we don't pull the plug on the whole bathtub for every commit. In the btrfs case, we don't update metadata until the data is written, so commits never have to force data writes. xfs does something lighter weight but with similar benefits. ext4 with delayed allocation on and data=ordered will only end up forcing down writes that are not under delayed allocation. This is a much smaller subset of the IO than ext3/reiserfs will do. > > For me knowing this is also important in another context of absolute IO > throttling. > > - If a fsync is in progress and gets throttled at device, what impact it > has on other file system operations. What gets serialized behind it. It depends. atime updates log inodes and logging needs a transaction and transactions sometimes need to wait for the last transaction to finish. So its very possible you'll make anything using the FS appear to stop. -chris ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF) 2011-03-31 14:34 ` Chris Mason @ 2011-03-31 22:14 ` Dave Chinner 2011-03-31 23:43 ` Chris Mason 2011-04-01 1:34 ` Vivek Goyal 2011-03-31 22:25 ` Vivek Goyal 1 sibling, 2 replies; 138+ messages in thread From: Dave Chinner @ 2011-03-31 22:14 UTC (permalink / raw) To: Chris Mason Cc: Vivek Goyal, Chad Talbott, James Bottomley, lsf, linux-fsdevel On Thu, Mar 31, 2011 at 10:34:03AM -0400, Chris Mason wrote: > Excerpts from Vivek Goyal's message of 2011-03-31 10:16:37 -0400: > > On Thu, Mar 31, 2011 at 09:20:02AM +1100, Dave Chinner wrote: > > > > [..] > > > > It should not happen that flusher > > > > thread gets blocked somewhere (trying to get request descriptors on > > > > request queue) > > > > > > A major design principle of the bdi-flusher threads is that they > > > are supposed to block when the request queue gets full - that's how > > > we got rid of all the congestion garbage from the writeback > > > stack. > > > > Instead of blocking flusher threads, can they voluntarily stop submitting > > more IO when they realize too much IO is in progress. We aready keep > > stats of how much IO is under writeback on bdi (BDI_WRITEBACK) and > > flusher tread can use that? > > We could, but the difficult part is keeping the hardware saturated as > requests complete. The voluntarily stopping part is pretty much the > same thing the congestion code was trying to do. And it was the bit that was causing most problems. IMO, we don't want to go back to that single threaded mechanism, especially as we have no shortage of cores and threads available... > > > There are plans to move the bdi-flusher threads to work queues, and > > > once that is done all your concerns about blocking and parallelism > > > are pretty much gone because it's trivial to have multiple writeback > > > works in progress at once on the same bdi with that infrastructure. > > > > Will this essentially not nullify the advantage of IO less throttling? > > I thought that we did not want have multiple threads doing writeback > > at the same time to avoid number of seeks and achieve better throughput. > > Work queues alone are probably not appropriate, at least for spinning > storage. It will introduce seeks into what would have been > sequential writes. I had to make the btrfs worker thread pools after > having a lot of trouble cramming writeback into work queues. That was before the cmwq infrastructure, right? cmwq changes the behaviour of workqueues in such a way that they can simply be thought of as having a thread pool of a specific size.... As a strict translation of the existing one flusher thread per bdi, then only allowing one work at a time to be issued (i.e. workqueue concurency of 1) would give the same behaviour without having all the thread management issues. i.e. regardless of the writeback parallelism mechanism we have the same issue of managing writeback to minimise seeking. cmwq just makes the implementation far simpler, IMO. As to whether that causes seeks or not, that depends on how we are driving the concurrent works/threads. If we drive a concurrent work per dirty cgroup that needs writing back, then we achieve the concurrency needed to make the IO scheduler appropriately throttle the IO. For the case of no cgroups, then we still only have a single writeback work in progress at a time and behaviour is no different to the current setup. Hence I don't see any particular problem with using workqueues to acheive the necessary writeback parallelism that cgroup aware throttling requires.... > > > > or it tries to dispatch too much IO from an inode which > > > > primarily contains pages from low prio cgroup and high prio cgroup > > > > task does not get enough pages dispatched to device hence not getting > > > > any prio over low prio group. > > > > > > That's a writeback scheduling issue independent of how we throttle, > > > and something we don't do at all right now. Our only decision on > > > what to write back is based on how low ago the inode was dirtied. > > > You need to completely rework the dirty inode tracking if you want > > > to efficiently prioritise writeback between different groups. > > > > > > Given that filesystems don't all use the VFS dirty inode tracking > > > infrastructure and specific filesystems have different ideas of the > > > order of writeback, you've got a really difficult problem there. > > > e.g. ext3/4 and btrfs use ordered writeback for filesystem integrity > > > purposes which will completely screw any sort of prioritised > > > writeback. Remember the ext3 "fsync = global sync" latency problems? > > > > Ok, so if one issues a fsync when filesystem is mounted in "data=ordered" > > mode we will flush all the writes to disk before committing meta data. > > > > I have no knowledge of filesystem code so here comes a stupid question. > > Do multiple fsyncs get completely serialized or they can progress in > > parallel? IOW, if a fsync is in progress and we slow down the writeback > > of that inode's pages, can other fsync still make progress without > > getting stuck behind the previous fsync? > > An fsync has two basic parts > > 1) write the file data pages > 2a) flush data=ordered in reiserfs/ext34 > 2b) do the real transaction commit > > > We can do part one in parallel across any number of writers. For part > two, there is only one running transaction. If the FS is smart, the > commit will only force down the transaction that last modified the > file. 50 procs running fsync may only need to trigger one commit. Right. However the real issue here, I think, is that the IO comes from a thread not associated with writeback nor is in any way cgroup aware. IOWs, getting the right context to each block being written back will be complex and filesystem specific. The other thing that concerns me is how metadata IO is accounted and throttled. Doing stuff like creating lots of small files will generate as much or more metadata IO than data IO, and none of that will be associated with a cgroup. Indeed, in XFS metadata doesn't even use the pagecache anymore, and it's written back by a thread (soon to be a workqueue) deep inside XFS's journalling subsystem, so it's pretty much impossible to associate that IO with any specific cgroup. What happens to that IO? Blocking it arbitrarily can have the same effect as blocking transaction completion - it can cause the filesystem to completely stop.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF) 2011-03-31 22:14 ` Dave Chinner @ 2011-03-31 23:43 ` Chris Mason 2011-04-01 0:55 ` Dave Chinner 2011-04-01 1:34 ` Vivek Goyal 1 sibling, 1 reply; 138+ messages in thread From: Chris Mason @ 2011-03-31 23:43 UTC (permalink / raw) To: Dave Chinner Cc: Vivek Goyal, Chad Talbott, James Bottomley, lsf, linux-fsdevel Excerpts from Dave Chinner's message of 2011-03-31 18:14:25 -0400: > On Thu, Mar 31, 2011 at 10:34:03AM -0400, Chris Mason wrote: > > Excerpts from Vivek Goyal's message of 2011-03-31 10:16:37 -0400: > > > On Thu, Mar 31, 2011 at 09:20:02AM +1100, Dave Chinner wrote: > > > > > > [..] > > > > > It should not happen that flusher > > > > > thread gets blocked somewhere (trying to get request descriptors on > > > > > request queue) > > > > > > > > A major design principle of the bdi-flusher threads is that they > > > > are supposed to block when the request queue gets full - that's how > > > > we got rid of all the congestion garbage from the writeback > > > > stack. > > > > > > Instead of blocking flusher threads, can they voluntarily stop submitting > > > more IO when they realize too much IO is in progress. We aready keep > > > stats of how much IO is under writeback on bdi (BDI_WRITEBACK) and > > > flusher tread can use that? > > > > We could, but the difficult part is keeping the hardware saturated as > > requests complete. The voluntarily stopping part is pretty much the > > same thing the congestion code was trying to do. > > And it was the bit that was causing most problems. IMO, we don't want to > go back to that single threaded mechanism, especially as we have > no shortage of cores and threads available... Getting rid of the congestion code was my favorite part of the per-bdi work. > > > > > There are plans to move the bdi-flusher threads to work queues, and > > > > once that is done all your concerns about blocking and parallelism > > > > are pretty much gone because it's trivial to have multiple writeback > > > > works in progress at once on the same bdi with that infrastructure. > > > > > > Will this essentially not nullify the advantage of IO less throttling? > > > I thought that we did not want have multiple threads doing writeback > > > at the same time to avoid number of seeks and achieve better throughput. > > > > Work queues alone are probably not appropriate, at least for spinning > > storage. It will introduce seeks into what would have been > > sequential writes. I had to make the btrfs worker thread pools after > > having a lot of trouble cramming writeback into work queues. > > That was before the cmwq infrastructure, right? cmwq changes the > behaviour of workqueues in such a way that they can simply be > thought of as having a thread pool of a specific size.... > > As a strict translation of the existing one flusher thread per bdi, > then only allowing one work at a time to be issued (i.e. workqueue > concurency of 1) would give the same behaviour without having all > the thread management issues. i.e. regardless of the writeback > parallelism mechanism we have the same issue of managing writeback > to minimise seeking. cmwq just makes the implementation far simpler, > IMO. > > As to whether that causes seeks or not, that depends on how we are > driving the concurrent works/threads. If we drive a concurrent work > per dirty cgroup that needs writing back, then we achieve the > concurrency needed to make the IO scheduler appropriately throttle > the IO. For the case of no cgroups, then we still only have a single > writeback work in progress at a time and behaviour is no different > to the current setup. Hence I don't see any particular problem with > using workqueues to acheive the necessary writeback parallelism that > cgroup aware throttling requires.... Yes, as long as we aren't trying to shotgun style spread the inodes across a bunch of threads, it should work well enough. The trick will just be making sure we don't end up with a lot of inode interleaving in the delalloc allocations. > > > > > > or it tries to dispatch too much IO from an inode which > > > > > primarily contains pages from low prio cgroup and high prio cgroup > > > > > task does not get enough pages dispatched to device hence not getting > > > > > any prio over low prio group. > > > > > > > > That's a writeback scheduling issue independent of how we throttle, > > > > and something we don't do at all right now. Our only decision on > > > > what to write back is based on how low ago the inode was dirtied. > > > > You need to completely rework the dirty inode tracking if you want > > > > to efficiently prioritise writeback between different groups. > > > > > > > > Given that filesystems don't all use the VFS dirty inode tracking > > > > infrastructure and specific filesystems have different ideas of the > > > > order of writeback, you've got a really difficult problem there. > > > > e.g. ext3/4 and btrfs use ordered writeback for filesystem integrity > > > > purposes which will completely screw any sort of prioritised > > > > writeback. Remember the ext3 "fsync = global sync" latency problems? > > > > > > Ok, so if one issues a fsync when filesystem is mounted in "data=ordered" > > > mode we will flush all the writes to disk before committing meta data. > > > > > > I have no knowledge of filesystem code so here comes a stupid question. > > > Do multiple fsyncs get completely serialized or they can progress in > > > parallel? IOW, if a fsync is in progress and we slow down the writeback > > > of that inode's pages, can other fsync still make progress without > > > getting stuck behind the previous fsync? > > > > An fsync has two basic parts > > > > 1) write the file data pages > > 2a) flush data=ordered in reiserfs/ext34 > > 2b) do the real transaction commit > > > > > > We can do part one in parallel across any number of writers. For part > > two, there is only one running transaction. If the FS is smart, the > > commit will only force down the transaction that last modified the > > file. 50 procs running fsync may only need to trigger one commit. > > Right. However the real issue here, I think, is that the IO comes > from a thread not associated with writeback nor is in any way cgroup > aware. IOWs, getting the right context to each block being written > back will be complex and filesystem specific. The ext3 style data=ordered requires that we give the same amount of bandwidth to all the data=ordered IO during commit. Otherwise we end up making the commit wait for some poor page in the data=ordered list and that slows everyone down. ick. > > The other thing that concerns me is how metadata IO is accounted and > throttled. Doing stuff like creating lots of small files will > generate as much or more metadata IO than data IO, and none of that > will be associated with a cgroup. Indeed, in XFS metadata doesn't > even use the pagecache anymore, and it's written back by a thread > (soon to be a workqueue) deep inside XFS's journalling subsystem, so > it's pretty much impossible to associate that IO with any specific > cgroup. > > What happens to that IO? Blocking it arbitrarily can have the same > effect as blocking transaction completion - it can cause the > filesystem to completely stop.... ick again, it's the same problem as the data=ordered stuff exactly. -chris ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF) 2011-03-31 23:43 ` Chris Mason @ 2011-04-01 0:55 ` Dave Chinner 0 siblings, 0 replies; 138+ messages in thread From: Dave Chinner @ 2011-04-01 0:55 UTC (permalink / raw) To: Chris Mason Cc: Vivek Goyal, Chad Talbott, James Bottomley, lsf, linux-fsdevel On Thu, Mar 31, 2011 at 07:43:27PM -0400, Chris Mason wrote: > Excerpts from Dave Chinner's message of 2011-03-31 18:14:25 -0400: > > On Thu, Mar 31, 2011 at 10:34:03AM -0400, Chris Mason wrote: > > > Excerpts from Vivek Goyal's message of 2011-03-31 10:16:37 -0400: > > > > On Thu, Mar 31, 2011 at 09:20:02AM +1100, Dave Chinner wrote: > > > > > There are plans to move the bdi-flusher threads to work queues, and > > > > > once that is done all your concerns about blocking and parallelism > > > > > are pretty much gone because it's trivial to have multiple writeback > > > > > works in progress at once on the same bdi with that infrastructure. > > > > > > > > Will this essentially not nullify the advantage of IO less throttling? > > > > I thought that we did not want have multiple threads doing writeback > > > > at the same time to avoid number of seeks and achieve better throughput. > > > > > > Work queues alone are probably not appropriate, at least for spinning > > > storage. It will introduce seeks into what would have been > > > sequential writes. I had to make the btrfs worker thread pools after > > > having a lot of trouble cramming writeback into work queues. > > > > That was before the cmwq infrastructure, right? cmwq changes the > > behaviour of workqueues in such a way that they can simply be > > thought of as having a thread pool of a specific size.... > > > > As a strict translation of the existing one flusher thread per bdi, > > then only allowing one work at a time to be issued (i.e. workqueue > > concurency of 1) would give the same behaviour without having all > > the thread management issues. i.e. regardless of the writeback > > parallelism mechanism we have the same issue of managing writeback > > to minimise seeking. cmwq just makes the implementation far simpler, > > IMO. > > > > As to whether that causes seeks or not, that depends on how we are > > driving the concurrent works/threads. If we drive a concurrent work > > per dirty cgroup that needs writing back, then we achieve the > > concurrency needed to make the IO scheduler appropriately throttle > > the IO. For the case of no cgroups, then we still only have a single > > writeback work in progress at a time and behaviour is no different > > to the current setup. Hence I don't see any particular problem with > > using workqueues to acheive the necessary writeback parallelism that > > cgroup aware throttling requires.... > > Yes, as long as we aren't trying to shotgun style spread the > inodes across a bunch of threads, it should work well enough. The trick > will just be making sure we don't end up with a lot of inode > interleaving in the delalloc allocations. That's a problem for any concurrent writeback mechanism as it passes through the filesystem. It comes down to filesystems also needing to have either concurrency- or cgroup-aware allocation mechanisms. It's just another piece of the puzzle, really. In the case of XFS, cgroup awareness could be as simple as as simple as associating each cgroup with a specific allocation group and keeping each cgroup as isolated as possible. There is precedence for doing this in XFS - the filestreams allocator makes these sorts of dynamic associations on a per-directory basis. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF) 2011-03-31 22:14 ` Dave Chinner 2011-03-31 23:43 ` Chris Mason @ 2011-04-01 1:34 ` Vivek Goyal 2011-04-01 4:36 ` Dave Chinner 1 sibling, 1 reply; 138+ messages in thread From: Vivek Goyal @ 2011-04-01 1:34 UTC (permalink / raw) To: Dave Chinner Cc: Chris Mason, Chad Talbott, James Bottomley, lsf, linux-fsdevel On Fri, Apr 01, 2011 at 09:14:25AM +1100, Dave Chinner wrote: [..] > > An fsync has two basic parts > > > > 1) write the file data pages > > 2a) flush data=ordered in reiserfs/ext34 > > 2b) do the real transaction commit > > > > > > We can do part one in parallel across any number of writers. For part > > two, there is only one running transaction. If the FS is smart, the > > commit will only force down the transaction that last modified the > > file. 50 procs running fsync may only need to trigger one commit. > > Right. However the real issue here, I think, is that the IO comes > from a thread not associated with writeback nor is in any way cgroup > aware. IOWs, getting the right context to each block being written > back will be complex and filesystem specific. > > The other thing that concerns me is how metadata IO is accounted and > throttled. Doing stuff like creating lots of small files will > generate as much or more metadata IO than data IO, and none of that > will be associated with a cgroup. Indeed, in XFS metadata doesn't > even use the pagecache anymore, and it's written back by a thread > (soon to be a workqueue) deep inside XFS's journalling subsystem, so > it's pretty much impossible to associate that IO with any specific > cgroup. > > What happens to that IO? Blocking it arbitrarily can have the same > effect as blocking transaction completion - it can cause the > filesystem to completely stop.... Dave, As of today, the cgroup/context of IO is decided from the IO submitting thread context. So any IO submitted by kernel threads (flusher, kjournald, workqueue threads) goes to root group IO which should remain unthrottled. (It is not a good idea to put throttling rules for root group). Now any meta data operation happening in the context of process will still be subject to throttling (is there any?). If that's a concern, can filesystem mark that bio (REQ_META?) and throttling logic can possibly let these bio pass through. Determining the cgroup/context from submitting process has the issue of that any writeback IO is not throttled and we are looking for a way to control buffered writes also. If we start determining the cgroup from some information stored in page_cgroup, then we are more likely to run into issues of priority inversion (filesystem in ordered mode flushing data first before committing meta data changes). So should we throttle buffered writes when page cache is being dirtied and not when these writes are being written back to device. Thanks Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF) 2011-04-01 1:34 ` Vivek Goyal @ 2011-04-01 4:36 ` Dave Chinner 2011-04-01 6:32 ` [Lsf] IO less throttling and cgroup aware writeback (Was: " Christoph Hellwig 2011-04-01 14:49 ` IO less throttling and cgroup aware writeback (Was: Re: [Lsf] " Vivek Goyal 0 siblings, 2 replies; 138+ messages in thread From: Dave Chinner @ 2011-04-01 4:36 UTC (permalink / raw) To: Vivek Goyal Cc: Chris Mason, Chad Talbott, James Bottomley, lsf, linux-fsdevel On Thu, Mar 31, 2011 at 09:34:24PM -0400, Vivek Goyal wrote: > On Fri, Apr 01, 2011 at 09:14:25AM +1100, Dave Chinner wrote: > > [..] > > > An fsync has two basic parts > > > > > > 1) write the file data pages > > > 2a) flush data=ordered in reiserfs/ext34 > > > 2b) do the real transaction commit > > > > > > > > > We can do part one in parallel across any number of writers. For part > > > two, there is only one running transaction. If the FS is smart, the > > > commit will only force down the transaction that last modified the > > > file. 50 procs running fsync may only need to trigger one commit. > > > > Right. However the real issue here, I think, is that the IO comes > > from a thread not associated with writeback nor is in any way cgroup > > aware. IOWs, getting the right context to each block being written > > back will be complex and filesystem specific. > > > > The other thing that concerns me is how metadata IO is accounted and > > throttled. Doing stuff like creating lots of small files will > > generate as much or more metadata IO than data IO, and none of that > > will be associated with a cgroup. Indeed, in XFS metadata doesn't > > even use the pagecache anymore, and it's written back by a thread > > (soon to be a workqueue) deep inside XFS's journalling subsystem, so > > it's pretty much impossible to associate that IO with any specific > > cgroup. > > > > What happens to that IO? Blocking it arbitrarily can have the same > > effect as blocking transaction completion - it can cause the > > filesystem to completely stop.... > > Dave, > > As of today, the cgroup/context of IO is decided from the IO submitting > thread context. So any IO submitted by kernel threads (flusher, kjournald, > workqueue threads) goes to root group IO which should remain unthrottled. > (It is not a good idea to put throttling rules for root group). > > Now any meta data operation happening in the context of process will > still be subject to throttling (is there any?). Certainly - almost all metadata _reads_ will occur in process context, though for XFS _most_ writes occur in kernel thread context. That being said, we can still get kernel threads hung up on metadata read IO that has been throttled in process context. e.g. a process is creating a new inode, which causes allocation to occur, which triggers a read of a free space btree block, which gets throttled. Writeback comes along, tries to do delayed allocation, gets hung up trying to allocate out of the same AG that is locked by the process creating a new inode. A signle allocation can lock multiple AGs, and so if we get enough backed-up allocations this can cause all AGs in the filesystem to become locked. AT this point no new allocation can complete until the throttled IO is submitted, completed and the allocation is committed and the AG unlocked.... > If that's a concern, > can filesystem mark that bio (REQ_META?) and throttling logic can possibly > let these bio pass through. We already tag most metadata IO in this way. However, you can't just not throttle metadata IO. e.g. a process doing a directory traversal (e.g. a find) will issue hundreds of IOs per second so if you don't throttle them it will adversely affect the throughput of other groups that you are trying to guarantee a certain throughput or iops rate for. Indeed, not throttling metadata writes will seriously harm throughput for controlled cgroups when the log fills up and the filesystem pushes out thousands metadata IOs in a very short period of time. Yet if we combine that with the problem that anywhere you delay metadata IO for arbitrarily long periods of time (read or write) via priority based mechanisms, you risk causing a train-smash of blocked processes all waiting for the throttled IO to complete. And that will seriously harm throughput for controlled cgroups because they can't make any modifications to the filesystem. I'm not sure if there is any middle ground here - I can't see any at this point... > Determining the cgroup/context from submitting process has the > issue of that any writeback IO is not throttled and we are looking > for a way to control buffered writes also. If we start determining > the cgroup from some information stored in page_cgroup, then we > are more likely to run into issues of priority inversion > (filesystem in ordered mode flushing data first before committing > meta data changes). So should we throttle > buffered writes when page cache is being dirtied and not when > these writes are being written back to device. I'm not sure what you mean by this paragraph - AFAICT, this is exactly the way we throttle buffered writes right now. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-01 4:36 ` Dave Chinner @ 2011-04-01 6:32 ` Christoph Hellwig 2011-04-01 7:23 ` Dave Chinner 2011-04-01 14:49 ` IO less throttling and cgroup aware writeback (Was: Re: [Lsf] " Vivek Goyal 1 sibling, 1 reply; 138+ messages in thread From: Christoph Hellwig @ 2011-04-01 6:32 UTC (permalink / raw) To: Dave Chinner; +Cc: Vivek Goyal, James Bottomley, lsf, linux-fsdevel On Fri, Apr 01, 2011 at 03:36:05PM +1100, Dave Chinner wrote: > > If that's a concern, > > can filesystem mark that bio (REQ_META?) and throttling logic can possibly > > let these bio pass through. > > We already tag most metadata IO in this way. Actually we don't tag any I/O that way right now. That's mostly because REQ_META assumes it's synchronous I/O and cfg and the block layer give id additional priority, while in XFS metadata writes are mostly asynchronous. We'll need a properly defined REQ_META to use it, which currently is not the case. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-01 6:32 ` [Lsf] IO less throttling and cgroup aware writeback (Was: " Christoph Hellwig @ 2011-04-01 7:23 ` Dave Chinner 2011-04-01 12:56 ` Christoph Hellwig 0 siblings, 1 reply; 138+ messages in thread From: Dave Chinner @ 2011-04-01 7:23 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Vivek Goyal, James Bottomley, lsf, linux-fsdevel On Fri, Apr 01, 2011 at 02:32:54AM -0400, Christoph Hellwig wrote: > On Fri, Apr 01, 2011 at 03:36:05PM +1100, Dave Chinner wrote: > > > If that's a concern, > > > can filesystem mark that bio (REQ_META?) and throttling logic can possibly > > > let these bio pass through. > > > > We already tag most metadata IO in this way. > > Actually we don't tag any I/O that way right now. That's mostly > because REQ_META assumes it's synchronous I/O and cfg and the block > layer give id additional priority, while in XFS metadata writes > are mostly asynchronous. We'll need a properly defined REQ_META > to use it, which currently is not the case. Oh, I misread the code in _xfs_buf_read that fiddles with _XBF_RUN_QUEUES. That flag is dead then, as is the XBF_LOG_BUFFER code which appears to have been superceded by the new XBF_ORDERED code. Definitely needs cleaning up. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-01 7:23 ` Dave Chinner @ 2011-04-01 12:56 ` Christoph Hellwig 2011-04-21 15:07 ` Vivek Goyal 0 siblings, 1 reply; 138+ messages in thread From: Christoph Hellwig @ 2011-04-01 12:56 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Vivek Goyal, James Bottomley, lsf, linux-fsdevel On Fri, Apr 01, 2011 at 06:23:48PM +1100, Dave Chinner wrote: > Oh, I misread the code in _xfs_buf_read that fiddles with > _XBF_RUN_QUEUES. That flag is dead then, as is the XBF_LOG_BUFFER > code which appears to have been superceded by the new XBF_ORDERED > code. Definitely needs cleaning up. Yes, that's been on my todo list for a while, but I first want a sane defintion of REQ_META in the block layer. ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-01 12:56 ` Christoph Hellwig @ 2011-04-21 15:07 ` Vivek Goyal 0 siblings, 0 replies; 138+ messages in thread From: Vivek Goyal @ 2011-04-21 15:07 UTC (permalink / raw) To: Christoph Hellwig Cc: Dave Chinner, James Bottomley, lsf, linux-fsdevel, Jens Axboe On Fri, Apr 01, 2011 at 08:56:54AM -0400, Christoph Hellwig wrote: > On Fri, Apr 01, 2011 at 06:23:48PM +1100, Dave Chinner wrote: > > Oh, I misread the code in _xfs_buf_read that fiddles with > > _XBF_RUN_QUEUES. That flag is dead then, as is the XBF_LOG_BUFFER > > code which appears to have been superceded by the new XBF_ORDERED > > code. Definitely needs cleaning up. > > Yes, that's been on my todo list for a while, but I first want a sane > defintion of REQ_META in the block layer. Will splitting REQ_META in two will help. Say REQ_META_SYNC and REQ_META_ASYNC. So meta requests which don't require any kind of priority boost at CFQ can mark these REQ_META_ASYNC (XFS). - So we retain the capability to mark metadata requests - Priority boost only for selected meta data. - Throttling can use this to avoid throttling meta data. Thanks Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF) 2011-04-01 4:36 ` Dave Chinner 2011-04-01 6:32 ` [Lsf] IO less throttling and cgroup aware writeback (Was: " Christoph Hellwig @ 2011-04-01 14:49 ` Vivek Goyal 1 sibling, 0 replies; 138+ messages in thread From: Vivek Goyal @ 2011-04-01 14:49 UTC (permalink / raw) To: Dave Chinner Cc: Chris Mason, Chad Talbott, James Bottomley, lsf, linux-fsdevel On Fri, Apr 01, 2011 at 03:36:05PM +1100, Dave Chinner wrote: > On Thu, Mar 31, 2011 at 09:34:24PM -0400, Vivek Goyal wrote: > > On Fri, Apr 01, 2011 at 09:14:25AM +1100, Dave Chinner wrote: > > > > [..] > > > > An fsync has two basic parts > > > > > > > > 1) write the file data pages > > > > 2a) flush data=ordered in reiserfs/ext34 > > > > 2b) do the real transaction commit > > > > > > > > > > > > We can do part one in parallel across any number of writers. For part > > > > two, there is only one running transaction. If the FS is smart, the > > > > commit will only force down the transaction that last modified the > > > > file. 50 procs running fsync may only need to trigger one commit. > > > > > > Right. However the real issue here, I think, is that the IO comes > > > from a thread not associated with writeback nor is in any way cgroup > > > aware. IOWs, getting the right context to each block being written > > > back will be complex and filesystem specific. > > > > > > The other thing that concerns me is how metadata IO is accounted and > > > throttled. Doing stuff like creating lots of small files will > > > generate as much or more metadata IO than data IO, and none of that > > > will be associated with a cgroup. Indeed, in XFS metadata doesn't > > > even use the pagecache anymore, and it's written back by a thread > > > (soon to be a workqueue) deep inside XFS's journalling subsystem, so > > > it's pretty much impossible to associate that IO with any specific > > > cgroup. > > > > > > What happens to that IO? Blocking it arbitrarily can have the same > > > effect as blocking transaction completion - it can cause the > > > filesystem to completely stop.... > > > > Dave, > > > > As of today, the cgroup/context of IO is decided from the IO submitting > > thread context. So any IO submitted by kernel threads (flusher, kjournald, > > workqueue threads) goes to root group IO which should remain unthrottled. > > (It is not a good idea to put throttling rules for root group). > > > > Now any meta data operation happening in the context of process will > > still be subject to throttling (is there any?). > > Certainly - almost all metadata _reads_ will occur in process > context, though for XFS _most_ writes occur in kernel thread context. > That being said, we can still get kernel threads hung up on metadata > read IO that has been throttled in process context. > > e.g. a process is creating a new inode, which causes allocation to > occur, which triggers a read of a free space btree block, which gets > throttled. Writeback comes along, tries to do delayed allocation, > gets hung up trying to allocate out of the same AG that is locked by > the process creating a new inode. A signle allocation can lock > multiple AGs, and so if we get enough backed-up allocations this can > cause all AGs in the filesystem to become locked. AT this point no > new allocation can complete until the throttled IO is submitted, > completed and the allocation is committed and the AG unlocked.... > > > If that's a concern, > > can filesystem mark that bio (REQ_META?) and throttling logic can possibly > > let these bio pass through. > > We already tag most metadata IO in this way. > > However, you can't just not throttle metadata IO. e.g. a process > doing a directory traversal (e.g. a find) will issue hundreds of IOs > per second so if you don't throttle them it will adversely affect > the throughput of other groups that you are trying to guarantee a > certain throughput or iops rate for. Indeed, not throttling metadata > writes will seriously harm throughput for controlled cgroups when > the log fills up and the filesystem pushes out thousands metadata > IOs in a very short period of time. > > Yet if we combine that with the problem that anywhere you delay > metadata IO for arbitrarily long periods of time (read or write) via > priority based mechanisms, you risk causing a train-smash of blocked > processes all waiting for the throttled IO to complete. And that will > seriously harm throughput for controlled cgroups because they can't > make any modifications to the filesystem. > > I'm not sure if there is any middle ground here - I can't see any at > this point... This is indeed a tricky situation. Especially the case of write getting blocked behind reads. I think virtual machine is best use case where one can avoid using host's file system and avoid all the issues related to serialization in host file system. Or we can probably advise not to set very low limits on any cgroup. That way even if things get serialized, once in a while, it will be resolved soon. It hurts scalability and performance though. Or modify file systems where they can mark *selective* meta data IO as REQ_NOTHROTTLE. If filesystem can determine that a write is dependent on read meta data request, then mark that read as REQ_NOTHROTTLE. Like in above example, we are performing a read of free space blktree to do an allocation of inode. Or live with reduced isolation by not throttling meta data IO. > > > Determining the cgroup/context from submitting process has the > > issue of that any writeback IO is not throttled and we are looking > > for a way to control buffered writes also. If we start determining > > the cgroup from some information stored in page_cgroup, then we > > are more likely to run into issues of priority inversion > > (filesystem in ordered mode flushing data first before committing > > meta data changes). So should we throttle > > buffered writes when page cache is being dirtied and not when > > these writes are being written back to device. > > I'm not sure what you mean by this paragraph - AFAICT, this > is exactly the way we throttle buffered writes right now. Actually I was referring to throttling in terms of IO rate (bytes_per_second or io_per_second). Notion of dirty_ratio or dirty_bytes for throttling itself is not sufficient. Thanks Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF) 2011-03-31 14:34 ` Chris Mason 2011-03-31 22:14 ` Dave Chinner @ 2011-03-31 22:25 ` Vivek Goyal 1 sibling, 0 replies; 138+ messages in thread From: Vivek Goyal @ 2011-03-31 22:25 UTC (permalink / raw) To: Chris Mason Cc: Dave Chinner, Chad Talbott, James Bottomley, lsf, linux-fsdevel On Thu, Mar 31, 2011 at 10:34:03AM -0400, Chris Mason wrote: [..] > > > > For me knowing this is also important in another context of absolute IO > > throttling. > > > > - If a fsync is in progress and gets throttled at device, what impact it > > has on other file system operations. What gets serialized behind it. > > It depends. atime updates log inodes and logging needs a transaction > and transactions sometimes need to wait for the last transaction to > finish. So its very possible you'll make anything using the FS appear > to stop. I think I have run into this. I created a cgroup and gave ridiculously low limit of 1bytes/sec and did a fsync. This process blocks. Later I did "ls" in the directory where fsync process is blocked and ls also hangs. Following is backtrace. Looks like atime update led to some kind of blocking in do_get_write_access(). ls D ffffffff8160b060 0 5936 5192 0x00000000 ffff880138729c48 0000000000000086 0000000000000000 0000000100000010 0000000000000000 ffff88013fc40100 ffff88013ac7ac00 000000012e5d70f3 ffff8801353d7af8 ffff880138729fd8 000000000000f558 ffff8801353d7af8 Call Trace: [<ffffffffa035b0dd>] do_get_write_access+0x29d/0x500 [jbd2] [<ffffffff8108e150>] ? wake_bit_function+0x0/0x50 [<ffffffffa035b491>] jbd2_journal_get_write_access+0x31/0x50 [jbd2] [<ffffffffa03a7328>] __ext4_journal_get_write_access+0x38/0x80 [ext4] [<ffffffffa0383843>] ext4_reserve_inode_write+0x73/0xa0 [ext4] [<ffffffffa037c618>] ? call_filldir+0x78/0xe0 [ext4] [<ffffffffa03838bc>] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4] [<ffffffff81041594>] ? __do_page_fault+0x1e4/0x480 [<ffffffffa0383bb0>] ext4_dirty_inode+0x40/0x60 [ext4] [<ffffffff8119a21b>] __mark_inode_dirty+0x3b/0x160 [<ffffffff8118acad>] touch_atime+0x12d/0x170 [<ffffffff81184c00>] ? filldir+0x0/0xe0 [<ffffffff81184e96>] vfs_readdir+0xd6/0xe0 [<ffffffff81185009>] sys_getdents+0x89/0xf0 [<ffffffff814dc635>] ? page_fault+0x25/0x30 [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b The vim process doing fsync trace is as follows. This is waiting for some IO to finish which has been throttled at the device. vim D ffffffff8110d1f0 0 5934 4452 0x00000000 ffff880107e2dcc8 0000000000000086 0000000100000000 0000000000000003 ffff8801351f4538 0000000000000000 ffff880107e2dc68 ffffffff810e7da2 ffff8801353d70b8 ffff880107e2dfd8 000000000000f558 ffff8801353d70b8 Call Trace: [<ffffffff810e7da2>] ? ring_buffer_lock_reserve+0xa2/0x160 [<ffffffff81098cb9>] ? ktime_get_ts+0xa9/0xe0 [<ffffffff8110d1f0>] ? sync_page+0x0/0x50 [<ffffffff814da123>] io_schedule+0x73/0xc0 [<ffffffff8110d22d>] sync_page+0x3d/0x50 [<ffffffff814da98f>] __wait_on_bit+0x5f/0x90 [<ffffffff8110d3e3>] wait_on_page_bit+0x73/0x80 [<ffffffff8108e150>] ? wake_bit_function+0x0/0x50 [<ffffffff81123195>] ? pagevec_lookup_tag+0x25/0x40 [<ffffffff8110d7fb>] wait_on_page_writeback_range+0xfb/0x190 [<ffffffff81122341>] ? do_writepages+0x21/0x40 [<ffffffff8110d94b>] ? __filemap_fdatawrite_range+0x5b/0x60 [<ffffffff8110d9c8>] filemap_write_and_wait_range+0x78/0x90 [<ffffffff8119f5fe>] vfs_fsync_range+0x7e/0xe0 [<ffffffff8119f6cd>] vfs_fsync+0x1d/0x20 [<ffffffff8119f70e>] do_fsync+0x3e/0x60 [<ffffffff8119f760>] sys_fsync+0x10/0x20 [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b Thanks Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-03-31 14:16 ` Vivek Goyal 2011-03-31 14:34 ` Chris Mason @ 2011-03-31 14:50 ` Greg Thelen 2011-03-31 22:27 ` Dave Chinner 1 sibling, 1 reply; 138+ messages in thread From: Greg Thelen @ 2011-03-31 14:50 UTC (permalink / raw) To: Vivek Goyal; +Cc: Dave Chinner, James Bottomley, lsf, linux-fsdevel On Thu, Mar 31, 2011 at 7:16 AM, Vivek Goyal <vgoyal@redhat.com> wrote: > On Thu, Mar 31, 2011 at 09:20:02AM +1100, Dave Chinner wrote: > > [..] >> > It should not happen that flusher >> > thread gets blocked somewhere (trying to get request descriptors on >> > request queue) >> >> A major design principle of the bdi-flusher threads is that they >> are supposed to block when the request queue gets full - that's how >> we got rid of all the congestion garbage from the writeback >> stack. > > Instead of blocking flusher threads, can they voluntarily stop submitting > more IO when they realize too much IO is in progress. We aready keep > stats of how much IO is under writeback on bdi (BDI_WRITEBACK) and > flusher tread can use that? > > Jens mentioned this idea of how about getting rid of this request accounting > at request queue level and move it somewhere up say at bdi level. > >> >> There are plans to move the bdi-flusher threads to work queues, and >> once that is done all your concerns about blocking and parallelism >> are pretty much gone because it's trivial to have multiple writeback >> works in progress at once on the same bdi with that infrastructure. > > Will this essentially not nullify the advantage of IO less throttling? > I thought that we did not want have multiple threads doing writeback > at the same time to avoid number of seeks and achieve better throughput. > > Now with this I am assuming that multiple work can be on progress doing > writeback. May be we can limit writeback work one per group so in global > context only one work will be active. > >> >> > or it tries to dispatch too much IO from an inode which >> > primarily contains pages from low prio cgroup and high prio cgroup >> > task does not get enough pages dispatched to device hence not getting >> > any prio over low prio group. >> >> That's a writeback scheduling issue independent of how we throttle, >> and something we don't do at all right now. Our only decision on >> what to write back is based on how low ago the inode was dirtied. >> You need to completely rework the dirty inode tracking if you want >> to efficiently prioritise writeback between different groups. >> >> Given that filesystems don't all use the VFS dirty inode tracking >> infrastructure and specific filesystems have different ideas of the >> order of writeback, you've got a really difficult problem there. >> e.g. ext3/4 and btrfs use ordered writeback for filesystem integrity >> purposes which will completely screw any sort of prioritised >> writeback. Remember the ext3 "fsync = global sync" latency problems? > > Ok, so if one issues a fsync when filesystem is mounted in "data=ordered" > mode we will flush all the writes to disk before committing meta data. > > I have no knowledge of filesystem code so here comes a stupid question. > Do multiple fsyncs get completely serialized or they can progress in > parallel? IOW, if a fsync is in progress and we slow down the writeback > of that inode's pages, can other fsync still make progress without > getting stuck behind the previous fsync? > > For me knowing this is also important in another context of absolute IO > throttling. > > - If a fsync is in progress and gets throttled at device, what impact it > has on other file system operations. What gets serialized behind it. > > [..] >> > and also try to select inodes intelligently (cgroup aware manner). >> >> Such selection algorithms would need to be able to handle hundreds >> of thousands of newly dirtied inodes per second so sorting and >> selecting them efficiently will be a major issue... > > There was proposal of memory cgroup maintaining a per memory cgroup per > bdi structure which will keep a list of inodes which need writeback > from that cgroup. FYI, I have patches which implement this per memcg per bdi dirty inode list. I want to debug a few issues before posting an RFC series. But it is getting close. > So any cgroup looking for a writeback will queue up this structure on > bdi and flusher threads can walk though this list and figure out > which memory cgroups and which inodes within memory cgroup need to > be written back. The way these memcg-writeback patches are currently implemented is that when a memcg is over background dirty limits, it will queue the memcg a on a global over_bg_limit list and wakeup bdi flusher. There is no context (memcg or otherwise) given to the bdi flusher. After the bdi flusher checks system-wide background limits, it uses the over_bg_limit list to find (and rotate) an over limit memcg. Using the memcg, then the per memcg per bdi dirty inode list is walked to find inode pages to writeback. Once the memcg dirty memory usage drops below the memcg-thresh, the memcg is removed from the global over_bg_limit list. > Thanks > Vivek > _______________________________________________ > Lsf mailing list > Lsf@lists.linux-foundation.org > https://lists.linux-foundation.org/mailman/listinfo/lsf > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-03-31 14:50 ` [Lsf] IO less throttling and cgroup aware writeback (Was: " Greg Thelen @ 2011-03-31 22:27 ` Dave Chinner 2011-04-01 17:18 ` Vivek Goyal 0 siblings, 1 reply; 138+ messages in thread From: Dave Chinner @ 2011-03-31 22:27 UTC (permalink / raw) To: Greg Thelen; +Cc: Vivek Goyal, James Bottomley, lsf, linux-fsdevel On Thu, Mar 31, 2011 at 07:50:23AM -0700, Greg Thelen wrote: > On Thu, Mar 31, 2011 at 7:16 AM, Vivek Goyal <vgoyal@redhat.com> wrote: > > On Thu, Mar 31, 2011 at 09:20:02AM +1100, Dave Chinner wrote: > >> > and also try to select inodes intelligently (cgroup aware manner). > >> > >> Such selection algorithms would need to be able to handle hundreds > >> of thousands of newly dirtied inodes per second so sorting and > >> selecting them efficiently will be a major issue... > > > > There was proposal of memory cgroup maintaining a per memory cgroup per > > bdi structure which will keep a list of inodes which need writeback > > from that cgroup. > > FYI, I have patches which implement this per memcg per bdi dirty inode > list. I want to debug a few issues before posting an RFC series. But > it is getting close. That's all well and good, but we're still trying to work out how to scale this list in a sane fashion. We just broke it out into it's own global lock, so it's going to change soon so that the list+lock is not a contention point on large machines. Just breaking it into a list per cgroup doesn't solve this problem - it just adds another container to the list. Also, you have the problem that some filesystems don't use the bdi dirty inode list for all the dirty inodes in the filesytem - XFS has recent changed to only track VFS dirtied inodes in that list, intead using it's own active item list to track all logged modifications. IIUC, btrfs and ext3/4 do something similar as well. My current plans are to modify the dirty inode code to allow filesystems to say tot the VFS "don't track this dirty inode - I'm doing it myself" so that we can reduce the VFS dirty inode list to only those inodes with dirty pages.... > > So any cgroup looking for a writeback will queue up this structure on > > bdi and flusher threads can walk though this list and figure out > > which memory cgroups and which inodes within memory cgroup need to > > be written back. > > The way these memcg-writeback patches are currently implemented is > that when a memcg is over background dirty limits, it will queue the > memcg a on a global over_bg_limit list and wakeup bdi flusher. No global lists and locks, please. That's one of the big problems with the current foreground IO based throttling - it _hammers_ the global inode writeback list locks such that one an 8p machine we can be wasted 2-3 entire CPUs just contending on it when all 8 CPUs are trying to throttle and write back at the same time..... > There > is no context (memcg or otherwise) given to the bdi flusher. After > the bdi flusher checks system-wide background limits, it uses the > over_bg_limit list to find (and rotate) an over limit memcg. Using > the memcg, then the per memcg per bdi dirty inode list is walked to > find inode pages to writeback. Once the memcg dirty memory usage > drops below the memcg-thresh, the memcg is removed from the global > over_bg_limit list. If you want controlled hand-off of writeback, you need to pass the memcg that triggered the throttling directly to the bdi. You already know what both the bdi and memcg that need writeback are. Yes, this needs concurrency at the BDI flush level to handle, but see my previous email in this thread for that.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-03-31 22:27 ` Dave Chinner @ 2011-04-01 17:18 ` Vivek Goyal 2011-04-01 21:49 ` Dave Chinner 0 siblings, 1 reply; 138+ messages in thread From: Vivek Goyal @ 2011-04-01 17:18 UTC (permalink / raw) To: Dave Chinner; +Cc: Greg Thelen, James Bottomley, lsf, linux-fsdevel On Fri, Apr 01, 2011 at 09:27:56AM +1100, Dave Chinner wrote: > On Thu, Mar 31, 2011 at 07:50:23AM -0700, Greg Thelen wrote: > > On Thu, Mar 31, 2011 at 7:16 AM, Vivek Goyal <vgoyal@redhat.com> wrote: > > > On Thu, Mar 31, 2011 at 09:20:02AM +1100, Dave Chinner wrote: > > >> > and also try to select inodes intelligently (cgroup aware manner). > > >> > > >> Such selection algorithms would need to be able to handle hundreds > > >> of thousands of newly dirtied inodes per second so sorting and > > >> selecting them efficiently will be a major issue... > > > > > > There was proposal of memory cgroup maintaining a per memory cgroup per > > > bdi structure which will keep a list of inodes which need writeback > > > from that cgroup. > > > > FYI, I have patches which implement this per memcg per bdi dirty inode > > list. I want to debug a few issues before posting an RFC series. But > > it is getting close. > > That's all well and good, but we're still trying to work out how to > scale this list in a sane fashion. We just broke it out into it's > own global lock, so it's going to change soon so that the list+lock > is not a contention point on large machines. Just breaking it into a > list per cgroup doesn't solve this problem - it just adds another > container to the list. > > Also, you have the problem that some filesystems don't use the bdi > dirty inode list for all the dirty inodes in the filesytem - XFS has > recent changed to only track VFS dirtied inodes in that list, intead > using it's own active item list to track all logged modifications. > IIUC, btrfs and ext3/4 do something similar as well. My current plans > are to modify the dirty inode code to allow filesystems to say tot > the VFS "don't track this dirty inode - I'm doing it myself" so that > we can reduce the VFS dirty inode list to only those inodes with > dirty pages.... > > > > So any cgroup looking for a writeback will queue up this structure on > > > bdi and flusher threads can walk though this list and figure out > > > which memory cgroups and which inodes within memory cgroup need to > > > be written back. > > > > The way these memcg-writeback patches are currently implemented is > > that when a memcg is over background dirty limits, it will queue the > > memcg a on a global over_bg_limit list and wakeup bdi flusher. > > No global lists and locks, please. That's one of the big problems > with the current foreground IO based throttling - it _hammers_ the > global inode writeback list locks such that one an 8p machine we can > be wasted 2-3 entire CPUs just contending on it when all 8 CPUs are > trying to throttle and write back at the same time..... > > > There > > is no context (memcg or otherwise) given to the bdi flusher. After > > the bdi flusher checks system-wide background limits, it uses the > > over_bg_limit list to find (and rotate) an over limit memcg. Using > > the memcg, then the per memcg per bdi dirty inode list is walked to > > find inode pages to writeback. Once the memcg dirty memory usage > > drops below the memcg-thresh, the memcg is removed from the global > > over_bg_limit list. > > If you want controlled hand-off of writeback, you need to pass the > memcg that triggered the throttling directly to the bdi. You already > know what both the bdi and memcg that need writeback are. Yes, this > needs concurrency at the BDI flush level to handle, but see my > previous email in this thread for that.... > Even with memcg being passed around I don't think that we get rid of global list lock. The reason being that inodes are not exclusive to the memory cgroups. Multiple memory cgroups might be writting to same inode. So inode still remains in the global list and memory cgroups kind of will have pointer to it. So to start writeback on an inode you still shall have to take global lock, IIUC. Thanks Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-01 17:18 ` Vivek Goyal @ 2011-04-01 21:49 ` Dave Chinner 2011-04-02 7:33 ` Greg Thelen 2011-04-05 13:13 ` Vivek Goyal 0 siblings, 2 replies; 138+ messages in thread From: Dave Chinner @ 2011-04-01 21:49 UTC (permalink / raw) To: Vivek Goyal; +Cc: Greg Thelen, James Bottomley, lsf, linux-fsdevel On Fri, Apr 01, 2011 at 01:18:38PM -0400, Vivek Goyal wrote: > On Fri, Apr 01, 2011 at 09:27:56AM +1100, Dave Chinner wrote: > > On Thu, Mar 31, 2011 at 07:50:23AM -0700, Greg Thelen wrote: > > > There > > > is no context (memcg or otherwise) given to the bdi flusher. After > > > the bdi flusher checks system-wide background limits, it uses the > > > over_bg_limit list to find (and rotate) an over limit memcg. Using > > > the memcg, then the per memcg per bdi dirty inode list is walked to > > > find inode pages to writeback. Once the memcg dirty memory usage > > > drops below the memcg-thresh, the memcg is removed from the global > > > over_bg_limit list. > > > > If you want controlled hand-off of writeback, you need to pass the > > memcg that triggered the throttling directly to the bdi. You already > > know what both the bdi and memcg that need writeback are. Yes, this > > needs concurrency at the BDI flush level to handle, but see my > > previous email in this thread for that.... > > > > Even with memcg being passed around I don't think that we get rid of > global list lock. You need to - we're getting rid of global lists and locks from writeback for scalability reasons so any new functionality needs to avoid global locks for the same reason. > The reason being that inodes are not exclusive to > the memory cgroups. Multiple memory cgroups might be writting to same > inode. So inode still remains in the global list and memory cgroups > kind of will have pointer to it. So two dirty inode lists that have to be kept in sync? That doesn't sound particularly appealing. Nor does it scale to an inode being dirty in multiple cgroups Besides, if you've got multiple memory groups dirtying the same inode, then you cannot expect isolation between groups. I'd consider this a broken configuration in this case - how often does this actually happen, and what is the use case for supporting it? Besides, the implications are that we'd have to break up contiguous IOs in the writeback path simply because two sequential pages are associated with different groups. That's really nasty, and exactly the opposite of all the write combining we try to do throughout the writeback path. Supporting this is also a mess, as we'd have to touch quite a lot of filesystem code (i.e. .writepage(s) inplementations) to do this. > So to start writeback on an inode > you still shall have to take global lock, IIUC. Why not simply bdi -> list of dirty cgroups -> list of dirty inodes in cgroup, and go from there? I mean, really all that cgroup-aware writeback needs is just adding a new container for managing dirty inodes in the writeback path and a method for selecting that container for writeback, right? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-01 21:49 ` Dave Chinner @ 2011-04-02 7:33 ` Greg Thelen 2011-04-02 7:34 ` Greg Thelen 2011-04-05 13:13 ` Vivek Goyal 1 sibling, 1 reply; 138+ messages in thread From: Greg Thelen @ 2011-04-02 7:33 UTC (permalink / raw) To: Dave Chinner; +Cc: Vivek Goyal, James Bottomley, lsf, linux-fsdevel On Fri, Apr 1, 2011 at 2:49 PM, Dave Chinner <david@fromorbit.com> wrote: > On Fri, Apr 01, 2011 at 01:18:38PM -0400, Vivek Goyal wrote: >> On Fri, Apr 01, 2011 at 09:27:56AM +1100, Dave Chinner wrote: >> > On Thu, Mar 31, 2011 at 07:50:23AM -0700, Greg Thelen wrote: >> > > There >> > > is no context (memcg or otherwise) given to the bdi flusher. After >> > > the bdi flusher checks system-wide background limits, it uses the >> > > over_bg_limit list to find (and rotate) an over limit memcg. Using >> > > the memcg, then the per memcg per bdi dirty inode list is walked to >> > > find inode pages to writeback. Once the memcg dirty memory usage >> > > drops below the memcg-thresh, the memcg is removed from the global >> > > over_bg_limit list. >> > >> > If you want controlled hand-off of writeback, you need to pass the >> > memcg that triggered the throttling directly to the bdi. You already >> > know what both the bdi and memcg that need writeback are. Yes, this >> > needs concurrency at the BDI flush level to handle, but see my >> > previous email in this thread for that.... >> > >> >> Even with memcg being passed around I don't think that we get rid of >> global list lock. > > You need to - we're getting rid of global lists and locks from > writeback for scalability reasons so any new functionality needs to > avoid global locks for the same reason. > >> The reason being that inodes are not exclusive to >> the memory cgroups. Multiple memory cgroups might be writting to same >> inode. So inode still remains in the global list and memory cgroups >> kind of will have pointer to it. > > So two dirty inode lists that have to be kept in sync? That doesn't > sound particularly appealing. Nor does it scale to an inode being > dirty in multiple cgroups > > Besides, if you've got multiple memory groups dirtying the same > inode, then you cannot expect isolation between groups. I'd consider > this a broken configuration in this case - how often does this > actually happen, and what is the use case for supporting > it? > > Besides, the implications are that we'd have to break up contiguous > IOs in the writeback path simply because two sequential pages are > associated with different groups. That's really nasty, and exactly > the opposite of all the write combining we try to do throughout the > writeback path. Supporting this is also a mess, as we'd have to touch > quite a lot of filesystem code (i.e. .writepage(s) inplementations) > to do this. > >> So to start writeback on an inode >> you still shall have to take global lock, IIUC. > > Why not simply bdi -> list of dirty cgroups -> list of dirty inodes > in cgroup, and go from there? I mean, really all that cgroup-aware > writeback needs is just adding a new container for managing > dirty inodes in the writeback path and a method for selecting that > container for writeback, right? I feel compelled to optimize for multiple cgroup's concurrently dirtying an inode. I see sharing as legitimate if a file is handed off between jobs (cgroups). But I do not see concurrent writing as a common use case. If anyone else feels this is a requirement, please speak up. However, I would like the system tolerate sharing, though it does not have to do so in optimal fashion. Here are two approaches that do not optimize for sharing. Though, each approach tries to tolerate sharing without falling over. Approach 1 (inspired from Dave's comments): bdi ->1:N -> bdi_memcg -> 1:N -> bdi_memcg_dirty_inode * when setting I_DIRTY in a memcg, insert inode into bdi_memcg_dirty_inodes rather than b_dirty. * when clearing I_DIRTY, remove inode from bdi_memcg_dirty_inode * balance_dirty_pages() -> mem_cgroup_balance_dirty_pages(memcg, bdi) if over bg limit, then queue memcg writeback to bdi flusher. if over fg limit, then queue memcg-waiting description to bdi flusher (IO less throttle). * bdi_flusher(bdi): using bdi,memcg write “some” of the bdi_memcg_dirty_inodes list. “Some” is for fairness. if bdi flusher is unable to bring memcg dirty usage below bg limit after bdi_memcg_dirty_inodes list is empty, then need to do “something” to make forward progress. This could be caused by either (a) memcg dirtying multiple bdi, or (b) a freeloading memcg dirtying inodes previously dirtied by another memcg therefore the first dirtying memcg is the one that will write it back. Case A) If memcg dirties multiple bdi and then hits memcg bg limit, queue bg writeback for the bdi being written to. This may not writeback other useful bdi. System-wide background limit has similar issue. Could link bdi_memcg together and wakeup peer bdi. For now, defer the problem. Case B) Dirtying another cgroup’s dirty inode. While is not a common use case, it could happen. Options to avoid lockup: + When an inode becomes dirty shared, then move the inode from the per bdi per memcg bdi_memcg_dirty_inode list to an otherwise unused bdi wide b_unknown_memcg_dirty (promiscuous inode) list. b_unknown_memcg_dirty is written when memcg writeback is invoked to the bdi. When an inode is cleaned and later redirtied it is added to the normal bdi_memcg_dirty_inode_list. + Considered: when file page goes dirty, then do not account the dirty page to the memcg where the page was charged, instead recharge the page to the memcg that the inode was billed to (by inode i_memcg field). Inode would require a memcg reference that would make memcg cleanup tricky. + Scan memcg lru for dirty file pages -> associated inodes -> bdi -> writeback(bdi, inode) + What if memcg dirty limits are simply ignored in case-B? Ineffective memcg background writeback would be queued as usage grows. Once memcg foreground limit is hit, then it would throttle waiting for the ineffective background writeback to never catch up. This could wait indefinitely. Could argue that the hung cgroup deserves this for writing to another cgroup’s inode. However, the other cgroup could be the trouble maker who sneaks in to dirty the file and assume dirty ownership before the innocent (now hung) cgroup starts writing. I am not worried about making this optimal, just making forward progress. Fallback to scanning memcg lru looking for inode’s of dirty pages. This may be expensive, but should only happen with dirty inodes shared between memcg. Approach 2 : do something even simpler: http://www.gossamer-threads.com/lists/linux/kernel/1347359#1347359 * __set_page_dirty() either set i_memcg=memcg or i_memcg=~0 no memcg reference needed, i_memcg is not dereferenced * mem_cgroup_balance_dirty_pages(memcg, bdi) if over bg limit, then queue memcg to bdi for background writeback if over fg limit, then queue memcg-waiting description to bdi flusher (IO less throttle) * bdi_flusher(bdi) if doing memcg writeback, scan b_dirty filtering using is_memcg_inode(inode,memcg), which checks i_memcg field: return i_memcg in [~0, memcg] if unable to get memcg below its dirty memory limit: + If memcg dirties multiple bdi and then hits memcg bg limit, queue bg writeback for the bdi being written to. This may not writeback other useful bdi. System-wide background limit has similar issue. - con: If degree of sharing exceeds compile time max supported sharing degree (likely 1), then ANY writeback (per-memcg or system-wide) will writeback the over-shared inode. This is undesirable because it punishes innocent cgroups that are not abusively sharing. - con: have to scan entire b_dirty list which may involve skipping many inodes not in over-limit cgroup. A memcg constantly hitting its limit would monopolize a bdi flusher. Both approaches are complicated by the (rare) possibility when an inode has been been claimed (from a dirtying memcg perspective) by memcg M1 but later M2 writes more dirty pages. When M2 exceeds its dirty limit it would be nice to find the inode, even if this requires some extra work. > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-02 7:33 ` Greg Thelen @ 2011-04-02 7:34 ` Greg Thelen 0 siblings, 0 replies; 138+ messages in thread From: Greg Thelen @ 2011-04-02 7:34 UTC (permalink / raw) To: Dave Chinner; +Cc: Vivek Goyal, James Bottomley, lsf, linux-fsdevel On Sat, Apr 2, 2011 at 12:33 AM, Greg Thelen <gthelen@google.com> wrote: > On Fri, Apr 1, 2011 at 2:49 PM, Dave Chinner <david@fromorbit.com> wrote: >> On Fri, Apr 01, 2011 at 01:18:38PM -0400, Vivek Goyal wrote: >>> On Fri, Apr 01, 2011 at 09:27:56AM +1100, Dave Chinner wrote: >>> > On Thu, Mar 31, 2011 at 07:50:23AM -0700, Greg Thelen wrote: >>> > > There >>> > > is no context (memcg or otherwise) given to the bdi flusher. After >>> > > the bdi flusher checks system-wide background limits, it uses the >>> > > over_bg_limit list to find (and rotate) an over limit memcg. Using >>> > > the memcg, then the per memcg per bdi dirty inode list is walked to >>> > > find inode pages to writeback. Once the memcg dirty memory usage >>> > > drops below the memcg-thresh, the memcg is removed from the global >>> > > over_bg_limit list. >>> > >>> > If you want controlled hand-off of writeback, you need to pass the >>> > memcg that triggered the throttling directly to the bdi. You already >>> > know what both the bdi and memcg that need writeback are. Yes, this >>> > needs concurrency at the BDI flush level to handle, but see my >>> > previous email in this thread for that.... >>> > >>> >>> Even with memcg being passed around I don't think that we get rid of >>> global list lock. >> >> You need to - we're getting rid of global lists and locks from >> writeback for scalability reasons so any new functionality needs to >> avoid global locks for the same reason. >> >>> The reason being that inodes are not exclusive to >>> the memory cgroups. Multiple memory cgroups might be writting to same >>> inode. So inode still remains in the global list and memory cgroups >>> kind of will have pointer to it. >> >> So two dirty inode lists that have to be kept in sync? That doesn't >> sound particularly appealing. Nor does it scale to an inode being >> dirty in multiple cgroups >> >> Besides, if you've got multiple memory groups dirtying the same >> inode, then you cannot expect isolation between groups. I'd consider >> this a broken configuration in this case - how often does this >> actually happen, and what is the use case for supporting >> it? >> >> Besides, the implications are that we'd have to break up contiguous >> IOs in the writeback path simply because two sequential pages are >> associated with different groups. That's really nasty, and exactly >> the opposite of all the write combining we try to do throughout the >> writeback path. Supporting this is also a mess, as we'd have to touch >> quite a lot of filesystem code (i.e. .writepage(s) inplementations) >> to do this. >> >>> So to start writeback on an inode >>> you still shall have to take global lock, IIUC. >> >> Why not simply bdi -> list of dirty cgroups -> list of dirty inodes >> in cgroup, and go from there? I mean, really all that cgroup-aware >> writeback needs is just adding a new container for managing >> dirty inodes in the writeback path and a method for selecting that >> container for writeback, right? > > I feel compelled to optimize for multiple cgroup's concurrently Correction: I do NOT feel compelled to optimizing for sharing... > dirtying an inode. I see sharing as legitimate if a file is handed > off between jobs (cgroups). But I do not see concurrent writing as a > common use case. > If anyone else feels this is a requirement, please speak up. > However, I would like the system tolerate sharing, though it does not > have to do so in optimal fashion. > Here are two approaches that do not optimize for sharing. Though, > each approach tries to tolerate sharing without falling over. > > Approach 1 (inspired from Dave's comments): > > bdi ->1:N -> bdi_memcg -> 1:N -> bdi_memcg_dirty_inode > > * when setting I_DIRTY in a memcg, insert inode into > bdi_memcg_dirty_inodes rather than b_dirty. > > * when clearing I_DIRTY, remove inode from bdi_memcg_dirty_inode > > * balance_dirty_pages() -> mem_cgroup_balance_dirty_pages(memcg, bdi) > if over bg limit, then queue memcg writeback to bdi flusher. > if over fg limit, then queue memcg-waiting description to bdi flusher > (IO less throttle). > > * bdi_flusher(bdi): > using bdi,memcg write “some” of the bdi_memcg_dirty_inodes list. > “Some” is for fairness. > > if bdi flusher is unable to bring memcg dirty usage below bg limit > after bdi_memcg_dirty_inodes list is empty, then need to do > “something” to make forward progress. This could be caused by either > (a) memcg dirtying multiple bdi, or (b) a freeloading memcg dirtying > inodes previously dirtied by another memcg therefore the first > dirtying memcg is the one that will write it back. > > Case A) If memcg dirties multiple bdi and then hits memcg bg limit, > queue bg writeback for the bdi being written to. This may not > writeback other useful bdi. System-wide background limit has similar > issue. Could link bdi_memcg together and wakeup peer bdi. For now, > defer the problem. > > Case B) Dirtying another cgroup’s dirty inode. While is not a common > use case, it could happen. Options to avoid lockup: > > + When an inode becomes dirty shared, then move the inode from the per > bdi per memcg bdi_memcg_dirty_inode list to an otherwise unused bdi > wide b_unknown_memcg_dirty (promiscuous inode) list. > b_unknown_memcg_dirty is written when memcg writeback is invoked to > the bdi. When an inode is cleaned and later redirtied it is added to > the normal bdi_memcg_dirty_inode_list. > > + Considered: when file page goes dirty, then do not account the dirty > page to the memcg where the page was charged, instead recharge the > page to the memcg that the inode was billed to (by inode i_memcg > field). Inode would require a memcg reference that would make memcg > cleanup tricky. > > + Scan memcg lru for dirty file pages -> associated inodes -> bdi -> > writeback(bdi, inode) > > + What if memcg dirty limits are simply ignored in case-B? > Ineffective memcg background writeback would be queued as usage grows. > Once memcg foreground limit is hit, then it would throttle waiting > for the ineffective background writeback to never catch up. This > could wait indefinitely. Could argue that the hung cgroup deserves > this for writing to another cgroup’s inode. However, the other cgroup > could be the trouble maker who sneaks in to dirty the file and assume > dirty ownership before the innocent (now hung) cgroup starts writing. > I am not worried about making this optimal, just making forward > progress. Fallback to scanning memcg lru looking for inode’s of dirty > pages. This may be expensive, but should only happen with dirty > inodes shared between memcg. > > > Approach 2 : do something even simpler: > > http://www.gossamer-threads.com/lists/linux/kernel/1347359#1347359 > > * __set_page_dirty() > > either set i_memcg=memcg or i_memcg=~0 > no memcg reference needed, i_memcg is not dereferenced > > * mem_cgroup_balance_dirty_pages(memcg, bdi) > > if over bg limit, then queue memcg to bdi for background writeback > if over fg limit, then queue memcg-waiting description to bdi flusher > (IO less throttle) > > * bdi_flusher(bdi) > > if doing memcg writeback, scan b_dirty filtering using > is_memcg_inode(inode,memcg), which checks i_memcg field: return > i_memcg in [~0, memcg] > > if unable to get memcg below its dirty memory limit: > > + If memcg dirties multiple bdi and then hits memcg bg limit, queue bg > writeback for the bdi being written to. This may not writeback other > useful bdi. System-wide background limit has similar issue. > > - con: If degree of sharing exceeds compile time max supported sharing > degree (likely 1), then ANY writeback (per-memcg or system-wide) will > writeback the over-shared inode. This is undesirable because it > punishes innocent cgroups that are not abusively sharing. > > - con: have to scan entire b_dirty list which may involve skipping > many inodes not in over-limit cgroup. A memcg constantly hitting its > limit would monopolize a bdi flusher. > > > Both approaches are complicated by the (rare) possibility when an > inode has been been claimed (from a dirtying memcg perspective) by > memcg M1 but later M2 writes more dirty pages. When M2 exceeds its > dirty limit it would be nice to find the inode, even if this requires > some extra work. > >> Cheers, >> >> Dave. >> -- >> Dave Chinner >> david@fromorbit.com >> > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-01 21:49 ` Dave Chinner 2011-04-02 7:33 ` Greg Thelen @ 2011-04-05 13:13 ` Vivek Goyal 2011-04-05 22:56 ` Dave Chinner 1 sibling, 1 reply; 138+ messages in thread From: Vivek Goyal @ 2011-04-05 13:13 UTC (permalink / raw) To: Dave Chinner; +Cc: Greg Thelen, James Bottomley, lsf, linux-fsdevel On Sat, Apr 02, 2011 at 08:49:47AM +1100, Dave Chinner wrote: > On Fri, Apr 01, 2011 at 01:18:38PM -0400, Vivek Goyal wrote: > > On Fri, Apr 01, 2011 at 09:27:56AM +1100, Dave Chinner wrote: > > > On Thu, Mar 31, 2011 at 07:50:23AM -0700, Greg Thelen wrote: > > > > There > > > > is no context (memcg or otherwise) given to the bdi flusher. After > > > > the bdi flusher checks system-wide background limits, it uses the > > > > over_bg_limit list to find (and rotate) an over limit memcg. Using > > > > the memcg, then the per memcg per bdi dirty inode list is walked to > > > > find inode pages to writeback. Once the memcg dirty memory usage > > > > drops below the memcg-thresh, the memcg is removed from the global > > > > over_bg_limit list. > > > > > > If you want controlled hand-off of writeback, you need to pass the > > > memcg that triggered the throttling directly to the bdi. You already > > > know what both the bdi and memcg that need writeback are. Yes, this > > > needs concurrency at the BDI flush level to handle, but see my > > > previous email in this thread for that.... > > > > > > > Even with memcg being passed around I don't think that we get rid of > > global list lock. > > You need to - we're getting rid of global lists and locks from > writeback for scalability reasons so any new functionality needs to > avoid global locks for the same reason. Ok. > > > The reason being that inodes are not exclusive to > > the memory cgroups. Multiple memory cgroups might be writting to same > > inode. So inode still remains in the global list and memory cgroups > > kind of will have pointer to it. > > So two dirty inode lists that have to be kept in sync? That doesn't > sound particularly appealing. Nor does it scale to an inode being > dirty in multiple cgroups > > Besides, if you've got multiple memory groups dirtying the same > inode, then you cannot expect isolation between groups. I'd consider > this a broken configuration in this case - how often does this > actually happen, and what is the use case for supporting > it? > > Besides, the implications are that we'd have to break up contiguous > IOs in the writeback path simply because two sequential pages are > associated with different groups. That's really nasty, and exactly > the opposite of all the write combining we try to do throughout the > writeback path. Supporting this is also a mess, as we'd have to touch > quite a lot of filesystem code (i.e. .writepage(s) inplementations) > to do this. We did not plan on breaking up contigous IO even if these belonged to different cgroup for performance reason. So probably can live with some inaccuracy and just trigger the writeback for one inode even if that meant that it could writeback the pages of some other cgroups doing IO on that inode. > > > So to start writeback on an inode > > you still shall have to take global lock, IIUC. > > Why not simply bdi -> list of dirty cgroups -> list of dirty inodes > in cgroup, and go from there? I mean, really all that cgroup-aware > writeback needs is just adding a new container for managing > dirty inodes in the writeback path and a method for selecting that > container for writeback, right? This was the initial design where one inode is associated with one cgroup even if process from multiple cgroups are doing IO to same inode. Then somebody raised the concern that it probably is too coarse. IMHO, as a first step, associating inode to one cgroup exclusively simplifies the things considerably and we can target that first. So yes, I agree that bdi->list_of_dirty_cgroups->list_of_drity_inodes makes sense and is relatively simple way of doing things at the expense of not being accurate for shared inode case. Thanks Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-05 13:13 ` Vivek Goyal @ 2011-04-05 22:56 ` Dave Chinner 2011-04-06 14:49 ` Curt Wohlgemuth 2011-04-06 15:37 ` Vivek Goyal 0 siblings, 2 replies; 138+ messages in thread From: Dave Chinner @ 2011-04-05 22:56 UTC (permalink / raw) To: Vivek Goyal; +Cc: Greg Thelen, James Bottomley, lsf, linux-fsdevel On Tue, Apr 05, 2011 at 09:13:59AM -0400, Vivek Goyal wrote: > On Sat, Apr 02, 2011 at 08:49:47AM +1100, Dave Chinner wrote: > > On Fri, Apr 01, 2011 at 01:18:38PM -0400, Vivek Goyal wrote: > > > On Fri, Apr 01, 2011 at 09:27:56AM +1100, Dave Chinner wrote: > > > > On Thu, Mar 31, 2011 at 07:50:23AM -0700, Greg Thelen wrote: > > > > > There > > > > > is no context (memcg or otherwise) given to the bdi flusher. After > > > > > the bdi flusher checks system-wide background limits, it uses the > > > > > over_bg_limit list to find (and rotate) an over limit memcg. Using > > > > > the memcg, then the per memcg per bdi dirty inode list is walked to > > > > > find inode pages to writeback. Once the memcg dirty memory usage > > > > > drops below the memcg-thresh, the memcg is removed from the global > > > > > over_bg_limit list. > > > > > > > > If you want controlled hand-off of writeback, you need to pass the > > > > memcg that triggered the throttling directly to the bdi. You already > > > > know what both the bdi and memcg that need writeback are. Yes, this > > > > needs concurrency at the BDI flush level to handle, but see my > > > > previous email in this thread for that.... > > > > > > > > > > Even with memcg being passed around I don't think that we get rid of > > > global list lock. ..... > > > The reason being that inodes are not exclusive to > > > the memory cgroups. Multiple memory cgroups might be writting to same > > > inode. So inode still remains in the global list and memory cgroups > > > kind of will have pointer to it. > > > > So two dirty inode lists that have to be kept in sync? That doesn't > > sound particularly appealing. Nor does it scale to an inode being > > dirty in multiple cgroups > > > > Besides, if you've got multiple memory groups dirtying the same > > inode, then you cannot expect isolation between groups. I'd consider > > this a broken configuration in this case - how often does this > > actually happen, and what is the use case for supporting > > it? > > > > Besides, the implications are that we'd have to break up contiguous > > IOs in the writeback path simply because two sequential pages are > > associated with different groups. That's really nasty, and exactly > > the opposite of all the write combining we try to do throughout the > > writeback path. Supporting this is also a mess, as we'd have to touch > > quite a lot of filesystem code (i.e. .writepage(s) inplementations) > > to do this. > > We did not plan on breaking up contigous IO even if these belonged to > different cgroup for performance reason. So probably can live with some > inaccuracy and just trigger the writeback for one inode even if that > meant that it could writeback the pages of some other cgroups doing IO > on that inode. Which, to me, violates the principle of isolation as it's been described that this functionality is supposed to provide. It also means you will have handle the case of a cgroup over a throttle limit and no inodes on it's dirty list. It's not a case of "probably can live with" the resultant mess, the mess will occur and so handling it needs to be designed in from the start. > > > So to start writeback on an inode > > > you still shall have to take global lock, IIUC. > > > > Why not simply bdi -> list of dirty cgroups -> list of dirty inodes > > in cgroup, and go from there? I mean, really all that cgroup-aware > > writeback needs is just adding a new container for managing > > dirty inodes in the writeback path and a method for selecting that > > container for writeback, right? > > This was the initial design where one inode is associated with one cgroup > even if process from multiple cgroups are doing IO to same inode. Then > somebody raised the concern that it probably is too coarse. Got a pointer? > IMHO, as a first step, associating inode to one cgroup exclusively > simplifies the things considerably and we can target that first. > > So yes, I agree that bdi->list_of_dirty_cgroups->list_of_drity_inodes > makes sense and is relatively simple way of doing things at the expense > of not being accurate for shared inode case. Can someone describe a valid shared inode use case? If not, we should not even consider it as a requirement and explicitly document it as a "not supported" use case. As it is, I'm hearing different ideas and requirements from the people working on the memcg side of this vs the IO controller side. Perhaps the first step is documenting a common set of functional requirements that demonstrates how everything will play well together? e.g. Defining what isolation means, when and if it can be violated, how violations are handled, when inodes in multiple memcgs are acceptable and how they need to be accounted and handled by the writepage path, how memcg's over the dirty threshold with no dirty inodes are to be handled, how metadata IO is going to be handled by IO controllers, what kswapd is going to do writeback when the pages it's trying to writeback during a critical low memory event belong to a cgroup that is throttled at the IO level, etc. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-05 22:56 ` Dave Chinner @ 2011-04-06 14:49 ` Curt Wohlgemuth 2011-04-06 15:39 ` Vivek Goyal 2011-04-06 23:08 ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Dave Chinner 2011-04-06 15:37 ` Vivek Goyal 1 sibling, 2 replies; 138+ messages in thread From: Curt Wohlgemuth @ 2011-04-06 14:49 UTC (permalink / raw) To: Dave Chinner Cc: Vivek Goyal, Greg Thelen, James Bottomley, lsf, linux-fsdevel On Tue, Apr 5, 2011 at 3:56 PM, Dave Chinner <david@fromorbit.com> wrote: > On Tue, Apr 05, 2011 at 09:13:59AM -0400, Vivek Goyal wrote: >> On Sat, Apr 02, 2011 at 08:49:47AM +1100, Dave Chinner wrote: >> > On Fri, Apr 01, 2011 at 01:18:38PM -0400, Vivek Goyal wrote: >> > > On Fri, Apr 01, 2011 at 09:27:56AM +1100, Dave Chinner wrote: >> > > > On Thu, Mar 31, 2011 at 07:50:23AM -0700, Greg Thelen wrote: >> > > > > There >> > > > > is no context (memcg or otherwise) given to the bdi flusher. After >> > > > > the bdi flusher checks system-wide background limits, it uses the >> > > > > over_bg_limit list to find (and rotate) an over limit memcg. Using >> > > > > the memcg, then the per memcg per bdi dirty inode list is walked to >> > > > > find inode pages to writeback. Once the memcg dirty memory usage >> > > > > drops below the memcg-thresh, the memcg is removed from the global >> > > > > over_bg_limit list. >> > > > >> > > > If you want controlled hand-off of writeback, you need to pass the >> > > > memcg that triggered the throttling directly to the bdi. You already >> > > > know what both the bdi and memcg that need writeback are. Yes, this >> > > > needs concurrency at the BDI flush level to handle, but see my >> > > > previous email in this thread for that.... >> > > > >> > > >> > > Even with memcg being passed around I don't think that we get rid of >> > > global list lock. > ..... >> > > The reason being that inodes are not exclusive to >> > > the memory cgroups. Multiple memory cgroups might be writting to same >> > > inode. So inode still remains in the global list and memory cgroups >> > > kind of will have pointer to it. >> > >> > So two dirty inode lists that have to be kept in sync? That doesn't >> > sound particularly appealing. Nor does it scale to an inode being >> > dirty in multiple cgroups >> > >> > Besides, if you've got multiple memory groups dirtying the same >> > inode, then you cannot expect isolation between groups. I'd consider >> > this a broken configuration in this case - how often does this >> > actually happen, and what is the use case for supporting >> > it? >> > >> > Besides, the implications are that we'd have to break up contiguous >> > IOs in the writeback path simply because two sequential pages are >> > associated with different groups. That's really nasty, and exactly >> > the opposite of all the write combining we try to do throughout the >> > writeback path. Supporting this is also a mess, as we'd have to touch >> > quite a lot of filesystem code (i.e. .writepage(s) inplementations) >> > to do this. >> >> We did not plan on breaking up contigous IO even if these belonged to >> different cgroup for performance reason. So probably can live with some >> inaccuracy and just trigger the writeback for one inode even if that >> meant that it could writeback the pages of some other cgroups doing IO >> on that inode. > > Which, to me, violates the principle of isolation as it's been > described that this functionality is supposed to provide. > > It also means you will have handle the case of a cgroup over a > throttle limit and no inodes on it's dirty list. It's not a case of > "probably can live with" the resultant mess, the mess will occur and > so handling it needs to be designed in from the start. > >> > > So to start writeback on an inode >> > > you still shall have to take global lock, IIUC. >> > >> > Why not simply bdi -> list of dirty cgroups -> list of dirty inodes >> > in cgroup, and go from there? I mean, really all that cgroup-aware >> > writeback needs is just adding a new container for managing >> > dirty inodes in the writeback path and a method for selecting that >> > container for writeback, right? >> >> This was the initial design where one inode is associated with one cgroup >> even if process from multiple cgroups are doing IO to same inode. Then >> somebody raised the concern that it probably is too coarse. > > Got a pointer? > >> IMHO, as a first step, associating inode to one cgroup exclusively >> simplifies the things considerably and we can target that first. >> >> So yes, I agree that bdi->list_of_dirty_cgroups->list_of_drity_inodes >> makes sense and is relatively simple way of doing things at the expense >> of not being accurate for shared inode case. > > Can someone describe a valid shared inode use case? If not, we > should not even consider it as a requirement and explicitly document > it as a "not supported" use case. At the very least, when a task is moved from one cgroup to another, we've got a shared inode case. This probably won't happen more than once for most tasks, but it will likely be common. Curt > > As it is, I'm hearing different ideas and requirements from the > people working on the memcg side of this vs the IO controller side. > Perhaps the first step is documenting a common set of functional > requirements that demonstrates how everything will play well > together? > > e.g. Defining what isolation means, when and if it can be violated, > how violations are handled, when inodes in multiple memcgs are > acceptable and how they need to be accounted and handled by the > writepage path, how memcg's over the dirty threshold with no dirty > inodes are to be handled, how metadata IO is going to be handled by > IO controllers, what kswapd is going to do writeback when the pages > it's trying to writeback during a critical low memory event belong > to a cgroup that is throttled at the IO level, etc. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-06 14:49 ` Curt Wohlgemuth @ 2011-04-06 15:39 ` Vivek Goyal 2011-04-06 19:49 ` Greg Thelen 2011-04-06 23:07 ` [Lsf] IO less throttling and cgroup aware writeback Greg Thelen 2011-04-06 23:08 ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Dave Chinner 1 sibling, 2 replies; 138+ messages in thread From: Vivek Goyal @ 2011-04-06 15:39 UTC (permalink / raw) To: Curt Wohlgemuth Cc: Dave Chinner, Greg Thelen, James Bottomley, lsf, linux-fsdevel On Wed, Apr 06, 2011 at 07:49:25AM -0700, Curt Wohlgemuth wrote: [..] > > Can someone describe a valid shared inode use case? If not, we > > should not even consider it as a requirement and explicitly document > > it as a "not supported" use case. > > At the very least, when a task is moved from one cgroup to another, > we've got a shared inode case. This probably won't happen more than > once for most tasks, but it will likely be common. I am hoping that for such cases sooner or later inode movement will automatically take place. At some point of time, inode will be clean and no more on memcg_bdi list. And when it is dirtied again, I am hoping it will be queued on new groups's list and not on old group's list? Greg? Thanks Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-06 15:39 ` Vivek Goyal @ 2011-04-06 19:49 ` Greg Thelen 2011-04-06 23:07 ` [Lsf] IO less throttling and cgroup aware writeback Greg Thelen 1 sibling, 0 replies; 138+ messages in thread From: Greg Thelen @ 2011-04-06 19:49 UTC (permalink / raw) To: Vivek Goyal Cc: Curt Wohlgemuth, Dave Chinner, James Bottomley, lsf, linux-fsdevel On Wed, Apr 6, 2011 at 8:39 AM, Vivek Goyal <vgoyal@redhat.com> wrote: > On Wed, Apr 06, 2011 at 07:49:25AM -0700, Curt Wohlgemuth wrote: > > [..] >> > Can someone describe a valid shared inode use case? If not, we >> > should not even consider it as a requirement and explicitly document >> > it as a "not supported" use case. >> >> At the very least, when a task is moved from one cgroup to another, >> we've got a shared inode case. This probably won't happen more than >> once for most tasks, but it will likely be common. > > I am hoping that for such cases sooner or later inode movement will > automatically take place. At some point of time, inode will be clean > and no more on memcg_bdi list. And when it is dirtied again, I am > hoping it will be queued on new groups's list and not on old group's > list? Greg? > > Thanks > Vivek When an inode is marked dirty, current->memcg is used to determine which per memcg b_dirty list within the bdi is used to queue the inode. When the inode is marked clean, then the inode is removed from the per memcg b_dirty list. So, as Vivek said, when a process is migrated between memcg, then the previously dirtied inodes will not be moved. Once such inodes are marked clean, and the re-dirtied, then they will be requeued to the correct per memcg dirty inode list. Here's an overview of the approach, which is assumes inode sharing is rare but possible. Thus, such sharing is tolerated (no live locks, etc) but not optimized. bdi -> 1:N -> bdi_memcg -> 1:N -> inode mark_inode_dirty(inode) If I_DIRTY is clear, set I_DIRTY and inserted inode into bdi_memcg->b_dirty using current->memcg as a key to select the correct list. This will require memory allocation of bdi_memcg, if this is the first inode within the bdi,memcg. If the allocation fails (rare, but possible), then fallback to adding the memcg to the root cgroup dirty inode list. If I_DIRTY is already set, then do nothing. When I_DIRTY is cleared, remove inode from bdi_memcg->b_dirty. Delete bdi_memcg if the list is now empty. balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(memcg, bdi) if over bg limit, then set bdi_memcg->b_over_limit If there is no bdi_memcg (because all inodes of current’s memcg dirty pages where first dirtied by other memcg) then memcg lru to find inode and call writeback_single_inode(). This is to handle uncommon sharing. reference memcg for bdi flusher awake bdi flusher if over fg limit IO-full: write bdi_memcg directly (if empty use memcg lru to find inode to write) IO-less: queue memcg-waiting description to bdi flusher. bdi_flusher(bdi): process work queue, which will not include any memcg flusher work - just like current code. once work queue is empty: wb_check_old_data_flush(): write old inodes from each of the per-memcg dirty lists. wb_check_background_flush(): if any of bdi_memcg->b_over_limit is set, then write bdi_memcg->b_dirty inodes until under limit. After writing some data, recheck to see if memcg is still over bg_thresh. If under limit, then clear b_over_limit and release memcg reference. If unable to bring memcg dirty usage below bg limit after bdi_memcg->b_dirty is empty, release memcg reference and return. Next time memcg calls balance_dirty_pages it will either select another bdi or use lru to find an inode. use over_bground_thresh() to check global background limit. When a memcg is deleted it may leave behing memcg_bdi structure. These memcg pointers are not referenced. As such inodes are cleaned, the bdi_memcg b_dirty list will become empty and bdi_memcg will be deleted. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback 2011-04-06 15:39 ` Vivek Goyal 2011-04-06 19:49 ` Greg Thelen @ 2011-04-06 23:07 ` Greg Thelen 2011-04-06 23:36 ` Dave Chinner 1 sibling, 1 reply; 138+ messages in thread From: Greg Thelen @ 2011-04-06 23:07 UTC (permalink / raw) To: Vivek Goyal Cc: Curt Wohlgemuth, Dave Chinner, James Bottomley, lsf, linux-fsdevel Vivek Goyal <vgoyal@redhat.com> writes: > On Wed, Apr 06, 2011 at 07:49:25AM -0700, Curt Wohlgemuth wrote: > > [..] >> > Can someone describe a valid shared inode use case? If not, we >> > should not even consider it as a requirement and explicitly document >> > it as a "not supported" use case. >> >> At the very least, when a task is moved from one cgroup to another, >> we've got a shared inode case. This probably won't happen more than >> once for most tasks, but it will likely be common. > > I am hoping that for such cases sooner or later inode movement will > automatically take place. At some point of time, inode will be clean > and no more on memcg_bdi list. And when it is dirtied again, I am > hoping it will be queued on new groups's list and not on old group's > list? Greg? > > Thanks > Vivek After more thought, a few tweaks to the previous design have emerged. I noted such differences with 'Clarification' below. When an inode is marked dirty, current->memcg is used to determine which per memcg b_dirty list within the bdi is used to queue the inode. When the inode is marked clean, then the inode is removed from the per memcg b_dirty list. So, as Vivek said, when a process is migrated between memcg, then the previously dirtied inodes will not be moved. Once such inodes are marked clean, and the re-dirtied, then they will be requeued to the correct per memcg dirty inode list. Here's an overview of the approach, which is assumes inode sharing is rare but possible. Thus, such sharing is tolerated (no live locks, etc) but not optimized. bdi -> 1:N -> bdi_memcg -> 1:N -> inode mark_inode_dirty(inode) If I_DIRTY is clear, set I_DIRTY and inserted inode into bdi_memcg->b_dirty using current->memcg as a key to select the correct list. This will require memory allocation of bdi_memcg, if this is the first inode within the bdi,memcg. If the allocation fails (rare, but possible), then fallback to adding the memcg to the root cgroup dirty inode list. If I_DIRTY is already set, then do nothing. When I_DIRTY is cleared, remove inode from bdi_memcg->b_dirty. Delete bdi_memcg if the list is now empty. balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(memcg, bdi) if over bg limit, then set bdi_memcg->b_over_limit If there is no bdi_memcg (because all inodes of current’s memcg dirty pages where first dirtied by other memcg) then memcg lru to find inode and call writeback_single_inode(). This is to handle uncommon sharing. reference memcg for bdi flusher awake bdi flusher if over fg limit IO-full: write bdi_memcg directly (if empty use memcg lru to find inode to write) Clarification: In IO-less: queue memcg-waiting description to bdi flusher waiters (balance_list). Clarification: wakeup_flusher_threads(): would take an optional memcg parameter, which would be included in the created work item. try_to_free_pages() would pass in a memcg. Other callers would pass in NULL. bdi_flusher(bdi): Clarification: When processing the bdi work queue, some work items may include a memcg (see wakeup_flusher_threads above). If present, use the specified memcg to determine which bdi_memcg (and thus b_dirty list) should be used. If NULL, then all bdi_memcg would be considered to process all inodes within the bdi. once work queue is empty: wb_check_old_data_flush(): write old inodes from each of the per-memcg dirty lists. wb_check_background_flush(): if any of bdi_memcg->b_over_limit is set, then write bdi_memcg->b_dirty inodes until under limit. After writing some data, recheck to see if memcg is still over bg_thresh. If under limit, then clear b_over_limit and release memcg reference. If unable to bring memcg dirty usage below bg limit after bdi_memcg->b_dirty is empty, release memcg reference and return. Next time memcg calls balance_dirty_pages it will either select another bdi or use lru to find an inode. use over_bground_thresh() to check global background limit. When a memcg is deleted it may leave behing memcg_bdi structure. These memcg pointers are not referenced. As such inodes are cleaned, the bdi_memcg b_dirty list will become empty and bdi_memcg will be deleted. Too much code churn in writeback is not good. So these memcg writeback enhancements should probably wait for IO-less dirty throttling to get worked out. These memcg messages are design level discussions to get me heading the right direction. I plan on implementing memcg aware writeback in the background while IO-less balance_dirty_pages is worked out so I can follow it up. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback 2011-04-06 23:07 ` [Lsf] IO less throttling and cgroup aware writeback Greg Thelen @ 2011-04-06 23:36 ` Dave Chinner 2011-04-07 19:24 ` Vivek Goyal 0 siblings, 1 reply; 138+ messages in thread From: Dave Chinner @ 2011-04-06 23:36 UTC (permalink / raw) To: Greg Thelen Cc: Vivek Goyal, Curt Wohlgemuth, James Bottomley, lsf, linux-fsdevel On Wed, Apr 06, 2011 at 04:07:14PM -0700, Greg Thelen wrote: > Vivek Goyal <vgoyal@redhat.com> writes: > > > On Wed, Apr 06, 2011 at 07:49:25AM -0700, Curt Wohlgemuth wrote: > > > > [..] > >> > Can someone describe a valid shared inode use case? If not, we > >> > should not even consider it as a requirement and explicitly document > >> > it as a "not supported" use case. > >> > >> At the very least, when a task is moved from one cgroup to another, > >> we've got a shared inode case. This probably won't happen more than > >> once for most tasks, but it will likely be common. > > > > I am hoping that for such cases sooner or later inode movement will > > automatically take place. At some point of time, inode will be clean > > and no more on memcg_bdi list. And when it is dirtied again, I am > > hoping it will be queued on new groups's list and not on old group's > > list? Greg? > > > > Thanks > > Vivek > > After more thought, a few tweaks to the previous design have emerged. I > noted such differences with 'Clarification' below. > > When an inode is marked dirty, current->memcg is used to determine > which per memcg b_dirty list within the bdi is used to queue the > inode. When the inode is marked clean, then the inode is removed from > the per memcg b_dirty list. So, as Vivek said, when a process is > migrated between memcg, then the previously dirtied inodes will not be > moved. Once such inodes are marked clean, and the re-dirtied, then > they will be requeued to the correct per memcg dirty inode list. > > Here's an overview of the approach, which is assumes inode sharing is > rare but possible. Thus, such sharing is tolerated (no live locks, > etc) but not optimized. > > bdi -> 1:N -> bdi_memcg -> 1:N -> inode > > mark_inode_dirty(inode) > If I_DIRTY is clear, set I_DIRTY and inserted inode into bdi_memcg->b_dirty > using current->memcg as a key to select the correct list. > This will require memory allocation of bdi_memcg, if this is the > first inode within the bdi,memcg. If the allocation fails (rare, > but possible), then fallback to adding the memcg to the root > cgroup dirty inode list. > If I_DIRTY is already set, then do nothing. This is where it gets tricky. Page cache dirtiness is tracked via I_DIRTY_PAGES, a subset of I_DIRTY. I_DIRTY_DATASYNC and I_DIRTY_SYNC are for inode metadata changes, and a lot of filesystems track those themselves. Indeed, XFS doesn't mark inodes dirty at the VFS for I_DIRTY_*SYNC for pure metadata operations any more, and there's no way that tracking can be made cgroup aware. Hence it can be the case that only I_DIRTY_PAGES is tracked in the VFS dirty lists, and that is the flag you need to care about here. Further, we are actually looking at formalising this - changing the .dirty_inode() operation to take the dirty flags and return a result that indicates whether the inode should be tracked in the VFS dirty list at all. This would stop double tracking of dirty inodes and go a long way to solving some of the behavioural issues we have now (e.g. the VFS tracking and trying to writeback inodes that the filesystem has already cleaned). Hence I think you need to be explicit that this tracking is specifically for I_DIRTY_PAGES state, though will handle other dirty inode states if desired by the filesytem. > When I_DIRTY is cleared, remove inode from bdi_memcg->b_dirty. Delete bdi_memcg > if the list is now empty. > > balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(memcg, bdi) > if over bg limit, then > set bdi_memcg->b_over_limit > If there is no bdi_memcg (because all inodes of current’s > memcg dirty pages where first dirtied by other memcg) then > memcg lru to find inode and call writeback_single_inode(). > This is to handle uncommon sharing. We don't want to introduce any new IO sources into balance_dirty_pages(). This needs to trigger memcg-LRU based bdi flusher writeback, not try to write back inodes itself. Alternatively, this problem won't exist if you transfer page щache state from one memcg to another when you move the inode from one memcg to another. > reference memcg for bdi flusher > awake bdi flusher > if over fg limit > IO-full: write bdi_memcg directly (if empty use memcg lru to find > inode to write) > > Clarification: In IO-less: queue memcg-waiting description to bdi > flusher waiters (balance_list). I'd be looking at designing for IO-less throttling up front.... > Clarification: > wakeup_flusher_threads(): > would take an optional memcg parameter, which would be included in the > created work item. > > try_to_free_pages() would pass in a memcg. Other callers would pass > in NULL. > > > bdi_flusher(bdi): > Clarification: When processing the bdi work queue, some work items > may include a memcg (see wakeup_flusher_threads above). If present, > use the specified memcg to determine which bdi_memcg (and thus > b_dirty list) should be used. If NULL, then all bdi_memcg would be > considered to process all inodes within the bdi. > > once work queue is empty: > wb_check_old_data_flush(): > write old inodes from each of the per-memcg dirty lists. > > wb_check_background_flush(): > if any of bdi_memcg->b_over_limit is set, then write > bdi_memcg->b_dirty inodes until under limit. > > After writing some data, recheck to see if memcg is still over > bg_thresh. If under limit, then clear b_over_limit and release > memcg reference. > > If unable to bring memcg dirty usage below bg limit after > bdi_memcg->b_dirty is empty, release memcg reference and return. > Next time memcg calls balance_dirty_pages it will either select > another bdi or use lru to find an inode. I think all the background flush cares about is bringing memcg's under the dirty limit. What balance_dirty_pages() does is irrelevant to the background flush. > use over_bground_thresh() to check global background limit. the background flush needs to continue while over the global limit even if all the memcg's are under their limits. In which case, we need to consider if we need to be fair when writing back memcg's on a bdi i.e. do we cycle an inode at a time until b_io is empty, then cycle to the next memcg, and not come back to the first memcg with inodes queued on b_more_io until they all have empty b_io queues? > When a memcg is deleted it may leave behing memcg_bdi structure. These memcg > pointers are not referenced. As such inodes are cleaned, the bdi_memcg b_dirty > list will become empty and bdi_memcg will be deleted. So you need to reference count the bdi_memcg structures? > Too much code churn in writeback is not good. So these memcg writeback > enhancements should probably wait for IO-less dirty throttling to get > worked out. Agreed. We're probably looking at .41 or .42 for any memcg writeback enhancements. > These memcg messages are design level discussions to get me > heading the right direction. I plan on implementing memcg aware > writeback in the background while IO-less balance_dirty_pages is worked > out so I can follow it up. Great! Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback 2011-04-06 23:36 ` Dave Chinner @ 2011-04-07 19:24 ` Vivek Goyal 2011-04-07 20:33 ` Christoph Hellwig 2011-04-07 23:42 ` Dave Chinner 0 siblings, 2 replies; 138+ messages in thread From: Vivek Goyal @ 2011-04-07 19:24 UTC (permalink / raw) To: Dave Chinner Cc: Greg Thelen, Curt Wohlgemuth, James Bottomley, lsf, linux-fsdevel On Thu, Apr 07, 2011 at 09:36:02AM +1000, Dave Chinner wrote: [..] > > mark_inode_dirty(inode) > > If I_DIRTY is clear, set I_DIRTY and inserted inode into bdi_memcg->b_dirty > > using current->memcg as a key to select the correct list. > > This will require memory allocation of bdi_memcg, if this is the > > first inode within the bdi,memcg. If the allocation fails (rare, > > but possible), then fallback to adding the memcg to the root > > cgroup dirty inode list. > > If I_DIRTY is already set, then do nothing. > > This is where it gets tricky. Page cache dirtiness is tracked via > I_DIRTY_PAGES, a subset of I_DIRTY. I_DIRTY_DATASYNC and > I_DIRTY_SYNC are for inode metadata changes, and a lot of > filesystems track those themselves. Indeed, XFS doesn't mark inodes > dirty at the VFS for I_DIRTY_*SYNC for pure metadata operations any > more, and there's no way that tracking can be made cgroup aware. > > Hence it can be the case that only I_DIRTY_PAGES is tracked in > the VFS dirty lists, and that is the flag you need to care about > here. > > Further, we are actually looking at formalising this - changing the > .dirty_inode() operation to take the dirty flags and return a result > that indicates whether the inode should be tracked in the VFS dirty > list at all. This would stop double tracking of dirty inodes and go > a long way to solving some of the behavioural issues we have now > (e.g. the VFS tracking and trying to writeback inodes that the > filesystem has already cleaned). > > Hence I think you need to be explicit that this tracking is > specifically for I_DIRTY_PAGES state, though will handle other dirty > inode states if desired by the filesytem. Ok, that makes sense. We are interested primarily in I_DIRTY_PAGES state only. IIUC, so first we need to fix existing code where we seem to be moving any inode on bdi writeback list based on I_DIRTY flag. BTW, what's the difference between I_DIRTY_DATASYNC and I_DIRTY_PAGES? To me both seem to mean that data needs to be written back and not the inode itself. > > > When I_DIRTY is cleared, remove inode from bdi_memcg->b_dirty. Delete bdi_memcg > > if the list is now empty. > > > > balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(memcg, bdi) > > if over bg limit, then > > set bdi_memcg->b_over_limit > > If there is no bdi_memcg (because all inodes of current’s > > memcg dirty pages where first dirtied by other memcg) then > > memcg lru to find inode and call writeback_single_inode(). > > This is to handle uncommon sharing. > > We don't want to introduce any new IO sources into > balance_dirty_pages(). This needs to trigger memcg-LRU based bdi > flusher writeback, not try to write back inodes itself. Will we not enjoy more sequtial IO traffic once we find an inode by traversing memcg->lru list? So isn't that better than pure LRU based flushing? > > Alternatively, this problem won't exist if you transfer page щache > state from one memcg to another when you move the inode from one > memcg to another. But in case of shared inode problem still remains. inode is being written from two cgroups and it can't be in both the groups as per the exisiting design. > > > reference memcg for bdi flusher > > awake bdi flusher > > if over fg limit > > IO-full: write bdi_memcg directly (if empty use memcg lru to find > > inode to write) > > > > Clarification: In IO-less: queue memcg-waiting description to bdi > > flusher waiters (balance_list). > > I'd be looking at designing for IO-less throttling up front.... Agreed. Lets design it on top of IO less throttling patches. We also shall have to modify IO less throttling a bit so that page completions are not uniformly distributed across all the threads but we need to account for groups first and then distribute completions with-in group uniformly. [..] > > use over_bground_thresh() to check global background limit. > > the background flush needs to continue while over the global limit > even if all the memcg's are under their limits. In which case, we > need to consider if we need to be fair when writing back memcg's on > a bdi i.e. do we cycle an inode at a time until b_io is empty, then > cycle to the next memcg, and not come back to the first memcg with > inodes queued on b_more_io until they all have empty b_io queues? > I think continue to cycle through memcg's even in this case will make sense. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback 2011-04-07 19:24 ` Vivek Goyal @ 2011-04-07 20:33 ` Christoph Hellwig 2011-04-07 21:34 ` Vivek Goyal 2011-04-07 23:42 ` Dave Chinner 1 sibling, 1 reply; 138+ messages in thread From: Christoph Hellwig @ 2011-04-07 20:33 UTC (permalink / raw) To: Vivek Goyal; +Cc: Dave Chinner, James Bottomley, lsf, linux-fsdevel On Thu, Apr 07, 2011 at 03:24:24PM -0400, Vivek Goyal wrote: > IIUC, so first we need to fix existing code where we seem to be moving > any inode on bdi writeback list based on I_DIRTY flag. I_DIRTY is a set of flags. Inodes are on the dirty list if any of the flags is set. > BTW, what's the difference between I_DIRTY_DATASYNC and I_DIRTY_PAGES? To > me both seem to mean that data needs to be written back and not the > inode itself. I_DIRTY_PAGES means dirty data (pages) I_DIRTY_DATASYNC means dirty metadata which needs to be written for fdatasync I_DIRTY_SYNC means dirty metadata which only needs to be written for fsync ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback 2011-04-07 20:33 ` Christoph Hellwig @ 2011-04-07 21:34 ` Vivek Goyal 0 siblings, 0 replies; 138+ messages in thread From: Vivek Goyal @ 2011-04-07 21:34 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Dave Chinner, James Bottomley, lsf, linux-fsdevel On Thu, Apr 07, 2011 at 04:33:03PM -0400, Christoph Hellwig wrote: > On Thu, Apr 07, 2011 at 03:24:24PM -0400, Vivek Goyal wrote: > > IIUC, so first we need to fix existing code where we seem to be moving > > any inode on bdi writeback list based on I_DIRTY flag. > > I_DIRTY is a set of flags. Inodes are on the dirty list if any of > the flags is set. > > > BTW, what's the difference between I_DIRTY_DATASYNC and I_DIRTY_PAGES? To > > me both seem to mean that data needs to be written back and not the > > inode itself. > > I_DIRTY_PAGES means dirty data (pages) > I_DIRTY_DATASYNC means dirty metadata which needs to be written for fdatasync > I_DIRTY_SYNC means dirty metadata which only needs to be written for fsync Ok, that helps. Thanks. So an fdatasync() can write back some metadata too if I_DIRTY_DATASYNC is set. Thanks Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback 2011-04-07 19:24 ` Vivek Goyal 2011-04-07 20:33 ` Christoph Hellwig @ 2011-04-07 23:42 ` Dave Chinner 2011-04-08 0:59 ` Greg Thelen 2011-04-08 13:43 ` Vivek Goyal 1 sibling, 2 replies; 138+ messages in thread From: Dave Chinner @ 2011-04-07 23:42 UTC (permalink / raw) To: Vivek Goyal Cc: Greg Thelen, Curt Wohlgemuth, James Bottomley, lsf, linux-fsdevel On Thu, Apr 07, 2011 at 03:24:24PM -0400, Vivek Goyal wrote: > On Thu, Apr 07, 2011 at 09:36:02AM +1000, Dave Chinner wrote: [...] > > > When I_DIRTY is cleared, remove inode from bdi_memcg->b_dirty. Delete bdi_memcg > > > if the list is now empty. > > > > > > balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(memcg, bdi) > > > if over bg limit, then > > > set bdi_memcg->b_over_limit > > > If there is no bdi_memcg (because all inodes of current’s > > > memcg dirty pages where first dirtied by other memcg) then > > > memcg lru to find inode and call writeback_single_inode(). > > > This is to handle uncommon sharing. > > > > We don't want to introduce any new IO sources into > > balance_dirty_pages(). This needs to trigger memcg-LRU based bdi > > flusher writeback, not try to write back inodes itself. > > Will we not enjoy more sequtial IO traffic once we find an inode by > traversing memcg->lru list? So isn't that better than pure LRU based > flushing? Sorry, I wasn't particularly clear there, What I meant was that we ask the bdi-flusher thread to select the inode to write back from the LRU, not do it directly from balance_dirty_pages(). i.e. bdp stays IO-less. > > Alternatively, this problem won't exist if you transfer page щache > > state from one memcg to another when you move the inode from one > > memcg to another. > > But in case of shared inode problem still remains. inode is being written > from two cgroups and it can't be in both the groups as per the exisiting > design. But we've already determined that there is no use case for this shared inode behaviour, so we aren't going to explictly support it, right? Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback 2011-04-07 23:42 ` Dave Chinner @ 2011-04-08 0:59 ` Greg Thelen 2011-04-08 1:25 ` Dave Chinner 2011-04-08 13:43 ` Vivek Goyal 1 sibling, 1 reply; 138+ messages in thread From: Greg Thelen @ 2011-04-08 0:59 UTC (permalink / raw) To: Dave Chinner Cc: Vivek Goyal, Curt Wohlgemuth, James Bottomley, lsf, linux-fsdevel, linux-mm cc: linux-mm Dave Chinner <david@fromorbit.com> writes: > On Thu, Apr 07, 2011 at 03:24:24PM -0400, Vivek Goyal wrote: >> On Thu, Apr 07, 2011 at 09:36:02AM +1000, Dave Chinner wrote: > [...] >> > > When I_DIRTY is cleared, remove inode from bdi_memcg->b_dirty. Delete bdi_memcg >> > > if the list is now empty. >> > > >> > > balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(memcg, bdi) >> > > if over bg limit, then >> > > set bdi_memcg->b_over_limit >> > > If there is no bdi_memcg (because all inodes of current’s >> > > memcg dirty pages where first dirtied by other memcg) then >> > > memcg lru to find inode and call writeback_single_inode(). >> > > This is to handle uncommon sharing. >> > >> > We don't want to introduce any new IO sources into >> > balance_dirty_pages(). This needs to trigger memcg-LRU based bdi >> > flusher writeback, not try to write back inodes itself. >> >> Will we not enjoy more sequtial IO traffic once we find an inode by >> traversing memcg->lru list? So isn't that better than pure LRU based >> flushing? > > Sorry, I wasn't particularly clear there, What I meant was that we > ask the bdi-flusher thread to select the inode to write back from > the LRU, not do it directly from balance_dirty_pages(). i.e. > bdp stays IO-less. > >> > Alternatively, this problem won't exist if you transfer page щache >> > state from one memcg to another when you move the inode from one >> > memcg to another. >> >> But in case of shared inode problem still remains. inode is being written >> from two cgroups and it can't be in both the groups as per the exisiting >> design. > > But we've already determined that there is no use case for this > shared inode behaviour, so we aren't going to explictly support it, > right? > > Cheers, > > Dave. I am thinking that we should avoid ever scanning the memcg lru for dirty pages or corresponding dirty inodes previously associated with other memcg. I think the only reason we considered scanning the lru was to handle the unexpected shared inode case. When such inode sharing occurs the sharing memcg will not be confined to the memcg's dirty limit. There's always the memcg hard limit to cap memcg usage. I'd like to add a counter (or at least tracepoint) to record when such unsupported usage is detected. Here's an example time line of such sharing: 1. memcg_1/process_a, writes to /var/log/messages and closes the file. This marks the inode in the bdi_memcg for memcg_1. 2. memcg_2/process_b, continually writes to /var/log/messages. This drives up memcg_2 dirty memory usage to the memcg_2 background threshold. mem_cgroup_balance_dirty_pages() would normally mark the corresponding bdi_memcg as over-bg-limit and kick the bdi_flusher and then return to the dirtying process. However, there is no bdi_memcg because there are no dirty inodes for memcg_2. So the bdi flusher sees no bdi_memcg as marked over-limit, so bdi flusher writes nothing (assuming we're still below system background threshold). 3. memcg_2/process_b, continues writing to /var/log/messages hitting the memcg_2 dirty memory foreground threshold. Using IO-less balance_dirty_pages(), normally mem_cgroup_balance_dirty_pages() would block waiting for the previously kicked bdi flusher to clean some memcg_2 pages. In this case mem_cgroup_balance_dirty_pages() sees no bdi_memcg and concludes that bdi flusher will not be lowering memcg dirty memory usage. This is the unsupported sharing case, so mem_cgroup_balance_dirty_pages() fires a tracepoint and just returns allowing memcg_2 dirty memory to exceed its foreground limit growing upwards to the memcg_2 memory limit_in_bytes. Once limit_in_bytes is hit it will use per memcg direct reclaim to recycle memcg_2 pages, including the previously written memcg_2 /var/log/messages dirty pages. By cutting out lru scanning the code should be simpler and still handle the common case well. If we later find that this supposed uncommon shared inode case is important then we can either implement the previously described lru scanning in mem_cgroup_balance_dirty_pages() or consider extending the bdi/memcg/inode data structures (perhaps with a memcg_mapping) to describe such sharing. > Cheers, > > Dave. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback 2011-04-08 0:59 ` Greg Thelen @ 2011-04-08 1:25 ` Dave Chinner 2011-04-12 3:17 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 138+ messages in thread From: Dave Chinner @ 2011-04-08 1:25 UTC (permalink / raw) To: Greg Thelen Cc: Vivek Goyal, Curt Wohlgemuth, James Bottomley, lsf, linux-fsdevel, linux-mm On Thu, Apr 07, 2011 at 05:59:35PM -0700, Greg Thelen wrote: > cc: linux-mm > > Dave Chinner <david@fromorbit.com> writes: > > > On Thu, Apr 07, 2011 at 03:24:24PM -0400, Vivek Goyal wrote: > >> On Thu, Apr 07, 2011 at 09:36:02AM +1000, Dave Chinner wrote: > > [...] > >> > > When I_DIRTY is cleared, remove inode from bdi_memcg->b_dirty. Delete bdi_memcg > >> > > if the list is now empty. > >> > > > >> > > balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(memcg, bdi) > >> > > if over bg limit, then > >> > > set bdi_memcg->b_over_limit > >> > > If there is no bdi_memcg (because all inodes of current’s > >> > > memcg dirty pages where first dirtied by other memcg) then > >> > > memcg lru to find inode and call writeback_single_inode(). > >> > > This is to handle uncommon sharing. > >> > > >> > We don't want to introduce any new IO sources into > >> > balance_dirty_pages(). This needs to trigger memcg-LRU based bdi > >> > flusher writeback, not try to write back inodes itself. > >> > >> Will we not enjoy more sequtial IO traffic once we find an inode by > >> traversing memcg->lru list? So isn't that better than pure LRU based > >> flushing? > > > > Sorry, I wasn't particularly clear there, What I meant was that we > > ask the bdi-flusher thread to select the inode to write back from > > the LRU, not do it directly from balance_dirty_pages(). i.e. > > bdp stays IO-less. > > > >> > Alternatively, this problem won't exist if you transfer page щache > >> > state from one memcg to another when you move the inode from one > >> > memcg to another. > >> > >> But in case of shared inode problem still remains. inode is being written > >> from two cgroups and it can't be in both the groups as per the exisiting > >> design. > > > > But we've already determined that there is no use case for this > > shared inode behaviour, so we aren't going to explictly support it, > > right? > > I am thinking that we should avoid ever scanning the memcg lru for dirty > pages or corresponding dirty inodes previously associated with other > memcg. I think the only reason we considered scanning the lru was to > handle the unexpected shared inode case. When such inode sharing occurs > the sharing memcg will not be confined to the memcg's dirty limit. > There's always the memcg hard limit to cap memcg usage. Yup, fair enough. > I'd like to add a counter (or at least tracepoint) to record when such > unsupported usage is detected. Definitely. Very good idea. > 1. memcg_1/process_a, writes to /var/log/messages and closes the file. > This marks the inode in the bdi_memcg for memcg_1. > > 2. memcg_2/process_b, continually writes to /var/log/messages. This > drives up memcg_2 dirty memory usage to the memcg_2 background > threshold. mem_cgroup_balance_dirty_pages() would normally mark the > corresponding bdi_memcg as over-bg-limit and kick the bdi_flusher and > then return to the dirtying process. However, there is no bdi_memcg > because there are no dirty inodes for memcg_2. So the bdi flusher > sees no bdi_memcg as marked over-limit, so bdi flusher writes nothing > (assuming we're still below system background threshold). > > 3. memcg_2/process_b, continues writing to /var/log/messages hitting the > memcg_2 dirty memory foreground threshold. Using IO-less > balance_dirty_pages(), normally mem_cgroup_balance_dirty_pages() > would block waiting for the previously kicked bdi flusher to clean > some memcg_2 pages. In this case mem_cgroup_balance_dirty_pages() > sees no bdi_memcg and concludes that bdi flusher will not be lowering > memcg dirty memory usage. This is the unsupported sharing case, so > mem_cgroup_balance_dirty_pages() fires a tracepoint and just returns > allowing memcg_2 dirty memory to exceed its foreground limit growing > upwards to the memcg_2 memory limit_in_bytes. Once limit_in_bytes is > hit it will use per memcg direct reclaim to recycle memcg_2 pages, > including the previously written memcg_2 /var/log/messages dirty > pages. Thanks for the good, simple example. > By cutting out lru scanning the code should be simpler and still > handle the common case well. Agreed. > If we later find that this supposed uncommon shared inode case is > important then we can either implement the previously described lru > scanning in mem_cgroup_balance_dirty_pages() or consider extending the > bdi/memcg/inode data structures (perhaps with a memcg_mapping) to > describe such sharing. Hmm, another idea I just had. What we're trying to avoid is needing to a) track inodes in multiple lists, and b) scanning to find something appropriate to write back. Rather than tracking at page or inode granularity, how about tracking "associated" memcgs at the memcg level? i.e. when we detect an inode is already dirty in another memcg, link the current memcg to the one that contains the inode. Hence if we get a situation where a memcg is throttling with no dirty inodes, it can quickly find and start writeback in an "associated" memcg that it _knows_ contain shared dirty inodes. Once we've triggered writeback on an associated memcg, it is removed from the list.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback 2011-04-08 1:25 ` Dave Chinner @ 2011-04-12 3:17 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 138+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-04-12 3:17 UTC (permalink / raw) To: Dave Chinner Cc: Greg Thelen, Vivek Goyal, Curt Wohlgemuth, James Bottomley, lsf, linux-fsdevel, linux-mm On Fri, 8 Apr 2011 11:25:56 +1000 Dave Chinner <david@fromorbit.com> wrote: > On Thu, Apr 07, 2011 at 05:59:35PM -0700, Greg Thelen wrote: > > cc: linux-mm > > > > Dave Chinner <david@fromorbit.com> writes: > > If we later find that this supposed uncommon shared inode case is > > important then we can either implement the previously described lru > > scanning in mem_cgroup_balance_dirty_pages() or consider extending the > > bdi/memcg/inode data structures (perhaps with a memcg_mapping) to > > describe such sharing. > > Hmm, another idea I just had. What we're trying to avoid is needing > to a) track inodes in multiple lists, and b) scanning to find > something appropriate to write back. > > Rather than tracking at page or inode granularity, how about > tracking "associated" memcgs at the memcg level? i.e. when we detect > an inode is already dirty in another memcg, link the current memcg > to the one that contains the inode. Hence if we get a situation > where a memcg is throttling with no dirty inodes, it can quickly > find and start writeback in an "associated" memcg that it _knows_ > contain shared dirty inodes. Once we've triggered writeback on an > associated memcg, it is removed from the list.... > Thank you for an idea. I think we can start from following. 0. add some feature to set 'preferred inode' for memcg. I think fadvise(fd, MAKE_THIF_FILE_UNDER_MY_MEMCG) or echo fd > /memory.move_file_here can be added. 1. account dirty pages for a memcg. as Greg does. 2. at the same time, account dirty pages made dirty by threads in a memcg. (to check which internal/external thread made page dirty.) 3. calculate internal/external dirty pages gap. With gap, we can have several choices. 4-a. If it exceeds some thresh, do some notify. userland daemon can decide to move pages to some memcg or not. (Of coruse, if the _shared_ dirty can be caught before making page dirty, user daemon can move inode before making it dirty by inotify().) I like helps of userland because it can be more flexible than kernel, it can eat config files. 4-b. set some flag to memcg as 'this memcg is dirty busy because of some extarnal threads'. When a page is newly dirtied, check the thread's memcg. If the memcg of a thread and a page is different from each other, write a memo as 'please check this memcgid , too' in task_struct and do double-memcg-check in balance_dirty_pages(). (How to clear per-task flag is difficult ;) I don't want to handle 3-100 threads does shared write case..;) we'll need 4-a. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback 2011-04-07 23:42 ` Dave Chinner 2011-04-08 0:59 ` Greg Thelen @ 2011-04-08 13:43 ` Vivek Goyal 1 sibling, 0 replies; 138+ messages in thread From: Vivek Goyal @ 2011-04-08 13:43 UTC (permalink / raw) To: Dave Chinner Cc: Greg Thelen, Curt Wohlgemuth, James Bottomley, lsf, linux-fsdevel On Fri, Apr 08, 2011 at 09:42:49AM +1000, Dave Chinner wrote: > On Thu, Apr 07, 2011 at 03:24:24PM -0400, Vivek Goyal wrote: > > On Thu, Apr 07, 2011 at 09:36:02AM +1000, Dave Chinner wrote: > [...] > > > > When I_DIRTY is cleared, remove inode from bdi_memcg->b_dirty. Delete bdi_memcg > > > > if the list is now empty. > > > > > > > > balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(memcg, bdi) > > > > if over bg limit, then > > > > set bdi_memcg->b_over_limit > > > > If there is no bdi_memcg (because all inodes of current’s > > > > memcg dirty pages where first dirtied by other memcg) then > > > > memcg lru to find inode and call writeback_single_inode(). > > > > This is to handle uncommon sharing. > > > > > > We don't want to introduce any new IO sources into > > > balance_dirty_pages(). This needs to trigger memcg-LRU based bdi > > > flusher writeback, not try to write back inodes itself. > > > > Will we not enjoy more sequtial IO traffic once we find an inode by > > traversing memcg->lru list? So isn't that better than pure LRU based > > flushing? > > Sorry, I wasn't particularly clear there, What I meant was that we > ask the bdi-flusher thread to select the inode to write back from > the LRU, not do it directly from balance_dirty_pages(). i.e. > bdp stays IO-less. Agreed. Even with cgroup aware writeback, we use bdi-flusher threads to do writeback and no direct writeback in bdp. > > > > Alternatively, this problem won't exist if you transfer page щache > > > state from one memcg to another when you move the inode from one > > > memcg to another. > > > > But in case of shared inode problem still remains. inode is being written > > from two cgroups and it can't be in both the groups as per the exisiting > > design. > > But we've already determined that there is no use case for this > shared inode behaviour, so we aren't going to explictly support it, > right? Well, we are not designing for shared inode to begin with but one can easily create that situation. So atleast we need to have some defined behavior that what happens if inodes are shared across multiple processes in same cgroup and across cgroups. Database might have multiple threads/processes doing IO to single file. What if somebody moves some threads out to a separate cgroup etc. So I am not saying that is common configuration but we need to define system behavior properly if sharing does happen. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-06 14:49 ` Curt Wohlgemuth 2011-04-06 15:39 ` Vivek Goyal @ 2011-04-06 23:08 ` Dave Chinner 2011-04-07 20:04 ` Vivek Goyal 1 sibling, 1 reply; 138+ messages in thread From: Dave Chinner @ 2011-04-06 23:08 UTC (permalink / raw) To: Curt Wohlgemuth Cc: Vivek Goyal, Greg Thelen, James Bottomley, lsf, linux-fsdevel On Wed, Apr 06, 2011 at 07:49:25AM -0700, Curt Wohlgemuth wrote: > On Tue, Apr 5, 2011 at 3:56 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Tue, Apr 05, 2011 at 09:13:59AM -0400, Vivek Goyal wrote: > >> On Sat, Apr 02, 2011 at 08:49:47AM +1100, Dave Chinner wrote: > >> > On Fri, Apr 01, 2011 at 01:18:38PM -0400, Vivek Goyal wrote: > >> > > On Fri, Apr 01, 2011 at 09:27:56AM +1100, Dave Chinner wrote: > >> > > > On Thu, Mar 31, 2011 at 07:50:23AM -0700, Greg Thelen wrote: > >> > > > > There > >> > > > > is no context (memcg or otherwise) given to the bdi flusher. After > >> > > > > the bdi flusher checks system-wide background limits, it uses the > >> > > > > over_bg_limit list to find (and rotate) an over limit memcg. Using > >> > > > > the memcg, then the per memcg per bdi dirty inode list is walked to > >> > > > > find inode pages to writeback. Once the memcg dirty memory usage > >> > > > > drops below the memcg-thresh, the memcg is removed from the global > >> > > > > over_bg_limit list. > >> > > > > >> > > > If you want controlled hand-off of writeback, you need to pass the > >> > > > memcg that triggered the throttling directly to the bdi. You already > >> > > > know what both the bdi and memcg that need writeback are. Yes, this > >> > > > needs concurrency at the BDI flush level to handle, but see my > >> > > > previous email in this thread for that.... > >> > > > > >> > > > >> > > Even with memcg being passed around I don't think that we get rid of > >> > > global list lock. > > ..... > >> > > The reason being that inodes are not exclusive to > >> > > the memory cgroups. Multiple memory cgroups might be writting to same > >> > > inode. So inode still remains in the global list and memory cgroups > >> > > kind of will have pointer to it. > >> > > >> > So two dirty inode lists that have to be kept in sync? That doesn't > >> > sound particularly appealing. Nor does it scale to an inode being > >> > dirty in multiple cgroups > >> > > >> > Besides, if you've got multiple memory groups dirtying the same > >> > inode, then you cannot expect isolation between groups. I'd consider > >> > this a broken configuration in this case - how often does this > >> > actually happen, and what is the use case for supporting > >> > it? > >> > > >> > Besides, the implications are that we'd have to break up contiguous > >> > IOs in the writeback path simply because two sequential pages are > >> > associated with different groups. That's really nasty, and exactly > >> > the opposite of all the write combining we try to do throughout the > >> > writeback path. Supporting this is also a mess, as we'd have to touch > >> > quite a lot of filesystem code (i.e. .writepage(s) inplementations) > >> > to do this. > >> > >> We did not plan on breaking up contigous IO even if these belonged to > >> different cgroup for performance reason. So probably can live with some > >> inaccuracy and just trigger the writeback for one inode even if that > >> meant that it could writeback the pages of some other cgroups doing IO > >> on that inode. > > > > Which, to me, violates the principle of isolation as it's been > > described that this functionality is supposed to provide. > > > > It also means you will have handle the case of a cgroup over a > > throttle limit and no inodes on it's dirty list. It's not a case of > > "probably can live with" the resultant mess, the mess will occur and > > so handling it needs to be designed in from the start. > > > >> > > So to start writeback on an inode > >> > > you still shall have to take global lock, IIUC. > >> > > >> > Why not simply bdi -> list of dirty cgroups -> list of dirty inodes > >> > in cgroup, and go from there? I mean, really all that cgroup-aware > >> > writeback needs is just adding a new container for managing > >> > dirty inodes in the writeback path and a method for selecting that > >> > container for writeback, right? > >> > >> This was the initial design where one inode is associated with one cgroup > >> even if process from multiple cgroups are doing IO to same inode. Then > >> somebody raised the concern that it probably is too coarse. > > > > Got a pointer? > > > >> IMHO, as a first step, associating inode to one cgroup exclusively > >> simplifies the things considerably and we can target that first. > >> > >> So yes, I agree that bdi->list_of_dirty_cgroups->list_of_drity_inodes > >> makes sense and is relatively simple way of doing things at the expense > >> of not being accurate for shared inode case. > > > > Can someone describe a valid shared inode use case? If not, we > > should not even consider it as a requirement and explicitly document > > it as a "not supported" use case. > > At the very least, when a task is moved from one cgroup to another, > we've got a shared inode case. This probably won't happen more than > once for most tasks, but it will likely be common. That's not a shared case, that's a transfer of ownership. If the task changes groups, you have to charge all it's pages to the new group, right? Otherwise you've got a problem where a task that is not part of a specific cgroup is still somewhat controlled by it's previous cgroup. It would also still influence that previous group even though it's no longer a member. Not good for isolation purposes. And if you are transfering the state, moving the inode from the dirty list of one cgroup to another is trivial and avoids any need for the dirty state to be shared.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-06 23:08 ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Dave Chinner @ 2011-04-07 20:04 ` Vivek Goyal 2011-04-07 23:47 ` Dave Chinner 0 siblings, 1 reply; 138+ messages in thread From: Vivek Goyal @ 2011-04-07 20:04 UTC (permalink / raw) To: Dave Chinner Cc: Curt Wohlgemuth, Greg Thelen, James Bottomley, lsf, linux-fsdevel On Thu, Apr 07, 2011 at 09:08:04AM +1000, Dave Chinner wrote: [..] > > At the very least, when a task is moved from one cgroup to another, > > we've got a shared inode case. This probably won't happen more than > > once for most tasks, but it will likely be common. > > That's not a shared case, that's a transfer of ownership. If the > task changes groups, you have to charge all it's pages to the new > group, right? Otherwise you've got a problem where a task that is > not part of a specific cgroup is still somewhat controlled by it's > previous cgroup. It would also still influence that previous group > even though it's no longer a member. Not good for isolation purposes. > > And if you are transfering the state, moving the inode from the > dirty list of one cgroup to another is trivial and avoids any need > for the dirty state to be shared.... I am wondering how do you map a task to an inode. Multiple tasks in the group might have written to same inode. Now which task owns it? Thanks Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-07 20:04 ` Vivek Goyal @ 2011-04-07 23:47 ` Dave Chinner 2011-04-08 13:50 ` Vivek Goyal 0 siblings, 1 reply; 138+ messages in thread From: Dave Chinner @ 2011-04-07 23:47 UTC (permalink / raw) To: Vivek Goyal Cc: Curt Wohlgemuth, Greg Thelen, James Bottomley, lsf, linux-fsdevel On Thu, Apr 07, 2011 at 04:04:37PM -0400, Vivek Goyal wrote: > On Thu, Apr 07, 2011 at 09:08:04AM +1000, Dave Chinner wrote: > > [..] > > > At the very least, when a task is moved from one cgroup to another, > > > we've got a shared inode case. This probably won't happen more than > > > once for most tasks, but it will likely be common. > > > > That's not a shared case, that's a transfer of ownership. If the > > task changes groups, you have to charge all it's pages to the new > > group, right? Otherwise you've got a problem where a task that is > > not part of a specific cgroup is still somewhat controlled by it's > > previous cgroup. It would also still influence that previous group > > even though it's no longer a member. Not good for isolation purposes. > > > > And if you are transfering the state, moving the inode from the > > dirty list of one cgroup to another is trivial and avoids any need > > for the dirty state to be shared.... > > I am wondering how do you map a task to an inode. Multiple tasks in the > group might have written to same inode. Now which task owns it? That sounds like a completely broken configuration to me. If you are using cgroups for isolation, you simple do not share *anything* between them. Right now the only use case that has been presented for shared inodes is transfering a task from one cgroup to another. Why on earth would you do that if it is sharing resources with other tasks in the original cgroup? What use case does this represent, how often is it likely to happen, and who cares about it anyway? Let's not overly complicate things by making up requirements that nobody cares about.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-07 23:47 ` Dave Chinner @ 2011-04-08 13:50 ` Vivek Goyal 2011-04-11 1:05 ` Dave Chinner 0 siblings, 1 reply; 138+ messages in thread From: Vivek Goyal @ 2011-04-08 13:50 UTC (permalink / raw) To: Dave Chinner Cc: Curt Wohlgemuth, Greg Thelen, James Bottomley, lsf, linux-fsdevel On Fri, Apr 08, 2011 at 09:47:17AM +1000, Dave Chinner wrote: > On Thu, Apr 07, 2011 at 04:04:37PM -0400, Vivek Goyal wrote: > > On Thu, Apr 07, 2011 at 09:08:04AM +1000, Dave Chinner wrote: > > > > [..] > > > > At the very least, when a task is moved from one cgroup to another, > > > > we've got a shared inode case. This probably won't happen more than > > > > once for most tasks, but it will likely be common. > > > > > > That's not a shared case, that's a transfer of ownership. If the > > > task changes groups, you have to charge all it's pages to the new > > > group, right? Otherwise you've got a problem where a task that is > > > not part of a specific cgroup is still somewhat controlled by it's > > > previous cgroup. It would also still influence that previous group > > > even though it's no longer a member. Not good for isolation purposes. > > > > > > And if you are transfering the state, moving the inode from the > > > dirty list of one cgroup to another is trivial and avoids any need > > > for the dirty state to be shared.... > > > > I am wondering how do you map a task to an inode. Multiple tasks in the > > group might have written to same inode. Now which task owns it? > > That sounds like a completely broken configuration to me. If you are > using cgroups for isolation, you simple do not share *anything* > between them. > > Right now the only use case that has been presented for shared > inodes is transfering a task from one cgroup to another. Moving applications dynamically across cgroups happens quite often just to put task in right cgroup after it has been launched or if a task has been running for sometime and system admin decides that it is causing heavy IO impacting other cgroup's IO. Then system admin might move it into a separate cgroup on the fly. > Why on > earth would you do that if it is sharing resources with other tasks > in the original cgroup? What use case does this represent, how often > is it likely to happen, and who cares about it anyway? > > Let's not overly complicate things by making up requirements that > nobody cares about.... Ok, so you are suggesting that always assume that only one task has written pages to inode and if that's not the case it is broken cofiguration. So if a task moves across cgroups, determine the pages and associated inodes and move everything to the new cgroup. If inode happend to be shared, then inode moves irrespective of the fact somebody else also was doing IO to it. I guess reasonable first step. Thanks Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-08 13:50 ` Vivek Goyal @ 2011-04-11 1:05 ` Dave Chinner 0 siblings, 0 replies; 138+ messages in thread From: Dave Chinner @ 2011-04-11 1:05 UTC (permalink / raw) To: Vivek Goyal Cc: Curt Wohlgemuth, Greg Thelen, James Bottomley, lsf, linux-fsdevel On Fri, Apr 08, 2011 at 09:50:58AM -0400, Vivek Goyal wrote: > On Fri, Apr 08, 2011 at 09:47:17AM +1000, Dave Chinner wrote: > > On Thu, Apr 07, 2011 at 04:04:37PM -0400, Vivek Goyal wrote: > > > On Thu, Apr 07, 2011 at 09:08:04AM +1000, Dave Chinner wrote: > > > > > > [..] > > > > > At the very least, when a task is moved from one cgroup to another, > > > > > we've got a shared inode case. This probably won't happen more than > > > > > once for most tasks, but it will likely be common. > > > > > > > > That's not a shared case, that's a transfer of ownership. If the > > > > task changes groups, you have to charge all it's pages to the new > > > > group, right? Otherwise you've got a problem where a task that is > > > > not part of a specific cgroup is still somewhat controlled by it's > > > > previous cgroup. It would also still influence that previous group > > > > even though it's no longer a member. Not good for isolation purposes. > > > > > > > > And if you are transfering the state, moving the inode from the > > > > dirty list of one cgroup to another is trivial and avoids any need > > > > for the dirty state to be shared.... > > > > > > I am wondering how do you map a task to an inode. Multiple tasks in the > > > group might have written to same inode. Now which task owns it? > > > > That sounds like a completely broken configuration to me. If you are > > using cgroups for isolation, you simple do not share *anything* > > between them. > > > > Right now the only use case that has been presented for shared > > inodes is transfering a task from one cgroup to another. > > Moving applications dynamically across cgroups happens quite often > just to put task in right cgroup after it has been launched If it's just been launched, it won't have dirtied very many files so I think shared dirty inodes for this use case is not an issue. > or if > a task has been running for sometime and system admin decides that > it is causing heavy IO impacting other cgroup's IO. Then system > admin might move it into a separate cgroup on the fly. And I'd expect manual load balancing to be the exception rather than the rule. Even so, if that process is doing lots of IO to the same file as other tasks that it is interfering with, then there's an application level problem there.... > > Why on > > earth would you do that if it is sharing resources with other tasks > > in the original cgroup? What use case does this represent, how often > > is it likely to happen, and who cares about it anyway? > > > > > Let's not overly complicate things by making up requirements that > > nobody cares about.... > > Ok, so you are suggesting that always assume that only one task has > written pages to inode and if that's not the case it is broken > cofiguration. Not broken, but initially unsupported. > So if a task moves across cgroups, determine the pages and associated > inodes and move everything to the new cgroup. If inode happend to be > shared, then inode moves irrespective of the fact somebody else also > was doing IO to it. I guess reasonable first step. It seems like the simplest way to start - once we have code that works doing the simple things right we can start to complicate it ;) Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-05 22:56 ` Dave Chinner 2011-04-06 14:49 ` Curt Wohlgemuth @ 2011-04-06 15:37 ` Vivek Goyal 2011-04-06 16:08 ` Vivek Goyal 2011-04-06 23:50 ` Dave Chinner 1 sibling, 2 replies; 138+ messages in thread From: Vivek Goyal @ 2011-04-06 15:37 UTC (permalink / raw) To: Dave Chinner; +Cc: Greg Thelen, James Bottomley, lsf, linux-fsdevel On Wed, Apr 06, 2011 at 08:56:40AM +1000, Dave Chinner wrote: > On Tue, Apr 05, 2011 at 09:13:59AM -0400, Vivek Goyal wrote: > > On Sat, Apr 02, 2011 at 08:49:47AM +1100, Dave Chinner wrote: > > > On Fri, Apr 01, 2011 at 01:18:38PM -0400, Vivek Goyal wrote: > > > > On Fri, Apr 01, 2011 at 09:27:56AM +1100, Dave Chinner wrote: > > > > > On Thu, Mar 31, 2011 at 07:50:23AM -0700, Greg Thelen wrote: > > > > > > There > > > > > > is no context (memcg or otherwise) given to the bdi flusher. After > > > > > > the bdi flusher checks system-wide background limits, it uses the > > > > > > over_bg_limit list to find (and rotate) an over limit memcg. Using > > > > > > the memcg, then the per memcg per bdi dirty inode list is walked to > > > > > > find inode pages to writeback. Once the memcg dirty memory usage > > > > > > drops below the memcg-thresh, the memcg is removed from the global > > > > > > over_bg_limit list. > > > > > > > > > > If you want controlled hand-off of writeback, you need to pass the > > > > > memcg that triggered the throttling directly to the bdi. You already > > > > > know what both the bdi and memcg that need writeback are. Yes, this > > > > > needs concurrency at the BDI flush level to handle, but see my > > > > > previous email in this thread for that.... > > > > > > > > > > > > > Even with memcg being passed around I don't think that we get rid of > > > > global list lock. > ..... > > > > The reason being that inodes are not exclusive to > > > > the memory cgroups. Multiple memory cgroups might be writting to same > > > > inode. So inode still remains in the global list and memory cgroups > > > > kind of will have pointer to it. > > > > > > So two dirty inode lists that have to be kept in sync? That doesn't > > > sound particularly appealing. Nor does it scale to an inode being > > > dirty in multiple cgroups > > > > > > Besides, if you've got multiple memory groups dirtying the same > > > inode, then you cannot expect isolation between groups. I'd consider > > > this a broken configuration in this case - how often does this > > > actually happen, and what is the use case for supporting > > > it? > > > > > > Besides, the implications are that we'd have to break up contiguous > > > IOs in the writeback path simply because two sequential pages are > > > associated with different groups. That's really nasty, and exactly > > > the opposite of all the write combining we try to do throughout the > > > writeback path. Supporting this is also a mess, as we'd have to touch > > > quite a lot of filesystem code (i.e. .writepage(s) inplementations) > > > to do this. > > > > We did not plan on breaking up contigous IO even if these belonged to > > different cgroup for performance reason. So probably can live with some > > inaccuracy and just trigger the writeback for one inode even if that > > meant that it could writeback the pages of some other cgroups doing IO > > on that inode. > > Which, to me, violates the principle of isolation as it's been > described that this functionality is supposed to provide. > > It also means you will have handle the case of a cgroup over a > throttle limit and no inodes on it's dirty list. It's not a case of > "probably can live with" the resultant mess, the mess will occur and > so handling it needs to be designed in from the start. This behavior can happen due to shared page accounting. One possible way to mitigate this problme is to traverse through LRU list of pages of memcg and find an inode to do the writebak. > > > > > So to start writeback on an inode > > > > you still shall have to take global lock, IIUC. > > > > > > Why not simply bdi -> list of dirty cgroups -> list of dirty inodes > > > in cgroup, and go from there? I mean, really all that cgroup-aware > > > writeback needs is just adding a new container for managing > > > dirty inodes in the writeback path and a method for selecting that > > > container for writeback, right? > > > > This was the initial design where one inode is associated with one cgroup > > even if process from multiple cgroups are doing IO to same inode. Then > > somebody raised the concern that it probably is too coarse. > > Got a pointer? This was briefly discussed at last LSF and some people seemed to like the idea of associated inode with one cgroup. I guess database would be a case where a large file can be shared by multiple processes? Now one can argue that why to put all these processes in separate cgroups. Anyway, I am not arguing for solving the case of shared inodes. I personally prefer first simple step of inode being associated with one memcg and if we run into issues due to shared inodes, then look into how to solve this problem. > > > IMHO, as a first step, associating inode to one cgroup exclusively > > simplifies the things considerably and we can target that first. > > > > So yes, I agree that bdi->list_of_dirty_cgroups->list_of_drity_inodes > > makes sense and is relatively simple way of doing things at the expense > > of not being accurate for shared inode case. > > Can someone describe a valid shared inode use case? If not, we > should not even consider it as a requirement and explicitly document > it as a "not supported" use case. I asked same question yesterday at LSF seesion and we don't have any good workload example yet. > > As it is, I'm hearing different ideas and requirements from the > people working on the memcg side of this vs the IO controller side. > Perhaps the first step is documenting a common set of functional > requirements that demonstrates how everything will play well > together? > > e.g. Defining what isolation means, when and if it can be violated, > how violations are handled, > when inodes in multiple memcgs are > acceptable and how they need to be accounted and handled by the > writepage path, After yesterday's discussion it looked like people agreed that to begin with keep it simple and maintain the notion of one inode on one memcg list. So instead of inode being on global bdi dirty list it will be on per memecg per bdi dirty list. Greg would you like to elaborate more on the design. >how memcg's over the dirty threshold with no dirty > inodes are to be handled, As I said above, one of the proposals was that traverse through LRU list of memcg if memcg is above dirty ratio and there are no inodes on that memcg. May be there are other better ways to handle this. > how metadata IO is going to be handled by > IO controllers, So IO controller provides two mechanisms. - IO throttling(bytes_per_second, io_per_second interface) - Proportional weight disk sharing In case of proportional weight disk sharing, we don't run into issues of priority inversion and metadata handing should not be a concern. For throttling case, apart from metadata, I found that with simple throttling of data I ran into issues with journalling with ext4 mounuted in ordered mode. So it was suggested that WRITE IO throttling should not be done at device level instead try to do it in higher layers, possibly balance_dirty_pages() and throttle process early. So yes, I agree that little more documentation and more clarity on this would be good. All this cgroup aware writeback primarily is being done for CFQ's proportional disk sharing at the moment. > what kswapd is going to do writeback when the pages > it's trying to writeback during a critical low memory event belong > to a cgroup that is throttled at the IO level, etc. Throttling will move up so kswapd will not be throttled. Even today, kswapd is part of root group and we do not suggest throttling root group. For the case of proportional disk sharing, we will probably account IO to respective cgroups (pages submitted by kswapd) and that should not flush to disk fairly fast and should not block for long time as it is work consering mechanism. Do you see an issue with kswapd IO being accounted to respective cgroups for proportional IO. For throttling case, all IO would go to root group which is unthrottled and real issue of dirtying too many pages by processes will be handled by throttling processes when they are dirtying page cache. Thanks Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-06 15:37 ` Vivek Goyal @ 2011-04-06 16:08 ` Vivek Goyal 2011-04-06 17:10 ` Jan Kara 2011-04-06 23:50 ` Dave Chinner 1 sibling, 1 reply; 138+ messages in thread From: Vivek Goyal @ 2011-04-06 16:08 UTC (permalink / raw) To: Dave Chinner; +Cc: Greg Thelen, James Bottomley, lsf, linux-fsdevel On Wed, Apr 06, 2011 at 11:37:15AM -0400, Vivek Goyal wrote: [..] > > what kswapd is going to do writeback when the pages > > it's trying to writeback during a critical low memory event belong > > to a cgroup that is throttled at the IO level, etc. > > Throttling will move up so kswapd will not be throttled. Even today, > kswapd is part of root group and we do not suggest throttling root group. > > For the case of proportional disk sharing, we will probably account > IO to respective cgroups (pages submitted by kswapd) and that should > not flush to disk fairly fast and should not block for long time as it is > work consering mechanism. > > Do you see an issue with kswapd IO being accounted to respective cgroups > for proportional IO. For throttling case, all IO would go to root group > which is unthrottled and real issue of dirtying too many pages by > processes will be handled by throttling processes when they are dirtying > page cache. Or may be it is not a good idea to try to account pages to associated cgroups when memory is low and kswapd is doing IO. We can probably mark kswapd with some flag and account all IO to root group even for proportional weight mechanism. In this case isolation will be broken but I guess one can not do much. To avoid this situation, one should not have allowed too many writes and I think that's where low dirty ratio can come into the picture. I thought one of the use case of this was that a high prio buffered writer should be able to do more writes than a low prio writer. That I think should be possible by accounting flusher writes. Dave you have any suggestions on how to handle this? Thanks Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-06 16:08 ` Vivek Goyal @ 2011-04-06 17:10 ` Jan Kara 2011-04-06 17:14 ` Curt Wohlgemuth 2011-04-08 1:58 ` Dave Chinner 0 siblings, 2 replies; 138+ messages in thread From: Jan Kara @ 2011-04-06 17:10 UTC (permalink / raw) To: Vivek Goyal; +Cc: Dave Chinner, James Bottomley, lsf, linux-fsdevel On Wed 06-04-11 12:08:05, Vivek Goyal wrote: > On Wed, Apr 06, 2011 at 11:37:15AM -0400, Vivek Goyal wrote: > > [..] > > > what kswapd is going to do writeback when the pages > > > it's trying to writeback during a critical low memory event belong > > > to a cgroup that is throttled at the IO level, etc. > > > > Throttling will move up so kswapd will not be throttled. Even today, > > kswapd is part of root group and we do not suggest throttling root group. > > > > For the case of proportional disk sharing, we will probably account > > IO to respective cgroups (pages submitted by kswapd) and that should > > not flush to disk fairly fast and should not block for long time as it is > > work consering mechanism. > > > > Do you see an issue with kswapd IO being accounted to respective cgroups > > for proportional IO. For throttling case, all IO would go to root group > > which is unthrottled and real issue of dirtying too many pages by > > processes will be handled by throttling processes when they are dirtying > > page cache. > > Or may be it is not a good idea to try to account pages to associated > cgroups when memory is low and kswapd is doing IO. We can probably mark > kswapd with some flag and account all IO to root group even for > proportional weight mechanism. In this case isolation will be broken but > I guess one can not do much. To avoid this situation, one should not > have allowed too many writes and I think that's where low dirty ratio > can come into the picture. Well, I wouldn't bother too much with kswapd handling. MM people plan to get rid of writeback from direct reclaim and just remove the dirty page from LRU and recycle it once flusher thread writes it... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-06 17:10 ` Jan Kara @ 2011-04-06 17:14 ` Curt Wohlgemuth 2011-04-08 1:58 ` Dave Chinner 1 sibling, 0 replies; 138+ messages in thread From: Curt Wohlgemuth @ 2011-04-06 17:14 UTC (permalink / raw) To: Jan Kara; +Cc: Vivek Goyal, Dave Chinner, James Bottomley, lsf, linux-fsdevel On Wed, Apr 6, 2011 at 10:10 AM, Jan Kara <jack@suse.cz> wrote: > On Wed 06-04-11 12:08:05, Vivek Goyal wrote: >> On Wed, Apr 06, 2011 at 11:37:15AM -0400, Vivek Goyal wrote: >> >> [..] >> > > what kswapd is going to do writeback when the pages >> > > it's trying to writeback during a critical low memory event belong >> > > to a cgroup that is throttled at the IO level, etc. >> > >> > Throttling will move up so kswapd will not be throttled. Even today, >> > kswapd is part of root group and we do not suggest throttling root group. >> > >> > For the case of proportional disk sharing, we will probably account >> > IO to respective cgroups (pages submitted by kswapd) and that should >> > not flush to disk fairly fast and should not block for long time as it is >> > work consering mechanism. >> > >> > Do you see an issue with kswapd IO being accounted to respective cgroups >> > for proportional IO. For throttling case, all IO would go to root group >> > which is unthrottled and real issue of dirtying too many pages by >> > processes will be handled by throttling processes when they are dirtying >> > page cache. >> >> Or may be it is not a good idea to try to account pages to associated >> cgroups when memory is low and kswapd is doing IO. We can probably mark >> kswapd with some flag and account all IO to root group even for >> proportional weight mechanism. In this case isolation will be broken but >> I guess one can not do much. To avoid this situation, one should not >> have allowed too many writes and I think that's where low dirty ratio >> can come into the picture. > Well, I wouldn't bother too much with kswapd handling. MM people plan to > get rid of writeback from direct reclaim and just remove the dirty page > from LRU and recycle it once flusher thread writes it... But still, it matters which memcg is "responsible" for the background writeout from direct reclaim. One could argue that direct reclaim should just specify the root cgroup... Curt > > Honza > -- > Jan Kara <jack@suse.cz> > SUSE Labs, CR > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-06 17:10 ` Jan Kara 2011-04-06 17:14 ` Curt Wohlgemuth @ 2011-04-08 1:58 ` Dave Chinner 2011-04-19 14:26 ` Wu Fengguang 1 sibling, 1 reply; 138+ messages in thread From: Dave Chinner @ 2011-04-08 1:58 UTC (permalink / raw) To: Jan Kara; +Cc: Vivek Goyal, James Bottomley, lsf, linux-fsdevel On Wed, Apr 06, 2011 at 07:10:17PM +0200, Jan Kara wrote: > On Wed 06-04-11 12:08:05, Vivek Goyal wrote: > > On Wed, Apr 06, 2011 at 11:37:15AM -0400, Vivek Goyal wrote: > Well, I wouldn't bother too much with kswapd handling. MM people plan to > get rid of writeback from direct reclaim and just remove the dirty page > from LRU and recycle it once flusher thread writes it... kswapd is not in the direct reclaim path - it's the background memory reclaim path. Writeback from direct reclaim is a problem because of stack usage, and that problem doesn't exist for kswapd. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-08 1:58 ` Dave Chinner @ 2011-04-19 14:26 ` Wu Fengguang 0 siblings, 0 replies; 138+ messages in thread From: Wu Fengguang @ 2011-04-19 14:26 UTC (permalink / raw) To: Dave Chinner; +Cc: Jan Kara, Vivek Goyal, James Bottomley, lsf, linux-fsdevel On Fri, Apr 08, 2011 at 11:58:41AM +1000, Dave Chinner wrote: > On Wed, Apr 06, 2011 at 07:10:17PM +0200, Jan Kara wrote: > > On Wed 06-04-11 12:08:05, Vivek Goyal wrote: > > > On Wed, Apr 06, 2011 at 11:37:15AM -0400, Vivek Goyal wrote: > > Well, I wouldn't bother too much with kswapd handling. MM people plan to > > get rid of writeback from direct reclaim and just remove the dirty page > > from LRU and recycle it once flusher thread writes it... > > kswapd is not in the direct reclaim path - it's the background > memory reclaim path. Writeback from direct reclaim is a problem > because of stack usage, and that problem doesn't exist for kswapd. FYI the IO initiated from pageout() in kswapd/direct reclaim can mostly be transfered to the flushers. Here is the early RFC patch, and I'll submit an update soon. http://www.spinics.net/lists/linux-mm/msg09199.html Thanks, Fengguang ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-06 15:37 ` Vivek Goyal 2011-04-06 16:08 ` Vivek Goyal @ 2011-04-06 23:50 ` Dave Chinner 2011-04-07 17:55 ` Vivek Goyal 1 sibling, 1 reply; 138+ messages in thread From: Dave Chinner @ 2011-04-06 23:50 UTC (permalink / raw) To: Vivek Goyal; +Cc: Greg Thelen, James Bottomley, lsf, linux-fsdevel On Wed, Apr 06, 2011 at 11:37:15AM -0400, Vivek Goyal wrote: > On Wed, Apr 06, 2011 at 08:56:40AM +1000, Dave Chinner wrote: > > On Tue, Apr 05, 2011 at 09:13:59AM -0400, Vivek Goyal wrote: > > It also means you will have handle the case of a cgroup over a > > throttle limit and no inodes on it's dirty list. It's not a case of > > "probably can live with" the resultant mess, the mess will occur and > > so handling it needs to be designed in from the start. > > This behavior can happen due to shared page accounting. One possible > way to mitigate this problme is to traverse through LRU list of pages > of memcg and find an inode to do the writebak. Page LRU ordered writeback is something we need to avoid. It causes havok with IO and allocation patterns. Also, how expensive is such a walk? If it's a common operation, then it's a non-starter for the generic writeback code. BTW, how is "shared page accounting" different to the shared dirty inode case we've been discussing? > After yesterday's discussion it looked like people agreed that to > begin with keep it simple and maintain the notion of one inode on > one memcg list. So instead of inode being on global bdi dirty list > it will be on per memecg per bdi dirty list. Good to hear. > > how metadata IO is going to be handled by > > IO controllers, > > So IO controller provides two mechanisms. > > - IO throttling(bytes_per_second, io_per_second interface) > - Proportional weight disk sharing > > In case of proportional weight disk sharing, we don't run into issues of > priority inversion and metadata handing should not be a concern. Though metadata IO will affect how much bandwidth/iops is available for applications to use. > For throttling case, apart from metadata, I found that with simple > throttling of data I ran into issues with journalling with ext4 mounuted > in ordered mode. So it was suggested that WRITE IO throttling should > not be done at device level instead try to do it in higher layers, > possibly balance_dirty_pages() and throttle process early. The problem with doing it at the page cache entry level is that cache hits then get throttled. It's not really a an IO controller at that point, and the impact on application performance could be huge (i.e. MB/s instead of GB/s). > So yes, I agree that little more documentation and more clarity on this > would be good. All this cgroup aware writeback primarily is being done > for CFQ's proportional disk sharing at the moment. > > > what kswapd is going to do writeback when the pages > > it's trying to writeback during a critical low memory event belong > > to a cgroup that is throttled at the IO level, etc. > > Throttling will move up so kswapd will not be throttled. Even today, > kswapd is part of root group and we do not suggest throttling root group. So once again you have the problem of writeback from kswapd (which is ugly to begin with) affecting all the groups. Given kswapd likes to issue what is effectively random IO, this coul dhave devastating effect on everything else.... > For the case of proportional disk sharing, we will probably account > IO to respective cgroups (pages submitted by kswapd) and that should > not flush to disk fairly fast and should not block for long time as it is > work consering mechanism. Well, it depends. I can still see how, with proportional IO, kswapd would get slowed cleaning dirty pages on one memcg when there are clean pages in another memcg that it could reclaim without doing any IO. i.e. it has potential to slow down memory reclaim significantly. (Note, I'm assuming proportional IO doesn't mean "no throttling" it just means there is a much lower delay on IO issue.) Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-06 23:50 ` Dave Chinner @ 2011-04-07 17:55 ` Vivek Goyal 2011-04-11 1:36 ` Dave Chinner 0 siblings, 1 reply; 138+ messages in thread From: Vivek Goyal @ 2011-04-07 17:55 UTC (permalink / raw) To: Dave Chinner; +Cc: Greg Thelen, James Bottomley, lsf, linux-fsdevel On Thu, Apr 07, 2011 at 09:50:39AM +1000, Dave Chinner wrote: > On Wed, Apr 06, 2011 at 11:37:15AM -0400, Vivek Goyal wrote: > > On Wed, Apr 06, 2011 at 08:56:40AM +1000, Dave Chinner wrote: > > > On Tue, Apr 05, 2011 at 09:13:59AM -0400, Vivek Goyal wrote: > > > It also means you will have handle the case of a cgroup over a > > > throttle limit and no inodes on it's dirty list. It's not a case of > > > "probably can live with" the resultant mess, the mess will occur and > > > so handling it needs to be designed in from the start. > > > > This behavior can happen due to shared page accounting. One possible > > way to mitigate this problme is to traverse through LRU list of pages > > of memcg and find an inode to do the writebak. > > Page LRU ordered writeback is something we need to avoid. It causes > havok with IO and allocation patterns. Also, how expensive is such a > walk? If it's a common operation, then it's a non-starter for the > generic writeback code. > Agreed that LRU ordered writeback needs to be avoided as it is going to be expensive. That's why the notion of inode on per memcg list and do per inode writeback. This seems to be only backup plan in case when there is no inode to do IO due to shared inode accounting issues. Do you have ideas on better way to handle it? The other proposal of maintaining a list memcg_mapping, which tracks which inode this cgroup has dirtied has been deemed complex and been kind of rejected at least for the first step. > BTW, how is "shared page accounting" different to the shared dirty > inode case we've been discussing? IIUC, there are two problems. - Issues because of shared page accounting - Issues because of shared inode accouting. So in shared page accounting, if two process do IO to same page, IO gets charged to cgroup who first touched the page. So if a cgroup is writting on lots of shared pages, it will be charged to the other cgroup who brought the page in memory to begin with and will drive its dirty ratio up. So this seems to be case of weaker isolation in case of shared pages, and we got to live with it. Similarly if inode is shared, inode gets put on the list of memcg who dirtied it first. So now if two cgroups are dirtying pages on inode, then pages should be charged to respective cgroup but inode will be only on one memcg and once writeback is performed it might happen that cgroup is over its background limit but there are no inodes to do writeback. > > > After yesterday's discussion it looked like people agreed that to > > begin with keep it simple and maintain the notion of one inode on > > one memcg list. So instead of inode being on global bdi dirty list > > it will be on per memecg per bdi dirty list. > > Good to hear. > > > > how metadata IO is going to be handled by > > > IO controllers, > > > > So IO controller provides two mechanisms. > > > > - IO throttling(bytes_per_second, io_per_second interface) > > - Proportional weight disk sharing > > > > In case of proportional weight disk sharing, we don't run into issues of > > priority inversion and metadata handing should not be a concern. > > Though metadata IO will affect how much bandwidth/iops is available > for applications to use. I think meta data IO will be accounted to the process submitting the meta data IO. (IO tracking stuff will be used only for page cache pages during page dirtying time). So yes, the process doing meta data IO will be charged for it. I think I am missing something here and not understanding your concern exactly here. > > > For throttling case, apart from metadata, I found that with simple > > throttling of data I ran into issues with journalling with ext4 mounuted > > in ordered mode. So it was suggested that WRITE IO throttling should > > not be done at device level instead try to do it in higher layers, > > possibly balance_dirty_pages() and throttle process early. > > The problem with doing it at the page cache entry level is that > cache hits then get throttled. It's not really a an IO controller at > that point, and the impact on application performance could be huge > (i.e. MB/s instead of GB/s). Agreed that throttling cache hits is not a good idea. Can we determine if page being asked for is in cache or not and charge for IO accordingly. > > > So yes, I agree that little more documentation and more clarity on this > > would be good. All this cgroup aware writeback primarily is being done > > for CFQ's proportional disk sharing at the moment. > > > > > what kswapd is going to do writeback when the pages > > > it's trying to writeback during a critical low memory event belong > > > to a cgroup that is throttled at the IO level, etc. > > > > Throttling will move up so kswapd will not be throttled. Even today, > > kswapd is part of root group and we do not suggest throttling root group. > > So once again you have the problem of writeback from kswapd (which > is ugly to begin with) affecting all the groups. Given kswapd likes > to issue what is effectively random IO, this coul dhave devastating > effect on everything else.... Implementing throttling at higher layer has the problem of IO spikes at the end level devices when flusher or kswapd decide to do bunch of IO. I really don't have a good answer for that. Doing throttling at device level runs into issues with journalling. So I guess issues of IO spikes is lesser concern as compared to issue of choking filesystem. Following two things might help though a bit with IO spikes. - Keep per cgroup background dirty ratio low so that flusher tries to flush out pages sooner than later. - All the IO coming from flusher/kswapd will be going in root group from throttling perspective. We can try to throttle it again to some reasonable value to reduce the impact of IO spikes. Ideas to handle this better? > > > For the case of proportional disk sharing, we will probably account > > IO to respective cgroups (pages submitted by kswapd) and that should > > not flush to disk fairly fast and should not block for long time as it is > > work consering mechanism. > > Well, it depends. I can still see how, with proportional IO, kswapd > would get slowed cleaning dirty pages on one memcg when there are > clean pages in another memcg that it could reclaim without doing any > IO. i.e. it has potential to slow down memory reclaim significantly. > (Note, I'm assuming proportional IO doesn't mean "no throttling" it > just means there is a much lower delay on IO issue.) Proportional IO can delay submitting an IO only if there is IO happening in other groups. So IO can still be throttled and limits are decided by fair share of a group. But if other groups are not doing IO and not using their fair share, then the group doing IO gets bigger share. So yes, if heavy IO is happening at disk while kswapd is also trying to reclaim memory, then IO submitted by kswapd can be delayed and this can slow down reclaim. (Does kswapd has to block after submitting IO from a memcg. Can't it just move onto next memcg and either free pages if not dirty, or also submit IO from next memcg?) Thanks Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-07 17:55 ` Vivek Goyal @ 2011-04-11 1:36 ` Dave Chinner 2011-04-15 21:07 ` Vivek Goyal 2011-04-19 14:17 ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Wu Fengguang 0 siblings, 2 replies; 138+ messages in thread From: Dave Chinner @ 2011-04-11 1:36 UTC (permalink / raw) To: Vivek Goyal; +Cc: Greg Thelen, James Bottomley, lsf, linux-fsdevel On Thu, Apr 07, 2011 at 01:55:37PM -0400, Vivek Goyal wrote: > On Thu, Apr 07, 2011 at 09:50:39AM +1000, Dave Chinner wrote: > > On Wed, Apr 06, 2011 at 11:37:15AM -0400, Vivek Goyal wrote: > > > On Wed, Apr 06, 2011 at 08:56:40AM +1000, Dave Chinner wrote: > > > > On Tue, Apr 05, 2011 at 09:13:59AM -0400, Vivek Goyal wrote: > > > > It also means you will have handle the case of a cgroup over a > > > > throttle limit and no inodes on it's dirty list. It's not a case of > > > > "probably can live with" the resultant mess, the mess will occur and > > > > so handling it needs to be designed in from the start. > > > > > > This behavior can happen due to shared page accounting. One possible > > > way to mitigate this problme is to traverse through LRU list of pages > > > of memcg and find an inode to do the writebak. > > > > Page LRU ordered writeback is something we need to avoid. It causes > > havok with IO and allocation patterns. Also, how expensive is such a > > walk? If it's a common operation, then it's a non-starter for the > > generic writeback code. > > > > Agreed that LRU ordered writeback needs to be avoided as it is going to be > expensive. That's why the notion of inode on per memcg list and do per inode > writeback. This seems to be only backup plan in case when there is no > inode to do IO due to shared inode accounting issues. This shouldn't be hidden inside memcg relaim - memcg reclaim should do exactly what the MM subsystem normal does without memcg being in the picture. That is, you need to convince the MM guys to change the way reclaim does writeback from the LRU. We've been asking them to do this for years.... > Do you have ideas on better way to handle it? The other proposal of > maintaining a list memcg_mapping, which tracks which inode this cgroup > has dirtied has been deemed complex and been kind of rejected at least > for the first step. Fix the mm subsystem to DTRT first? > > > BTW, how is "shared page accounting" different to the shared dirty > > inode case we've been discussing? > > IIUC, there are two problems. > > - Issues because of shared page accounting > - Issues because of shared inode accouting. > > So in shared page accounting, if two process do IO to same page, IO gets > charged to cgroup who first touched the page. So if a cgroup is writting > on lots of shared pages, it will be charged to the other cgroup who > brought the page in memory to begin with and will drive its dirty ratio > up. So this seems to be case of weaker isolation in case of shared pages, > and we got to live with it. > > Similarly if inode is shared, inode gets put on the list of memcg who dirtied > it first. So now if two cgroups are dirtying pages on inode, then pages should > be charged to respective cgroup but inode will be only on one memcg and once > writeback is performed it might happen that cgroup is over its background > limit but there are no inodes to do writeback. > > > > > > After yesterday's discussion it looked like people agreed that to > > > begin with keep it simple and maintain the notion of one inode on > > > one memcg list. So instead of inode being on global bdi dirty list > > > it will be on per memecg per bdi dirty list. > > > > Good to hear. > > > > > > how metadata IO is going to be handled by > > > > IO controllers, > > > > > > So IO controller provides two mechanisms. > > > > > > - IO throttling(bytes_per_second, io_per_second interface) > > > - Proportional weight disk sharing > > > > > > In case of proportional weight disk sharing, we don't run into issues of > > > priority inversion and metadata handing should not be a concern. > > > > Though metadata IO will affect how much bandwidth/iops is available > > for applications to use. > > I think meta data IO will be accounted to the process submitting the meta > data IO. (IO tracking stuff will be used only for page cache pages during > page dirtying time). So yes, the process doing meta data IO will be > charged for it. > > I think I am missing something here and not understanding your concern > exactly here. XFS can issue thousands of delayed metadata write IO per second from it's writeback threads when it needs to (e.g. tail pushing the journal). Completely unthrottled due to the context they are issued from(*) and can basically consume all the disk iops and bandwidth capacity for seconds at a time. Also, XFS doesn't use the page cache for metadata buffers anymore so page cache accounting, throttling and reclaim mechanisms are never going to work for controlling XFS metadata IO (*) It'll be IO issued by workqueues rather than threads RSN: http://git.kernel.org/?p=linux/kernel/git/dgc/xfsdev.git;a=shortlog;h=refs/heads/xfs-for-2.6.39 And this will become _much_ more common in the not-to-distant future. So context passing between threads and to workqueues is something you need to think about sooner rather than later if you want metadata IO to be throttled in any way.... > > > For throttling case, apart from metadata, I found that with simple > > > throttling of data I ran into issues with journalling with ext4 mounuted > > > in ordered mode. So it was suggested that WRITE IO throttling should > > > not be done at device level instead try to do it in higher layers, > > > possibly balance_dirty_pages() and throttle process early. > > > > The problem with doing it at the page cache entry level is that > > cache hits then get throttled. It's not really a an IO controller at > > that point, and the impact on application performance could be huge > > (i.e. MB/s instead of GB/s). > > Agreed that throttling cache hits is not a good idea. Can we determine > if page being asked for is in cache or not and charge for IO accordingly. You'd need hooks in find_or_create_page(), though you have no context of whether a read or a write is in progress at that point. > > > So yes, I agree that little more documentation and more clarity on this > > > would be good. All this cgroup aware writeback primarily is being done > > > for CFQ's proportional disk sharing at the moment. > > > > > > > what kswapd is going to do writeback when the pages > > > > it's trying to writeback during a critical low memory event belong > > > > to a cgroup that is throttled at the IO level, etc. > > > > > > Throttling will move up so kswapd will not be throttled. Even today, > > > kswapd is part of root group and we do not suggest throttling root group. > > > > So once again you have the problem of writeback from kswapd (which > > is ugly to begin with) affecting all the groups. Given kswapd likes > > to issue what is effectively random IO, this coul dhave devastating > > effect on everything else.... > > Implementing throttling at higher layer has the problem of IO spikes > at the end level devices when flusher or kswapd decide to do bunch of > IO. I really don't have a good answer for that. Doing throttling at > device level runs into issues with journalling. So I guess issues of > IO spikes is lesser concern as compared to issue of choking filesystem. > > Following two things might help though a bit with IO spikes. > > - Keep per cgroup background dirty ratio low so that flusher tries to > flush out pages sooner than later. Which has major performance impacts. > > - All the IO coming from flusher/kswapd will be going in root group > from throttling perspective. We can try to throttle it again to > some reasonable value to reduce the impact of IO spikes. Don't do writeback from kswapd at all? Push it all to the flusher thread which has a context to work from? > > > For the case of proportional disk sharing, we will probably account > > > IO to respective cgroups (pages submitted by kswapd) and that should > > > not flush to disk fairly fast and should not block for long time as it is > > > work consering mechanism. > > > > Well, it depends. I can still see how, with proportional IO, kswapd > > would get slowed cleaning dirty pages on one memcg when there are > > clean pages in another memcg that it could reclaim without doing any > > IO. i.e. it has potential to slow down memory reclaim significantly. > > (Note, I'm assuming proportional IO doesn't mean "no throttling" it > > just means there is a much lower delay on IO issue.) > > Proportional IO can delay submitting an IO only if there is IO happening > in other groups. So IO can still be throttled and limits are decided > by fair share of a group. But if other groups are not doing IO and not > using their fair share, then the group doing IO gets bigger share. > > So yes, if heavy IO is happening at disk while kswapd is also trying > to reclaim memory, then IO submitted by kswapd can be delayed and > this can slow down reclaim. (Does kswapd has to block after submitting > IO from a memcg. Can't it just move onto next memcg and either free > pages if not dirty, or also submit IO from next memcg?) No idea - you'll need to engage the mm guys to get help there. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-11 1:36 ` Dave Chinner @ 2011-04-15 21:07 ` Vivek Goyal 2011-04-16 3:06 ` Vivek Goyal 2011-04-19 14:17 ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Wu Fengguang 1 sibling, 1 reply; 138+ messages in thread From: Vivek Goyal @ 2011-04-15 21:07 UTC (permalink / raw) To: Dave Chinner; +Cc: Greg Thelen, James Bottomley, lsf, linux-fsdevel On Mon, Apr 11, 2011 at 11:36:30AM +1000, Dave Chinner wrote: [..] > > > > > how metadata IO is going to be handled by > > > > > IO controllers, > > > > > > > > So IO controller provides two mechanisms. > > > > > > > > - IO throttling(bytes_per_second, io_per_second interface) > > > > - Proportional weight disk sharing > > > > > > > > In case of proportional weight disk sharing, we don't run into issues of > > > > priority inversion and metadata handing should not be a concern. > > > > > > Though metadata IO will affect how much bandwidth/iops is available > > > for applications to use. > > > > I think meta data IO will be accounted to the process submitting the meta > > data IO. (IO tracking stuff will be used only for page cache pages during > > page dirtying time). So yes, the process doing meta data IO will be > > charged for it. > > > > I think I am missing something here and not understanding your concern > > exactly here. > > XFS can issue thousands of delayed metadata write IO per second from > it's writeback threads when it needs to (e.g. tail pushing the > journal). Completely unthrottled due to the context they are issued > from(*) and can basically consume all the disk iops and bandwidth > capacity for seconds at a time. > > Also, XFS doesn't use the page cache for metadata buffers anymore > so page cache accounting, throttling and reclaim mechanisms > are never going to work for controlling XFS metadata IO > > > (*) It'll be IO issued by workqueues rather than threads RSN: > > http://git.kernel.org/?p=linux/kernel/git/dgc/xfsdev.git;a=shortlog;h=refs/heads/xfs-for-2.6.39 > > And this will become _much_ more common in the not-to-distant > future. So context passing between threads and to workqueues is > something you need to think about sooner rather than later if you > want metadata IO to be throttled in any way.... Ok, So this seems to the similar case as WRITE traffic from flusher threads which can disrupt IO on end device even if we have done throttling in balance_dirty_pages(). How about doing throttling at two layers. All the data throttling is done in higher layers and then also retain the mechanism of throttling at end device. That way an admin can put a overall limit on such common write traffic. (XFS meta data coming from workqueues, flusher thread, kswapd etc). Anyway, we can't attribute this IO to per process context/group otherwise most likely something will get serialized in higher layers. Right now I am speaking purely from IO throttling point of view and not even thinking about CFQ and IO tracking stuff. This increases the complexity in IO cgroup interface as now we see to have four combinations. Global Throttling Throttling at lower layers Throttling at higher layers. Per device throttling Throttling at lower layers Throttling at higher layers. Thanks Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-15 21:07 ` Vivek Goyal @ 2011-04-16 3:06 ` Vivek Goyal 2011-04-18 21:58 ` Jan Kara 0 siblings, 1 reply; 138+ messages in thread From: Vivek Goyal @ 2011-04-16 3:06 UTC (permalink / raw) To: Dave Chinner; +Cc: Greg Thelen, James Bottomley, lsf, linux-fsdevel On Fri, Apr 15, 2011 at 05:07:50PM -0400, Vivek Goyal wrote: > On Mon, Apr 11, 2011 at 11:36:30AM +1000, Dave Chinner wrote: > > [..] > > > > > > how metadata IO is going to be handled by > > > > > > IO controllers, > > > > > > > > > > So IO controller provides two mechanisms. > > > > > > > > > > - IO throttling(bytes_per_second, io_per_second interface) > > > > > - Proportional weight disk sharing > > > > > > > > > > In case of proportional weight disk sharing, we don't run into issues of > > > > > priority inversion and metadata handing should not be a concern. > > > > > > > > Though metadata IO will affect how much bandwidth/iops is available > > > > for applications to use. > > > > > > I think meta data IO will be accounted to the process submitting the meta > > > data IO. (IO tracking stuff will be used only for page cache pages during > > > page dirtying time). So yes, the process doing meta data IO will be > > > charged for it. > > > > > > I think I am missing something here and not understanding your concern > > > exactly here. > > > > XFS can issue thousands of delayed metadata write IO per second from > > it's writeback threads when it needs to (e.g. tail pushing the > > journal). Completely unthrottled due to the context they are issued > > from(*) and can basically consume all the disk iops and bandwidth > > capacity for seconds at a time. > > > > Also, XFS doesn't use the page cache for metadata buffers anymore > > so page cache accounting, throttling and reclaim mechanisms > > are never going to work for controlling XFS metadata IO > > > > > > (*) It'll be IO issued by workqueues rather than threads RSN: > > > > http://git.kernel.org/?p=linux/kernel/git/dgc/xfsdev.git;a=shortlog;h=refs/heads/xfs-for-2.6.39 > > > > And this will become _much_ more common in the not-to-distant > > future. So context passing between threads and to workqueues is > > something you need to think about sooner rather than later if you > > want metadata IO to be throttled in any way.... > > Ok, > > So this seems to the similar case as WRITE traffic from flusher threads > which can disrupt IO on end device even if we have done throttling in > balance_dirty_pages(). > > How about doing throttling at two layers. All the data throttling is > done in higher layers and then also retain the mechanism of throttling > at end device. That way an admin can put a overall limit on such > common write traffic. (XFS meta data coming from workqueues, flusher > thread, kswapd etc). > > Anyway, we can't attribute this IO to per process context/group otherwise > most likely something will get serialized in higher layers. > > Right now I am speaking purely from IO throttling point of view and not > even thinking about CFQ and IO tracking stuff. > > This increases the complexity in IO cgroup interface as now we see to have > four combinations. > > Global Throttling > Throttling at lower layers > Throttling at higher layers. > > Per device throttling > Throttling at lower layers > Throttling at higher layers. Dave, I wrote above but I myself am not fond of coming up with 4 combinations. Want to limit it two. Per device throttling or global throttling. Here are some more thoughts in general about both throttling policy and proportional policy of IO controller. For throttling policy, I am primarily concerned with how to avoid file system serialization issues. Proportional IO (CFQ) --------------------- - Make writeback cgroup aware and kernel threads (flusher) which are cgroup aware can be marked with a task flag (GROUP_AWARE). If a cgroup aware kernel threads throws IO at CFQ, then IO is accounted to cgroup of task who originally dirtied the page. Otherwise we use task context to account the IO to. So any IO submitted by flusher threads will go to respective cgroups and higher weight cgroup should be able to do more WRITES. IO submitted by other kernel threads like kjournald, XFS async metadata submission, kswapd etc all goes to thread context and that is root group. - If kswapd is a concern then either make kswapd cgroup aware or let kswapd use cgroup aware flusher to do IO (Dave Chinner's idea). Open Issues ----------- - We do not get isolation for meta data IO. In virtualized setup, to achieve stronger isolation do not use host filesystem. Export block devices into guests. IO throttling ------------ READS ----- - Do not throttle meta data IO. Filesystem needs to mark READ metadata IO so that we can avoid throttling it. This way ordered filesystems will not get serialized behind a throttled read in slow group. May be one can account meta data read to a group and try to use that to throttle data IO in same cgroup as a compensation. WRITES ------ - Throttle tasks. Do not throttle bios. That means that when a task submits direct write, let it go to disk. Do the accounting and if task is exceeding the IO rate make it sleep. Something similar to balance_dirty_pages(). That way, any direct WRITES should not run into any serialization issues in ordered mode. We can continue to use blkio_throtle_bio() hook in generic_make request(). - For buffered WRITES, design a throttling hook similar to balance_drity_pages() and throttle tasks according to rules while they are dirtying page cache. - Do not throttle buffered writes again at the end device as these have been throttled already while writting to page cache. Also throttling WRITES at end device will lead to serialization issues with file systems in ordered mode. - Cgroup of a IO is always attributed to submitting thread. That way all meta data writes will go in root cgroup and remain unthrottled. If one is too concerned with lots of meta data IO, then probably one can put a throttling rule in root cgroup. Open Issues ----------- - IO spikes at end devices Because buffered writes are controlled at page dirtying time, we can have a spike of IO later at end device when flusher thread decides to do writeback. I am not sure how to solve this issue. Part of the problem can be handled by using per cgroup dirty ratio and keeping each cgroup's ratio low so that we don't build up huge dirty caches. This can lead to performance drop of applications. So this is performance vs isolation trade off and user chooses one. This issue exists in virtualized environment only if host file system is used. The best way to achieve maximum isolation would be to export block devices into guest and then perform throttling per block device. - Poor isolation for meta data. We can't account and throttle meta data in each cgroup otherwise we should again run into file system serialization issues in ordered mode. So this is a trade off of using file systems. You primarily get throttling for data IO and not meta data IO. Again, export block devices in virtual machines and create file systems on that and do not use host filesystem and one can achieve a very good isolation. Thoughts? Thanks Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-16 3:06 ` Vivek Goyal @ 2011-04-18 21:58 ` Jan Kara 2011-04-18 22:51 ` cgroup IO throttling and filesystem ordered mode (Was: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)) Vivek Goyal 0 siblings, 1 reply; 138+ messages in thread From: Jan Kara @ 2011-04-18 21:58 UTC (permalink / raw) To: Vivek Goyal Cc: Dave Chinner, Greg Thelen, James Bottomley, lsf, linux-fsdevel On Fri 15-04-11 23:06:02, Vivek Goyal wrote: > On Fri, Apr 15, 2011 at 05:07:50PM -0400, Vivek Goyal wrote: > > How about doing throttling at two layers. All the data throttling is > > done in higher layers and then also retain the mechanism of throttling > > at end device. That way an admin can put a overall limit on such > > common write traffic. (XFS meta data coming from workqueues, flusher > > thread, kswapd etc). > > > > Anyway, we can't attribute this IO to per process context/group otherwise > > most likely something will get serialized in higher layers. > > > > Right now I am speaking purely from IO throttling point of view and not > > even thinking about CFQ and IO tracking stuff. > > > > This increases the complexity in IO cgroup interface as now we see to have > > four combinations. > > > > Global Throttling > > Throttling at lower layers > > Throttling at higher layers. > > > > Per device throttling > > Throttling at lower layers > > Throttling at higher layers. > > Dave, > > I wrote above but I myself am not fond of coming up with 4 combinations. > Want to limit it two. Per device throttling or global throttling. Here > are some more thoughts in general about both throttling policy and > proportional policy of IO controller. For throttling policy, I am > primarily concerned with how to avoid file system serialization issues. > > Proportional IO (CFQ) > --------------------- > - Make writeback cgroup aware and kernel threads (flusher) which are > cgroup aware can be marked with a task flag (GROUP_AWARE). If a > cgroup aware kernel threads throws IO at CFQ, then IO is accounted > to cgroup of task who originally dirtied the page. Otherwise we use > task context to account the IO to. > > So any IO submitted by flusher threads will go to respective cgroups > and higher weight cgroup should be able to do more WRITES. > > IO submitted by other kernel threads like kjournald, XFS async metadata > submission, kswapd etc all goes to thread context and that is root > group. > > - If kswapd is a concern then either make kswapd cgroup aware or let > kswapd use cgroup aware flusher to do IO (Dave Chinner's idea). > > Open Issues > ----------- > - We do not get isolation for meta data IO. In virtualized setup, to > achieve stronger isolation do not use host filesystem. Export block > devices into guests. > > IO throttling > ------------ > > READS > ----- > - Do not throttle meta data IO. Filesystem needs to mark READ metadata > IO so that we can avoid throttling it. This way ordered filesystems > will not get serialized behind a throttled read in slow group. > > May be one can account meta data read to a group and try to use that > to throttle data IO in same cgroup as a compensation. > > WRITES > ------ > - Throttle tasks. Do not throttle bios. That means that when a task > submits direct write, let it go to disk. Do the accounting and if task > is exceeding the IO rate make it sleep. Something similar to > balance_dirty_pages(). > > That way, any direct WRITES should not run into any serialization issues > in ordered mode. We can continue to use blkio_throtle_bio() hook in > generic_make request(). > > - For buffered WRITES, design a throttling hook similar to > balance_drity_pages() and throttle tasks according to rules while they > are dirtying page cache. > > - Do not throttle buffered writes again at the end device as these have > been throttled already while writting to page cache. Also throttling > WRITES at end device will lead to serialization issues with file systems > in ordered mode. > > - Cgroup of a IO is always attributed to submitting thread. That way all > meta data writes will go in root cgroup and remain unthrottled. If one > is too concerned with lots of meta data IO, then probably one can > put a throttling rule in root cgroup. But I think the above scheme basically allows agressive buffered writer to occupy as much of disk throughput as throttling at page dirty time allows. So either you'd have to seriously limit the speed of page dirtying for each cgroup (effectively giving each write properties like direct write) or you'd have to live with cgroup taking your whole disk throughput. Neither of which seems very appealing. Grumble, not that I have a good solution to this problem... > Open Issues > ----------- > - IO spikes at end devices > > Because buffered writes are controlled at page dirtying time, we can > have a spike of IO later at end device when flusher thread decides to > do writeback. > > I am not sure how to solve this issue. Part of the problem can be > handled by using per cgroup dirty ratio and keeping each cgroup's > ratio low so that we don't build up huge dirty caches. This can lead > to performance drop of applications. So this is performance vs isolation > trade off and user chooses one. > > This issue exists in virtualized environment only if host file system > is used. The best way to achieve maximum isolation would be to export > block devices into guest and then perform throttling per block device. > > - Poor isolation for meta data. > > We can't account and throttle meta data in each cgroup otherwise we > should again run into file system serialization issues in ordered > mode. So this is a trade off of using file systems. You primarily get > throttling for data IO and not meta data IO. > > Again, export block devices in virtual machines and create file systems > on that and do not use host filesystem and one can achieve a very good > isolation. > > Thoughts? Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 138+ messages in thread
* cgroup IO throttling and filesystem ordered mode (Was: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)) 2011-04-18 21:58 ` Jan Kara @ 2011-04-18 22:51 ` Vivek Goyal 2011-04-19 0:33 ` Dave Chinner 0 siblings, 1 reply; 138+ messages in thread From: Vivek Goyal @ 2011-04-18 22:51 UTC (permalink / raw) To: Jan Kara Cc: Dave Chinner, Greg Thelen, James Bottomley, lsf, linux-fsdevel, linux kernel mailing list On Mon, Apr 18, 2011 at 11:58:44PM +0200, Jan Kara wrote: > On Fri 15-04-11 23:06:02, Vivek Goyal wrote: > > On Fri, Apr 15, 2011 at 05:07:50PM -0400, Vivek Goyal wrote: > > > How about doing throttling at two layers. All the data throttling is > > > done in higher layers and then also retain the mechanism of throttling > > > at end device. That way an admin can put a overall limit on such > > > common write traffic. (XFS meta data coming from workqueues, flusher > > > thread, kswapd etc). > > > > > > Anyway, we can't attribute this IO to per process context/group otherwise > > > most likely something will get serialized in higher layers. > > > > > > Right now I am speaking purely from IO throttling point of view and not > > > even thinking about CFQ and IO tracking stuff. > > > > > > This increases the complexity in IO cgroup interface as now we see to have > > > four combinations. > > > > > > Global Throttling > > > Throttling at lower layers > > > Throttling at higher layers. > > > > > > Per device throttling > > > Throttling at lower layers > > > Throttling at higher layers. > > > > Dave, > > > > I wrote above but I myself am not fond of coming up with 4 combinations. > > Want to limit it two. Per device throttling or global throttling. Here > > are some more thoughts in general about both throttling policy and > > proportional policy of IO controller. For throttling policy, I am > > primarily concerned with how to avoid file system serialization issues. > > > > Proportional IO (CFQ) > > --------------------- > > - Make writeback cgroup aware and kernel threads (flusher) which are > > cgroup aware can be marked with a task flag (GROUP_AWARE). If a > > cgroup aware kernel threads throws IO at CFQ, then IO is accounted > > to cgroup of task who originally dirtied the page. Otherwise we use > > task context to account the IO to. > > > > So any IO submitted by flusher threads will go to respective cgroups > > and higher weight cgroup should be able to do more WRITES. > > > > IO submitted by other kernel threads like kjournald, XFS async metadata > > submission, kswapd etc all goes to thread context and that is root > > group. > > > > - If kswapd is a concern then either make kswapd cgroup aware or let > > kswapd use cgroup aware flusher to do IO (Dave Chinner's idea). > > > > Open Issues > > ----------- > > - We do not get isolation for meta data IO. In virtualized setup, to > > achieve stronger isolation do not use host filesystem. Export block > > devices into guests. > > > > IO throttling > > ------------ > > > > READS > > ----- > > - Do not throttle meta data IO. Filesystem needs to mark READ metadata > > IO so that we can avoid throttling it. This way ordered filesystems > > will not get serialized behind a throttled read in slow group. > > > > May be one can account meta data read to a group and try to use that > > to throttle data IO in same cgroup as a compensation. > > > > WRITES > > ------ > > - Throttle tasks. Do not throttle bios. That means that when a task > > submits direct write, let it go to disk. Do the accounting and if task > > is exceeding the IO rate make it sleep. Something similar to > > balance_dirty_pages(). > > > > That way, any direct WRITES should not run into any serialization issues > > in ordered mode. We can continue to use blkio_throtle_bio() hook in > > generic_make request(). > > > > - For buffered WRITES, design a throttling hook similar to > > balance_drity_pages() and throttle tasks according to rules while they > > are dirtying page cache. > > > > - Do not throttle buffered writes again at the end device as these have > > been throttled already while writting to page cache. Also throttling > > WRITES at end device will lead to serialization issues with file systems > > in ordered mode. > > > > - Cgroup of a IO is always attributed to submitting thread. That way all > > meta data writes will go in root cgroup and remain unthrottled. If one > > is too concerned with lots of meta data IO, then probably one can > > put a throttling rule in root cgroup. > But I think the above scheme basically allows agressive buffered writer > to occupy as much of disk throughput as throttling at page dirty time > allows. So either you'd have to seriously limit the speed of page dirtying > for each cgroup (effectively giving each write properties like direct write) > or you'd have to live with cgroup taking your whole disk throughput. Neither > of which seems very appealing. Grumble, not that I have a good solution to > this problem... [CCing lkml] Hi Jan, I agree that if we do throttling in balance_dirty_pages() to solve the issue of file system ordered mode, then we allow flusher threads to write data at high rate which is bad. Keeping write throttling at device level runs into issues of file system ordered mode write. I think problem is that file systems are not cgroup aware (/me runs for cover) and we are just trying to work around that hence none of the proposed problem solution is not satisfying. To get cgroup thing right, we shall have to make whole stack cgroup aware. In this case because file system journaling is not cgroup aware and is essentially a serialized operation and life becomes hard. Throttling is in higher layer is not a good solution and throttling in lower layer is not a good solution either. Ideally, throttling in generic_make_request() is good as long as all the layers sitting above it (file systems, flusher writeback, page cache share) can be made cgroup aware. So that if a cgroup is throttled, others cgroup are more or less not impacted by throttled cgroup. We have talked about making flusher cgroup aware and per cgroup dirty ratio thing, but making file system journalling cgroup aware seems to be out of question (I don't even know if it is possible to do and how much work does it involve). I will try to summarize the options I have thought about so far. - Keep throttling at device level. Do not use it with host filesystems especially with ordered mode. So this is primarily useful in case of virtualization. Or recommend user to not configure too low limits on each cgroup. So once in a while file systems in ordered mode will get serialized and it will impact scalability but will not livelock the system. - Move all write throttling in balance_dirty_pages(). This avoids ordering issues but introduce the issue of flusher writting at high speed also people have been looking for limiting traffic from a host coming to shared storage. It does not work very well there as we limit the IO rate coming into page cache and not going out of device. So there will be lot of bursts. - Keep throttling at device level and do something magical in file systems journalling code so that it is more parallel and cgroup aware. Thanks Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: cgroup IO throttling and filesystem ordered mode (Was: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)) 2011-04-18 22:51 ` cgroup IO throttling and filesystem ordered mode (Was: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)) Vivek Goyal @ 2011-04-19 0:33 ` Dave Chinner 2011-04-19 14:30 ` Vivek Goyal 0 siblings, 1 reply; 138+ messages in thread From: Dave Chinner @ 2011-04-19 0:33 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, Greg Thelen, James Bottomley, lsf, linux-fsdevel, linux kernel mailing list On Mon, Apr 18, 2011 at 06:51:18PM -0400, Vivek Goyal wrote: > On Mon, Apr 18, 2011 at 11:58:44PM +0200, Jan Kara wrote: > > On Fri 15-04-11 23:06:02, Vivek Goyal wrote: > > > On Fri, Apr 15, 2011 at 05:07:50PM -0400, Vivek Goyal wrote: > > > > How about doing throttling at two layers. All the data throttling is > > > > done in higher layers and then also retain the mechanism of throttling > > > > at end device. That way an admin can put a overall limit on such > > > > common write traffic. (XFS meta data coming from workqueues, flusher > > > > thread, kswapd etc). > > > > > > > > Anyway, we can't attribute this IO to per process context/group otherwise > > > > most likely something will get serialized in higher layers. > > > > > > > > Right now I am speaking purely from IO throttling point of view and not > > > > even thinking about CFQ and IO tracking stuff. > > > > > > > > This increases the complexity in IO cgroup interface as now we see to have > > > > four combinations. > > > > > > > > Global Throttling > > > > Throttling at lower layers > > > > Throttling at higher layers. > > > > > > > > Per device throttling > > > > Throttling at lower layers > > > > Throttling at higher layers. > > > > > > Dave, > > > > > > I wrote above but I myself am not fond of coming up with 4 combinations. > > > Want to limit it two. Per device throttling or global throttling. Here > > > are some more thoughts in general about both throttling policy and > > > proportional policy of IO controller. For throttling policy, I am > > > primarily concerned with how to avoid file system serialization issues. > > > > > > Proportional IO (CFQ) > > > --------------------- > > > - Make writeback cgroup aware and kernel threads (flusher) which are > > > cgroup aware can be marked with a task flag (GROUP_AWARE). If a > > > cgroup aware kernel threads throws IO at CFQ, then IO is accounted > > > to cgroup of task who originally dirtied the page. Otherwise we use > > > task context to account the IO to. > > > > > > So any IO submitted by flusher threads will go to respective cgroups > > > and higher weight cgroup should be able to do more WRITES. > > > > > > IO submitted by other kernel threads like kjournald, XFS async metadata > > > submission, kswapd etc all goes to thread context and that is root > > > group. > > > > > > - If kswapd is a concern then either make kswapd cgroup aware or let > > > kswapd use cgroup aware flusher to do IO (Dave Chinner's idea). > > > > > > Open Issues > > > ----------- > > > - We do not get isolation for meta data IO. In virtualized setup, to > > > achieve stronger isolation do not use host filesystem. Export block > > > devices into guests. > > > > > > IO throttling > > > ------------ > > > > > > READS > > > ----- > > > - Do not throttle meta data IO. Filesystem needs to mark READ metadata > > > IO so that we can avoid throttling it. This way ordered filesystems > > > will not get serialized behind a throttled read in slow group. > > > > > > May be one can account meta data read to a group and try to use that > > > to throttle data IO in same cgroup as a compensation. > > > > > > WRITES > > > ------ > > > - Throttle tasks. Do not throttle bios. That means that when a task > > > submits direct write, let it go to disk. Do the accounting and if task > > > is exceeding the IO rate make it sleep. Something similar to > > > balance_dirty_pages(). > > > > > > That way, any direct WRITES should not run into any serialization issues > > > in ordered mode. We can continue to use blkio_throtle_bio() hook in > > > generic_make request(). > > > > > > - For buffered WRITES, design a throttling hook similar to > > > balance_drity_pages() and throttle tasks according to rules while they > > > are dirtying page cache. > > > > > > - Do not throttle buffered writes again at the end device as these have > > > been throttled already while writting to page cache. Also throttling > > > WRITES at end device will lead to serialization issues with file systems > > > in ordered mode. > > > > > > - Cgroup of a IO is always attributed to submitting thread. That way all > > > meta data writes will go in root cgroup and remain unthrottled. If one > > > is too concerned with lots of meta data IO, then probably one can > > > put a throttling rule in root cgroup. > > But I think the above scheme basically allows agressive buffered writer > > to occupy as much of disk throughput as throttling at page dirty time > > allows. So either you'd have to seriously limit the speed of page dirtying > > for each cgroup (effectively giving each write properties like direct write) > > or you'd have to live with cgroup taking your whole disk throughput. Neither > > of which seems very appealing. Grumble, not that I have a good solution to > > this problem... > > [CCing lkml] > > Hi Jan, > > I agree that if we do throttling in balance_dirty_pages() to solve the > issue of file system ordered mode, then we allow flusher threads to > write data at high rate which is bad. Keeping write throttling at device > level runs into issues of file system ordered mode write. > > I think problem is that file systems are not cgroup aware (/me runs for > cover) and we are just trying to work around that hence none of the proposed > problem solution is not satisfying. > > To get cgroup thing right, we shall have to make whole stack cgroup aware. > In this case because file system journaling is not cgroup aware and is > essentially a serialized operation and life becomes hard. Throttling is > in higher layer is not a good solution and throttling in lower layer > is not a good solution either. > > Ideally, throttling in generic_make_request() is good as long as all the > layers sitting above it (file systems, flusher writeback, page cache share) > can be made cgroup aware. So that if a cgroup is throttled, others cgroup > are more or less not impacted by throttled cgroup. We have talked about > making flusher cgroup aware and per cgroup dirty ratio thing, but making > file system journalling cgroup aware seems to be out of question (I don't > even know if it is possible to do and how much work does it involve). If you want to throttle journal operations, then we probably need to throttle metadata operations that commit to the journal, not the journal IO itself. The journal is a shared global resource that all cgroups use, so throttling journal IO inappropriately will affect the performance of all cgroups, not just the one that is "hogging" it. In XFS, you could probably do this at the transaction reservation stage where log space is reserved. We know everything about the transaction at this point in time, and we throttle here already when the journal is full. Adding cgroup transaction limits to this point would be the place to do it, but the control parameter for it would be very XFS specific (i.e. number of transactions/s). Concurrency is not an issue - the XFS transaction subsystem is only limited in concurrency by the space available in the journal for reservations (hundred to thousands of concurrent transactions). FWIW, this would even allow per-bdi-flusher thread transaction throttling parameters to be set, so writeback triggered metadata IO could possibly be limited as well. I'm not sure whether this is possible with other filesystems, and ext3/4 would still have the issue of ordered writeback causing much more writeback than expected at times (e.g. fsync), but I suspect there is nothing that can really be done about this. > I will try to summarize the options I have thought about so far. > > - Keep throttling at device level. Do not use it with host filesystems > especially with ordered mode. So this is primarily useful in case of > virtualization. > > Or recommend user to not configure too low limits on each cgroup. So > once in a while file systems in ordered mode will get serialized and > it will impact scalability but will not livelock the system. > > - Move all write throttling in balance_dirty_pages(). This avoids ordering > issues but introduce the issue of flusher writting at high speed also > people have been looking for limiting traffic from a host coming to > shared storage. It does not work very well there as we limit the IO > rate coming into page cache and not going out of device. So there > will be lot of bursts. > > - Keep throttling at device level and do something magical in file systems > journalling code so that it is more parallel and cgroup aware. I think the third approach is the best long term approach. FWIW, if you really want cgroups integrated properly into XFS, then they need to be integrated into the allocator as well so we can push isolateed cgroups into different, non-contending regions of the filesystem (similar to filestreams containers). I started on an general allocation policy framework for XFS a few years ago, but never had more than a POC prototype. I always intended this framework to implement (at the time) a cpuset aware policy, so I'm pretty sure such an approach would work for cgroups, too. Maybe it's time to dust off that patch set.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: cgroup IO throttling and filesystem ordered mode (Was: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)) 2011-04-19 0:33 ` Dave Chinner @ 2011-04-19 14:30 ` Vivek Goyal 2011-04-19 14:45 ` Jan Kara ` (2 more replies) 0 siblings, 3 replies; 138+ messages in thread From: Vivek Goyal @ 2011-04-19 14:30 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, Greg Thelen, James Bottomley, lsf, linux-fsdevel, linux kernel mailing list On Tue, Apr 19, 2011 at 10:33:39AM +1000, Dave Chinner wrote: > On Mon, Apr 18, 2011 at 06:51:18PM -0400, Vivek Goyal wrote: > > On Mon, Apr 18, 2011 at 11:58:44PM +0200, Jan Kara wrote: > > > On Fri 15-04-11 23:06:02, Vivek Goyal wrote: > > > > On Fri, Apr 15, 2011 at 05:07:50PM -0400, Vivek Goyal wrote: > > > > > How about doing throttling at two layers. All the data throttling is > > > > > done in higher layers and then also retain the mechanism of throttling > > > > > at end device. That way an admin can put a overall limit on such > > > > > common write traffic. (XFS meta data coming from workqueues, flusher > > > > > thread, kswapd etc). > > > > > > > > > > Anyway, we can't attribute this IO to per process context/group otherwise > > > > > most likely something will get serialized in higher layers. > > > > > > > > > > Right now I am speaking purely from IO throttling point of view and not > > > > > even thinking about CFQ and IO tracking stuff. > > > > > > > > > > This increases the complexity in IO cgroup interface as now we see to have > > > > > four combinations. > > > > > > > > > > Global Throttling > > > > > Throttling at lower layers > > > > > Throttling at higher layers. > > > > > > > > > > Per device throttling > > > > > Throttling at lower layers > > > > > Throttling at higher layers. > > > > > > > > Dave, > > > > > > > > I wrote above but I myself am not fond of coming up with 4 combinations. > > > > Want to limit it two. Per device throttling or global throttling. Here > > > > are some more thoughts in general about both throttling policy and > > > > proportional policy of IO controller. For throttling policy, I am > > > > primarily concerned with how to avoid file system serialization issues. > > > > > > > > Proportional IO (CFQ) > > > > --------------------- > > > > - Make writeback cgroup aware and kernel threads (flusher) which are > > > > cgroup aware can be marked with a task flag (GROUP_AWARE). If a > > > > cgroup aware kernel threads throws IO at CFQ, then IO is accounted > > > > to cgroup of task who originally dirtied the page. Otherwise we use > > > > task context to account the IO to. > > > > > > > > So any IO submitted by flusher threads will go to respective cgroups > > > > and higher weight cgroup should be able to do more WRITES. > > > > > > > > IO submitted by other kernel threads like kjournald, XFS async metadata > > > > submission, kswapd etc all goes to thread context and that is root > > > > group. > > > > > > > > - If kswapd is a concern then either make kswapd cgroup aware or let > > > > kswapd use cgroup aware flusher to do IO (Dave Chinner's idea). > > > > > > > > Open Issues > > > > ----------- > > > > - We do not get isolation for meta data IO. In virtualized setup, to > > > > achieve stronger isolation do not use host filesystem. Export block > > > > devices into guests. > > > > > > > > IO throttling > > > > ------------ > > > > > > > > READS > > > > ----- > > > > - Do not throttle meta data IO. Filesystem needs to mark READ metadata > > > > IO so that we can avoid throttling it. This way ordered filesystems > > > > will not get serialized behind a throttled read in slow group. > > > > > > > > May be one can account meta data read to a group and try to use that > > > > to throttle data IO in same cgroup as a compensation. > > > > > > > > WRITES > > > > ------ > > > > - Throttle tasks. Do not throttle bios. That means that when a task > > > > submits direct write, let it go to disk. Do the accounting and if task > > > > is exceeding the IO rate make it sleep. Something similar to > > > > balance_dirty_pages(). > > > > > > > > That way, any direct WRITES should not run into any serialization issues > > > > in ordered mode. We can continue to use blkio_throtle_bio() hook in > > > > generic_make request(). > > > > > > > > - For buffered WRITES, design a throttling hook similar to > > > > balance_drity_pages() and throttle tasks according to rules while they > > > > are dirtying page cache. > > > > > > > > - Do not throttle buffered writes again at the end device as these have > > > > been throttled already while writting to page cache. Also throttling > > > > WRITES at end device will lead to serialization issues with file systems > > > > in ordered mode. > > > > > > > > - Cgroup of a IO is always attributed to submitting thread. That way all > > > > meta data writes will go in root cgroup and remain unthrottled. If one > > > > is too concerned with lots of meta data IO, then probably one can > > > > put a throttling rule in root cgroup. > > > But I think the above scheme basically allows agressive buffered writer > > > to occupy as much of disk throughput as throttling at page dirty time > > > allows. So either you'd have to seriously limit the speed of page dirtying > > > for each cgroup (effectively giving each write properties like direct write) > > > or you'd have to live with cgroup taking your whole disk throughput. Neither > > > of which seems very appealing. Grumble, not that I have a good solution to > > > this problem... > > > > [CCing lkml] > > > > Hi Jan, > > > > I agree that if we do throttling in balance_dirty_pages() to solve the > > issue of file system ordered mode, then we allow flusher threads to > > write data at high rate which is bad. Keeping write throttling at device > > level runs into issues of file system ordered mode write. > > > > I think problem is that file systems are not cgroup aware (/me runs for > > cover) and we are just trying to work around that hence none of the proposed > > problem solution is not satisfying. > > > > To get cgroup thing right, we shall have to make whole stack cgroup aware. > > In this case because file system journaling is not cgroup aware and is > > essentially a serialized operation and life becomes hard. Throttling is > > in higher layer is not a good solution and throttling in lower layer > > is not a good solution either. > > > > Ideally, throttling in generic_make_request() is good as long as all the > > layers sitting above it (file systems, flusher writeback, page cache share) > > can be made cgroup aware. So that if a cgroup is throttled, others cgroup > > are more or less not impacted by throttled cgroup. We have talked about > > making flusher cgroup aware and per cgroup dirty ratio thing, but making > > file system journalling cgroup aware seems to be out of question (I don't > > even know if it is possible to do and how much work does it involve). > > If you want to throttle journal operations, then we probably need to > throttle metadata operations that commit to the journal, not the > journal IO itself. The journal is a shared global resource that all > cgroups use, so throttling journal IO inappropriately will affect > the performance of all cgroups, not just the one that is "hogging" > it. Agreed. > > In XFS, you could probably do this at the transaction reservation > stage where log space is reserved. We know everything about the > transaction at this point in time, and we throttle here already when > the journal is full. Adding cgroup transaction limits to this point > would be the place to do it, but the control parameter for it would > be very XFS specific (i.e. number of transactions/s). Concurrency is > not an issue - the XFS transaction subsystem is only limited in > concurrency by the space available in the journal for reservations > (hundred to thousands of concurrent transactions). Instead of transaction per second, can we implement some kind of upper limit of pending transactions per cgroup. And that limit does not have to be user tunable to begin with. The effective transactions/sec rate will automatically be determined by IO throttling rate of the cgroup at the end nodes. I think effectively what we need is that the notion of parallel transactions so that transactions of one cgroup can make progress independent of transactions of other cgroup. So if a process does an fsync and it is throttled then it should block transaction of only that cgroup and not other cgroups. You mentioned that concurrency is not an issue in XFS and hundreds of thousands of concurrent trasactions can progress depending on log space available. If that's the case, I think to begin with we might not have to do anything at all. Processes can still get blocked but as long as we have enough log space, this might not be a frequent event. I will do some testing with XFS and see can I livelock the system with very low IO limits. > > FWIW, this would even allow per-bdi-flusher thread transaction > throttling parameters to be set, so writeback triggered metadata IO > could possibly be limited as well. How does writeback trigger metadata IO? In the first step I was looking to not throttle meta data IO as that will require even more changes in file system layer. I was thinking that if we provide throttling only for data and do changes in filesystems so that concurrent transactions can exist and make progress and file system IO does not serialize behind slow throttled cgroup. This leads to weaker isolation but atleast we don't run into livelocking or filesystem scalability issues. Once that's resolved, we can handle the case of throttling meta data IO also. In fact if metadata is dependent on data (in ordered mode) and if we are throttling data, then we automatically throttle meata for select cases. > > I'm not sure whether this is possible with other filesystems, and > ext3/4 would still have the issue of ordered writeback causing much > more writeback than expected at times (e.g. fsync), but I suspect > there is nothing that can really be done about this. Can't this be modified so that multiple per cgroup transactions can make progress. So if one fsync is blocked, then processes in other cgroup should still be able to do IO using a separate transaction and be able to commit it. > > > I will try to summarize the options I have thought about so far. > > > > - Keep throttling at device level. Do not use it with host filesystems > > especially with ordered mode. So this is primarily useful in case of > > virtualization. > > > > Or recommend user to not configure too low limits on each cgroup. So > > once in a while file systems in ordered mode will get serialized and > > it will impact scalability but will not livelock the system. > > > > - Move all write throttling in balance_dirty_pages(). This avoids ordering > > issues but introduce the issue of flusher writting at high speed also > > people have been looking for limiting traffic from a host coming to > > shared storage. It does not work very well there as we limit the IO > > rate coming into page cache and not going out of device. So there > > will be lot of bursts. > > > > - Keep throttling at device level and do something magical in file systems > > journalling code so that it is more parallel and cgroup aware. > > I think the third approach is the best long term approach. I also like the third approach. It is complex but more sustabinable in long term. > > FWIW, if you really want cgroups integrated properly into XFS, then > they need to be integrated into the allocator as well so we can push > isolateed cgroups into different, non-contending regions of the > filesystem (similar to filestreams containers). I started on an > general allocation policy framework for XFS a few years ago, but > never had more than a POC prototype. I always intended this > framework to implement (at the time) a cpuset aware policy, so I'm > pretty sure such an approach would work for cgroups, too. Maybe it's > time to dust off that patch set.... So having separate allocation areas/groups for separate group is useful from locking perspective? Is it useful even if we do not throttle meta data? I will be willing to test these patches if you decide to dust off old patches. Thanks Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: cgroup IO throttling and filesystem ordered mode (Was: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)) 2011-04-19 14:30 ` Vivek Goyal @ 2011-04-19 14:45 ` Jan Kara 2011-04-19 17:17 ` Vivek Goyal 2011-04-21 0:29 ` Dave Chinner 2 siblings, 0 replies; 138+ messages in thread From: Jan Kara @ 2011-04-19 14:45 UTC (permalink / raw) To: Vivek Goyal Cc: Dave Chinner, Jan Kara, Greg Thelen, James Bottomley, lsf, linux-fsdevel, linux kernel mailing list On Tue 19-04-11 10:30:22, Vivek Goyal wrote: > On Tue, Apr 19, 2011 at 10:33:39AM +1000, Dave Chinner wrote: > > If you want to throttle journal operations, then we probably need to > > throttle metadata operations that commit to the journal, not the > > journal IO itself. The journal is a shared global resource that all > > cgroups use, so throttling journal IO inappropriately will affect > > the performance of all cgroups, not just the one that is "hogging" > > it. > > Agreed. > > > > > In XFS, you could probably do this at the transaction reservation > > stage where log space is reserved. We know everything about the > > transaction at this point in time, and we throttle here already when > > the journal is full. Adding cgroup transaction limits to this point > > would be the place to do it, but the control parameter for it would > > be very XFS specific (i.e. number of transactions/s). Concurrency is > > not an issue - the XFS transaction subsystem is only limited in > > concurrency by the space available in the journal for reservations > > (hundred to thousands of concurrent transactions). > > Instead of transaction per second, can we implement some kind of upper > limit of pending transactions per cgroup. And that limit does not have > to be user tunable to begin with. The effective transactions/sec rate > will automatically be determined by IO throttling rate of the cgroup > at the end nodes. > > I think effectively what we need is that the notion of parallel > transactions so that transactions of one cgroup can make progress > independent of transactions of other cgroup. So if a process does > an fsync and it is throttled then it should block transaction of > only that cgroup and not other cgroups. > > You mentioned that concurrency is not an issue in XFS and hundreds of > thousands of concurrent trasactions can progress depending on log space > available. If that's the case, I think to begin with we might not have > to do anything at all. Processes can still get blocked but as long as > we have enough log space, this might not be a frequent event. I will > do some testing with XFS and see can I livelock the system with very > low IO limits. > > > > > FWIW, this would even allow per-bdi-flusher thread transaction > > throttling parameters to be set, so writeback triggered metadata IO > > could possibly be limited as well. > > How does writeback trigger metadata IO? Because by writing data, you may need to do block allocation or mark blocks as written on disk, or similar changes to metadata... > In the first step I was looking to not throttle meta data IO as that > will require even more changes in file system layer. I was thinking > that if we provide throttling only for data and do changes in filesystems > so that concurrent transactions can exist and make progress and file > system IO does not serialize behind slow throttled cgroup. Yes, I think not throttling metadata is a good start. > This leads to weaker isolation but atleast we don't run into livelocking > or filesystem scalability issues. Once that's resolved, we can handle the > case of throttling meta data IO also. > > In fact if metadata is dependent on data (in ordered mode) and if we are > throttling data, then we automatically throttle meata for select cases. > > > > > I'm not sure whether this is possible with other filesystems, and > > ext3/4 would still have the issue of ordered writeback causing much > > more writeback than expected at times (e.g. fsync), but I suspect > > there is nothing that can really be done about this. > > Can't this be modified so that multiple per cgroup transactions can make > progress. So if one fsync is blocked, then processes in other cgroup > should still be able to do IO using a separate transaction and be able > to commit it. Not really. Ext3/4 has always a single running transaction and all metadata updates from all threads are recorded in it. When the transaction grows large/old enough, we commit it and start a new transaction. The fact that there is always just one running transaction is heavily used in the journaling code so it would need serious rewrite of JBD2... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: cgroup IO throttling and filesystem ordered mode (Was: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)) 2011-04-19 14:30 ` Vivek Goyal 2011-04-19 14:45 ` Jan Kara @ 2011-04-19 17:17 ` Vivek Goyal 2011-04-19 18:30 ` Vivek Goyal 2011-04-21 0:29 ` Dave Chinner 2 siblings, 1 reply; 138+ messages in thread From: Vivek Goyal @ 2011-04-19 17:17 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, Greg Thelen, James Bottomley, lsf, linux-fsdevel, linux kernel mailing list On Tue, Apr 19, 2011 at 10:30:22AM -0400, Vivek Goyal wrote: [..] > > > > In XFS, you could probably do this at the transaction reservation > > stage where log space is reserved. We know everything about the > > transaction at this point in time, and we throttle here already when > > the journal is full. Adding cgroup transaction limits to this point > > would be the place to do it, but the control parameter for it would > > be very XFS specific (i.e. number of transactions/s). Concurrency is > > not an issue - the XFS transaction subsystem is only limited in > > concurrency by the space available in the journal for reservations > > (hundred to thousands of concurrent transactions). > > Instead of transaction per second, can we implement some kind of upper > limit of pending transactions per cgroup. And that limit does not have > to be user tunable to begin with. The effective transactions/sec rate > will automatically be determined by IO throttling rate of the cgroup > at the end nodes. > > I think effectively what we need is that the notion of parallel > transactions so that transactions of one cgroup can make progress > independent of transactions of other cgroup. So if a process does > an fsync and it is throttled then it should block transaction of > only that cgroup and not other cgroups. > > You mentioned that concurrency is not an issue in XFS and hundreds of > thousands of concurrent trasactions can progress depending on log space > available. If that's the case, I think to begin with we might not have > to do anything at all. Processes can still get blocked but as long as > we have enough log space, this might not be a frequent event. I will > do some testing with XFS and see can I livelock the system with very > low IO limits. Wow, XFS seems to be doing pretty good here. I created a group of 1 bytes/sec limit and wrote few bytes in a file and write quit it (vim). That led to an fsync and process got blocked. From a different cgroup, in the same directory I seem to be able to do all other regular operations like ls, opening a new file, editing it etc. ext4 will lockup immediately. So concurrent transactions do seem to work in XFS. Thanks Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: cgroup IO throttling and filesystem ordered mode (Was: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)) 2011-04-19 17:17 ` Vivek Goyal @ 2011-04-19 18:30 ` Vivek Goyal 2011-04-21 0:32 ` Dave Chinner 0 siblings, 1 reply; 138+ messages in thread From: Vivek Goyal @ 2011-04-19 18:30 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, Greg Thelen, James Bottomley, lsf, linux-fsdevel, linux kernel mailing list On Tue, Apr 19, 2011 at 01:17:23PM -0400, Vivek Goyal wrote: > On Tue, Apr 19, 2011 at 10:30:22AM -0400, Vivek Goyal wrote: > > [..] > > > > > > In XFS, you could probably do this at the transaction reservation > > > stage where log space is reserved. We know everything about the > > > transaction at this point in time, and we throttle here already when > > > the journal is full. Adding cgroup transaction limits to this point > > > would be the place to do it, but the control parameter for it would > > > be very XFS specific (i.e. number of transactions/s). Concurrency is > > > not an issue - the XFS transaction subsystem is only limited in > > > concurrency by the space available in the journal for reservations > > > (hundred to thousands of concurrent transactions). > > > > Instead of transaction per second, can we implement some kind of upper > > limit of pending transactions per cgroup. And that limit does not have > > to be user tunable to begin with. The effective transactions/sec rate > > will automatically be determined by IO throttling rate of the cgroup > > at the end nodes. > > > > I think effectively what we need is that the notion of parallel > > transactions so that transactions of one cgroup can make progress > > independent of transactions of other cgroup. So if a process does > > an fsync and it is throttled then it should block transaction of > > only that cgroup and not other cgroups. > > > > You mentioned that concurrency is not an issue in XFS and hundreds of > > thousands of concurrent trasactions can progress depending on log space > > available. If that's the case, I think to begin with we might not have > > to do anything at all. Processes can still get blocked but as long as > > we have enough log space, this might not be a frequent event. I will > > do some testing with XFS and see can I livelock the system with very > > low IO limits. > > Wow, XFS seems to be doing pretty good here. I created a group of > 1 bytes/sec limit and wrote few bytes in a file and write quit it (vim). > That led to an fsync and process got blocked. From a different cgroup, in the > same directory I seem to be able to do all other regular operations like ls, > opening a new file, editing it etc. > > ext4 will lockup immediately. So concurrent transactions do seem to work in > XFS. Well, I used tedso's fsync tester test case which wrote a file of 1MB and then did fsync. I launched this test case in two cgroups. One is throttled and other is not. Looks like unthrottled one gets blocked somewhere and can't make progress. So there are dependencies somewhere even with XFS. Thanks Vivek > > Thanks > Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: cgroup IO throttling and filesystem ordered mode (Was: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)) 2011-04-19 18:30 ` Vivek Goyal @ 2011-04-21 0:32 ` Dave Chinner 0 siblings, 0 replies; 138+ messages in thread From: Dave Chinner @ 2011-04-21 0:32 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, Greg Thelen, James Bottomley, lsf, linux-fsdevel, linux kernel mailing list On Tue, Apr 19, 2011 at 02:30:22PM -0400, Vivek Goyal wrote: > On Tue, Apr 19, 2011 at 01:17:23PM -0400, Vivek Goyal wrote: > > On Tue, Apr 19, 2011 at 10:30:22AM -0400, Vivek Goyal wrote: > > > > [..] > > > > > > > > In XFS, you could probably do this at the transaction reservation > > > > stage where log space is reserved. We know everything about the > > > > transaction at this point in time, and we throttle here already when > > > > the journal is full. Adding cgroup transaction limits to this point > > > > would be the place to do it, but the control parameter for it would > > > > be very XFS specific (i.e. number of transactions/s). Concurrency is > > > > not an issue - the XFS transaction subsystem is only limited in > > > > concurrency by the space available in the journal for reservations > > > > (hundred to thousands of concurrent transactions). > > > > > > Instead of transaction per second, can we implement some kind of upper > > > limit of pending transactions per cgroup. And that limit does not have > > > to be user tunable to begin with. The effective transactions/sec rate > > > will automatically be determined by IO throttling rate of the cgroup > > > at the end nodes. > > > > > > I think effectively what we need is that the notion of parallel > > > transactions so that transactions of one cgroup can make progress > > > independent of transactions of other cgroup. So if a process does > > > an fsync and it is throttled then it should block transaction of > > > only that cgroup and not other cgroups. > > > > > > You mentioned that concurrency is not an issue in XFS and hundreds of > > > thousands of concurrent trasactions can progress depending on log space > > > available. If that's the case, I think to begin with we might not have > > > to do anything at all. Processes can still get blocked but as long as > > > we have enough log space, this might not be a frequent event. I will > > > do some testing with XFS and see can I livelock the system with very > > > low IO limits. > > > > Wow, XFS seems to be doing pretty good here. I created a group of > > 1 bytes/sec limit and wrote few bytes in a file and write quit it (vim). > > That led to an fsync and process got blocked. From a different cgroup, in the > > same directory I seem to be able to do all other regular operations like ls, > > opening a new file, editing it etc. > > > > ext4 will lockup immediately. So concurrent transactions do seem to work in > > XFS. > > Well, I used tedso's fsync tester test case which wrote a file of 1MB > and then did fsync. I launched this test case in two cgroups. One is > throttled and other is not. Looks like unthrottled one gets blocked > somewhere and can't make progress. So there are dependencies somewhere > even with XFS. Yes, if you throttle the journal commit IO then other transaction commits will stall when we run out of log buffers to write new commits to disk. Like I said - the journal is a shared resource and stalling it will eventually stop _everything_. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: cgroup IO throttling and filesystem ordered mode (Was: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)) 2011-04-19 14:30 ` Vivek Goyal 2011-04-19 14:45 ` Jan Kara 2011-04-19 17:17 ` Vivek Goyal @ 2011-04-21 0:29 ` Dave Chinner 2 siblings, 0 replies; 138+ messages in thread From: Dave Chinner @ 2011-04-21 0:29 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, Greg Thelen, James Bottomley, lsf, linux-fsdevel, linux kernel mailing list On Tue, Apr 19, 2011 at 10:30:22AM -0400, Vivek Goyal wrote: > On Tue, Apr 19, 2011 at 10:33:39AM +1000, Dave Chinner wrote: > > On Mon, Apr 18, 2011 at 06:51:18PM -0400, Vivek Goyal wrote: > > > On Mon, Apr 18, 2011 at 11:58:44PM +0200, Jan Kara wrote: > > > > On Fri 15-04-11 23:06:02, Vivek Goyal wrote: > > > > > On Fri, Apr 15, 2011 at 05:07:50PM -0400, Vivek Goyal wrote: > > > > > > How about doing throttling at two layers. All the data throttling is > > > > > > done in higher layers and then also retain the mechanism of throttling > > > > > > at end device. That way an admin can put a overall limit on such > > > > > > common write traffic. (XFS meta data coming from workqueues, flusher > > > > > > thread, kswapd etc). > > > > > > > > > > > > Anyway, we can't attribute this IO to per process context/group otherwise > > > > > > most likely something will get serialized in higher layers. > > > > > > > > > > > > Right now I am speaking purely from IO throttling point of view and not > > > > > > even thinking about CFQ and IO tracking stuff. > > > > > > > > > > > > This increases the complexity in IO cgroup interface as now we see to have > > > > > > four combinations. > > > > > > > > > > > > Global Throttling > > > > > > Throttling at lower layers > > > > > > Throttling at higher layers. > > > > > > > > > > > > Per device throttling > > > > > > Throttling at lower layers > > > > > > Throttling at higher layers. > > > > > > > > > > Dave, > > > > > > > > > > I wrote above but I myself am not fond of coming up with 4 combinations. > > > > > Want to limit it two. Per device throttling or global throttling. Here > > > > > are some more thoughts in general about both throttling policy and > > > > > proportional policy of IO controller. For throttling policy, I am > > > > > primarily concerned with how to avoid file system serialization issues. > > > > > > > > > > Proportional IO (CFQ) > > > > > --------------------- > > > > > - Make writeback cgroup aware and kernel threads (flusher) which are > > > > > cgroup aware can be marked with a task flag (GROUP_AWARE). If a > > > > > cgroup aware kernel threads throws IO at CFQ, then IO is accounted > > > > > to cgroup of task who originally dirtied the page. Otherwise we use > > > > > task context to account the IO to. > > > > > > > > > > So any IO submitted by flusher threads will go to respective cgroups > > > > > and higher weight cgroup should be able to do more WRITES. > > > > > > > > > > IO submitted by other kernel threads like kjournald, XFS async metadata > > > > > submission, kswapd etc all goes to thread context and that is root > > > > > group. > > > > > > > > > > - If kswapd is a concern then either make kswapd cgroup aware or let > > > > > kswapd use cgroup aware flusher to do IO (Dave Chinner's idea). > > > > > > > > > > Open Issues > > > > > ----------- > > > > > - We do not get isolation for meta data IO. In virtualized setup, to > > > > > achieve stronger isolation do not use host filesystem. Export block > > > > > devices into guests. > > > > > > > > > > IO throttling > > > > > ------------ > > > > > > > > > > READS > > > > > ----- > > > > > - Do not throttle meta data IO. Filesystem needs to mark READ metadata > > > > > IO so that we can avoid throttling it. This way ordered filesystems > > > > > will not get serialized behind a throttled read in slow group. > > > > > > > > > > May be one can account meta data read to a group and try to use that > > > > > to throttle data IO in same cgroup as a compensation. > > > > > > > > > > WRITES > > > > > ------ > > > > > - Throttle tasks. Do not throttle bios. That means that when a task > > > > > submits direct write, let it go to disk. Do the accounting and if task > > > > > is exceeding the IO rate make it sleep. Something similar to > > > > > balance_dirty_pages(). > > > > > > > > > > That way, any direct WRITES should not run into any serialization issues > > > > > in ordered mode. We can continue to use blkio_throtle_bio() hook in > > > > > generic_make request(). > > > > > > > > > > - For buffered WRITES, design a throttling hook similar to > > > > > balance_drity_pages() and throttle tasks according to rules while they > > > > > are dirtying page cache. > > > > > > > > > > - Do not throttle buffered writes again at the end device as these have > > > > > been throttled already while writting to page cache. Also throttling > > > > > WRITES at end device will lead to serialization issues with file systems > > > > > in ordered mode. > > > > > > > > > > - Cgroup of a IO is always attributed to submitting thread. That way all > > > > > meta data writes will go in root cgroup and remain unthrottled. If one > > > > > is too concerned with lots of meta data IO, then probably one can > > > > > put a throttling rule in root cgroup. > > > > But I think the above scheme basically allows agressive buffered writer > > > > to occupy as much of disk throughput as throttling at page dirty time > > > > allows. So either you'd have to seriously limit the speed of page dirtying > > > > for each cgroup (effectively giving each write properties like direct write) > > > > or you'd have to live with cgroup taking your whole disk throughput. Neither > > > > of which seems very appealing. Grumble, not that I have a good solution to > > > > this problem... > > > > > > [CCing lkml] > > > > > > Hi Jan, > > > > > > I agree that if we do throttling in balance_dirty_pages() to solve the > > > issue of file system ordered mode, then we allow flusher threads to > > > write data at high rate which is bad. Keeping write throttling at device > > > level runs into issues of file system ordered mode write. > > > > > > I think problem is that file systems are not cgroup aware (/me runs for > > > cover) and we are just trying to work around that hence none of the proposed > > > problem solution is not satisfying. > > > > > > To get cgroup thing right, we shall have to make whole stack cgroup aware. > > > In this case because file system journaling is not cgroup aware and is > > > essentially a serialized operation and life becomes hard. Throttling is > > > in higher layer is not a good solution and throttling in lower layer > > > is not a good solution either. > > > > > > Ideally, throttling in generic_make_request() is good as long as all the > > > layers sitting above it (file systems, flusher writeback, page cache share) > > > can be made cgroup aware. So that if a cgroup is throttled, others cgroup > > > are more or less not impacted by throttled cgroup. We have talked about > > > making flusher cgroup aware and per cgroup dirty ratio thing, but making > > > file system journalling cgroup aware seems to be out of question (I don't > > > even know if it is possible to do and how much work does it involve). > > > > If you want to throttle journal operations, then we probably need to > > throttle metadata operations that commit to the journal, not the > > journal IO itself. The journal is a shared global resource that all > > cgroups use, so throttling journal IO inappropriately will affect > > the performance of all cgroups, not just the one that is "hogging" > > it. > > Agreed. > > > > > In XFS, you could probably do this at the transaction reservation > > stage where log space is reserved. We know everything about the > > transaction at this point in time, and we throttle here already when > > the journal is full. Adding cgroup transaction limits to this point > > would be the place to do it, but the control parameter for it would > > be very XFS specific (i.e. number of transactions/s). Concurrency is > > not an issue - the XFS transaction subsystem is only limited in > > concurrency by the space available in the journal for reservations > > (hundred to thousands of concurrent transactions). > > Instead of transaction per second, can we implement some kind of upper > limit of pending transactions per cgroup. And that limit does not have > to be user tunable to begin with. The effective transactions/sec rate > will automatically be determined by IO throttling rate of the cgroup > at the end nodes. Sure - that's just another measure of the same thing, really. > I think effectively what we need is that the notion of parallel > transactions so that transactions of one cgroup can make progress > independent of transactions of other cgroup. So if a process does > an fsync and it is throttled then it should block transaction of > only that cgroup and not other cgroups. Parallel transactions only get you so far - there's still the serialisation of the transaction commit that occurs. > You mentioned that concurrency is not an issue in XFS and hundreds of > thousands of concurrent trasactions can progress depending on log space "hundreds _to_ thousands of concurrent transactions". You read a couple of orders of magnitude larger number there ;) > > FWIW, this would even allow per-bdi-flusher thread transaction > > throttling parameters to be set, so writeback triggered metadata IO > > could possibly be limited as well. > > How does writeback trigger metadata IO? Allocation might need to read free space btree blocks, transaction reservation can trigger a log tail push becuase there isn't enough space in the log, transaction commit might cause journal writes.... > > I'm not sure whether this is possible with other filesystems, and > > ext3/4 would still have the issue of ordered writeback causing much > > more writeback than expected at times (e.g. fsync), but I suspect > > there is nothing that can really be done about this. > > Can't this be modified so that multiple per cgroup transactions can make > progress. So if one fsync is blocked, then processes in other cgroup > should still be able to do IO using a separate transaction and be able > to commit it. That would be for the ext4 guys to answer. > > FWIW, if you really want cgroups integrated properly into XFS, then > > they need to be integrated into the allocator as well so we can push > > isolateed cgroups into different, non-contending regions of the > > filesystem (similar to filestreams containers). I started on an > > general allocation policy framework for XFS a few years ago, but > > never had more than a POC prototype. I always intended this > > framework to implement (at the time) a cpuset aware policy, so I'm > > pretty sure such an approach would work for cgroups, too. Maybe it's > > time to dust off that patch set.... > > So having separate allocation areas/groups for separate group is useful > from locking perspective? Is it useful even if we do not throttle > meta data? Yes. Allocation groups have their own locking and can operate completely in parallel. The only typical serialisation point between allocation transactions in different AGs is the transaction commit... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-11 1:36 ` Dave Chinner 2011-04-15 21:07 ` Vivek Goyal @ 2011-04-19 14:17 ` Wu Fengguang 2011-04-19 14:34 ` Vivek Goyal 1 sibling, 1 reply; 138+ messages in thread From: Wu Fengguang @ 2011-04-19 14:17 UTC (permalink / raw) To: Dave Chinner Cc: Vivek Goyal, Greg Thelen, James Bottomley, lsf, linux-fsdevel [snip] > > > > For throttling case, apart from metadata, I found that with simple > > > > throttling of data I ran into issues with journalling with ext4 mounuted > > > > in ordered mode. So it was suggested that WRITE IO throttling should > > > > not be done at device level instead try to do it in higher layers, > > > > possibly balance_dirty_pages() and throttle process early. > > > > > > The problem with doing it at the page cache entry level is that > > > cache hits then get throttled. It's not really a an IO controller at > > > that point, and the impact on application performance could be huge > > > (i.e. MB/s instead of GB/s). > > > > Agreed that throttling cache hits is not a good idea. Can we determine > > if page being asked for is in cache or not and charge for IO accordingly. > > You'd need hooks in find_or_create_page(), though you have no > context of whether a read or a write is in progress at that point. I'm confused. Where is the throttling at cache hits? The balance_dirty_pages() throttling kicks in at write() syscall and page fault time. For example, generic_perform_write(), do_wp_page() and __do_fault() will explicitly call balance_dirty_pages_ratelimited() to do the write throttling. Thanks, Fengguang ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-19 14:17 ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Wu Fengguang @ 2011-04-19 14:34 ` Vivek Goyal 2011-04-19 14:48 ` Jan Kara 0 siblings, 1 reply; 138+ messages in thread From: Vivek Goyal @ 2011-04-19 14:34 UTC (permalink / raw) To: Wu Fengguang Cc: Dave Chinner, Greg Thelen, James Bottomley, lsf, linux-fsdevel On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote: > [snip] > > > > > For throttling case, apart from metadata, I found that with simple > > > > > throttling of data I ran into issues with journalling with ext4 mounuted > > > > > in ordered mode. So it was suggested that WRITE IO throttling should > > > > > not be done at device level instead try to do it in higher layers, > > > > > possibly balance_dirty_pages() and throttle process early. > > > > > > > > The problem with doing it at the page cache entry level is that > > > > cache hits then get throttled. It's not really a an IO controller at > > > > that point, and the impact on application performance could be huge > > > > (i.e. MB/s instead of GB/s). > > > > > > Agreed that throttling cache hits is not a good idea. Can we determine > > > if page being asked for is in cache or not and charge for IO accordingly. > > > > You'd need hooks in find_or_create_page(), though you have no > > context of whether a read or a write is in progress at that point. > > I'm confused. Where is the throttling at cache hits? > > The balance_dirty_pages() throttling kicks in at write() syscall and > page fault time. For example, generic_perform_write(), do_wp_page() > and __do_fault() will explicitly call > balance_dirty_pages_ratelimited() to do the write throttling. This comment was in the context of what if we move block IO controller read throttling also in higher layers. Then we don't want to throttle reads which are already in cache. Currently throttling hook is in generic_make_request() and it kicks in only if data is not present in page cache and actual disk IO is initiated. Thanks Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-19 14:34 ` Vivek Goyal @ 2011-04-19 14:48 ` Jan Kara 2011-04-19 15:11 ` Vivek Goyal 0 siblings, 1 reply; 138+ messages in thread From: Jan Kara @ 2011-04-19 14:48 UTC (permalink / raw) To: Vivek Goyal Cc: Wu Fengguang, James Bottomley, lsf, linux-fsdevel, Dave Chinner On Tue 19-04-11 10:34:23, Vivek Goyal wrote: > On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote: > > [snip] > > > > > > For throttling case, apart from metadata, I found that with simple > > > > > > throttling of data I ran into issues with journalling with ext4 mounuted > > > > > > in ordered mode. So it was suggested that WRITE IO throttling should > > > > > > not be done at device level instead try to do it in higher layers, > > > > > > possibly balance_dirty_pages() and throttle process early. > > > > > > > > > > The problem with doing it at the page cache entry level is that > > > > > cache hits then get throttled. It's not really a an IO controller at > > > > > that point, and the impact on application performance could be huge > > > > > (i.e. MB/s instead of GB/s). > > > > > > > > Agreed that throttling cache hits is not a good idea. Can we determine > > > > if page being asked for is in cache or not and charge for IO accordingly. > > > > > > You'd need hooks in find_or_create_page(), though you have no > > > context of whether a read or a write is in progress at that point. > > > > I'm confused. Where is the throttling at cache hits? > > > > The balance_dirty_pages() throttling kicks in at write() syscall and > > page fault time. For example, generic_perform_write(), do_wp_page() > > and __do_fault() will explicitly call > > balance_dirty_pages_ratelimited() to do the write throttling. > > This comment was in the context of what if we move block IO controller read > throttling also in higher layers. Then we don't want to throttle reads > which are already in cache. > > Currently throttling hook is in generic_make_request() and it kicks in > only if data is not present in page cache and actual disk IO is initiated. You can always throttle in readpage(). It's not much higher than generic_make_request() but basically as high as it can get I suspect (otherwise you'd have to deal with lots of different code paths like page faults, splice, read, ...). Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-19 14:48 ` Jan Kara @ 2011-04-19 15:11 ` Vivek Goyal 2011-04-19 15:22 ` Wu Fengguang 0 siblings, 1 reply; 138+ messages in thread From: Vivek Goyal @ 2011-04-19 15:11 UTC (permalink / raw) To: Jan Kara; +Cc: Wu Fengguang, James Bottomley, lsf, linux-fsdevel, Dave Chinner On Tue, Apr 19, 2011 at 04:48:32PM +0200, Jan Kara wrote: > On Tue 19-04-11 10:34:23, Vivek Goyal wrote: > > On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote: > > > [snip] > > > > > > > For throttling case, apart from metadata, I found that with simple > > > > > > > throttling of data I ran into issues with journalling with ext4 mounuted > > > > > > > in ordered mode. So it was suggested that WRITE IO throttling should > > > > > > > not be done at device level instead try to do it in higher layers, > > > > > > > possibly balance_dirty_pages() and throttle process early. > > > > > > > > > > > > The problem with doing it at the page cache entry level is that > > > > > > cache hits then get throttled. It's not really a an IO controller at > > > > > > that point, and the impact on application performance could be huge > > > > > > (i.e. MB/s instead of GB/s). > > > > > > > > > > Agreed that throttling cache hits is not a good idea. Can we determine > > > > > if page being asked for is in cache or not and charge for IO accordingly. > > > > > > > > You'd need hooks in find_or_create_page(), though you have no > > > > context of whether a read or a write is in progress at that point. > > > > > > I'm confused. Where is the throttling at cache hits? > > > > > > The balance_dirty_pages() throttling kicks in at write() syscall and > > > page fault time. For example, generic_perform_write(), do_wp_page() > > > and __do_fault() will explicitly call > > > balance_dirty_pages_ratelimited() to do the write throttling. > > > > This comment was in the context of what if we move block IO controller read > > throttling also in higher layers. Then we don't want to throttle reads > > which are already in cache. > > > > Currently throttling hook is in generic_make_request() and it kicks in > > only if data is not present in page cache and actual disk IO is initiated. > You can always throttle in readpage(). It's not much higher than > generic_make_request() but basically as high as it can get I suspect > (otherwise you'd have to deal with lots of different code paths like page > faults, splice, read, ...). Yep, I was thinking that what do I gain by moving READ throttling up. The only thing generic_make_request() does not catch is network file systems. I think for that I can introduce another hook say in NFS and I might be all set. Thanks Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-19 15:11 ` Vivek Goyal @ 2011-04-19 15:22 ` Wu Fengguang 2011-04-19 15:31 ` Vivek Goyal 0 siblings, 1 reply; 138+ messages in thread From: Wu Fengguang @ 2011-04-19 15:22 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, James Bottomley, lsf@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, Dave Chinner On Tue, Apr 19, 2011 at 11:11:11PM +0800, Vivek Goyal wrote: > On Tue, Apr 19, 2011 at 04:48:32PM +0200, Jan Kara wrote: > > On Tue 19-04-11 10:34:23, Vivek Goyal wrote: > > > On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote: > > > > [snip] > > > > > > > > For throttling case, apart from metadata, I found that with simple > > > > > > > > throttling of data I ran into issues with journalling with ext4 mounuted > > > > > > > > in ordered mode. So it was suggested that WRITE IO throttling should > > > > > > > > not be done at device level instead try to do it in higher layers, > > > > > > > > possibly balance_dirty_pages() and throttle process early. > > > > > > > > > > > > > > The problem with doing it at the page cache entry level is that > > > > > > > cache hits then get throttled. It's not really a an IO controller at > > > > > > > that point, and the impact on application performance could be huge > > > > > > > (i.e. MB/s instead of GB/s). > > > > > > > > > > > > Agreed that throttling cache hits is not a good idea. Can we determine > > > > > > if page being asked for is in cache or not and charge for IO accordingly. > > > > > > > > > > You'd need hooks in find_or_create_page(), though you have no > > > > > context of whether a read or a write is in progress at that point. > > > > > > > > I'm confused. Where is the throttling at cache hits? > > > > > > > > The balance_dirty_pages() throttling kicks in at write() syscall and > > > > page fault time. For example, generic_perform_write(), do_wp_page() > > > > and __do_fault() will explicitly call > > > > balance_dirty_pages_ratelimited() to do the write throttling. > > > > > > This comment was in the context of what if we move block IO controller read > > > throttling also in higher layers. Then we don't want to throttle reads > > > which are already in cache. > > > > > > Currently throttling hook is in generic_make_request() and it kicks in > > > only if data is not present in page cache and actual disk IO is initiated. > > You can always throttle in readpage(). It's not much higher than > > generic_make_request() but basically as high as it can get I suspect > > (otherwise you'd have to deal with lots of different code paths like page > > faults, splice, read, ...). > > Yep, I was thinking that what do I gain by moving READ throttling up. > The only thing generic_make_request() does not catch is network file > systems. I think for that I can introduce another hook say in NFS and > I might be all set. Basically all data reads go through the readahead layer, and the __do_page_cache_readahead() function. Just one more option for your tradeoffs :) Thanks, Fengguang ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-19 15:22 ` Wu Fengguang @ 2011-04-19 15:31 ` Vivek Goyal 2011-04-19 16:58 ` Wu Fengguang 0 siblings, 1 reply; 138+ messages in thread From: Vivek Goyal @ 2011-04-19 15:31 UTC (permalink / raw) To: Wu Fengguang Cc: Jan Kara, James Bottomley, lsf@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, Dave Chinner On Tue, Apr 19, 2011 at 11:22:40PM +0800, Wu Fengguang wrote: > On Tue, Apr 19, 2011 at 11:11:11PM +0800, Vivek Goyal wrote: > > On Tue, Apr 19, 2011 at 04:48:32PM +0200, Jan Kara wrote: > > > On Tue 19-04-11 10:34:23, Vivek Goyal wrote: > > > > On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote: > > > > > [snip] > > > > > > > > > For throttling case, apart from metadata, I found that with simple > > > > > > > > > throttling of data I ran into issues with journalling with ext4 mounuted > > > > > > > > > in ordered mode. So it was suggested that WRITE IO throttling should > > > > > > > > > not be done at device level instead try to do it in higher layers, > > > > > > > > > possibly balance_dirty_pages() and throttle process early. > > > > > > > > > > > > > > > > The problem with doing it at the page cache entry level is that > > > > > > > > cache hits then get throttled. It's not really a an IO controller at > > > > > > > > that point, and the impact on application performance could be huge > > > > > > > > (i.e. MB/s instead of GB/s). > > > > > > > > > > > > > > Agreed that throttling cache hits is not a good idea. Can we determine > > > > > > > if page being asked for is in cache or not and charge for IO accordingly. > > > > > > > > > > > > You'd need hooks in find_or_create_page(), though you have no > > > > > > context of whether a read or a write is in progress at that point. > > > > > > > > > > I'm confused. Where is the throttling at cache hits? > > > > > > > > > > The balance_dirty_pages() throttling kicks in at write() syscall and > > > > > page fault time. For example, generic_perform_write(), do_wp_page() > > > > > and __do_fault() will explicitly call > > > > > balance_dirty_pages_ratelimited() to do the write throttling. > > > > > > > > This comment was in the context of what if we move block IO controller read > > > > throttling also in higher layers. Then we don't want to throttle reads > > > > which are already in cache. > > > > > > > > Currently throttling hook is in generic_make_request() and it kicks in > > > > only if data is not present in page cache and actual disk IO is initiated. > > > You can always throttle in readpage(). It's not much higher than > > > generic_make_request() but basically as high as it can get I suspect > > > (otherwise you'd have to deal with lots of different code paths like page > > > faults, splice, read, ...). > > > > Yep, I was thinking that what do I gain by moving READ throttling up. > > The only thing generic_make_request() does not catch is network file > > systems. I think for that I can introduce another hook say in NFS and > > I might be all set. > > Basically all data reads go through the readahead layer, and the > __do_page_cache_readahead() function. > > Just one more option for your tradeoffs :) But this does not cover direct IO? But I guess if I split the hook into two parts (one in direct IO path and one in __do_page_cache_readahead()), then filesystems don't have to mark meta data READS. I will look into it. Thanks Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-19 15:31 ` Vivek Goyal @ 2011-04-19 16:58 ` Wu Fengguang 2011-04-19 17:05 ` Vivek Goyal 0 siblings, 1 reply; 138+ messages in thread From: Wu Fengguang @ 2011-04-19 16:58 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, James Bottomley, lsf@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, Dave Chinner On Tue, Apr 19, 2011 at 11:31:06PM +0800, Vivek Goyal wrote: > On Tue, Apr 19, 2011 at 11:22:40PM +0800, Wu Fengguang wrote: > > On Tue, Apr 19, 2011 at 11:11:11PM +0800, Vivek Goyal wrote: > > > On Tue, Apr 19, 2011 at 04:48:32PM +0200, Jan Kara wrote: > > > > On Tue 19-04-11 10:34:23, Vivek Goyal wrote: > > > > > On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote: > > > > > > [snip] > > > > > > > > > > For throttling case, apart from metadata, I found that with simple > > > > > > > > > > throttling of data I ran into issues with journalling with ext4 mounuted > > > > > > > > > > in ordered mode. So it was suggested that WRITE IO throttling should > > > > > > > > > > not be done at device level instead try to do it in higher layers, > > > > > > > > > > possibly balance_dirty_pages() and throttle process early. > > > > > > > > > > > > > > > > > > The problem with doing it at the page cache entry level is that > > > > > > > > > cache hits then get throttled. It's not really a an IO controller at > > > > > > > > > that point, and the impact on application performance could be huge > > > > > > > > > (i.e. MB/s instead of GB/s). > > > > > > > > > > > > > > > > Agreed that throttling cache hits is not a good idea. Can we determine > > > > > > > > if page being asked for is in cache or not and charge for IO accordingly. > > > > > > > > > > > > > > You'd need hooks in find_or_create_page(), though you have no > > > > > > > context of whether a read or a write is in progress at that point. > > > > > > > > > > > > I'm confused. Where is the throttling at cache hits? > > > > > > > > > > > > The balance_dirty_pages() throttling kicks in at write() syscall and > > > > > > page fault time. For example, generic_perform_write(), do_wp_page() > > > > > > and __do_fault() will explicitly call > > > > > > balance_dirty_pages_ratelimited() to do the write throttling. > > > > > > > > > > This comment was in the context of what if we move block IO controller read > > > > > throttling also in higher layers. Then we don't want to throttle reads > > > > > which are already in cache. > > > > > > > > > > Currently throttling hook is in generic_make_request() and it kicks in > > > > > only if data is not present in page cache and actual disk IO is initiated. > > > > You can always throttle in readpage(). It's not much higher than > > > > generic_make_request() but basically as high as it can get I suspect > > > > (otherwise you'd have to deal with lots of different code paths like page > > > > faults, splice, read, ...). > > > > > > Yep, I was thinking that what do I gain by moving READ throttling up. > > > The only thing generic_make_request() does not catch is network file > > > systems. I think for that I can introduce another hook say in NFS and > > > I might be all set. > > > > Basically all data reads go through the readahead layer, and the > > __do_page_cache_readahead() function. > > > > Just one more option for your tradeoffs :) > > But this does not cover direct IO? Yes, sorry! > But I guess if I split the hook into two parts (one in direct IO path > and one in __do_page_cache_readahead()), then filesystems don't have > to mark meta data READS. I will look into it. Right, and the hooks should be trivial to add. The readahead code is typically invoked in three ways: - sync readahead, on page cache miss, => page_cache_sync_readahead() - async readahead, on hitting PG_readahead (tagged on one page per readahead window), => page_cache_async_readahead() - user space readahead, fadvise(WILLNEED), => force_page_cache_readahead() ext3/4 also call into readahead on readdir(). The readahead window size is typically 128K, but much larger for software raid, btrfs and NFS, typically multiple MB and even more. Thanks, Fengguang ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-19 16:58 ` Wu Fengguang @ 2011-04-19 17:05 ` Vivek Goyal 2011-04-19 20:58 ` Jan Kara 2011-04-20 1:16 ` Wu Fengguang 0 siblings, 2 replies; 138+ messages in thread From: Vivek Goyal @ 2011-04-19 17:05 UTC (permalink / raw) To: Wu Fengguang Cc: Jan Kara, James Bottomley, lsf@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, Dave Chinner On Wed, Apr 20, 2011 at 12:58:38AM +0800, Wu Fengguang wrote: > On Tue, Apr 19, 2011 at 11:31:06PM +0800, Vivek Goyal wrote: > > On Tue, Apr 19, 2011 at 11:22:40PM +0800, Wu Fengguang wrote: > > > On Tue, Apr 19, 2011 at 11:11:11PM +0800, Vivek Goyal wrote: > > > > On Tue, Apr 19, 2011 at 04:48:32PM +0200, Jan Kara wrote: > > > > > On Tue 19-04-11 10:34:23, Vivek Goyal wrote: > > > > > > On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote: > > > > > > > [snip] > > > > > > > > > > > For throttling case, apart from metadata, I found that with simple > > > > > > > > > > > throttling of data I ran into issues with journalling with ext4 mounuted > > > > > > > > > > > in ordered mode. So it was suggested that WRITE IO throttling should > > > > > > > > > > > not be done at device level instead try to do it in higher layers, > > > > > > > > > > > possibly balance_dirty_pages() and throttle process early. > > > > > > > > > > > > > > > > > > > > The problem with doing it at the page cache entry level is that > > > > > > > > > > cache hits then get throttled. It's not really a an IO controller at > > > > > > > > > > that point, and the impact on application performance could be huge > > > > > > > > > > (i.e. MB/s instead of GB/s). > > > > > > > > > > > > > > > > > > Agreed that throttling cache hits is not a good idea. Can we determine > > > > > > > > > if page being asked for is in cache or not and charge for IO accordingly. > > > > > > > > > > > > > > > > You'd need hooks in find_or_create_page(), though you have no > > > > > > > > context of whether a read or a write is in progress at that point. > > > > > > > > > > > > > > I'm confused. Where is the throttling at cache hits? > > > > > > > > > > > > > > The balance_dirty_pages() throttling kicks in at write() syscall and > > > > > > > page fault time. For example, generic_perform_write(), do_wp_page() > > > > > > > and __do_fault() will explicitly call > > > > > > > balance_dirty_pages_ratelimited() to do the write throttling. > > > > > > > > > > > > This comment was in the context of what if we move block IO controller read > > > > > > throttling also in higher layers. Then we don't want to throttle reads > > > > > > which are already in cache. > > > > > > > > > > > > Currently throttling hook is in generic_make_request() and it kicks in > > > > > > only if data is not present in page cache and actual disk IO is initiated. > > > > > You can always throttle in readpage(). It's not much higher than > > > > > generic_make_request() but basically as high as it can get I suspect > > > > > (otherwise you'd have to deal with lots of different code paths like page > > > > > faults, splice, read, ...). > > > > > > > > Yep, I was thinking that what do I gain by moving READ throttling up. > > > > The only thing generic_make_request() does not catch is network file > > > > systems. I think for that I can introduce another hook say in NFS and > > > > I might be all set. > > > > > > Basically all data reads go through the readahead layer, and the > > > __do_page_cache_readahead() function. > > > > > > Just one more option for your tradeoffs :) > > > > But this does not cover direct IO? > > Yes, sorry! > > > But I guess if I split the hook into two parts (one in direct IO path > > and one in __do_page_cache_readahead()), then filesystems don't have > > to mark meta data READS. I will look into it. > > Right, and the hooks should be trivial to add. > > The readahead code is typically invoked in three ways: > > - sync readahead, on page cache miss, => page_cache_sync_readahead() > > - async readahead, on hitting PG_readahead (tagged on one page per readahead window), > => page_cache_async_readahead() > > - user space readahead, fadvise(WILLNEED), => force_page_cache_readahead() > > ext3/4 also call into readahead on readdir(). So this will be called for even meta data READS. Then there is no advantage of moving the throttle hook out of generic_make_request()? Instead what I will need is that ask file systems to mark meta data IO so that I can avoid throttling. Thanks Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-19 17:05 ` Vivek Goyal @ 2011-04-19 20:58 ` Jan Kara 2011-04-20 1:21 ` Wu Fengguang 2011-04-20 1:16 ` Wu Fengguang 1 sibling, 1 reply; 138+ messages in thread From: Jan Kara @ 2011-04-19 20:58 UTC (permalink / raw) To: Vivek Goyal Cc: Wu Fengguang, Jan Kara, James Bottomley, lsf@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, Dave Chinner On Tue 19-04-11 13:05:43, Vivek Goyal wrote: > On Wed, Apr 20, 2011 at 12:58:38AM +0800, Wu Fengguang wrote: > > On Tue, Apr 19, 2011 at 11:31:06PM +0800, Vivek Goyal wrote: > > > On Tue, Apr 19, 2011 at 11:22:40PM +0800, Wu Fengguang wrote: > > > > On Tue, Apr 19, 2011 at 11:11:11PM +0800, Vivek Goyal wrote: > > > > > On Tue, Apr 19, 2011 at 04:48:32PM +0200, Jan Kara wrote: > > > > > > On Tue 19-04-11 10:34:23, Vivek Goyal wrote: > > > > > > > On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote: > > > > > > > > [snip] > > > > > > > > > > > > For throttling case, apart from metadata, I found that with simple > > > > > > > > > > > > throttling of data I ran into issues with journalling with ext4 mounuted > > > > > > > > > > > > in ordered mode. So it was suggested that WRITE IO throttling should > > > > > > > > > > > > not be done at device level instead try to do it in higher layers, > > > > > > > > > > > > possibly balance_dirty_pages() and throttle process early. > > > > > > > > > > > > > > > > > > > > > > The problem with doing it at the page cache entry level is that > > > > > > > > > > > cache hits then get throttled. It's not really a an IO controller at > > > > > > > > > > > that point, and the impact on application performance could be huge > > > > > > > > > > > (i.e. MB/s instead of GB/s). > > > > > > > > > > > > > > > > > > > > Agreed that throttling cache hits is not a good idea. Can we determine > > > > > > > > > > if page being asked for is in cache or not and charge for IO accordingly. > > > > > > > > > > > > > > > > > > You'd need hooks in find_or_create_page(), though you have no > > > > > > > > > context of whether a read or a write is in progress at that point. > > > > > > > > > > > > > > > > I'm confused. Where is the throttling at cache hits? > > > > > > > > > > > > > > > > The balance_dirty_pages() throttling kicks in at write() syscall and > > > > > > > > page fault time. For example, generic_perform_write(), do_wp_page() > > > > > > > > and __do_fault() will explicitly call > > > > > > > > balance_dirty_pages_ratelimited() to do the write throttling. > > > > > > > > > > > > > > This comment was in the context of what if we move block IO controller read > > > > > > > throttling also in higher layers. Then we don't want to throttle reads > > > > > > > which are already in cache. > > > > > > > > > > > > > > Currently throttling hook is in generic_make_request() and it kicks in > > > > > > > only if data is not present in page cache and actual disk IO is initiated. > > > > > > You can always throttle in readpage(). It's not much higher than > > > > > > generic_make_request() but basically as high as it can get I suspect > > > > > > (otherwise you'd have to deal with lots of different code paths like page > > > > > > faults, splice, read, ...). > > > > > > > > > > Yep, I was thinking that what do I gain by moving READ throttling up. > > > > > The only thing generic_make_request() does not catch is network file > > > > > systems. I think for that I can introduce another hook say in NFS and > > > > > I might be all set. > > > > > > > > Basically all data reads go through the readahead layer, and the > > > > __do_page_cache_readahead() function. > > > > > > > > Just one more option for your tradeoffs :) > > > > > > But this does not cover direct IO? > > > > Yes, sorry! > > > > > But I guess if I split the hook into two parts (one in direct IO path > > > and one in __do_page_cache_readahead()), then filesystems don't have > > > to mark meta data READS. I will look into it. > > > > Right, and the hooks should be trivial to add. > > > > The readahead code is typically invoked in three ways: > > > > - sync readahead, on page cache miss, => page_cache_sync_readahead() > > > > - async readahead, on hitting PG_readahead (tagged on one page per readahead window), > > => page_cache_async_readahead() > > > > - user space readahead, fadvise(WILLNEED), => force_page_cache_readahead() > > > > ext3/4 also call into readahead on readdir(). > > So this will be called for even meta data READS. Then there is no > advantage of moving the throttle hook out of generic_make_request()? No, generally it won't. I think Fengguang was wrong - only ext2 carries directories in page cache and thus uses readahead code. All other filesystems handle directories specially and don't use readpage for them. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-19 20:58 ` Jan Kara @ 2011-04-20 1:21 ` Wu Fengguang 2011-04-20 10:56 ` Jan Kara 0 siblings, 1 reply; 138+ messages in thread From: Wu Fengguang @ 2011-04-20 1:21 UTC (permalink / raw) To: Jan Kara Cc: Vivek Goyal, James Bottomley, lsf@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, Dave Chinner On Wed, Apr 20, 2011 at 04:58:21AM +0800, Jan Kara wrote: > On Tue 19-04-11 13:05:43, Vivek Goyal wrote: > > On Wed, Apr 20, 2011 at 12:58:38AM +0800, Wu Fengguang wrote: > > > On Tue, Apr 19, 2011 at 11:31:06PM +0800, Vivek Goyal wrote: > > > > On Tue, Apr 19, 2011 at 11:22:40PM +0800, Wu Fengguang wrote: > > > > > On Tue, Apr 19, 2011 at 11:11:11PM +0800, Vivek Goyal wrote: > > > > > > On Tue, Apr 19, 2011 at 04:48:32PM +0200, Jan Kara wrote: > > > > > > > On Tue 19-04-11 10:34:23, Vivek Goyal wrote: > > > > > > > > On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote: > > > > > > > > > [snip] > > > > > > > > > > > > > For throttling case, apart from metadata, I found that with simple > > > > > > > > > > > > > throttling of data I ran into issues with journalling with ext4 mounuted > > > > > > > > > > > > > in ordered mode. So it was suggested that WRITE IO throttling should > > > > > > > > > > > > > not be done at device level instead try to do it in higher layers, > > > > > > > > > > > > > possibly balance_dirty_pages() and throttle process early. > > > > > > > > > > > > > > > > > > > > > > > > The problem with doing it at the page cache entry level is that > > > > > > > > > > > > cache hits then get throttled. It's not really a an IO controller at > > > > > > > > > > > > that point, and the impact on application performance could be huge > > > > > > > > > > > > (i.e. MB/s instead of GB/s). > > > > > > > > > > > > > > > > > > > > > > Agreed that throttling cache hits is not a good idea. Can we determine > > > > > > > > > > > if page being asked for is in cache or not and charge for IO accordingly. > > > > > > > > > > > > > > > > > > > > You'd need hooks in find_or_create_page(), though you have no > > > > > > > > > > context of whether a read or a write is in progress at that point. > > > > > > > > > > > > > > > > > > I'm confused. Where is the throttling at cache hits? > > > > > > > > > > > > > > > > > > The balance_dirty_pages() throttling kicks in at write() syscall and > > > > > > > > > page fault time. For example, generic_perform_write(), do_wp_page() > > > > > > > > > and __do_fault() will explicitly call > > > > > > > > > balance_dirty_pages_ratelimited() to do the write throttling. > > > > > > > > > > > > > > > > This comment was in the context of what if we move block IO controller read > > > > > > > > throttling also in higher layers. Then we don't want to throttle reads > > > > > > > > which are already in cache. > > > > > > > > > > > > > > > > Currently throttling hook is in generic_make_request() and it kicks in > > > > > > > > only if data is not present in page cache and actual disk IO is initiated. > > > > > > > You can always throttle in readpage(). It's not much higher than > > > > > > > generic_make_request() but basically as high as it can get I suspect > > > > > > > (otherwise you'd have to deal with lots of different code paths like page > > > > > > > faults, splice, read, ...). > > > > > > > > > > > > Yep, I was thinking that what do I gain by moving READ throttling up. > > > > > > The only thing generic_make_request() does not catch is network file > > > > > > systems. I think for that I can introduce another hook say in NFS and > > > > > > I might be all set. > > > > > > > > > > Basically all data reads go through the readahead layer, and the > > > > > __do_page_cache_readahead() function. > > > > > > > > > > Just one more option for your tradeoffs :) > > > > > > > > But this does not cover direct IO? > > > > > > Yes, sorry! > > > > > > > But I guess if I split the hook into two parts (one in direct IO path > > > > and one in __do_page_cache_readahead()), then filesystems don't have > > > > to mark meta data READS. I will look into it. > > > > > > Right, and the hooks should be trivial to add. > > > > > > The readahead code is typically invoked in three ways: > > > > > > - sync readahead, on page cache miss, => page_cache_sync_readahead() > > > > > > - async readahead, on hitting PG_readahead (tagged on one page per readahead window), > > > => page_cache_async_readahead() > > > > > > - user space readahead, fadvise(WILLNEED), => force_page_cache_readahead() > > > > > > ext3/4 also call into readahead on readdir(). > > > > So this will be called for even meta data READS. Then there is no > > advantage of moving the throttle hook out of generic_make_request()? > No, generally it won't. I think Fengguang was wrong - only ext2 carries > directories in page cache and thus uses readahead code. All other > filesystems handle directories specially and don't use readpage for them. So ext2 is implicitly using readahead? ext3/4 behave different in that ext4_readdir() has an explicit call to page_cache_sync_readahead(), passing the blockdev mapping as the page cache container. Thanks, Fengguang ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-20 1:21 ` Wu Fengguang @ 2011-04-20 10:56 ` Jan Kara 2011-04-20 11:19 ` Wu Fengguang 0 siblings, 1 reply; 138+ messages in thread From: Jan Kara @ 2011-04-20 10:56 UTC (permalink / raw) To: Wu Fengguang Cc: Jan Kara, Vivek Goyal, James Bottomley, lsf@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, Dave Chinner On Wed 20-04-11 09:21:31, Wu Fengguang wrote: > On Wed, Apr 20, 2011 at 04:58:21AM +0800, Jan Kara wrote: > > On Tue 19-04-11 13:05:43, Vivek Goyal wrote: > > > On Wed, Apr 20, 2011 at 12:58:38AM +0800, Wu Fengguang wrote: > > > > On Tue, Apr 19, 2011 at 11:31:06PM +0800, Vivek Goyal wrote: > > > > > On Tue, Apr 19, 2011 at 11:22:40PM +0800, Wu Fengguang wrote: > > > > > > On Tue, Apr 19, 2011 at 11:11:11PM +0800, Vivek Goyal wrote: > > > > > > > On Tue, Apr 19, 2011 at 04:48:32PM +0200, Jan Kara wrote: > > > > > > > > On Tue 19-04-11 10:34:23, Vivek Goyal wrote: > > > > > > > > > On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote: > > > > > > > > > > [snip] > > > > > > > > > > > > > > For throttling case, apart from metadata, I found that with simple > > > > > > > > > > > > > > throttling of data I ran into issues with journalling with ext4 mounuted > > > > > > > > > > > > > > in ordered mode. So it was suggested that WRITE IO throttling should > > > > > > > > > > > > > > not be done at device level instead try to do it in higher layers, > > > > > > > > > > > > > > possibly balance_dirty_pages() and throttle process early. > > > > > > > > > > > > > > > > > > > > > > > > > > The problem with doing it at the page cache entry level is that > > > > > > > > > > > > > cache hits then get throttled. It's not really a an IO controller at > > > > > > > > > > > > > that point, and the impact on application performance could be huge > > > > > > > > > > > > > (i.e. MB/s instead of GB/s). > > > > > > > > > > > > > > > > > > > > > > > > Agreed that throttling cache hits is not a good idea. Can we determine > > > > > > > > > > > > if page being asked for is in cache or not and charge for IO accordingly. > > > > > > > > > > > > > > > > > > > > > > You'd need hooks in find_or_create_page(), though you have no > > > > > > > > > > > context of whether a read or a write is in progress at that point. > > > > > > > > > > > > > > > > > > > > I'm confused. Where is the throttling at cache hits? > > > > > > > > > > > > > > > > > > > > The balance_dirty_pages() throttling kicks in at write() syscall and > > > > > > > > > > page fault time. For example, generic_perform_write(), do_wp_page() > > > > > > > > > > and __do_fault() will explicitly call > > > > > > > > > > balance_dirty_pages_ratelimited() to do the write throttling. > > > > > > > > > > > > > > > > > > This comment was in the context of what if we move block IO controller read > > > > > > > > > throttling also in higher layers. Then we don't want to throttle reads > > > > > > > > > which are already in cache. > > > > > > > > > > > > > > > > > > Currently throttling hook is in generic_make_request() and it kicks in > > > > > > > > > only if data is not present in page cache and actual disk IO is initiated. > > > > > > > > You can always throttle in readpage(). It's not much higher than > > > > > > > > generic_make_request() but basically as high as it can get I suspect > > > > > > > > (otherwise you'd have to deal with lots of different code paths like page > > > > > > > > faults, splice, read, ...). > > > > > > > > > > > > > > Yep, I was thinking that what do I gain by moving READ throttling up. > > > > > > > The only thing generic_make_request() does not catch is network file > > > > > > > systems. I think for that I can introduce another hook say in NFS and > > > > > > > I might be all set. > > > > > > > > > > > > Basically all data reads go through the readahead layer, and the > > > > > > __do_page_cache_readahead() function. > > > > > > > > > > > > Just one more option for your tradeoffs :) > > > > > > > > > > But this does not cover direct IO? > > > > > > > > Yes, sorry! > > > > > > > > > But I guess if I split the hook into two parts (one in direct IO path > > > > > and one in __do_page_cache_readahead()), then filesystems don't have > > > > > to mark meta data READS. I will look into it. > > > > > > > > Right, and the hooks should be trivial to add. > > > > > > > > The readahead code is typically invoked in three ways: > > > > > > > > - sync readahead, on page cache miss, => page_cache_sync_readahead() > > > > > > > > - async readahead, on hitting PG_readahead (tagged on one page per readahead window), > > > > => page_cache_async_readahead() > > > > > > > > - user space readahead, fadvise(WILLNEED), => force_page_cache_readahead() > > > > > > > > ext3/4 also call into readahead on readdir(). > > > > > > So this will be called for even meta data READS. Then there is no > > > advantage of moving the throttle hook out of generic_make_request()? > > No, generally it won't. I think Fengguang was wrong - only ext2 carries > > directories in page cache and thus uses readahead code. All other > > filesystems handle directories specially and don't use readpage for them. > > So ext2 is implicitly using readahead? ext3/4 behave different in that > ext4_readdir() has an explicit call to page_cache_sync_readahead(), > passing the blockdev mapping as the page cache container. Yes, ext2 uses implicitely readahead because it uses read_mapping_page() for directory inodes. I forgot that ext3/4 call page_cache_sync_readahead() so you were right that they actually use it for the device inode. I'm sorry for the noise. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-20 10:56 ` Jan Kara @ 2011-04-20 11:19 ` Wu Fengguang 2011-04-20 14:42 ` Jan Kara 0 siblings, 1 reply; 138+ messages in thread From: Wu Fengguang @ 2011-04-20 11:19 UTC (permalink / raw) To: Jan Kara Cc: Vivek Goyal, James Bottomley, lsf@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, Dave Chinner On Wed, Apr 20, 2011 at 06:56:06PM +0800, Jan Kara wrote: > On Wed 20-04-11 09:21:31, Wu Fengguang wrote: > > On Wed, Apr 20, 2011 at 04:58:21AM +0800, Jan Kara wrote: > > > On Tue 19-04-11 13:05:43, Vivek Goyal wrote: > > > > On Wed, Apr 20, 2011 at 12:58:38AM +0800, Wu Fengguang wrote: > > > > > On Tue, Apr 19, 2011 at 11:31:06PM +0800, Vivek Goyal wrote: > > > > > > On Tue, Apr 19, 2011 at 11:22:40PM +0800, Wu Fengguang wrote: > > > > > > > On Tue, Apr 19, 2011 at 11:11:11PM +0800, Vivek Goyal wrote: > > > > > > > > On Tue, Apr 19, 2011 at 04:48:32PM +0200, Jan Kara wrote: > > > > > > > > > On Tue 19-04-11 10:34:23, Vivek Goyal wrote: > > > > > > > > > > On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote: > > > > > > > > > > > [snip] > > > > > > > > > > > > > > > For throttling case, apart from metadata, I found that with simple > > > > > > > > > > > > > > > throttling of data I ran into issues with journalling with ext4 mounuted > > > > > > > > > > > > > > > in ordered mode. So it was suggested that WRITE IO throttling should > > > > > > > > > > > > > > > not be done at device level instead try to do it in higher layers, > > > > > > > > > > > > > > > possibly balance_dirty_pages() and throttle process early. > > > > > > > > > > > > > > > > > > > > > > > > > > > > The problem with doing it at the page cache entry level is that > > > > > > > > > > > > > > cache hits then get throttled. It's not really a an IO controller at > > > > > > > > > > > > > > that point, and the impact on application performance could be huge > > > > > > > > > > > > > > (i.e. MB/s instead of GB/s). > > > > > > > > > > > > > > > > > > > > > > > > > > Agreed that throttling cache hits is not a good idea. Can we determine > > > > > > > > > > > > > if page being asked for is in cache or not and charge for IO accordingly. > > > > > > > > > > > > > > > > > > > > > > > > You'd need hooks in find_or_create_page(), though you have no > > > > > > > > > > > > context of whether a read or a write is in progress at that point. > > > > > > > > > > > > > > > > > > > > > > I'm confused. Where is the throttling at cache hits? > > > > > > > > > > > > > > > > > > > > > > The balance_dirty_pages() throttling kicks in at write() syscall and > > > > > > > > > > > page fault time. For example, generic_perform_write(), do_wp_page() > > > > > > > > > > > and __do_fault() will explicitly call > > > > > > > > > > > balance_dirty_pages_ratelimited() to do the write throttling. > > > > > > > > > > > > > > > > > > > > This comment was in the context of what if we move block IO controller read > > > > > > > > > > throttling also in higher layers. Then we don't want to throttle reads > > > > > > > > > > which are already in cache. > > > > > > > > > > > > > > > > > > > > Currently throttling hook is in generic_make_request() and it kicks in > > > > > > > > > > only if data is not present in page cache and actual disk IO is initiated. > > > > > > > > > You can always throttle in readpage(). It's not much higher than > > > > > > > > > generic_make_request() but basically as high as it can get I suspect > > > > > > > > > (otherwise you'd have to deal with lots of different code paths like page > > > > > > > > > faults, splice, read, ...). > > > > > > > > > > > > > > > > Yep, I was thinking that what do I gain by moving READ throttling up. > > > > > > > > The only thing generic_make_request() does not catch is network file > > > > > > > > systems. I think for that I can introduce another hook say in NFS and > > > > > > > > I might be all set. > > > > > > > > > > > > > > Basically all data reads go through the readahead layer, and the > > > > > > > __do_page_cache_readahead() function. > > > > > > > > > > > > > > Just one more option for your tradeoffs :) > > > > > > > > > > > > But this does not cover direct IO? > > > > > > > > > > Yes, sorry! > > > > > > > > > > > But I guess if I split the hook into two parts (one in direct IO path > > > > > > and one in __do_page_cache_readahead()), then filesystems don't have > > > > > > to mark meta data READS. I will look into it. > > > > > > > > > > Right, and the hooks should be trivial to add. > > > > > > > > > > The readahead code is typically invoked in three ways: > > > > > > > > > > - sync readahead, on page cache miss, => page_cache_sync_readahead() > > > > > > > > > > - async readahead, on hitting PG_readahead (tagged on one page per readahead window), > > > > > => page_cache_async_readahead() > > > > > > > > > > - user space readahead, fadvise(WILLNEED), => force_page_cache_readahead() > > > > > > > > > > ext3/4 also call into readahead on readdir(). > > > > > > > > So this will be called for even meta data READS. Then there is no > > > > advantage of moving the throttle hook out of generic_make_request()? > > > No, generally it won't. I think Fengguang was wrong - only ext2 carries > > > directories in page cache and thus uses readahead code. All other > > > filesystems handle directories specially and don't use readpage for them. > > > > So ext2 is implicitly using readahead? ext3/4 behave different in that > > ext4_readdir() has an explicit call to page_cache_sync_readahead(), > > passing the blockdev mapping as the page cache container. > Yes, ext2 uses implicitely readahead because it uses read_mapping_page() > for directory inodes. I forgot that ext3/4 call > page_cache_sync_readahead() so you were right that they actually use it for > the device inode. I'm sorry for the noise. Never mind. However I cannot find readahead calls in the read_mapping_page() call chain. ext2 readdir() may not be doing readahead at all... read_mapping_page() read_cache_page() read_cache_page_async() do_read_cache_page() __read_cache_page() Thanks, Fengguang ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-20 11:19 ` Wu Fengguang @ 2011-04-20 14:42 ` Jan Kara 0 siblings, 0 replies; 138+ messages in thread From: Jan Kara @ 2011-04-20 14:42 UTC (permalink / raw) To: Wu Fengguang Cc: Jan Kara, Vivek Goyal, James Bottomley, lsf@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, Dave Chinner On Wed 20-04-11 19:19:57, Wu Fengguang wrote: > On Wed, Apr 20, 2011 at 06:56:06PM +0800, Jan Kara wrote: > > On Wed 20-04-11 09:21:31, Wu Fengguang wrote: > > > On Wed, Apr 20, 2011 at 04:58:21AM +0800, Jan Kara wrote: > > > > On Tue 19-04-11 13:05:43, Vivek Goyal wrote: > > > > > On Wed, Apr 20, 2011 at 12:58:38AM +0800, Wu Fengguang wrote: > > > > > > On Tue, Apr 19, 2011 at 11:31:06PM +0800, Vivek Goyal wrote: > > > > > > > On Tue, Apr 19, 2011 at 11:22:40PM +0800, Wu Fengguang wrote: > > > > > > > > On Tue, Apr 19, 2011 at 11:11:11PM +0800, Vivek Goyal wrote: > > > > > > > > > On Tue, Apr 19, 2011 at 04:48:32PM +0200, Jan Kara wrote: > > > > > > > > > > On Tue 19-04-11 10:34:23, Vivek Goyal wrote: > > > > > > > > > > > On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote: > > > > > > > > > > > > [snip] > > > > > > > > > > > > > > > > For throttling case, apart from metadata, I found that with simple > > > > > > > > > > > > > > > > throttling of data I ran into issues with journalling with ext4 mounuted > > > > > > > > > > > > > > > > in ordered mode. So it was suggested that WRITE IO throttling should > > > > > > > > > > > > > > > > not be done at device level instead try to do it in higher layers, > > > > > > > > > > > > > > > > possibly balance_dirty_pages() and throttle process early. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The problem with doing it at the page cache entry level is that > > > > > > > > > > > > > > > cache hits then get throttled. It's not really a an IO controller at > > > > > > > > > > > > > > > that point, and the impact on application performance could be huge > > > > > > > > > > > > > > > (i.e. MB/s instead of GB/s). > > > > > > > > > > > > > > > > > > > > > > > > > > > > Agreed that throttling cache hits is not a good idea. Can we determine > > > > > > > > > > > > > > if page being asked for is in cache or not and charge for IO accordingly. > > > > > > > > > > > > > > > > > > > > > > > > > > You'd need hooks in find_or_create_page(), though you have no > > > > > > > > > > > > > context of whether a read or a write is in progress at that point. > > > > > > > > > > > > > > > > > > > > > > > > I'm confused. Where is the throttling at cache hits? > > > > > > > > > > > > > > > > > > > > > > > > The balance_dirty_pages() throttling kicks in at write() syscall and > > > > > > > > > > > > page fault time. For example, generic_perform_write(), do_wp_page() > > > > > > > > > > > > and __do_fault() will explicitly call > > > > > > > > > > > > balance_dirty_pages_ratelimited() to do the write throttling. > > > > > > > > > > > > > > > > > > > > > > This comment was in the context of what if we move block IO controller read > > > > > > > > > > > throttling also in higher layers. Then we don't want to throttle reads > > > > > > > > > > > which are already in cache. > > > > > > > > > > > > > > > > > > > > > > Currently throttling hook is in generic_make_request() and it kicks in > > > > > > > > > > > only if data is not present in page cache and actual disk IO is initiated. > > > > > > > > > > You can always throttle in readpage(). It's not much higher than > > > > > > > > > > generic_make_request() but basically as high as it can get I suspect > > > > > > > > > > (otherwise you'd have to deal with lots of different code paths like page > > > > > > > > > > faults, splice, read, ...). > > > > > > > > > > > > > > > > > > Yep, I was thinking that what do I gain by moving READ throttling up. > > > > > > > > > The only thing generic_make_request() does not catch is network file > > > > > > > > > systems. I think for that I can introduce another hook say in NFS and > > > > > > > > > I might be all set. > > > > > > > > > > > > > > > > Basically all data reads go through the readahead layer, and the > > > > > > > > __do_page_cache_readahead() function. > > > > > > > > > > > > > > > > Just one more option for your tradeoffs :) > > > > > > > > > > > > > > But this does not cover direct IO? > > > > > > > > > > > > Yes, sorry! > > > > > > > > > > > > > But I guess if I split the hook into two parts (one in direct IO path > > > > > > > and one in __do_page_cache_readahead()), then filesystems don't have > > > > > > > to mark meta data READS. I will look into it. > > > > > > > > > > > > Right, and the hooks should be trivial to add. > > > > > > > > > > > > The readahead code is typically invoked in three ways: > > > > > > > > > > > > - sync readahead, on page cache miss, => page_cache_sync_readahead() > > > > > > > > > > > > - async readahead, on hitting PG_readahead (tagged on one page per readahead window), > > > > > > => page_cache_async_readahead() > > > > > > > > > > > > - user space readahead, fadvise(WILLNEED), => force_page_cache_readahead() > > > > > > > > > > > > ext3/4 also call into readahead on readdir(). > > > > > > > > > > So this will be called for even meta data READS. Then there is no > > > > > advantage of moving the throttle hook out of generic_make_request()? > > > > No, generally it won't. I think Fengguang was wrong - only ext2 carries > > > > directories in page cache and thus uses readahead code. All other > > > > filesystems handle directories specially and don't use readpage for them. > > > > > > So ext2 is implicitly using readahead? ext3/4 behave different in that > > > ext4_readdir() has an explicit call to page_cache_sync_readahead(), > > > passing the blockdev mapping as the page cache container. > > Yes, ext2 uses implicitely readahead because it uses read_mapping_page() > > for directory inodes. I forgot that ext3/4 call > > page_cache_sync_readahead() so you were right that they actually use it for > > the device inode. I'm sorry for the noise. > > Never mind. However I cannot find readahead calls in the > read_mapping_page() call chain. ext2 readdir() may not be doing > readahead at all... > > read_mapping_page() > read_cache_page() > read_cache_page_async() > do_read_cache_page() > __read_cache_page() Right, I've now checked the real code and it would have to use read_cache_pages() to have some readahead. I'm not completely sure where did I get from that ext2 performs directory readahead - some papers about ext2 I found in the Internet say so and I believe Andrew mentioned it as well. But I cannot find a kernel where this would happen... So thanks for correcting me :). Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-19 17:05 ` Vivek Goyal 2011-04-19 20:58 ` Jan Kara @ 2011-04-20 1:16 ` Wu Fengguang 2011-04-20 18:44 ` Vivek Goyal 1 sibling, 1 reply; 138+ messages in thread From: Wu Fengguang @ 2011-04-20 1:16 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, James Bottomley, lsf@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, Dave Chinner On Wed, Apr 20, 2011 at 01:05:43AM +0800, Vivek Goyal wrote: > On Wed, Apr 20, 2011 at 12:58:38AM +0800, Wu Fengguang wrote: > > On Tue, Apr 19, 2011 at 11:31:06PM +0800, Vivek Goyal wrote: > > > On Tue, Apr 19, 2011 at 11:22:40PM +0800, Wu Fengguang wrote: > > > > On Tue, Apr 19, 2011 at 11:11:11PM +0800, Vivek Goyal wrote: > > > > > On Tue, Apr 19, 2011 at 04:48:32PM +0200, Jan Kara wrote: > > > > > > On Tue 19-04-11 10:34:23, Vivek Goyal wrote: > > > > > > > On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote: > > > > > > > > [snip] > > > > > > > > > > > > For throttling case, apart from metadata, I found that with simple > > > > > > > > > > > > throttling of data I ran into issues with journalling with ext4 mounuted > > > > > > > > > > > > in ordered mode. So it was suggested that WRITE IO throttling should > > > > > > > > > > > > not be done at device level instead try to do it in higher layers, > > > > > > > > > > > > possibly balance_dirty_pages() and throttle process early. > > > > > > > > > > > > > > > > > > > > > > The problem with doing it at the page cache entry level is that > > > > > > > > > > > cache hits then get throttled. It's not really a an IO controller at > > > > > > > > > > > that point, and the impact on application performance could be huge > > > > > > > > > > > (i.e. MB/s instead of GB/s). > > > > > > > > > > > > > > > > > > > > Agreed that throttling cache hits is not a good idea. Can we determine > > > > > > > > > > if page being asked for is in cache or not and charge for IO accordingly. > > > > > > > > > > > > > > > > > > You'd need hooks in find_or_create_page(), though you have no > > > > > > > > > context of whether a read or a write is in progress at that point. > > > > > > > > > > > > > > > > I'm confused. Where is the throttling at cache hits? > > > > > > > > > > > > > > > > The balance_dirty_pages() throttling kicks in at write() syscall and > > > > > > > > page fault time. For example, generic_perform_write(), do_wp_page() > > > > > > > > and __do_fault() will explicitly call > > > > > > > > balance_dirty_pages_ratelimited() to do the write throttling. > > > > > > > > > > > > > > This comment was in the context of what if we move block IO controller read > > > > > > > throttling also in higher layers. Then we don't want to throttle reads > > > > > > > which are already in cache. > > > > > > > > > > > > > > Currently throttling hook is in generic_make_request() and it kicks in > > > > > > > only if data is not present in page cache and actual disk IO is initiated. > > > > > > You can always throttle in readpage(). It's not much higher than > > > > > > generic_make_request() but basically as high as it can get I suspect > > > > > > (otherwise you'd have to deal with lots of different code paths like page > > > > > > faults, splice, read, ...). > > > > > > > > > > Yep, I was thinking that what do I gain by moving READ throttling up. > > > > > The only thing generic_make_request() does not catch is network file > > > > > systems. I think for that I can introduce another hook say in NFS and > > > > > I might be all set. > > > > > > > > Basically all data reads go through the readahead layer, and the > > > > __do_page_cache_readahead() function. > > > > > > > > Just one more option for your tradeoffs :) > > > > > > But this does not cover direct IO? > > > > Yes, sorry! > > > > > But I guess if I split the hook into two parts (one in direct IO path > > > and one in __do_page_cache_readahead()), then filesystems don't have > > > to mark meta data READS. I will look into it. > > > > Right, and the hooks should be trivial to add. > > > > The readahead code is typically invoked in three ways: > > > > - sync readahead, on page cache miss, => page_cache_sync_readahead() > > > > - async readahead, on hitting PG_readahead (tagged on one page per readahead window), > > => page_cache_async_readahead() > > > > - user space readahead, fadvise(WILLNEED), => force_page_cache_readahead() > > > > ext3/4 also call into readahead on readdir(). > > So this will be called for even meta data READS. Then there is no > advantage of moving the throttle hook out of generic_make_request()? > Instead what I will need is that ask file systems to mark meta data > IO so that I can avoid throttling. Do you want to avoid meta data itself, or to avoid overall performance being impacted as a result of meta data read throttling? Either way, you have the freedom to test whether the passed filp is a normal file or a directory "file", and do conditional throttling. Thanks, Fengguang ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-20 1:16 ` Wu Fengguang @ 2011-04-20 18:44 ` Vivek Goyal 2011-04-20 19:16 ` Jan Kara ` (2 more replies) 0 siblings, 3 replies; 138+ messages in thread From: Vivek Goyal @ 2011-04-20 18:44 UTC (permalink / raw) To: Wu Fengguang Cc: Jan Kara, James Bottomley, lsf@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, Dave Chinner On Wed, Apr 20, 2011 at 09:16:38AM +0800, Wu Fengguang wrote: > On Wed, Apr 20, 2011 at 01:05:43AM +0800, Vivek Goyal wrote: > > On Wed, Apr 20, 2011 at 12:58:38AM +0800, Wu Fengguang wrote: > > > On Tue, Apr 19, 2011 at 11:31:06PM +0800, Vivek Goyal wrote: > > > > On Tue, Apr 19, 2011 at 11:22:40PM +0800, Wu Fengguang wrote: > > > > > On Tue, Apr 19, 2011 at 11:11:11PM +0800, Vivek Goyal wrote: > > > > > > On Tue, Apr 19, 2011 at 04:48:32PM +0200, Jan Kara wrote: > > > > > > > On Tue 19-04-11 10:34:23, Vivek Goyal wrote: > > > > > > > > On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote: > > > > > > > > > [snip] > > > > > > > > > > > > > For throttling case, apart from metadata, I found that with simple > > > > > > > > > > > > > throttling of data I ran into issues with journalling with ext4 mounuted > > > > > > > > > > > > > in ordered mode. So it was suggested that WRITE IO throttling should > > > > > > > > > > > > > not be done at device level instead try to do it in higher layers, > > > > > > > > > > > > > possibly balance_dirty_pages() and throttle process early. > > > > > > > > > > > > > > > > > > > > > > > > The problem with doing it at the page cache entry level is that > > > > > > > > > > > > cache hits then get throttled. It's not really a an IO controller at > > > > > > > > > > > > that point, and the impact on application performance could be huge > > > > > > > > > > > > (i.e. MB/s instead of GB/s). > > > > > > > > > > > > > > > > > > > > > > Agreed that throttling cache hits is not a good idea. Can we determine > > > > > > > > > > > if page being asked for is in cache or not and charge for IO accordingly. > > > > > > > > > > > > > > > > > > > > You'd need hooks in find_or_create_page(), though you have no > > > > > > > > > > context of whether a read or a write is in progress at that point. > > > > > > > > > > > > > > > > > > I'm confused. Where is the throttling at cache hits? > > > > > > > > > > > > > > > > > > The balance_dirty_pages() throttling kicks in at write() syscall and > > > > > > > > > page fault time. For example, generic_perform_write(), do_wp_page() > > > > > > > > > and __do_fault() will explicitly call > > > > > > > > > balance_dirty_pages_ratelimited() to do the write throttling. > > > > > > > > > > > > > > > > This comment was in the context of what if we move block IO controller read > > > > > > > > throttling also in higher layers. Then we don't want to throttle reads > > > > > > > > which are already in cache. > > > > > > > > > > > > > > > > Currently throttling hook is in generic_make_request() and it kicks in > > > > > > > > only if data is not present in page cache and actual disk IO is initiated. > > > > > > > You can always throttle in readpage(). It's not much higher than > > > > > > > generic_make_request() but basically as high as it can get I suspect > > > > > > > (otherwise you'd have to deal with lots of different code paths like page > > > > > > > faults, splice, read, ...). > > > > > > > > > > > > Yep, I was thinking that what do I gain by moving READ throttling up. > > > > > > The only thing generic_make_request() does not catch is network file > > > > > > systems. I think for that I can introduce another hook say in NFS and > > > > > > I might be all set. > > > > > > > > > > Basically all data reads go through the readahead layer, and the > > > > > __do_page_cache_readahead() function. > > > > > > > > > > Just one more option for your tradeoffs :) > > > > > > > > But this does not cover direct IO? > > > > > > Yes, sorry! > > > > > > > But I guess if I split the hook into two parts (one in direct IO path > > > > and one in __do_page_cache_readahead()), then filesystems don't have > > > > to mark meta data READS. I will look into it. > > > > > > Right, and the hooks should be trivial to add. > > > > > > The readahead code is typically invoked in three ways: > > > > > > - sync readahead, on page cache miss, => page_cache_sync_readahead() > > > > > > - async readahead, on hitting PG_readahead (tagged on one page per readahead window), > > > => page_cache_async_readahead() > > > > > > - user space readahead, fadvise(WILLNEED), => force_page_cache_readahead() > > > > > > ext3/4 also call into readahead on readdir(). > > > > So this will be called for even meta data READS. Then there is no > > advantage of moving the throttle hook out of generic_make_request()? > > Instead what I will need is that ask file systems to mark meta data > > IO so that I can avoid throttling. > > Do you want to avoid meta data itself, or to avoid overall performance > being impacted as a result of meta data read throttling? I wanted to avoid throttling metadata beacause it might lead to reduced overall performance due to dependencies in file system layer. > > Either way, you have the freedom to test whether the passed filp is a > normal file or a directory "file", and do conditional throttling. Ok, will look into it. That will probably take care of READS. What about WRITES and meta data. Is it safe to assume that any meta data write will come in some jounalling thread context and not in user process context? Thanks Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-20 18:44 ` Vivek Goyal @ 2011-04-20 19:16 ` Jan Kara 2011-04-21 0:17 ` Dave Chinner 2011-04-21 15:06 ` Wu Fengguang 2 siblings, 0 replies; 138+ messages in thread From: Jan Kara @ 2011-04-20 19:16 UTC (permalink / raw) To: Vivek Goyal Cc: Wu Fengguang, Jan Kara, James Bottomley, lsf@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, Dave Chinner On Wed 20-04-11 14:44:33, Vivek Goyal wrote: > Ok, will look into it. That will probably take care of READS. What > about WRITES and meta data. Is it safe to assume that any meta data > write will come in some jounalling thread context and not in user > process context? For ext3/4 it is a journal thread context or flusher thread context because after metadata is written to journal by journal thread, they are left as dirty buffers in page cache of the block device. So flusher thread can come and write them - and these writes will hold buffer lock and thus also block any manipulation with the metadata. I don't know about other filesystems. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-20 18:44 ` Vivek Goyal 2011-04-20 19:16 ` Jan Kara @ 2011-04-21 0:17 ` Dave Chinner 2011-04-21 15:06 ` Wu Fengguang 2 siblings, 0 replies; 138+ messages in thread From: Dave Chinner @ 2011-04-21 0:17 UTC (permalink / raw) To: Vivek Goyal Cc: Wu Fengguang, Jan Kara, James Bottomley, lsf@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org On Wed, Apr 20, 2011 at 02:44:33PM -0400, Vivek Goyal wrote: > On Wed, Apr 20, 2011 at 09:16:38AM +0800, Wu Fengguang wrote: > > On Wed, Apr 20, 2011 at 01:05:43AM +0800, Vivek Goyal wrote: > > > So this will be called for even meta data READS. Then there is no > > > advantage of moving the throttle hook out of generic_make_request()? > > > Instead what I will need is that ask file systems to mark meta data > > > IO so that I can avoid throttling. > > > > Do you want to avoid meta data itself, or to avoid overall performance > > being impacted as a result of meta data read throttling? > > I wanted to avoid throttling metadata beacause it might lead to reduced > overall performance due to dependencies in file system layer. > > > > > Either way, you have the freedom to test whether the passed filp is a > > normal file or a directory "file", and do conditional throttling. > > Ok, will look into it. That will probably take care of READS. What > about WRITES and meta data. Is it safe to assume that any meta data > write will come in some jounalling thread context and not in user > process context? No. Journal writes in XFS come from the context that forces them to occur, whether it be user, bdi-flusher or background kernel thread context. Indeed, we can even have journal writes coming from workqueues and there is a possibility that they will always come from a workqueue context in the next release or so. As for metadata buffer writes themselves, they currently come from background kernel threads or workqueues in most normal operational cases. However, in certain situations (e.g. sync(1), filesystem freeze and unmount) we can issue write IO on metadata buffers directly from the user process context.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-20 18:44 ` Vivek Goyal 2011-04-20 19:16 ` Jan Kara 2011-04-21 0:17 ` Dave Chinner @ 2011-04-21 15:06 ` Wu Fengguang 2011-04-21 15:10 ` Wu Fengguang 2011-04-21 17:20 ` Vivek Goyal 2 siblings, 2 replies; 138+ messages in thread From: Wu Fengguang @ 2011-04-21 15:06 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, James Bottomley, lsf@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, Dave Chinner On Thu, Apr 21, 2011 at 02:44:33AM +0800, Vivek Goyal wrote: > On Wed, Apr 20, 2011 at 09:16:38AM +0800, Wu Fengguang wrote: > > On Wed, Apr 20, 2011 at 01:05:43AM +0800, Vivek Goyal wrote: > > > On Wed, Apr 20, 2011 at 12:58:38AM +0800, Wu Fengguang wrote: > > > > On Tue, Apr 19, 2011 at 11:31:06PM +0800, Vivek Goyal wrote: > > > > > On Tue, Apr 19, 2011 at 11:22:40PM +0800, Wu Fengguang wrote: > > > > > > On Tue, Apr 19, 2011 at 11:11:11PM +0800, Vivek Goyal wrote: > > > > > > > On Tue, Apr 19, 2011 at 04:48:32PM +0200, Jan Kara wrote: > > > > > > > > On Tue 19-04-11 10:34:23, Vivek Goyal wrote: > > > > > > > > > On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote: > > > > > > > > > > [snip] > > > > > > > > > > > > > > For throttling case, apart from metadata, I found that with simple > > > > > > > > > > > > > > throttling of data I ran into issues with journalling with ext4 mounuted > > > > > > > > > > > > > > in ordered mode. So it was suggested that WRITE IO throttling should > > > > > > > > > > > > > > not be done at device level instead try to do it in higher layers, > > > > > > > > > > > > > > possibly balance_dirty_pages() and throttle process early. > > > > > > > > > > > > > > > > > > > > > > > > > > The problem with doing it at the page cache entry level is that > > > > > > > > > > > > > cache hits then get throttled. It's not really a an IO controller at > > > > > > > > > > > > > that point, and the impact on application performance could be huge > > > > > > > > > > > > > (i.e. MB/s instead of GB/s). > > > > > > > > > > > > > > > > > > > > > > > > Agreed that throttling cache hits is not a good idea. Can we determine > > > > > > > > > > > > if page being asked for is in cache or not and charge for IO accordingly. > > > > > > > > > > > > > > > > > > > > > > You'd need hooks in find_or_create_page(), though you have no > > > > > > > > > > > context of whether a read or a write is in progress at that point. > > > > > > > > > > > > > > > > > > > > I'm confused. Where is the throttling at cache hits? > > > > > > > > > > > > > > > > > > > > The balance_dirty_pages() throttling kicks in at write() syscall and > > > > > > > > > > page fault time. For example, generic_perform_write(), do_wp_page() > > > > > > > > > > and __do_fault() will explicitly call > > > > > > > > > > balance_dirty_pages_ratelimited() to do the write throttling. > > > > > > > > > > > > > > > > > > This comment was in the context of what if we move block IO controller read > > > > > > > > > throttling also in higher layers. Then we don't want to throttle reads > > > > > > > > > which are already in cache. > > > > > > > > > > > > > > > > > > Currently throttling hook is in generic_make_request() and it kicks in > > > > > > > > > only if data is not present in page cache and actual disk IO is initiated. > > > > > > > > You can always throttle in readpage(). It's not much higher than > > > > > > > > generic_make_request() but basically as high as it can get I suspect > > > > > > > > (otherwise you'd have to deal with lots of different code paths like page > > > > > > > > faults, splice, read, ...). > > > > > > > > > > > > > > Yep, I was thinking that what do I gain by moving READ throttling up. > > > > > > > The only thing generic_make_request() does not catch is network file > > > > > > > systems. I think for that I can introduce another hook say in NFS and > > > > > > > I might be all set. > > > > > > > > > > > > Basically all data reads go through the readahead layer, and the > > > > > > __do_page_cache_readahead() function. > > > > > > > > > > > > Just one more option for your tradeoffs :) > > > > > > > > > > But this does not cover direct IO? > > > > > > > > Yes, sorry! > > > > > > > > > But I guess if I split the hook into two parts (one in direct IO path > > > > > and one in __do_page_cache_readahead()), then filesystems don't have > > > > > to mark meta data READS. I will look into it. > > > > > > > > Right, and the hooks should be trivial to add. > > > > > > > > The readahead code is typically invoked in three ways: > > > > > > > > - sync readahead, on page cache miss, => page_cache_sync_readahead() > > > > > > > > - async readahead, on hitting PG_readahead (tagged on one page per readahead window), > > > > => page_cache_async_readahead() > > > > > > > > - user space readahead, fadvise(WILLNEED), => force_page_cache_readahead() > > > > > > > > ext3/4 also call into readahead on readdir(). > > > > > > So this will be called for even meta data READS. Then there is no > > > advantage of moving the throttle hook out of generic_make_request()? > > > Instead what I will need is that ask file systems to mark meta data > > > IO so that I can avoid throttling. > > > > Do you want to avoid meta data itself, or to avoid overall performance > > being impacted as a result of meta data read throttling? > > I wanted to avoid throttling metadata beacause it might lead to reduced > overall performance due to dependencies in file system layer. You can get meta data "throttling" and performance at the same time. See below ideas. > > > > Either way, you have the freedom to test whether the passed filp is a > > normal file or a directory "file", and do conditional throttling. > > Ok, will look into it. That will probably take care of READS. What > about WRITES and meta data. Is it safe to assume that any meta data > write will come in some jounalling thread context and not in user > process context? It's very possible to throttle meta data READS/WRITES, as long as they can be attributed to the original task (assuming task oriented throttling instead of bio/request oriented). The trick is to separate the concepts of THROTTLING and ACCOUNTING. You can ACCOUNT data and meta data reads/writes to the right task, and only to THROTTLE the task when it's doing data reads/writes. FYI I played the same trick for balance_dirty_pages_ratelimited() for another reason: _accurate_ accounting of dirtied pages. That trick should play well with most applications who do interleaved data and meta data reads/writes. For the special case of "find" who does pure meta data reads, we can still throttle it by playing another trick: to THROTTLE meta data reads/writes with a much higher threshold than that of data. So normal applications will be almost always be throttled at data accesses while "find" will be throttled at meta data accesses. For a real example of how it works, you can check this patch (plus the attached one) writeback: IO-less balance_dirty_pages() http://git.kernel.org/?p=linux/kernel/git/wfg/writeback.git;a=commitdiff;h=e0de5e9961eeb992f305e877c5ef944fcd7a4269;hp=992851d56d79d227beaba1e4dcc657cbcf815556 Where tsk->nr_dirtied does dirty ACCOUNTING and tsk->nr_dirtied_pause is the threshold for THROTTLING. When tsk->nr_dirtied > tsk->nr_dirtied_pause The task will voluntarily enter balance_dirty_pages() for taking a nap (pause time will be proportional to tsk->nr_dirtied), and when finished, start a new account-and-throttle period by resetting tsk->nr_dirtied and possibly adjust tsk->nr_dirtied_pause for a more reasonable pause time at next sleep. BTW, I'd like to advocate balance_dirty_pages() based IO controller :) As you may have noticed, it's not all that hard: the main functions blkcg_update_bandwidth()/blkcg_update_dirty_ratelimit() can fit nicely in one screen! writeback: async write IO controllers http://git.kernel.org/?p=linux/kernel/git/wfg/writeback.git;a=commitdiff;h=1a58ad99ce1f6a9df6618a4b92fa4859cc3e7e90;hp=5b6fcb3125ea52ff04a2fad27a51307842deb1a0 Thanks, Fengguang ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-21 15:06 ` Wu Fengguang @ 2011-04-21 15:10 ` Wu Fengguang 2011-04-21 17:20 ` Vivek Goyal 1 sibling, 0 replies; 138+ messages in thread From: Wu Fengguang @ 2011-04-21 15:10 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, James Bottomley, lsf@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, Dave Chinner [-- Attachment #1: Type: text/plain, Size: 2294 bytes --] Sorry, attached is the "separate ACCOUNTING from THROTTLING" patch. > It's very possible to throttle meta data READS/WRITES, as long as they > can be attributed to the original task (assuming task oriented throttling > instead of bio/request oriented). > > The trick is to separate the concepts of THROTTLING and ACCOUNTING. > You can ACCOUNT data and meta data reads/writes to the right task, and > only to THROTTLE the task when it's doing data reads/writes. > > FYI I played the same trick for balance_dirty_pages_ratelimited() for > another reason: _accurate_ accounting of dirtied pages. > > That trick should play well with most applications who do interleaved > data and meta data reads/writes. For the special case of "find" who > does pure meta data reads, we can still throttle it by playing another > trick: to THROTTLE meta data reads/writes with a much higher threshold > than that of data. So normal applications will be almost always be > throttled at data accesses while "find" will be throttled at meta data > accesses. > > For a real example of how it works, you can check this patch (plus the > attached one) > > writeback: IO-less balance_dirty_pages() > http://git.kernel.org/?p=linux/kernel/git/wfg/writeback.git;a=commitdiff;h=e0de5e9961eeb992f305e877c5ef944fcd7a4269;hp=992851d56d79d227beaba1e4dcc657cbcf815556 > > Where tsk->nr_dirtied does dirty ACCOUNTING and tsk->nr_dirtied_pause > is the threshold for THROTTLING. When > > tsk->nr_dirtied > tsk->nr_dirtied_pause > > The task will voluntarily enter balance_dirty_pages() for taking a > nap (pause time will be proportional to tsk->nr_dirtied), and when > finished, start a new account-and-throttle period by resetting > tsk->nr_dirtied and possibly adjust tsk->nr_dirtied_pause for a more > reasonable pause time at next sleep. > > BTW, I'd like to advocate balance_dirty_pages() based IO controller :) > > As you may have noticed, it's not all that hard: the main functions > blkcg_update_bandwidth()/blkcg_update_dirty_ratelimit() can fit nicely > in one screen! > > writeback: async write IO controllers > http://git.kernel.org/?p=linux/kernel/git/wfg/writeback.git;a=commitdiff;h=1a58ad99ce1f6a9df6618a4b92fa4859cc3e7e90;hp=5b6fcb3125ea52ff04a2fad27a51307842deb1a0 > > Thanks, > Fengguang [-- Attachment #2: writeback-accurate-task-dirtied.patch --] [-- Type: text/x-diff, Size: 924 bytes --] Subject: writeback: accurately account dirtied pages Date: Thu Apr 14 07:52:37 CST 2011 Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- mm/page-writeback.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) --- linux-next.orig/mm/page-writeback.c 2011-04-16 11:28:41.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-04-16 11:28:41.000000000 +0800 @@ -1352,8 +1352,6 @@ void balance_dirty_pages_ratelimited_nr( if (!bdi_cap_account_dirty(bdi)) return; - current->nr_dirtied += nr_pages_dirtied; - if (dirty_exceeded_recently(bdi, MAX_PAUSE)) { unsigned long max = current->nr_dirtied + (128 >> (PAGE_SHIFT - 10)); @@ -1819,6 +1817,7 @@ void account_page_dirtied(struct page *p __inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED); task_dirty_inc(current); task_io_account_write(PAGE_CACHE_SIZE); + current->nr_dirtied++; } } EXPORT_SYMBOL(account_page_dirtied); ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-21 15:06 ` Wu Fengguang 2011-04-21 15:10 ` Wu Fengguang @ 2011-04-21 17:20 ` Vivek Goyal 2011-04-22 4:21 ` Wu Fengguang 1 sibling, 1 reply; 138+ messages in thread From: Vivek Goyal @ 2011-04-21 17:20 UTC (permalink / raw) To: Wu Fengguang Cc: Jan Kara, James Bottomley, lsf@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, Dave Chinner On Thu, Apr 21, 2011 at 11:06:18PM +0800, Wu Fengguang wrote: [..] > > You can get meta data "throttling" and performance at the same time. > See below ideas. > > > > > > > Either way, you have the freedom to test whether the passed filp is a > > > normal file or a directory "file", and do conditional throttling. > > > > Ok, will look into it. That will probably take care of READS. What > > about WRITES and meta data. Is it safe to assume that any meta data > > write will come in some jounalling thread context and not in user > > process context? > > It's very possible to throttle meta data READS/WRITES, as long as they > can be attributed to the original task (assuming task oriented throttling > instead of bio/request oriented). Even in bio oriented throttling we attribute the bio to a task and hence to the group (atleast as of today). So from that perspective, it should not make much difference. > > The trick is to separate the concepts of THROTTLING and ACCOUNTING. > You can ACCOUNT data and meta data reads/writes to the right task, and > only to THROTTLE the task when it's doing data reads/writes. Agreed. I too mentioned this idea in one of the mails that account meta data but do not throttle meta data and use that meta data accounting to throttle data for longer period of times. For this to implement, I need to know whether an IO is regular IO or metadata IO and looks like one of the ways will that filesystems mark that info in bio for meta data requests. > > FYI I played the same trick for balance_dirty_pages_ratelimited() for > another reason: _accurate_ accounting of dirtied pages. > > That trick should play well with most applications who do interleaved > data and meta data reads/writes. For the special case of "find" who > does pure meta data reads, we can still throttle it by playing another > trick: to THROTTLE meta data reads/writes with a much higher threshold > than that of data. So normal applications will be almost always be > throttled at data accesses while "find" will be throttled at meta data > accesses. Ok, that makes sense. If an application is doing lots of meta data transactions only then try to limit it after some high limit I am not very sure if it will run into issues of some file system dependencies and hence priority inversion. > > For a real example of how it works, you can check this patch (plus the > attached one) Ok, I will go through the patches for more details. > > writeback: IO-less balance_dirty_pages() > http://git.kernel.org/?p=linux/kernel/git/wfg/writeback.git;a=commitdiff;h=e0de5e9961eeb992f305e877c5ef944fcd7a4269;hp=992851d56d79d227beaba1e4dcc657cbcf815556 > > Where tsk->nr_dirtied does dirty ACCOUNTING and tsk->nr_dirtied_pause > is the threshold for THROTTLING. When > > tsk->nr_dirtied > tsk->nr_dirtied_pause > > The task will voluntarily enter balance_dirty_pages() for taking a > nap (pause time will be proportional to tsk->nr_dirtied), and when > finished, start a new account-and-throttle period by resetting > tsk->nr_dirtied and possibly adjust tsk->nr_dirtied_pause for a more > reasonable pause time at next sleep. > > BTW, I'd like to advocate balance_dirty_pages() based IO controller :) > Actually implementing throttling in balance_dirty_pages() is not hard. I think it has following issues. - One controls the IO rate coming into the page cache and does not control the IO rate at the outgoing devices. So a flusher thread can still throw lots of writes at a device and completely disrupting read latencies. If buffered WRITES can disrupt READ latencies unexpectedly, then it kind of renders IO controller/throttling useless. - For the application performance, I thought a better mechanism would be that we come up with per cgroup dirty ratio. This is equivalent to partitioning the page cache and coming up with cgroup's share. Now an application can write to this cache as fast as it want and is only throttled either by balance_dirty_pages() rules. All this IO must be going to some device and if an admin has put this cgroup in a low bandwidth group, then pages from this cgroup will be written slowly hence tasks in this group will be blocked for longer time. If we can make this work, then application can write to cache at higher rate at the same time not create a havoc at the end device. Thanks Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-21 17:20 ` Vivek Goyal @ 2011-04-22 4:21 ` Wu Fengguang 2011-04-22 15:25 ` Vivek Goyal 0 siblings, 1 reply; 138+ messages in thread From: Wu Fengguang @ 2011-04-22 4:21 UTC (permalink / raw) To: Vivek Goyal Cc: Jan Kara, James Bottomley, lsf@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, Dave Chinner On Fri, Apr 22, 2011 at 01:20:40AM +0800, Vivek Goyal wrote: > On Thu, Apr 21, 2011 at 11:06:18PM +0800, Wu Fengguang wrote: > > [..] > > > > You can get meta data "throttling" and performance at the same time. > > See below ideas. > > > > > > > > > > Either way, you have the freedom to test whether the passed filp is a > > > > normal file or a directory "file", and do conditional throttling. > > > > > > Ok, will look into it. That will probably take care of READS. What > > > about WRITES and meta data. Is it safe to assume that any meta data > > > write will come in some jounalling thread context and not in user > > > process context? > > > > It's very possible to throttle meta data READS/WRITES, as long as they > > can be attributed to the original task (assuming task oriented throttling > > instead of bio/request oriented). > > Even in bio oriented throttling we attribute the bio to a task and hence > to the group (atleast as of today). So from that perspective, it should > not make much difference. OK, good to learn about that :) > > > > The trick is to separate the concepts of THROTTLING and ACCOUNTING. > > You can ACCOUNT data and meta data reads/writes to the right task, and > > only to THROTTLE the task when it's doing data reads/writes. > > Agreed. I too mentioned this idea in one of the mails that account meta data > but do not throttle meta data and use that meta data accounting to throttle > data for longer period of times. That's great. > For this to implement, I need to know whether an IO is regular IO or > metadata IO and looks like one of the ways will that filesystems mark > that info in bio for meta data requests. OK. > > > > FYI I played the same trick for balance_dirty_pages_ratelimited() for > > another reason: _accurate_ accounting of dirtied pages. > > > > That trick should play well with most applications who do interleaved > > data and meta data reads/writes. For the special case of "find" who > > does pure meta data reads, we can still throttle it by playing another > > trick: to THROTTLE meta data reads/writes with a much higher threshold > > than that of data. So normal applications will be almost always be > > throttled at data accesses while "find" will be throttled at meta data > > accesses. > > Ok, that makes sense. If an application is doing lots of meta data > transactions only then try to limit it after some high limit > > I am not very sure if it will run into issues of some file system > dependencies and hence priority inversion. It's safe at least for task-context reads? For meta data writes, we may also differentiate task-context DIRTY, kernel-context DIRTY and WRITEOUT. We should still be able to throttle task-context meta data DIRTY, probably not for kernel-context DIRTY, and never for WRITEOUT. > > For a real example of how it works, you can check this patch (plus the > > attached one) > > Ok, I will go through the patches for more details. Thanks! FYI this document describes the basic ideas in the first 14 pages. http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/smooth-dirty-throttling.pdf > > > > writeback: IO-less balance_dirty_pages() > > http://git.kernel.org/?p=linux/kernel/git/wfg/writeback.git;a=commitdiff;h=e0de5e9961eeb992f305e877c5ef944fcd7a4269;hp=992851d56d79d227beaba1e4dcc657cbcf815556 > > > > Where tsk->nr_dirtied does dirty ACCOUNTING and tsk->nr_dirtied_pause > > is the threshold for THROTTLING. When > > > > tsk->nr_dirtied > tsk->nr_dirtied_pause > > > > The task will voluntarily enter balance_dirty_pages() for taking a > > nap (pause time will be proportional to tsk->nr_dirtied), and when > > finished, start a new account-and-throttle period by resetting > > tsk->nr_dirtied and possibly adjust tsk->nr_dirtied_pause for a more > > reasonable pause time at next sleep. > > > > BTW, I'd like to advocate balance_dirty_pages() based IO controller :) > > > > Actually implementing throttling in balance_dirty_pages() is not hard. I > think it has following issues. > > - One controls the IO rate coming into the page cache and does not control > the IO rate at the outgoing devices. So a flusher thread can still throw > lots of writes at a device and completely disrupting read latencies. > > If buffered WRITES can disrupt READ latencies unexpectedly, then it kind > of renders IO controller/throttling useless. Hmm..I doubt IO controller is the right solution to this problem at all. It's such a fundamental problem that it would be Linux's failure to recommend normal users to use IO controller for the sake of good read latencies in the presence of heavy writes. It actually helps reducing seeks when the flushers submit async write requests in bursts (eg. 1 second). It will then kind of optimally "work on this bdi area on behalf of this flusher for 1 second, and then to the other area for 1 second...". The IO scheduler should have similar optimizations, which should generally work better with more clustered data supplies from the flushers. (Sorry I'm not tracking the cfq code, so it's all general hypothesis and please correct me...) The IO scheduler looks like the right owner to safeguard read latencies. Where you already have the commit 365722bb917b08b7 ("cfq-iosched: delay async IO dispatch, if sync IO was just done") and friends. They do such a good job that if there are continual reads, the async writes will be totally starved. But yeah that still leaves sporadic reads at the mercy of heavy writes, where the default policy will prefer write throughput to read latencies. And there is the "no heavy writes to saturate the disk in long term, but still temporal heavy writes created by the bursty flushing" case. In this case the device level throttling has the nice side effect of smoothing writes out without performance penalties. However, if it's so useful so that you regard it as an important target, why not build some smoothing logic into the flushers? It has the great prospect of benefiting _all_ users _by default_ :) > - For the application performance, I thought a better mechanism would be > that we come up with per cgroup dirty ratio. This is equivalent to > partitioning the page cache and coming up with cgroup's share. Now > an application can write to this cache as fast as it want and is only > throttled either by balance_dirty_pages() rules. > > All this IO must be going to some device and if an admin has put this cgroup > in a low bandwidth group, then pages from this cgroup will be written > slowly hence tasks in this group will be blocked for longer time. > > If we can make this work, then application can write to cache at higher > rate at the same time not create a havoc at the end device. The memcg dirty ratio is fundamentally different from blkio throttling. The former aims to eliminate excessive pageout()s when reclaiming pages from the memcg LRU lists. It treats "dirty pages" as throttle goal, and has the side effect throttling the task at the rate the memcg's dirty inodes can be flushed to disk. Its complexity originates from the correlation with "how the flusher selects the inodes to writeout". Unfortunately the flusher by nature works in a coarse way.. OTOH, blkio-cgroup don't need to care about inode selection at all. It's enough to account and throttle tasks' dirty rate, and let the flusher freely work on whatever dirtied inodes. In this manner, blkio-cgroup dirty rate throttling is more user oriented. While memcg dirty pages throttling looks like a complex solution to some technical problems (if me understand it right). The blkio-cgroup dirty throttling code can mainly go to page-writeback.c, while the memcg code will mainly go to fs-writeback.c (balance_dirty_pages() will also be involved, but that's actually a more trivial part). The correlations seem to be, - you can get the page tagging functionality from memcg, if doing async write throttling at device level - the side effect of rate limiting by memcg's dirty pages throttling, which is much less controllable than blkio-cgroup's rate limiting Thanks, Fengguang ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-22 4:21 ` Wu Fengguang @ 2011-04-22 15:25 ` Vivek Goyal 2011-04-22 16:28 ` Andrea Arcangeli 0 siblings, 1 reply; 138+ messages in thread From: Vivek Goyal @ 2011-04-22 15:25 UTC (permalink / raw) To: Wu Fengguang Cc: Jan Kara, James Bottomley, lsf@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, Dave Chinner On Fri, Apr 22, 2011 at 12:21:23PM +0800, Wu Fengguang wrote: [..] > > > BTW, I'd like to advocate balance_dirty_pages() based IO controller :) > > > > > > > Actually implementing throttling in balance_dirty_pages() is not hard. I > > think it has following issues. > > > > - One controls the IO rate coming into the page cache and does not control > > the IO rate at the outgoing devices. So a flusher thread can still throw > > lots of writes at a device and completely disrupting read latencies. > > > > If buffered WRITES can disrupt READ latencies unexpectedly, then it kind > > of renders IO controller/throttling useless. > > Hmm..I doubt IO controller is the right solution to this problem at all. > > It's such a fundamental problem that it would be Linux's failure to > recommend normal users to use IO controller for the sake of good read > latencies in the presence of heavy writes. It is and we have modified CFQ a lot to tackle that but still... Just do a "dd if=/dev/zero of=/zerofile bs=1M count=4K" on your root disk and then try to launch firefox and browse few websites and see if you are happy with the response of the firefox. It took me more than a minute to launch firefox and be able to input and load first website. But I agree that READ latencies in presence of WRITES can be a problem independent of IO controller. Also there is another case of cluster where IO is coming to storage from multiple hosts and one does not probably want a flurry of WRITES from one host so that IO of other hosts is not severely impacted. In that case IO scheduler can't do much as it has the view of single system. Secondly, the whole thing with IO controller is that it provides user more control of IO instead of living with a default system specific policy. For example, an admin might want to just look for better latencies for READS and is willing to give up on WRITE throughput. So if IO controller is properly implemented, he might say that my WRITE intensive application I am putting in a cgroup with WRITE limit of 20MB/s. Now the READ latencies in root cgroup should be better and may be predictable too as we know the WRITE rate to disk never exceedes 20MB/s. Also it is only CFQ which provides READS so much preferrence over WRITES. deadline and noop do not which we typically use on faster storage. There we might take a bigger hit on READ latencies depending on what storage is and how effected it is with a burst of WRITES. I guess it boils down to better system control and better predictability. So I think throttling buffered writes in balance_dirty_pages() is better than not providing any way to control buffered WRITES at all but controlling it at end device provides much better control on IO and serves more use cases. > > It actually helps reducing seeks when the flushers submit async write > requests in bursts (eg. 1 second). It will then kind of optimally > "work on this bdi area on behalf of this flusher for 1 second, and > then to the other area for 1 second...". The IO scheduler should have > similar optimizations, which should generally work better with more > clustered data supplies from the flushers. (Sorry I'm not tracking the > cfq code, so it's all general hypothesis and please correct me...) > Isolation and throughput are orthogonal. You go for better isolation and you will esentially pay by reduced throughput. Now as a user one can decide what are his priorities. I see it as a slider where on one end it is 100% isolation and on other end it is 100% throughput. Now a user can slide the slider and keep that somewhere in between depending on his/her needs. One of the goals of IO controller is to provide that fine grained control. By implementing throttling in balance_dirty_pages() we really lose that capability. Also flusher still will submit the requests in burst. flusher will still pick one inode at a time so IO is as sequential as possible. We will still do the IO-lesss throttling to reduce the seeks. If we do IO throttling below page cache, it also gives us the capability to control flusher IO burst. Gives user a fine grained control which is lost if we do the control while entering page cache. > The IO scheduler looks like the right owner to safeguard read latencies. > Where you already have the commit 365722bb917b08b7 ("cfq-iosched: > delay async IO dispatch, if sync IO was just done") and friends. > They do such a good job that if there are continual reads, the async > writes will be totally starved. > > But yeah that still leaves sporadic reads at the mercy of heavy > writes, where the default policy will prefer write throughput to read > latencies. Well, there is no default policy as such. CFQ tries to prioritize READs as much as it can. Deadline does not as much. So as I said previously, we really are not controlling the burst. We are leaving it to IO scheduler to handle it as per its policy and lose isolation between the groups which is primary purpose of IO controller. IOW, doing throttling below page cache allows us much better/smoother control of IO. > > And there is the "no heavy writes to saturate the disk in long term, > but still temporal heavy writes created by the bursty flushing" case. > In this case the device level throttling has the nice side effect of > smoothing writes out without performance penalties. However, if it's > so useful so that you regard it as an important target, why not build > some smoothing logic into the flushers? It has the great prospect of > benefiting _all_ users _by default_ :) We already have implemented the control at lower layers. So we really don't have to build secondary control now. Just that rest of the subsystems have to be aware of cgroups and play nicely. At high level smoothing logic is just another throttling technique. Whether to throttle process abruptly or try to put more complex technique to smooth out the traffic. It is just a knob. The key question here is where to put the knob in stack for maximum degree of control. flusher logic is already complicated. I am not sure what we will gain by training flushers about the IO rate and throttling it based on user policies. We can let lower layers do it as long as we can make sure flusher is aware of cgroups and can select inodes to flush in such a manner that it does not get blocked behind slow cgroups and can keep all the cgroups busy. The challenge I am facing here is the file system dependencies on IO. One example is that if I throttle fsync IO, then it leads to issues with journalling and other IO in filesystem seems to be stopping. > > > - For the application performance, I thought a better mechanism would be > > that we come up with per cgroup dirty ratio. This is equivalent to > > partitioning the page cache and coming up with cgroup's share. Now > > an application can write to this cache as fast as it want and is only > > throttled either by balance_dirty_pages() rules. > > > > All this IO must be going to some device and if an admin has put this cgroup > > in a low bandwidth group, then pages from this cgroup will be written > > slowly hence tasks in this group will be blocked for longer time. > > > > If we can make this work, then application can write to cache at higher > > rate at the same time not create a havoc at the end device. > > The memcg dirty ratio is fundamentally different from blkio > throttling. The former aims to eliminate excessive pageout()s when > reclaiming pages from the memcg LRU lists. It treats "dirty pages" as > throttle goal, and has the side effect throttling the task at the rate > the memcg's dirty inodes can be flushed to disk. Its complexity > originates from the correlation with "how the flusher selects the > inodes to writeout". Unfortunately the flusher by nature works in a > coarse way.. memcg dirty ratio is a different problem but it needs to work with IO controller to solve the whole issue. If IO was just direct IO, and no page cache in picture we don't need memcg. But the momemnt, page cache comes into the picture, immediately comes the notion of logically dividing that page cache among cgroups. And comes the notion of dirty ratio per cgroup so that even if the overall cache usage is less but this cgroups has consumed its share of dirty pages and now we need to throttle it and ask flusher to send IO to underlying devices. IO controller is sitting below page cache. So we need to make sure that memcg is enhanced to support per cgroup dirt ratio, and train flusher threads so that they are aware of cgroup presence and can do writeout in per memcg aware manner. Greg Thelen is working on putting these two pieces together. So memcg dirty ratio is a different problem but is required to make IO controller work for buffered WRITES. > > OTOH, blkio-cgroup don't need to care about inode selection at all. > It's enough to account and throttle tasks' dirty rate, and let the > flusher freely work on whatever dirtied inodes. That goes back to the model of putting the knob in balance_dirty_pages(). Yes it simplifies the implementation but also takes away the capability of better control. One would still see the burst of WRITES at end devices. > > In this manner, blkio-cgroup dirty rate throttling is more user > oriented. While memcg dirty pages throttling looks like a complex > solution to some technical problems (if me understand it right). If we implement IO throttling in balance_dirty_pages(), then we don't require memcg dirty ratio thing for it to work. But we will still reuire memcg dirty ratio for other reasons. - Prportional IO control for CFQ - memcg's own problems of starting to write out pages from a cgroup earlier. > > The blkio-cgroup dirty throttling code can mainly go to > page-writeback.c, while the memcg code will mainly go to > fs-writeback.c (balance_dirty_pages() will also be involved, but > that's actually a more trivial part). > > The correlations seem to be, > > - you can get the page tagging functionality from memcg, if doing > async write throttling at device level > > - the side effect of rate limiting by memcg's dirty pages throttling, > which is much less controllable than blkio-cgroup's rate limiting Well, I thought memcg's per cgroup ratio and IO controller's rate limit will work together. memcgroup will keep track of per cgroup share of page cache and when caches usage is more than certain %, it will ask flusher to send IO to device and then IO controller will throttle that IO. Now if rate limit of the cgroup is less, then task of that cgroup will be throttled for longer in balance_dirty_pages(). So throttling is happening at two layers. One throttling is in balance_dirty_pages() which is actually not dependent on user inputted parameters. It is more dependent on what's the page cache share of this cgroup and what's the effecitve IO rate this cgroup is getting. The real IO throttling is happning at device level which is dependent on parameters inputted by user and which in-turn indirectly should decide how tasks are throttled in balance_dirty_pages(). I have yet to look at your implementation of throttling but keep in mind that once IO controller comes into picture the throttling/smoothing mechanism also needs to be able to take into account direct writes and we should be able to use same algorithms for throttling READS. Thanks Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-22 15:25 ` Vivek Goyal @ 2011-04-22 16:28 ` Andrea Arcangeli 2011-04-25 18:19 ` Vivek Goyal 0 siblings, 1 reply; 138+ messages in thread From: Andrea Arcangeli @ 2011-04-22 16:28 UTC (permalink / raw) To: Vivek Goyal Cc: Wu Fengguang, James Bottomley, lsf@lists.linux-foundation.org, Dave Chinner, linux-fsdevel@vger.kernel.org On Fri, Apr 22, 2011 at 11:25:31AM -0400, Vivek Goyal wrote: > It is and we have modified CFQ a lot to tackle that but still... > > Just do a "dd if=/dev/zero of=/zerofile bs=1M count=4K" on your root > disk and then try to launch firefox and browse few websites and see if > you are happy with the response of the firefox. It took me more than > a minute to launch firefox and be able to input and load first website. > > But I agree that READ latencies in presence of WRITES can be a problem > independent of IO controller. Reading this I've some dejavu, this is literally a decade old problem, so old that when I first worked on it the elevator had no notion of latency and it would potentially infinitely starve any I/O (regardless of read/write) at the end of the disk if any I/O before the end would keep coming in ;). We're orders of magnitude better these days, but one thing I didn't see mentioned is that according to memories, a lot of it had to do with the way the DMA command size can grow to the maximum allowed by the sg table for writes, but reads (especially metadata and small files where readahead is less effective) won't grow to the maximum or even if it grows to the maximum the readahead may not be useful (userland will seek again not reading into the readahead) and even if synchronous metadata reads aren't involved it'll submit another physical readahead after having satisfied only a little userland read. So even if you have a totally unfair io scheduler that places the next read request always at the top of the queue (ignoring any fairness requirement), you're still going to have the synchronous small read dma waiting at the top of the queue for the large dma write to complete. The time I got the dd if=/dev/zero working best is when I broke the throughput by massively reducing the dma size (by error or intentional frankly I don't remember). SATA requires ~64k large dma to run at peak speed, and I expect if you reduce it to 4k it'll behave a lot better than current 256k. Some very old scsi device I had performed best at 512k dma (much faster than 64k). The max sector size is still 512k today, probably 256k (or only 128k) for SATA but likely above 64k (as it saves CPU even if throughput can be maxed out at ~64k dma as far as the platter is concerned). > Also it is only CFQ which provides READS so much preferrence over WRITES. > deadline and noop do not which we typically use on faster storage. There > we might take a bigger hit on READ latencies depending on what storage > is and how effected it is with a burst of WRITES. > > I guess it boils down to better system control and better predictability. I tend to think to get even better read latency and predictability, the IO scheduler could dynamically and temporarily reduce the max sector size of the write dma (and also ensure any read readahead is also reduced to the dynamic reduced sector size or it'd be detrimental on the number of read DMA issued for each userland read). Maybe with tagged queuing things are better and the dma size doesn't make a difference anymore, I don't know. Surely Jens knows this best and can tell me if I'm wrong. Anyway it should be real easy to test, just a two liner reducing the max sector size to scsi_lib and the max readahead, should allow you to see how fast firefox starts with cfq when dd if=/dev/zero is running and if there's any difference at all. I've seen huge work on cfq but still the max merging remains at top and it doesn't decrease dynamically and I doubt you can get real unnoticeable writeback to reads, without such a chance, no matter how the IO scheduler is otherwise implemented. I'm unsure if this will ever be really viable in single user environment (often absolute throughput is more important and that is clearly higher - at least for the writeback - by keeping the max sector fixed to the max), but if cgroup wants to make a dd if=/dev/zero of=zero bs=10M oflag=direct from one group unnoticeable to the other cgroups that are reading, it's worth researching if this still an actual issue with todays hardware. I guess SSD won't change it much, as it's a DMA duration issue, not seeks, in fact it may be way more noticeable on SSD as seeks will be less costly leaving the duration effect more visible. > So throttling is happening at two layers. One throttling is in > balance_dirty_pages() which is actually not dependent on user inputted > parameters. It is more dependent on what's the page cache share of > this cgroup and what's the effecitve IO rate this cgroup is getting. > The real IO throttling is happning at device level which is dependent > on parameters inputted by user and which in-turn indirectly should decide > how tasks are throttled in balance_dirty_pages(). This sounds a fine design to me. Thanks, Andrea ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-22 16:28 ` Andrea Arcangeli @ 2011-04-25 18:19 ` Vivek Goyal 2011-04-26 14:37 ` Vivek Goyal 0 siblings, 1 reply; 138+ messages in thread From: Vivek Goyal @ 2011-04-25 18:19 UTC (permalink / raw) To: Andrea Arcangeli Cc: Wu Fengguang, James Bottomley, lsf@lists.linux-foundation.org, Dave Chinner, linux-fsdevel@vger.kernel.org, Jens Axboe On Fri, Apr 22, 2011 at 06:28:29PM +0200, Andrea Arcangeli wrote: [..] > > Also it is only CFQ which provides READS so much preferrence over WRITES. > > deadline and noop do not which we typically use on faster storage. There > > we might take a bigger hit on READ latencies depending on what storage > > is and how effected it is with a burst of WRITES. > > > > I guess it boils down to better system control and better predictability. > > I tend to think to get even better read latency and predictability, > the IO scheduler could dynamically and temporarily reduce the max > sector size of the write dma (and also ensure any read readahead is > also reduced to the dynamic reduced sector size or it'd be detrimental > on the number of read DMA issued for each userland read). > > Maybe with tagged queuing things are better and the dma size doesn't > make a difference anymore, I don't know. Surely Jens knows this best > and can tell me if I'm wrong. > > Anyway it should be real easy to test, just a two liner reducing the > max sector size to scsi_lib and the max readahead, should allow you to > see how fast firefox starts with cfq when dd if=/dev/zero is running > and if there's any difference at all. I did some quick runs. - Default queue depth is 31 on my SATA disk. Reducing queue depth to 1 helps a bit. In CFQ we already try to reduce the queue depth of WRITES if READS are going on. - I reduced /sys/block/sda/queue/max_sector_kb to 16. That seemed to help with firefox launch time. There are couple of interesting observations though. - Even after I reduced max_sector_kb to 16, I saw requests of 1024 sector size coming from flusher threads. - Firefox launch time reduced by reducing the max_sector_kb but it did not help much when I tried to launch first website "lwn.net". It still took me little more than 1 minute, to be able to select lwn.net from cached entries and then be able to really load and display the page. I will spend more time figuring out what's happening here. But in general, reducing the max request size dynamically sounds interesting. I am not sure how upper layers are impacted because of this (dm etc). Thanks Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
* Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) 2011-04-25 18:19 ` Vivek Goyal @ 2011-04-26 14:37 ` Vivek Goyal 0 siblings, 0 replies; 138+ messages in thread From: Vivek Goyal @ 2011-04-26 14:37 UTC (permalink / raw) To: Andrea Arcangeli Cc: Wu Fengguang, James Bottomley, lsf@lists.linux-foundation.org, Dave Chinner, linux-fsdevel@vger.kernel.org, Jens Axboe On Mon, Apr 25, 2011 at 02:19:54PM -0400, Vivek Goyal wrote: > On Fri, Apr 22, 2011 at 06:28:29PM +0200, Andrea Arcangeli wrote: > > [..] > > > Also it is only CFQ which provides READS so much preferrence over WRITES. > > > deadline and noop do not which we typically use on faster storage. There > > > we might take a bigger hit on READ latencies depending on what storage > > > is and how effected it is with a burst of WRITES. > > > > > > I guess it boils down to better system control and better predictability. > > > > I tend to think to get even better read latency and predictability, > > the IO scheduler could dynamically and temporarily reduce the max > > sector size of the write dma (and also ensure any read readahead is > > also reduced to the dynamic reduced sector size or it'd be detrimental > > on the number of read DMA issued for each userland read). > > > > Maybe with tagged queuing things are better and the dma size doesn't > > make a difference anymore, I don't know. Surely Jens knows this best > > and can tell me if I'm wrong. > > > > Anyway it should be real easy to test, just a two liner reducing the > > max sector size to scsi_lib and the max readahead, should allow you to > > see how fast firefox starts with cfq when dd if=/dev/zero is running > > and if there's any difference at all. > > I did some quick runs. > > - Default queue depth is 31 on my SATA disk. Reducing queue depth to 1 > helps a bit. > > In CFQ we already try to reduce the queue depth of WRITES if READS > are going on. > > - I reduced /sys/block/sda/queue/max_sector_kb to 16. That seemed to > help with firefox launch time. > > There are couple of interesting observations though. > > - Even after I reduced max_sector_kb to 16, I saw requests of 1024 sector > size coming from flusher threads. > I realized that I had a dm device sitting on top of sda and I was changing max_sector_kb only on sda and not on dm device hence request size was still 1024 sector each. I changed max_sector_kb to 16 and that seems to help. Launching and loading first website time comes down from 1minute to roughly 30 seconds. At dm layer no IO scheduler is running so IO scheduler really can't do much in controlling the request size dynamically depending on what's happening on the device. I am not sure if one can break the requests in smaller pieces in IO scheduler if reads are going on. Thanks Vivek ^ permalink raw reply [flat|nested] 138+ messages in thread
end of thread, other threads:[~2011-04-26 14:37 UTC | newest] Thread overview: 138+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <1301373398.2590.20.camel@mulgrave.site> 2011-03-29 5:14 ` [Lsf] Preliminary Agenda and Activities for LSF Amir Goldstein 2011-03-29 11:16 ` Ric Wheeler 2011-03-29 11:22 ` Matthew Wilcox 2011-03-29 12:17 ` Jens Axboe 2011-03-29 13:09 ` Martin K. Petersen 2011-03-29 13:12 ` Ric Wheeler 2011-03-29 13:38 ` James Bottomley 2011-03-29 17:20 ` Shyam_Iyer 2011-03-29 17:33 ` Vivek Goyal 2011-03-29 18:10 ` Shyam_Iyer 2011-03-29 18:45 ` Vivek Goyal 2011-03-29 19:13 ` Shyam_Iyer 2011-03-29 19:57 ` Vivek Goyal 2011-03-29 19:59 ` Mike Snitzer 2011-03-29 20:12 ` Shyam_Iyer 2011-03-29 20:23 ` Mike Snitzer 2011-03-29 23:09 ` Shyam_Iyer 2011-03-30 5:58 ` [Lsf] " Hannes Reinecke 2011-03-30 14:02 ` James Bottomley 2011-03-30 14:10 ` Hannes Reinecke 2011-03-30 14:26 ` James Bottomley 2011-03-30 14:55 ` Hannes Reinecke 2011-03-30 15:33 ` James Bottomley 2011-03-30 15:46 ` Shyam_Iyer 2011-03-30 20:32 ` Giridhar Malavali 2011-03-30 20:45 ` James Bottomley 2011-03-29 19:47 ` Nicholas A. Bellinger 2011-03-29 20:29 ` Jan Kara 2011-03-29 20:31 ` Ric Wheeler 2011-03-30 0:33 ` Mingming Cao 2011-03-30 2:17 ` Dave Chinner 2011-03-30 11:13 ` Theodore Tso 2011-03-30 11:28 ` Ric Wheeler 2011-03-30 14:07 ` Chris Mason 2011-04-01 15:19 ` Ted Ts'o 2011-04-01 16:30 ` Amir Goldstein 2011-04-01 21:46 ` Joel Becker 2011-04-02 3:26 ` Amir Goldstein 2011-04-01 21:43 ` Joel Becker 2011-03-30 21:49 ` Mingming Cao 2011-03-31 0:05 ` Matthew Wilcox 2011-03-31 1:00 ` Joel Becker 2011-04-01 21:34 ` Mingming Cao 2011-04-01 21:49 ` Joel Becker 2011-03-29 17:35 ` Chad Talbott 2011-03-29 19:09 ` Vivek Goyal 2011-03-29 20:14 ` Chad Talbott 2011-03-29 20:35 ` Jan Kara 2011-03-29 21:08 ` Greg Thelen 2011-03-30 4:18 ` Dave Chinner 2011-03-30 15:37 ` IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF) Vivek Goyal 2011-03-30 22:20 ` Dave Chinner 2011-03-30 22:49 ` Chad Talbott 2011-03-31 3:00 ` Dave Chinner 2011-03-31 14:16 ` Vivek Goyal 2011-03-31 14:34 ` Chris Mason 2011-03-31 22:14 ` Dave Chinner 2011-03-31 23:43 ` Chris Mason 2011-04-01 0:55 ` Dave Chinner 2011-04-01 1:34 ` Vivek Goyal 2011-04-01 4:36 ` Dave Chinner 2011-04-01 6:32 ` [Lsf] IO less throttling and cgroup aware writeback (Was: " Christoph Hellwig 2011-04-01 7:23 ` Dave Chinner 2011-04-01 12:56 ` Christoph Hellwig 2011-04-21 15:07 ` Vivek Goyal 2011-04-01 14:49 ` IO less throttling and cgroup aware writeback (Was: Re: [Lsf] " Vivek Goyal 2011-03-31 22:25 ` Vivek Goyal 2011-03-31 14:50 ` [Lsf] IO less throttling and cgroup aware writeback (Was: " Greg Thelen 2011-03-31 22:27 ` Dave Chinner 2011-04-01 17:18 ` Vivek Goyal 2011-04-01 21:49 ` Dave Chinner 2011-04-02 7:33 ` Greg Thelen 2011-04-02 7:34 ` Greg Thelen 2011-04-05 13:13 ` Vivek Goyal 2011-04-05 22:56 ` Dave Chinner 2011-04-06 14:49 ` Curt Wohlgemuth 2011-04-06 15:39 ` Vivek Goyal 2011-04-06 19:49 ` Greg Thelen 2011-04-06 23:07 ` [Lsf] IO less throttling and cgroup aware writeback Greg Thelen 2011-04-06 23:36 ` Dave Chinner 2011-04-07 19:24 ` Vivek Goyal 2011-04-07 20:33 ` Christoph Hellwig 2011-04-07 21:34 ` Vivek Goyal 2011-04-07 23:42 ` Dave Chinner 2011-04-08 0:59 ` Greg Thelen 2011-04-08 1:25 ` Dave Chinner 2011-04-12 3:17 ` KAMEZAWA Hiroyuki 2011-04-08 13:43 ` Vivek Goyal 2011-04-06 23:08 ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Dave Chinner 2011-04-07 20:04 ` Vivek Goyal 2011-04-07 23:47 ` Dave Chinner 2011-04-08 13:50 ` Vivek Goyal 2011-04-11 1:05 ` Dave Chinner 2011-04-06 15:37 ` Vivek Goyal 2011-04-06 16:08 ` Vivek Goyal 2011-04-06 17:10 ` Jan Kara 2011-04-06 17:14 ` Curt Wohlgemuth 2011-04-08 1:58 ` Dave Chinner 2011-04-19 14:26 ` Wu Fengguang 2011-04-06 23:50 ` Dave Chinner 2011-04-07 17:55 ` Vivek Goyal 2011-04-11 1:36 ` Dave Chinner 2011-04-15 21:07 ` Vivek Goyal 2011-04-16 3:06 ` Vivek Goyal 2011-04-18 21:58 ` Jan Kara 2011-04-18 22:51 ` cgroup IO throttling and filesystem ordered mode (Was: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)) Vivek Goyal 2011-04-19 0:33 ` Dave Chinner 2011-04-19 14:30 ` Vivek Goyal 2011-04-19 14:45 ` Jan Kara 2011-04-19 17:17 ` Vivek Goyal 2011-04-19 18:30 ` Vivek Goyal 2011-04-21 0:32 ` Dave Chinner 2011-04-21 0:29 ` Dave Chinner 2011-04-19 14:17 ` [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Wu Fengguang 2011-04-19 14:34 ` Vivek Goyal 2011-04-19 14:48 ` Jan Kara 2011-04-19 15:11 ` Vivek Goyal 2011-04-19 15:22 ` Wu Fengguang 2011-04-19 15:31 ` Vivek Goyal 2011-04-19 16:58 ` Wu Fengguang 2011-04-19 17:05 ` Vivek Goyal 2011-04-19 20:58 ` Jan Kara 2011-04-20 1:21 ` Wu Fengguang 2011-04-20 10:56 ` Jan Kara 2011-04-20 11:19 ` Wu Fengguang 2011-04-20 14:42 ` Jan Kara 2011-04-20 1:16 ` Wu Fengguang 2011-04-20 18:44 ` Vivek Goyal 2011-04-20 19:16 ` Jan Kara 2011-04-21 0:17 ` Dave Chinner 2011-04-21 15:06 ` Wu Fengguang 2011-04-21 15:10 ` Wu Fengguang 2011-04-21 17:20 ` Vivek Goyal 2011-04-22 4:21 ` Wu Fengguang 2011-04-22 15:25 ` Vivek Goyal 2011-04-22 16:28 ` Andrea Arcangeli 2011-04-25 18:19 ` Vivek Goyal 2011-04-26 14:37 ` Vivek Goyal
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).