Linux Container Development
 help / color / mirror / Atom feed
  • [parent not found: <20120404145134.GC12676@redhat.com>]
  • * [RFC] writeback and cgroup
    @ 2012-04-03 18:36 Tejun Heo
      0 siblings, 0 replies; 81+ messages in thread
    From: Tejun Heo @ 2012-04-03 18:36 UTC (permalink / raw)
      To: Fengguang Wu, Jan Kara, vgoyal-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe
      Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
    	andrea-oIIqvOZpAevzfdHfmsDf5w,
    	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
    	linux-kernel-u79uwXL29TY76Z2rM5mHXA, sjayaraman-IBi9RG/b67k,
    	lsf-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
    	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
    	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
    	cgroups-u79uwXL29TY76Z2rM5mHXA
    
    Hello, guys.
    
    So, during LSF, I, Fengguang and Jan had a chance to sit down and talk
    about how to cgroup support to writeback.  Here's what I got from it.
    
    Fengguang's opinion is that the throttling algorithm implemented in
    writeback is good enough and blkcg parameters can be exposed to
    writeback such that those limits can be applied from writeback.  As
    for reads and direct IOs, Fengguang opined that the algorithm can
    easily be extended to cover those cases and IIUC all IOs, whether
    buffered writes, reads or direct IOs can eventually all go through
    writeback layer which will be the one layer controlling all IOs.
    
    Unfortunately, I don't agree with that at all.  I think it's a gross
    layering violation and lacks any longterm design.  We have a well
    working model of applying and propagating resource pressure - we apply
    the pressure where the resource exists and propagates the back
    pressure through buffers to upper layers upto the originator.  Think
    about network, the pressure exists or is applied at the in/egress
    points which gets propagated through socket buffers and eventually
    throttles the originator.
    
    Writeback, without cgroup, isn't different.  It consists a part of the
    pressure propagation chain anchored at the IO device.  IO devices
    these days generate very high pressure, which gets propgated through
    the IO sched and buffered requests, which in turn creates pressure at
    writeback.  Here, the buffering happens in page cache and pressure at
    writeback increases the amount of dirty page cache.  Propagating this
    IO pressure to the dirtying task is one of the biggest
    responsibililties of the writeback code, and this is the underlying
    design of the whole thing.
    
    IIUC, without cgroup, the current writeback code works more or less
    like this.  Throwing in cgroup doesn't really change the fundamental
    design.  Instead of a single pipe going down, we just have multiple
    pipes to the same device, each of which should be treated separately.
    Of course, a spinning disk can't be divided that easily and their
    performance characteristics will be inter-dependent, but the place to
    solve that problem is where the problem is, the block layer.
    
    We may have to look for optimizations and expose some details to
    improve the overall behavior and such optimizations may require some
    deviation from the fundamental design, but such optimizations should
    be justified and such deviations kept at minimum, so, no, I don't
    think we're gonna be expose blkcg / block / elevator parameters
    directly to writeback.  Unless someone can *really* convince me
    otherwise, I'll be vetoing any change toward that direction.
    
    Let's please keep the layering clear.  IO limitations will be applied
    at the block layer and pressure will be formed there and then
    propagated upwards eventually to the originator.  Sure, exposing the
    whole information might result in better behavior for certain
    workloads, but down the road, say, in three or five years, devices
    which can be shared without worrying too much about seeks might be
    commonplace and we could be swearing at a disgusting structural mess,
    and sadly various cgroup support seems to be a prominent source of
    such design failures.
    
    IMHO, treating cgroup - device/bdi pair as a separate device should
    suffice as the underlying design.  After all, blkio cgroup support's
    ultimate goal is dividing the IO resource into separate bins.
    Implementation details might change as underlying technology changes
    and we learn more about how to do it better but that is the goal which
    we'll always try to keep close to.  Writeback should (be able to)
    treat them as separate devices.  We surely will need adjustments and
    optimizations to make things work at least somewhat reasonably but
    that is the baseline.
    
    In the discussion, for such implementation, the following obstacles
    were identified.
    
    * There are a lot of cases where IOs are issued by a task which isn't
      the originiator.  ie. Writeback issues IOs for pages which are
      dirtied by some other tasks.  So, by the time an IO reaches the
      block layer, we don't know which cgroup the IO belongs to.
    
      Recently, block layer has grown support to attach a task to a bio
      which causes the bio to be handled as if it were issued by the
      associated task regardless of the actual issuing task.  It currently
      only allows attaching %current to a bio - bio_associate_current() -
      but changing it to support other tasks is trivial.
    
      We'll need to update the async issuers to tag the IOs they issue but
      the mechanism is already there.
    
    * There's a single request pool shared by all issuers per a request
      queue.  This can lead to priority inversion among cgroups.  Note
      that problem also exists without cgroups.  Lower ioprio issuer may
      be holding a request holding back highprio issuer.
    
      We'll need to make request allocation cgroup (and hopefully ioprio)
      aware.  Probably in the form of separate request pools.  This will
      take some work but I don't think this will be too challenging.  I'll
      work on it.
    
    * cfq cgroup policy throws all async IOs, which all buffered writes
      are, into the shared cgroup regardless of the actual cgroup.  This
      behavior is, I believe, mostly historical and changing it isn't
      difficult.  Prolly only few tens of lines of changes.  This may
      cause significant changes to actual IO behavior with cgroups tho.  I
      personally think the previous behavior was too wrong to keep (the
      weight was completely ignored for buffered writes) but we may want
      to introduce a switch to toggle between the two behaviors.
    
      Note that blk-throttle doesn't have this problem.
    
    * Unlike dirty data pages, metadata tends to have strict ordering
      requirements and thus is susceptible to priority inversion.  Two
      solutions were suggested - 1. allow overdrawl for metadata writes so
      that low prio metadata writes don't block the whole FS, 2. provide
      an interface to query and wait for bdi-cgroup congestion which can
      be called from FS metadata paths to throttle metadata operations
      before they enter the stream of ordered operations.
    
      I think combination of the above two should be enough for solving
      the problem.  I *think* the second can be implemented as part of
      cgroup aware request allocation update.  The first one needs a bit
      more thinking but there can be easier interim solutions (e.g. throw
      META writes to the head of the cgroup queue or just plain ignore
      cgroup limits for META writes) for now.
    
    * I'm sure there are a lot of design choices to be made in the
      writeback implementation but IIUC Jan seems to agree that the
      simplest would be simply deal different cgroup-bdi pairs as
      completely separate which shouldn't add too much complexity to the
      already intricate writeback code.
    
    So, I think we have something which sounds like a plan, which at least
    I can agree with and seems doable without adding a lot of complexity.
    
    Jan, Fengguang, I'm pretty sure I missed some stuff from writeback's
    side and IIUC Fengguang doesn't agree with this approach too much, so
    please voice your opinions & comments.
    
    Thank you.
    
    --
    tejun
    
    ^ permalink raw reply	[flat|nested] 81+ messages in thread

    end of thread, other threads:[~2012-04-25 15:47 UTC | newest]
    
    Thread overview: 81+ messages (download: mbox.gz follow: Atom feed
    -- links below jump to the message on this page --
         [not found] <20120403183655.GA23106@dhcp-172-17-108-109.mtv.corp.google.com>
         [not found] ` <20120403183655.GA23106-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
    2012-04-04 14:51   ` [RFC] writeback and cgroup Vivek Goyal
    2012-04-04 17:51   ` Fengguang Wu
    2012-04-04 18:35     ` Vivek Goyal
    2012-04-04 19:33     ` Tejun Heo
         [not found]       ` <20120404193355.GD29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
    2012-04-04 20:18         ` Vivek Goyal
         [not found]           ` <20120404201816.GL12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    2012-04-05 16:31             ` Tejun Heo
         [not found]           ` <20120405163113.GD12854@google.com>
         [not found]             ` <20120405163113.GD12854-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
    2012-04-05 17:09               ` Vivek Goyal
    2012-04-06  9:59         ` Fengguang Wu
         [not found]       ` <20120406095934.GA10465@localhost>
    2012-04-17 22:38         ` Tejun Heo
    2012-04-18  6:57         ` Jan Kara
         [not found]         ` <20120418065720.GA21485@quack.suse.cz>
         [not found]           ` <20120418065720.GA21485-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
    2012-04-18  7:58             ` Fengguang Wu
         [not found]         ` <20120417223854.GG19975@google.com>
         [not found]           ` <20120417223854.GG19975-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
    2012-04-19 14:23             ` Fengguang Wu
         [not found]           ` <20120419142343.GA12684@localhost>
    2012-04-19 18:31             ` Vivek Goyal
    2012-04-19 20:26             ` Jan Kara
         [not found]               ` <20120419202635.GA4795-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
    2012-04-20 13:34                 ` Fengguang Wu
         [not found]               ` <20120420133441.GA7035@localhost>
    2012-04-20 19:08                 ` Tejun Heo
    2012-04-23  9:14                 ` Jan Kara
         [not found]                   ` <20120423091432.GC6512-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
    2012-04-23 10:24                     ` Fengguang Wu
         [not found]                   ` <20120423102420.GA13262@localhost>
    2012-04-23 12:42                     ` Jan Kara
         [not found]                     ` <20120423124240.GE6512@quack.suse.cz>
         [not found]                       ` <20120423124240.GE6512-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
    2012-04-23 14:31                         ` Fengguang Wu
         [not found]                 ` <20120420190844.GH32324@google.com>
         [not found]                   ` <20120420190844.GH32324-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
    2012-04-22 14:46                     ` Fengguang Wu
         [not found]                   ` <20120422144649.GA7066@localhost>
    2012-04-23 16:56                     ` Tejun Heo
         [not found]                       ` <20120423165626.GB5406-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
    2012-04-24  7:58                         ` Fengguang Wu
         [not found]                       ` <20120424075853.GA8391@localhost>
    2012-04-25 15:47                         ` Tejun Heo
         [not found]             ` <20120419183118.GM10216@redhat.com>
         [not found]               ` <20120419183118.GM10216-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    2012-04-20 12:45                 ` Fengguang Wu
         [not found]               ` <20120420124518.GA7133@localhost>
    2012-04-20 19:29                 ` Vivek Goyal
         [not found]                 ` <20120420192930.GR22419@redhat.com>
         [not found]                   ` <20120420192930.GR22419-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    2012-04-20 21:33                     ` Tejun Heo
         [not found]                   ` <20120420213301.GA29134@google.com>
         [not found]                     ` <20120420213301.GA29134-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
    2012-04-22 14:26                       ` Fengguang Wu
    2012-04-23 12:30                       ` Vivek Goyal
         [not found]                     ` <20120423123011.GA8103@redhat.com>
         [not found]                       ` <20120423123011.GA8103-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    2012-04-23 16:04                         ` Tejun Heo
         [not found]     ` <20120404183528.GJ12676@redhat.com>
         [not found]       ` <20120404183528.GJ12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    2012-04-04 21:42         ` Fengguang Wu
    2012-04-05 15:10           ` Vivek Goyal
         [not found]           ` <20120405151026.GB23999@redhat.com>
         [not found]             ` <20120405151026.GB23999-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    2012-04-06  0:32               ` Fengguang Wu
         [not found] ` <20120404145134.GC12676@redhat.com>
         [not found]   ` <CAH2r5mtwQa0Uu=_Yd2JywVJXA=OMGV43X_OUfziC-yeVy9BGtQ@mail.gmail.com>
         [not found]     ` <CAH2r5mtwQa0Uu=_Yd2JywVJXA=OMGV43X_OUfziC-yeVy9BGtQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
    2012-04-04 18:56       ` [Lsf] " Tejun Heo
         [not found]     ` <20120404185605.GC29686@dhcp-172-17-108-109.mtv.corp.google.com>
         [not found]       ` <20120404185605.GC29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
    2012-04-04 19:19         ` Vivek Goyal
         [not found]       ` <20120404191918.GK12676@redhat.com>
         [not found]         ` <20120404191918.GK12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    2012-04-25  8:47           ` Suresh Jayaraman
         [not found]   ` <20120404184909.GB29686@dhcp-172-17-108-109.mtv.corp.google.com>
         [not found]     ` <20120404184909.GB29686-RcKxWJ4Cfj1J2suj2OqeGauc2jM2gXBXkQQo+JxHRPFibQn6LdNjmg@public.gmane.org>
    2012-04-04 19:23       ` Steve French
    2012-04-04 20:32       ` Vivek Goyal
    2012-04-05 16:38       ` Tejun Heo
    2012-04-14 11:53       ` [Lsf] " Peter Zijlstra
         [not found]     ` <20120404203239.GM12676@redhat.com>
         [not found]       ` <20120404203239.GM12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    2012-04-04 23:02         ` Tejun Heo
         [not found]     ` <20120405163854.GE12854@google.com>
         [not found]       ` <20120405163854.GE12854-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
    2012-04-05 17:13         ` Vivek Goyal
         [not found]     ` <CAH2r5mvP56D0y4mk5wKrJcj+=OZ0e0Q5No_L+9a8a=GMcEhRew@mail.gmail.com>
         [not found]       ` <CAH2r5mvP56D0y4mk5wKrJcj+=OZ0e0Q5No_L+9a8a=GMcEhRew-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
    2012-04-14 12:15         ` [Lsf] " Peter Zijlstra
         [not found]   ` <20120404145134.GC12676-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    2012-04-04 15:36     ` Steve French
    2012-04-04 18:49     ` Tejun Heo
    2012-04-07  8:00     ` Jan Kara
         [not found]   ` <20120407080027.GA2584@quack.suse.cz>
         [not found]     ` <20120407080027.GA2584-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
    2012-04-10 16:23       ` [Lsf] " Steve French
    2012-04-10 18:06       ` Vivek Goyal
         [not found]     ` <CAH2r5mvLVnM3Se5vBBsYzwaz5Ckp3i6SVnGp2T0XaGe9_u8YYA@mail.gmail.com>
         [not found]       ` <CAH2r5mvLVnM3Se5vBBsYzwaz5Ckp3i6SVnGp2T0XaGe9_u8YYA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
    2012-04-10 18:16         ` [Lsf] " Vivek Goyal
         [not found]     ` <20120410180653.GJ21801@redhat.com>
         [not found]       ` <20120410180653.GJ21801-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    2012-04-10 21:05         ` Jan Kara
         [not found]           ` <20120410210505.GE4936-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
    2012-04-10 21:20             ` Vivek Goyal
         [not found]           ` <20120410212041.GP21801@redhat.com>
         [not found]             ` <20120410212041.GP21801-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    2012-04-10 22:24               ` Jan Kara
         [not found]             ` <20120410222425.GF4936@quack.suse.cz>
         [not found]               ` <20120410222425.GF4936-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
    2012-04-11 15:40                 ` Vivek Goyal
         [not found]                   ` <20120411154005.GD16692-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    2012-04-11 15:45                     ` Vivek Goyal
    2012-04-11 19:22                     ` Jan Kara
    2012-04-14 12:25                     ` [Lsf] " Peter Zijlstra
         [not found]                   ` <20120411154531.GE16692@redhat.com>
         [not found]                     ` <20120411154531.GE16692-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    2012-04-11 17:05                       ` Jan Kara
         [not found]                     ` <20120411170542.GB16008@quack.suse.cz>
         [not found]                       ` <20120411170542.GB16008-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
    2012-04-11 17:23                         ` Vivek Goyal
    2012-04-17 21:48                         ` Tejun Heo
         [not found]                       ` <20120411172311.GF16692@redhat.com>
         [not found]                         ` <20120411172311.GF16692-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    2012-04-11 19:44                           ` Jan Kara
         [not found]                       ` <20120417214831.GE19975@google.com>
         [not found]                         ` <20120417214831.GE19975-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
    2012-04-18 18:18                           ` Vivek Goyal
         [not found]                   ` <1334406314.2528.90.camel@twins>
    2012-04-16 12:54                     ` [Lsf] " Vivek Goyal
         [not found]                       ` <20120416125432.GB12776-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    2012-04-16 13:07                         ` Fengguang Wu
         [not found]                       ` <20120416130707.GA10532@localhost>
    2012-04-16 14:19                         ` Fengguang Wu
    2012-04-16 15:52                         ` Vivek Goyal
         [not found]                         ` <20120416155207.GB15437@redhat.com>
         [not found]                           ` <20120416155207.GB15437-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    2012-04-17  2:14                             ` Fengguang Wu
         [not found]                   ` <20120411192231.GF16008@quack.suse.cz>
         [not found]                     ` <20120411192231.GF16008-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
    2012-04-12 20:37                       ` Vivek Goyal
         [not found]                         ` <20120412203719.GL2207-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    2012-04-12 20:51                           ` Tejun Heo
    2012-04-15 11:37                           ` [Lsf] " Peter Zijlstra
         [not found]                         ` <20120412205148.GA24056@google.com>
         [not found]                           ` <20120412205148.GA24056-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
    2012-04-14 14:36                             ` Fengguang Wu
    2012-04-16 14:57                               ` Vivek Goyal
         [not found]                               ` <20120416145744.GA15437@redhat.com>
         [not found]                                 ` <20120416145744.GA15437-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    2012-04-24 11:33                                   ` Fengguang Wu
         [not found]                                 ` <20120424113340.GA12509@localhost>
    2012-04-24 14:56                                   ` Jan Kara
         [not found]                                   ` <20120424145655.GA1474@quack.suse.cz>
         [not found]                                     ` <20120424145655.GA1474-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
    2012-04-24 15:58                                       ` Vivek Goyal
    2012-04-25  3:16                                       ` Fengguang Wu
    2012-04-25  9:01                                         ` Jan Kara
         [not found]                                         ` <20120425090156.GB12568@quack.suse.cz>
         [not found]                                           ` <20120425090156.GB12568-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
    2012-04-25 12:05                                             ` Fengguang Wu
         [not found]                                     ` <20120424155843.GG26708@redhat.com>
         [not found]                                       ` <20120424155843.GG26708-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    2012-04-25  2:42                                         ` Fengguang Wu
    2012-04-17 22:01                       ` Tejun Heo
         [not found]                     ` <20120417220106.GF19975@google.com>
         [not found]                       ` <20120417220106.GF19975-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
    2012-04-18  6:30                         ` Jan Kara
    2012-04-03 18:36 Tejun Heo
    

    This is a public inbox, see mirroring instructions
    for how to clone and mirror all data and code used for this inbox