Re: [Xen-devel] Re: [dm-devel] Re: dm-ioband + bio-cgroup benchm

To:	"Hirokazu Takahashi" <taka@xxxxxxxxxxxxx>
Subject:	Re: [Xen-devel] Re: [dm-devel] Re: dm-ioband + bio-cgroup benchmarks
From:	"Balbir Singh" <balbir@xxxxxxxxxxxxxxxxxx>
Date:	Wed, 24 Sep 2008 16:37:00 +0530
Cc:	xen-devel@xxxxxxxxxxxxxxxxxxx, containers@xxxxxxxxxxxxxxxxxxxxxxxxxx, jens.axboe@xxxxxxxxxx, linux-kernel@xxxxxxxxxxxxxxx, virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx, dm-devel@xxxxxxxxxx, agk@xxxxxxxxxxxxxx, ryov@xxxxxxxxxxxxx, xemul@xxxxxxxxxx, fernando@xxxxxxxxxxxxx, vgoyal@xxxxxxxxxx, righi.andrea@xxxxxxxxx
Delivery-date:	Wed, 24 Sep 2008 04:07:28 -0700
Dkim-signature:	v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:sender :to:subject:cc:in-reply-to:mime-version:content-type :content-transfer-encoding:content-disposition:references :x-google-sender-auth; bh=jJOQzKKIM70kXsrv0OUQ2IdMmNUs22DW8oqHS0pGPL8=; b=YbFU1gfzVAgjZzAgPWYc88GV78AAm7QcU25fSnyz+avArZk4iV7qPKsbMTv+ZkCsK6 cEnWdt0Fdey2MDrYCXjt66r01/9b6tGfISqSqei5BheVAUGsfT/G8yib6z4n82T2RLKI v4yX27KRTpf+aFJ4O5u05ADkdvl8VNgqgle0Y=
Domainkey-signature:	a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version :content-type:content-transfer-encoding:content-disposition :references:x-google-sender-auth; b=dnH4JW4gflmOvCTgu4DmdgRrbogjBf/nx4nZsbj5+NmMntmayQWktdttBF9usGCqzR cmE9ZoPm6C2dbjLOeYsJXo6fNhQoIQbsoePXy+dNyFybpbITUcvrpjJ3Mk111Z+AgwVz UvsJs4hT6aE9q8+bYykuqI2Yzu9OlyLKPhKgI=
Envelope-to:	www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to:	<661de9470809240404i62300942o15337ecec335fe22@xxxxxxxxxxxxxx>
List-help:	<mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id:	Xen developer discussion <xen-devel.lists.xensource.com>
List-post:	<mailto:xen-devel@lists.xensource.com>
List-subscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References:	<48D267B5.20402@xxxxxxxxx> <20080918150634.GH20640@xxxxxxxxxx> <48D2715A.6060002@xxxxxxxxx> <20080919.123405.91829935.taka@xxxxxxxxxxxxx> <661de9470809240404i62300942o15337ecec335fe22@xxxxxxxxxxxxxx>
Sender:	xen-devel-bounces@xxxxxxxxxxxxxxxxxxx

On Wed, Sep 24, 2008 at 4:34 PM, Balbir Singh <balbir@xxxxxxxxxxxxxxxxxx> wrote:
>
>
> On Fri, Sep 19, 2008 at 9:04 AM, Hirokazu Takahashi <taka@xxxxxxxxxxxxx>
> wrote:
>>
>> Hi,
>>
>> > >> Vivek Goyal wrote:
>> > >>> On Thu, Sep 18, 2008 at 09:04:18PM +0900, Ryo Tsuruta wrote:
>> > >>>> Hi All,
>> > >>>>
>> > >>>> I have got excellent results of dm-ioband, that controls the disk
>> > >>>> I/O
>> > >>>> bandwidth even when it accepts delayed write requests.
>> > >>>>
>> > >>>> In this time, I ran some benchmarks with a high-end storage. The
>> > >>>> reason was to avoid a performance bottleneck due to mechanical
>> > >>>> factors
>> > >>>> such as seek time.
>> > >>>>
>> > >>>> You can see the details of the benchmarks at:
>> > >>>> http://people.valinux.co.jp/~ryov/dm-ioband/hps/
>> > >>>>
>> > >>> Hi Ryo,
>> > >>>
>> > >>> I had a query about dm-ioband patches. IIUC, dm-ioband patches will
>> > >>> break
>> > >>> the notion of process priority in CFQ because now dm-ioband device
>> > >>> will
>> > >>> hold the bio and issue these to lower layers later based on which
>> > >>> bio's
>> > >>> become ready. Hence actual bio submitting context might be different
>> > >>> and
>> > >>> because cfq derives the io_context from current task, it will be
>> > >>> broken.
>> > >>>
>> > >>> To mitigate that problem, we probably need to implement Fernando's
>> > >>> suggestion of putting io_context pointer in bio.
>> > >>>
>> > >>> Have you already done something to solve this issue?
>> > >>>
>> > >>> Secondly, why do we have to create an additional dm-ioband device
>> > >>> for
>> > >>> every device we want to control using rules. This looks little odd
>> > >>> atleast to me. Can't we keep it in line with rest of the controllers
>> > >>> where task grouping takes place using cgroup and rules are specified
>> > >>> in
>> > >>> cgroup itself (The way Andrea Righi does for io-throttling patches)?
>> > >>>
>> > >>> To avoid creation of stacking another device (dm-ioband) on top of
>> > >>> every
>> > >>> device we want to subject to rules, I was thinking of maintaining an
>> > >>> rb-tree per request queue. Requests will first go into this rb-tree
>> > >>> upon
>> > >>> __make_request() and then will filter down to elevator associated
>> > >>> with the
>> > >>> queue (if there is one). This will provide us the control of
>> > >>> releasing
>> > >>> bio's to elevaor based on policies (proportional weight, max
>> > >>> bandwidth
>> > >>> etc) and no need of stacking additional block device.
>> > >>>
>> > >>> I am working on some experimental proof of concept patches. It will
>> > >>> take
>> > >>> some time though.
>> > >>>
>> > >>> I was thinking of following.
>> > >>>
>> > >>> - Adopt the Andrea Righi's style of specifying rules for devices and
>> > >>>   group the tasks using cgroups.
>> > >>>
>> > >>> - To begin with, adopt dm-ioband's approach of proportional
>> > >>> bandwidth
>> > >>>   controller. It makes sense to me limit the bandwidth usage only in
>> > >>>   case of contention. If there is really a need to limit max
>> > >>> bandwidth,
>> > >>>   then probably we can do something to implement additional rules or
>> > >>>   implement some policy switcher where user can decide what kind of
>> > >>>   policies need to be implemented.
>> > >>>
>> > >>> - Get rid of dm-ioband and instead buffer requests on an rb-tree on
>> > >>> every
>> > >>>   request queue which is controlled by some kind of cgroup rules.
>> > >>>
>> > >>> It would be good to discuss above approach now whether it makes
>> > >>> sense or
>> > >>> not. I think it is kind of fusion of io-throttling and dm-ioband
>> > >>> patches
>> > >>> with additional idea of doing io-control just above elevator on the
>> > >>> request
>> > >>> queue using an rb-tree.
>> > >> Thanks Vivek. All sounds reasonable to me and I think this is be the
>> > >> right way
>> > >> to proceed.
>> > >>
>> > >> I'll try to design and implement your rb-tree per request-queue idea
>> > >> into my
>> > >> io-throttle controller, maybe we can reuse it also for a more generic
>> > >> solution.
>> > >> Feel free to send me your experimental proof of concept if you want,
>> > >> even if
>> > >> it's not yet complete, I can review it, test and contribute.
>> > >
>> > > Currently I have taken code from bio-cgroup to implement cgroups and
>> > > to
>> > > provide functionality to associate a bio to a cgroup. I need this to
>> > > be
>> > > able to queue the bio's at right node in the rb-tree and then also to
>> > > be
>> > > able to take a decision when is the right time to release few
>> > > requests.
>> > >
>> > > Right now in crude implementation, I am working on making system boot.
>> > > Once patches are at least in little bit working shape, I will send it
>> > > to you
>> > > to have a look.
>> > >
>> > > Thanks
>> > > Vivek
>> >
>> > I wonder... wouldn't be simpler to just use the memory controller
>> > to retrieve this information starting from struct page?
>> >
>> > I mean, following this path (in short, obviously using the appropriate
>> > interfaces for locking and referencing the different objects):
>> >
>> > cgrp = page->page_cgroup->mem_cgroup->css.cgroup
>> >
>> > Once you get the cgrp it's very easy to use the corresponding controller
>> > structure.
>> >
>> > Actually, this is how I'm doing in cgroup-io-throttle to associate a bio
>> > to a cgroup. What other functionalities/advantages bio-cgroup provide in
>> > addition to that?
>>
>> I've decided to get Ryo to post the accurate dirty-page tracking patch
>> for bio-cgroup, which isn't perfect yet though. The memory controller
>> never wants to support this tracking because migrating a page between
>> memory cgroups is really heavy.

It depends on the migration. The cost is proportional to the number of
pages moved. The cost can be brought down (I do have a  design on
paper  -- from long long ago), where moving mm's will reduce the cost
of migration, but it adds an additional dereference in the common
path.

>
>>
>> I also thought enhancing the memory controller would be good enough,
>> but a lot of people said they wanted to control memory resource and
>> block I/O resource separately.
>
> Yes, ideally we do want that.
>
>>
>> So you can create several bio-cgroup in one memory-cgroup,
>> or you can use bio-cgroup without memory-cgroup.
>>
>> I also have a plan to implement more acurate tracking mechanism
>> on bio-cgroup after the memory cgroup team re-implement the
>> infrastructure,
>> which won't be supported by memory-cgroup.
>> When a process are moved into another memory cgroup,
>> the pages belonging to the process don't move to the new cgroup
>> because migrating pages is so heavy. It's hard to find the pages
>> from the process and migrating pages may cause some memory pressure.
>> I'll implement this feature only on bio-cgroup with minimum overhead
>

Kamezawa has also wanted the page migration feature and we've agreed
to provide a per-cgroup flag to decide to turn migration on/off. I
would not mind refactoring memcontrol.c if that can help the IO
controller and if you want migration, force the migration flag to on
and warn the user if they try to turn it off.

Balbir

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

WARNING - OLD ARCHIVES

xen-devel

Re: [Xen-devel] Re: [dm-devel] Re: dm-ioband + bio-cgroup benchmarks