WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

Re: [Xen-devel] blktap wedges when block-attached to dom0

To: jake <jwires@xxxxxxxxxxxxx>
Subject: Re: [Xen-devel] blktap wedges when block-attached to dom0
From: Brendan Cully <brendan@xxxxxxxxx>
Date: Wed, 21 Mar 2007 09:59:55 -0700
Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
Delivery-date: Wed, 21 Mar 2007 09:58:51 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxx
In-reply-to: <1167788265.6625.16.camel@patagonia>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Mail-followup-to: jwires@xxxxxxxxxxxxx, xen-devel@xxxxxxxxxxxxxxxxxxx
References: <1167788265.6625.16.camel@patagonia>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: Mutt/1.5.14 (2007-03-20)
Any chance this will be refreshed for 2.6.18? I very much enjoy being
able to block-attach in domain 0, but am less enamoured of the
frequent hangs when I fsck those devices...

On Tuesday, 02 January 2007 at 17:37, jake wrote:
> blktap devices attached to dom0 are liable to wedge during IO transfers.
> The problem does not occur in typical usage scenarios (i.e., virtual
> devices attached to guest domains); it is unique to the unanticipated
> case in which virtual devices are attached to dom0. 
> 
> The problem arises when processes in dom0 generate a large number of
> dirty pages while writing to a block-attached device.  Once the number
> of dirty pages reaches a certain threshold, the dom0 kernel begins
> throttling IO in balance_dirty_pages; processes traversing the buffered
> IO path will block in this function until the number of dirty pages
> decreases. 
> 
> This is bad for the tapdisk process, which is responsible for servicing
> IO requests from the blktap driver.  The tapdisk process normally
> performs direct IO, but if it writes to a hole in a sparse file, it
> falls into the buffered IO path.  If the tapdisk process blocks in
> balance_dirty_pages, it will do so indefinitely, because it is the only
> process that cleans the pages dirtied by the processes writing to the
> virtual device.  Thus dirty pages continue to amass in dom0 as IO is
> performed on the virtual device, but none of them make it to the
> physical devices because the tapdisk process is unable to service the
> requests. 
> 
> Note that when used as originally intended, blktap does not suffer from
> this problem: when blktap devices are attached to guest domains,
> performing IO on them dirties pages in the guest domain, not in dom0, so
> the tapdisk process doesn't get throttled in balance_dirty_pages.
> 
> Attached is a patch that eschews the dom0 problem by exempting the
> tapdisk process from blocking in balance_dirty_pages.  tapdisk processes
> servicing dom0-attached devices are granted special status using a
> modified setpriority syscall; a check in balance_dirty_pages ensures
> that such processes do not block indefinitely. 
> 
> This is clearly a hacky solution; any suggestions for improvement are
> welcome.

> # HG changeset patch
> # User Jake Wires <jwires@xxxxxxxxxxxxx>
> # Date 1166551978 28800
> # Node ID 34c6a9a2983ae46fad5dbba7e4b49520fb639a8c
> # Parent  df1e7ae878b4badf4e5555df12a1c4d233170fb9
> [BLKTAP] prevent tapdisk processes from blocking in balance_dirty_pages
> 
> This patch mods the setpriority syscall to enable marking processes as special
> IO processes.  IO processes are exempted from blocking in balance_dirty_pages.
> This patch is intended to avoid deadlocks when block-attaching a blktap VDI to
> dom0.
> 
> diff -r df1e7ae878b4 -r 34c6a9a2983a patches/linux-2.6.16.33/series
> +++ b/patches/linux-2.6.16.33/series  Tue Dec 19 10:12:58 2006 -0800
> @@ -5,6 +5,7 @@ git-4bfaaef01a1badb9e8ffb0c0a37cd2379008
>  git-4bfaaef01a1badb9e8ffb0c0a37cd2379008d21f.patch
>  linux-2.6.19-rc1-kexec-move_segment_code-x86_64.patch
>  blktap-aio-16_03_06.patch
> +blktap-ioprio.patch
>  device_bind.patch
>  fix-hz-suspend.patch
>  fix-ide-cd-pio-mode.patch
> diff -r df1e7ae878b4 -r 34c6a9a2983a tools/blktap/drivers/blktapctrl.c
> +++ b/tools/blktap/drivers/blktapctrl.c       Tue Dec 19 10:12:58 2006 -0800
> @@ -51,6 +51,7 @@
>  #include <xs.h>
>  #include <printf.h>
>  #include <sys/time.h>
> +#include <sys/resource.h>
>  #include <syslog.h>
>                                                                       
>  #include "blktaplib.h"
> @@ -535,6 +536,14 @@ int blktapctrl_new_blkif(blkif_t *blkif)
>                       goto fail;
>               }
>  
> +             /* exempt tapdisk from flushing when attached to dom0 */
> +             if (blkif->domid == 0) 
> +                     if (setpriority(PRIO_PROCESS, 
> +                                     blkif->tappid, PRIO_SPECIAL_IO)) {
> +                             DPRINTF("Unable to prioritize tapdisk proc\n");
> +                             goto fail;
> +                     }
> +
>               /* Both of the following read and write calls will block up to 
>                * max_timeout val*/
>               if (write_msg(blkif->fds[WRITE], CTLMSG_PARAMS, blkif, ptr) 
> diff -r df1e7ae878b4 -r 34c6a9a2983a tools/blktap/lib/blktaplib.h
> +++ b/tools/blktap/lib/blktaplib.h    Tue Dec 19 10:12:58 2006 -0800
> @@ -57,6 +57,8 @@
>  #define BLKTAP_QUERY_ALLOC_REQS      8
>  #define BLKTAP_IOCTL_FREEINTF             9
>  #define BLKTAP_IOCTL_PRINT_IDXS      100   
> +
> +#define PRIO_SPECIAL_IO             -9999
>  
>  /* blktap switching modes: (Set with BLKTAP_IOCTL_SETMODE)             */
>  #define BLKTAP_MODE_PASSTHROUGH      0x00000000  /* default            */
> diff -r df1e7ae878b4 -r 34c6a9a2983a 
> patches/linux-2.6.16.33/blktap-ioprio.patch
> +++ b/patches/linux-2.6.16.33/blktap-ioprio.patch     Tue Dec 19 10:12:58 
> 2006 -0800
> @@ -0,0 +1,81 @@
> +diff -pruN ../orig-linux-2.6.16.33/include/linux/sched.h 
> ./include/linux/sched.h
> +--- ../orig-linux-2.6.16.33/include/linux/sched.h    2006-12-18 
> 18:42:00.000000000 -0800
> ++++ ./include/linux/sched.h  2006-12-18 18:46:07.000000000 -0800
> +@@ -706,6 +706,7 @@ struct task_struct {
> +     prio_array_t *array;
> + 
> +     unsigned short ioprio;
> ++    short special_prio;
> + 
> +     unsigned long sleep_avg;
> +     unsigned long long timestamp, last_ran;
> +diff -pruN ../orig-linux-2.6.16.33/include/linux/resource.h 
> ./include/linux/resource.h
> +--- ../orig-linux-2.6.16.33/include/linux/resource.h 2006-12-18 
> 18:42:00.000000000 -0800
> ++++ ./include/linux/resource.h       2006-12-18 18:44:35.000000000 -0800
> +@@ -44,6 +44,7 @@ struct rlimit {
> + 
> + #define     PRIO_MIN        (-20)
> + #define     PRIO_MAX        20
> ++#define PRIO_SPECIAL_IO -9999
> + 
> + #define     PRIO_PROCESS    0
> + #define     PRIO_PGRP       1
> +diff -pruN ../orig-linux-2.6.16.33/include/linux/init_task.h 
> ./include/linux/init_task.h
> +--- ../orig-linux-2.6.16.33/include/linux/init_task.h        2006-12-18 
> 18:42:00.000000000 -0800
> ++++ ./include/linux/init_task.h      2006-12-18 18:45:56.000000000 -0800
> +@@ -85,6 +85,7 @@ extern struct group_info init_groups;
> +     .lock_depth     = -1,                                           \
> +     .prio           = MAX_PRIO-20,                                  \
> +     .static_prio    = MAX_PRIO-20,                                  \
> ++        .special_prio   = 0,                                            \
> +     .policy         = SCHED_NORMAL,                                 \
> +     .cpus_allowed   = CPU_MASK_ALL,                                 \
> +     .mm             = NULL,                                         \
> +diff -pruN ../orig-linux-2.6.16.33/kernel/sys.c ./kernel/sys.c
> +--- ../orig-linux-2.6.16.33/kernel/sys.c     2006-12-18 18:42:00.000000000 
> -0800
> ++++ ./kernel/sys.c   2006-12-18 18:43:30.000000000 -0800
> +@@ -245,6 +245,11 @@ static int set_one_prio(struct task_stru
> +             error = -EPERM;
> +             goto out;
> +     }
> ++    if (niceval == PRIO_SPECIAL_IO) {
> ++            p->special_prio = PRIO_SPECIAL_IO;
> ++            error = 0;
> ++            goto out;
> ++    }
> +     if (niceval < task_nice(p) && !can_nice(p, niceval)) {
> +             error = -EACCES;
> +             goto out;
> +@@ -272,10 +277,15 @@ asmlinkage long sys_setpriority(int whic
> + 
> +     /* normalize: avoid signed division (rounding problems) */
> +     error = -ESRCH;
> +-    if (niceval < -20)
> +-            niceval = -20;
> +-    if (niceval > 19)
> +-            niceval = 19;
> ++    if (niceval == PRIO_SPECIAL_IO) {
> ++            if (which != PRIO_PROCESS)
> ++                    return -EINVAL;
> ++    } else {
> ++            if (niceval < -20)
> ++                    niceval = -20;
> ++            if (niceval > 19)
> ++                    niceval = 19;
> ++    }
> + 
> +     read_lock(&tasklist_lock);
> +     switch (which) {
> +diff -pruN ../orig-linux-2.6.16.33/mm/page-writeback.c ./mm/page-writeback.c
> +--- ../orig-linux-2.6.16.33/mm/page-writeback.c      2006-12-19 
> 10:03:59.000000000 -0800
> ++++ ./mm/page-writeback.c    2006-12-19 10:04:17.000000000 -0800
> +@@ -231,6 +231,9 @@ static void balance_dirty_pages(struct a
> +                     pages_written += write_chunk - wbc.nr_to_write;
> +                     if (pages_written >= write_chunk)
> +                             break;          /* We've done our duty */
> ++                    if (current->special_prio == PRIO_SPECIAL_IO)
> ++                            break;          /* Exempt IO processes */
> ++
> +             }
> +             blk_congestion_wait(WRITE, HZ/10);
> +     }

> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

<Prev in Thread] Current Thread [Next in Thread>
  • Re: [Xen-devel] blktap wedges when block-attached to dom0, Brendan Cully <=