WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

RE: [Xen-devel] [PATCH] x86: add SSE-based copy_page()

To: Keir Fraser <keir.fraser@xxxxxxxxxxxxx>, "Cui, Dexuan" <dexuan.cui@xxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxxxx>
Subject: RE: [Xen-devel] [PATCH] x86: add SSE-based copy_page()
From: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>
Date: Mon, 12 Jan 2009 23:29:55 +0000 (GMT)
Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
Delivery-date: Mon, 12 Jan 2009 15:31:10 -0800
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <C54A33EF.1F69E%keir.fraser@xxxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
> On 19/11/08 20:24, "Dan Magenheimer" 
> <dan.magenheimer@xxxxxxxxxx> wrote:
> 
> > I haven't had a chance to test this further yet,
> > but I see the patch was already taken (c/s 18772).
> > 
> > Why, given that performance gets worse under some
> > circumstances?  At least maybe there should be two
> > interfaces: copy_page_cold_cache() and
> > copy_page_warm_cache() rather than just assume?
> > 
> > I'll post measurements when I get a chance to test,
> > but bring this up as a placeholder for now.
> 
> If more extensive testing shows it not to be a win in general 
> then we can
> revert the patch.
> 
>  -- Keir

I finally got around to measuring this.  On my two machines,
an Intel "Weybridge" box and an Intel TBD quadcore box,
the new sse2 code was at best nearly the same for cold cache
and much worse for warm cache.

I can't explain the sampling variation as I have interrupts off,
a lock held, and pre-warmed TLB... I suppose maybe another
processor could be causing rare TLB misses?  But in any case
the min number is probably best for comparison.

I'm guessing the gcc optimizer for the memcpy code was tuned
for an Intel pipeline... Jan, were you measuring on an
AMD processor?

I've included the raw data and measurement code below.

Dan (whose reason for interest in page-copy performance is now public)

=================

Dual core:

(XEN) Cycles for cold sse2: avg=5811, max=25839, min=4383, samples=208965
(XEN) Cycles for hot sse2: avg=2177, max=19665, min=1980, samples=208965
(XEN) Cycles for cold memcpy: avg=6125, max=27171, min=3969, samples=208965
(XEN) Cycles for hot memcpy: avg=668, max=17460, min=594, samples=208965

Quad core:

(raw numbers removed pending Intel OK, but the ratios reinforce
my claim)

Measurement code:

/* interrupts are off and lock is held */
void tmem_copy_page(char *to, char*from)
{
    *to = *from;  /* don't measure TLB misses */
    flush_area_local(to,FLUSH_CACHE|FLUSH_ORDER(0));
    flush_area_local(from,FLUSH_CACHE|FLUSH_ORDER(0));
    START_CYC_COUNTER(pg_copy1);
    copy_page_sse2(to, from);  /* cold cache */
    END_CYC_COUNTER(pg_copy1);
    START_CYC_COUNTER(pg_copy2);
    copy_page_sse2(to, from);  /* hot cache */
    END_CYC_COUNTER(pg_copy2);
    flush_area_local(to,FLUSH_CACHE|FLUSH_ORDER(0));
    flush_area_local(from,FLUSH_CACHE|FLUSH_ORDER(0));
    START_CYC_COUNTER(pg_copy3);
    memcpy(to, from, PAGE_SIZE);  /* cold cache */
    END_CYC_COUNTER(pg_copy3);
    START_CYC_COUNTER(pg_copy4);
    memcpy(to, from, PAGE_SIZE); /* hot cache */
    END_CYC_COUNTER(pg_copy4);
}

#define START_CYC_COUNTER(x) x##_start = get_cycles()
#define END_CYC_COUNTER(x) \
do { \
    x##_start = (int32_t)get_cycles() - x##_start; \
    if ((int32_t)(x##_start) < 0) x##_start = -x##_start; \
    if (x##_start < 10000000) { /* ignore context switches etc */ \
     x##_sum_cycles += x##_start; x##_count++; \
     if (x##_start < x##_min_cycles) x##_min_cycles = x##_start; \
     if (x##_start > x##_max_cycles) x##_max_cycles = x##_start; \
    } \
} while (0)
#define PRINTK_CYC_COUNTER(x,text) \
  if (x##_count) printk(text" avg=%"PRIu64", max=%"PRId32", " \
  "min=%"PRId32", samples=%"PRIu64"\n", \
  x##_sum_cycles ? (x##_sum_cycles/x##_count) : 0, \
  x##_max_cycles, x##_min_cycles, x##_count)

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

<Prev in Thread] Current Thread [Next in Thread>