This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
Home Products Support Community News


[Xen-devel] Re: Xen spinlock questions

To: Jan Beulich <jbeulich@xxxxxxxxxx>
Subject: [Xen-devel] Re: Xen spinlock questions
From: Jeremy Fitzhardinge <jeremy@xxxxxxxx>
Date: Wed, 06 Aug 2008 01:47:50 -0700
Cc: xen-devel@xxxxxxxxxxxxxxxxxxx, Keir Fraser <keir.fraser@xxxxxxxxxxxxx>
Delivery-date: Wed, 06 Aug 2008 01:48:27 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <48996BB8.76E4.0078.0@xxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <4896F39A.76E4.0078.0@xxxxxxxxxx> <48975A30.2080408@xxxxxxxx> <48981005.76E4.0078.0@xxxxxxxxxx> <4898976A.3080406@xxxxxxxx> <48996BB8.76E4.0078.0@xxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: Thunderbird (X11/20080501)
Jan Beulich wrote:
More on that: You'll really need two per-CPU variables afaics, one for the
current non-irq lock being spun upon, and one for the current irqs-disabled
one. The latter one might not need saving/restoring as long as you don't
re-enable interrupts, but the code might turn out cleaner when doing the
save/restore regardless, e.g. for me (doing ticket locking):

Not sure I follow. How do you use the second array at the kick end of the process?

I ended up just storing the previous value locally, and then restoring it. It assumes that locks will strictly nest, of course, but I think that's reasonable.

--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -135,25 +135,39 @@
static DEFINE_PER_CPU(int, lock_kicker_irq) = -1;
static DEFINE_PER_CPU(struct xen_spinlock *, lock_spinners);

-static inline void spinning_lock(struct xen_spinlock *xl)
+ * Mark a cpu as interested in a lock.  Returns the CPU's previous
+ * lock of interest, in case we got preempted by an interrupt.
+ */
+static inline struct xen_spinlock *spinning_lock(struct xen_spinlock *xl)
-       __get_cpu_var(lock_spinners) = xl;
-       wmb();                  /* set lock of interest before count */
+       struct xen_spinlock *prev;
+       prev = xchg(&__get_cpu_var(lock_spinners), xl);
+       /* xchg is a barrier */
        asm(LOCK_PREFIX " incw %0"
            : "+m" (xl->spinners) : : "memory");
+       return prev;

-static inline void unspinning_lock(struct xen_spinlock *xl)
+ * Mark a cpu as no longer interested in a lock.  Restores previous
+ * lock of interest (NULL for none).
+ */
+static inline void unspinning_lock(struct xen_spinlock *xl, struct 
xen_spinlock *prev)
        asm(LOCK_PREFIX " decw %0"
            : "+m" (xl->spinners) : : "memory");
-       wmb();                  /* decrement count before clearing lock */
-       __get_cpu_var(lock_spinners) = NULL;
+       wmb();                  /* decrement count before restoring lock */
+       __get_cpu_var(lock_spinners) = prev;

static noinline int xen_spin_lock_slow(struct raw_spinlock *lock, bool 
        struct xen_spinlock *xl = (struct xen_spinlock *)lock;
+       struct xen_spinlock *prev;
        int irq = __get_cpu_var(lock_kicker_irq);
        int ret;
        unsigned long flags;
@@ -163,7 +177,8 @@
                return 0;

        /* announce we're spinning */
-       spinning_lock(xl);
+       prev = spinning_lock(xl);
        flags = __raw_local_save_flags();
        if (irq_enable) {
                ADD_STATS(taken_slow_irqenable, 1);
@@ -189,7 +204,7 @@

-       unspinning_lock(xl);
+       unspinning_lock(xl, prev);
        return ret;

int xen_spin_wait(raw_spinlock_t *lock, unsigned int token)
        struct spinning *spinning;
        int rc;

        /* if kicker interrupt not initialized yet, just spin */
        if (spinlock_irq < 0)
                return 0;

        /* announce we're spinning */
        spinning = &__get_cpu_var(spinning);
        if (spinning->lock) {
                spinning = &__get_cpu_var(spinning_irq);
        spinning->ticket = token >> TICKET_SHIFT;
        spinning->lock = lock;

        /* clear pending */

        /* check again make sure it didn't become free while
           we weren't looking  */
        rc = __raw_spin_trylock(lock);
        if (!rc) {
                /* block until irq becomes pending */

        /* announce we're done */
        spinning->lock = NULL;

        return rc;

On an 8-core system I'm seeing between 20,000 (x86-64) and 35,000
(i686) wakeup interrupts per CPU. I'm not certain this still counts as rare.
Though that number may go down a little once the hypervisor doesn't
needlessly wake all polling vCPU-s anymore.

What workload are you seeing that on? 20-35k interrupts over what time period?

In my tests, I only see it fall into the slow path a couple of thousand times per cpu for a kernbench run.

The main reason for ticket locks is to break the egregious unfairness that (some) bus protocols implement. That level of fairness shouldn't be necessary here because once the cpus fall to blocking in the hypervisor, it's up to Xen to tie-break.

Why? The hypervisor doing the tie-break makes it possibly even more
unfair, whereas with tickets and a way to kick the next owner (and only
it) almost the same level of fairness as on native can be achieved.

I don't think strict fairness is a particularly desirable property; it's certainly not an unambiguous win. The important thing is solves is total starvation, and if the Xen scheduler ends up starving a CPU then that's a scheduler bug we can fix. We can't fix the cache coherence protocols, so we need to use something like a ticket lock.

only heuristic to determine by measurement is the number of spin loops
before going into poll mode (which your original patch's description and
implementation for some reason disagree about - the description says
2^16 loops, the implementation uses 2^10). Obviously, the optimal
numbers may turn out different for byte and ticket locks.

Yes, there's a bit of confusion about loop iterations vs cycles. The original paper said that after 2^16 *cycles* 90% of locks have been taken, which I map (approximately) to 2^10 loop iterations (but originally I had 2^16 iterations). I don't think it's all that critical; it probably depends too much on workload and precise system and VM configuration to really finely tune, and doesn't end up making all that much difference.

That said, I've implemented a pile of debugfs infrastructure for extracting lots of details about lock performance so there's some scope for tuning it (including being able to change the timeout on the fly to see how things change).

And doing a wake-them-all approach isn't good here, as was
(supposedly, I wasn't there) explained in a talk on the last summit.

Well, yes, wake-all is a total disaster for ticket locks, since it just causes scheduler thrashing. But for byte locks it doesn't matter all that much since whoever runs next will take the lock and make progress. The others will wake up and spin for a bit before sleeping; not great, but still better than plain spinning.

It might be worth using a smaller timeout for re-lock attempts after waking up, but that could also lead to starvation as a sleeper will be at a disadvantage against someone who's currently spinning on the lock, and so will tend to lose against new lock takers.

I think we can also mitigate poll's wake-all behaviour by seeing if our particular per-cpu interrupt is pending and drop back into poll immediately if not (ie, detect a spurious wakeup).


Xen-devel mailing list