[Xen-devel] sedf scheduler may cause a CPU fatal trap

Hello,

 I played with the sEDF scheduler included in the xen-3.0-testing.hg and
everything is just fine except a CPU fatal trap error that appeared
several times. Here is what I've done on a SMP (two processors) machine:

 I started two unprivileged domains and I compiled a kernel in
each of them using the command:

    # time sh -c "make O=/home/guill/build/k2614 oldconfig \
               && make O=/home/guill/build/k2614"


1) Two domains with default sefd (seems to be best-effort):

          |  domain 1   |  domain 2  |
          |-------------|------------|
     real | 11m43.034s  | 11m46.293s |     
     user | 10m20.220s  | 10m25.140s |
     sys  |  1m08.330s  |  1m09.100s |
           --------------------------

   The xentop showed that domain1 was using aroung 99% of the CPU and it
   was the same for domain2.

2) Two domains with 20ms/5ms (ie 25% of CPU time) and 20ms/15ms (ie 75%
   of CPU time) with no extra time: 
     xm sched-sedf 1 20000000 5000000 0 0 0
     xm sched-sedf 2 20000000 15000000 0 0 0
  
          |  domain 1   |  domain 2  |
          |-------------|------------|
     real | 45m35.626s  | 15m04.808s |     
     user | 41m04.300s  | 13m37.940s |
     sys  |  4m24.050s  |  1m25.160s |
           --------------------------

   The xentop showed that domain1 was using around 25% of the CPU 
whereas domain2 was using around 75%.

3) Two domains with 20ms/5ms (ie 25% of CPU time) and 20ms/15ms (ie 75%
   of CPU time) with extra time: 
     xm sched-sedf 1 20000000 5000000 0 1 0
     xm sched-sedf 2 20000000 15000000 0 1 0
  
          |  domain 1   |  domain 2  |
          |-------------|------------|
     real | 11m48.687s  | 11m50.909s |     
     user | 10m36.870s  | 10m36.180s |
     sys  |  1m08.320s  |  1m09.540s |
           --------------------------

   With extra time enabled, the xentop shows that domain 1 is using
around 97% of CPU and domain 2 is using around 97% too. 


4) Two domains with 20ms/5ms (ie 25% of CPU time) and 20ms/15ms (ie 75%
   of CPU time) without extra time but we change the politics when
   compilation in the second domain finished: 
     xm sched-sedf 1 20000000 5000000 0 0 0
     xm sched-sedf 2 20000000 15000000 0 0 0
   when second domain finished its job:
     xm sched-sedf 1 20000000 0 0 1 0
     xm sched-sedf 2 20000000 0 0 1 0
   
 when I changed the politics, the xen hypervisor crashed and I get the
following error:

(XEN) CPU:    1
(XEN) EIP:    e008:[<ff108d7e>] __qdivrem+0x4e/0x580
(XEN) EFLAGS: 00010046   CONTEXT: hypervisor
(XEN) eax: 00000001   ebx: 00000000   ecx: 00000000   edx: 00000000
(XEN) esi: c4b40000   edi: 00000004   ebp: 00000000   esp: ff1afd94
(XEN) cr0: 8005003b   cr3: 6d236000
(XEN) ds: e010   es: e010   fs: 0000   gs: 0033   ss: e010   cs: e008
(XEN) Xen stack trace from esp=ff1afd94:
(XEN)    00000002 00000001 00007100 ff1afe20 00000989 0000ff1f 00000002 
00000009 
(XEN)    00000002 00000001 0000c000 ff1afde0 ff1afdfc ff1afe18 00000000 
00000000 
(XEN)    00000000 00000000 00000000 00000000 00000991 ff1afe38 00000571 
00000000 
(XEN)    00000000 00000000 00000000 0000ff1f 0000c000 00000000 00000000 
00000000 
(XEN)    00000000 00000000 00000000 0000c944 00004000 00000004 00000000 
ff1b5e84 
(XEN)    ff1b6d84 ffbfa980 ff10c8b0 00000000 c4b40000 00000004 00000000 
ff1092ff 
(XEN)    c4b40000 00000004 00000000 00000000 00000000 ff1ad080 b46d68de 
ff1b5e88 
(XEN)    ff1b5e80 c4b40000 00000004 ff10e443 c4b40000 00000004 00000000 00000000
(XEN)    ff1afee4 b2a993ef 000012b4 ff10d898 00001000 00000001 ff1b5080 
ffbfa980 
(XEN)    ff1b5e80 b2aea7e3 000012b4 ff10d8c0 b2aea7e3 000012b4 ff1b5080 
00000080 
(XEN)    0000efff 0000fe80 e6525499 000012b4 00000001 ff1924a0 b3355354 
000012b4 
(XEN)    ffbfa988 00000001 ffbfa990 ffbfa998 00000096 00000001 bfb12eb8 
00000096 
(XEN)    00000000 00000000 ff174010 ff174010 b2aea7e3 000012b4 ff1aff74 
ff10ec3b 
(XEN)    ff1aff74 b2aea7e3 000012b4 00000033 0000000c 00000000 00000000 
ff12111d 
(XEN)    0000000c 00000000 00055080 ff1b5080 00000080 00000000 00000001 
ff1b5080 
(XEN)    ff1affb4 00000000 ff1249ce ff1affb4 ff1affb4 00000020 00000000 00000080
(XEN)    b7efa860 00000005 bfb12eb8 ff10f732 00000005 bfb12eb8 ff1b5080 
ff1354c6 
(XEN)    b7ef8ff4 00000000 00000001 b7efa860 00000005 bfb12eb8 00000000 
000d0000 
(XEN)    b7e2e549 00000073 00010286 bfb12e90 0000007b 0000007b 0000007b 
00000000 
(XEN)    00000033 00000001 ff1b5080
(XEN) Xen call trace:
(XEN)    [<ff108d7e>] __qdivrem+0x4e/0x580
(XEN)    [<ff10c8b0>] runq_comp+0x0/0x70
(XEN)    [<ff1092ff>] __divdi3+0x4f/0xa0
(XEN)    [<ff10e443>] desched_extra_dom+0x1f3/0x210
(XEN)    [<ff10d898>] sedf_do_schedule+0x228/0x260
(XEN)    [<ff10d8c0>] sedf_do_schedule+0x250/0x260
(XEN)    [<ff10ec3b>] __enter_scheduler+0x7b/0x2e0
(XEN)    [<ff12111d>] mod_l1_entry+0x9d/0xf0
(XEN)    [<ff1249ce>] do_general_protection+0xbe/0x180
(XEN)    [<ff10f732>] do_softirq+0x32/0x50
(XEN)    [<ff1354c6>] process_softirqs+0x6/0x8
(XEN)
(XEN) ************************************
(XEN) CPU1 FATAL TRAP 0 (divide error), ERROR_CODE 0000, IN INTERRUPT CONTEXT. 
(XEN) System shutting down -- need manual reset.
(XEN) ************************************


This fatal trap doesn't appear if we use 
  xm sched-sedf 1 20000000 5000000 0 1 0


Did someone else have this problem? I can reproduce the bug on my Xeon
x86_64 box so I can provide more inputs. 



Hope this help,
Best regards,

Guillaume

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
WARNING - OLD ARCHIVES

xen-devel

[Xen-devel] sedf scheduler may cause a CPU fatal trap