|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: oxenstored performance issue when starting VMs in parallel
On Mon, 2020-09-21 at 17:40 +0100, Julien Grall wrote:
> [CAUTION - EXTERNAL EMAIL] DO NOT reply, click links, or open
> attachments unless you have verified the sender and know the content
> is safe.
>
> On 21/09/2020 14:05, jerome leseinne wrote:
> > Hello,
>
> Hello,
>
> I am only CCing Edwin who is working on OXenStored. Hopefully, he
> will
> be able to give you some pointers.
>
> > We are developing a solution based on Xen 4.13 who is constantly
> > creating / destroying VMs.
> >
> > To summarize our lifecycle :
> >
> > - xl restore vmX
> > - xl cd-insert ....
> > - We do our stuff for ~ 2 minutes
> > - xl destroy vmX
> >
> > So our VMs have a life of approximately 2 minutes.
> >
> > The number of VMs we ran in parallel depends on the underlying
> > server.
> >
> > We are seeing the issue with our larger server who is running 30
> > VMs
> > (HVM) in parallel.
> >
> > On this server oxenstored is constantly running at 100% cpu usage
> > and
> > some VMs are almost stucked or unresponsive.
> >
> > This is not an hardware issue, 72 xeon cores, 160 GB of memory and
> > very fast I/O subsystem.
> > Everything else is running smoothly on the server.
> >
> > what we witness in the xenstore-access.log is that the number of
> > WATCH
> > event is matching the number of currently running VMs
> >
> > so for example for a single WRITE event is followed by around 30
> > watch events :
> >
> > [20200918T15:15:18.045Z] A41354 write
> > /local/domain/0/backend/qdisk/1311/5632
> > [20200918T15:15:18.046Z] A41248 w event
> > backend/qdisk/1311/5632 38ed11d9-9a38-4022-ad75-7c571d4886ed
> > [20200918T15:15:18.046Z] A41257 w event
> > backend/qdisk/1311/5632 98fa91b8-e88b-4667-9813-d95196257288
> > [20200918T15:15:18.046Z] A40648 w event
> > backend/qdisk/1311/5632 e6fd9a35-61ec-4750-93eb-999fb7f662fc
> > [20200918T15:15:18.046Z] A40542 w event
Hi,
All the A<id> here point to another unix domain socket. They have all
registered a watch on this key (or a parent).
To find out what key they are watching you can look for the watch
registration:
grep 'A41248 .*watch' /var/log/xenstored-access.log
You can find out all xenstore clients connected to oxenstored, find all
the numbers in the port column here:
ss -a --unix -p|grep xenstored
Then look up those numbers, e.g. assuming you see 22269:
ss -a --unix -p|grep 22269
This should tell you what the processes are.
I have a suspicion that it is related to this code in qemu:
xenbus->backend_watch =
xen_bus_add_watch(xenbus, "", /* domain root node */
"backend", xen_bus_backend_changed,
&local_err);
If I read the code right this seems to set up a watch on
/local/domain/0/backend, so if you have N qemus running (one for each
disk) then you get N^2 watch events fired in total.
Best regards,
--Edwin
> > backend/qdisk/1311/5632 6a39c858-2fd4-46e4-a810-485a41328f8c
> > [20200918T15:15:18.046Z] A41141 w event
> > backend/qdisk/1311/5632 8762d552-b4b4-41ef-a2aa-23700f790ea2
> > [20200918T15:15:18.046Z] A41310 w event
> > backend/qdisk/1311/5632 4dc2a9ae-6388-4b0c-9c98-df3c897a832f
> > [20200918T15:15:18.046Z] A40660 w event
> > backend/qdisk/1311/5632 6abf244d-5939-4540-b176-4ec7d14b392c
> > [20200918T15:15:18.046Z] A41347 w event
> > backend/qdisk/1311/5632 ecb93157-9929-43e2-8ed4-f5e78ab2f37d
> > [20200918T15:15:18.046Z] A41015 w event
> > backend/qdisk/1311/5632 a1fec49f-e7cc-4059-87d3-ce43f386746e
> > [20200918T15:15:18.046Z] A41167 w event
> > backend/qdisk/1311/5632 e9419014-9fd2-47c0-b79d-30f99d9530d6
> > [20200918T15:15:18.046Z] A41100 w event
> > backend/qdisk/1311/5632 a2754a91-ecd6-4b6b-87ea-b68db8b888df
> > [20200918T15:15:18.046Z] A41147 w event
> > backend/qdisk/1311/5632 176a1c3c-add7-4710-a7ee-3b5548d7a56a
> > [20200918T15:15:18.046Z] A41305 w event
> > backend/qdisk/1311/5632 afe7933b-c92d-4403-8d6c-2e530558c937
> > [20200918T15:15:18.046Z] A40616 w event
> > backend/qdisk/1311/5632 35fa45e0-21e8-4666-825b-0c3d629f378d
> > [20200918T15:15:18.046Z] A40951 w event
> > backend/qdisk/1311/5632 230eb42f-d700-46ce-af61-89242847a978
> > [20200918T15:15:18.046Z] A40567 w event
> > backend/qdisk/1311/5632 39cc7ffb-5045-4120-beb7-778073927c93
> > [20200918T15:15:18.046Z] A41363 w event
> > backend/qdisk/1311/5632 9e42e74a-80fb-46e8-81f2-718628bf70f6
> > [20200918T15:15:18.046Z] A40740 w event
> > backend/qdisk/1311/5632 1a64af31-fee6-45be-b8d8-c98baa5e162f
> > [20200918T15:15:18.046Z] A40632 w event
> > backend/qdisk/1311/5632 466ef522-cb76-4117-8e93-42471897c353
> > [20200918T15:15:18.046Z] A41319 w event
> > backend/qdisk/1311/5632 19ea986b-e303-4180-b833-c691b2b32819
> > [20200918T15:15:18.046Z] A40677 w event
> > backend/qdisk/1311/5632 fb01629a-033b-41d6-8349-cec82e570238
> > [20200918T15:15:18.046Z] A41152 w event
> > backend/qdisk/1311/5632 84ce9e29-a5cc-42a1-a47b-497b95767885
> > [20200918T15:15:18.047Z] A41233 w event
> > backend/qdisk/1311/5632 ea944ad3-3af6-4688-8076-db1eac25d8e9
> > [20200918T15:15:18.047Z] A41069 w event
> > backend/qdisk/1311/5632 ce57e169-e1ea-4fb5-b97f-23e651f49d79
> > [20200918T15:15:18.047Z] A41287 w event
> > backend/qdisk/1311/5632 d31110c8-ae0b-4b9d-b71f-aa2985addd1a
> > [20200918T15:15:18.047Z] A40683 w event
> > backend/qdisk/1311/5632 f0e4b0a0-fad0-4bb7-b01e-b8a31107ba3d
> > [20200918T15:15:18.047Z] A41177 w event
> > backend/qdisk/1311/5632 9ff80e49-4cca-4ec9-901a-d30198104f29
> > [20200918T15:15:18.047Z] D0 w event
> > backend/qdisk/1311/5632 FFFFFFFF8276B520
> > [20200918T15:15:18.047Z] A40513 w event
> > backend/qdisk/1311/5632 d35a9a42-c15e-492c-a70d-d8b20bafec8f
> > [20200918T15:15:18.047Z] A41354 w event
> > backend/qdisk/1311/5632 e4456ca4-70f4-4afc-9ba1-4a1cfd74c8e6
> >
> > We are not sure this is the root cause of the issue but this is the
> > only real difference we can see in the log.
> >
> > We don't understand why the number of WATCH events is related to
> > the
> > number of concurrent running VM.
> > A watch event should be registered and only fired for the current
> > domain ID, so a write for a specific node path should only trigger
> > one
> > watch event and not 30 in our case.
> >
> > Any ideas / comments ?
>
> Cheers,
>
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |