On Fri, Mar 25, 2011 at 11:57 AM, Teck Choon Giam
> On Thu, Mar 24, 2011 at 7:57 PM, Konrad Rzeszutek Wilk
> <konrad.wilk@xxxxxxxxxx> wrote:
>> On Wed, Mar 16, 2011 at 12:40:01PM -0400, Konrad Rzeszutek Wilk wrote:
>>> > > - turn on CONFIG_DEBUG_PAGEALLOC
>>> > > - turn on CONFIG_DEBUG_LIST
>>> > > - turn on CONFIG_DEBUG_KMEMLEAK
>>> > > - turn on CONFIG_JBD_DEBUG, CONFIG_JBD2_DEBUG
>>> > > - turn on CONFIG_SLUB_DEBUG_ON
>>> > >
>>> > > And see if anything starts coming out.
>>> > >
>>> > Thanks a lot for both of you spending time to do so. It isn't easy as I
>>> > believe this is something related to kernel 2.6.32.x and just wondering is
>>> > there something related to *sched_domains? I read recent mails in LKML
>>> Hmmm.. no idea.
>>> > about rebuild_sched_domains consider dangerous issues... and that is about
>>> > recent kernels but won't know what recent kernels that refer to... ...
>>> > I will do those config changes in one of my test server when time permit
>>> > and
>>> > will post results/output here when done.
>>> OK. Thank you!
>> I've been using Jermey's latest tree: 188.8.131.52 (there is even a 184.108.40.206)
>> and I can't hit this bug anymore. Would appreciate your input if you still
>> see this.
> This is a report back to you that unfortunately I still able to hit this bug
> git url = git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen.git
> git branch = xen/next-2.6.32
> git commit = 4306ea8f6db3d83a5a2bbfe5448dd78e6846475a
> Kindest regards,
> Giam Teck Choon
Maybe this is good news ;)
This is my report about various suggested kernel configuration options
suggested by Konrad and Jeremy. I think I caught the cause or prevent
this same BUG from happening so that Konrad or Jeremy have fewer place
to look into it. Sorry, this will be little lengthy and sorry for my
poor English. I am using the following:
git url = git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen.git
git branch = xen/next-2.6.32
git commit = df3a5560166da5a05de93f2fc36b718cc43c6c3c
hg_root = http://xenbits.xensource.com/xen-4.0-testing.hg
hg_changeset = 21465
With my old kernel config, I still hit this BUG with testcrash loop
100 easily. In fact, with loop below 30 I mostly will hit this same
BUG. My two test servers setup with at least 20 x 5GB LV. So each
loop cycle will have at least 20 lvcreate/lvremove
With the suggested CONFIGURATION changes by Konrad and Jeremy, I am
unable to reproduce this same BUG for testcrash.sh loop 1000 for two
of my test servers. The following are the summary/short note:
> - turn on CONFIG_DEBUG_PAGEALLOC
> - turn on CONFIG_DEBUG_LIST
Already set originally.
> - turn on CONFIG_DEBUG_KMEMLEAK
Don't think I can enable this with x86_64 as there isn't an option for
it in x86_64 arch. However, I can see this option in x86_32 arch so I
guess it is dependent in x86_32. Anyway, I don't think this is
important for my case... ... why... read on... ... :P
> - turn on CONFIG_JBD_DEBUG, CONFIG_JBD2_DEBUG
> - turn on CONFIG_SLUB_DEBUG_ON
Ok, set as I need to change from CONFIG_SLAB to CONFIG_SLUB instead
which also set CONFIG_SLUB_DEBUG=y besides CONFIG_SLUB_DEBUG_ON=y.
So from the testcrash results for two of my servers, I know there must
be related to the kernel CONFIGURATION changes and one of them is the
cause to prevent hitting this BUG. Now I am testing to set one of the
mentioned CONFIG at a time then run the same testcrash again to
determine which is the only CONFIG option that will not trigger this
same BUG. The results as below all using my old config as base with
*only one CONFIG option change at a time* to run testcrash loop 100:
With CONFIG_DEBUG_PAGEALLOC=y: Result : Think should be this one to
prevent hitting this same BUG as one of my test server already past
testcrash loop cycle 100... ... now testing testcrash loop 10000 :P
With CONFIG_SLUB=y and CONFIG_SLUB_DEBUG=y: Result : CRASH
With CONFIG_SLUB=y, CONFIG_SLUB_DEBUG=y and CONFIG_SLUB_DEBUG_ON=y:
Result : CRASH
With CONFIG_JBD_DEBUG=y: Result : CRASH
With CONFIG_JBD2_DEBUG=y: Result : CRASH
Can others who hit this same BUG reconfirm that your kernel config is
without CONFIG_DEBUG_PAGEALLOC being set/on? I think most production
servers will not have this config option enable in default. If so,
can test with CONFIG_DEBUG_PAGEALLOC=y instead? Sorry, currently
still in testing phrase for such configuration and hopefully can pass
this testcrash with loop 10000 for one of my server (am I crazy? LOL).
If this is really the case (I hope) then I guess there must be some
conditional difference for CONFIG_PAGEALLOC as without
CONFIG_DEBUG_PAGEALLOC set it will hit this BUG but with it set to on
it won't (at least during my composing of this mail reply/report)...
I will report back my final testcrash loop 10000 result when finish...
... keeping fingers crossed!!!
Can anyone test with kernel version 2.6.38 PVOPS tree with
CONFIG_DEBUG_PAGEALLOC not set and set to see whether such BUG exists
I hope this report is useful especially to Konrad and Jeremy... ... ;)
Giam Teck Choon
Xen-devel mailing list