WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-ia64-devel

RE: CONFIG_IA64_SPLIT_CACHE was: [Xen-ia64-devel] Console problem on dom

To: "Yang, Fred" <fred.yang@xxxxxxxxx>, "Xu, Anthony" <anthony.xu@xxxxxxxxx>, "Tian, Kevin" <kevin.tian@xxxxxxxxx>, <xen-ia64-devel@xxxxxxxxxxxxxxxxxxx>
Subject: RE: CONFIG_IA64_SPLIT_CACHE was: [Xen-ia64-devel] Console problem on domU on tip?
From: "Magenheimer, Dan (HP Labs Fort Collins)" <dan.magenheimer@xxxxxx>
Date: Thu, 22 Dec 2005 13:08:34 -0800
Delivery-date: Thu, 22 Dec 2005 21:11:33 +0000
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
List-help: <mailto:xen-ia64-devel-request@lists.xensource.com?subject=help>
List-id: Discussion of the ia64 port of Xen <xen-ia64-devel.lists.xensource.com>
List-post: <mailto:xen-ia64-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-ia64-devel>, <mailto:xen-ia64-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-ia64-devel>, <mailto:xen-ia64-devel-request@lists.xensource.com?subject=unsubscribe>
Sender: xen-ia64-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index: AcYBELBGMu7ZHSeYRUaEav49mCDvgwAAVFMwAADLrDAAIbQEoAARHDxgAAFwgsAApRrdUAAOEX2AABW4J/AAUJyDYAAareHQABB8dxAABifEIAAC14UQAAO4ruAAAk0GkA==
Thread-topic: CONFIG_IA64_SPLIT_CACHE was: [Xen-ia64-devel] Console problem on domU on tip?
> Sync I/Dcache for code segment loaded has been implemented as 
> standard code feature Linux, please check 
> lazy_mmu_prot_update() which is used across.

In what version of Linux?  In 2.6.12 and 2.6.14, the routine
ia64_pal_cache_flush is defined but never used and lazy_mmu_prot_update
calls flush_icache_range (an assembly routine), not PAL.  Is this
maybe changed to use the pal call in 2.6.15-rcX?  If not, perhaps
Xen/ia64 should be using the same flush_icache_range code as Linux
instead of using the PAL call.

> Note the usage in Xen/ia64 is
> 
> 1. load dom0 code to memory#1
> 2. unzip/copy dom0 to memory#2
> 3. Exec Dom0 code from memory#2

No, the crash for me ONLY happens when launching domU.  Since
the call works fine for dom0, and crashes in PAL when launching
domU, my guess is that for domU something in the PAL code
is accessing a memory location that isn't pinned.

> As you pointed out, this system crash only happens on your HP 
> rx2620 and not other HP boxes.  This is a good opportunity to 
> track down if any Xen/ia64 software quality issue. This is 
> really not a specific processor issue.  Have you confirmed 
> the number of rx2620 boxes have this crash?

I do not have easy access to other machines and do not have
the means to debug PAL code.  I do not know that this will
only happen on an rx2620, will happen on all rx2620s, or
may happen at another time on another machine.

> Without identify the real crash reason and bypass this issue 
> by comment out the code is only to delay this bug, not to 
> mention you are penalizing people who are using Tiger4 for Xen/ia64.
> 
> Community needs real reason behind HP rx2620 crash to clarify 
> it is not Xen/ia64 software issue

I agree that it would be good to "root cause" this problem.
However PAL is fairly obscure and infrequently used.  I do not
understand it well and I suspect there is a pre-condition
for this call that we do not understand.  For example, from
reading the code in Linux, I think this is the only PAL call
that is made with psr.ic off.  This means that if the PAL
code has any exceptions, Xen will crash.  To me, this call
should be avoided and another method should be used.  If that
is not possible, we should limit exposure by ifdef or runtime
conditional code.

(There's an old joke:  Man: "Doctor, it hurts when I do this."
Doctor: "Then don't do that" :-)

I am not trying to penalize Tiger4 users.  I am observing that
a patch you added to fix a problem on an unreleased machine
has caused a problem on a released machine.  Though my short
term workaround (ifdef) causes an inconvenience for you, I have
suggested other alternatives.  If we sync I-cache and D-cache
only on machines where it is necessary and the bug is never seen
again, then we are done.  If it happens again, the symptoms are
very obvious.

Thanks,
Dan

> Magenheimer, Dan (HP Labs Fort Collins) wrote:
> > Hi Fred --
> > 
> > I understand your pain.  I too wasted time building and
> > testing bits with the code turned on.  However:
> > 
> > 1) It is not uncommon in the open source community for the
> >    needs of publicly-available machines to take precedence
> >    over the needs of unreleased future machines.
> > 2) It is not uncommon in the open source community for a
> >    unreleased future machine to require different config
> >    files than the defaults.
> > 3) This specific code that is failing is not even needed
> >    on publicly-available machines.  It is not uncommon in
> >    the open source community to refuse patches that are
> >    only needed for unreleased future machines.
> > 
> > That said, I understand that your team as well as other
> > Xen developers are primarily using these future machines
> > for development and testing, so let me suggest a compromise:
> > 
> > Is there a way to dynamically test early in boot to determine
> > if this machine has split I-D caches?  If so, you could provide
> > a patch that sets a global or cpu variable appropriately and
> > changes the compile-time ifdef to a run-time if test.
> > 
> > Dan
> > 
> > P.S. Thinking about this makes me realize.. the pal cache flush
> > code may be inadequate anyway when we get to SMP-guest support
> > because the stale mapping may be on another processor.
> > 
> >> -----Original Message-----
> >> From: Yang, Fred [mailto:fred.yang@xxxxxxxxx]
> >> Sent: Thursday, December 22, 2005 9:15 AM
> >> To: Magenheimer, Dan (HP Labs Fort Collins); Xu, Anthony;
> >> Tian, Kevin; xen-ia64-devel@xxxxxxxxxxxxxxxxxxx
> >> Subject: RE: CONFIG_IA64_SPLIT_CACHE was: [Xen-ia64-devel]
> >> Console problem on domU on tip?
> >> 
> >> Dan,
> >> 
> >> We spent long time to track down Cset#8383 yesterday, and now
> >> the current identified issue is I/Dcache patch was not turned
> >> on in the default built!  Hope other community members won't hit
> >> this problem again. 
> >> 
> >> From the discussion, it is definitely the issue on the
> >> specific HP box on accessing PAL call.   To be the correct
> >> approach, we should definitely track it down to find out the
> >> potential implementation or platform issue.
> >> 
> >> Hope you can track this down ASAP to remove this hurdle.
> >> 
> >> -Fred
> >> 
> >> Magenheimer, Dan (HP Labs Fort Collins) wrote:
> >>> With CONFIG_IA64_SPLIT_CACHE on, a new user may encounter
> >>> the problem on a shipping machine and the symptom is that
> >>> the machine immediately crashes when a domU is launched.
> >>> 
> >>> With CONFIG_IA64_SPLIT_CACHE off, a developer may encounter
> >>> a different problem on an unreleased machine.
> >>> 
> >>> I know that you are focused primarily on the unreleased machine,
> >>> but in this case, I think we should be cautious for the new user
> >>> as the developer knows to change the option when running
> >>> on the unreleased machine.
> >>> 
> >>> I will spend some more time on this when I have a chance.
> >>> I think it is a real bug (probably PAL accessing some address
> >>> which isn't pinned) that occurs only on some boxes due
> >>> to some factor like memory configuration.
> >>> 
> >>> Thanks,
> >>> Dan
> >>> 
> >>> P.S. The debug output just before the crash was:
> >>> ia64_fault: General Exception: IA-64 Reserved Register/Field fault
> >>> (data access): reflecting 
> >>> 
> >>>> -----Original Message-----
> >>>> From: Yang, Fred [mailto:fred.yang@xxxxxxxxx]
> >>>> Sent: Wednesday, December 21, 2005 10:34 PM
> >>>> To: Magenheimer, Dan (HP Labs Fort Collins); Xu, Anthony;
> >>>> Tian, Kevin; xen-ia64-devel@xxxxxxxxxxxxxxxxxxx
> >>>> Subject: CONFIG_IA64_SPLIT_CACHE was: [Xen-ia64-devel]
> >>>> Console problem on domU on tip?
> >>>> 
> >>>> Dan,
> >>>> 
> >>>> Can we suggest to always turn on #CONFIG_IA64_SPLIT_CACHE as
> >>>> the default build configuration.  People may not be aware of
> >>>> this build flag and miss it one each new build.
> >>>> 
> >>>> All the newer generation ia64 processors will come with
> >>>> splitted I/Dcache as discussed in the previous mail thread
> >>>> and it is documented in the Itanium architectur of possible
> >>>> splitted cache for future implementation.  With default
> >>>> turning off, it is a potential bugs for all Tiger4 systems
> >>>> using for daily development and future platforms to come.
> >>>> 
> >>>> It is also indicated through your mail, it is only HP  rx2620
> >>>> system has issue and not the other HP boxes.  Can you track
> >>>> down this issue?  Rather than put a kludge for rx2620 box?
> >>>> 
> >>>> Thanks,
> >>>> 
> >>>> -Fred
> >>>> 
> >>>> 
> >>>> Magenheimer, Dan (HP Labs Fort Collins) wrote:
> >>>>> Committed (but without removal of ifdefs until we
> >>>>> track down this problem).
> >>>>> 
> >>>>>> -----Original Message-----
> >>>>>> From: Xu, Anthony [mailto:anthony.xu@xxxxxxxxx]
> >>>>>> Sent: Monday, December 19, 2005 7:15 PM
> >>>>>> To: Magenheimer, Dan (HP Labs Fort Collins); Tian, Kevin;
> >>>>>> xen-ia64-devel@xxxxxxxxxxxxxxxxxxx
> >>>>>> Subject: RE: [Xen-ia64-devel] Console problem on domU on tip?
> >>>>>> 
> >>>>>> I guest maybe the firmware on your machine doesn't implement
> >>>>>> this pal call due to there is no split I/D cache at that
> >>>>>> time, so when you call this pal call, it will return
> >>>>>> PAL_STATUS_UNIMPLEMENTED, Could you please turn on
> >>>>>> CONFIG_IA64_SPLIT_CACHE  and try this new patch to see
> >>>>>> whether your machine can boot domain0?
> >>>>>> If this patch works, could you please remove all
> >>>>>> CONFIG_IA64_SPLIT_CACHE macro?
> >>>>>> 
> >>>>>> Thanks
> >>>>>> -Anthony
> >>>>>> 
> >>>>>>> -----Original Message-----
> >>>>>>> From: Magenheimer, Dan (HP Labs Fort Collins)
> >>>>>> [mailto:dan.magenheimer@xxxxxx]
> >>>>>>> Sent: 2005年12月19日 23:48
> >>>>>>> To: Xu, Anthony; Tian, Kevin; 
> xen-ia64-devel@xxxxxxxxxxxxxxxxxxx
> >>>>>>> Subject: RE: [Xen-ia64-devel] Console problem on domU on tip?
> >>>>>>> 
> >>>>>>> I have been distracted tracking another bug...
> >>>>>>> 
> >>>>>>> Here's where I got:
> >>>>>>> 
> >>>>>>> The machine is a new (April 2005) HP rx2620 so it is
> >>>>>>> not old firmware.   I can't reproduce it on a machine
> >>>>>>> with an ITP (which does have older firmware).
> >>>>>>> 
> >>>>>>> This PAL call is never used in Linux, though there is a
> >>>>>>> routine coded for it.  It is the only
> >>>>>>> PAL call coded in Linux that occurs with psr.ic off.
> >>>>>>> 
> >>>>>>> The crash I am seeing occurs either during the PAL call or
> >>>>>>> immediately upon return. 
> >>>>>>> 
> >>>>>>> Is it OK to
> >>>>>>> 
> >>>>>>> 
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: Xu, Anthony [mailto:anthony.xu@xxxxxxxxx]
> >>>>>>>> Sent: Monday, December 19, 2005 2:02 AM
> >>>>>>>> To: Tian, Kevin; Magenheimer, Dan (HP Labs Fort Collins);
> >>>>>>>> xen-ia64-devel@xxxxxxxxxxxxxxxxxxx
> >>>>>>>> Subject: RE: [Xen-ia64-devel] Console problem on domU on tip?
> >>>>>>>> 
> >>>>>>>> Dan,
> >>>>>>>> Have you got time to verify below discussion?
> >>>>>>>> 
> >>>>>>>> Thanks
> >>>>>>>> -Anthony
> >>>>>>>> 
> >>>>>>>>> -----Original Message-----
> >>>>>>>>> From: Tian, Kevin
> >>>>>>>>> Sent: 2005年12月16日 10:16
> >>>>>>>>> To: Xu, Anthony; 'Magenheimer, Dan (HP Labs Fort Collins)';
> >>>>>>>>> 'xen-ia64-devel@xxxxxxxxxxxxxxxxxxx'
> >>>>>>>>> Subject: RE: [Xen-ia64-devel] Console problem on 
> domU on tip?
> >>>>>>>>> 
> >>>>>>>>>> From: Xu, Anthony
> >>>>>>>>>> Sent: 2005年12月16日 9:54
> >>>>>>>>>> 
> >>>>>>>>>>> Also, why panic if it fails?
> >>>>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> Panic is not required here, and we could just print out a
> >>>>>>>>> warning message. Previously panic is kept there to help our
> >>>>>>>>> debug in early stage. 
> >>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>>> Does the problem happen only on VTI?  Or both VTI and
> >>>>>>>>>>> non-VTI on split-cache machines?
> >>>>>>>>>> 
> >>>>>>>>>> Sometimes, it makes domain0 crash at the very beginning of
> >>>>>>>>>> the domain0 boot process, especially on MP machine.
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> Thanks
> >>>>>>>>>> -Anthony
> >>>>>>>>> 
> >>>>>>>>> One complement is, that problem definitely exists on new
> >>>>>>>>> split-cache processors, for dom0/domU. For VTI 
> domain, we have
> >>>>>>>>> logic within device model to ensure consistence.
> >>>>>>>>> 
> >>>>>>>>> Thanks,
> >>>>>>>>> Kevin
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>> From: Magenheimer, Dan (HP Labs Fort Collins)
> >>>>>>>>>> [mailto:dan.magenheimer@xxxxxx]
> >>>>>>>>>>> Sent: 2005年12月16日 1:39
> >>>>>>>>>>> To: Tian, Kevin; xen-ia64-devel@xxxxxxxxxxxxxxxxxxx
> >>>>>>>>>>> Cc: Xu, Anthony
> >>>>>>>>>>> Subject: RE: [Xen-ia64-devel] Console problem on domU on
> >>>>>>>>>>> tip? 
> >>>>>>>>>>> 
> >>>>>>>>>>>>> Is this code fragment necessary for VTI to boot domU
> >>>>>>>>>>>>> or is it OK to remove?
> >>>>>>>>>>>> 
> >>>>>>>>>>>>  The comment is inaccurate and it should be 
> domU. That I/D
> >>>>>>>>>>>> cache sync step is mandatory to boot domU on new IA64
> >>>>>>>>>>>> processor which has split L2 I/D cache. If 
> without such I/D
> >>>>>>>>>>>> cache sync, control panel loads domU's kernel image which
> >>>>>>>>>>>> only affects D side cache. If there're some 
> stale entry on
> >>>>>>>>>>>> I-side cache within same range of dom0 image, people will
> >>>>>>>>>>>> see machine going weird.
> >>>>>>>>>>> 
> >>>>>>>>>>> I don't understand... how can there be stale 
> entries in the
> >>>>>>>>>>> I-cache? The instructions have just been written to memory
> >>>>>>>>>>> (through D-cache) and no instructions in this domain have
> >>>>>>>>>>> yet been executed. I do see that the D-cache needs to be
> >>>>>>>>>>> flushed so that memory is coherent but are there better
> >>>>>>>>>>> ways to do that without a pal call? 
> >>>>>>>>>>> 
> >>>>>>>>>>>>  Normally I/D cache sync shouldn't force any problem.
> >>>>>>>>>>>> Possibly there's some problem with the pal calling code,
> >>>>>>>>>>>> like incorrect ITLB mapping for pal or similar issue...
> >>>>>>>>>>> 
> >>>>>>>>>>> Although the ia64_pal_cache_flush routine is defined in
> >>>>>>>>>>> linux's pal.h, it doesn't appear to be used anywhere in
> >>>>>>>>>>> Linux so there is no use model to copy.  I 
> suspect there is
> >>>>>>>>>>> some use model for the call that we don't understand, for
> >>>>>>>>>>> example maybe it should only be called with physical
> >>>>>>>>>>> &progress?  It definitely fails every time on one of my
> >>>>>>>>>>> (newer) machines and disabling the pal call makes the
> >>>>>>>>>>> problem go away. 
> >>>>>>>>>>> 
> >>>>>>>>>>>> Though it's intermittent, please
> >>>>>>>>>>>> keep this code
> >>>>>>>>>>>> there for correctness.
> >>>>>>>>>>> 
> >>>>>>>>>>> Since the call is definitely failing under some
> >>>>>>>>>>> circumstances that we don't understand, I'm inclined to at
> >>>>>>>>>>> least put the code in an #ifdef CONFIG_SPLIT_CACHE
> >>>>>>>>>>> 
> >>>>>>>>>>> Does the problem happen only on VTI?  Or both VTI and
> >>>>>>>>>>> non-VTI on split-cache machines? 
> >>>>>>>>>>> 
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Dan
> >>>>>>>>>>> 
> >>>>>>>>>>> P.S. I tried Anthony's patch (which moves the PAL 
> call after
> >>>>>>>>>>> new_thread()) but it still crashes.
> 
> 
_______________________________________________
Xen-ia64-devel mailing list
Xen-ia64-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-ia64-devel
<Prev in Thread] Current Thread [Next in Thread>