foundry27 : Post

Forum Topic - M7 + gcc4.2.1: devb-ram SEGV: (9 Items)

View: as

Andrew Pierson

06/19/2008 1:04 PM

post9440

When using the devb-ram driver, it will SEGV a short time after it is started.  The memory addresses it is crashing on 
are within the shared memory it allocates.  We verified this by inspecting the core for the offending instruction and 
register contents it used.  You can replicate this problem by adding the following patch to sim.c and running devb-ram 
by itself.

$ svn diff sim.c
Index: sim.c
===================================================================
--- sim.c       (revision 10453)
+++ sim.c       (working copy)
@@ -521,6 +521,19 @@
       return EXIT_FAILURE;
    }
 
+   // *** This will crash at a random point. ***
+   unsigned int i;
+   for( i=0 ; i<mapSize; i+=8 )
+   {
+      fprintf( stderr, "%p:  %08x %08x %08x %08x  %08x %08x %08x %08x/n",
+               addr + i,
+               addr[i  ],addr[i+1],addr[i+2],addr[i+3],
+               addr[i+4],addr[i+5],addr[i+6],addr[i+7]
+             );
+   }
+   // Never get here.
+   fprintf( stderr, "/n---END OF MEMORY---/n" );
+
        if( ram_ctrl.cflags & RAM_CFLAG_SCAN ) {
                if( ( hba = ram_alloc_hba( ) ) == NULL ) {
                        return( CAM_FAILURE );
===================================================================

Here is how we start it:
     devb-ram cam quiet ram capacity=57344 dos exe=all blk cache=128k

Since we are accessing memory that we allocated, why are we losing permission to access this memory?  How can we work 
around this problem?

Thanks,
Andy

dave carlson(deleted)

08/20/2008 7:58 AM

post12066

Re: devb-ram SEGV with ARM FCSE shmctl

Using the 640M5, we are continuing to see the problem Andy reported here several weeks ago.  Since our devb-ram was very
 old and very hacked,
I have compiled the trunk version of devb-ram so that it will be "compatible" with all the 6.4 runtime pieces (libcam, 
etc.)  Thus, the hacking is limited to the tiny diff below that adds the arm shmctl so that a ramdisk can be larger than
 tiny.  We need a min of 28MB which is not happening without SHMCTL_GLOBAL.

Findings:
1.  As a virgin driver straight from the trunk *will not crash*.  Except that I am on ARM with FCSE so my ramdisk needed
 to be derated from 28MB to 12MB.

When I replace the cam_calloc version with the canonical shm_open/shmctl/mmap, I get find:

2.  If I run the shmctl version with  FreeMem:88Mb/128Mb, it will have a SEGV in the shm memory array nearly immediately
.  (Within a few thousand accesses -- most often in memcpy.)  It is as if the mapping is "lost".

3.  If I kill some of my running apps (so that pidin info shows 98+MB free rather than 60MB free), the devb-ram *will 
not crash*.  It appears that core files in /dev/shmem -- will cause the same shm failure -- ie., my idle apps are not 
actively causing the failure -- it appears to be simply how much memory is in use.

Enclosed is a diff -u for the trivial change to devb-ram/sim.c required to demonstrate the problem with ARM FSCE.

This is a show stopper for us.

Attachment:

devb-ram_arm.diff 1.78 KB

Sunil Kittur(deleted)

08/20/2008 8:41 AM

post12069

Re: devb-ram SEGV with ARM FCSE shmctl

When did you initially report this problem?
Did you get a TicketID or PR number?

	Sunil.

dave carlson wrote:
> Using the 640M5, we are continuing to see the problem Andy reported here several weeks ago.  Since our devb-ram was 
very old and very hacked,
> I have compiled the trunk version of devb-ram so that it will be "compatible" with all the 6.4 runtime pieces (libcam,
 etc.)  Thus, the hacking is limited to the tiny diff below that adds the arm shmctl so that a ramdisk can be larger 
than tiny.  We need a min of 28MB which is not happening without SHMCTL_GLOBAL.
> 
> Findings:
> 1.  As a virgin driver straight from the trunk *will not crash*.  Except that I am on ARM with FCSE so my ramdisk 
needed to be derated from 28MB to 12MB.
> 
> When I replace the cam_calloc version with the canonical shm_open/shmctl/mmap, I get find:
> 
> 2.  If I run the shmctl version with  FreeMem:88Mb/128Mb, it will have a SEGV in the shm memory array nearly 
immediately.  (Within a few thousand accesses -- most often in memcpy.)  It is as if the mapping is "lost".
> 
> 3.  If I kill some of my running apps (so that pidin info shows 98+MB free rather than 60MB free), the devb-ram *will 
not crash*.  It appears that core files in /dev/shmem -- will cause the same shm failure -- ie., my idle apps are not 
actively causing the failure -- it appears to be simply how much memory is in use.
> 
> Enclosed is a diff -u for the trivial change to devb-ram/sim.c required to demonstrate the problem with ARM FSCE.
> 
> This is a show stopper for us.
> 
> 
> 
> _______________________________________________
> OSTech
> http://community.qnx.com/sf/go/post12066

dave carlson(deleted)

08/20/2008 9:42 AM

post12080

Re: devb-ram SEGV with ARM FCSE shmctl

No, we have been (trying) not to use our priority support ($$$) for 6.4 pre-release integration issues.  The response on
 the forums has been our main support.  So, we reported this as topic 3140 back in June.

You guys fixed the unlink panic (topic 3141) in 24 hours.  But 3140 has languished.

We have not been pursuing it due to the libc.3 ABI change and the fact that the bsp libraries (libcam, cam-disk, io-char
, etc.) had not been released yet.  Once we obtained the 6.4/libc.3 compilation of the bsp libs, I have started to 
characterize the problem further.

It was at this point that simply saying devb-ram was broken was not going help you track the problem.

I thought reducing the problem to a 20 line diff to trunk would help. :-)

BTW, I have thought about our apps scribbling on the page tables -- but I reject this.  1) Our apps are idle.  Idle = 99
%.  The failure is nearly immediate.  2) Our apps are rock solid.  A random scribbler would (should) cause random apps 
to fail.  Even kernel panics.  Devb-ram is the only death.  Note also, we have ~10 other shmctl global memory segments 
in use that "never fail".  Whatever the interaction, it seems unique to the "large" devb-ram chunk.

Thanks (as always) for your help.

dave

Sunil Kittur(deleted)

08/20/2008 12:36 PM

post12108

Re: devb-ram SEGV with ARM FCSE shmctl

Does this happen with a 6.3.2 procnto?
The memory manager went through a major overhaul in 6.3.2 and a few
ARM shm_ctl() things fell through the cracks so I was curious if it
was something new in 6.4.0.

Also, am I correct in assuming all I need to do to reproduce it is
build a devb-ram with your sim.c modifications then run:

devb-ram cam quiet ram capacity=57344 dos exe=all blk cache=128k

Simply doing this will cause devb-ram to sigsegv?

	Sunil.

dave carlson wrote:
> No, we have been (trying) not to use our priority support ($$$) for 6.4 pre-release integration issues.  The response 
on the forums has been our main support.  So, we reported this as topic 3140 back in June.
> 
> You guys fixed the unlink panic (topic 3141) in 24 hours.  But 3140 has languished.
> 
> We have not been pursuing it due to the libc.3 ABI change and the fact that the bsp libraries (libcam, cam-disk, io-
char, etc.) had not been released yet.  Once we obtained the 6.4/libc.3 compilation of the bsp libs, I have started to 
characterize the problem further.
> 
> It was at this point that simply saying devb-ram was broken was not going help you track the problem.
> 
> I thought reducing the problem to a 20 line diff to trunk would help. :-)
> 
> BTW, I have thought about our apps scribbling on the page tables -- but I reject this.  1) Our apps are idle.  Idle = 
99%.  The failure is nearly immediate.  2) Our apps are rock solid.  A random scribbler would (should) cause random apps
 to fail.  Even kernel panics.  Devb-ram is the only death.  Note also, we have ~10 other shmctl global memory segments 
in use that "never fail".  Whatever the interaction, it seems unique to the "large" devb-ram chunk.
> 
> Thanks (as always) for your help.
> 
> dave
>  
> 
> 
> 
> _______________________________________________
> OSTech
> http://community.qnx.com/sf/go/post12080
>

dave carlson(deleted)

08/20/2008 12:51 PM

post12110

RE: devb-ram SEGV with ARM FCSE shmctl

Sunil,

We have that odd-ball patched OS from last year + the MsgCurrent
prio-inversion fix.

Our shipping OS is:
2008/02/08-12:42:22EST
6.3.2

The devb-ram has never failed with this kernel.  NB: I have not tested
the 6.4trunk devb-ram with the old kernel/libcam/etc.  But the old
devb-ram and the trunk devb-ram+shmctl fail identically.

Note also, this kernel was recently patched with the MsgCurrent
priqo-inv fix.

The command line below (as Andy supplied) will fail as described.

Our filesystem init code (mkdosfs, cp small files to /dos, etc.) will
fail 50% of the time during the small file copies.

To force a failure for the other 50%, I do:

while true ; do
	cp /usr/bin/someBigFile /dos
	echo done
	rm /dos/someBigFile
done

That loop fails immediately or runs forever -- I use 20minutes as "proof
of life".

dave

-----Original Message-----
From: Sunil Kittur [mailto:community-noreply@qnx.com] 
Sent: Wednesday, August 20, 2008 12:37 PM
To: ostech-core_os
Subject: Re: devb-ram SEGV with ARM FCSE shmctl

Does this happen with a 6.3.2 procnto?
The memory manager went through a major overhaul in 6.3.2 and a few
ARM shm_ctl() things fell through the cracks so I was curious if it
was something new in 6.4.0.

Also, am I correct in assuming all I need to do to reproduce it is
build a devb-ram with your sim.c modifications then run:

devb-ram cam quiet ram capacity=57344 dos exe=all blk cache=128k

Simply doing this will cause devb-ram to sigsegv?

	Sunil.

dave carlson wrote:
> No, we have been (trying) not to use our priority support ($$$) for
6.4 pre-release integration issues.  The response on the forums has been
our main support.  So, we reported this as topic 3140 back in June.
> 
> You guys fixed the unlink panic (topic 3141) in 24 hours.  But 3140
has languished.
> 
> We have not been pursuing it due to the libc.3 ABI change and the fact
that the bsp libraries (libcam, cam-disk, io-char, etc.) had not been
released yet.  Once we obtained the 6.4/libc.3 compilation of the bsp
libs, I have started to characterize the problem further.
> 
> It was at this point that simply saying devb-ram was broken was not
going help you track the problem.
> 
> I thought reducing the problem to a 20 line diff to trunk would help.
:-)
> 
> BTW, I have thought about our apps scribbling on the page tables --
but I reject this.  1) Our apps are idle.  Idle = 99%.  The failure is
nearly immediate.  2) Our apps are rock solid.  A random scribbler would
(should) cause random apps to fail.  Even kernel panics.  Devb-ram is
the only death.  Note also, we have ~10 other shmctl global memory
segments in use that "never fail".  Whatever the interaction, it seems
unique to the "large" devb-ram chunk.
> 
> Thanks (as always) for your help.
> 
> dave
>  
> 
> 
> 
> _______________________________________________
> OSTech
> http://community.qnx.com/sf/go/post12080
> 

_______________________________________________
OSTech
http://community.qnx.com/sf/go/post12108

Sunil Kittur(deleted)

08/21/2008 4:38 PM

post12225

Re: RE: devb-ram SEGV with ARM FCSE shmctl

OK, I think I've found the bug...

It's due to the variable page size support.
If you need a quick workaround, use the -m~v option
to procnto to disable the variable page size support.

Basically what is happening is that the mappings can
be coalesced from 4K small pages to 1M section
mappings. However, the cpu_pte_split/merge code
was unconditionally setting the L1 descriptors with
the process current domain id instead of taking into
account that these PRIV/LOWERPROT mappings have
no domain id. What this means is that if the domain
happens to get stolen,  the L1 entries' domain field
is no longer valid, so we will get a domain fault when
we next access it. 

To fix the problem we need to preserve the L1 
domain field that has been set when we split/merge
L1 entries.

I'll be posting a diff for review in the OSrev forum
(this was assigned PR60487).

    Sunil.

dave carlson(deleted)

RE: RE: devb-ram SEGV with ARM FCSE shmctl

dave carlson(deleted)

08/21/2008 5:04 PM

post12228

RE: RE: devb-ram SEGV with ARM FCSE shmctl

Sunil,

That sounds pretty good -- Occam is satisfied.  I will try your
workaround.  But I am on vaca next week so it may take a bit.

Thanks for your attention on this matter.

A pleasure doing business with you.  :-)

dave
-----Original Message-----
From: Sunil Kittur [mailto:community-noreply@qnx.com] 
Sent: Thursday, August 21, 2008 4:39 PM
To: ostech-core_os
Subject: Re: RE: devb-ram SEGV with ARM FCSE shmctl


OK, I think I've found the bug...

It's due to the variable page size support.
If you need a quick workaround, use the -m~v option
to procnto to disable the variable page size support.

Basically what is happening is that the mappings can
be coalesced from 4K small pages to 1M section
mappings. However, the cpu_pte_split/merge code
was unconditionally setting the L1 descriptors with
the process current domain id instead of taking into
account that these PRIV/LOWERPROT mappings have
no domain id. What this means is that if the domain
happens to get stolen,  the L1 entries' domain field
is no longer valid, so we will get a domain fault when
we next access it. 

To fix the problem we need to preserve the L1 
domain field that has been set when we split/merge
L1 entries.

I'll be posting a diff for review in the OSrev forum
(this was assigned PR60487).

    Sunil.

_______________________________________________
OSTech
http://community.qnx.com/sf/go/post12225

dave carlson(deleted)

RE: TicketID85015 - devb-ram SEGV with ARM FCSE shmctl

dave carlson(deleted)

08/21/2008 6:09 PM

post12230

RE: TicketID85015 - devb-ram SEGV with ARM FCSE shmctl

Adrian/Sunil,
 
That the -m~v appears to fix the problem.  (ya!)
 
I will test the procnto patch on the other side of my holiday.
 
Nice job popping our showstopper.
 
dave

Return

The text you entered is not a valid object ID
More Information
Object IDs begin with an object prefix and end with a number. For example, if you enter
artf2345
the application will jump directly to an artifact with the ID artf2345. Some valid object prefixes are:
artf	for an artifact
doc	for a document
page	for a project page
topc	for a discussion topic
wiki	for a wiki page