Project Home
Project Home
Wiki
Wiki
Discussion Forums
Discussions
Project Information
Project Info
Forum Topic - mbuf corruption: (21 Items)
   
mbuf corruption  
We're trying to track down a problem where an mbuf gets corrupted somewhere along the line, which eventually causes a 
fault and io-pkt crashes.

It's not obvious what's going on, and we've already spent quite a bit of time investigating.

However, before we continue digging into it, I just wanted to ask if there have been any changes since the 6.5.0 release
 that might cause one or more fields of an mbuf to be invalid, particularly m_ext.ext_page.

If there are no known issues, we'll continue with what we're doing, but I didn't want to spend a whole lot of time on it
 if this is a known issue.

Thanks,
lew
Re: mbuf corruption  
Which h/w driver?
Re: mbuf corruption  
In this case, e1000.
Re: mbuf corruption  
The e1000 driver doesn¹t change the ext_page variable. All that the driver
does is to use this offset to get the physical address of the buffer


On 11-06-02 9:34 AM, "Lewis Donzis" <community-noreply@qnx.com> wrote:

> In this case, e1000.
> 
> 
> 
> _______________________________________________
> 
> Technology
> http://community.qnx.com/sf/go/post86389
> 
> 

-- 
Hugh Brown                      (613) 591-0931 ext. 2209 (voice)
QNX Software Systems Ltd.        (613) 591-3579           (fax)
175 Terence Matthews Cres.       email:  hsbrown@qnx.com
Kanata, Ontario, Canada.
K2M 1W8
 


Re: mbuf corruption  
> The e1000 driver doesn¹t change the ext_page variable. All that the driver
> does is to use this offset to get the physical address of the buffer

I agree.  I didn't say it was a problem in the hardware driver, I was just answering the question :)

That's why I asked whether there have been any other changes in the rest of the stack that might involve possible mbuf 
corruption?
Re: mbuf corruption  
Not that I'm aware of.


On 11-06-02 10:04 AM, "Lewis Donzis" <community-noreply@qnx.com> wrote:

>> The e1000 driver doesn¹t change the ext_page variable. All that the driver
>> does is to use this offset to get the physical address of the buffer
> 
> I agree.  I didn't say it was a problem in the hardware driver, I was just
> answering the question :)
> 
> That's why I asked whether there have been any other changes in the rest of
> the stack that might involve possible mbuf corruption?
>



_______________________________________________

Technology
http://communi
> ty.qnx.com/sf/go/post86393


-- 
Hugh Brown                      (613) 591-0931 ext. 2209 (voice)
QNX Software Systems Ltd.        (613) 591-3579           (fax)
175 Terence Matthews Cres.       email:  hsbrown@qnx.com
Kanata, Ontario, Canada.
K2M 1W8
 

RE: mbuf corruption  

> -----Message d'origine-----
> De : Lewis Donzis [mailto:community-noreply@qnx.com]
> Envoyé : 2 juin 2011 09:22
> À : technology-networking
> Objet : mbuf corruption
> 
> We're trying to track down a problem where an mbuf gets corrupted
> somewhere along the line, which eventually causes a fault and io-pkt
> crashes.
> 
> It's not obvious what's going on, and we've already spent quite a bit of time
> investigating.
> 
> However, before we continue digging into it, I just wanted to ask if there
> have been any changes since the 6.5.0 release that might cause one or more
> fields of an mbuf to be invalid, particularly m_ext.ext_page.
> 
> If there are no known issues, we'll continue with what we're doing, but I
> didn't want to spend a whole lot of time on it if this is a known issue.
> 

We are also having issues with io-pkt-v4 crashing. In your case is it related to tcpip or QNET?  In our case the problem
 happened when we handle one extra tcpip stream in our application.  Finaly got access to network source last week, so I
`ll investigate this further in the following weeks.  We have a Case open but since I`m not able to create a simple test
 case for QNX to reproduce the issue, they aren`t really able to assist.

> Thanks,
> lew
> 
> 
> 
> _______________________________________________
> 
> Technology
> http://community.qnx.com/sf/go/post86386
> 
Re: RE: mbuf corruption  
> We are also having issues with io-pkt-v4 crashing. In your case is it related 
> to tcpip or QNET?  

We've seen crashes from both, but the cause is a big mystery.  Generally, we see a crash in the mbuf allocator, where 
something in the mbuf that it pulled from the pool was previously corrupted.

So while we have seen qnet call something that allocates an mbuf and then crashes, removing qnet has very little effect,
 it still crashes elsewhere.

We've seen crashes from tcp_input() calling tcp_output calling the mbuf allocator.

We also have our own modules that use the pfil hooks, and there are crashes when those modules call the mbuf allocator.

And, there are also crashes in the driver, where the receive code allocates a replacement mbuf.

The problem is that the crash occurs some time after the corruption actually occurred, i.e., only when the mbuf gets 
recycled, so we don't know when it was originally corrupted.

Anyway, we'll continue to investigate, I just didn't want to spend a lot more time on it if it's a "known" problem 
that's already been fixed.

lew
Re: RE: mbuf corruption  
On Thu, Jun 02, 2011 at 03:33:27PM -0400, Lewis Donzis wrote:
> > We are also having issues with io-pkt-v4 crashing. In your case is it
> related 
> > to tcpip or QNET?  
> 
> We've seen crashes from both, but the cause is a big mystery.
> Generally, we see a crash in the mbuf allocator, where something in the
> mbuf that it pulled from the pool was previously corrupted.
> 
> So while we have seen qnet call something that allocates an mbuf and
> then crashes, removing qnet has very little effect, it still crashes
> elsewhere.
> 
> We've seen crashes from tcp_input() calling tcp_output calling the mbuf
> allocator.
> 
> We also have our own modules that use the pfil hooks, and there are
> crashes when those modules call the mbuf allocator.
> 
> And, there are also crashes in the driver, where the receive code
> allocates a replacement mbuf.
> 
> The problem is that the crash occurs some time after the corruption
> actually occurred, i.e., only when the mbuf gets recycled, so we don't
> know when it was originally corrupted.
> 
> Anyway, we'll continue to investigate, I just didn't want to spend a lot
> more time on it if it's a "known" problem that's already been fixed.
> 
> lew

There was a fix recently that may explain this.

Regards,

-seanb
Re: RE: mbuf corruption  
> There was a fix recently that may explain this.

Thanks, Sean.

Can you shed any more light, i.e., what was fixed and where we could get the fixed code to give it a try?

I'm pretty sure that we can reproduce it within a reasonably short period of time.

lew
Re: RE: mbuf corruption  
On Fri, Jun 03, 2011 at 08:52:33AM -0400, Lewis Donzis wrote:
> > There was a fix recently that may explain this.
> 
> Thanks, Sean.
> 
> Can you shed any more light, i.e., what was fixed

A mutexing issue in the pool allocator.

> and where we could get
> the fixed code to give it a try?

This I'm not sure of these days...

> 
> I'm pretty sure that we can reproduce it within a reasonably short
> period of time.
> 
> lew
> 
> 
> 
> _______________________________________________
> 
> Technology
> http://community.qnx.com/sf/go/post86423
> 
Re: RE: mbuf corruption  
> A mutexing issue in the pool allocator.

That sounds promising and would explain a lot.

> > and where we could get
> > the fixed code to give it a try?
> 
> This I'm not sure of these days...

Well, we've compiled io-pkt from the released source, so if we knew what to change, we could modify the source and try 
it here.

Or maybe I should ask our FAE for an update?

What would be the best course of action?

Thanks,
lew
Re: mbuf corruption  
Talk to your FAE and ask how to get the fix.

You may need a support plan and the FAE will be glad to enroll you in one.


On 11-06-03 9:22 AM, "Lewis Donzis" <community-noreply@qnx.com> wrote:

>> > A mutexing issue in the pool allocator.
> 
> That sounds promising and would explain a lot.
> 
>>> > > and where we could get
>>> > > the fixed code to give it a try?
>> >
>> > This I'm not sure of these days...
> 
> Well, we've compiled io-pkt from the released source, so if we knew what to
> change, we could modify the source and try it here.
> 
> Or maybe I should ask our FAE for an update?
> 
> What would be the best course of action?
> 
> Thanks,
> lew
> 
> 
> 
> _______________________________________________
> 
> Technology
> http://community.qnx.com/sf/go/post86425
> 
> 

Re: mbuf corruption  
> Talk to your FAE and ask how to get the fix.
> 
> You may need a support plan and the FAE will be glad to enroll you in one.

No problem.  We already have an active support plan, so I just opened a case.

Thanks,

lew
Re: RE: mbuf corruption  
> A mutexing issue in the pool allocator.

Sean,

We're trying to come up with a test case that would let you see this more easily, and that is proving difficult.  One 
thing that helps a lot is changing the e1000 driver to do receive processing in its own separate thread, and using 
poke_stack_pkt_q() to get another thread to become the stack (more like the io-net model).  At that point, we can make 
it fail reasonably often by directing a decent load (about 500,000 packets/sec) at the machine and then running "tcpdump
 -c1" in a loop, to cause the interface to go in/out of promiscuous mode.  If we don't use poke_stack_pkt_q(), it fails 
so rarely as to be useless.  I was thinking that this might be similar to how the shim drivers operate, so we'll try 
replicating it with a released shim driver.

Note that the problem is a lot more serious for us than going into promiscuous mode -- we have occasional random 
failures, but they are very rare, so we're just searching for some way to cause it to happen more frequently.

In any event, after tracing through the driver, the mbuf library, and the pool allocator, it appears that something in 
the pool gets corrupted while it's unallocated.  99% of the time, everything works fine.  But every now and then, 
pcg_get() returns an object which points to 0x0f320700, which is not a valid pointer (which of course leads to a SEGV). 
 Sometimes, the pointer is null, and rarely it's -1, but about 90% of the time, it's the above magic value.  We can see 
that pcg_get() is returning a corrupted object, but we also instrumented pcg_put() and nothing ever intentionally frees 
a "bad" object.

Does any of this sound familiar?

Support sent us an io-pkt-v6 to try, and it made no difference, but I don't know how to verify whether it actually 
contains the fix you mentioned previously.

Thanks,
lew
Re: RE: mbuf corruption  
We found the problem.  It's in the e1000 driver, where it doesn't clear the status of packets when initializing the ring
.

We had previously mentioned this to Hugh and had implemented a fix, but our fix wasn't in the right part of the code to 
get executed in this case.

lew
Re: RE: mbuf corruption  
Sean,

We thought we had this fixed, but the prior (e1000 related) problem was something entirely different -- the previous 
problem was corrupting mbufs by DMAing into them at bad times.

In any event, after almost a month of trying, we found a method of reproducing a highly intermittent problem in a matter
 of minutes.  Unfortunately, the lab setup is so complex that it would not be easy to replicate for you.  But the 
problem is likely due to passing an mbuf to another thread and having that thread free it, thus exercising the pool 
allocator because each thread's local mbuf cache is either empty or full all the time.

Long story short, there is a corruption occurring inside the pool allocator.  After several days of looking into it, 
there appear to be cases in the pool allocator where the PR_PROTECT flag is not set and therefore mutex locking is not 
occurring on certain pools.

Empirically, we found that we can eliminate the problem by locking a mutex on all pools, so now we're trying to find out
 exactly which one of the pools was causing the problem.

You mentioned that a mutexing problem had been found & fixed in the pool allocator, and we were theoretically given a 
copy of io-pkt that contained the fix, but we have no way to verify that the fix was contained within.

So my question is, can you tell us EXACTLY what was changed?  We'd like to make sure that we're in sync, both with 
whatever you fixed previously, as well as our findings in this particular case.

This is a very serious problem for us, we've even held up a major release due to a lurking known problem.  We now have 
light at the end of the tunnel, but would like to discuss further so we can make sure that our findings are consistent 
with your much deeper knowledge of this code, and that the fix is made appropriately so that it can be contributed back.


Thanks,
lew
RE: RE: mbuf corruption  
> -----Message d'origine-----
> De : Lewis Donzis [mailto:community-noreply@qnx.com]
> Envoyé : 22 juillet 2011 11:42
> À : technology-networking
> Objet : Re: RE: mbuf corruption
> 
> Sean,
> 
> We thought we had this fixed, but the prior (e1000 related) problem was
> something entirely different -- the previous problem was corrupting mbufs
> by DMAing into them at bad times.
> 
> In any event, after almost a month of trying, we found a method of
> reproducing a highly intermittent problem in a matter of minutes.
> Unfortunately, the lab setup is so complex that it would not be easy to
> replicate for you.  But the problem is likely due to passing an mbuf to another
> thread and having that thread free it, thus exercising the pool allocator
> because each thread's local mbuf cache is either empty or full all the time.
> 
> Long story short, there is a corruption occurring inside the pool allocator.
> After several days of looking into it, there appear to be cases in the pool
> allocator where the PR_PROTECT flag is not set and therefore mutex locking
> is not occurring on certain pools.
> 
> Empirically, we found that we can eliminate the problem by locking a mutex
> on all pools, so now we're trying to find out exactly which one of the pools
> was causing the problem.
> 
> You mentioned that a mutexing problem had been found & fixed in the pool
> allocator, and we were theoretically given a copy of io-pkt 
that contained the
> fix, but we have no way to verify that the fix was contained within.
> 
> So my question is, can you tell us EXACTLY what was changed?  We'd like to
> make sure that we're in sync, both with whatever you fixed previously, as
> well as our findings in this particular case.
> 
> This is a very serious problem for us, we've even held up a major release due
> to a lurking known problem.  We now have light at the end of the tunnel, but
> would like to discuss further so we can make sure that our findings are
> consistent with your much deeper knowledge of this code, and that the fix is
> made appropriately so that it can be contributed back.
> 

Same here!  That being said we were given a debug version of io-pkt to help debug the crash, but this debug version 
isn't crashing. That's what we had to put in the field, otherwise we couldn't have deliver our product.

> Thanks,
> lew
> 
> 
> 
> _______________________________________________
> 
> Technology
> http://community.qnx.com/sf/go/post87534
> 
Re: RE: mbuf corruption  
Lewis,
          can you please tell me who provided you latest patch and what date was it at?

Thanks,
Sabtain
RE: RE: mbuf corruption  
The version we are using that is NOT crashing is a debug version:

NAME=io-pkt-v4
DESCRIPTION=TCP/IP protocol module.
DATE=2011/01/14-01:04:25-ang
STATE=Experimental
HOST=localhost
USER=root
VERSION=skhan.1


Then later on I tried this version, that is following some comments on the forum about a possible fix, but it crashed:

NAME=io-pkt-v6
DESCRIPTION=TCP/IP protocol module.
DATE=2011/06/03-13:10:11-ang
STATE=Experimental
HOST=vmware-650
USER=root
VERSION=skhan.1

Then a few days later, you gave me this version to try which also end up crashing

NAME=io-pkt-v4
DESCRIPTION=TCP/IP protocol module.
DATE=2011/06/28-09:31:48-ang
STATE=Experimental
HOST=vmware-650
USER=root
VERSION=skhan.1


I`m leaving at 15h30 today.

> -----Message d'origine-----
> De : Sabtain Khan [mailto:community-noreply@qnx.com]
> Envoyé : 27 juillet 2011 14:04
> À : technology-networking
> Objet : Re: RE: mbuf corruption
> 
> Lewis,
>           can you please tell me who provided you latest patch and what date
> was it at?
> 
> Thanks,
> Sabtain
> 
> 
> 
> _______________________________________________
> 
> Technology
> http://community.qnx.com/sf/go/post87631
> 
Re: RE: mbuf corruption  
The original build that didn't correct it is documented under case 00109733.

However, I posted details of our findings in case 00110356 and it was confirmed as a known problem that has already been
 fixed.

So it sounds like it's all under control.

Thanks!
lew