foundry27 : Post

Forum Topic - mbuf corruption: (21 Items)

View: as

Lewis Donzis

06/02/2011 9:22 AM

post86386

We're trying to track down a problem where an mbuf gets corrupted somewhere along the line, which eventually causes a 
fault and io-pkt crashes.

It's not obvious what's going on, and we've already spent quite a bit of time investigating.

However, before we continue digging into it, I just wanted to ask if there have been any changes since the 6.5.0 release
 that might cause one or more fields of an mbuf to be invalid, particularly m_ext.ext_page.

If there are no known issues, we'll continue with what we're doing, but I didn't want to spend a whole lot of time on it
 if this is a known issue.

Thanks,
lew

Dennis Kellly

Re: mbuf corruption

Dennis Kellly

06/02/2011 9:24 AM

post86387

Re: mbuf corruption

Which h/w driver?

Lewis Donzis

06/02/2011 9:34 AM

post86389

Re: mbuf corruption

In this case, e1000.

Hugh Brown

06/02/2011 9:38 AM

post86390

Re: mbuf corruption

The e1000 driver doesn¹t change the ext_page variable. All that the driver
does is to use this offset to get the physical address of the buffer

On 11-06-02 9:34 AM, "Lewis Donzis" <community-noreply@qnx.com> wrote:

> In this case, e1000.
> 
> 
> 
> _______________________________________________
> 
> Technology
> http://community.qnx.com/sf/go/post86389
> 
> 

-- 
Hugh Brown                      (613) 591-0931 ext. 2209 (voice)
QNX Software Systems Ltd.        (613) 591-3579           (fax)
175 Terence Matthews Cres.       email:  hsbrown@qnx.com
Kanata, Ontario, Canada.
K2M 1W8

Lewis Donzis

06/02/2011 10:04 AM

post86393

Re: mbuf corruption

> The e1000 driver doesn¹t change the ext_page variable. All that the driver
> does is to use this offset to get the physical address of the buffer

I agree.  I didn't say it was a problem in the hardware driver, I was just answering the question :)

That's why I asked whether there have been any other changes in the rest of the stack that might involve possible mbuf 
corruption?

Hugh Brown

06/02/2011 10:06 AM

post86394

Re: mbuf corruption

Not that I'm aware of.


On 11-06-02 10:04 AM, "Lewis Donzis" <community-noreply@qnx.com> wrote:

>> The e1000 driver doesn¹t change the ext_page variable. All that the driver
>> does is to use this offset to get the physical address of the buffer
> 
> I agree.  I didn't say it was a problem in the hardware driver, I was just
> answering the question :)
> 
> That's why I asked whether there have been any other changes in the rest of
> the stack that might involve possible mbuf corruption?
>



_______________________________________________

Technology
http://communi
> ty.qnx.com/sf/go/post86393


-- 
Hugh Brown                      (613) 591-0931 ext. 2209 (voice)
QNX Software Systems Ltd.        (613) 591-3579           (fax)
175 Terence Matthews Cres.       email:  hsbrown@qnx.com
Kanata, Ontario, Canada.
K2M 1W8

Mario Charest

RE: mbuf corruption

Mario Charest

06/02/2011 10:32 AM

post86396

RE: mbuf corruption


> -----Message d'origine-----
> De : Lewis Donzis [mailto:community-noreply@qnx.com]
> Envoyé : 2 juin 2011 09:22
> À : technology-networking
> Objet : mbuf corruption
> 
> We're trying to track down a problem where an mbuf gets corrupted
> somewhere along the line, which eventually causes a fault and io-pkt
> crashes.
> 
> It's not obvious what's going on, and we've already spent quite a bit of time
> investigating.
> 
> However, before we continue digging into it, I just wanted to ask if there
> have been any changes since the 6.5.0 release that might cause one or more
> fields of an mbuf to be invalid, particularly m_ext.ext_page.
> 
> If there are no known issues, we'll continue with what we're doing, but I
> didn't want to spend a whole lot of time on it if this is a known issue.
> 

We are also having issues with io-pkt-v4 crashing. In your case is it related to tcpip or QNET?  In our case the problem
 happened when we handle one extra tcpip stream in our application.  Finaly got access to network source last week, so I
`ll investigate this further in the following weeks.  We have a Case open but since I`m not able to create a simple test
 case for QNX to reproduce the issue, they aren`t really able to assist.

> Thanks,
> lew
> 
> 
> 
> _______________________________________________
> 
> Technology
> http://community.qnx.com/sf/go/post86386
>

Lewis Donzis

06/02/2011 3:33 PM

post86410

Re: RE: mbuf corruption

> We are also having issues with io-pkt-v4 crashing. In your case is it related 
> to tcpip or QNET?  

We've seen crashes from both, but the cause is a big mystery.  Generally, we see a crash in the mbuf allocator, where 
something in the mbuf that it pulled from the pool was previously corrupted.

So while we have seen qnet call something that allocates an mbuf and then crashes, removing qnet has very little effect,
 it still crashes elsewhere.

We've seen crashes from tcp_input() calling tcp_output calling the mbuf allocator.

We also have our own modules that use the pfil hooks, and there are crashes when those modules call the mbuf allocator.

And, there are also crashes in the driver, where the receive code allocates a replacement mbuf.

The problem is that the crash occurs some time after the corruption actually occurred, i.e., only when the mbuf gets 
recycled, so we don't know when it was originally corrupted.

Anyway, we'll continue to investigate, I just didn't want to spend a lot more time on it if it's a "known" problem 
that's already been fixed.

lew

Sean Boudreau(deleted)

06/03/2011 8:45 AM

post86422

Re: RE: mbuf corruption

On Thu, Jun 02, 2011 at 03:33:27PM -0400, Lewis Donzis wrote:
> > We are also having issues with io-pkt-v4 crashing. In your case is it
> related 
> > to tcpip or QNET?  
> 
> We've seen crashes from both, but the cause is a big mystery.
> Generally, we see a crash in the mbuf allocator, where something in the
> mbuf that it pulled from the pool was previously corrupted.
> 
> So while we have seen qnet call something that allocates an mbuf and
> then crashes, removing qnet has very little effect, it still crashes
> elsewhere.
> 
> We've seen crashes from tcp_input() calling tcp_output calling the mbuf
> allocator.
> 
> We also have our own modules that use the pfil hooks, and there are
> crashes when those modules call the mbuf allocator.
> 
> And, there are also crashes in the driver, where the receive code
> allocates a replacement mbuf.
> 
> The problem is that the crash occurs some time after the corruption
> actually occurred, i.e., only when the mbuf gets recycled, so we don't
> know when it was originally corrupted.
> 
> Anyway, we'll continue to investigate, I just didn't want to spend a lot
> more time on it if it's a "known" problem that's already been fixed.
> 
> lew

There was a fix recently that may explain this.

Regards,

-seanb

Lewis Donzis

06/03/2011 8:52 AM

post86423

Re: RE: mbuf corruption

> There was a fix recently that may explain this.

Thanks, Sean.

Can you shed any more light, i.e., what was fixed and where we could get the fixed code to give it a try?

I'm pretty sure that we can reproduce it within a reasonably short period of time.

lew

Sean Boudreau(deleted)

06/03/2011 9:12 AM

post86424

Re: RE: mbuf corruption

On Fri, Jun 03, 2011 at 08:52:33AM -0400, Lewis Donzis wrote:
> > There was a fix recently that may explain this.
> 
> Thanks, Sean.
> 
> Can you shed any more light, i.e., what was fixed

A mutexing issue in the pool allocator.

> and where we could get
> the fixed code to give it a try?

This I'm not sure of these days...

> 
> I'm pretty sure that we can reproduce it within a reasonably short
> period of time.
> 
> lew
> 
> 
> 
> _______________________________________________
> 
> Technology
> http://community.qnx.com/sf/go/post86423
>

Lewis Donzis

Re: RE: mbuf corruption

Lewis Donzis

06/03/2011 9:22 AM

post86425

Re: RE: mbuf corruption

> A mutexing issue in the pool allocator.

That sounds promising and would explain a lot.

> > and where we could get
> > the fixed code to give it a try?
> 
> This I'm not sure of these days...

Well, we've compiled io-pkt from the released source, so if we knew what to change, we could modify the source and try 
it here.

Or maybe I should ask our FAE for an update?

What would be the best course of action?

Thanks,
lew

Mate Szarvas

06/03/2011 11:06 AM

post86426

Re: mbuf corruption

Talk to your FAE and ask how to get the fix.

You may need a support plan and the FAE will be glad to enroll you in one.


On 11-06-03 9:22 AM, "Lewis Donzis" <community-noreply@qnx.com> wrote:

>> > A mutexing issue in the pool allocator.
> 
> That sounds promising and would explain a lot.
> 
>>> > > and where we could get
>>> > > the fixed code to give it a try?
>> >
>> > This I'm not sure of these days...
> 
> Well, we've compiled io-pkt from the released source, so if we knew what to
> change, we could modify the source and try it here.
> 
> Or maybe I should ask our FAE for an update?
> 
> What would be the best course of action?
> 
> Thanks,
> lew
> 
> 
> 
> _______________________________________________
> 
> Technology
> http://community.qnx.com/sf/go/post86425
> 
>

Lewis Donzis

06/03/2011 11:09 AM

post86427

Re: mbuf corruption

> Talk to your FAE and ask how to get the fix.
> 
> You may need a support plan and the FAE will be glad to enroll you in one.

No problem.  We already have an active support plan, so I just opened a case.

Thanks,

lew

Lewis Donzis

Re: RE: mbuf corruption

Lewis Donzis

06/07/2011 9:00 AM

post86473

Re: RE: mbuf corruption

> A mutexing issue in the pool allocator.

Sean,

We're trying to come up with a test case that would let you see this more easily, and that is proving difficult. One
thing that helps a lot is changing the e1000 driver to do receive processing in its own separate thread, and using
poke_stack_pkt_q() to get another thread to become the stack (more like the io-net model). At that point, we can make
it fail reasonably often by directing a decent load (about 500,000 packets/sec) at the machine and then running "tcpdump
-c1" in a loop, to cause the interface to go in/out of promiscuous mode. If we don't use poke_stack_pkt_q(), it fails
so rarely as to be useless. I was thinking that this might be similar to how the shim drivers operate, so we'll try
replicating it with a released shim driver.

Note that the problem is a lot more serious for us than going into promiscuous mode -- we have occasional random
failures, but they are very rare, so we're just searching for some way to cause it to happen more frequently.

In any event, after tracing through the driver, the mbuf library, and the pool allocator, it appears that something in
the pool gets corrupted while it's unallocated. 99% of the time, everything works fine. But every now and then,
pcg_get() returns an object which points to 0x0f320700, which is not a valid pointer (which of course leads to a SEGV).
Sometimes, the pointer is null, and rarely it's -1, but about 90% of the time, it's the above magic value. We can see
that pcg_get() is returning a corrupted object, but we also instrumented pcg_put() and nothing ever intentionally frees
a "bad" object.

Does any of this sound familiar?

Support sent us an io-pkt-v6 to try, and it made no difference, but I don't know how to verify whether it actually
contains the fix you mentioned previously.

Thanks,
lew

Lewis Donzis

06/09/2011 7:30 PM

post86542

Re: RE: mbuf corruption

We found the problem.  It's in the e1000 driver, where it doesn't clear the status of packets when initializing the ring
.

We had previously mentioned this to Hugh and had implemented a fix, but our fix wasn't in the right part of the code to 
get executed in this case.

lew

Lewis Donzis

Re: RE: mbuf corruption

Lewis Donzis

07/22/2011 11:41 AM

post87534

Re: RE: mbuf corruption

Sean,

We thought we had this fixed, but the prior (e1000 related) problem was something entirely different -- the previous
problem was corrupting mbufs by DMAing into them at bad times.

In any event, after almost a month of trying, we found a method of reproducing a highly intermittent problem in a matter
of minutes. Unfortunately, the lab setup is so complex that it would not be easy to replicate for you. But the
problem is likely due to passing an mbuf to another thread and having that thread free it, thus exercising the pool
allocator because each thread's local mbuf cache is either empty or full all the time.

Long story short, there is a corruption occurring inside the pool allocator. After several days of looking into it,
there appear to be cases in the pool allocator where the PR_PROTECT flag is not set and therefore mutex locking is not
occurring on certain pools.

Empirically, we found that we can eliminate the problem by locking a mutex on all pools, so now we're trying to find out
exactly which one of the pools was causing the problem.

You mentioned that a mutexing problem had been found & fixed in the pool allocator, and we were theoretically given a
copy of io-pkt that contained the fix, but we have no way to verify that the fix was contained within.

So my question is, can you tell us EXACTLY what was changed? We'd like to make sure that we're in sync, both with
whatever you fixed previously, as well as our findings in this particular case.

This is a very serious problem for us, we've even held up a major release due to a lurking known problem. We now have
light at the end of the tunnel, but would like to discuss further so we can make sure that our findings are consistent
with your much deeper knowledge of this code, and that the fix is made appropriately so that it can be contributed back.

Thanks,
lew

Mario Charest

RE: RE: mbuf corruption

Mario Charest

07/22/2011 11:46 AM

post87535

RE: RE: mbuf corruption

> -----Message d'origine-----
> De : Lewis Donzis [mailto:community-noreply@qnx.com]
> Envoyé : 22 juillet 2011 11:42
> À : technology-networking
> Objet : Re: RE: mbuf corruption
> 
> Sean,
> 
> We thought we had this fixed, but the prior (e1000 related) problem was
> something entirely different -- the previous problem was corrupting mbufs
> by DMAing into them at bad times.
> 
> In any event, after almost a month of trying, we found a method of
> reproducing a highly intermittent problem in a matter of minutes.
> Unfortunately, the lab setup is so complex that it would not be easy to
> replicate for you.  But the problem is likely due to passing an mbuf to another
> thread and having that thread free it, thus exercising the pool allocator
> because each thread's local mbuf cache is either empty or full all the time.
> 
> Long story short, there is a corruption occurring inside the pool allocator.
> After several days of looking into it, there appear to be cases in the pool
> allocator where the PR_PROTECT flag is not set and therefore mutex locking
> is not occurring on certain pools.
> 
> Empirically, we found that we can eliminate the problem by locking a mutex
> on all pools, so now we're trying to find out exactly which one of the pools
> was causing the problem.
> 
> You mentioned that a mutexing problem had been found & fixed in the pool
> allocator, and we were theoretically given a copy of io-pkt 
that contained the
> fix, but we have no way to verify that the fix was contained within.
> 
> So my question is, can you tell us EXACTLY what was changed?  We'd like to
> make sure that we're in sync, both with whatever you fixed previously, as
> well as our findings in this particular case.
> 
> This is a very serious problem for us, we've even held up a major release due
> to a lurking known problem.  We now have light at the end of the tunnel, but
> would like to discuss further so we can make sure that our findings are
> consistent with your much deeper knowledge of this code, and that the fix is
> made appropriately so that it can be contributed back.
> 

Same here!  That being said we were given a debug version of io-pkt to help debug the crash, but this debug version 
isn't crashing. That's what we had to put in the field, otherwise we couldn't have deliver our product.

> Thanks,
> lew
> 
> 
> 
> _______________________________________________
> 
> Technology
> http://community.qnx.com/sf/go/post87534
>

Sabtain Khan(deleted)

Re: RE: mbuf corruption

Sabtain Khan(deleted)

07/27/2011 2:03 PM

post87631

Re: RE: mbuf corruption

Lewis,
          can you please tell me who provided you latest patch and what date was it at?

Thanks,
Sabtain

Mario Charest

RE: RE: mbuf corruption

Mario Charest

07/27/2011 2:53 PM

post87632

RE: RE: mbuf corruption

The version we are using that is NOT crashing is a debug version:

NAME=io-pkt-v4
DESCRIPTION=TCP/IP protocol module.
DATE=2011/01/14-01:04:25-ang
STATE=Experimental
HOST=localhost
USER=root
VERSION=skhan.1


Then later on I tried this version, that is following some comments on the forum about a possible fix, but it crashed:

NAME=io-pkt-v6
DESCRIPTION=TCP/IP protocol module.
DATE=2011/06/03-13:10:11-ang
STATE=Experimental
HOST=vmware-650
USER=root
VERSION=skhan.1

Then a few days later, you gave me this version to try which also end up crashing

NAME=io-pkt-v4
DESCRIPTION=TCP/IP protocol module.
DATE=2011/06/28-09:31:48-ang
STATE=Experimental
HOST=vmware-650
USER=root
VERSION=skhan.1


I`m leaving at 15h30 today.

> -----Message d'origine-----
> De : Sabtain Khan [mailto:community-noreply@qnx.com]
> Envoyé : 27 juillet 2011 14:04
> À : technology-networking
> Objet : Re: RE: mbuf corruption
> 
> Lewis,
>           can you please tell me who provided you latest patch and what date
> was it at?
> 
> Thanks,
> Sabtain
> 
> 
> 
> _______________________________________________
> 
> Technology
> http://community.qnx.com/sf/go/post87631
>

Lewis Donzis

Re: RE: mbuf corruption

Lewis Donzis

07/27/2011 4:33 PM

post87636

Re: RE: mbuf corruption

The original build that didn't correct it is documented under case 00109733.

However, I posted details of our findings in case 00110356 and it was confirmed as a known problem that has already been
 fixed.

So it sounds like it's all under control.

Thanks!
lew

Return

The text you entered is not a valid object ID
More Information
Object IDs begin with an object prefix and end with a number. For example, if you enter
artf2345
the application will jump directly to an artifact with the ID artf2345. Some valid object prefixes are:
artf	for an artifact
doc	for a document
page	for a project page
topc	for a discussion topic
wiki	for a wiki page