Cold Fusion Memory Leak
Cold
Fusion Memory Leak
The Problem
For the past year or so one
of our cold fusion server instances has been consuming more and more memory. I
thought it was because the application load had increased, but the main problem
seems to be a memory leak in cold fusion.
I increased the max heap
size and tweaked the jvm config until I was blue in the face, but this only had
the effect of delaying the inevitable. To compound the issue, one of the
servers in our cluster died, forcing the entire load through one server. It was
a kind of blessing in disguise, in that it forced the memory leak to show its
ugly little face more rapidly. The server would not make it through a 24 hour
period before it would eat up all its allocated memory and eventually crashed
horribly.
The Analysis of the
Problem
I started by enabling
metrics logging in the servers jrun.xml file. By doing that I could see the
state of the server at hang time and noticed that the memory was steadily
increasing up until it hung. It seemed like garbage collector was on holidays.
Ok .. so now I know that
it's memory usage problem, my initial reaction was to think that there must be
something wrong with my app. I start thinking about all the places where I'm
caching data and looking for areas where I can improve the memory management of
the app. I thought that soft references may be the answer to my memory woes
and read through these interesting articles with the view to using them for
caching.
http://www-128.ibm.com/developerworks/java/library/j-jtp11225/index.html
http://www-128.ibm.com/developerworks/java/library/j-jtp01246.html
The next question in my mind
is, What is using all the memory? I started looking for java memory profiling
tools. Hprof looked promising, but is really only useful for profiling J2SE applications,
not J2EE apps. HAT was also another option, but would only be useful for
analysing static heap dumps. What I needed was a real time memory profiler and
Borland's Optimizit has done the job nicely. There's a very good article here
on integrating Optimizit with Jrun 4.
http://info.borland.com/techpubs/optimizeit/optimizeit6/integ_guide/JRun4.html
A ten day trial version can
be downloaded here.
http://www.borland.com/downloads/download_optimizeit.html
Mike Schierberl's Blog
Post on ColdFusion Memory Leaks
Building on Mike's 3 posts
on the cold fusion memory leak he discovered, I created some more test cases
and can confidently agree with most of his conclusions. There is definitely a
memory leak in CF.
http://www.schierberl.com/cfblog/index.cfm/2006/10/12/ColdFusion_memoryLeak_profiler
http://www.schierberl.com/cfblog/index.cfm/2006/10/16/memoryLeak_variablesScope
http://www.schierberl.com/cfblog/index.cfm/2006/10/25/memoryLeaks_session
He suggest using BEA Jrockit
JVM in place of the built in Sun jvm in JRun. The only problem with that is
that it raised a few questions about whether or not it was the JVM that was
causing the issue or actually CF's problem. As I mentioned previously, I got
Borlands Optimizit profiler working with the built in jvm in Jrun. I ran the
same tests as in Mike's posts and have replicated his results.
The test case I created actually
shows that the problem is more severe that what Mike is suggesting. Under
the particular conditions of the test case, it appears as though any cfc's that
are instantiated in the variables scope of the calling cfm page are actually
kept in memory for the life of the app. Not only that, but the temporary class
created by CF are retained in memory for the life of the app. The problem seems
to be in the garbage collection of the temporary CFM classes.
Test Case
I tried to simplify Mark's
examples a little to remove any un-necessary complexity. The source code for my
test case is listed here:
testCase.cfm
------------------------------------------------------------------------------
<CFapplication NAME="memory_test"
sessionMANAGEMENT="YES"
sessionTIMEOUT=#CreateTimespan(0,0,0,10)#
SetClientCookies="no" clientmanagement="no">
<cfif not structKeyExists(application,"test") OR isDefined("URL.reset")>
<cfset
application.test = createObject("component","test").init() />
<cflocation url="#cgi.SCRIPT_NAME#" addtoken="false">
</cfif>
<cfset foo =
application.test />
<cfset fooB = createObject("component","testBBB") >
<cfset session[createUUID()] = fooB />
<cfoutput>
<a href="#cgi.SCRIPT_NAME#?reset=true">reset
application.test</a>
</cfoutput>
<!--- <cfset structClear(variables) /> --->
------------------------------------------------------------------------------
The cfc's I'm using are as
simple as can be, but you could use any cfc to demonstrate the issue. i.e.
<cfcomponent name="test">
<cffunction name="init">
<cfreturn this >
</cffunction>
</cfcomponent>
The first time you run this
script you get the following output from the profiler.

Note: I'm filtering the
class list using *cftest* on the classname
You can see that there are
two instances of testCase.cfm, 1 instance of test.cfc and 1 instance of
testBBB.cfc. After the session times out (after 10 seconds), the garbage
collector should take testCase.cfm and testBBB.cfc back to 0 instances. This
doesn't happen, It appears that the number of cfm instances retained is
directly proportional to the number of cfc instances retained.
After running it again you
get the following output. You can see that the cfm instance count has increase
by one and the testBBB.cfc count has increase by one. test.cfc stays the same
since it's being referenced from the existing instance in the application
scope.

After the session times out
again, the garbage collector does collect the new instances.

Clicking the link to reset
application.test produces the following.

Session times out, garbage
collection is done again and it returns to the base state.

Now for the tricky part ,
refresh the page 5 times and then click the link to reset the application.test
instance.

After the session times out
and the garbage is collected the leak is becoming more evident. The five
instances of test.cfm and testBBB.cfc we created before we reset application.test
are retained in memory.

Refresh the page again..

Garbage is collected ..

The leaked objects are
maintained on the heap.
The Solution
Now what we have done as per
Marks recommendation is to clear the variables scope on request end. This is
more of a work around rather than a solution, but it has stopped our servers
from hanging. Before this our server was maxing out at 500mb and crashing and
now the same server is humming along at around 250 mb heap usage. If this isn't
a leak I don't know what is.
One minor hiccup that I ran
into with this work around is that I had to get rid of application.cfc and
replace it with application.cfm .. onRequestEnd.cfm won't work in conjunction
with application.cfc. I did try using structclear("variables") in the
onRequestEnd method of application.cfc, but it was only clearing the local
scope of the cfc and not the variables scope of the request.
Conclusions
It has been a very time
consuming exercise in diagnosing this issue and I would much rather have spent
the time working on my actual development projects. However, the upside is that
I,ve learned a hell of a lot about Java
- Jason's blog
- 23747 reads


Comments
Kudos for performing an exhaustive and well documented test. I am sure the CF Team has seen this by now. It will be interesting to see how they respond.
Well done.
Thanks for the hard work on investigating and documenting this issue. We've been looking into a very similar JVM/memory issue for the past 6 months that's only arisen since moving to using CFCs for key functionality. We've implemented one of your recommended workarounds so will monitor and feed back the result.
Thanks, we're investigating.
btw Jason, are you deploying to your servers using CF's J2EE functionality? we are...
There is a hotfix now available for this issue. Please see http://www.adobe.com/cfusion/knowledgebase/index.cfm?id=kb401239
This should fix the problem.
Wow. That was fast!
I came to say the same thing Dan... that /was/ fast.
Cheers Hemant, I'll remove my work around and let you know if it fixes the problem. Much appreciated. :)
Jason, I'm curious - did any of the profiling tools you used also give you the size in memory of the classes (I realise 1 class != instances) ?
Hi Kola, optomizit only showed total instances, not memory usage per instance. There where some features in there that i didn't really try, so it may still have that functionality available. Also, just a note, I rang Borland to get pricing on Optimizit and was told that it has been depreciated in place of some other enterprize suite they're developing.. can't remember what it was called though, but it may have what you're looking for.
FWIW, this hotfix has corrected some chronic memory problems that surfaced once we started widely employing CFCs. My opinion of the scaleability of CF OOP is now on the rise :0)
I think we owe Mike and Jason a huge vote of thanks for this - I don't even want to think about the number of hours they must have burned. Adobe, how about coughing up a couple of trips to MAX or something?
BTW, did anyone notice the date on that KB article? This fix was released last month along with 7.0.2 cumulative hot fix 2.
Jaime Metcher
Cheers Jaime, I don't know if the trip to MAX would make up for my frustrations at this point. Maybe a trip to the moon or free lifetime upgrades to CF server enterprize. Sadly, the saga continues.... After running the hotfix for close to a week, it seems the memory is still not being cleaned up properly. I'm going to re-introduce the onrquestend.cfm structClear(variables) work around and see how it goes. Alternatively, I may become a fisherman .. how hard can it be ???
Jason, that's a real bummer to hear you say the problem remains. Have you mentioned this to Damon? He posted a link to this thread on his blog when he announced the fix (http://www.dcooper.org/blog/client/index.cfm?mode=entry&entry=1FAC4D6A-4...). You may want to leave a comment there if your problem really does remain. I'm sure many of us would like to hear.
BTW, are you saying you applied the hotfix? Also, it appears from the "Additional info" section of the technote (http://www.adobe.com/cfusion/knowledgebase/index.cfm?id=kb400996) that you need to apply this fix as a manual extra step. Just confirming that you caught and did that.
Hi Charlie, previously I added the hotfix manually to our servers as suggested and the memory usage problem re-appeared. I have re-introduced the work around to see if it fixes the problem, but don't have any solid results at the moment. i'm going to monitor it for a couple more days before I make any conclusions.
I am having a similar issue with CF 6x. Is this a problem in 6x also? Will moving to 7x resolve it?
Chris, not sure about 6 I haven't tested it.
I am 200% sure there is memory leakage.
The solution I found is to store all complex objects in server scope rather in Application scope or sesssion scope.
my suggestion is to store all complex object in server scope and you will see there is no memory leakage problem.
I am happy to help anybody who is struggling like these issues.
Thanks
Sana
I had also installed the hotfix, without any luck. Even if it is related, it is poorly described as being related to file upload issues. I'm trying clearing the variables onRequestEnd.cfm I've also tried every possible jvm config known to man, and always the heap grows despite gc.
@Ryan Brueske
are you using any application scope to store CFC or storing query results in Application scope.
If then use server scope rather Application, you don't need to call GC yourself, calling GC does not make any effects. If still problem persist then you can catch me on sanaullah73@gmail.com
Thanks
Sana
Ok, let me dissent for a second.
I have also spent a lot of time trying to resolve a memory issue with our CF server. It was about 6 months ago, on a CFMX6.1 server (W2K server, now W23K server). I turned on some extensive logging, and found that with my jvm tweaked settings, that the garbage collection was running about every 3-5 seconds or so. The data in the young generation would be promoted to the permanent generation when a request spanned that 5 second garbage collection gap. Meaning that if you had a long running page, the variables you instantiate would end up in the permanent generation.
Now, that's actually Ok, because a full garbage collection would run when the memory usage finally made it up to the max memory setting in your config file, and at that time the jvm would release that memory from the permanent generation for re-use. We noted here that the task manager still showed our max memory setting (~1.3Gb), even though the garbage collection logging showed that memory usage was released down to under 100Mb. I now no longer am concerned by seeing the 1.3Gb in the task manager like I used to be. In fact, we run the whole day just fine at our max, because in actual fact that only represents what's been allocated to the jrun.exe, and not what is actually in use by the jrun.exe. We do restart every night.
Ok, having said all that, our server, which had been running fine with my tweaked config for over 12 months, just started dying on us again. 4 hangs now in the last 3 weeks, and my client is getting worried, as am I. So, I start looking for the problem again, and find this blog post.
I just thought I'd mention that my previous issue looked like a CF memory leak, and smelled like a CF memory leak, but in actual fact was not technically a CF memory leak. My testing showed that the memory would be eaten up (i.e. transferred to the permanent generation) when the page request spanned the garbage collection interval - which of course is extremely easy to cause by hitting F5 5 times fast. Even a simple page would cause the young generation variables to make it into the permanent generation when run multiple times, which is quite a common occurrence with web programming in general (think multiple users - and especially if you use some CFLocks to protect inserts and the like). In this situation though, the full garbage collection should release that memory again when it runs, as long as those variables are not used again. Therefore, perhaps the issues you guys have been having are not in fact a memory leak inside CF, but incorrectly tuned jvm settings? Perhaps you've been too concerned about the instances not being immediately released, when instead you should be concerned about why they do not get released from a full garbage collection? ... I guess you could argue that what I have described IS what a memory leak is... I'm just saying that I don't think it's caused by a programmatical bug in CF more than it simply is just an incorrectly tuned jvm.
Also note that the posted hotfix does not actually apply to the problem as you've described it. It stated it was for a CFFile Action="Upload" problem that they had discovered, and not a generic variable scope instantiation not being release from active memory.
Back to my current problem, I am almost certain it is because we are now storing some configuration data in the application scope, and I didn't realize how much data it was going to be until I just saw how big it was today. A struct with 4,000 odd stucture of arrays... Although I don't think I can tell exactly how many bytes it is using, I'm fairly confident that it is the reason for my hangs now - that struct has (clearly) made it into the permanent generation, and is in constant use (it's there all the day long), so it will never be cleared even by the full garbage collection process. I either have to re-tweak my config, or more likely rethink that idea of using the application scope for this large struct.
I may give that StructClear(Variables) a try, but I actually don't think that will help my problem (mine's in the application scope, and I can't very well clear that every page request, can I?)! :)
I hope I didn't muddy things too much for anyone out there. I have been very happy with our (Macromedia-assisted) jvm settings, and in case you are interested, here they are:
java.args=-server -DJINTEGRA_NATIVE_MODE -DJINTEGRA_PREFETCH_ENUMS -Xms1284m -Xmx1284m -XX:NewSize=64m -XX:MaxNewSize=64m -XX:SurvivorRatio=3 -XX:PermSize=64m -XX:MaxPermSize=128m -XX:+UseParNewGC -XX:+DisableExplicitGC -Dsun.rmi.dgc.client.gcInterval=600000 -Dsun.rmi.dgc.server.gcInterval=600000 -Dsun.io.useCanonCaches=false -Xbootclasspath/a:"{application.home}/../lib/webchartsJava2D.jar" -Djavax.xml.parsers.SAXParserFactory=com.macromedia.crimson.jaxp.SAXParserFactoryImpl -Djavax.xml.parsers.DocumentBuilderFactory=com.macromedia.crimson.jaxp.DocumentBuilderFactoryImpl
And yes, I did mean Macromedia and not Adobe. Sarge helped us.
Feel free to disagree with me. I am just posting the fruits of my similar days-long trudge through the jvm memory settings swamp. I had thought I had left it behind me... :)
Good luck!
Steven Van Gemert
Check that, that first line - 6 months should read 12 months. We'd been going just fine for a full 12 months now.
Steven Van Gemert
Well, it's been a while since my head was deep into this. I could easily have mistaken something.
>Are you sure the full GC is only executed when the heap reaches the max heap size? I don't think this is the case.
I think you are right. I found that there was a gc happening often, and that it would clear the used memory back down to what it was before the page execution. However, when you run multiple pages, and cause objects to make it to the tenured generation (man, I may still be wrong with where it goes - I mean when it leaves the young generation because it is still in use), the overall memory usage rises. When it reaches the limit, at that point there is a more complete garbage collection performed because there is less than x% of the room left in the tenured generation (again now not sure of which generation). I think that x% was configurable, but I thought that at that point a more complete gc takes place, which would then release a huge chunk of memory back for availability to the jvm. This more complete one was only triggered because that threshhold was reached. Thus I called it a "full gc".
At least, that's what I saw. My memory creeped up over 5 or 6 bouts of holding down F5 for 30 seconds at a time, and when it reached the top, having had many (read 100) gc's occur but not clear out the memory, when it reached the top a final gc actually released a huge chunk (over a gig) of memory in one collection, that took quite a long time (5 seconds or so) compared to the others that took a few to a teen milliseconds each.
Again, that's what I remember, and now that I've thrown those log files away (oh, and it's a different server now too), I don't think I can verify it. Take it for what it's worth I guess...
>...it's requirement in CF7 to have -XX:+UseParallelGC enabled.
I recall finding some documentation that the new parallel gc was better than that. In fact, at one time I knew the exact reason why, and determined that I'd better use the new one or I'd crash and burn, hence the "XX:+UseParNewGC" in my jvm config.
>I'm also wondering why you have the -XX:MaxPermSize set so low. If as you say, "if you had a long running page, the variables you instantiate would end up in the permanent generation.", I think you're actually refering to the tenured generation.
Yeah I was referring to the tenured generation. My current problem might be that the tenured is making it to the permanent, and I don't have enough room for that (thanks for correcting me by the way).
My permanent so low? I thought it was quite large actually... I recall testing larger perm generation sizes and noting no performance difference in my testing. Of course, my testing might not simulate real-life usage, but it was what I went with. I think 128MB is quite a large permanent generation... you disagree?
Steven Van Gemert
Hi,
I think we have nailed down a similar problem and fixed it as part of our
efforts on performance and scalability.
We will see if we can get this available on CF7.0.2.
Thanks,
Hemant
I am very interested in this issue and have seen the same results.
There is also an issue with large numbers of DSN's on the same box.
Would setting up multible instances allow you to use 2gb per instance?
Hi Bill, if you had 2Gb of physical ram. You'd need to allocate a portion to the OS and other system tasks and the rest can would divided between you're separate J2EE servers. You'd need to allocate memory to each instance based on it's usage.
Just an update:
I mentioned earlier that the hotfix cured our memory problems. Well, it cured the fast leak. A slow leak remained in both the tenured and permanent generations. Clearing the variables scope in onRequestEnd has cured the slow leak in the tenured generation. I haven't yet been able to reproduce the perm gen leak on my test environment.
Jaime Metcher
Having what looks like the same memory leak problem...
causes me to restart ColdFusion 2-3 times per day when site is under load.
This site uses lots of CFCs. Other sites that do not use CFC's have no problem.
Have installed 702 and the bundled hotfixes available as of today and not seen any resolution.
Have setup the workaround to clear out variables structure as well (which may have slowed the leak some).
Mostly just documenting this to be updated on any further developments, but if anyone has
any other ideas, please feel free to reply.
Anyone tested this on CF8 RC?
is everyone running cf on jrun here? one potentially related thing I've noticed is that if one instance is having a memory meltdown, other instances go at the same time. theoretically this shouldn't happen, so i'm wondering if this is more of a jrun problem than a cf problem.
Yep, on JRun but only the single instance setup. Total memory is now stable up to 500,000 hits (haven't tested beyond that). Not sure what's happening with the perm gen because the GC logging keeps failing.
Post new comment