During last week I discovered that our new webservice was slowly leaking memory. Just a little bit after one of my test clients finished but after hundreds of test-runs from my client it became clear that this would be a problem in a production environment.
I decided to fix this right now because in the current state the codebase of the new service hasn't become too large. Since there are more features waiting to be added it will grow.
The first test was replacing the way we create our backends to see if the problem lies within the service code or more in python-opsi. With a faked backend that just returns fixed values it becomes clear that the problem is in our Python library. But where?
For diagnosis I first used the gc module. Combined with PeriodicCallback from Tornado I had a function that presents me with information about the running service in no time. The small function showed me the amount of types currently living and some information about the generations of the garbage collector. At least this did show that the BackendManager instances were still alive even though their sessions have already been deleted.
Side note: using
gc.set_debug(gc.DEBUG_LEAK) sounded like a good idea first but it will keep references to collected objects which was counterproductive in this case.
I hacked together some filtering to reduce the amount of information gathered by this. Unfortunately this did not clear things up enough to spot the problem right away. However I sensed a direction where to look now.
Time to bring out objgraph. With this library you can draw a graph of your references. This helped quite a lot! A few graphs later I found out what the problem was.
The method I decorated with
@lru_cache(maxsize=None) was __loadBackendConfig.
Can you spot the problem?
Okay, here it is: lru_cache caches a function call and uses the parameters to look up if it can return you an already computed result.
The call that has been made was something like
BackendManager.__loadBackendConfig(<some BackendManager instance>, 'file') and therefore the object instance was part of our cache. D'oh!
The cache has been introduced with the thought to speed up the creation of new instances of a BackendManagers since the configuration rarely changes. Could I refactor the parameters of the function? Probably but I decided the time for refactoring is currently better spent elsewhere. Should I throw the decorator out? It leads to never freeing memory, of course we throw this out!
Will we be left with slow BackendManager creation? Not in the context of the webservice. The good part is that reading the configuration files is already done in a way that caches their content. This only will benefit Python programs that create more than one backend though. But another good thing in this regard that I worked on making the creation of the BackendManagers in the webservice async which should improve how the service responds when handling multiple new sessions.
Lessons learned today:
- Do not apply lru_cache to methods referencing a class instance if you want the instance be collected somewhere later.
- Using PeriodicCallback makes running repeatedly running functions in an async setting super easy.