Why does the engine do a lot of frame by frame calls to malloc?

:information_source: Attention Topic was automatically imported from the old Question2Answer platform.
:bust_in_silhouette: Asked By beepdavid
:warning: Old Version Published before Godot 3 was released.

hello, my studio is currently looking over Godot for our next project. I’ve been digging around the source code which is looking refreshingly easy to understand and potentially extend as we need.

But one area did strike me as needing further investigation. Godot seems to do a lot of frame by frame calls to malloc, especially for things like adding commands to render lists etc. This strikes me as red flag for performance, heap fragmentation and cache consistency, and is a little at odds with our intended use cases for the engine.

Are there any plans to address these kinds of performance concerns moving forward, and is there a way we could contribute to this?

@Akien could answer for this…
(how to link a user who is not on this thread? hm…)

volzhs | 2016-06-27 15:42

:bust_in_silhouette: Reply From: Juan Linietsky

In general there are not that many calls to malloc(), and rendering in itself does very little to no amount of allocation.

Regarding fragmentation, Godot does not mix large and small allocations, so it only does malloc() with small allocations (large allocations go to a customizable pool that can be compacted as needed). If alloations are kept relatively small, fragmentation is never a problem in practice.

Godot is generally fast, but it’s not as optimal as can be. We only optimize on a real-case use scenarios, so if you have a situation where you would need more performance, we would gladly work together with you on improving that specific use case.

thank you for the quick reply! forgive me if I’m miss-understanding something here, but stepping through from Sprite NOTIFICATION_DRAW, leads to:

VisualServerRaster::canvas_item_add_texture_rect_region,

which to add the canvas item command does:

CanvasItem::CommandRect * rect = memnew( CanvasItem::CommandRect );

memnew (on windows) eventualy does a heapaloc from the winapi, this is the common pattern for all draw commands even individual particles?

beepdavid | 2016-06-27 16:27

call stack from particle draw

beepdavid | 2016-06-27 17:02

For Sprite this is not too bad, because NOTIFICATION_DRAW only happens when the sprite changes some property (not transform), like frame or h/v flipping.

The CanvasItem::Command structures are cached in the server and only re-created if something changes. Far most objects in a game don’t really change often.

That said, if required, modifying Godot so they are pre-allocated should be pretty simple, but as this has never really been a bottleneck and no one complained in real use-cases, this hasn’t been optimized.

The case of Particles2D is indeed a bottleneck, our plan after 2.1 is out is to change this API to add instancing support for them, or even render them using GPU.

Juan Linietsky | 2016-06-27 17:15

ah! this is where my confusion was coming from, I thought the command list was being re-created from scratch every frame (more like a classic render command list). This could still be a concern for highly dynamic scenes, but as you say worry about it when it’s actually an issue.

beepdavid | 2016-06-27 17:25

just as a quick follow up to this, I put a static counter into MemoryPoolStaticMalloc::alloc and tested the Platformer2d scene for a few frames after some warm up time. It seems there are around 500-700 allocations going per frame, a lot from string handling and script boundaries

beepdavid | 2016-06-27 18:28

To say truth, 500-700 heap allocations per frame isn’t very much, allocations are cheap.
That said, if you know a good profiler to trace heap allocations that would be appreciated, as it would probably be very easy to reduce them.

Juan Linietsky | 2016-06-27 19:53

I just want to preface this comment with I appreciate that Godot has had different development priorities, and is a great solution to the types of projects that OKAM have had use cases for. A priority for us is how 2d performance can scale to highly dynamic platform games and simulation projects in the vein of prison architect etc

I agree that this many allocation per frame is not a huge concern, but this is also a very simple scene with just a few scripts running. If allocations scale with scene complexity then this number could get into problematic side effects. You are at the mercy of the platform runtime as to what it feels like doing on each call, as you say malloc is (relatively) cheap, but can also cause hard to track frame spikes later in the project cycle.

A greater concern to me is cache consistency when various systems are walking the heap to do their work. This in our exp is the number one performance killer that is very hard to solve with profiling alone as there can just be an overall feeling of slowness that does not really spike anywhere.

To move forward from this (as you also say) we would want to identify a test scene where this is actually an issue. From here my plan would probably be to try to identify allocations that are really used frame by frame and move these to some sort of stack allocator that can be refreshed each frame. I would also want to try to identify areas, such as command lists, and make sure these are being walked in linear memory, this may take some more heavy handed re-factoring (no idea yet)

In terms of identifying what systems are making high use of allocs, I would probably start with adding some functionality to your current alloc wrapper to give some watermarks for various systems, but to be honest I usually start by just putting a break in malloc and looking at what is calling each frame, then start to look at a system as a whole.

QA does not seem like a good fit for discussions such as this, if we feel we can contribute, is it better to open a forum thread or open a github issue for this kind of thing?

beepdavid | 2016-06-28 11:23

I think probably IRC is a better place of discussion (#godotengine-devel in freenode)

That said, I still think the best approach is to find an use case where performance is a problem and then work from it. If you can make a simple test-case that shows a performance problem, we can work from that. Doing optimization without any basis will just make the code base more complex needlessly.

Juan Linietsky | 2016-06-28 11:56

:bust_in_silhouette: Reply From: punto

A lot of these come from the fact that the different allocators are ultimately implemented by malloc on PC, but the layer is in place to implement a better allocation when that becomes a problem. It’s definitely something that we consider is important to plan ahead, but since the default implementation just works, it stuck…