A Mysterious Case of Skinned Mesh Disappearances
Babylon Native has had its share of hard-to-track-down issues, but this recent issue has taken the team on quite the trip. This is a story about investigating and debugging a rendering issue for Babylon Native as reported by a member of the open source community.
The Issue
This issue shows multiple problems, but we are only going to focus on the mesh disappearances part, i.e. where the face and body of the model is not rendered for some reason.
The Investigation
As always, the first step is to reproduce the issue. Although the issue is reported as happening only on certain Android devices, it turns out many different devices exhibit the issue. Of all of the devices we have on hand (e.g. OnePlus 8, Samsung 10+, and Asus ROG Phone from our devices lab), they all reproduce the behavior. The only “device” that doesn’t repro (that we have readily available) is the Android Emulator. It would have been better to also have a physical device that works, but the emulator will have to do. We also narrowed down the problem to just loading the GLB model by itself with no other post-processing (e.g. highlighting as noted in the issue).
Now that we have a repro of both working and not working scenarios, it will be quick to figure out, right? Unfortunately not. To make matters worse, this issue does not throw any exceptions or output any errors in the log. Everything seems to be fine except that the mesh doesn’t actually render on screen.
As a quick stab, we try removing the skin and notice that the model renders. Why does skinning cause the mesh to disappear? Other skinned models work just fine.
Another thing we tried quickly is to load the model in the Babylon.js sandbox with Chrome to see if the device was capable of rendering the model at all. After all, this model has 157 bones and it may be hitting some limits. We even turned off bone textures in the sandbox as we realized that would reduce the uniform count and bone textures are not yet supported by Babylon Native. But no, all is well when running with the sandbox in Chrome. What gives?
It is time to use RenderDoc to capture a frame to see what is happening. This will certainly tell us what is going on, right? Alas, this is where the situation gets weird. RenderDoc actually shows that the model data is there. But, for some inexplicable reason, the rendering fails to make it to the resulting back buffer. Each draw call should be accumulating the rendered result. What is going on? There is no extra clear call anywhere between these draw calls. Is the depth buffer messed up somehow? How is this possible???
At this point, since we know that skinning causes the issue, we decide that the best course of action is to manipulate the glTF model (in a binary search fashion) until the model starts working. Hopefully this will shed some light. After simplifying the model to just one mesh, removing textures, and a bunch of iterations, we discover that if the number of bones is reduced from 157 to 114 or less, the the model will render. The model, of course, looks like garbage because many bones have been removed, but still, it renders!
💡! After discussing with Justin Murray, we realize that Chrome is probably using Vulkan and Vulkan has a different driver than OpenGL which is still what we are using in Babylon Native (Vulkan support is coming). Thus, we try disabling Vulkan on Chrome and voilà! Chrome instantly dies on the model with an error.
This more or less confirms that there is a problem with the number of uniforms as we originally suspected, which makes a lot of sense given that reducing the number of bones renders. We drilled into the source code for the error reported by Chrome which is apparently in ANGLE. ANGLE points to “GLSL ES 1.00.17 spec, Appendix A, section 7” and has explicit code to check this. The driver, bgfx, and RenderDoc apparently do not check this. The driver appears to silently fail!
The Solution
Phew! We learned a lot from investigating and debugging this issue. There are two solutions that should resolve this issue.
- Support bone textures in Babylon Native. This will reduce the number of uniforms such that we won’t hit the limit.
- Support Vulkan on Android.
For the first solution, here is the pull request to support bone textures which is now merged. Implementing this in Babylon Native exposed a bunch of issues in the code, especially for updating textures, but we will have to save explaining this for another time 😉. The Vulkan solution probably will also work, but we will have to see once that work is completed.
There are also a few corollary items that we should probably do.
- Add code similar to the check in ANGLE to either Babylon Native when compiling shaders or to bgfx directly.
- File an issue on RenderDoc that it reports incorrect information when using too many uniforms.
But wait, there is more! The author of the original issue reports that somehow, by enabling bone textures, the Android devices that didn’t work before is now working, but the Android devices that worked before is now not working?? The issue is now reversed! The mystery continues. Follow along in the issue for the latest developments.
Gary Hsu — Babylon Native Team Lead