Identifying GPU bottlenecks using PVRTrace in Fantasy Warrior 3D

Datetime:2016-08-23 04:28:53          Topic: Cocos2d-X           Share

Inmy previous post, we explained how to use the advanced features of PVRTune to find the specific causes of performance bottlenecks in the Fantasy Warrior 3D game built using the Cocos2d-x game engine.

In this post we will demonstrate how developers can use PVRTrace to locate GPU-related issues in this Cocos2d-x based game. We will also show the improvements that these changes will make to the overall performance of the optimized code.

Environment setup

To capture OpenGL ES call streams on an Android device, we used ourPVRHub device-side debugging utility. Once a performance analysis file (*.pvrt) was generated, we usedPVRTrace GUI to inspect the API calls.

You can find my recording file here ; these files have been recorded in the same environment as the previous PVRTune files.

The list of performance problems

Known issues

There are a number of known issues I highlighted inmy previous articles:

  • The game is CPU limited
  • The geometry data needs to be sorted

What we need to check with PVRTrace

  • The number of culled triangles is too high
    • Possible causes: sub-pixel, back-facing and off-screen polygons
  • HSR efficiency is too low
    • Possible cause: too many blended objects.
  • ISP usage is very high
    • Possible cause: high overdraw
  • Texturing load  and Texture overload are high
    • Possible causes: high overdraw or too many blended objects
  • Too many API calls
    • Possible causes: too many draw calls and too many buffer data uploads.

Tracking and locating bottlenecks with PVRTrace

We can use PVRTrace to track and locate bottlenecks in this game. I randomly picked frame number 102 in the recording file (*.pvrt).

Analyzing frame data

Below is a summary of the draw calls that are issued by the application every frame:

#0: draw the terrain . Most of the terrain geometry is beyond the bounds of the screen.

#1: draw the water . Out of screen. Might be removed.

class Water : public cocos2d::Sprite

#2-7: draw the knight , #2-4 for the color render #5-7 for the outline render (in vertex shader)

Since our GPU load is quite low, we can do the outline render in fragment shader. We then reduce the number of draw calls from six to three.

#8-13: draw the mage . Same as above.

#14-23: draw the archer . Same as above.

#24-29: draw the dragon . Same as above. But the dragon is not actually drawn on the frame buffer.

#30-33 & #34-37 & #38-41 & #42-45 & #46-49 & #50-53: draw the slime . Same as above. Since this device uses an OpenGL ES 3.x-capable GPU (PowerVR G6200), we can use instancing to combine them into one draw call.

#54-57 & 58-61: draw the piglet. Same as above. We can also instance the draws.

#62 & #66 & #68 & #70-74 & #81 & #88 & #92: draw the FX – shadow & effects . Same as above – we can instance draws, and we can group mesh for others.

class CC_DLL BillBoard : public Sprite.

#63-65 & #67 & #69 & #75-80 & #82-87 & #89-91:We’re drawing a sprite with a null texture assigned. Maybe we can remove it.

#93 & #100 & #108: draw UI – avatar & HUD & critical effects . These can be batched into one draw.

#94-99 & #101-106: draw UI – blood bar & angry bar . These can be batched into one draw.

In local BattlefieldUI = class(“battlefieldUI”,function() return cc.Layer:create() end)

Class cocos2d::Sprite

#107: draw the countdown .

#109-111: draw on-screen number & FPS & GL counters .

Based on the initial analysis described above, we should be able to reduce the number of draws each frame from 111 to 46.

Reduce the number of culled triangles

There are lots of objects outside of the screen space. To decrease the number of triangles culled by the GPU, CPU-side culling (e.g. view frustum culling) will be required. Additionally, we should enable back-face culling for opaque object as much as possible.

Analyzing frame data

According to the analysis data, we have lots of objects that are off-screen, especially for the terrain (one big mesh with 18159 vertices). We can use a cool feature in PVRTrace to detect them. In the following diagram, if you enable the Shader Analysis widget’s Fragment Analysis mode, then in the Draw Calls widgets you can find there are several draw calls’ fragments whose value is 0. It means these draws didn’t contribute to the color of the render target (i.e. they are off-screen).

#0-61:draw calls for all the meshes.

GL state Current value
GL_CULL_FACE GL_TRUE
GL_CULL_FACE_MODE GL_BACK (GL_FRONT for the outline effect)
GL_FRONT_FACE GL_CCW

Conclusion

From the information presented above, we need to split the big geometry objects into appropriate pieces. We also need to implement CPU-side culling (e.g. view frustum culling).

Reduce overdraw

Increasing HSR efficiency means to decrease the number of transparent objects overlapped over each other in per-frame rendering. We therefore need to reduce the number of alpha blending objects, avoid using discard in pixel shader and avoid using alpha tests.

We need to reorder the render sequence of the object in per-frame rendering like the following:

[opaque objects] -> [alpha test objects/with discard in fragment shader (try to avoid this to benefit from the PowerVR TBDR architecture)] -> [alpha blended objects]

Analyzing frame data

#0-61:draw calls for all the meshes.

GL state Current value
GL_DEPTH_WRITEMASK GL_TRUE
GL_DEPTH_TEST GL_TRUE
GL_BLEND GL_TRUE

Opaque objects should disable GL_BLEND.

Those draw calls should come first.

#62 & #66 & #68 & #70-74 & #81 & #88 & #92: draw the FX – shadow & effects .

Alpha blended objects

GL state Current value
GL_DEPTH_WRITEMASK GL_FALSE
GL_DEPTH_TEST GL_TRUE
GL_BLEND GL_TRUE

Those draw calls should come second.

#93-108:draw all the game UI.

Alpha blended objects

GL_DEPTH_WRITEMASK == GL_FALSE

GL_DEPTH_TEST == GL_FALSE

GL_BLEND == GL_TRUE

Those draw calls should come third.

Conclusion

To increase HSR efficiency, we should draw opaque objects with GL_BLEND == GL_FALSE and use alpha blending to replace discarding . We also need to find a way to reorder the render commands in Cocos2d-x to follow the render sequence rule .

Reduce Texture Load and Texture Overload

Reduce Texture Load and Texture Overload means:

  • Reduce overdraw
  • Use compressed textures

Analyzing frame data

#0 & 2-61:draw calls for all the meshes. We should optimize complex geometry draw calls to reduce overdraw.

#0:draw terrain using a 2K x 2K texture.

#2-61:draw calls for all the skinning characters. We should optimize complex geometry draw calls to reduce overdraw.

#62 & #66 & #68 & #70-74 & #81 & #88 & #92:draw the FX – shadow & effects. Lots of small quad objects sharing the same TextureAtlas should be grouped together.

Conclusion

To reduce texture overload, we must split terrain geometry into appropriate pieces, optimize the complex geometry draw process and group small quad objects .

Reduce API calls

Reduce API calls means:

  • Reduce the number of glBufferSubData and glBufferData with GL_DYNAMIC_DRAW per frame
  • Split objects to reduce the matrix palette size
  • Remove outside frame buffer objects
  • Improve outline effects
  • Improve sprite draw

Analyzing data

glBufferSubData == 4.

glBufferData with GL_DYNAMIC_DRAW = 40.

glDrawElements == 100

glDrawArrays == 12

Conclusion

There are 112 draw calls in this frame – we can reduce them to 46. The 40 glBufferData objects can be removed by Draw Varying Quad With Static Mesh ; this means removing half of the bands in the following graph:

Recommended application optimizations

Let’s list all the improvements that need to be done on Cocos2d-x:

  • Separate the OpenGL API call submission from all other CPU work
  • Increase the efficiency of vertex transforms
  • Split terrain to reduce out-of-screen vertices
  • Optimize view frustum culling
  • Reduce state changes
  • Disable blending for opaque objects
  • Split skinning object to reduce matrix palette size
  • Optimize the skinning vertex shader

Implementing the optimizations

After identifying a number of performance issues in Cocos2d-x, we implemented fixes for the simplest problems and reported the rest to Chukong Technologies to be considered for future revisions of the renderer.

Here are the pull requests for the following changes.

Refactor an individual render thread for CPU limited

Target: Reduce the gaps between graphics API calls.

Solution: Split the OpenGL API call submission from all the other CPU work, i.e. create a thread dedicated to rendering tasks.

Result: TO DO

Increase the efficiency of vertex transform

Target: Reduce Vertices per triangle

Solution: Use our triangle sorting algorithm to optimize the meshes. Integrate this algorithm into fbx-conv.

Result: TO DO

Split terrain to reduce out-of-screen vertices

Target: Reduce the number of culled triangles and reduce Texture Load

Solution: Use cocos::Terrain for terrain object.

Result: TO DO

Optimize view frustum culling

Target: Improve CPU side culling

Solution: Add CPU side culling for 2D billboard

Results

Before:

After:

Once we removed the 2D billboards which are out of the screen space, then the batch mechanism from Cocos2d-x automatically batched draw calls which rendered with the same effect. You can find the pull request here .

Reduce state changes

Target: Batch draw calls and reduce redundant gl API calls.

Solution: Reorder render commands by glProgram, gltexture, etc. and use ccGLStateCache.

Result: TO DO

Disable blending for opaque objects

Target: Reduce overdraw

Solution: Use Cocos::RenderState to disable blend for QUEUE_GROUP::OPAQUE_3D command queue.

Result: Reduced HSR usage from 20.5% to 9% and ISP overload from 79.5% to 8%

Before:

After:

You can find the pull request here .

Split skinning object to reduce matrix palette size

Target: Reduce bandwidth and increase vertex shader efficiency

Solution: Fbx-conv to regenerate the skinning object (8 bones for each part) and change the skinning shader.

Result: Reduced the Processing load: Vertex value from 14.4% to 2%

You can find the pull request here .

Optimize the skinning vertex shader

Target: Improve performance for the skinning vertex shader

Solution: Use sufficient precision for shader variables, remove unused uniforms and optimize shader code with PVRShaderEditor

Result: TO DO

Conclusion

In this series, I’ve shown how the PowerVR Tools can be used to analyze GPU performance, isolate bottlenecks and identify issues in an application’s OpenGL ES call stream.

Here is a short summary of the optimization process:

  1. Utilize PVRTune to identify bottlenecks
  2. Utilize PVRTrace to track and locate bottlenecks
  3. Optimize the bottlenecks using our PowerVR Tools kit or any other useful toolkit
  4. Test the optimization to check if we removed the bottlenecks without introducing any additional bug(s).
  5. Profile the target once again to check if it is good enough , or if we need another iteration.

Further reading

Here is a list of recommended reading material:

You can find the PDF files above here as well as many other great resources on our PowerVR documentation page .

Make sure you also follow us on Twitter ( @ImaginationPR , @GPUCompute and  @PowerVRInsider ) for more news and announcements from Imagination.





About List