Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use fullscreen triangles for fullscreen passes. #21541

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

ACEfanatic02
Copy link

Instead of using a vertex buffer with a quad triangle strip, generate a single fullscreen triangle in the vertex shader. This has a number of advantages:

  • Avoids fragment quad overshading along the diagonal.

  • Avoids creating, filling, reading, and destroying a vertex buffer.

Instead of using a vertex buffer with a quad triangle strip, generate a
single fullscreen triangle in the vertex shader.  This has a number of
advantages:

- Avoids fragment quad overshading along the diagonal.

- Avoids creating, filling, reading, and destroying a vertex buffer.
@ZehMatt
Copy link
Member

ZehMatt commented Mar 7, 2024

Do you have any numbers to how it affects it?

@janisozaur
Copy link
Member

I want to test it with my raspberry pi

@ACEfanatic02
Copy link
Author

Do you have any numbers to how it affects it?

No. The game currently crashes under DLL injection (see Discord for a discussion), which makes it difficult to profile externally, and explicit timing queries have not yet been added. With the (lack of) complexity of these fragment shaders, I'd be shocked if there was a measurable GPU performance difference in either direction, though. The main win here in practice is less code and fewer allocations to track.

@ZehMatt
Copy link
Member

ZehMatt commented Mar 7, 2024

I'm not so sure about the allocations, the shader program is initialized once. I'm a bit hesitant in doing any changes here, the code we currently have is at least battle tested, if there is not a huge gain then I don't really see the justification at the moment.

@ZehMatt
Copy link
Member

ZehMatt commented Mar 7, 2024

Also gonna tag @LRFLEW, he is the original author for most of the code

@LRFLEW
Copy link
Contributor

LRFLEW commented Mar 7, 2024

It's been a while since I've been active in this project, but thank you for the ping.

Yes, the code for ApplyPaletteShader and ApplyTransparencyShader were my contributions as part of #6541. However, a lot of that code was based on even older code for the now defunct CopyFramebufferShader (which I replaced with glBlitFramebuffer).

Using the fullscreen triangle makes sense. It's not something I was taught (so wouldn't have thought of back in the day), but looking into the idea, it does seem legit, so I support that change.

It's the other part of this PR I'm unsure of. Pretty much all the vertex shaders used to use an index like this PR uses, though implemented in a more complicated way and still used a VBO. I opted to replace that system with a simpler VBO-based system in my rewrite, as it seemed like an improvement.

Avoids creating, filling, reading, and destroying a vertex buffer.

The creation, filling, and destruction of the VBO is only done once per process, so I don't see that as a major performance concern. Similarly, the data is only 16 floats, so I don't think it's a significant VRAM consideration. The only question, then, is whether having the GPU read the VBO is significantly slower than computing the values in the shader. I would assume not, since GPUs are optimized for processing VBO and Texture data, and having only 3 vertices reduces the GPU's ability to parallelize the computation. In terms of code quality, I guess the question is whether we'd rather more complexity in the shader or more complexity in the supporting C++ code. Currently, I'm not sold on the change, and would like to see some performance numbers or a reference for this idea if possible.

@ACEfanatic02
Copy link
Author

The creation, filling, and destruction of the VBO is only done once per process, so I don't see that as a major performance concern.

Removing the buffer isn't really about raw performance, but complexity reduction. Code that doesn't exist can't have bugs.

The only question, then, is whether having the GPU read the VBO is significantly slower than computing the values in the shader. I would assume not, since GPUs are optimized for processing VBO and Texture data, and having only 3 vertices reduces the GPU's ability to parallelize the computation.

There's two ways to look at this:

  • In isolation, looking at only the vertex shader, yes this will be faster. Three integer ops, two conversions, and two FMAs will be faster than loading a cache line to read the vertex buffer. The number of vertices is actually irrelevant in this case, because shaders run in a wide (32- or 64-thread) SIMT model. Rendering one triangle versus two takes exactly the same number of VS threadgroups.
  • In practice, this will not be faster or slower, because there's no hazard with this vertex shader and so it will (most likely) spawn its one threadgroup in amongst the tail end of the previous pass's FS work where there are unoccupied shader units. So either way the VS for fullscreen passes tends to be "free" unless it has an actual dependency (or if there is a manual barrier in place.)

On its own this is not a particularly inspiring change, I agree. I wanted to start with something small to learn the PR / review process.

@ZehMatt
Copy link
Member

ZehMatt commented Mar 10, 2024

We did a bit of house cleaning, this needs to rebased. I looked at the changes and I think they seem fine, when I had a brief glance initially I thought its doing something different. Should also wait for @LRFLEW, but I'm fine with this.

@LRFLEW
Copy link
Contributor

LRFLEW commented Mar 11, 2024

I've left review comments on things I'd like to see changed in the shaders before this gets merged. If those are all addressed, I have no qualms with this being merged.

If we're using the VertexID trick here, it might be worth considering it for the other shaders, which use a small VBO in conjunction with instancing for drawing. It doesn't have to be with this PR, but I did just want to make a note of that.

@@ -1,12 +1,14 @@
#version 150

in int gl_VertexID;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, this is unnecessary (and may cause complications), as the gl_* variables are built-in and don't need to be declared. (eg. There's no declaration for gl_Position before using it).

Ditto for the other shader.

@@ -1,12 +1,14 @@
#version 150

in int gl_VertexID;

in vec4 vPosition;
in vec2 vTextureCoordinate;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If vPosition and vTextureCoordinate are no longer used, then the declarations here should be removed.

Ditto for the other shader.

fTextureCoordinate = vTextureCoordinate;
gl_Position = vPosition;
fTextureCoordinate = vec2((gl_VertexID << 1) & 2, gl_VertexID & 2);
gl_Position = vec4(fTextureCoordinate * 2.0 + -1.0, 0.0, 1.0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if reading from an output variable like this is well defined or well supported across the wide range of GLSL interpreters. Personally, I'd change this to be three lines, where the first creates a temporary variable that's read by the other two lines. Personally, I'd also make the temporary variable use only 0.0 and 1.0 values, and make the other two lines use * 2.0 and * 4.0 - 1.0 respectively, though I don't think that's a big deal either way. I think there should also be a comment here to help explain what this is doing (i.e. computing vertex data from vertex id).

All these ditto for the other shader.

Copy link
Member

@janisozaur janisozaur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the same as my baseline on ARM SBC

@Gymnasiast
Copy link
Member

@ACEfanatic02 Could you address the line notes that @LRFLEW added?

Copy link

github-actions bot commented May 1, 2024

This pull request has been marked as stale and will be closed in 14 days if no action is taken. To keep it open, leave a comment or remove the stale-pr label. If you're awaiting feedback from a developer, please send us a reminder (either here or on Discord).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants