You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: readme.md
+11-11
Original file line number
Diff line number
Diff line change
@@ -50,28 +50,21 @@ And you're good to go.
50
50
**Note: for the ultimate 5-10x performance increase, you'll need to let `fast_gauss`'s shader directly write to your desired framebuffer.**
51
51
52
52
Currently, we are trying to automatically detect whether you're managing your own OpenGL context (i.e. opening up a GUI) by checking for the module `OpenGL` during the import of `fast_gauss`.
53
-
54
-
If detected, all rendering command will return `None`s and we will directly write to the bound framebuffer at the time of the draw call.
55
-
53
+
If detected, all rendering commands will return `None`s and we will directly write to the bound framebuffer at the time of the draw call.
56
54
Thus if you're running in a GUI (OpenGL-based) environment, the output of our rasterizer will be `None`s and does not require further processing.
57
55
58
56
-[ ] TODO: Improve offline rendering performance.
59
57
-[ ] TODO: Add a warning to the user if they're performing further processing on the returned values.
60
58
61
-
62
59
**Note: the speedup is the most visible when the pixel-to-point ratio is high.**
63
60
64
61
That is, when there are large Gaussians and very high-resolution rendering, the speedup is more visible.
65
-
66
62
The CUDA-based software implementation is more resolution sensitive and for some extremely dense point clouds (> 1 million points), the CUDA implementation might be faster.
67
-
68
63
This is because the typical rasterization-based pipeline on modern graphics hardware is [not well-optimized for small triangles](https://www.youtube.com/watch?v=hf27qsQPRLQ&list=WL).
69
64
70
-
71
65
**Note: for best performance, cache the persistent results (for example, the 6 elements of the covariance matrix).**
72
66
73
67
This is more of a general tip and not directly related to `fast_gauss`.
74
-
75
68
However, the impact is more observable here since we haven't implemented a fast 3D covariance computation (from scales and rotations) in the shader yet.
76
69
Only PyTorch implementation is available for now.
77
70
@@ -83,12 +76,10 @@ Thus, store the concatenated tensors instead and avoid concatenating them in eve
83
76
-[ ] TODO: Warn users if they're not properly precomputing the covariance matrix.
84
77
-[ ] TODO: Implement a more optimized `OptimizedGaussians` for precomputing things and apply a cache. Similar to that of the vertex shader (see [Invokation frequency](https://www.khronos.org/opengl/wiki/Vertex_Shader)).
85
78
86
-
87
79
**Note: it's recommended to pass in a CPU tensor in the `GaussianRasterizationSettings` to avoid explicit synchronizations for even better performance.**
88
80
89
81
-[ ] TODO: Add a warning to the user if GPU tensors are detected.
90
82
91
-
92
83
**Note: the second output of the `GaussianRasterizer` is not radii anymore (since we're not gonna use it for the backward pass), but the alpha values of the rendered image instead.**
93
84
94
85
And the alpha channel content seems to be bugged currently, will debug.
@@ -107,7 +98,7 @@ And the alpha channel content seems to be bugged currently, will debug.
107
98
108
99
## Implementation
109
100
110
-
**Goal:**
101
+
**Guidelines**
111
102
112
103
- Let the professionals do the work.
113
104
- Let GPU do the large-scale sorting.
@@ -119,6 +110,15 @@ And the alpha channel content seems to be bugged currently, will debug.
119
110
- Enabled by using `non_blocking=True` data passing and moving sync points to as early as possible.
120
111
- Boosted by the fact that we're sorting on the GPU, thus no need to perform synchronized host-to-device copies.
121
112
113
+
**Why does a global sort work?**
114
+
115
+
The OpenGL specification is somewhat vague but there's this reference:
116
+
(in the 4th paragraph of section 2.1 of chapter 2 of this specification: https://registry.khronos.org/OpenGL/specs/gl/glspec44.core.pdf)
117
+
118
+
> Commands are always processed in the order in which they are received, although there may be an indeterminate delay before the effects of a command are realized. This means, for example, that one primitive must be drawn completely before any subsequent one can affect the framebuffer.
119
+
120
+
Thus if the order of the data in the vertex buffer (or as specified by an index buffer) is back-to-front, and alpha blending is enabled, you can count on OpenGL to correctly update the framebuffer in the correct back to front order.
0 commit comments