Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

dma-fence: Use kernel's sort for merging fences

One alternative to the fix Christian proposed in
https://lore.kernel.org/dri-devel/20241024124159.4519-3-christian.koenig@amd.com/
is to replace the rather complex open coded sorting loops with the kernel
standard sort followed by a context squashing pass.

Proposed advantage of this would be readability but one concern Christian
raised was that there could be many fences, that they are typically mostly
sorted, and so the kernel's heap sort would be much worse by the proposed
algorithm.

I had a look running some games and vkcube to see what are the typical
number of input fences. Tested scenarios:

1) Hogwarts Legacy under Gamescope

450 calls per second to __dma_fence_unwrap_merge.

Percentages per number of fences buckets, before and after checking for
signalled status, sorting and flattening:

N Before After
0 0.91%
1 69.40%
2-3 28.72% 9.4% (90.6% resolved to one fence)
4-5 0.93%
6-9 0.03%
10+

2) Cyberpunk 2077 under Gamescope

1050 calls per second, amounting to 0.01% CPU time according to perf top.

N Before After
0 1.13%
1 52.30%
2-3 40.34% 55.57%
4-5 1.46% 0.50%
6-9 2.44%
10+ 2.34%

3) vkcube under Plasma

90 calls per second.

N Before After
0
1
2-3 100% 0% (Ie. all resolved to a single fence)
4-5
6-9
10+

In the case of vkcube all invocations in the 2-3 bucket were actually
just two input fences.

From these numbers it looks like the heap sort should not be a
disadvantage, given how the dominant case is <= 2 input fences which heap
sort solves with just one compare and swap. (And for the case of one input
fence we have a fast path in the previous patch.)

A complementary possibility is to implement a different sorting algorithm
under the same API as the kernel's sort() and so keep the simplicity,
potentially moving the new sort under lib/ if it would be found more
widely useful.

v2:
* Hold on to fence references and reduce commentary. (Christian)
* Record and use latest signaled timestamp in the 2nd loop too.
* Consolidate zero or one fences fast paths.

v3:
* Reverse the seqno sort order for a simpler squashing pass. (Christian)

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Fixes: 245a4a7b531c ("dma-buf: generalize dma_fence unwrap & merging v3")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3617
Cc: Christian König <christian.koenig@amd.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Gustavo Padovan <gustavo@padovan.org>
Cc: Friedrich Vock <friedrich.vock@gmx.de>
Cc: linux-media@vger.kernel.org
Cc: dri-devel@lists.freedesktop.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: <stable@vger.kernel.org> # v6.0+
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241115102153.1980-3-tursulin@igalia.com

authored by

Tvrtko Ursulin and committed by
Christian König
fe52c649 949291c5

+61 -67
+61 -67
drivers/dma-buf/dma-fence-unwrap.c
··· 12 12 #include <linux/dma-fence-chain.h> 13 13 #include <linux/dma-fence-unwrap.h> 14 14 #include <linux/slab.h> 15 + #include <linux/sort.h> 15 16 16 17 /* Internal helper to start new array iteration, don't use directly */ 17 18 static struct dma_fence * ··· 60 59 } 61 60 EXPORT_SYMBOL_GPL(dma_fence_unwrap_next); 62 61 62 + 63 + static int fence_cmp(const void *_a, const void *_b) 64 + { 65 + struct dma_fence *a = *(struct dma_fence **)_a; 66 + struct dma_fence *b = *(struct dma_fence **)_b; 67 + 68 + if (a->context < b->context) 69 + return -1; 70 + else if (a->context > b->context) 71 + return 1; 72 + 73 + if (dma_fence_is_later(b, a)) 74 + return 1; 75 + else if (dma_fence_is_later(a, b)) 76 + return -1; 77 + 78 + return 0; 79 + } 80 + 63 81 /* Implementation for the dma_fence_merge() marco, don't use directly */ 64 82 struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences, 65 83 struct dma_fence **fences, ··· 87 67 struct dma_fence_array *result; 88 68 struct dma_fence *tmp, **array; 89 69 ktime_t timestamp; 90 - unsigned int i; 91 - size_t count; 70 + int i, j, count; 92 71 93 72 count = 0; 94 73 timestamp = ns_to_ktime(0); ··· 115 96 if (!array) 116 97 return NULL; 117 98 118 - /* 119 - * This trashes the input fence array and uses it as position for the 120 - * following merge loop. This works because the dma_fence_merge() 121 - * wrapper macro is creating this temporary array on the stack together 122 - * with the iterators. 123 - */ 124 - for (i = 0; i < num_fences; ++i) 125 - fences[i] = dma_fence_unwrap_first(fences[i], &iter[i]); 126 - 127 99 count = 0; 128 - do { 129 - unsigned int sel; 130 - 131 - restart: 132 - tmp = NULL; 133 - for (i = 0; i < num_fences; ++i) { 134 - struct dma_fence *next; 135 - 136 - while (fences[i] && dma_fence_is_signaled(fences[i])) 137 - fences[i] = dma_fence_unwrap_next(&iter[i]); 138 - 139 - next = fences[i]; 140 - if (!next) 141 - continue; 142 - 143 - /* 144 - * We can't guarantee that inpute fences are ordered by 145 - * context, but it is still quite likely when this 146 - * function is used multiple times. So attempt to order 147 - * the fences by context as we pass over them and merge 148 - * fences with the same context. 149 - */ 150 - if (!tmp || tmp->context > next->context) { 151 - tmp = next; 152 - sel = i; 153 - 154 - } else if (tmp->context < next->context) { 155 - continue; 156 - 157 - } else if (dma_fence_is_later(tmp, next)) { 158 - fences[i] = dma_fence_unwrap_next(&iter[i]); 159 - goto restart; 100 + for (i = 0; i < num_fences; ++i) { 101 + dma_fence_unwrap_for_each(tmp, &iter[i], fences[i]) { 102 + if (!dma_fence_is_signaled(tmp)) { 103 + array[count++] = dma_fence_get(tmp); 160 104 } else { 161 - fences[sel] = dma_fence_unwrap_next(&iter[sel]); 162 - goto restart; 105 + ktime_t t = dma_fence_timestamp(tmp); 106 + 107 + if (ktime_after(t, timestamp)) 108 + timestamp = t; 163 109 } 164 110 } 165 - 166 - if (tmp) { 167 - array[count++] = dma_fence_get(tmp); 168 - fences[sel] = dma_fence_unwrap_next(&iter[sel]); 169 - } 170 - } while (tmp); 171 - 172 - if (count == 0) { 173 - tmp = dma_fence_allocate_private_stub(ktime_get()); 174 - goto return_tmp; 175 111 } 176 112 177 - if (count == 1) { 178 - tmp = array[0]; 179 - goto return_tmp; 180 - } 113 + if (count == 0 || count == 1) 114 + goto return_fastpath; 181 115 182 - result = dma_fence_array_create(count, array, 183 - dma_fence_context_alloc(1), 184 - 1, false); 185 - if (!result) { 186 - for (i = 0; i < count; i++) 116 + sort(array, count, sizeof(*array), fence_cmp, NULL); 117 + 118 + /* 119 + * Only keep the most recent fence for each context. 120 + */ 121 + j = 0; 122 + for (i = 1; i < count; i++) { 123 + if (array[i]->context == array[j]->context) 187 124 dma_fence_put(array[i]); 188 - tmp = NULL; 189 - goto return_tmp; 125 + else 126 + array[++j] = array[i]; 190 127 } 191 - return &result->base; 128 + count = ++j; 129 + 130 + if (count > 1) { 131 + result = dma_fence_array_create(count, array, 132 + dma_fence_context_alloc(1), 133 + 1, false); 134 + if (!result) { 135 + for (i = 0; i < count; i++) 136 + dma_fence_put(array[i]); 137 + tmp = NULL; 138 + goto return_tmp; 139 + } 140 + return &result->base; 141 + } 142 + 143 + return_fastpath: 144 + if (count == 0) 145 + tmp = dma_fence_allocate_private_stub(timestamp); 146 + else 147 + tmp = array[0]; 192 148 193 149 return_tmp: 194 150 kfree(array);