@@ -122,7 +122,7 @@ specification we add four operations:
122
122
* ** Vector-Vector Outer Product and Accumulate:** Compute the outerproduct of
123
123
two vectors and accumulate the result matrix atomically-elementwise in
124
124
memory.
125
- * ** Reduce and Accumulate:** Accumulate elements of a vector
125
+ * ** Vector Accumulate:** Accumulate elements of a vector
126
126
atomically-elementwise to corresponding elements in memory.
127
127
128
128
@@ -218,8 +218,10 @@ For optimal layouts, **matrix stride** is ignored.
218
218
219
219
Only non-packed interpretations are valid for matrices.
220
220
221
- The base address of ** matrix resource** and ** matrix offset** must be 64 byte
222
- aligned.
221
+ The base address of ** matrix resource** and ** matrix offset** must be 128 byte
222
+ aligned. Also note that the size of the underlying allocation is guaranteed to
223
+ be a multiple of 16 bytes ensuring that the 16 bytes access of the last
224
+ row/column of the matrix is valid memory.
223
225
224
226
The ** matrix stride** is 16 byte aligned.
225
227
@@ -300,8 +302,10 @@ resource**, with **matrix offset**, **matrix stride**, **matrix
300
302
interpretation** and ** matrix layout** behaving as described [ above]
301
303
(#matrix-vector-multiply-and-multiply-add-operations).
302
304
303
- The base address of ** matrix resource** and ** matrix offset** must be 64 byte
304
- aligned.
305
+ The base address of ** matrix resource** and ** matrix offset** must be 128 byte
306
+ aligned. Also note that the size of the underlying allocation is guaranteed to
307
+ be a multiple of 16 bytes ensuring that the 16 bytes access of the last
308
+ row/column of the matrix is valid memory
305
309
306
310
The ** matrix stride** is 16 byte aligned.
307
311
@@ -318,12 +322,12 @@ guaranteed to be supported on all implementations can be found in
318
322
` I8 ` , ` F8_E4M3 ` , ` F8_E5M2 ` ,
319
323
320
324
321
- ### Reduce Sum Accumulate
325
+ ### Vector Accumulate
322
326
323
327
#### Syntax
324
328
325
329
``` llvm
326
- declare void @dx.op.vecreducesumacc .v[NUM][TY](
330
+ declare void @dx.op.vectoraccumulate .v[NUM][TY](
327
331
immarg i32, ; opcode
328
332
<[NUM] x [TY]>, ; input vector
329
333
%dx.types.Handle, ; output array resource
@@ -666,7 +670,7 @@ typedef struct D3D12_COOPERATIVE_VECTOR_PROPERTIES_INFERENCE
666
670
BOOL TransposeSupported;
667
671
};
668
672
669
- // Used for OuterProductAccumulate and ReduceSumAccumulate intrinsics
673
+ // Used for OuterProductAccumulate and VectorAccumulate intrinsics
670
674
typedef struct D3D12_COOPERATIVE_VECTOR_PROPERTIES_TRAINING
671
675
{
672
676
D3D12_COOPERATIVE_VECTOR_DATATYPE InputType;
@@ -679,8 +683,8 @@ typedef struct D3D12_FEATURE_DATA_COOPERATIVE_VECTOR
679
683
Out D3D12_COOPERATIVE_VECTOR_PROPERTIES_INFERENCE* pMatrixVectorMulAddProperties;
680
684
InOut UINT OuterProductAccPropCount;
681
685
Out D3D12_COOPERATIVE_VECTOR_PROPERTIES_TRAINING* pOuterProductAccProperties;
682
- InOut UINT ReduceSumAccPropCount ;
683
- Out D3D12_COOPERATIVE_VECTOR_PROPERTIES_TRAINING* pReduceSumAccProperties ;
686
+ InOut UINT VectorAccumulatePropCount ;
687
+ Out D3D12_COOPERATIVE_VECTOR_PROPERTIES_TRAINING* pVectorAccumulateProperties ;
684
688
};
685
689
686
690
```
@@ -705,10 +709,10 @@ the operation fails and `E_INVALIDARG` is returned.
705
709
706
710
**D3D12_COOPERATIVE_VECTOR_TIER_1_0**: Device supports *MatrixVectorMul*
707
711
and *MatrixVectorMulAdd* intrinsics. `OuterProductAccPropCount` and
708
- `ReduceSumAccPropCount ` are 0 in this case .
712
+ `VectorAccumulatePropCount ` are 0 in this case .
709
713
710
714
**D3D12_COOPERATIVE_VECTOR_TIER_1_1**: Device supports previous
711
- tiers, *OuterProductAccumulate* and *ReduceSumAccumulate * functions.
715
+ tiers, *OuterProductAccumulate* and *VectorAccumulate * functions.
712
716
713
717
#### Minimum Support Set
714
718
@@ -739,7 +743,7 @@ explicitly checked for the combinations below.
739
743
| FP16 | FP16 |
740
744
| FP16 | FP32 |
741
745
742
- ##### For ReduceSumAccumulate
746
+ ##### For VectorAccumulate
743
747
744
748
| InputType | AccumulationType |
745
749
| -----------| ------------------|
@@ -811,7 +815,8 @@ the inputs required to calculate the necessary size. The same descriptor,
811
815
updated with the calculated output size, is then passed to the conversion
812
816
API.
813
817
814
- The ` DestStride ` must be a multiple of 16 bytes.
818
+ The ` DestSize ` and ` DestStride ` must be a multiple of 16 bytes. The ` DestVA `
819
+ must be 128B aligned.
815
820
816
821
``` c++
817
822
@@ -987,22 +992,9 @@ Various combinations of enums for specifying interpretations were considered
987
992
with varying trade-offs of complexity versus typesafety and simplicity, before
988
993
deciding to extend the existing ` ComponentType ` enum.
989
994
990
- ## Open Issues
991
-
992
- * Q: Type interpretations to use HLSL conversion rules of ML best practices?
993
- * A: This spec uses the ML best practices like the SpirV spec. // TODO: get
994
- approval
995
- * Q: More details on formats and their precision requirements
996
- * A: Implementation Dependent
997
- * Q: How do you handle cases where different implementations may not produce bit
998
- identical results?
999
- * A: Some combination of exactly representable results/ epsilon ranges.
1000
- * Q: Using MatrixView and VectorView as a wrapper for the BAB containing the
1001
- matrix/bias vectors and their corresponding interpretations.
1002
-
1003
995
## Acknowledgments
1004
996
1005
- We would like to thank Jeff Bolz, Yury Uralsky and Patrick Neill for their
1006
- contributions to this specification.
997
+ We would like to thank Jeff Bolz, Yury Uralsky, Patrick Neill, Tex Riddell and
998
+ Amar Patel for their contributions to this specification.
1007
999
1008
1000
<!-- {% endraw %} -->
0 commit comments