Skip to content

Commit 8085cec

Browse files
authored
Add kernel build flag for prioritizing speed or size (#2408)
Adds a build flag that can be used by any kernel to provide a different implementation depending on use case. Adds a first use case for cmsis-nn transpose conv. The background for this PR is in #2345 BUG=none
1 parent f5e498b commit 8085cec

File tree

4 files changed

+111
-17
lines changed

4 files changed

+111
-17
lines changed

tensorflow/lite/micro/docs/optimized_kernel_implementations.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -169,6 +169,12 @@ support:
169169
* Build a static libtensorflow-microlite.a using the TFLM makefile with:
170170
`make -f tensorflow/lite/micro/tools/make/Makefile TARGET=<target>
171171
OPTIMIZED_KERNEL_DIR=<optimize_dir> microlite`
172+
* Optionally build for size or speed. Translated to a valid make command it will be any of these two:
173+
`make -f tensorflow/lite/micro/tools/make/Makefile TARGET=<target>
174+
OPTIMIZED_KERNEL_DIR=<optimize_dir> OPTIMIZE_KERNELS_FOR=KERNELS_OPTIMIZED_FOR_SIZE microlite`
175+
`make -f tensorflow/lite/micro/tools/make/Makefile TARGET=<target>
176+
OPTIMIZED_KERNEL_DIR=<optimize_dir> OPTIMIZE_KERNELS_FOR=KERNELS_OPTIMIZED_FOR_SPEED microlite`
177+
Check relevant README for given optimization library if this is applicable.
172178
* Use the static library and any TFLM headers as part of the overall
173179
application (with its own build system).
174180

tensorflow/lite/micro/kernels/cmsis_nn/README.md

Lines changed: 24 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,22 @@
11
<!-- mdformat off(b/169948621#comment2) -->
22

3-
# Info
3+
# General Info
44
CMSIS-NN is a library containing kernel optimizations for Arm(R) Cortex(R)-M
55
processors. To use CMSIS-NN optimized kernels instead of reference kernels, add
66
`OPTIMIZED_KERNEL_DIR=cmsis_nn` to the make command line. See examples below.
77

88
For more information about the optimizations, check out
9-
[CMSIS-NN documentation](https://github.com/ARM-software/CMSIS_5/blob/develop/CMSIS/NN/README.md).
9+
[CMSIS-NN documentation](https://github.com/ARM-software/CMSIS-NN/blob/main/README.md),
10+
11+
# Specifying path to CMSIS-NN
1012

1113
By default CMSIS-NN is built by code that is downloaded to the TFLM tree.
1214
It also possible to build CMSIS-NN code from an external path by specifying
1315
CMSIS_PATH=<../path> and CMSIS_NN_PATH=<../path>. Note that both CMSIS_PATH and CMSIS_NN_PATH is needed
1416
since CMSIS-NN has a dependency to CMSIS-Core. As a third option CMSIS-NN can be provided manually as an external library.
1517
The examples below will illustrate this.
1618

17-
# Example - FVP based on Arm Corstone-300 software.
19+
## Example - FVP based on Arm Corstone-300 software.
1820
In this example, the kernel conv unit test is built. For more information about
1921
this specific target, check out the [Corstone-300 readme](https://github.com/tensorflow/tflite-micro/tree/main/tensorflow/lite/micro/cortex_m_corstone_300/README.md).
2022

@@ -39,3 +41,22 @@ external CMSIS-NN library as different compiler options may have been used.
3941
Also note that if specifying CMSIS_NN_LIBS but not CMSIS_PATH and or CMSIS_NN_PATH, headers and
4042
system/startup code from the default downloaded path of CMSIS would be used.
4143
So CMSIS_NN_LIBS, CMSIS_NN_PATH and CMSIS_PATH should have the same base path and if not there will be a build error.
44+
45+
# Build for speed or size
46+
It is possible to build for speed or size. The size option may be required for a large model on an embedded system with limited memory. Where applicable, building for size would result in higher latency paired with a smaller scratch buffer, whereas building for speed would result in lower latency with a larger scratch buffer. Currently only transpose conv supports this. See examples below.
47+
48+
## Example - building a static library with CMSIS-NN optimized kernels
49+
More info on the target used in this example: https://github.com/tensorflow/tflite-micro/blob/main/tensorflow/lite/micro/cortex_m_generic/README.md
50+
51+
Bulding for speed (default):
52+
Note that speed is default so if leaving out OPTIMIZE_KERNELS_FOR completely that will be the default.
53+
```
54+
make -f tensorflow/lite/micro/tools/make/Makefile TARGET=cortex_m_generic TARGET_ARCH=cortex-m55 OPTIMIZED_KERNEL_DIR=cmsis_nn OPTIMIZE_KERNELS_FOR=KERNELS_OPTIMIZED_FOR_SPEED microlite
55+
56+
```
57+
58+
Bulding for size:
59+
```
60+
make -f tensorflow/lite/micro/tools/make/Makefile TARGET=cortex_m_generic TARGET_ARCH=cortex-m55 OPTIMIZED_KERNEL_DIR=cmsis_nn OPTIMIZE_KERNELS_FOR=KERNELS_OPTIMIZED_FOR_SIZE microlite
61+
62+
```

tensorflow/lite/micro/kernels/cmsis_nn/transpose_conv.cc

Lines changed: 57 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
/* Copyright 2023 The TensorFlow Authors. All Rights Reserved.
1+
/* Copyright 2024 The TensorFlow Authors. All Rights Reserved.
22
33
Licensed under the Apache License, Version 2.0 (the "License");
44
you may not use this file except in compliance with the License.
@@ -198,14 +198,22 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
198198
if (input->type == kTfLiteInt8) {
199199
TFLITE_DCHECK(context->RequestScratchBufferInArena != nullptr);
200200

201-
RuntimeShape filter_shape = GetTensorShape(filter);
202201
RuntimeShape input_shape = GetTensorShape(input);
203202
RuntimeShape output_shape = GetTensorShape(output);
203+
RuntimeShape filter_shape = GetTensorShape(filter);
204204

205205
const int batch_size = MatchingDim(input_shape, 0, output_shape, 0);
206-
const int input_depth = MatchingDim(input_shape, 3, filter_shape, 3);
207206
const int output_depth = MatchingDim(filter_shape, 0, output_shape, 3);
208207

208+
cmsis_nn_dims output_dims;
209+
output_dims.n = batch_size;
210+
output_dims.h = output_shape.Dims(1);
211+
output_dims.w = output_shape.Dims(2);
212+
output_dims.c = output_depth;
213+
214+
#if defined(KERNELS_OPTIMIZED_FOR_SPEED)
215+
const int input_depth = MatchingDim(input_shape, 3, filter_shape, 3);
216+
209217
cmsis_nn_dims input_dims;
210218
input_dims.n = batch_size;
211219
input_dims.h = input_shape.Dims(1);
@@ -218,17 +226,12 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
218226
filter_dims.w = filter_shape.Dims(2);
219227
filter_dims.c = input_depth;
220228

221-
cmsis_nn_dims output_dims;
222-
output_dims.n = batch_size;
223-
output_dims.h = output_shape.Dims(1);
224-
output_dims.w = output_shape.Dims(2);
225-
output_dims.c = output_depth;
226-
227229
const size_t buf_size = arm_transpose_conv_s8_get_buffer_size(
228230
&input_dims, &filter_dims, &output_dims);
229231
TFLITE_DCHECK(context->RequestScratchBufferInArena(
230232
context, buf_size, &(data->scratch_buffer_index)) ==
231233
kTfLiteOk);
234+
#endif
232235

233236
// Quantized 8-bit kernels use an int32 scratch buffer.
234237
TFLITE_DCHECK(
@@ -285,6 +288,7 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
285288
return kTfLiteOk;
286289
}
287290

291+
#if defined(KERNELS_OPTIMIZED_FOR_SPEED)
288292
TfLiteStatus EvalQuantizedPerChannel(TfLiteContext* context, TfLiteNode* node,
289293
const TfLiteConvParams& params,
290294
const OpData& data,
@@ -376,6 +380,7 @@ TfLiteStatus EvalQuantizedPerChannel(TfLiteContext* context, TfLiteNode* node,
376380

377381
return kTfLiteOk;
378382
}
383+
#endif
379384

380385
TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
381386
const TfLiteEvalTensor* input =
@@ -416,8 +421,29 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
416421
break;
417422
}
418423
case kTfLiteInt8: {
424+
#if defined(KERNELS_OPTIMIZED_FOR_SIZE)
425+
int32_t* scratch_buffer = static_cast<int32_t*>(
426+
context->GetScratchBuffer(context, data.scratch_buffer_index));
427+
reference_integer_ops::TransposeConv(
428+
data.params, data.per_channel_output_multiplier,
429+
data.per_channel_output_shift, tflite::micro::GetTensorShape(input),
430+
tflite::micro::GetTensorData<int8_t>(input),
431+
tflite::micro::GetTensorShape(filter),
432+
tflite::micro::GetTensorData<int8_t>(filter),
433+
tflite::micro::GetTensorShape(bias),
434+
tflite::micro::GetOptionalTensorData<int32_t>(bias),
435+
tflite::micro::GetTensorShape(output),
436+
tflite::micro::GetTensorData<int8_t>(output),
437+
tflite::micro::GetTensorShape(nullptr), nullptr, scratch_buffer);
438+
#elif defined(KERNELS_OPTIMIZED_FOR_SPEED)
419439
return EvalQuantizedPerChannel(context, node, params, data, input, filter,
420440
bias, output);
441+
#else
442+
MicroPrintf(
443+
"Either KERNELS_OPTIMIZED_FOR_SIZE or KERNELS_OPTIMIZED_FOR_SPEED "
444+
"must be defined");
445+
return kTfLiteError;
446+
#endif
421447
break;
422448
}
423449
case kTfLiteInt16: {
@@ -481,12 +507,33 @@ TfLiteStatus EvalInt8(TfLiteContext* context, TfLiteNode* node) {
481507
TFLITE_DCHECK(node->user_data != nullptr);
482508
const OpData& data = *(static_cast<const OpData*>(node->user_data));
483509

484-
TF_LITE_ENSURE_EQ(context, input->type, output->type);
510+
#if defined(KERNELS_OPTIMIZED_FOR_SIZE)
511+
int32_t* scratch_buffer = static_cast<int32_t*>(
512+
context->GetScratchBuffer(context, data.scratch_buffer_index));
513+
reference_integer_ops::TransposeConv(
514+
data.params, data.per_channel_output_multiplier,
515+
data.per_channel_output_shift, tflite::micro::GetTensorShape(input),
516+
tflite::micro::GetTensorData<int8_t>(input),
517+
tflite::micro::GetTensorShape(filter),
518+
tflite::micro::GetTensorData<int8_t>(filter),
519+
tflite::micro::GetTensorShape(bias),
520+
tflite::micro::GetOptionalTensorData<int32_t>(bias),
521+
tflite::micro::GetTensorShape(output),
522+
tflite::micro::GetTensorData<int8_t>(output),
523+
tflite::micro::GetTensorShape(nullptr), nullptr, scratch_buffer);
524+
#elif defined(KERNELS_OPTIMIZED_FOR_SPEED)
485525
const auto& params =
486526
*(reinterpret_cast<TfLiteConvParams*>(node->builtin_data));
487527

488528
return EvalQuantizedPerChannel(context, node, params, data, input, filter,
489529
bias, output);
530+
#else
531+
MicroPrintf(
532+
"Either KERNELS_OPTIMIZED_FOR_SIZE or KERNELS_OPTIMIZED_FOR_SPEED must "
533+
"be defined");
534+
return kTfLiteError;
535+
#endif
536+
return kTfLiteOk;
490537
}
491538

492539
} // namespace

tensorflow/lite/micro/tools/make/Makefile

Lines changed: 24 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
1+
# Copyright 2024 The TensorFlow Authors. All Rights Reserved.
22
#
33
# Licensed under the Apache License, Version 2.0 (the "License");
44
# you may not use this file except in compliance with the License.
@@ -60,6 +60,17 @@ endif
6060
# Specify which specialized kernel implementation should be pulled in.
6161
OPTIMIZED_KERNEL_DIR :=
6262

63+
# Optimize kernels for speed or memory. This is similar but not the same as KERNEL_OPTIMIZATION_LEVEL and
64+
# CORE_OPTIMIZATION_LEVEL, which specify compiler optimization level.
65+
# Instead this enables a kernel to provide multiple implementations that is configured at build time.
66+
# An example could be a kernel requiring a bigger scratch buffer for certain use cases.
67+
# The example kernel would have a smaller scratch buffer usage when building for size.
68+
# Vice versa it would use more scratch buffer when building for speed and would be more performant.
69+
# Note that this is optional. If having one implementation, nothing needs to be done.
70+
# OPTIMIZE_KERNELS_FOR has only two valid values, KERNELS_OPTIMIZED_FOR_SIZE and KERNELS_OPTIMIZED_FOR_SPEED where the
71+
# former is default.
72+
OPTIMIZE_KERNELS_FOR := KERNELS_OPTIMIZED_FOR_SPEED
73+
6374
# Override this variable from the command line in case the optimized kernels are
6475
# in a different directory.
6576
OPTIMIZED_KERNEL_DIR_PREFIX := $(TENSORFLOW_ROOT)tensorflow/lite/micro/kernels
@@ -99,7 +110,7 @@ TEST_SCRIPT :=
99110

100111
MICROLITE_LIBS := -lm
101112

102-
# For the optimized_kernel_dir, and co-processor as specified on the
113+
# For the optimized_kernel_dir, co-processor and optimize_kernels_for as specified on the
103114
# command line we add -D<tag> to the cflags to allow for #idefs in the code.
104115
#
105116
# We apply the following transformations (via the tr command):
@@ -113,6 +124,10 @@ ifneq ($(CO_PROCESSOR),)
113124
ADDITIONAL_DEFINES += -D$(shell echo $(CO_PROCESSOR) | tr [a-z] [A-Z])
114125
endif
115126

127+
ifneq ($(OPTIMIZE_KERNELS_FOR),)
128+
ADDITIONAL_DEFINES += -D$(shell echo $(OPTIMIZE_KERNELS_FOR) | tr [a-z] [A-Z])
129+
endif
130+
116131
ifeq ($(TOOLCHAIN), armclang)
117132
CORE_OPTIMIZATION_LEVEL := -Oz
118133
else
@@ -483,11 +498,11 @@ $(shell find $(TENSORFLOW_ROOT)tensorflow/lite -type d \( -path $(TENSORFLOW_ROO
483498

484499
ifneq ($(BUILD_TYPE), no_tf_lite_static_memory)
485500
EXCLUDED_TFL_CC_SRCS := \
486-
$(TENSORFLOW_ROOT)tensorflow/lite/array.cc
501+
$(TENSORFLOW_ROOT)tensorflow/lite/array.cc
487502
TFL_CC_SRCS := $(filter-out $(EXCLUDED_TFL_CC_SRCS), $(TFL_CC_SRCS))
488503

489504
EXCLUDED_TFL_CC_HDRS := \
490-
$(TENSORFLOW_ROOT)tensorflow/lite/array.h
505+
$(TENSORFLOW_ROOT)tensorflow/lite/array.h
491506
TFL_CC_HDRS := $(filter-out $(EXCLUDED_TFL_CC_HDRS), $(TFL_CC_HDRS))
492507
endif
493508

@@ -614,6 +629,11 @@ ifeq ($(findstring $(TARGET),$(TARGETS_WITHOUT_MAKEFILES)),)
614629
include $(MAKEFILE_DIR)/targets/$(TARGET)_makefile.inc
615630
endif
616631

632+
# Validate valid options.
633+
ifeq (,$(filter $(OPTIMIZE_KERNELS_FOR),KERNELS_OPTIMIZED_FOR_SPEED KERNELS_OPTIMIZED_FOR_SIZE))
634+
$(error Incorrect OPTIMIZE_KERNELS_FOR: $(OPTIMIZE_KERNELS_FOR))
635+
endif
636+
617637
ifneq ($(OPTIMIZED_KERNEL_DIR),)
618638
PATH_TO_OPTIMIZED_KERNELS := $(OPTIMIZED_KERNEL_DIR_PREFIX)/$(OPTIMIZED_KERNEL_DIR)
619639
PATH_TO_SIGNAL_OPTIMIZED_KERNELS := $(OPTIMIZED_SIGNAL_KERNEL_DIR_PREFIX)/$(OPTIMIZED_KERNEL_DIR)

0 commit comments

Comments
 (0)