From 81ff41f1e48c4489fc2188e00f279162b05eade9 Mon Sep 17 00:00:00 2001 From: Martin Hickey Date: Mon, 16 Dec 2024 14:52:41 +0000 Subject: [PATCH 1/2] Add spell checker This adds the pyspelling spell check automation tool. It is a wrapper around CLI of Aspell or Hunspell which are spell checker tools. The PR specifies that pyspelling uses Aspell as the spell checker tools can differ in output. Therefore by speficying Aspell, it will mean consistency. Closes #31 Signed-off-by: Martin Hickey --- .github/workflows/spellcheck.yml | 55 +++++++++++++++ .gitignore | 7 +- .spellcheck-en-custom.txt | 111 +++++++++++++++++++++++++++++++ .spellcheck.yml | 26 ++++++++ docs/fms_mo_design.md | 6 +- examples/DQ_SQ/README.md | 4 +- examples/FP8_QUANT/README.md | 4 +- examples/GPTQ/README.md | 4 +- examples/PTQ_INT8/README.md | 4 +- examples/QAT_INT8/README.md | 6 +- fms_mo/quant/README.md | 2 +- tox.ini | 13 ++++ 12 files changed, 225 insertions(+), 17 deletions(-) create mode 100644 .github/workflows/spellcheck.yml create mode 100644 .spellcheck-en-custom.txt create mode 100644 .spellcheck.yml diff --git a/.github/workflows/spellcheck.yml b/.github/workflows/spellcheck.yml new file mode 100644 index 00000000..2d51234d --- /dev/null +++ b/.github/workflows/spellcheck.yml @@ -0,0 +1,55 @@ +name: Spellcheck + +on: + pull_request: + branches: + - main + - "release-**" + paths: + - '**.md' + - 'tox.ini' + - '.spellcheck*' + - '.github/workflows/spellcheck.yml' # This workflow file + +env: + LC_ALL: en_US.UTF-8 + +defaults: + run: + shell: bash + +permissions: + contents: read + +jobs: + spellcheck: + runs-on: ubuntu-latest + steps: + - name: "Harden Runner" + uses: step-security/harden-runner@0080882f6c36860b6ba35c610c98ce87d4e2f26f # v2.10.2 + with: + egress-policy: audit # TODO: change to 'egress-policy: block' after couple of runs + + - name: Checkout Code + uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2 + with: + fetch-depth: 0 + + - name: Install aspell + run: | + sudo sudo apt-get update + sudo apt-get install -y aspell aspell-en + + - name: Setup Python 3.11 + uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0 + with: + python-version: 3.11 + cache: pip + cache-dependency-path: | + **/pyproject.toml + + - name: Install tox dependencies + run: python -m pip install --upgrade tox + + - name: Run spellchecker + run: python -m tox -e spellcheck diff --git a/.gitignore b/.gitignore index 2f6112bb..0fd5d763 100644 --- a/.gitignore +++ b/.gitignore @@ -34,9 +34,12 @@ venv/ # Build output /build/lib/ -# generated by setuptools_scm +# Generated by setuptools_scm /fms_mo/_version.py -#Generated by tests +# Generated by tests qcfg.json +# Generated by spelling check +dictionary.dic + diff --git a/.spellcheck-en-custom.txt b/.spellcheck-en-custom.txt new file mode 100644 index 00000000..0571e09a --- /dev/null +++ b/.spellcheck-en-custom.txt @@ -0,0 +1,111 @@ +activations +ADR +Args +AutoGPTQ +autoregressive +backpropagation +bmm +BMM +BRECQ +CLI +Conda +config +Conv +CUDA +CUDAGRAPH +dataset +datautils +Deployable +dequant +dequantize +dequantization +dq +DQ +dev +eval +fms +fp +FP +frac +gptq +GPTQ +GPTQArgs +graphviz +GPTQ +hyperparameters +Inductor +inferenced +inferencing +isort +Jupyter +Kubernetes +KV +kvcache +len +lfloor +llm +LLM +lm +lossy +LSTM +matmul +matmuls +maxperCh +maxpertoken +Miniforge +mins +Mixtral +MSE +msec +natively +nbatch +nbits +NLP +Nouterloop +Nvidia +Nvidia's +orchestrator +param +pre +ptq +PTQ +py +pyenv +pylint +pygraphviz +pyproject +pytest +QAT +QAT'ed +quant +quantized +quantizer +quantizers +quantizes +Quantizing +QW +rceil +repo +representable +runtime +Runtime +SAWB +sexualized +SmoothQuant +socio +sparsification +SQuAD +straightforward +tokenization +tokenized +Tokenized +tokenizer +Tokenizer +toml +Unquantized +vals +venv +vllm +xs +zp + diff --git a/.spellcheck.yml b/.spellcheck.yml new file mode 100644 index 00000000..e7e712cd --- /dev/null +++ b/.spellcheck.yml @@ -0,0 +1,26 @@ +matrix: +- name: markdown + aspell: + lang: en + d: en_US + camel-case: true + mode: markdown + sources: + - "**/*.md|!CODEOWNERS.md|!build/**|!.tox/**|!venv/**" + dictionary: + wordlists: + - .spellcheck-en-custom.txt + pipeline: + - pyspelling.filters.context: + context_visible_first: true + escapes: '\\[\\`~]' + delimiters: + # Ignore multiline content between fences (fences can have 3 or more back ticks) + # ```language + # content + # ``` + - open: '(?s)^(?P *`{3,}).*?$' + close: '^(?P=open)$' + # Ignore text between inline back ticks + - open: '(?P`+)' + close: '(?P=open)' diff --git a/docs/fms_mo_design.md b/docs/fms_mo_design.md index bd32fdf2..a803a359 100644 --- a/docs/fms_mo_design.md +++ b/docs/fms_mo_design.md @@ -37,9 +37,9 @@ The quantization process can be illustrated in the following plots: ### Quantization-aware training (QAT) -In order to accommodate the quantization errors, one straightfoward technique is to take quantization/dequantization into account during the training process, hence the name quantization-aware training [(QAT)](https://arxiv.org/pdf/1712.05877), as illustrated by Step 1 of the following figure. The training optimizer will then adjust the parameters of the model, e.g. weights, accordingly so that the resulting accuracy will be comparable to the original FP32 model. +In order to accommodate the quantization errors, one straightforward technique is to take quantization/dequantization into account during the training process, hence the name quantization-aware training [(QAT)](https://arxiv.org/pdf/1712.05877), as illustrated by Step 1 of the following figure. The training optimizer will then adjust the parameters of the model, e.g. weights, accordingly so that the resulting accuracy will be comparable to the original FP32 model. -There are many other techniques, such as post-training quantization ([PTQ](https://arxiv.org/abs/2102.05426)), that can achieve similar outcome. Users will need to pick the proper method for their specific task based on model size, dataset size, resource available, and other consideraions. +There are many other techniques, such as post-training quantization ([PTQ](https://arxiv.org/abs/2102.05426)), that can achieve similar outcome. Users will need to pick the proper method for their specific task based on model size, dataset size, resource available, and other considerations. ![Quantize and deploy](./images/layer_swapping.png) @@ -91,7 +91,7 @@ For generative LLMs, very often the bottleneck of inference is no longer the com The key architectural components are: 1. **`model_analyzer`**, which traces the model and identifies the layers/operations to be quantized or to be skipped. It will try to recognize several well-known structures and configure based on best practice. However, users could also choose to bypass the tracing and manually specify the desired configuration with full flexibility. -2. **A set of `wrappers`**. As shown in the figure above, the preparation for QAT and deployment can be viewed as a "layer swapping" process. One could identify a desired `torch.nn.Linear` layer to be quantized, e.g. Linear1 in the plot, and replace it with a `QLinear` wrapper, which contains a set of `quantizers` that can quantize/dequantize the inputs and weights before the Linear operation. Similarly, the `QLinear` wrapper for deployment stage will quantize the inputs, perform INT matmul, then dequantize the outcome. It is mathmatically equivalanet to the wrapper used in QAT, but it can utilize the INT compute engine. +2. **A set of `wrappers`**. As shown in the figure above, the preparation for QAT and deployment can be viewed as a "layer swapping" process. One could identify a desired `torch.nn.Linear` layer to be quantized, e.g. Linear1 in the plot, and replace it with a `QLinear` wrapper, which contains a set of `quantizers` that can quantize/dequantize the inputs and weights before the Linear operation. Similarly, the `QLinear` wrapper for deployment stage will quantize the inputs, perform INT matmul, then dequantize the outcome. It is mathematically equivalent to the wrapper used in QAT, but it can utilize the INT compute engine. ### Interfaces diff --git a/examples/DQ_SQ/README.md b/examples/DQ_SQ/README.md index a015fa92..9c1e50ba 100644 --- a/examples/DQ_SQ/README.md +++ b/examples/DQ_SQ/README.md @@ -6,7 +6,7 @@ Here, we provide an example of direct quantization. In this case, we demonstrate ## Requirements - [FMS Model Optimizer requirements](../../README.md#requirements) -## Quickstart +## QuickStart **1. Prepare Data** for calibration process by converting into its tokenized form. An example of tokenization using `LLAMA-3-8B`'s tokenizer is below. @@ -55,7 +55,7 @@ The perplexity of the INT8 and FP8 quantized models on the `wikitext` dataset is |`Llama3-8b`|INT8 |maxpertoken |maxperCh |yes |yes |6.21 | | |FP8 |fp8_e4m3_scale|fp8_e4m3_scale|yes |yes |6.19 | -## Code Walkthrough +## Code Walk-through **1. KV caching** diff --git a/examples/FP8_QUANT/README.md b/examples/FP8_QUANT/README.md index 9cd487d7..183969d3 100644 --- a/examples/FP8_QUANT/README.md +++ b/examples/FP8_QUANT/README.md @@ -24,7 +24,7 @@ This is an example of mature FP8, which under the hood leverages some functional > [!CAUTION] > `vllm` may require a specific PyTorch version that is different from what is installed in your current environment and it may force install without asking. Make sure it's compatible with your settings or create a new environment if needed. -## Quickstart +## QuickStart This end-to-end example utilizes the common set of interfaces provided by `fms_mo` for easily applying multiple quantization algorithms with FP8 being the focus of this example. The steps involved are: 1. **FP8 quantization through CLI**. Other arguments could be found here [FP8Args](../../fms_mo/training_args.py#L84). @@ -88,7 +88,7 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m | | |none | 5|perplexity|↓ |3.8915|± |0.3727| ``` -## Code Walkthrough +## Code Walk-through 1. The non-quantized pre-trained model is loaded using model wrapper from `llm-compressor`. The corresponding tokenizer is constructed as well. diff --git a/examples/GPTQ/README.md b/examples/GPTQ/README.md index 1877e5db..b7f420ed 100644 --- a/examples/GPTQ/README.md +++ b/examples/GPTQ/README.md @@ -13,7 +13,7 @@ For generative LLMs, very often the bottleneck of inference is no longer the com ``` -## Quickstart +## QuickStart This end-to-end example utilizes the common set of interfaces provided by `fms_mo` for easily applying multiple quantization algorithms with GPTQ being the focus of this example. The steps involved are: 1. **Convert the dataset into its tokenized form.** An example of tokenization using `LLAMA-3-8B`'s tokenizer is below. @@ -109,7 +109,7 @@ This end-to-end example utilizes the common set of interfaces provided by `fms_m > There is some randomness in generating the model and data, the resulting accuracy may vary ~$\pm$ 0.05. -## Code Walkthrough +## Code Walk-through 1. Command line arguments will be used to create a GPTQ quantization config. Information about the required arguments and their default values can be found [here](../../fms_mo/training_args.py) diff --git a/examples/PTQ_INT8/README.md b/examples/PTQ_INT8/README.md index 0fcbc86a..4ee83eb5 100644 --- a/examples/PTQ_INT8/README.md +++ b/examples/PTQ_INT8/README.md @@ -15,7 +15,7 @@ This is an example of [block sequential PTQ](https://arxiv.org/abs/2102.05426). - `PyTorch 2.3.1` (as newer version will cause issue for the custom CUDA kernel) -## Quickstart +## QuickStart > [!NOTE] > This example is based on the HuggingFace [Transformers Question answering example](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering). Unlike our [QAT example](../QAT_INT8/README.md), which utilizes the training loop of the original code, our PTQ function will control the loop and the program will end before entering the original loop. Make sure the model doesn't get "tuned" twice! @@ -106,7 +106,7 @@ The table below shows results obtained for the conditions listed: `Nouterloop` and `ptq_nbatch` are PTQ specific hyper-parameter. Above experiments were run on v100 machine. -## Code Walkthrough +## Code Walk-through In this section, we will deep dive into what happens during the example steps. diff --git a/examples/QAT_INT8/README.md b/examples/QAT_INT8/README.md index f939f2ae..758d263b 100644 --- a/examples/QAT_INT8/README.md +++ b/examples/QAT_INT8/README.md @@ -23,7 +23,7 @@ In the following example, we will first create a fine-tuned FP16 model, and then - `PyTorch 2.3.1` (as newer version will cause issue for the custom CUDA kernel) -## Quickstart +## QuickStart > [!NOTE] > This example is based on the HuggingFace [Transformers Question answering example](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering). @@ -101,7 +101,7 @@ For comparison purposes, here are some of the results we found during testing wh > [!NOTE] > Accuracy could vary ~ +-0.2 from run to run. -|model|batchsize|torch.compile|accuracy(F1)|inference speed (msec)| +|model|batch size|torch.compile|accuracy(F1)|inference speed (msec)| |----|--:|---------:|----:|------------:| |fp16|128|eager |88.21 (as fine-tuned) |126.38| | |128|Inductor | |71.59| @@ -116,7 +116,7 @@ For comparison purposes, here are some of the results we found during testing wh 3 `CUDAGRAPH` is the most effective way to minimize job launching overheads and can achieve ~2X end-to-end speed-up in this case. However, there seem to be bugs associated with this option at the moment. Further investigation is still on-going. -## Code Walkthrough +## Code Walk-through In this section, we will deep dive into what happens during the example steps. diff --git a/fms_mo/quant/README.md b/fms_mo/quant/README.md index 6e2c1d58..12ad9b45 100644 --- a/fms_mo/quant/README.md +++ b/fms_mo/quant/README.md @@ -1,6 +1,6 @@ # Notice -In the `ptq.py` file in this folder, Class `StraightThrough`, function `_fold_bn`, `fold_bn_into_conv`, `reset_bn`, and `search_fold_and_remove_bn` are modified from `QDROP` reposotpry on GitHub. +In the `ptq.py` file in this folder, Class `StraightThrough`, function `_fold_bn`, `fold_bn_into_conv`, `reset_bn`, and `search_fold_and_remove_bn` are modified from `QDROP` repository on GitHub. For the original code, see [QDrop](https://github.com/wimh966/QDrop/tree/qdrop/qdrop/quantization) which has no license stipulated. In the `quantizers.py` file in this folder, Class/function `MSEObserver`, `ObserverBase`, `fake_quantize_per_channel_affine`, `fake_quantize_per_tensor_affine`, `_transform_to_ch_axis`, `CyclicTempDecay`, `LinearTempDecay`, `AdaRoundSTE`, `AdaRoundQuantizerare` are modified from `BRECQ` repository on GitHub. For the original code, see [BRECQ](https://github.com/yhhhli/BRECQ) with the following license. diff --git a/tox.ini b/tox.ini index d9232c8d..6a64ea3b 100644 --- a/tox.ini +++ b/tox.ini @@ -55,6 +55,19 @@ commands = ruff format . isort --check . +[testenv:spellcheck] +description = spell check (needs 'aspell' command) +basepython = {[testenv:py3]basepython} +labels = fastverify +skip_install = true +skipsdist = true +deps = + pyspelling +commands = + sh -c 'command -v aspell || (echo "aspell is not installed. Please install it." && exit 1)' + {envpython} -m pyspelling --config {toxinidir}/.spellcheck.yml --spellchecker aspell +allowlist_externals = sh + [testenv:coverage] description = report unit test coverage deps = From 2bd84b0002bbd790321f4a44b5f4ecac918a902d Mon Sep 17 00:00:00 2001 From: Martin Hickey Date: Mon, 16 Dec 2024 17:48:53 +0000 Subject: [PATCH 2/2] Update the contributing guide Add sub-section about spell checker. Signed-off-by: Martin Hickey --- .spellcheck-en-custom.txt | 1 + CONTRIBUTING.md | 12 +++++++++++- tox.ini | 2 +- 3 files changed, 13 insertions(+), 2 deletions(-) diff --git a/.spellcheck-en-custom.txt b/.spellcheck-en-custom.txt index 0571e09a..21dde169 100644 --- a/.spellcheck-en-custom.txt +++ b/.spellcheck-en-custom.txt @@ -74,6 +74,7 @@ pyenv pylint pygraphviz pyproject +pyspelling pytest QAT QAT'ed diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 93598e61..15ffa4f4 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -88,7 +88,7 @@ pip install tox If you want to manage your own virtual environment instead of using `tox`, you can install the model optimizer and all dependencies. Check out [installation](./README.md#installation) for more details. -Before pushing changes to GitHub, you need to run the tests and coding style as shown below. They can be run individually as shown in each sub-section or can be run with the one command: +Before pushing changes to GitHub, you need to run the tests, coding style and spelling check as shown below. They can be run individually as shown in each sub-section or can be run with the one command: ```shell tox @@ -137,6 +137,16 @@ You can invoke the linting with the following command tox -e lint ``` +### Spelling check + +Spelling check is enforced by the CI system. Run the checker before pushing the changes to avoid CI issues. We use [pyspelling](https://github.com/facelessuser/pyspelling) spell check automation tool. It is a wrapper around CLI of [Aspell](http://aspell.net/) and [Hunspell](https://hunspell.github.io) which are spell checker tools. We configure `pyspelling` to use `Aspell` as the spell checker tool of choice. + +Running the spelling check is as simple as: + +```sh +tox -e spellcheck +``` + ## Your First Code Contribution Unsure where to begin contributing? You can start by looking through these issues: diff --git a/tox.ini b/tox.ini index 6a64ea3b..c99019f7 100644 --- a/tox.ini +++ b/tox.ini @@ -1,5 +1,5 @@ [tox] -envlist = ruff, lint, unit +envlist = ruff, lint, spellcheck, unit minversion = 4.4 [testenv]