-
Notifications
You must be signed in to change notification settings - Fork 57
Terminology #448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
XapaJIaMnu
wants to merge
45
commits into
main
Choose a base branch
from
terminology
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Terminology #448
Changes from 37 commits
Commits
Show all changes
45 commits
Select commit
Hold shift + click to select a range
c43c57a
Basic terminology API
XapaJIaMnu 06ec79b
reference to the code
XapaJIaMnu 867fc6c
Update marian with gcc 12
kpu 9153041
WiP python iface
XapaJIaMnu e5977e6
More WiP
XapaJIaMnu b1cb3bc
Works except stdin
XapaJIaMnu 7a93f7c
Python interface
XapaJIaMnu 5be7b96
Merge branch 'main' into terminology
XapaJIaMnu c1a659e
Small fixes, removes pybind submodule
XapaJIaMnu 1f8ba76
Allow dictionary maps. Work in progress
XapaJIaMnu cc44014
Convert the map to python map
XapaJIaMnu 6c7fe75
Allow dictionary terminology set up
XapaJIaMnu c586e09
Attempt to install pybind11 for the wheel build
XapaJIaMnu 26529dc
Merge branch 'main' into terminology
XapaJIaMnu 82cc687
Add support for different terminology format
XapaJIaMnu 5c9161b
Try to update the workflows.
XapaJIaMnu 7d6f4e5
Refactor terminology replace
jelmervdl f53879d
Fix formatting
jelmervdl a95001d
Update marian dev which should allow for compilation on newer platforms
XapaJIaMnu 316c5dd
Fix for latest argparse
XapaJIaMnu 58e5363
technology -> terminology
kpu 0a6be45
Buffer input for efficiency
kpu ca37e8f
Pass terminology_form from CLI to Translator
graemenail 4011f88
Leave USE_STATIC_LIBS off by default
kpu 19ca40d
Enable cuda compilation
XapaJIaMnu 1a8b90c
Merge branch 'main' into terminology
XapaJIaMnu 1e80e79
Working, except in python
XapaJIaMnu 3d37edf
Simplify invocation a bit
XapaJIaMnu e5d4ed0
Formatting fixes
XapaJIaMnu 72ade1d
Update the terminology format
XapaJIaMnu 5f9858f
Merge branch 'main' into terminology
XapaJIaMnu 168d589
Use 0 GPU workers by default
XapaJIaMnu 3eab045
Attempt to fix tests
XapaJIaMnu 88e7f28
Fix error in workflow syntax
XapaJIaMnu 1db9d09
Fix typing error
XapaJIaMnu 537f4e1
I hate python linters
XapaJIaMnu 042acc2
pytype can't access C++ modules
XapaJIaMnu e3b4a7c
Small fixes
XapaJIaMnu 05a7379
Merge branch 'main' into terminology
XapaJIaMnu 5479c20
Merge with main
XapaJIaMnu d2356a6
Merge branch 'main' into terminology
kpu 97c8da4
Pull in submodule fixing clang compilation
kpu 095d602
Update marian-dev with newer fbgemm for clang
kpu 007b578
Merge branch 'main' into terminology
kpu 2417225
Merge branch 'main' into terminology
kpu File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Submodule pybind11
deleted from
9ec112
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,2 +1,2 @@ | ||
| add_executable(bergamot bergamot.cpp) | ||
| target_link_libraries(bergamot PRIVATE bergamot-translator) | ||
| target_link_libraries(bergamot bergamot-translator) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,212 @@ | ||
| #!/usr/bin/env python3 | ||
| import argparse | ||
| from sys import stdin | ||
| from typing import Dict, List | ||
|
|
||
| import bergamot # type: ignore | ||
|
|
||
|
|
||
| class Translator: | ||
| """Bergamot translator interfacing with the C++ code. | ||
| Attributes: | ||
| _num_workers Number of parallel CPU workers. | ||
| _gpu_workers Indices of the GPU devices used. _num_workers must be set to zero! | ||
| _cache: Cache size. 0 to disable cache. | ||
| _logging: Log level: trace, debug, info, warn, err(or), critical, off. Default is off | ||
| _terminology: Path to a TSV terminology file | ||
| _force_terminology Force the terminology to appear on the target side. May affect translation quality negatively. | ||
| _format Format of the terminology string | ||
| _config Translation model config | ||
| _model: Translation model | ||
| _responseOpts What to include in the response (alignment, html restoration, etc..) | ||
| _service The translation service | ||
| """ | ||
|
|
||
| _num_workers: int | ||
| _gpu_workers: List[int] | ||
| _cache: int | ||
| _logging: str | ||
| _terminology: str | ||
| _force_terminology: bool | ||
| _terminology_form: str | ||
|
|
||
| _config: bergamot.ServiceConfig | ||
| _model: bergamot.TranslationModel | ||
| _responseOpts: bergamot.ResponseOptions | ||
| _service: bergamot.Service | ||
|
|
||
| def __init__( | ||
| self, | ||
| model_conifg_path: str, | ||
XapaJIaMnu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| num_workers: int = 1, | ||
| gpu_workers: List[int] = [], | ||
| cache: int = 0, | ||
| logging="off", | ||
| terminology: str = "", | ||
| force_terminology: bool = False, | ||
| terminology_form: str = "%s __target__ %s __done__ ", | ||
| ): | ||
| """Initialises the translator class | ||
| :param model_conifg_path: Path to the configuration file for the translation model. | ||
| :param num_workers: Number of CPU workers. | ||
| :param gpu_workers: Indices of the GPU devices. num_workers must be zero if this is non-empty | ||
| :param cache: cache size. 0 means no cache. | ||
| :param logging: Log level: trace, debug, info, warn, err(or), critical, off. | ||
| :param terminology: Path to terminology file, TSV format | ||
| :param force_terminology: Force terminology to appear on the target side. May impact translation quality. | ||
| """ | ||
| self._num_workers = num_workers | ||
| self._gpu_workers = gpu_workers | ||
| self._cache = cache | ||
| self._logging = logging | ||
| self._terminology = terminology | ||
| self._force_terminology = force_terminology | ||
| self._terminology_form = terminology_form | ||
|
|
||
| self._config = bergamot.ServiceConfig( | ||
| self._num_workers, | ||
| bergamot.VectorSizeT(self._gpu_workers), | ||
| self._cache, | ||
| self._logging, | ||
| self._terminology, | ||
| self._force_terminology, | ||
| self._terminology_form, | ||
| ) | ||
| self._service = bergamot.Service(self._config) | ||
| self._responseOpts = ( | ||
| bergamot.ResponseOptions() | ||
| ) # Default false for all, if we want to enable HTML later, from here | ||
| self._model = self._service.modelFromConfigPath(model_conifg_path) | ||
|
|
||
| def reset_terminology( | ||
| self, terminology: str = "", force_terminology: bool = False | ||
| ) -> None: | ||
| """Resets the terminology of the model | ||
| :param terminology: path to the terminology file. | ||
| :param force_terminology: force terminology | ||
| :return: None | ||
| """ | ||
| self._terminology = terminology | ||
| self._force_terminology = force_terminology | ||
| self._config = bergamot.ServiceConfig( | ||
| self._num_workers, | ||
| bergamot.VectorSizeT(self._gpu_workers), | ||
| self._cache, | ||
| self._logging, | ||
| self._terminology, | ||
| self._force_terminology, | ||
| self._terminology_form, | ||
| ) | ||
| self._service = bergamot.Service(self._config) | ||
|
|
||
| def reset_terminology( | ||
| self, terminology: Dict[str, str], force_terminology: bool = False | ||
| ) -> None: | ||
| """Resets the terminology of the model | ||
| :param terminology: Dictionary that maps source words to their target side terminology | ||
| :param force_terminology: force terminology | ||
| :return: None | ||
| """ | ||
| self._service.setTerminology(terminology, force_terminology) | ||
|
|
||
| def reset_num_workers(self, num_workers) -> None: | ||
| """Resets the number of workers | ||
| :param num_workers: number of parallel CPU threads. | ||
| :return: None | ||
| """ | ||
| self._num_workers = num_workers | ||
| self._config = bergamot.ServiceConfig( | ||
| self._num_workers, | ||
| bergamot.VectorSizeT(self._gpu_workers), | ||
| self._cache, | ||
| self._logging, | ||
| self._terminology, | ||
| self._force_terminology, | ||
| self._terminology_form, | ||
| ) | ||
| self._service = bergamot.Service(self._config) | ||
|
|
||
| def reset_gpu_workers(self, gpu_workers: List[int]) -> None: | ||
| """Resets the number of GPU workers | ||
| :param gpu_workers: Indices of the GPU devices to be used. | ||
| :return: None | ||
| """ | ||
| self._gpu_workers = gpu_workers | ||
| self._config = bergamot.ServiceConfig( | ||
| self._num_workers, | ||
| bergamot.VectorSizeT(self._gpu_workers), | ||
| self._cache, | ||
| self._logging, | ||
| self._terminology, | ||
| self._force_terminology, | ||
| self._terminology_form, | ||
| ) | ||
| self._service = bergamot.Service(self._config) | ||
|
|
||
| def translate(self, sentences: List[str]) -> List[str]: | ||
| """Translates a list of strings | ||
| :param sentences: A List of strings to be translated. | ||
| :return: A list of translation outputs. | ||
| """ | ||
| responses = self._service.translate( | ||
| self._model, bergamot.VectorString(sentences), self._responseOpts | ||
| ) | ||
| return [response.target.text for response in responses] | ||
|
|
||
| # @TODO add async translate with futures | ||
|
|
||
|
|
||
| def main(): | ||
| parser = argparse.ArgumentParser(description="bergamot-translator interface") | ||
| parser.add_argument("--config", '-c', required=True, type=str, help='Model YML configuration input.') | ||
| parser.add_argument("--num-workers", '-n', type=int, default=1, help='Number of CPU workers.') | ||
| parser.add_argument("--num-gpus", "-g", type=int, action='append', nargs='+', default=None, help='List of GPUs to use.') | ||
| parser.add_argument("--logging", '-l', type=str, default="off", help='Set verbosity level of logging: trace, debug, info, warn, err(or), critical, off. Default is off') | ||
| parser.add_argument("--cache-size", type=int, default=0, help='Cache size. 0 for caching is disabled') | ||
| parser.add_argument("--terminology-tsv", '-t', default="", type=str, help='Path to a terminology file TSV') | ||
| parser.add_argument("--force-terminology", '-f', action="store_true", help='Force terminology to appear on the target side.') | ||
| parser.add_argument("--terminology-form", type=str, default="%s __target__ %s __done__ ", help='"Form for terminology. Default is "%%s __target__ %%s __done__ "') | ||
| parser.add_argument("--path-to-input", '-i', default=None, type=str, help="Path to input file. Uses stdin if empty") | ||
| parser.add_argument("--batch", '-b', default=32, type=int, help="Number of lines to process in a batch") | ||
| args = parser.parse_args() | ||
|
|
||
| if args.num_gpus is None: | ||
| num_gpus = [] | ||
| else: | ||
| num_gpus = args.num_gpus[0] | ||
| translator = Translator(args.config, args.num_workers, num_gpus, args.cache_size, args.logging, args.terminology_tsv, args.force_terminology, args.terminology_form) | ||
|
|
||
|
|
||
| if args.path_to_input is None: | ||
| infile = stdin | ||
| else: | ||
| infile = open(args.path_to_input, "r", encoding="utf-8") | ||
|
|
||
| # In this example, each block of input text (i.e. a document) is a line. | ||
| # If you're using the API directly, feel free to include newlines in the | ||
| # block of text. We aim to preserve whitespace at sentence boundaries. | ||
|
|
||
| # Buffer input text to allow the backend to parallelize. We recommend | ||
| # there be about 16 sentences per worker (thread). Note that blocks of | ||
| # text are internally split into sentences, so the number of sentences is | ||
| # typically larger than the length of the list of blocks provided. | ||
| buffer = [] | ||
| for line in infile: | ||
| buffer.append(line.strip()) | ||
| if len(buffer) >= args.batch: | ||
| print("\n".join(translator.translate(buffer))) | ||
| buffer = [] | ||
|
|
||
| # Flush buffer | ||
| if len(buffer) > 0: | ||
| print("\n".join(translator.translate(buffer))) | ||
|
|
||
| if args.path_to_input is not None: | ||
| infile.close() | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.