CHANGES.txt

v<0.1.0>, <20140114> -- Initial release.
v<0.1.1>, <20140119> -- Treat filenames as byte streams, not unicode in analyser.py
v<0.1.2>, <20140120> -- Process files given on the commandline correctly in ccat
v<0.2.0>, <20140121> -- Add the corpus conversion tools to the package
v<0.2.1>, <20140124> -- Fix a bug in the conversion tools decoding, lots of changes to confirm better with PEP8
v<0.2.2>, <20140128> -- Make convert2xml use multiprocess the same way that analyse_corpus does. Update installation instructions.
v<0.2.3>, <20140129> -- ccat now understands how to turn analysed files into clean text.
v<0.3.0>, <20140129> -- Moved corpus-parallel.py to CorpusTools
v<0.3.1>, <20140204> -- Changed the way multiprocessing is done in convert2xml, readd the disambiguation tag to the analysed docs
v<0.3.2>, <20140205> -- Use the old infra for sme.fst, abbr.txt and corr.txt
v<0.3.3>, <20140213> -- Don't write empty analysis files
v<0.3.4>, <20140218> -- Don't write empty converted files
v<0.3.5>, <20140305> -- New scheme for encoding guessing
v<0.3.6>, <20140329> -- Handle hyph tags
v<0.4.0>, <20140507> -- Convert .doc by using docbook-xsl-ns -> html -> our xml
v<0.5.0>, <20140507> -- Change the way Avvir xml files are converted
v<0.5.1>, <20140821> -- Change the way newstags are handled
v<0.7.0>, <20141008> -- Add add_files_to_corpus script, a script to add files in a given directory to a given corpus
v<0.7.1>, <20141008> -- Change the --debug option to --serial in analyse_corpus and remove -r option in ccat
v<0.7.2>, <20141010> -- The file adder is able to add both files and files inside directories. The files are also added to corpus by the file adder.
v<0.7.3>, <20141011> -- Add the script change_corpus_names, update the CorpusNameFixer to use the NameChangerBase as base class
v<0.7.4>, <20141011> -- When using --serial in convert2xml, tell which file is being converted
v<0.7.5>, <20141105> -- Add -v to print the version from all scripts
v<0.7.6>, <20141106> -- Convert filenames to unicode strings in ccat
v<0.7.7>, <20141110> -- Add -v/--version options in the proper way using argparse
v<0.7.8>, <20141110> -- Using the DRY-principle for the version info
v<0.7.9>, <20141125> -- Set the correct requirements
v<0.8.0>, <20141127> -- Support for docx conversion
v<0.8.1>, <20141127> -- Depend on wvHtml for doc conversion
v<0.8.2>, <20141230> -- Countless improvements by Kevin Unhammer, added saami_crawler script
v<0.9.0alpha1>, <20141230> -- Remove dependency on pytidylib and beautifulsoup4, replace them by lxml.html.clean and lxml.html.html5parser
v<0.9.0alpha2>, <20150120> -- Make a ConverterManager, let it handle the conversion
v<0.9.0beta1>, <20150217> -- pdf conversion based on PDF2XMLConverter
v<0.9.0beta2>, <20150218> -- added parallel file handling to add_files_to_corpus
v<0.9.0beta3>, <20150304> -- catch error when trying to parse the xsl-file, telling what the error is
v<0.9.0b4>, <20150316> -- Paragrahps in pdf documents that are split by a page change, will appear as one paragraph in the converted document.
v<0.9.8>, <20150909> -- Fix Š and Ž problems mentioned in http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=1954#c20
v<0.9.9>, <20150909> -- Fix ¹ and Ü problems mentioned in http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=1954#c20
v<0.9.10>, <20151022> -- Always convert ´ to ' when language is sms
v<0.9.11>, <20151029> -- Comment out the previous change, it did the wrong thing
v<0.9.12>, <20151029> -- Activate fix_sms in converter.py again
v<0.10.0>, <20151030> -- Add two new scripts, dupefinder and duperemover
v<0.11.0>, <20151112> -- Add the remove_corpus_file script
v<0.11.1>, <20151116> -- Let multiple pages with same margin be separated by semicolon. Fixes bug 2102
v<0.11.2>, <20151116> -- Allow use of odd and even in skip_pages. Fixes bug 2103
v<0.11.3>, <20151117> -- Fix more footnote superscript omissions. Related to bug 2105.
v<0.11.4>, <20151118> -- Added inner_margins. Fixes bug 2104.
v<0.11.5>, <20151118> -- Fixes bug 2115.
v<0.11.6>, <20151120> -- In pdf files, do not try to extract text from text elements only containg whitespace. Fixes bug 2111.
v<0.11.7>, <20151123> -- Make parallelize work correctly again.
v<0.12.0>, <20151123> -- Added support for sms in parallelize.
v<0.13.0>, <20151124> -- Add the script pick_parallel_docs
v<0.13.1>, <20151127> -- Fix bug 2107, fix unexpected line shifts
v<0.13.2>, <20151210> -- Fix bug 2101, unexpected space before and after saami characters
v<0.13.3>, <20151210> -- Fixed an error in the way footnote superscript were handled
v<0.13.4>, <20160104> -- Fix bug 2106, headings appear at the end of the page
v<0.13.5>, <20160105> -- Don't crash if a page is empty
v<0.13.6>, <20160105> -- Add support for smn in parallelize and generate_anchor_list
v<0.13.7>, <20160115> -- Lots of improvements in paragraph handling
v<0.13.8>, <20160217> -- Handle bare text in body in HTMLContentConverter. Also handle more tags to remove from html.
v<0.13.9>, <20160220> -- More classes and ids to avoid. More files are converted.
v<0.13.10>, <20160227> -- Improved saami_crawler
v<0.13.11>, <20160304> -- Changed the user interface of add_files_to_corpus
v<0.13.12>, <20160320> -- Make add_files_to_corpus get names for downloaded documents from the http-header, if possible.
v<0.13.13>, <20160330> -- Handle both utf8 and latin1 encoded filenames when downloading pdf files.
v<0.14.0>, <20160504> -- CorpusTools is now both python2 and python3 compatible
v<0.14.1>, <20160510> -- CorpusTools now supports goldstandard files inside orig
v<0.14.2>, <20160531> -- Improved quote detection and more tags to remove from regjeringen.no files. Improves the situation for bug 1780.
v<0.14.3>, <20160901> -- Change paths for syntax analysis files after the gtcore -> giella-core, giella-shared transition
v<0.15.0>, <20160901> -- Add basic support for conversion of epub files.
v<0.15.1>, <20160902> -- Ensure that lang is unicode in adder.py. Fixes a bug in add_files_to_corpus
v<0.15.2>, <20160908> -- Add skip_elements in metadata. Makes it possible to set ranges of elements that should be skipped in epub files.
v<0.15.3>, <20160915> -- Fix bug in skip_elements handling, handle ranges of same depth.
v<0.15.4>, <20160916> -- Fix bug in skip_elements handling, stop removal at stop element, do not just avoid it.
v<0.15.5>, <20160921> -- Make sure that adder uses unicode filenames even in python2.
v<0.15.6>, <20160930> -- Remove all custom encoding guessing. This gives the correct encoding to www.samediggi.fi html documents, and fixes the disappearing ž's.
v<0.15.7>, <20161001> -- Sámi character encoding guessing must be done on the intermediate document, before the intermediate document and the metadata are merged. This is because some characters are lost in the merge.
v<0.15.8>, <20161009> -- Fix bug 2223, problemer med konvertering av sørsamiske dokumenter (which also affected quite a lot of other documents).
v<0.15.9>, <20161012> -- Some converted smj docs contained ∫, that should have been ŋ. Turned out they used the mac-sami encoding internally. Fixed the decoding detection to cover these cases, too.
v<0.15.10>, <20161013> -- Fix parts of bug 1938, space in front of the letter ŋ. Font elements in html are removed, leaving only the text of these elements.
v<0.15.11>, <20161022> -- Fix error message which crashed convert2xml
v<0.15.12>, <20161024> -- Improve renaming of files: Add basic metadata file if it does not exist and handle non-ascii names.
v<0.16.0>, <20161026> -- Make parallelize accept directories as well as files as input.
v<0.16.1>, <20161026> -- Add linespacing to metadata and use it in pdf conversion.
v<0.16.2>, <20161107> -- Fix double utf-8'ing problems when converting html documents and other documents with html as a temporary format.
v<0.16.3>, <20161125> -- Remove all but the first story element with identical ids in avvirxml files.
v<0.16.4>, <20161130> -- Make it possible to skip lines in plain text document.
v<0.16.5>, <20161209> -- Fix sms documents again. This was initially fixed in v0.9.12, but in v0.15.7 this regressed.
v<0.16.6>, <20161209> -- Additions to mac-sami_to_latin1. Fixes: Bug 2301 - smj: tekst med mye feil konvertert.
v<0.16.7>, <20161212> -- Fix a bug in convert2xml that did not write changed values in the metadata to the converted file. Make parallelize use the new dual language anchor files. Make pick_parallel_docs pick converted docs that aren't direct translations of the other.
v<0.16.8>, <20170302> -- Make analyse_corpus use Apertium modes.xml files, via the modes.py module. Make encoding conversion a little more robust.
v<0.16.9>, <20170303> -- Improve moving files around in svn.
v<0.17.0>, <20170314> -- Use the hfst-tokenise as the motor for sentence division. Divide sentences also on ;. This improves the situation for Bug 2351 - Text dissappears when parallelising files
v<0.17.1>, <20170316> -- Remove more cruft when converting html documents. Make the sentence divider use the nob fst when nno is specified. The sentence divider produces even more sentences to satisfy tca2.
v<0.18.0>, <20170320> -- Add a new tool, realign. This tool reconverts and realigns the sentences of a pair of parallel documents.
v<0.19.0>, <20170321> -- Add a new tool, update_metadata. This tool updates metadata files in a given directory with new variables and documentation from the XSL-template.xsl file.
v<0.19.1>, <20170322> -- Improve the realign tool, add --files and --convert options. --files will only show filenames, --convert will reconvert the .xml files and show filenames.
v<0.19.2>, <20170503> -- Improve saami_crawler.py, handle new list character in pdf conversion, add space in listelements when needed when doing pdf conversion, add new character to sms conversion
v<0.19.3>, <20170510> -- Fix a bug in versioncontrol.py which affects move_corpus_file
v<0.19.4>, <20170513> -- Fix a bug in pdfconverter.py which pdf conversion error handling
v<0.20.0>, <20170522> -- Add the samas.no crawler
v<0.20.1>, <20170606> -- Only allow cp1251_cp1252 decoding for cyrillic languages. Fixes bug 2400 - konverteringsproblemer for sda 2007
v<0.20.2>, <20170609> -- realign did not work anymore, fix it
v<0.20.3>, <20170706> -- fixed the lang recognition bug that strongly affected the work with realigning parallel files
v<0.20.4>, <20170707> -- fixed DTD: if multilingual then at least one language element as child
v<0.20.5>, <20170711> -- relaxed DTD: if multilingual then it can be empty, too, because otherwise many files would not convert
v<0.20.6>, <20170718> -- Fix bug 2409 - Two Mari letters ӧ ӱ not converted correctly. Also only detect languages when there are language models for the requested languages.
v<0.20.7>, <20170728> -- Fix bug 2411 - Fix a bug introduced in langtech r155208 that rendered convert2xml unusable.
v<0.20.8>, <20170921> -- Set new paths and options for hfst-tokenisation.
v<0.20.9>, <20170927> -- Use --print-all for hfst-tokenise. Improves input for both parallelisation and analysis. Alse remove irritating avvir.no input.
v<0.20.10>, <20170928> -- Remove script tags and their tails from html documents. This removes superfluous text from e.g. finnish sami parliament documents.
v<0.21.0>, <20171010> -- Add support for latex to giella-xml conversion.
v<0.21.1>, <20171010> -- Fix path problems in the latex conversion module.
v<0.21.2>, <20171010> -- Add the possibility to specify file name when adding it to the corpus.
v<0.21.3>, <20171025> -- Set the mimetype of xsl files to text/xml when adding files to a svn working copy.
v<0.21.4>, <20171205> -- Change the way html and html-dependent documents types are converted, by parsing all html content as if it is xhtml. This fixes bugs in particularly epub conversion.
v<0.21.5>, <20171211> -- Some html documents were not parsable by html5parser. Clean them up before parsing them.
v<0.21.6>, <20171212> -- Some .doc documents sent through wvHtml were not parsable by html5parser. Remove control characters from the wvHtml output.
v<0.21.7>, <20171214> -- parallelize no longer tries to parallelize non existing parallel files …
v<0.21.8>, <20171217> -- Route all formats that have html as an intermediate step through htmlcontentconverter. Depend on html.document_fromstring instead of html5parser.document_fromstring.
v<0.22.0>, <20171219> -- Add helper program for setting epub specific metadata
v<0.22.1>, <20171229> -- Fix the epubchooser and epubconverter to also remove just one specified xpath, in addition to be able to remove a range of xpaths.
v<0.22.2>, <20180105> -- Add author search to nrk.no articles
v<0.22.3>, <20180109> -- Add latex2html to sanity_check
v<0.23.0>, <20180123> -- Add the utility make_training_corpus, that makes training corpus for pytextcat and langid
v<0.23.1>, <20180228> -- Analysed files from stallo are stored in analysed.YY-MM-DD, make it possible to count these
v<0.23.2>, <20180530> -- Import print_function, unicode_literals from __future__ to minimise python 2 and 3 differences
v<0.23.3>, <20180831> -- No weights in hfst analysis when analysing text
v<0.24.0>, <20181005> -- Depend on files from a make install in giella when analysing files
v<0.24.1>, <20181008> -- Insert modename (xfst or hfst) into the analysed pathname
v<0.24.2>, <20181008> -- Make vislcg3 --grammar mwe-dis.bin available, use the long version of vislcg3 -g option (--grammar) in modes.xml
v<0.24.3>, <20181206> -- Convert Windows style '\r\n' line endings to Unix style '\n' line endings before handling the content of plaintext files further.
v<0.24.4>, <20181214> -- Rename the realign script to reparallelize, making it
                         more obvious what its usage is. Also do small changes
                         to the default behaviour of parallelize. Update the
                         documentation of the parallelize and reparallelize.
v<0.25.0>, <20190206> -- Add a new program, tmx2html, to convert individual .tmx files into html files.
v<0.25.1>, <20190516> -- Fix language handling errors in ccat
v<0.26.0>, <20190603> -- convert2xml has by default only reconverted files if metadata files are newer than
                         the converted file. Revert this behaviour, and add an option (lazy-conversion) to
                         emulate the previous behaviour.
v<0.26.1>, <20190611> -- set default arguments for ConverterManager, fixes error in reparallelize
v<0.26.2>, <20190614> -- make SamediggiNoCrawler work again
v<0.26.3>, <20190809> -- make the nrk.no news flash fetcher work again
v<0.27.0>, <20190809> -- add support for the errorformat/‰ errormarkup tag.
                         .tmx files are written to the prestable/tmx folder
v<0.27.1>, <20191015> -- Do not pretty_print xml-files anymore, only set tails
                         of p as '\n' instead.
v<0.27.2>, <20191107> -- Make reparallelize work again, it relied on converted
                         files being in prestable.
v<0.27.3>, <20200209> -- Make the -wbt option of pdftohtml available as a variable in the metadatafiles.
v<0.28.0>, <20200504> -- Support the conversion of udhr xml documents (https://unicode.org/udhr/).
v<0.28.1>, <20201105> -- Make sure that temporary directory exists when adding files
v<0.28.2>, <20201110> -- Decode garbled mrj documents. Thanks to Trond for the mapping.
v<0.29.0>, <20210101> -- Conversion of usx documents and improved pdf conversion
v<0.29.1>, <20210113> -- Catch IndexError when converting documents, to prevent a crash, turn it into a warning, does not stop the conversion
v<0.29.2>, <20210129> -- Add -tts option to ccat, print fixes for all errors, except foreign ones.
v<0.29.3>, <20210129> -- Change name of -tts to -withforeign.
v<0.29.4>, <20210224> -- Raise a real error when erring out in add_nested_error
v<0.29.5>, <20210308> -- The last step in the xml conversion is to normalise text to NFC before writing it to disk.
v<0.29.6>, <20210512> -- Add char fix for Mansi
v<0.30.0>, <20211005> -- Move codebase to Python 3.6+, add dropbox_adder script