Allow the `dictContentSize` to be any size. The finalized dictionary
content size must be at least as large as the maximum repcode (8). So we
add zero bytes to the dictionary to ensure that we meet that
requirement.
I've removed this restriction because its been causing us headaches when
people complain that dictionary training failed. It fails because there
isn't enough useful content to put in the dictionary. Either because
every sample is exactly the same and less than ZDICT_CONTENTSIZE_MIN bytes,
or there isn't enough content. Instead, we should succeed in creating
the dictionary, and it is up to the user to decide if it is worthwhile.
It is possible that the tables alone provide enough value.
NOTE: This allows us to produce dictionaries with finalized
`dictContentSize < ZDICT_CONTENTSIZE_MIN`. But, they are still valid
zstd dictionaries. We could remove the `ZDICT_CONTENTSIZE_MIN` macro,
but I've decided to leave that for now, so we don't break users.
`zstd_errors.h` and `zdict.h` are public headers, so they deserve to be
in the root `lib/` directory with `zstd.h`, not mixed in with our private
headers.
* Switch to yearless copyright per FB policy
* Fix up SPDX-License-Identifier lines in `contrib/linux-kernel` sources
* Add zstd copyright/license header to the `contrib/linux-kernel` sources
* Update the `tests/test-license.py` to check for yearless copyright
* Improvements to `tests/test-license.py`
* Check `contrib/linux-kernel` in `tests/test-license.py`
This ensures the symbols aren't redefined, which would result in a compiler
error.
I was getting redefined symbols for _LARGEFILE64_SOURCE when building for
32-bit x86 Linux on an older CentOS release in a CI environment. With this
change, I'm able to compile the single file library in this environment.
Closes#2443.
Allow compression to use dictionaries with missing symbols in their
entropy tables. We set the FSE repeat mode to check when there are
missing symbols, and set the FSE repeat mode to valid when all symbols
are present.
Note that when not all symbols are present, the heuristics which favor
dictionary tables for lower compression levels won't activate.
Tested by manually creating a dictionary with missing symbols of every
type, and validing that the compressor rejects it before this change,
and accepts it after this change. Also, I ran the `dictionary_loader`
fuzzer for >1 hour of CPU time without running into cases where
compression succeeds, but decompression fails.
Fixes#2174.
* All copyright lines now have -2020 instead of -present
* All copyright lines include "Facebook, Inc"
* All licenses are now standardized
The copyright in `threading.{h,c}` is not changed because it comes from
zstdmt.
The copyright and license of `divsufsort.{h,c}` is not changed.
as suggested in #1441.
generally U32 and unsigned are the same thing,
except when they are not ...
case : 32-bit compilation for MIPS (uint32_t == unsigned long)
A vast majority of transformation consists in transforming U32 into unsigned.
In rare cases, it's the other way around (typically for internal code, such as seeds).
Among a few issues this patches solves :
- some parameters were declared with type `unsigned` in *.h,
but with type `U32` in their implementation *.c .
- some parameters have type unsigned*,
but the caller user a pointer to U32 instead.
These fixes are useful.
However, the bulk of changes is about %u formating,
which requires unsigned type,
but generally receives U32 values instead,
often just for brevity (U32 is shorter than unsigned).
These changes are generally minor, or even annoying.
As a consequence, the amount of code changed is larger than I would expect for such a patch.
Testing is also a pain :
it requires manually modifying `mem.h`,
in order to lie about `U32`
and force it to be an `unsigned long` typically.
On a 64-bit system, this will break the equivalence unsigned == U32.
Unfortunately, it will also break a few static_assert(), controlling structure sizes.
So it also requires modifying `debug.h` to make `static_assert()` a noop.
And then reverting these changes.
So it's inconvenient, and as a consequence,
this property is currently not checked during CI tests.
Therefore, these problems can emerge again in the future.
I wonder if it is worth ensuring proper distinction of U32 != unsigned in CI tests.
It's another restriction for coding, adding more frustration during merge tests,
since most platforms don't need this distinction (hence contributor will not see it),
and while this can matter in theory, the number of platforms impacted seems minimal.
Thoughts ?
* Minor fix
* Run non-optimize FASTCOVER 5 times in benchmark
* Merge fastCover into dictBuilder
* Fix mixed declaration issue
* Add fastcover to symbol.c
* Add fastCover.c and cover.h to build
* Change fastCover.c to fastcover.c
* Update benchmark to run FASTCOVER in dictBuilder
* Undo spliting fastcover_param into cover_param and f
* Remove convert param functions
* Assign f to parameter
* Add zdict.h to Makefile in lib
* Add cover.h to BUCK
* Cast 1 to U64 before shifting
* Remove trimming of zero freq head and tail in selectSegment and rebenchmark
* Remove f as a separate parameter of tryParam
* Read 8 bytes when d is 6
* Add trimming off zero frequency head and tail
* Use best functions from COVER and remove trimming part(which leads to worse compression ratio after previous bugs were fixed)
* Add finalize= argument to FASTCOVER to specify percentage of training samples passed to ZDICT_finalizeDictionary
* Change nbDmer to always read 8 bytes even when d=6
* Add skip=# argument to allow skipping dmers in computeFrequency in FASTCOVER
* Update comments and benchmarking result
* Change default method of ZDICT_trainFromBuffer to ZDICT_optimizeTrainFromBuffer_fastCover
* Add dictType enum and fix bug about passing zParam when converting to coverParam
* Combine finalize and skip into a single parameter
* Update acceleration parameters and benchmark on 3 sample sets
* Change default splitPoint of FASTCOVER to 0.75 and benchmark first 3 sample sets
* Initialize variables outside of for loop in benchmark.c
* Update benchmark result for hg-manifest
* Remove cover.h from install-includes
* Add explanation of f
* Set default compression level for trainFromBuffer to 3
* Add assertion of fastCoverParams in DiB_trainFromFiles
* Add checkTotalCompressedSize function + some minor fixes
* Add test for multithreading fastCovr
* Initialize segmentFreqs in every FASTCOVER_selectSegment and move mutex_unnlock to end of COVER_best_finish
* Free segmentFreqs
* Initialize segmentFreqs before calling FASTCOVER_buildDictionary instead of in FASTCOVER_selectSegment
* Add FASTCOVER_MEMMULT
* Minor fix
* Update benchmarking result
There were 2 competing set of debug functions
within zstd_internal.h and bitstream.h.
They were mostly duplicate, and required care to avoid messing with each other.
There is now a single implementation, shared by both.
Significant change :
The macro variable ZSTD_DEBUG does no longer exist,
it has been replaced by DEBUGLEVEL,
which required modifying several source files.
Pathological samples may result in literal section being incompressible.
This case is now detected,
and literal distribution is replaced by one that can be written into the dictionary.