* Add a Huffman round trip fuzzer
* Fix two minor bugs in Huffman that aren't exposed in zstd
- Incorrect weight comparison (weights are allowed to be equal to
table log).
- HUF_compress1X_usingCTable_internal() can return compressed
size >= source size, so the assert that `cSize <= 65535` isn't
correct, and it needs to be checked instead.
This patch is supposed to improve compatibility with less featured tar variants
"when the tar program used does not support historical options (without hyphen) nor the '-z' option."
Patch proposed by Antonio Diaz Diaz
Compress the input twice in the `simple_round_trip` and
`dictionary_round_trip` fuzzers with exactly the same parameters, but
reusing the context. Then ensure that the compressed output is
identical.
This flag forces zstd to always load the prefix in ext-dict mode, even
if it happens to be contiguous, to force determinism. It also applies to
dictionaries that are re-processed.
A determinism test case is also added, which fails without
`ZSTD_c_deterministicRefPrefix` and passes with it set.
Question: Should this be the default behavior? It isn't in this PR.
`open()`'s mode bits are only applied to files that are created by the call.
If the output file already exists, but is not readable, the `fopen()` would
fail, preventing us from removing it, which would mean that the file would
not end up with the correct permission bits.
It's not clear to me why the `fopen()` is there at all. `UTIL_isRegularFile()`
should be sufficient, AFAICT.
previous lower limit was 1 MB.
Note : by default, the lowest job size is 2 MB, achieved at level 1.
Even lower job sizes can be achieved by manipulating this value directly,
or manually modifying window sizes to lower amounts.
Updated unit test to ensure that this new limit works fine
(test would fail with previous 1 MB limit).
Switch from `-T0` to the default `-T1` which significantly reduces
memory usage for level 19 when there are many cores. This fixes
32-bit issues of running out of address space.
Fixes#2603.
* Fix overflow correction when `windowLog < cycleLog`. Previously, we
got the correction wrong in this case, and our chain tables and binary
trees would be corrupted. Now, we work as long as `maxDist` is a power
of two, by adding `MAX(maxDist, cycleSize)` to our indices.
* When `ZSTD_WINDOW_OVERFLOW_CORRECT_FREQUENTLY` is defined to non-zero
run overflow correction as frequently as allowed without impacting
compression ratio.
* Enable `ZSTD_WINDOW_OVERFLOW_CORRECT_FREQUENTLY` in `fuzzer` and
`zstreamtest` as well as all the OSS-Fuzz fuzzers. This has a 5-10%
speed penalty at most, which seems reasonable.
* Switch to yearless copyright per FB policy
* Fix up SPDX-License-Identifier lines in `contrib/linux-kernel` sources
* Add zstd copyright/license header to the `contrib/linux-kernel` sources
* Update the `tests/test-license.py` to check for yearless copyright
* Improvements to `tests/test-license.py`
* Check `contrib/linux-kernel` in `tests/test-license.py`
Fixes the update from PR #2508. I had accidentally forgotten to rebuild
the library, and the regression test suite isn't hooked up to the new
fancy build system yet.
I've double checked that the results are deterministic.
9f327c02fd changed the compression method
for LDM, so the results are slightly different.
I've re-tested LDM on some larger inputs and everything seems fine.
These ratio changes just seem to be noise. There is generally a 0.01%
swing in ratio, sometimes better sometimes worse, but never large.
* Fix compiler version regex, which was broken for multi-digit
versions.
* Fix compiler detection for gcc.
* Disable `pointer-overflow` instead of `integer-overflow` for gcc
versions newer than 8.0.0.
The simple compression functions are intended to ignore the advanced
parameters, but they were accidentally using them. All the
`ZSTD_parameters` were set correctly, but any extra parameters were
used as-is. E.g. `ZSTD_c_format`.
This PR makes all the simple single-pass functions listed below ignore
the advanced parameters, as intended.
* `ZSTD_compressCCtx()`
* `ZSTD_compress_usingDict()`
* `ZSTD_compress_usingCDict()`
* `ZSTD_compress_advanced()`
* `ZSTD_compress_usingCDict_advanced()`
It also adds a test case that ensures that each of these functions
ignore the advanced parameters.
Treat ZSTD_getCParams() and ZSTD_adjustCParams() in the same way
we treat streaming compression. Choose parameters based on the
dictionary size + source size, and assume the source size is small
if unkown. But, don't shrink the window log down in
ZSTD_adjustCParams_internal().
Fixes#2442.
1. When creating a dictionary keep the same behavior as before.
Assume the source size is 513 bytes when adjusting parameters.
2. When calling ZSTD_getCParams() or ZSTD_adjustCParams() keep
the same behavior as before.
3. When attaching a dictionary keep the same behavior of ignoring
the dictionary size. When streaming this will select the
largest parameters and not adjust them down. But, the CDict
will use the correctly sized parameters, which seems like the
right tradeoff.
4. When not attaching a dictionary (either forced not to, or
using a prefix dictionary) we select parameters based on the
dictionary size + source size, and assume the source size is
small, which is the same behavior as before. But, now we don't
adjust the window log (and hash and chain log) down when the
source size is unknown.
When the source size is unknown all cdicts should attach, except
when the user disables attaching, or `forceWindow` is used. This
means that when streaming with a CDict we end up in the good case
where we get small CDict parameters, and large source parameters.
TODO: Add a streaming + dictionary regression test case.
* Add a test that runs without a pledgedSrcSize and with a dictionary.
* Add github.tar data with uses the github dictionary while compressing
github.tar, instead of each file individually.
this build works fine on all my systems,
but since to fail on CI environment.
Unclear why there is a difference.
This build test is not relevant anyway.
https://github.com/facebook/zstd/pull/2339 removes the single-pass zstdmt API.
This changes the compressed size, because we no longer take the # of threads into
account when deciding the job size.
When zstdmt cannot get a buffer and `ZSTD_e_end` is passed an empty
compression job can be created. Additionally, `mtctx->frameEnded` can be
set to 1, which could potentially cause problems like unterminated blocks.
The fix is to adjust to `ZSTD_e_flush` even when we can't get a buffer.
* Run compression twice and check the compressed data is byte-identical.
The compression loop had to be rewritten to ensure deteriminism. It is
guaranteed by always making maximal forward progress.
* When nbWorkers > 0, change the number of workers 1/8 of the time.
* Run in single-pass mode 1/4 of the time.
I've run a few hundred thousand iterations of zstreamtest and have seen
no deteriminism issues so far. Before the zstdmt fix that skips the
single-pass shortcut non-determinism showed up in a few hundred
iterations.
This commit leaves only the functions used by zstd_compress.c. All other
functions have been removed from the API. The ZSTDMT unit tests in
fuzzer.c and zstreamtest.c have been rewritten to use the ZSTD API. And
the --mt zstreamtest tests have been ripped out.
Pass in the `ZSTD_cParamMode_e` to select how we define our cparams.
Based on the mode we either take the `dictSize` into account or we set
it to `0`. See the documentation for `ZSTD_cParamMode_e`.
Some of the modes currently share the same behavior. But they have
distinct modes because they are drastically different cases. E.g.
compression + reprocessing the dictionary and creating a cdict.
Additionally, when downsizing the hashLog and chainLog take the
(adjusted) dictionary size into account, since the size of the
dictionary gets added onto the window size.
Adds a simple test to ensure that we aren't downsizing too far.
Conditions to trigger:
* CDict is loaded as raw content.
* CDict starts with the zstd dictionary magic number.
* The CDict is reprocessed (not attached or copied).
* The new API is used (streaming or `ZSTD_compress2()`).
Bug: The dictionary is loaded as a zstd dictionary, not a raw content
dictionary, because the dict content type is set to `ZSTD_dct_auto`.
Fix: Pass in the dictionary content type from cdict creation to the call
to `ZSTD_compress_insertDictionary()`.
Test: Added a test case that exposes the bug, and fixed the raw
content tests to not modify the `dictBuffer`, which makes all future
tests with the `dictBuffer` raw content, which doesn't seem intentional.
This is kind of hacky. And maybe this test doesn't need to be permanently as
exhaustive as it is now. But while we're actively developing the DDSS, we
should ensure it's compatible across many different modes.
* Fix bug introduced in PR #2271
* Fix long-standing bug that is impossible to trigger inside of zstd
* Add a fuzzer that makes sure the normalized count always round trips
correctly