Putting stack marking into every assembly files is required to indicate
that the stack does not need to be executable.
Executable flag on stack conflicts with some security measures, Systemd
MemoryDenyWriteExecute=yes for example.
Previously, if an index was equal to `reducerValue + 1`, it would get remapped
during index reduction to 1 i.e. `ZSTD_DUBT_UNSORTED_MARK`. This can affect the
parsing of the input slightly, by causing tree nodes to be nullified when they
otherwise wouldn't be. This hardly matters from a correctness or efficiency
perspective, but it does impact determinism.
So this commit changes index reduction to avoid mapping indices to collide with
`ZSTD_DUBT_UNSORTED_MARK`.
that's clearer than finding the tables somewhere in the middle of `compress.c`.
Also, down the line, it may potentially allows zstd to feature adjusted tables depending on target cpu.
Speed up compilation times by moving each specialized search function
into its own function. This is faster because compilers can handle many
smaller functions much faster than one gigantic function. The previous
approach generated one giant function with `switch` statements and
inlining to select the implementation.
| Compiler | Flags | Dev Time (s) | PR Time (s) | Delta |
|----------|-------------------------------------|--------------|-------------|-------|
| gcc | -O3 | 16.5 | 5.6 | -66% |
| gcc | -O3 -g -fsanitize=address,undefined | 158.9 | 38.2 | -75% |
| clang | -O3 | 36.5 | 5.5 | -85% |
| clang | -O3 -g -fsanitize=address,undefined | 27.8 | 17.5 | -37% |
This also reduces the binary size because the search functions are no
longer inlined into the main body.
| Compiler | Dev libzstd.a Size (B) | PR libzstd.a Size (B) | Delta |
|----------|------------------------|-----------------------|-------|
| gcc | 1563868 | 1308844 | -16% |
| clang | 1924372 | 1376020 | -28% |
Finally, the performance is not impacted significantly by this change,
in fact we generally see a small speed boost.
| Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta |
|----------|-------|------------------|-----------------|-------|
| gcc | 5 | 110.6 | 110.0 | -0.5% |
| gcc | 7 | 70.4 | 72.2 | +2.5% |
| gcc | 9 | 53.2 | 53.5 | +0.5% |
| gcc | 13 | 12.7 | 12.9 | +1.5% |
| clang | 5 | 113.9 | 110.4 | -3.0% |
| clang | 7 | 67.7 | 70.6 | +4.2% |
| clang | 9 | 51.9 | 52.2 | +0.5% |
| clang | 13 | 12.4 | 13.3 | +7.2% |
The compression strategy is unmodified in this PR, so the compressed size
should be exactly the same. I may have a follow up PR to slightly improve
the compression ratio, if it doesn't cost too much speed.
Fix underflow of `nbCompares` by switching to an `int` and comparing
`nbCompares > 0`. This is a minimal fix, because I don't want to change
the logic. These loops seem to be doing `nbCompares + 1` comparisons.
The bug was reported by Dan Carpenter and found by Smatch static
checker.
https://lore.kernel.org/all/20211008063704.GA5370@kili/
There is no minimum value check, so the parameter could be negative.
Switch to the standard pattern of using `BOUNDCHECK()`.
The bug was reported by Dan Carpenter and found by Smatch static
checker.
https://lore.kernel.org/all/20211008063704.GA5370@kili/
Since we're now hashing the position ahead even if we find a long match and
don't search that next position, we can write it back into the hashtable even
in long matches. This seems to cost us no speed, and improves compression
ratio slightly!
Aside from maybe a latency win in the loop, this means that when we find a
short match, we've already done the hash we need to check the next long match.
* Limit training samples size to 2GB
* simplified DISPLAYLEVEL() macro to use global vqriable instead of local.
* refactored training samples loading
* fixed compiler warning
* addressed comments from the pull request
* addressed @terrelln comments
* missed some fixes
* fixed type mismatch
* Fixed bug passing estimated number of samples rather insted of the loaded number of samples.
Changed unit conversion not to use bit-shifts.
* fixed a declaration after code
* fixed type conversion compile errors
* fixed more type castting
* fixed more type mismatching
* changed sizes type to size_t
* move type casting
* more type cast fixes
PR #2784 introduced a bug in the decompressor that caused some valid
inputs to fail to decompress. The bitstream isn't reloaded after the 4X*
loop if the number of elements remaining is small enough, causing us to
read more bits than are available in the bitcontainer.
This was caught by the MSAN fuzzer in OSS-Fuzz because the assembly
implementation isn't used in the MSAN build.
Credit to OSS-Fuzz.
Multiple ZSTD_createDCtx* functions call other (public)
ZSTD_createDCtx* functions, this makes it harder for humans
and compilers to throw out code that is not used.
This farms out the logic into a static function, if a program
only uses a single ZSTD_createDCtx variant, all others can be easily
dropped and the remaining implementation can be specialized.
Commit d7ef97a013
("[build] Fix oss-fuzz build with the dataflow sanitizer") broke
build inside Linux-kernel after 'import', as it no longer can
conditionally remove ZSTD_MEMORY_SANITIZER definition from
the #if DEF_A || DEF_B block. This emits -Wundef warning which
can be treated as error.
Split this preprocessor condition into two separate conditions
to fix this.
Fixes: d7ef97a013 ("[build] Fix oss-fuzz build with the dataflow sanitizer")
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Switch to a macro `ZSTD_FALLTHROUGH;` instead of a comment. On supported
compilers this uses an attribute, otherwise it becomes a comment.
This is necessary to be compatible with clang's `-Wfall-through`, and
gcc's `-Wfall-through=2` which don't support comments. Without this the
linux build emits a bunch of warnings.
Also add a test to CI to ensure that we don't regress.
turns out, it's possible to constify MatchState* parameter
in some parts of the binary tree algorithm,
making it a pure read-only parameter,
as opposed to a mutable state.
This is supposed to be helpful for both maintenance and the compiler.
* Extract out common portion of `lib/Makefile` into `lib/libzstd.mk`.
Most relevantly, the way we find library files.
* Use `lib/libzstd.mk` in the other Makefiles instead of repeating the
same code.
* Add a test `tests/test-variants.sh` that checks that the builds of
`make -C programs allVariants` are correct, and run it in Actions.
* Adds support for ASM files in the CMake build.
The Meson build is not updated because it lists every file in zstd,
and supports ASM off the bat, so the Huffman ASM commit will just add
the ASM file to the list.
The Visual Studios build is not updated because I'm not adding ASM
support to Visual Studios yet.
Test failures showed up on the daily cron job. They didn't show up
in CI because the condition is somewhat rare, and didn't trigger
during the CI tests.
This PR fixes up the logic in `findSynchronizationPoint()` to correctly
handle the edge case. It also un-comments an assert that helps catch the
issue, and verify that rsyncable mode is calculating the correct hash.
After the fix, the test that failed passes:
```
./zstreamtest --newapi -t1 --no-big-tests -s9680
```
In degenerate cases `--rsyncable` could create very small blocks (1
byte). This causes the compressed output to be larger than
`ZSTD_compressBound()`. Fix the issue by ensuring that rsyncable mode
never outputs blocks smaller than 128 KB.
The minimum job size is 512 KB, so we shouldn't lose many
synchronization points from skipping any that cause blocks smaller than
128 KB. And even if we do, that is fine, because we'll find the next
one.
This fixes the `raw_dictionary_round_trip` oss-fuzz assert.
Credit to OSS-Fuzz
better for large files, and sources with relatively "stable" entropy,
like silesia.tar.
slightly worse for files with rapidly changing entropy,
like Calgary.tar/.
Updated small files tests in fuzzer
used to be necessary to counter-balance the fixed-weight frequency update
which has been recently changed for an adaptive rate (targeting stable starting frequency stats).
As a library, the default shouldn't be to write anything on console.
`cover` and `fastcover` have a `g_displayLevel` variable to control this behavior.
It's now set to 0 (no display) by default.
Setting notification to a higher level should be an explicit operation by a console application.
This new setup is slighly better on `silesia.tar` :
Ratio : 3.649 -> 3.655
Speed : 11.9 MB/s -> 12.2 MB/s
At the cost of more memory : 24 MB -> 32 MB
The new memory budget is a reasonable interpolation between neighboring levels 12 and 14:
level 12 : 24 MB
level 13 : 32 MB (increased from 24 MB)
level 14 : 48 MB
Window size remains unaffected (4 MB)
This removes the old `ZSTD_compressBlock_fast_generic()` and renames the new
`ZSTD_compressBlock_fast_generic_pipelined()` to replace it. This is
functionally a no-op.
Unrolling the loop to handle 2 positions in each iteration allows us to reduce
the frequency of some operations that don't need to happen at every position.
One such operation is the step calculation, which is a very rough heuristic
anyways. It's fine if we do this a position later. The other operation is the
repcode check. But since the repcode check already tries expanding back one
position, we're really not missing much of importance by only trying it every
other position.
This commit also slightly reorders some operations.
Amusingly, it seems to be a non-trivial performance hit to add in final
searches or even hash table insertions during cleanup. So let's not. It seems
to not make any meaningful difference in compression ratio.
* Add a Huffman round trip fuzzer
* Fix two minor bugs in Huffman that aren't exposed in zstd
- Incorrect weight comparison (weights are allowed to be equal to
table log).
- HUF_compress1X_usingCTable_internal() can return compressed
size >= source size, so the assert that `cSize <= 65535` isn't
correct, and it needs to be checked instead.
This PR fixes an incorrect comparison in figuring out `minChain` in
`ZSTD_dedicatedDictSearch_lazy_loadDictionary()`. This incorrect comparison
had been masked by the fact that `idx` was always 1, until @terrelln changed
that in #2726.
Credit-to: OSS-Fuzz
The DUBT can be non-deterministic if an index is equal to
`ZSTD_DUBT_UNSORTED_MARK`. Ensure that never happens by starting the
indices at 2.
This bug was found by the OSS-Fuzz determinism fuzzer. With this change
the fuzzer test passes. And I've confirmed that this is the root cause,
not just hiding the problem.
Aside: This took me a long time to figure out, because I thought I had
tried this first thing. But, apparantly I messed it up, because when I
was going through it again with @felixhandte, I was pointing out that it
wasn't the case, but it turns out it was.
Credit to: OSS-Fuzz
Add the libzstd.pc target to the lib target in lib/Makefile, which makes
it inherit LDFLAGS_DYNLIB from the lib-mt target. This allows us to add
a Libs.private field to libzstd.pc which gets conditionally populated
with '-pthread'.
The 1.5.0 release notes mention that the static library isn't
multi-threaded by default, due to concern for people building static
binaries with libzstd:
Now the dynamic library supports multi-threaded compression by
default. Note that this property is not extended to the static
library because doing so would have impacted the build script of
existing client applications (requiring them to add -pthread to their
recipe), thus potentially breaking their build.
To get closer to being able to enable multi-threading for all library
builds by default, this commit makes it so that any libzstd consumer
using pkg-config gets the correct flags.
We also fix the indentation of the rule for libzstd.pc and move it
outside the if/endif block for install rules (which uses a list of OSs
where the rules were validated), so the rule is available for all users
of the 'lib*' targets.
* The block splitter missed a bounds check, so when the buffer is too small it
passes an erroneously large size to `ZSTD_entropyCompressSeqStore()`, which
can then write the compressed data past the end of the buffer. This is a new
regression in v1.5.0 when the block splitter is enabled. It is either enabled
explicitly, or implicitly when using the optimal parser and `ZSTD_compress2()`
or `ZSTD_compressStream*()`.
* `HUF_writeCTable_wksp()` omits a bounds check when calling
`HUF_compressWeights()`. If it is called with `dstCapacity == 0` it will pass
an erroneously large size to `HUF_compressWeights()`, which can then write
past the end of the buffer. This bug has been present for ages. However, I
believe that zstd cannot trigger the bug, because it never calls
`HUF_compress*()` with `dstCapacity == 0` because of [this check][1].
Credit to: Oss-Fuzz
[1]: 89127e5ee2/lib/compress/zstd_compress_literals.c (L100)
* Flatten ZSTD_row_getMatchMask
* Remove the SIMD abstraction layer.
* Add big endian support.
* Align `hashTags` within `tagRow` to a 16-byte boundary.
* Switch SSE2 to use aligned reads.
* Optimize scalar path using SWAR.
* Optimize neon path for `n == 32`
* Work around minor clang issue for NEON (https://bugs.llvm.org/show_bug.cgi?id=49577)
* replace memcpy with MEM_readST
* silence alignment warnings
* fix neon casts
* Update zstd_lazy.c
* unify simd preprocessor detection (#3)
* remove duplicate asserts
* tweak rotates
* improve endian detection
* add cast
there is a fun little catch-22 with gcc: result from pmovmskb has to be cast to uint32_t to avoid a zero-extension
but must be uint16_t to get gcc to generate a rotate instruction..
* more casts
* fix casts
better work-around for the (bogus) warning: unary minus on unsigned
Even with -fvisibility=hidden added to CFLAGS, any symbol which is
given a default visibility attribute ends up exported in the dynamic
library. This happens through zstd_internal.h which defines
..._STATIC_LINKING_ONLY before including various header files, and is
included for example in lib/common/pool.c.
To avoid this, this patch distinguishes static and non-static APIs, by
using ZSTDLIB_API only for the latter, and introducing
ZSTDLIB_STATIC_API for the former. For now, both are exported, but
non-static APIs can be hidden by overriding the definition
ZSTDLIB_STATIC_API. lib/Makefile is modified to allow this using
make CPPFLAGS_DYNLIB=-DZSTDLIB_STATIC_API=ZSTDLIB_HIDDEN
In addition, API declarations are dropped from zstd_compress.c (they
aren't needed there).
Signed-off-by: Stephen Kitt <steve@sk2.org>
Call `ZSTD_enforceMaxDist()` before each block with the beginning of the
block. This ensures that `lowLimit` is updated to `dictLimit` whenever
the ext-dict is out of range, so we can use prefix mode for speed.
This can cause non-determinism because prefix mode and ext-dict mode
match finders can return different results. It can also hurt speed
because ext-dict match finders are slower.
The scenario is:
1. Compress large data with a dictionary.
2. The dictionary goes out of bounds, so we invalidate it.
3. However, we still have `lowLimit < dictLimit`, since it is
never updated.
4. We will call the ext-dict match finder instead of the prefix one.
The repcode checks disallowed repcodes that are equal to `windowLow`.
This is slightly inefficient, but isn't a problem on its own. Together
with the next commit, it cause non-determinism.
`ZSTD_insertBt1()` has a speed optimization that skips the prefix of
very long matches.
40def70387/lib/compress/zstd_opt.c (L476)
This optimization is based off the length longest match found. However,
when indices are reset, we only ensure that we can reference the whole
window starting from `ip`. If the previous block ended with a long match
then `nextToUpdate` could be much less than `ip`. It might be far enough
back that `nextToUpdate < maxDist`, so it doesn't have a full window of
data to reference. This can cause non-determinism bugs, because we may
find a match that is beyond `ip - maxDist`, and may sometimes be
un-referencable, and that match triggers the speed optimization.
The fix is to base the `windowLow` off of the `target` of
`ZSTD_updateTree_internal()`, because anything below that value will be
obsolete by the time `ZSTD_updateTree_internal()` completes.
and restored limit to 256 when in 64-bit mode
(it was reduced to 200 to give more room for 32-bit).
This should fix test instability issues
using lot of threads in 32-bit environments.
When running armv6 userspace on armv8 hardware with a 64 bit Linux kernel,
the mode 2 caused SIGBUS (unaligned memory access).
Running all our arm builds in the build farm
only on armv8 simplifies administration a lot.
Depending on compiler and environment, this change might slow down
memory accesses (did not benchmark it). The original analysis is 6 years old.
Fixes#2632
the new alignment setting is better for gcc-9 and gcc-10
by about ~+5%.
Unfortunately, it's worse for essentially all other compilers.
Make the new alignment setting conditional to gcc-9+.
changed strategy,
now unconditionally prefetch the first 2 cache lines,
instead of cache lines corresponding to the first and last bytes of the match.
This better corresponds to cpu expectation,
which should auto-prefetch following cachelines on detecting the sequential nature of the read.
This is globally positive, by +5%,
though exact gains depend on compiler (from -2% to +15%).
The only negative counter-example is gcc-9.
Linearly back off the frequency of overflow correction based on the
number of times the `ZSTD_window_t` has been overflow corrected. This
will still allow the fuzzer to quickly find overflow correction bugs,
while also keeping good speed for larger inputs.
Additionally, the `nbOverflowCorrections` variable can be useful for
debugging coredumps, since we can inspect the `ZSTD_CCtx` to see if
overflow correction has happened yet.
I've verified this fixes the timeouts in OSS-Fuzz (176 seconds -> 6
seconds). I've also verified that fuzzers and `fuzzer` and `zstreamtest`
still catch the row-hash overflow correction bug.
The FAQ covers the questions asked in Issue #2566. It first covers why
you would want to use a dictionary, then what a dictionary is, and
finally it tells you how to train a dictionary, and clarifies some of
the parameters.
There is definitely more that could be said about some of the advanced
trainers, but this should be a good start.
This flag forces zstd to always load the prefix in ext-dict mode, even
if it happens to be contiguous, to force determinism. It also applies to
dictionaries that are re-processed.
A determinism test case is also added, which fails without
`ZSTD_c_deterministicRefPrefix` and passes with it set.
Question: Should this be the default behavior? It isn't in this PR.
* Take `params` by const reference in `ZSTD_resetCCtx_internal()`.
* Add `simpleApiParams` to the CCtx and use them in the simple API
functions, instead of creating those parameters on the stack.
I think this is a good direction to move in, because we shouldn't need
to worry about adding parameters to `ZSTD_CCtx_params`, since it should
always be on the heap (unless they become absoultely gigantic).
Some `ZSTD_CCtx_params` are still on the stack in the CDict functions,
but I've left them for now, because it was a little more complex, and we
don't use those functions in stack-constrained currently.
pipeline increased from 4 to 8 slots.
This change substantially improves decompression speed when there are long distance offsets.
example with enwik9 compressed at level 22 :
gcc-9 : 947 -> 1039 MB/s
clang-10: 884 -> 946 MB/s
I also checked the "cold dictionary" scenario,
and found a smaller benefit, around ~2%
(measurements are more noisy for this scenario).
Dictionaries larger than `ZSTD_CHUNKSIZE_MAX` used to have to be loaded
in multiple segments. Instead, when we detect large dictionaries, ensure
that we reset the context's indicies. Then, for dictionaries larger than
`ZSTD_CURRENT_MAX - 1`, only load the suffix of the dictionary. Finally,
enable DDS for large dictionaries, since we no longer load in multiple
segments.
This simplifes the dictionary loading code, and reduces opportunities
for non-determinism to slip in.
previous lower limit was 1 MB.
Note : by default, the lowest job size is 2 MB, achieved at level 1.
Even lower job sizes can be achieved by manipulating this value directly,
or manually modifying window sizes to lower amounts.
Updated unit test to ensure that this new limit works fine
(test would fail with previous 1 MB limit).
LDM does especially poorly on repetitive data when that data's hash happens
to have `(hash & stopMask) == 0`. Either because the `stopMask == 0` or
random chance. Optimize this case by skipping over repetitive patterns.
The detection is very simplistic, but should catch most of the offending
cases.
```
head -c 1G /dev/zero | perf stat -- ./zstd -1 -o /dev/null -v --zstd=ldmHashRateLog=1 --long
21.187881087 seconds time elapsed
head -c 1G /dev/zero | perf stat -- ./zstd -1 -o /dev/null -v --zstd=ldmHashRateLog=1 --long
1.149707921 seconds time elapsed
```
Any stable API entry point introduced after v1.0
should be documented with its minimum version number.
Since PR fixes this requirement
updating mostly new entry points since v1.4.0
and newly introduced ones for future v1.5.0.
* Fix overflow correction when `windowLog < cycleLog`. Previously, we
got the correction wrong in this case, and our chain tables and binary
trees would be corrupted. Now, we work as long as `maxDist` is a power
of two, by adding `MAX(maxDist, cycleSize)` to our indices.
* When `ZSTD_WINDOW_OVERFLOW_CORRECT_FREQUENTLY` is defined to non-zero
run overflow correction as frequently as allowed without impacting
compression ratio.
* Enable `ZSTD_WINDOW_OVERFLOW_CORRECT_FREQUENTLY` in `fuzzer` and
`zstreamtest` as well as all the OSS-Fuzz fuzzers. This has a 5-10%
speed penalty at most, which seems reasonable.
`zstd_errors.h` and `zdict.h` are public headers, so they deserve to be
in the root `lib/` directory with `zstd.h`, not mixed in with our private
headers.
Instead of providing a default no-op implementation, check the symbols
for `NULL` before accessing them. Providing a default implementation
doesn't reliably work with dynamic linking. Depending on link order the
default implementations may not be overridden. By skipping the default
implementation, all link order issues are resolved. If the symbols
aren't provided the weak function will be `NULL`.