Matches now use memcpy and memset when possible.
Block loops have been rewritten to be more optimizer friendly.
Reworks Symbol and HuffmanDecoder
* Symbol now only includes the value and number of code bits.
decodeSymbol returns only the value.
* HuffmanDecoder now takes the regular bits instead of the reversed.
* Code table construction now uses buckets instead of sorting.
* For linked codes, the value field of Symbol is now used as the next
index. The actual value is the element index.
* InvalidCode is now detected only once with a special linked index.
Performance is 39.7% faster than before and 1.1% faster than gzip using
a sample created from compressing a tar of the src directory.
-- On the standard library side:
The `input: []const u8` parameter of functions passed to `testing.fuzz`
has changed to `smith: *testing.Smith`. `Smith` is used to generate
values from libfuzzer or input bytes generated by libfuzzer.
`Smith` contains the following base methods:
* `value` as a generic method for generating any type
* `eos` for generating end-of-stream markers. Provides the additional
guarantee `true` will eventually by provided.
* `bytes` for filling a byte array.
* `slice` for filling part of a buffer and providing the length.
`Smith.Weight` is used for giving value ranges a higher probability of
being selected. By default, every value has a weight of zero (i.e. they
will not be selected). Weights can only apply to values that fit within
a u64. The above functions have corresponding ones that accept weights.
Additionally, the following functions are provided:
* `baselineWeights` which provides a set of weights containing every
possible value of a type.
* `eosSimpleWeighted` for unique weights for `true` and `false`
* `valueRangeAtMost` and `valueRangeLessThan` for weighing only a range
of values.
-- On the libfuzzer and abi side:
--- Uids
These are u32s which are used to classify requested values. This solves
the problem of a mutation causing a new value to be requested and
shifting all future values; for example:
1. An initial input contains the values 1, 2, 3 which are interpreted
as a, b, and c respectively by the test.
2. The 1 is mutated to a 4 which causes the test to request an extra
value interpreted as d. The input is now 4, 2, 3, 5 (new value) which
the test corresponds to a, d, b, c; however, b and c no longer
correspond to their original values.
Uids contain a hash component and type component. The hash component
is currently determined in `Smith` by taking a hash of the calling
`@returnAddress()` or via an argument in the corresponding `WithHash`
functions. The type component is used extensively in libfuzzer with its
hashmaps.
--- Mutations
At the start of a cycle (a run), a random number of values to mutate is
selected with less being exponentially more likely. The indexes of the
values are selected from a selected uid with a logarithmic bias to uids
with more values.
Mutations may change a single values, several consecutive values in a
uid, or several consecutive values in the uid-independent order they
were requested. They may generate random values, mutate from previous
ones, or copy from other values in the same uid from the same input or
spliced from another.
For integers, mutations from previous ones currently only generates
random values. For bytes, mutations from previous mix new random data
and previous bytes with a set number of mutations.
--- Passive Minimization
A different approach has been taken for minimizing inputs: instead of
trying a fixed set of mutations when a fresh input is found, the input
is instead simply added to the corpus and removed when it is no longer
valuable.
The quality of an input is measured based off how many unique pcs it
hit and how many values it needed from the fuzzer. It is tracked which
inputs hold the best qualities for each pc for hitting the minimum and
maximum unique pcs while needing the least values.
Once all an input's qualities have been superseded for the pcs it hit,
it is removed from the corpus.
-- Comparison to byte-based smith
A byte-based smith would be much more inefficient and complex than this
solution. It would be unable to solve the shifting problem that Uids
do. It is unable to provide values from the fuzzer past end-of-stream.
Even with feedback, it would be unable to act on dynamic weights which
have proven essential with the updated tests (e.g. to constrain values
to a range).
-- Test updates
All the standard library tests have been updated to use the new smith
interface. For `Deque`, an ad hoc allocator was written to improve
performance and remove reliance on heap allocation. `TokenSmith` has
been added to aid in testing Ast and help inform decisions on the smith
interface.
After fetching a package and applying the filter by deleting files that
are not part of the hash, creates a recompressed $GLOBAL_CACHE/p/$PKG_HASH.tar.gz
Checking this cache before fetching network URLs is not yet implemented.
The call to `rebase` in `discardIndirect` and `discardDirect` was inappropriate. As `rebase` expects the `capacity` parameter to exclude the sliding window, this call was asking for ANOTHER `d.window_len` bytes. This was impossible to fulfill with a buffer smaller than 2*`d.window_len`, and caused [#25764](https://github.com/ziglang/zig/issues/25764).
This PR adds a basic test to do a discard (which does trigger [#25764](https://github.com/ziglang/zig/issues/25764)), and rebases only as much as is required to make the discard succeed ([or no rebase at all](https://github.com/ziglang/zig/issues/25764#issuecomment-3484716253)). That means: ideally rebase to fit `limit`, or if the buffer is too small, as much as possible.
I must say, `discardDirect` does not make much sense to me, but I replaced it anyway. `rebaseForDiscard` works fine with `d.reader.buffer.len == 0`. Let me know if anything should be changed.
Reviewed-on: https://codeberg.org/ziglang/zig/pulls/30891
Reviewed-by: Andrew Kelley <andrew@ziglang.org>
Co-authored-by: mercenary <mercenary@noreply.codeberg.org>
Co-committed-by: mercenary <mercenary@noreply.codeberg.org>
Implements deflate compression from scratch. A history window is kept in
the writer's buffer for matching and a chained hash table is used to
find matches. Tokens are accumulated until a threshold is reached and
then outputted as a block. Flush is used to indicate end of stream.
Additionally, two other deflate writers are provided:
* `Raw` writes only in store blocks (the uncompressed bytes). It
utilizes data vectors to efficiently send block headers and data.
* `Huffman` only performs Huffman compression on data and no matching.
The above are also able to take advantage of writer semantics since they
do not need to keep a history.
Literal and distance code parameters in `token` have also been reworked.
Their parameters are now derived mathematically, however the more
expensive ones are still obtained through a lookup table (expect on
ReleaseSmall).
Decompression bit reading has been greatly simplified, taking advantage
of the ability to peek on the underlying reader. Additionally, a few
bugs with limit handling have been fixed.
lzma2 Decoder already checks if decoding is finished or not inside the
process function, `range_decoder`finish does not mean the decoder has
finished, also need to check `ld.rep[0] == 0xFFFF_FFFF`, which was
already done inside the proccess function. This fix delete the redundant
`isFinish()` check for `range_decoder`.
missing these things:
- implementation of finish()
- detect packed bytes read for check and block padding
- implementation of discard()
- implementation of block stream checksum
Previously, index out-of-bounds could occur when copying match_length bytes while decoding whatever sequence happened to overflow `dest`. Now, each sequence checks that there is enough room for the full sequence_length (literal_length + match_length) before doing any copying.
Fixes the failing inputs found here: https://github.com/ziglang/zig/issues/24817#issuecomment-3192927715
* std.Io.Reader: appendRemaining no longer supports alignment and has
different rules about how exceeding limit. Fixed bug where it would
return success instead of error.StreamTooLong like it was supposed to.
* std.Io.Reader: simplify appendRemaining and appendRemainingUnlimited
to be implemented based on std.Io.Writer.Allocating
* std.Io.Writer: introduce unreachableRebase
* std.Io.Writer: remove minimum_unused_capacity from Allocating. maybe
that flexibility could have been handy, but let's see if anyone
actually needs it. The field is redundant with the superlinear growth
of ArrayList capacity.
* std.Io.Writer: growingRebase also ensures total capacity on the
preserve parameter, making it no longer necessary to do
ensureTotalCapacity at the usage site of decompression streams.
* std.compress.flate.Decompress: fix rebase not taking into account seek
* std.compress.zstd.Decompress: split into "direct" and "indirect" usage
patterns depending on whether a buffer is provided to init, matching
how flate works. Remove some overzealous asserts that prevented buffer
expansion from within rebase implementation.
* std.zig: fix readSourceFileToAlloc returning an overaligned slice
which was difficult to free correctly.
fixes#24608
The previous code assumed that `initFrame` during the `new_frame` state would always result in the `in_frame` state, but that's not always the case. `initFrame` can also result in the `skippable_frame` state, which would lead to access of union field 'in_frame' while field 'skipping_frame' is active.
Now, the switch is re-entered with the updated state so either case is handled appropriately.
Fixes the crashes from https://github.com/ziglang/zig/issues/24817
Previously, the "allow EndOfStream" part of this logic was too permissive. If there are a few dangling bytes at the end of the stream, that should be treated as a bad magic number. The only case where EndOfStream is allowed is when the stream is truly at the end, with exactly zero bytes available.
This "get" is useless noise and was copied from FixedBufferWriter.
Since this API has not yet landed in a release, now is a good time
to make the breaking change to fix this.
Don't see why byte returned from specialPeek needs to be shifted by
remaining_needed_bits.
I believe that decision in specialPeek should be done on the number of
the remaining bits not of the content of that bits.
Some test result are changed, but they are now consistent with the
original state as found in:
https://github.com/ziglang/zig/blame/5f790464b0d5da3c4c1a7252643e7cdd4c4b605e/lib/std/compress/flate/Decompress.zig
Changing Bits from usize to u32 or u64 now returns same results.
* flate: simplify peekBitsEnding
`peekBits` returns at most asked number of bits. Fails with EndOfStream
when there are no available bits. If there are less bits available than
asked still returns that available bits.
Hopefully this change better reflects intention. On first input stream
peek error we break the loop.