## Summary of changes
+ Make adjustments to the `allocator` field and ensure the below tests pass:
```sh
zig test lib/std/std.zig --zig-lib-dir lib
zig build test-std -Dno-matrix --summary all
```
+ Rename `add` to `push` and `remove` to `pop` in methods and tests
+ Incorporate the functionality of `pop` in `popOrNull`, then rename the `popOrNull` to `pop` and update tests
+ Use `.empty` to set default field values and rename the `init` method to `initContext`
+ Improve variable types in tests: min heap uses the less than context function and max heap uses greater than context function
+ Remove the `dump` method as its not being used anywhere
+ Document methods `clearRetainingCapacity`, `clearAndFree`, `update`, and `ensureTotalCapacityPrecise`
Closes https://codeberg.org/ziglang/zig/issues/31298
Reviewed-on: https://codeberg.org/ziglang/zig/pulls/31299
Reviewed-by: Andrew Kelley <andrew@ziglang.org>
Co-authored-by: Saurabh Mishra <saurabh.m@proton.me>
Co-committed-by: Saurabh Mishra <saurabh.m@proton.me>
Previously resetting with `retain_capacity < @sizeOf(Node)` would create
an invalid node. This is now fixed, plus `Node.size` now has its own `Size`
type that provides additional safety via assertions to prevent bugs like
this in the future.
This is achieved by bumping `end_index` by a large enough amount so that
a suitably aligned region of memory can always be provided. The potential
wasted space this creates is then recovered by a single cmpxchg. This is
always successful for single-threaded arenas which means that this version
still behaves exactly the same as the old single-threaded implementation
when only being accessed by one thread at a time. It can however fail when
another thread bumps `end_index` in the meantime. The observerd failure
rates under extreme load are:
2 Threads: 4-5%
3 Threads: 13-15%
4 Threads: 15-17%
5 Threads: 17-18%
6 Threads: 19-20%
7 Threads: 18-21%
This version offers ~25% faster performance under extreme load from 7 threads,
with diminishing speedups for less threads. The performance for 1 and 2
threads is nearly identical.
Quoting Vol. 2B 4-590 of [Intel SSA manual][1]:
> Bit 3 of the immediate byte controls processor behavior for a
precision exception <...>
The Direction struct is incorrectly sized to 4 bits, which pushes the
precision exception bit to the reserved bits, which helpfully crashes
under valgrind (but works on real hardware, since CPU ignores it):
``
const std = @import("std");
noinline fn ceil(x: f32) f32 {
return @ceil(x);
}
test "ceil" {
try std.testing.expectEqual(@as(f32, 2.0), ceil(1.5));
}
```
With Zig 0.15.1:
```
$ zig test -fno-llvm -mcpu x86_64_v3 -fvalgrind ceil_test.zig --test-cmd valgrind --test-cmd --quiet --test-cmd-bin
Illegal instruction at address 0x102d14d
/home/motiejus/code/ceil_test.zig:4:5: 0x102d14d in ceil (ceil_test.zig)
return @ceil(x);
^
/home/motiejus/code/ceil_test.zig:8:52: 0x102d097 in test.ceil (ceil_test.zig)
try std.testing.expectEqual(@as(f32, 2.0), ceil(1.5));
^
/nix/store/n81qdpd96d86ngfv5bqymy26b8kqm654-zig-0.15.1/lib/compiler/test_runner.zig:218:25: 0x1167e80 in mainTerminal (test_runner.zig)
if (test_fn.func()) |_| {
^
/nix/store/n81qdpd96d86ngfv5bqymy26b8kqm654-zig-0.15.1/lib/compiler/test_runner.zig:66:28: 0x11610a1 in main (test_runner.zig)
return mainTerminal();
^
/nix/store/n81qdpd96d86ngfv5bqymy26b8kqm654-zig-0.15.1/lib/std/start.zig:618:22: 0x115ae3d in posixCallMainAndExit (std.zig)
root.main();
^
/nix/store/n81qdpd96d86ngfv5bqymy26b8kqm654-zig-0.15.1/lib/std/start.zig:232:5: 0x115a6d1 in _start (std.zig)
asm volatile (switch (native_arch) {
^
???:?:?: 0x0 in ??? (???)
error: the following test command crashed:
valgrind --quiet /home/motiejus/.cache/zig/o/7cc564b5410fc3facdb6e8768643074e/test
```
Zig 0.15.1 with this patch:
```
zig/zig3 test -fno-llvm -mcpu x86_64_v3 -fvalgrind ceil_test.zig --test-cmd valgrind --test-cmd --quiet --test-cmd-bin
All 1 tests passed.
```
Looking through `CodeGen.zig` it _seems_ like the same RoundMode struct
is used for vroundss/vroundps/vcvtps2ph, so update the comment while
we're at it. But I am only 90% sure it's true.
[1]: https://cdrdv2.intel.com/v1/dl/getContent/671200
Matches now use memcpy and memset when possible.
Block loops have been rewritten to be more optimizer friendly.
Reworks Symbol and HuffmanDecoder
* Symbol now only includes the value and number of code bits.
decodeSymbol returns only the value.
* HuffmanDecoder now takes the regular bits instead of the reversed.
* Code table construction now uses buckets instead of sorting.
* For linked codes, the value field of Symbol is now used as the next
index. The actual value is the element index.
* InvalidCode is now detected only once with a special linked index.
Performance is 39.7% faster than before and 1.1% faster than gzip using
a sample created from compressing a tar of the src directory.
Modifies the `Allocator` implementation provided by `ArenaAllocator` to be
threadsafe using only atomics and no synchronization primitives locked
behind an `Io` implementation.
At its core this is a lock-free singly linked list which uses CAS loops to
exchange the head node. A nice property of `ArenaAllocator` is that the
only functions that can ever remove nodes from its linked list are `reset`
and `deinit`, both of which are not part of the `Allocator` interface and
thus aren't threadsafe, so node-related ABA problems are impossible.
There *are* some trade-offs: end index tracking is now per node instead of
per allocator instance. It's not possible to publish a head node and its
end index at the same time if the latter isn't part of the former.
Another compromise had to be made in regards to resizing existing nodes.
Annoyingly, `rawResize` of an arbitrary thread-safe child allocator can
of course never be guaranteed to be an atomic operation, so only one
`alloc` call can ever resize at the same time, other threads have to
consider any resizes they attempt during that time failed. This causes
slightly less optimal behavior than what could be achieved with a mutex.
The LSB of `Node.size` is used to signal that a node is being resized.
This means that all nodes have to have an even size.
Calls to `alloc` have to allocate new nodes optimistically as they can
only know whether any CAS on a head node will succeed after attempting it,
and to attempt the CAS they of course already need to know the address of
the freshly allocated node they are trying to make the new head.
The simplest solution to this would be to just free the new node again if
a CAS fails, however this can be expensive and would mean that in practice
arenas could only really be used with a GPA as their child allocator. To
work around this, this implementation keeps its own free list of nodes
which didn't make their CAS to be reused by a later `alloc` invocation.
To keep things simple and avoid ABA problems the free list is only ever
be accessed beyond its head by 'stealing' the head node (and thus the
entire list) with an atomic swap. This makes iteration and removal trivial
since there's only ever one thread doing it at a time which also owns all
nodes it's holding. When the thread is done it can just push its list onto
the free list again.
This implementation offers comparable performance to the previous one when
only being accessed by a single thread and a slight speedup compared to
the previous implementation wrapped into a `ThreadSafeAllocator` up to ~7
threads performing operations on it concurrently.
(measured on a base model MacBook Pro M1)
This is the same as for Edwards25519 without the y coordinate,
since it returns Montgomery coordinates, but it can be confusing
to call the Edwards25519 function while working on the
Curve25519 representation.
New protocols such as CPACE requires the map over Curve25519.
Linux's approach to mapping the main thread's stack is quite odd: it essentially
tries to select an mmap address (assuming unhinted mmap calls) which do not
cover the region of virtual address space into which the stack *would* grow
(based on the stack rlimit), but it doesn't actually *prevent* those pages from
being mapped. It also doesn't try particularly hard: it's been observed that the
first (unhinted) mmap call in a simple application is usually put at an address
which is within a gigabyte or two of the stack, which is close enough to make
issues somewhat likely. In particular, if we get an address which is close-ish
to the stack, and then `mremap` it without the MAY_MOVE flag, we are *very*
likely to map pages in this "theoretical stack region". This is particularly a
problem on loongarch64, where the initial mmap address is empirically only
around 200 megabytes from the stack (whereas on most other 64-bit targets it's
closer to a gigabyte).
To work around this, we just need to avoid mremap in some cases. Unfortunately,
this system call isn't used too heavily by musl or glibc, so design issues like
this can and do exist without being caught. So, when `PageAllocator.resize` is
called, let's not try to `mremap` to grow the pages. We can still call `mremap`
in the `PageAllocator.remap` path, because in that case we can set the
`MAY_MOVE` flag, which empirically appears to make the Linux kernel avoid the
problematic "theoretical stack region".
The old logic was fine for targets where the stack grows up (so, literally just
hppa), but problematic on targets where it grows down, because we could hint
that we wanted an allocation to happen in an area of the address space that the
kernel expects to be able to expand the stack into. The kernel is happy to
satisfy such a hint despite the obvious problems this leads to later down the
road.
Co-authored-by: rpkak <rpkak@noreply.codeberg.org>
This reverts commit e314dadb01.
This idea requires more consideration before committing to it.
At the very least let's not regress autodocs.
Closes#31230