Now that zig1.wasm is updated, apply the matching standard library
changes.
The main one is in `std.builtin.Type`, where `alignment` fields now have
type `?usize` rather than `comptime_int`.
Additionally, we need to add explicit backing integers to some packed
unions, because (due to https://github.com/ziglang/zig/issues/24714)
they need explicit backing integers to be used in `extern` contexts.
This change could not happen before now, because prior to this branch,
packed unions did not allow explicit backing integer types (that is,
this branch implemented https://github.com/ziglang/zig/issues/25350).
I previously wrote some weird code in the compiler frontend solely
because the LLVM backend has some weird requirements, but the better
solution is to avoid those requirements. This commit does that by
introducing "alignment forward references" to `std.zig.llvm.Builder`.
Much like debug forward references, they allow you to reference an
alignment value which will be populated at a later time (and which can
be updated many times, which is important for incremental compilation).
Then, when we want to reference a type's ABI alignment while the type is
not necessarily resolved (required for `@"align"` attributes on function
parameters and function call arguments), we create a forward reference
and use `link.ConstPool` to populate it when ready.
This allows us to remove from the compiler frontend some extremely
arbitrary calls to `Sema.ensureLayoutResolved`, so that the language
specification is not being built around the particular needs of our
compiler implementation's LLVM code generation backend.
The change in codegen/x86_64/CodeGen.zig was not strictly necessary (the
Sema change I did solves the error I was getting there), I just think
it's better style anyway.
The goal of these changes is to allow the C backend to support the new
lazier type resolution system implemented by the frontend. This required
a full rewrite of the `CType` abstraction, and major changes to the C
backend "linker".
The `DebugConstPool` abstraction introduced in a previous commit turns
out to be useful for the C backend to codegen types. Because this use
case is not debug information but rather general linking (albeit when
targeting an unusual object format), I have renamed the abstraction to
`ConstPool`. With it, the C linker is told when a type's layout becomes
known, and can at that point generate the corresponding C definitions,
rather than deferring this work until `flush`.
The work done in `flush` is now more-or-less *solely* focused on
collecting all of the buffers into a big array for a vectored write.
This does unfortunately involve a non-trivial graph traversal to emit
type definitions in an appropriate order, but it's still quite fast in
practice, and it operates on fairly compact dependency data. We don't
generate the actual type *definitions* in `flush`; that happens during
compilation using `ConstPool` as discussed above. (We do generate the
typedefs for underaligned types in `flush`, but that's a trivial amount
of work in most cases.)
`CType` is now an ephemeral type: it is created only when we render a
type (the logic for which has been pushed into just 2 or 3 functions in
`codegen.c`---most of the backend now operates on unmolested Zig `Type`s
instead). C types are no longer stored in a "pool", although the type
"dependencies" of generated C code (that is, the struct, unions, and
typedefs which the generated code references) are tracked (in some
simple hash sets) and given to the linker so it can codegen the types.
Most importantly, adds support for `DW_TAG_typedef` to `llvm.Builder`,
and uses it to define error sets and optional pointers/errors.
Also deletes some random dead code I found.
The LLVM backend can now run the behavior tests and standard library
tests, like the x86_64 backend can. This commit required me to make a
lot of changes to how the LLVM backend lowers debug information, and
while I was doing that, I improved a few things:
* `anyerror` is now an enum type (and other error sets just wrap it), so
error values appear by name in debuggers
* Fixed broken lowering for tagged unions with zero-width payloads
* Associate container types with source locations in all cases
* Avoid depending on the order of type resolution (using the new
`DebugConstPool` abstraction), so debug information will contain all
available type information rather than just the subset which happens
to be resolved when the backend lowers that debug type
`std.process.spawn`: remove the TODO for nonblocking file stdio and document the behavior.
Fix a bug in Io.Uring.dup2 where the function does not return on success
Reviewed-on: https://codeberg.org/ziglang/zig/pulls/31379
Reviewed-by: Andrew Kelley <andrew@ziglang.org>
Co-authored-by: breakmit <breakmit@noreply.codeberg.org>
Co-committed-by: breakmit <breakmit@noreply.codeberg.org>
Replace the O(n) shift-subtract loop with a constant-time trial
quotient approach (Knuth Algorithm D, TAOCP Vol 2 Section 4.3.1).
The old code iterates clz(b_hi)-clz(a_hi)+1 times (up to 64
iterations of 128-bit arithmetic). The new code uses a single
divwide call to get a trial quotient, then verifies with two
native-width widening multiplies.
Benchmark (Apple M1, ReleaseFast):
- Large divisor, large shift: 87ns -> 7.5ns (11.5x faster)
- Small divisor / uniform: unchanged
The dependency on advapi32.dll actually silently brings along 3 other dlls at runtime (msvcrt.dll, sechost.dll, bcrypt.dll), even if no advapi32 APIs are called. So, this commit actually reduces the number of dlls loaded at runtime by 4 (but only when LLVM is not linked, since LLVM has its own dependency on advapi32.dll).
The data is not super conclusive, but the ntdll version of WindowsSdk appears to run slightly faster than the previous advapi32 version:
Benchmark 1: libc-ntdll.exe ..
Time (mean ± σ): 6.0 ms ± 0.6 ms [User: 3.9 ms, System: 7.1 ms]
Range (min … max): 4.8 ms … 7.9 ms 112 runs
Benchmark 2: libc-advapi32.exe ..
Time (mean ± σ): 7.2 ms ± 0.5 ms [User: 5.4 ms, System: 9.2 ms]
Range (min … max): 6.1 ms … 8.9 ms 103 runs
Summary
'libc-ntdll.exe ..' ran
1.21 ± 0.15 times faster than 'libc-advapi32.exe ..'
and this mostly seems to be due to changes in the implementation (the advapi32 APIs do a lot of NtQueryKey calls that the new implementation doesn't do) rather than due to the decrease in dll loading. LLVM-less zig binaries don't show the same reduction (the only difference here is the DLLs being loaded):
Benchmark 1: stage4-ntdll\bin\zig.exe version
Time (mean ± σ): 3.0 ms ± 0.6 ms [User: 5.3 ms, System: 4.8 ms]
Range (min … max): 1.3 ms … 4.2 ms 112 runs
Benchmark 2: stage4-advapi32\bin\zig.exe version
Time (mean ± σ): 3.5 ms ± 0.6 ms [User: 6.9 ms, System: 5.5 ms]
Range (min … max): 2.5 ms … 5.9 ms 111 runs
Summary
'stage4-ntdll\bin\zig.exe version' ran
1.16 ± 0.28 times faster than 'stage4-advapi32\bin\zig.exe version'
---
With the removal of the advapi32 dependency, the non-ntdll dependencies that remain in an LLVM-less Zig binary are ws2_32.dll (which brings along rpcrt4.dll at runtime), kernel32.dll (which brings along kernelbase.dll at runtime), and crypt32.dll (which brings along ucrtbase.dll at runtime).
The following assertions fail on non-Linux platforms after c0c2010535
which inserted padding based on musl definitions. This padding only
exists on musl to workaround a discrepancy betweeen the POSIX API and
Linux ABI, and is incorrect on other POSIX operating systems.
This change makes the padding musl-only, and documents the reason it
exists. With this change, the assertions pass on Linux and FreeBSD
targets. The corresponding definitions on other targets line up with the
POSIX and FreeBSD ones, so they should work there too.
```zig
const std = @import("std");
const assert = std.debug.assert;
const msghdr = std.c.msghdr;
const cmsghdr = std.c.cmsghdr;
const c = @cImport({
@cInclude("sys/socket.h");
});
comptime {
assert(@offsetOf(msghdr, "iovlen") == @offsetOf(c.msghdr, "msg_iovlen"));
assert(@offsetOf(msghdr, "controllen") == @offsetOf(c.msghdr, "msg_controllen"));
assert(@offsetOf(msghdr, "control") == @offsetOf(c.msghdr, "msg_control"));
assert(@offsetOf(msghdr, "flags") == @offsetOf(c.msghdr, "msg_flags"));
assert(@sizeOf(msghdr) == @sizeOf(c.msghdr));
assert(@offsetOf(cmsghdr, "len") == @offsetOf(c.cmsghdr, "cmsg_len"));
assert(@offsetOf(cmsghdr, "level") == @offsetOf(c.cmsghdr, "cmsg_level"));
assert(@sizeOf(cmsghdr) == @sizeOf(c.cmsghdr));
}
```
There were good reviews made after #31365 was merged, so this commit
addresses them separately.
1. Assert that the number is greater than zero
2. Use `constants` instead of calculating constants manually
3. Use `Const.bitCountAbs` for log2
While the general guidance remains useful, it is not the case that
error.Canceled will always pass across the Group task function boundary.
Remove the too-aggressive assertions and add unit test coverage.
Closes#30096Closes#31340Closes#31358
Remove the `{D}` format specifier. It is moved into `std.Io.Duration` as
a format method.
Migration plan:
```diff
-writer.print("{D}", .{ns});
+writer.print("{f}", .{std.Io.Duration{ .nanoseconds = ns }});
```
All instances where `{D}` was used have been changed to use
`std.Io.Duration` and `{f}`.
Fixes#31281
and make the return value of `cancel` return queue items.
I don't think it's possible to make `cancel` not deadlock with an empty
queue buffer without introducing a new Group primitive.
This is the best I could come up with based on existing primitives.
Let's see if applications find these APIs palatable.
## Summary of changes
+ Make adjustments to the `allocator` field and ensure the below tests pass:
```sh
zig test lib/std/std.zig --zig-lib-dir lib
zig build test-std -Dno-matrix --summary all
```
+ Rename `add` to `push` and `remove` to `pop` in methods and tests
+ Incorporate the functionality of `pop` in `popOrNull`, then rename the `popOrNull` to `pop` and update tests
+ Use `.empty` to set default field values and rename the `init` method to `initContext`
+ Improve variable types in tests: min heap uses the less than context function and max heap uses greater than context function
+ Remove the `dump` method as its not being used anywhere
+ Document methods `clearRetainingCapacity`, `clearAndFree`, `update`, and `ensureTotalCapacityPrecise`
Closes https://codeberg.org/ziglang/zig/issues/31298
Reviewed-on: https://codeberg.org/ziglang/zig/pulls/31299
Reviewed-by: Andrew Kelley <andrew@ziglang.org>
Co-authored-by: Saurabh Mishra <saurabh.m@proton.me>
Co-committed-by: Saurabh Mishra <saurabh.m@proton.me>
Previously resetting with `retain_capacity < @sizeOf(Node)` would create
an invalid node. This is now fixed, plus `Node.size` now has its own `Size`
type that provides additional safety via assertions to prevent bugs like
this in the future.
This is achieved by bumping `end_index` by a large enough amount so that
a suitably aligned region of memory can always be provided. The potential
wasted space this creates is then recovered by a single cmpxchg. This is
always successful for single-threaded arenas which means that this version
still behaves exactly the same as the old single-threaded implementation
when only being accessed by one thread at a time. It can however fail when
another thread bumps `end_index` in the meantime. The observerd failure
rates under extreme load are:
2 Threads: 4-5%
3 Threads: 13-15%
4 Threads: 15-17%
5 Threads: 17-18%
6 Threads: 19-20%
7 Threads: 18-21%
This version offers ~25% faster performance under extreme load from 7 threads,
with diminishing speedups for less threads. The performance for 1 and 2
threads is nearly identical.
Matches now use memcpy and memset when possible.
Block loops have been rewritten to be more optimizer friendly.
Reworks Symbol and HuffmanDecoder
* Symbol now only includes the value and number of code bits.
decodeSymbol returns only the value.
* HuffmanDecoder now takes the regular bits instead of the reversed.
* Code table construction now uses buckets instead of sorting.
* For linked codes, the value field of Symbol is now used as the next
index. The actual value is the element index.
* InvalidCode is now detected only once with a special linked index.
Performance is 39.7% faster than before and 1.1% faster than gzip using
a sample created from compressing a tar of the src directory.
Modifies the `Allocator` implementation provided by `ArenaAllocator` to be
threadsafe using only atomics and no synchronization primitives locked
behind an `Io` implementation.
At its core this is a lock-free singly linked list which uses CAS loops to
exchange the head node. A nice property of `ArenaAllocator` is that the
only functions that can ever remove nodes from its linked list are `reset`
and `deinit`, both of which are not part of the `Allocator` interface and
thus aren't threadsafe, so node-related ABA problems are impossible.
There *are* some trade-offs: end index tracking is now per node instead of
per allocator instance. It's not possible to publish a head node and its
end index at the same time if the latter isn't part of the former.
Another compromise had to be made in regards to resizing existing nodes.
Annoyingly, `rawResize` of an arbitrary thread-safe child allocator can
of course never be guaranteed to be an atomic operation, so only one
`alloc` call can ever resize at the same time, other threads have to
consider any resizes they attempt during that time failed. This causes
slightly less optimal behavior than what could be achieved with a mutex.
The LSB of `Node.size` is used to signal that a node is being resized.
This means that all nodes have to have an even size.
Calls to `alloc` have to allocate new nodes optimistically as they can
only know whether any CAS on a head node will succeed after attempting it,
and to attempt the CAS they of course already need to know the address of
the freshly allocated node they are trying to make the new head.
The simplest solution to this would be to just free the new node again if
a CAS fails, however this can be expensive and would mean that in practice
arenas could only really be used with a GPA as their child allocator. To
work around this, this implementation keeps its own free list of nodes
which didn't make their CAS to be reused by a later `alloc` invocation.
To keep things simple and avoid ABA problems the free list is only ever
be accessed beyond its head by 'stealing' the head node (and thus the
entire list) with an atomic swap. This makes iteration and removal trivial
since there's only ever one thread doing it at a time which also owns all
nodes it's holding. When the thread is done it can just push its list onto
the free list again.
This implementation offers comparable performance to the previous one when
only being accessed by a single thread and a slight speedup compared to
the previous implementation wrapped into a `ThreadSafeAllocator` up to ~7
threads performing operations on it concurrently.
(measured on a base model MacBook Pro M1)