Matches now use memcpy and memset when possible.
Block loops have been rewritten to be more optimizer friendly.
Reworks Symbol and HuffmanDecoder
* Symbol now only includes the value and number of code bits.
decodeSymbol returns only the value.
* HuffmanDecoder now takes the regular bits instead of the reversed.
* Code table construction now uses buckets instead of sorting.
* For linked codes, the value field of Symbol is now used as the next
index. The actual value is the element index.
* InvalidCode is now detected only once with a special linked index.
Performance is 39.7% faster than before and 1.1% faster than gzip using
a sample created from compressing a tar of the src directory.
Modifies the `Allocator` implementation provided by `ArenaAllocator` to be
threadsafe using only atomics and no synchronization primitives locked
behind an `Io` implementation.
At its core this is a lock-free singly linked list which uses CAS loops to
exchange the head node. A nice property of `ArenaAllocator` is that the
only functions that can ever remove nodes from its linked list are `reset`
and `deinit`, both of which are not part of the `Allocator` interface and
thus aren't threadsafe, so node-related ABA problems are impossible.
There *are* some trade-offs: end index tracking is now per node instead of
per allocator instance. It's not possible to publish a head node and its
end index at the same time if the latter isn't part of the former.
Another compromise had to be made in regards to resizing existing nodes.
Annoyingly, `rawResize` of an arbitrary thread-safe child allocator can
of course never be guaranteed to be an atomic operation, so only one
`alloc` call can ever resize at the same time, other threads have to
consider any resizes they attempt during that time failed. This causes
slightly less optimal behavior than what could be achieved with a mutex.
The LSB of `Node.size` is used to signal that a node is being resized.
This means that all nodes have to have an even size.
Calls to `alloc` have to allocate new nodes optimistically as they can
only know whether any CAS on a head node will succeed after attempting it,
and to attempt the CAS they of course already need to know the address of
the freshly allocated node they are trying to make the new head.
The simplest solution to this would be to just free the new node again if
a CAS fails, however this can be expensive and would mean that in practice
arenas could only really be used with a GPA as their child allocator. To
work around this, this implementation keeps its own free list of nodes
which didn't make their CAS to be reused by a later `alloc` invocation.
To keep things simple and avoid ABA problems the free list is only ever
be accessed beyond its head by 'stealing' the head node (and thus the
entire list) with an atomic swap. This makes iteration and removal trivial
since there's only ever one thread doing it at a time which also owns all
nodes it's holding. When the thread is done it can just push its list onto
the free list again.
This implementation offers comparable performance to the previous one when
only being accessed by a single thread and a slight speedup compared to
the previous implementation wrapped into a `ThreadSafeAllocator` up to ~7
threads performing operations on it concurrently.
(measured on a base model MacBook Pro M1)
This is the same as for Edwards25519 without the y coordinate,
since it returns Montgomery coordinates, but it can be confusing
to call the Edwards25519 function while working on the
Curve25519 representation.
New protocols such as CPACE requires the map over Curve25519.
Linux's approach to mapping the main thread's stack is quite odd: it essentially
tries to select an mmap address (assuming unhinted mmap calls) which do not
cover the region of virtual address space into which the stack *would* grow
(based on the stack rlimit), but it doesn't actually *prevent* those pages from
being mapped. It also doesn't try particularly hard: it's been observed that the
first (unhinted) mmap call in a simple application is usually put at an address
which is within a gigabyte or two of the stack, which is close enough to make
issues somewhat likely. In particular, if we get an address which is close-ish
to the stack, and then `mremap` it without the MAY_MOVE flag, we are *very*
likely to map pages in this "theoretical stack region". This is particularly a
problem on loongarch64, where the initial mmap address is empirically only
around 200 megabytes from the stack (whereas on most other 64-bit targets it's
closer to a gigabyte).
To work around this, we just need to avoid mremap in some cases. Unfortunately,
this system call isn't used too heavily by musl or glibc, so design issues like
this can and do exist without being caught. So, when `PageAllocator.resize` is
called, let's not try to `mremap` to grow the pages. We can still call `mremap`
in the `PageAllocator.remap` path, because in that case we can set the
`MAY_MOVE` flag, which empirically appears to make the Linux kernel avoid the
problematic "theoretical stack region".
The old logic was fine for targets where the stack grows up (so, literally just
hppa), but problematic on targets where it grows down, because we could hint
that we wanted an allocation to happen in an area of the address space that the
kernel expects to be able to expand the stack into. The kernel is happy to
satisfy such a hint despite the obvious problems this leads to later down the
road.
Co-authored-by: rpkak <rpkak@noreply.codeberg.org>
This reverts commit e314dadb01.
This idea requires more consideration before committing to it.
At the very least let's not regress autodocs.
Closes#31230
This function works with a slice of futures and returns the index of a
completed one. This doesn't work very well in practice because it's
either too high level or too low level.
At the lower level we have Io.Batch for doing this kind of thing at the
Operation API layer.
At the higher level we have Io.Select which is a convenience wrapper
around an Io.Group and an Io.Queue.
Instead of padding, use static entropy to detect corrupted header.
* 64-bit, safe modes: 10 canary bits
* 64-bit, unsafe modes: 0 canary bits
* 32-bit: 27 canary bits
A further enhancement, not done here, would be to upgrade from canary to
parity.