Tokio

This page compares Tokio against Knitting on Bun, Node.js, and Deno using the same batch-oriented echo benchmark.

Benchmark source: mimiMonads/knitting-vs-tokio-bench.

What this benchmark measures

Whole-batch latency for three payload shapes:

f64
String / large UTF-8 text
Uint8Array / raw bytes
separate Arc<Vec<u8>> reference sweeps for tiny byte payloads

The “close to Tokio” claim on the homepage refers to the small end of this comparison: wakeups/signaling plus copying or cloning tiny payloads. The larger-payload sections below are a different cost regime.

The current summary was recorded on Ubuntu 23.10, x86_64, on an AMD Ryzen 7 4700U.

All runtimes use the same reporting setup:

batch sizes: 1, 10, 100
warmup: 200 iterations for n=1, 50 otherwise
measured iterations: 500
per-batch timing
sorted samples: avg, min, p75, p99, max

Batch avg latency (`f64`)

For small scalar payloads, the JavaScript runtimes stay ahead on average in this run:

At n=1, Node.js is lowest at 6.63 µs, followed by Bun at 7.35 µs, Tokio at 13.01 µs, and Deno at 21.54 µs.
At n=10, Deno is lowest at 11.97 µs, then Bun at 13.41 µs, Node.js at 17.28 µs, and Tokio at 27.50 µs.
At n=100, Node.js (62.33 µs) and Deno (63.28 µs) stay ahead of Tokio (89.55 µs), with Bun in between at 80.61 µs.
Tokio keeps the best p99 at n=1 (16.85 µs), but Bun has the best p99 at n=10 (36.58 µs) and n=100 (92.37 µs).

Chart
Raw data

Batch average latency for f64 payloads comparing Tokio, Bun, Node.js, and Deno

# Benchmark Summary

## Sources

- tokio: `results/tokio-1773827721825.csv`
- bun: `results/knitting-bun-1773827812276.csv`
- node: `results/knitting-node-1773828074699.csv`
- deno: `results/knitting-deno-1773827925019.csv`

## Machine Specs

- OS: Ubuntu 23.10
- Kernel: 6.5.0-44-generic
- Architecture: x86_64
- CPU: AMD Ryzen 7 4700U with Radeon Graphics
- Topology: 8 logical CPUs, 1 socket(s), 8 core(s)/socket, 1 thread(s)/core
- Memory: 15.1 GiB
- Swap: 4.0 GiB

## Methodology Notes

- The main string and byte benchmarks are intended to compare the same logical round trip on both sides: send payload, receive it in the worker, echo it back, receive it again on the caller, then wait for the whole batch.
- In `src/main.ts`, the `string` and `Uint8Array` paths go through knitting transport in both directions. That transport materializes a fresh payload on receive, so the round trip includes payload work on both the request side and the reply side.
- To keep the Tokio baseline fair, `src/main.rs` clones `String` and `Vec<u8>` on send and also clones again on the worker reply. The reply clone is intentional. Without it, Tokio would be measuring a cheaper return-path move while the JS runtimes were still paying for fresh payload materialization on the way back.
- The `Arc<Vec<u8>>` sweep is intentionally separate and is not the default apples-to-apples byte benchmark. It exists as an upper-bound shared-bytes reference for small payloads. `Arc::clone` only bumps a refcount, so it is expected to be cheaper than copying bytes.
- This means the default `string` and `Uint8Array` tables should be read as the fairer comparison, while the Arc section should be read as "how close does the normal transport get to shared ownership for small values?"

## Batch Avg Latency (less is better)

```text
benchmark            | batch | tokio     | bun      | node     | deno
---------------------+-------+-----------+----------+----------+---------
number f64 (8 bytes) | n=1   | 13.01 us  | 7.35 us  | 6.63 us  | 21.54 us
number f64 (8 bytes) | n=10  | 27.50 us  | 13.41 us | 17.28 us | 11.97 us
number f64 (8 bytes) | n=100 | 89.55 us  | 80.61 us | 62.33 us | 63.28 us
large string 1 MiB   | n=1   | 221.35 us | 1.19 ms  | 2.85 ms  | 1.38 ms
large string 1 MiB   | n=10  | 6.20 ms   | 6.01 ms  | 10.79 ms | 10.38 ms
large string 1 MiB   | n=100 | 37.93 ms  | 50.16 ms | 84.66 ms | 83.90 ms
Uint8Array 1 MiB     | n=1   | 272.81 us | 1.30 ms  | 2.35 ms  | 1.14 ms
Uint8Array 1 MiB     | n=10  | 4.64 ms   | 5.27 ms  | 5.22 ms  | 6.36 ms
Uint8Array 1 MiB     | n=100 | 37.83 ms  | 47.76 ms | 54.95 ms | 60.04 ms
```

## Batch P99 Latency (less is better)

```text
benchmark            | batch | tokio     | bun      | node      | deno
---------------------+-------+-----------+----------+-----------+----------
number f64 (8 bytes) | n=1   | 16.85 us  | 18.70 us | 26.25 us  | 160.57 us
number f64 (8 bytes) | n=10  | 40.83 us  | 36.58 us | 81.26 us  | 111.89 us
number f64 (8 bytes) | n=100 | 203.57 us | 92.37 us | 314.11 us | 263.05 us
large string 1 MiB   | n=1   | 371.10 us | 3.74 ms  | 3.73 ms   | 2.97 ms
large string 1 MiB   | n=10  | 8.44 ms   | 8.41 ms  | 15.75 ms  | 14.45 ms
large string 1 MiB   | n=100 | 40.81 ms  | 61.62 ms | 105.51 ms | 106.67 ms
Uint8Array 1 MiB     | n=1   | 400.45 us | 3.03 ms  | 5.58 ms   | 5.81 ms
Uint8Array 1 MiB     | n=10  | 8.41 ms   | 7.80 ms  | 9.52 ms   | 14.37 ms
Uint8Array 1 MiB     | n=100 | 43.71 ms  | 59.77 ms | 72.39 ms  | 81.71 ms
```

## Avg Ratio Vs Tokio

```text
benchmark            | batch | bun/tokio | node/tokio | deno/tokio
---------------------+-------+-----------+------------+-----------
number f64 (8 bytes) | n=1   | 0.56x     | 0.51x      | 1.66x
number f64 (8 bytes) | n=10  | 0.49x     | 0.63x      | 0.44x
number f64 (8 bytes) | n=100 | 0.90x     | 0.70x      | 0.71x
large string 1 MiB   | n=1   | 5.36x     | 12.86x     | 6.22x
large string 1 MiB   | n=10  | 0.97x     | 1.74x      | 1.68x
large string 1 MiB   | n=100 | 1.32x     | 2.23x      | 2.21x
Uint8Array 1 MiB     | n=1   | 4.75x     | 8.62x      | 4.18x
Uint8Array 1 MiB     | n=10  | 1.13x     | 1.12x      | 1.37x
Uint8Array 1 MiB     | n=100 | 1.26x     | 1.45x      | 1.59x
```

## Uint8Array Size Sweep Avg Latency (less is better)

```text
size    | tokio     | bun       | node      | deno
--------+-----------+-----------+-----------+----------
8 B     | 82.99 us  | 62.80 us  | 88.20 us  | 107.01 us
16 B    | 81.91 us  | 56.24 us  | 65.37 us  | 95.78 us
32 B    | 85.70 us  | 49.48 us  | 65.76 us  | 85.05 us
64 B    | 76.98 us  | 42.68 us  | 66.88 us  | 78.27 us
128 B   | 92.53 us  | 53.53 us  | 79.28 us  | 84.39 us
256 B   | 99.70 us  | 63.42 us  | 83.89 us  | 100.44 us
512 B   | 86.67 us  | 68.55 us  | 97.07 us  | 118.03 us
1 KiB   | 101.42 us | 171.09 us | 157.61 us | 169.50 us
2 KiB   | 191.25 us | 194.62 us | 220.68 us | 233.39 us
4 KiB   | 195.56 us | 260.16 us | 324.39 us | 391.43 us
8 KiB   | 208.84 us | 397.05 us | 465.89 us | 539.98 us
16 KiB  | 279.25 us | 649.18 us | 741.47 us | 899.81 us
32 KiB  | 1.48 ms   | 1.14 ms   | 1.27 ms   | 1.41 ms
64 KiB  | 2.71 ms   | 2.38 ms   | 2.49 ms   | 2.89 ms
128 KiB | 5.66 ms   | 5.02 ms   | 5.11 ms   | 6.08 ms
256 KiB | 11.92 ms  | 10.56 ms  | 10.11 ms  | 11.96 ms
512 KiB | 23.06 ms  | 22.33 ms  | 22.66 ms  | 25.26 ms
1 MiB   | 29.48 ms  | 46.97 ms  | 52.53 ms  | 55.77 ms
```

## Arc Comparison Size Sweep Avg Latency (less is better)

Tokio uses `Arc<Vec<u8>>` here as a separate shared-bytes reference point, not the default apples-to-apples byte path.

```text
size  | tokio    | bun      | node     | deno
------+----------+----------+----------+----------
8 B   | 80.76 us | 70.31 us | 86.19 us | 97.79 us
16 B  | 79.35 us | 60.94 us | 73.73 us | 77.46 us
32 B  | 81.48 us | 57.04 us | 70.26 us | 77.03 us
64 B  | 80.14 us | 54.44 us | 75.94 us | 78.81 us
128 B | 79.89 us | 68.50 us | 82.51 us | 85.95 us
256 B | 79.48 us | 50.59 us | 85.94 us | 100.10 us
512 B | 79.51 us | 74.78 us | 97.23 us | 123.11 us
```

## Arc Comparison Avg Ratio Vs Tokio

```text
size  | bun/tokio | node/tokio | deno/tokio
------+-----------+------------+-----------
8 B   | 0.87x     | 1.07x      | 1.21x
16 B  | 0.77x     | 0.93x      | 0.98x
32 B  | 0.70x     | 0.86x      | 0.95x
64 B  | 0.68x     | 0.95x      | 0.98x
128 B | 0.86x     | 1.03x      | 1.08x
256 B | 0.64x     | 1.08x      | 1.26x
512 B | 0.94x     | 1.22x      | 1.55x
```

1 MiB payloads

Large payloads show a very different profile from scalar messages.

For large string 1 MiB, Tokio is clearly fastest at n=1 (221.35 µs) and n=100 (37.93 ms). At n=10, Bun edges it slightly on average (6.01 ms vs 6.20 ms) and also has a near-identical p99 (8.41 ms vs 8.44 ms).
For Uint8Array 1 MiB, Tokio leads at every batch size on average: 272.81 µs, 4.64 ms, and 37.83 ms.
Node.js and Deno fall further behind once payload materialization dominates the round trip, especially on the 1 MiB cases.

Batch average latency for 1 MiB string payloads comparing Tokio, Bun, Node.js, and Deno

# Benchmark Summary

## Sources

- tokio: `results/tokio-1773827721825.csv`
- bun: `results/knitting-bun-1773827812276.csv`
- node: `results/knitting-node-1773828074699.csv`
- deno: `results/knitting-deno-1773827925019.csv`

## Machine Specs

- OS: Ubuntu 23.10
- Kernel: 6.5.0-44-generic
- Architecture: x86_64
- CPU: AMD Ryzen 7 4700U with Radeon Graphics
- Topology: 8 logical CPUs, 1 socket(s), 8 core(s)/socket, 1 thread(s)/core
- Memory: 15.1 GiB
- Swap: 4.0 GiB

## Methodology Notes

- The main string and byte benchmarks are intended to compare the same logical round trip on both sides: send payload, receive it in the worker, echo it back, receive it again on the caller, then wait for the whole batch.
- In `src/main.ts`, the `string` and `Uint8Array` paths go through knitting transport in both directions. That transport materializes a fresh payload on receive, so the round trip includes payload work on both the request side and the reply side.
- To keep the Tokio baseline fair, `src/main.rs` clones `String` and `Vec<u8>` on send and also clones again on the worker reply. The reply clone is intentional. Without it, Tokio would be measuring a cheaper return-path move while the JS runtimes were still paying for fresh payload materialization on the way back.
- The `Arc<Vec<u8>>` sweep is intentionally separate and is not the default apples-to-apples byte benchmark. It exists as an upper-bound shared-bytes reference for small payloads. `Arc::clone` only bumps a refcount, so it is expected to be cheaper than copying bytes.
- This means the default `string` and `Uint8Array` tables should be read as the fairer comparison, while the Arc section should be read as "how close does the normal transport get to shared ownership for small values?"

## Batch Avg Latency (less is better)

```text
benchmark            | batch | tokio     | bun      | node     | deno
---------------------+-------+-----------+----------+----------+---------
number f64 (8 bytes) | n=1   | 13.01 us  | 7.35 us  | 6.63 us  | 21.54 us
number f64 (8 bytes) | n=10  | 27.50 us  | 13.41 us | 17.28 us | 11.97 us
number f64 (8 bytes) | n=100 | 89.55 us  | 80.61 us | 62.33 us | 63.28 us
large string 1 MiB   | n=1   | 221.35 us | 1.19 ms  | 2.85 ms  | 1.38 ms
large string 1 MiB   | n=10  | 6.20 ms   | 6.01 ms  | 10.79 ms | 10.38 ms
large string 1 MiB   | n=100 | 37.93 ms  | 50.16 ms | 84.66 ms | 83.90 ms
Uint8Array 1 MiB     | n=1   | 272.81 us | 1.30 ms  | 2.35 ms  | 1.14 ms
Uint8Array 1 MiB     | n=10  | 4.64 ms   | 5.27 ms  | 5.22 ms  | 6.36 ms
Uint8Array 1 MiB     | n=100 | 37.83 ms  | 47.76 ms | 54.95 ms | 60.04 ms
```

## Batch P99 Latency (less is better)

```text
benchmark            | batch | tokio     | bun      | node      | deno
---------------------+-------+-----------+----------+-----------+----------
number f64 (8 bytes) | n=1   | 16.85 us  | 18.70 us | 26.25 us  | 160.57 us
number f64 (8 bytes) | n=10  | 40.83 us  | 36.58 us | 81.26 us  | 111.89 us
number f64 (8 bytes) | n=100 | 203.57 us | 92.37 us | 314.11 us | 263.05 us
large string 1 MiB   | n=1   | 371.10 us | 3.74 ms  | 3.73 ms   | 2.97 ms
large string 1 MiB   | n=10  | 8.44 ms   | 8.41 ms  | 15.75 ms  | 14.45 ms
large string 1 MiB   | n=100 | 40.81 ms  | 61.62 ms | 105.51 ms | 106.67 ms
Uint8Array 1 MiB     | n=1   | 400.45 us | 3.03 ms  | 5.58 ms   | 5.81 ms
Uint8Array 1 MiB     | n=10  | 8.41 ms   | 7.80 ms  | 9.52 ms   | 14.37 ms
Uint8Array 1 MiB     | n=100 | 43.71 ms  | 59.77 ms | 72.39 ms  | 81.71 ms
```

## Avg Ratio Vs Tokio

```text
benchmark            | batch | bun/tokio | node/tokio | deno/tokio
---------------------+-------+-----------+------------+-----------
number f64 (8 bytes) | n=1   | 0.56x     | 0.51x      | 1.66x
number f64 (8 bytes) | n=10  | 0.49x     | 0.63x      | 0.44x
number f64 (8 bytes) | n=100 | 0.90x     | 0.70x      | 0.71x
large string 1 MiB   | n=1   | 5.36x     | 12.86x     | 6.22x
large string 1 MiB   | n=10  | 0.97x     | 1.74x      | 1.68x
large string 1 MiB   | n=100 | 1.32x     | 2.23x      | 2.21x
Uint8Array 1 MiB     | n=1   | 4.75x     | 8.62x      | 4.18x
Uint8Array 1 MiB     | n=10  | 1.13x     | 1.12x      | 1.37x
Uint8Array 1 MiB     | n=100 | 1.26x     | 1.45x      | 1.59x
```

## Uint8Array Size Sweep Avg Latency (less is better)

```text
size    | tokio     | bun       | node      | deno
--------+-----------+-----------+-----------+----------
8 B     | 82.99 us  | 62.80 us  | 88.20 us  | 107.01 us
16 B    | 81.91 us  | 56.24 us  | 65.37 us  | 95.78 us
32 B    | 85.70 us  | 49.48 us  | 65.76 us  | 85.05 us
64 B    | 76.98 us  | 42.68 us  | 66.88 us  | 78.27 us
128 B   | 92.53 us  | 53.53 us  | 79.28 us  | 84.39 us
256 B   | 99.70 us  | 63.42 us  | 83.89 us  | 100.44 us
512 B   | 86.67 us  | 68.55 us  | 97.07 us  | 118.03 us
1 KiB   | 101.42 us | 171.09 us | 157.61 us | 169.50 us
2 KiB   | 191.25 us | 194.62 us | 220.68 us | 233.39 us
4 KiB   | 195.56 us | 260.16 us | 324.39 us | 391.43 us
8 KiB   | 208.84 us | 397.05 us | 465.89 us | 539.98 us
16 KiB  | 279.25 us | 649.18 us | 741.47 us | 899.81 us
32 KiB  | 1.48 ms   | 1.14 ms   | 1.27 ms   | 1.41 ms
64 KiB  | 2.71 ms   | 2.38 ms   | 2.49 ms   | 2.89 ms
128 KiB | 5.66 ms   | 5.02 ms   | 5.11 ms   | 6.08 ms
256 KiB | 11.92 ms  | 10.56 ms  | 10.11 ms  | 11.96 ms
512 KiB | 23.06 ms  | 22.33 ms  | 22.66 ms  | 25.26 ms
1 MiB   | 29.48 ms  | 46.97 ms  | 52.53 ms  | 55.77 ms
```

## Arc Comparison Size Sweep Avg Latency (less is better)

Tokio uses `Arc<Vec<u8>>` here as a separate shared-bytes reference point, not the default apples-to-apples byte path.

```text
size  | tokio    | bun      | node     | deno
------+----------+----------+----------+----------
8 B   | 80.76 us | 70.31 us | 86.19 us | 97.79 us
16 B  | 79.35 us | 60.94 us | 73.73 us | 77.46 us
32 B  | 81.48 us | 57.04 us | 70.26 us | 77.03 us
64 B  | 80.14 us | 54.44 us | 75.94 us | 78.81 us
128 B | 79.89 us | 68.50 us | 82.51 us | 85.95 us
256 B | 79.48 us | 50.59 us | 85.94 us | 100.10 us
512 B | 79.51 us | 74.78 us | 97.23 us | 123.11 us
```

## Arc Comparison Avg Ratio Vs Tokio

```text
size  | bun/tokio | node/tokio | deno/tokio
------+-----------+------------+-----------
8 B   | 0.87x     | 1.07x      | 1.21x
16 B  | 0.77x     | 0.93x      | 0.98x
32 B  | 0.70x     | 0.86x      | 0.95x
64 B  | 0.68x     | 0.95x      | 0.98x
128 B | 0.86x     | 1.03x      | 1.08x
256 B | 0.64x     | 1.08x      | 1.26x
512 B | 0.94x     | 1.22x      | 1.55x
```

`Uint8Array` size sweep

This sweep fixes batch=100 and scales binary payload size from 8 B to 1 MiB:

Bun is fastest from 8 B through 512 B, so the default copy-based byte path is already competitive at the tiny end.
Tokio retakes the lead from 1 KiB through 16 KiB in this run.
Bun and Node.js pull slightly ahead again from 32 KiB through 512 KiB.
At 1 MiB, Tokio is fastest again at 29.48 ms, ahead of Bun (46.97 ms), Node.js (52.53 ms), and Deno (55.77 ms).

Chart
Raw data

Uint8Array size sweep comparing Tokio, Bun, Node.js, and Deno

# Benchmark Summary

## Sources

- tokio: `results/tokio-1773827721825.csv`
- bun: `results/knitting-bun-1773827812276.csv`
- node: `results/knitting-node-1773828074699.csv`
- deno: `results/knitting-deno-1773827925019.csv`

## Machine Specs

- OS: Ubuntu 23.10
- Kernel: 6.5.0-44-generic
- Architecture: x86_64
- CPU: AMD Ryzen 7 4700U with Radeon Graphics
- Topology: 8 logical CPUs, 1 socket(s), 8 core(s)/socket, 1 thread(s)/core
- Memory: 15.1 GiB
- Swap: 4.0 GiB

## Methodology Notes

- The main string and byte benchmarks are intended to compare the same logical round trip on both sides: send payload, receive it in the worker, echo it back, receive it again on the caller, then wait for the whole batch.
- In `src/main.ts`, the `string` and `Uint8Array` paths go through knitting transport in both directions. That transport materializes a fresh payload on receive, so the round trip includes payload work on both the request side and the reply side.
- To keep the Tokio baseline fair, `src/main.rs` clones `String` and `Vec<u8>` on send and also clones again on the worker reply. The reply clone is intentional. Without it, Tokio would be measuring a cheaper return-path move while the JS runtimes were still paying for fresh payload materialization on the way back.
- The `Arc<Vec<u8>>` sweep is intentionally separate and is not the default apples-to-apples byte benchmark. It exists as an upper-bound shared-bytes reference for small payloads. `Arc::clone` only bumps a refcount, so it is expected to be cheaper than copying bytes.
- This means the default `string` and `Uint8Array` tables should be read as the fairer comparison, while the Arc section should be read as "how close does the normal transport get to shared ownership for small values?"

## Batch Avg Latency (less is better)

```text
benchmark            | batch | tokio     | bun      | node     | deno
---------------------+-------+-----------+----------+----------+---------
number f64 (8 bytes) | n=1   | 13.01 us  | 7.35 us  | 6.63 us  | 21.54 us
number f64 (8 bytes) | n=10  | 27.50 us  | 13.41 us | 17.28 us | 11.97 us
number f64 (8 bytes) | n=100 | 89.55 us  | 80.61 us | 62.33 us | 63.28 us
large string 1 MiB   | n=1   | 221.35 us | 1.19 ms  | 2.85 ms  | 1.38 ms
large string 1 MiB   | n=10  | 6.20 ms   | 6.01 ms  | 10.79 ms | 10.38 ms
large string 1 MiB   | n=100 | 37.93 ms  | 50.16 ms | 84.66 ms | 83.90 ms
Uint8Array 1 MiB     | n=1   | 272.81 us | 1.30 ms  | 2.35 ms  | 1.14 ms
Uint8Array 1 MiB     | n=10  | 4.64 ms   | 5.27 ms  | 5.22 ms  | 6.36 ms
Uint8Array 1 MiB     | n=100 | 37.83 ms  | 47.76 ms | 54.95 ms | 60.04 ms
```

## Batch P99 Latency (less is better)

```text
benchmark            | batch | tokio     | bun      | node      | deno
---------------------+-------+-----------+----------+-----------+----------
number f64 (8 bytes) | n=1   | 16.85 us  | 18.70 us | 26.25 us  | 160.57 us
number f64 (8 bytes) | n=10  | 40.83 us  | 36.58 us | 81.26 us  | 111.89 us
number f64 (8 bytes) | n=100 | 203.57 us | 92.37 us | 314.11 us | 263.05 us
large string 1 MiB   | n=1   | 371.10 us | 3.74 ms  | 3.73 ms   | 2.97 ms
large string 1 MiB   | n=10  | 8.44 ms   | 8.41 ms  | 15.75 ms  | 14.45 ms
large string 1 MiB   | n=100 | 40.81 ms  | 61.62 ms | 105.51 ms | 106.67 ms
Uint8Array 1 MiB     | n=1   | 400.45 us | 3.03 ms  | 5.58 ms   | 5.81 ms
Uint8Array 1 MiB     | n=10  | 8.41 ms   | 7.80 ms  | 9.52 ms   | 14.37 ms
Uint8Array 1 MiB     | n=100 | 43.71 ms  | 59.77 ms | 72.39 ms  | 81.71 ms
```

## Avg Ratio Vs Tokio

```text
benchmark            | batch | bun/tokio | node/tokio | deno/tokio
---------------------+-------+-----------+------------+-----------
number f64 (8 bytes) | n=1   | 0.56x     | 0.51x      | 1.66x
number f64 (8 bytes) | n=10  | 0.49x     | 0.63x      | 0.44x
number f64 (8 bytes) | n=100 | 0.90x     | 0.70x      | 0.71x
large string 1 MiB   | n=1   | 5.36x     | 12.86x     | 6.22x
large string 1 MiB   | n=10  | 0.97x     | 1.74x      | 1.68x
large string 1 MiB   | n=100 | 1.32x     | 2.23x      | 2.21x
Uint8Array 1 MiB     | n=1   | 4.75x     | 8.62x      | 4.18x
Uint8Array 1 MiB     | n=10  | 1.13x     | 1.12x      | 1.37x
Uint8Array 1 MiB     | n=100 | 1.26x     | 1.45x      | 1.59x
```

## Uint8Array Size Sweep Avg Latency (less is better)

```text
size    | tokio     | bun       | node      | deno
--------+-----------+-----------+-----------+----------
8 B     | 82.99 us  | 62.80 us  | 88.20 us  | 107.01 us
16 B    | 81.91 us  | 56.24 us  | 65.37 us  | 95.78 us
32 B    | 85.70 us  | 49.48 us  | 65.76 us  | 85.05 us
64 B    | 76.98 us  | 42.68 us  | 66.88 us  | 78.27 us
128 B   | 92.53 us  | 53.53 us  | 79.28 us  | 84.39 us
256 B   | 99.70 us  | 63.42 us  | 83.89 us  | 100.44 us
512 B   | 86.67 us  | 68.55 us  | 97.07 us  | 118.03 us
1 KiB   | 101.42 us | 171.09 us | 157.61 us | 169.50 us
2 KiB   | 191.25 us | 194.62 us | 220.68 us | 233.39 us
4 KiB   | 195.56 us | 260.16 us | 324.39 us | 391.43 us
8 KiB   | 208.84 us | 397.05 us | 465.89 us | 539.98 us
16 KiB  | 279.25 us | 649.18 us | 741.47 us | 899.81 us
32 KiB  | 1.48 ms   | 1.14 ms   | 1.27 ms   | 1.41 ms
64 KiB  | 2.71 ms   | 2.38 ms   | 2.49 ms   | 2.89 ms
128 KiB | 5.66 ms   | 5.02 ms   | 5.11 ms   | 6.08 ms
256 KiB | 11.92 ms  | 10.56 ms  | 10.11 ms  | 11.96 ms
512 KiB | 23.06 ms  | 22.33 ms  | 22.66 ms  | 25.26 ms
1 MiB   | 29.48 ms  | 46.97 ms  | 52.53 ms  | 55.77 ms
```

## Arc Comparison Size Sweep Avg Latency (less is better)

Tokio uses `Arc<Vec<u8>>` here as a separate shared-bytes reference point, not the default apples-to-apples byte path.

```text
size  | tokio    | bun      | node     | deno
------+----------+----------+----------+----------
8 B   | 80.76 us | 70.31 us | 86.19 us | 97.79 us
16 B  | 79.35 us | 60.94 us | 73.73 us | 77.46 us
32 B  | 81.48 us | 57.04 us | 70.26 us | 77.03 us
64 B  | 80.14 us | 54.44 us | 75.94 us | 78.81 us
128 B | 79.89 us | 68.50 us | 82.51 us | 85.95 us
256 B | 79.48 us | 50.59 us | 85.94 us | 100.10 us
512 B | 79.51 us | 74.78 us | 97.23 us | 123.11 us
```

## Arc Comparison Avg Ratio Vs Tokio

```text
size  | bun/tokio | node/tokio | deno/tokio
------+-----------+------------+-----------
8 B   | 0.87x     | 1.07x      | 1.21x
16 B  | 0.77x     | 0.93x      | 0.98x
32 B  | 0.70x     | 0.86x      | 0.95x
64 B  | 0.68x     | 0.95x      | 0.98x
128 B | 0.86x     | 1.03x      | 1.08x
256 B | 0.64x     | 1.08x      | 1.26x
512 B | 0.94x     | 1.22x      | 1.55x
```

`Arc<Vec<u8>>` reference sweep

This separate sweep also fixes batch=100, but only covers 8 B through 512 B. On the Tokio side it uses Arc<Vec<u8>>, which is the closest thing to “magic teleportation” in this setup: Arc::clone mostly just bumps a refcount instead of copying the bytes.

Bun still beats the Tokio Arc path from 8 B through 256 B, and is still close at 512 B (74.78 µs vs 79.51 µs).
Node.js is faster than the Arc baseline from 16 B through 64 B, and stays near parity at 128 B (82.51 µs vs 79.89 µs).
Deno is close in the 16-64 B band, but falls back more clearly by 256-512 B.

Treat this as an upper-bound shared-ownership reference, not the default apples-to-apples byte benchmark. The fair default comparison is still the normal Uint8Array copy path above.

Chart
Raw data

Uint8Array size sweep comparing Tokio Arc Vec against Bun, Node.js, and Deno

# Benchmark Summary

## Sources

- tokio: `results/tokio-1773827721825.csv`
- bun: `results/knitting-bun-1773827812276.csv`
- node: `results/knitting-node-1773828074699.csv`
- deno: `results/knitting-deno-1773827925019.csv`

## Machine Specs

- OS: Ubuntu 23.10
- Kernel: 6.5.0-44-generic
- Architecture: x86_64
- CPU: AMD Ryzen 7 4700U with Radeon Graphics
- Topology: 8 logical CPUs, 1 socket(s), 8 core(s)/socket, 1 thread(s)/core
- Memory: 15.1 GiB
- Swap: 4.0 GiB

## Methodology Notes

- The main string and byte benchmarks are intended to compare the same logical round trip on both sides: send payload, receive it in the worker, echo it back, receive it again on the caller, then wait for the whole batch.
- In `src/main.ts`, the `string` and `Uint8Array` paths go through knitting transport in both directions. That transport materializes a fresh payload on receive, so the round trip includes payload work on both the request side and the reply side.
- To keep the Tokio baseline fair, `src/main.rs` clones `String` and `Vec<u8>` on send and also clones again on the worker reply. The reply clone is intentional. Without it, Tokio would be measuring a cheaper return-path move while the JS runtimes were still paying for fresh payload materialization on the way back.
- The `Arc<Vec<u8>>` sweep is intentionally separate and is not the default apples-to-apples byte benchmark. It exists as an upper-bound shared-bytes reference for small payloads. `Arc::clone` only bumps a refcount, so it is expected to be cheaper than copying bytes.
- This means the default `string` and `Uint8Array` tables should be read as the fairer comparison, while the Arc section should be read as "how close does the normal transport get to shared ownership for small values?"

## Batch Avg Latency (less is better)

```text
benchmark            | batch | tokio     | bun      | node     | deno
---------------------+-------+-----------+----------+----------+---------
number f64 (8 bytes) | n=1   | 13.01 us  | 7.35 us  | 6.63 us  | 21.54 us
number f64 (8 bytes) | n=10  | 27.50 us  | 13.41 us | 17.28 us | 11.97 us
number f64 (8 bytes) | n=100 | 89.55 us  | 80.61 us | 62.33 us | 63.28 us
large string 1 MiB   | n=1   | 221.35 us | 1.19 ms  | 2.85 ms  | 1.38 ms
large string 1 MiB   | n=10  | 6.20 ms   | 6.01 ms  | 10.79 ms | 10.38 ms
large string 1 MiB   | n=100 | 37.93 ms  | 50.16 ms | 84.66 ms | 83.90 ms
Uint8Array 1 MiB     | n=1   | 272.81 us | 1.30 ms  | 2.35 ms  | 1.14 ms
Uint8Array 1 MiB     | n=10  | 4.64 ms   | 5.27 ms  | 5.22 ms  | 6.36 ms
Uint8Array 1 MiB     | n=100 | 37.83 ms  | 47.76 ms | 54.95 ms | 60.04 ms
```

## Batch P99 Latency (less is better)

```text
benchmark            | batch | tokio     | bun      | node      | deno
---------------------+-------+-----------+----------+-----------+----------
number f64 (8 bytes) | n=1   | 16.85 us  | 18.70 us | 26.25 us  | 160.57 us
number f64 (8 bytes) | n=10  | 40.83 us  | 36.58 us | 81.26 us  | 111.89 us
number f64 (8 bytes) | n=100 | 203.57 us | 92.37 us | 314.11 us | 263.05 us
large string 1 MiB   | n=1   | 371.10 us | 3.74 ms  | 3.73 ms   | 2.97 ms
large string 1 MiB   | n=10  | 8.44 ms   | 8.41 ms  | 15.75 ms  | 14.45 ms
large string 1 MiB   | n=100 | 40.81 ms  | 61.62 ms | 105.51 ms | 106.67 ms
Uint8Array 1 MiB     | n=1   | 400.45 us | 3.03 ms  | 5.58 ms   | 5.81 ms
Uint8Array 1 MiB     | n=10  | 8.41 ms   | 7.80 ms  | 9.52 ms   | 14.37 ms
Uint8Array 1 MiB     | n=100 | 43.71 ms  | 59.77 ms | 72.39 ms  | 81.71 ms
```

## Avg Ratio Vs Tokio

```text
benchmark            | batch | bun/tokio | node/tokio | deno/tokio
---------------------+-------+-----------+------------+-----------
number f64 (8 bytes) | n=1   | 0.56x     | 0.51x      | 1.66x
number f64 (8 bytes) | n=10  | 0.49x     | 0.63x      | 0.44x
number f64 (8 bytes) | n=100 | 0.90x     | 0.70x      | 0.71x
large string 1 MiB   | n=1   | 5.36x     | 12.86x     | 6.22x
large string 1 MiB   | n=10  | 0.97x     | 1.74x      | 1.68x
large string 1 MiB   | n=100 | 1.32x     | 2.23x      | 2.21x
Uint8Array 1 MiB     | n=1   | 4.75x     | 8.62x      | 4.18x
Uint8Array 1 MiB     | n=10  | 1.13x     | 1.12x      | 1.37x
Uint8Array 1 MiB     | n=100 | 1.26x     | 1.45x      | 1.59x
```

## Uint8Array Size Sweep Avg Latency (less is better)

```text
size    | tokio     | bun       | node      | deno
--------+-----------+-----------+-----------+----------
8 B     | 82.99 us  | 62.80 us  | 88.20 us  | 107.01 us
16 B    | 81.91 us  | 56.24 us  | 65.37 us  | 95.78 us
32 B    | 85.70 us  | 49.48 us  | 65.76 us  | 85.05 us
64 B    | 76.98 us  | 42.68 us  | 66.88 us  | 78.27 us
128 B   | 92.53 us  | 53.53 us  | 79.28 us  | 84.39 us
256 B   | 99.70 us  | 63.42 us  | 83.89 us  | 100.44 us
512 B   | 86.67 us  | 68.55 us  | 97.07 us  | 118.03 us
1 KiB   | 101.42 us | 171.09 us | 157.61 us | 169.50 us
2 KiB   | 191.25 us | 194.62 us | 220.68 us | 233.39 us
4 KiB   | 195.56 us | 260.16 us | 324.39 us | 391.43 us
8 KiB   | 208.84 us | 397.05 us | 465.89 us | 539.98 us
16 KiB  | 279.25 us | 649.18 us | 741.47 us | 899.81 us
32 KiB  | 1.48 ms   | 1.14 ms   | 1.27 ms   | 1.41 ms
64 KiB  | 2.71 ms   | 2.38 ms   | 2.49 ms   | 2.89 ms
128 KiB | 5.66 ms   | 5.02 ms   | 5.11 ms   | 6.08 ms
256 KiB | 11.92 ms  | 10.56 ms  | 10.11 ms  | 11.96 ms
512 KiB | 23.06 ms  | 22.33 ms  | 22.66 ms  | 25.26 ms
1 MiB   | 29.48 ms  | 46.97 ms  | 52.53 ms  | 55.77 ms
```

## Arc Comparison Size Sweep Avg Latency (less is better)

Tokio uses `Arc<Vec<u8>>` here as a separate shared-bytes reference point, not the default apples-to-apples byte path.

```text
size  | tokio    | bun      | node     | deno
------+----------+----------+----------+----------
8 B   | 80.76 us | 70.31 us | 86.19 us | 97.79 us
16 B  | 79.35 us | 60.94 us | 73.73 us | 77.46 us
32 B  | 81.48 us | 57.04 us | 70.26 us | 77.03 us
64 B  | 80.14 us | 54.44 us | 75.94 us | 78.81 us
128 B | 79.89 us | 68.50 us | 82.51 us | 85.95 us
256 B | 79.48 us | 50.59 us | 85.94 us | 100.10 us
512 B | 79.51 us | 74.78 us | 97.23 us | 123.11 us
```

## Arc Comparison Avg Ratio Vs Tokio

```text
size  | bun/tokio | node/tokio | deno/tokio
------+-----------+------------+-----------
8 B   | 0.87x     | 1.07x      | 1.21x
16 B  | 0.77x     | 0.93x      | 0.98x
32 B  | 0.70x     | 0.86x      | 0.95x
64 B  | 0.68x     | 0.95x      | 0.98x
128 B | 0.86x     | 1.03x      | 1.08x
256 B | 0.64x     | 1.08x      | 1.26x
512 B | 0.94x     | 1.22x      | 1.55x
```

Fairness and the one intentional asymmetry

Two major sources of skew are already handled:

Dispatch shape is aligned. Rust fans out via spawned tasks and waits with join_all(...), matching knitting creating all pool.call.*(...) promises and awaiting Promise.all(...).
Runtime width is aligned. Knitting uses threads: 1, and Rust uses #[tokio::main(worker_threads = 1)], so sender fan-out can’t spread across a bigger worker pool.
Round-trip work is aligned. The default String and Uint8Array paths pay payload work in both directions on both sides; Tokio explicitly clones on send and clones again on the worker reply so the return path is not a cheaper move-only shortcut.

One asymmetry is kept on purpose: memory management.

Allocation model

This benchmark measures “total cost of the system as designed”, not “transport cost after normalizing allocation away”. Large payloads have to be copied or shared somehow, and that choice is part of the cost.

For large string and byte payloads:

Rust String / Vec<u8> pays clone() (heap allocation + memcpy) in the timed section.
Knitting copies into a preallocated shared-memory region managed by its own allocator-like bookkeeping.

Avoiding general-purpose allocation in the hot path is part of what makes knitting interesting, so the benchmark keeps that cost in-bounds rather than hiding it.

The Arc<Vec<u8>> sweep is included separately for exactly that reason: it shows the shared-ownership upper bound for tiny payloads without pretending that it is the default fair byte path.

A rough cost model

For the payload-heavy echo cases, treat the benchmark as measuring two different “systems”:

knitting: shared-buffer copies + allocator-style region management (JS values still get materialized when a worker reads/returns them)
tokio default: clone-driven allocation + payload copies on the channel path
tokio Arc reference: Arc::clone shared ownership for the byte buffer handle

The exact low-level behavior depends on payload type and runtime, but the high-level point is stable: knitting is buying speed by replacing repeated general-purpose allocation with preallocated shared-memory management.

Why knitting can be fast

A few concrete things knitting does that matter for this benchmark:

Fixed pool topology → simpler queues. The pool knows its workers up front, and each host↔worker lane is effectively single‑producer/single‑consumer. That’s cheaper than a fully general multi‑producer channel.
Low-garbage hot path. Most transport work happens inside typed-array-backed buffers and reused task objects, reducing allocation churn and GC pressure (and references get cleared quickly after each call settles).
Two-tier payload path. Small payloads encode inline in the per-call header slot (roughly ~0.5 KiB per in-flight call, with ~544 bytes usable for inline data); larger payloads spill into the shared payload buffer (SAB/GSAB).
Shared payload buffer + mini allocator. Large payloads are copied into a preallocated SharedArrayBuffer and carved into 64‑byte‑aligned regions tracked by a small slot table/bitset (more complexity, less malloc in the hot path).
Primitives are “header-only”. Numbers/booleans/null/etc encode directly in header words (no payload buffer at all), keeping contention and copying low.
Optional “gc at idle boundaries”. When workers have gc() available , knitting may trigger a GC before going into longer spin/park waits, nudging collections away from the hot loop.

None of this is free: it trades simplicity for careful memory layout, extra bookkeeping, and more “allocator-like” engineering. That trade is exactly what this repo is trying to make visible.