How we run iSCSI over the internet

Hi HN - Tom here, I built scsipub.

The short version: it's iSCSI targets on the public internet. Pick an image, get a block device. The free tier doesn't need a signup at all - iscsiadm -m discovery -t sendtargets -p scsipub.com and --login to iqn.2025-01.pub.scsipub:blank lands you a 64 MB scratch disk. There's a small catalog of OS images you can mount the same way.

The paid tier is where it gets less hobby-shaped: sessions survive disconnects, a single target can expose multiple LUNs, and SCSI-3 Persistent Reservations work end-to-end (REGISTER / RESERVE / RELEASE round-trip clean against sg_persist). That last bit is the cluster-storage primitive — Pacemaker, ESXi HA, and Windows MSCS all use PR for fencing — so you can actually back a 2-node failover cluster off a target on the public internet.

The post linked in the submission is the architectural decision log: Ranch 2.x listeners, a BEAM process per session, COW overlays with per-sector bitmaps, Caddy-managed Let's Encrypt for the iSCSI-TLS port without restarting the listener, and the four open-iscsi quirks that each cost me few hours. There's a section on what we're deliberately not solving (multi-region, RDMA, etc.) so you know the scope.

Two companion projects ship as embedded sub-sites on the front page — one turns an ESP32-S3 into a wireless iSCSI-to-USB bridge, one lets a Raspberry Pi 3/4/5 netboot directly from a target. Both linked from the landing page under "Hardware initiators".

Happy to answer any questions about the protocol, the deployment, or the BEAM-side design choices.

iSCSI is a protocol from the era when “the network” meant a rack-scale fibre channel replacement. Initiators and targets trusted each other, CHAP was optional theatre, and a packet from an initiator carried the implicit assumption “we’re on the same L2 segment.”

scsipub serves iSCSI targets to arbitrary clients on the public internet. That’s a different set of assumptions. This post is the decision log — the small choices that add up to “this works and doesn’t break from day one.”

It started as the missing dependency for two adjacent projects of mine — a Raspberry Pi netboot shim and an ESP32-based USB-mass- storage bridge — both of which needed an iSCSI target out on the open internet to point demos at, and there wasn’t one. Building a target turned out to be the bigger of the three problems.

The listener

Both ports are Ranch 2.x listeners — plain TCP on 3260, TLS on 3261. Scsipub.Target.Listener returns a pair of child specs that the application supervisor adds at boot:

def child_specs(opts) do
  tcp_spec = tcp_child_spec(opts[:port] || 3260, protocol_opts)

  if opts[:tls_certfile] && opts[:tls_keyfile] &&
       File.exists?(opts[:tls_certfile]) && File.exists?(opts[:tls_keyfile]) do
    tls_spec = tls_child_spec(opts[:tls_port] || 3261, certfile, keyfile, protocol_opts)
    [tcp_spec, tls_spec]
  else
    [tcp_spec]
  end
end

Ranch runs a small acceptor pool in front of a :ranch_protocol callback. When a connection arrives, Ranch spawns a fresh BEAM process and hands it the socket. For iSCSI that’s the unit we want: one process per TCP connection, one TCP connection per initiator session, one initiator session per user-visible mountable disk.

“One BEAM process per connection” only works because processes here aren’t OS threads. A BEAM process is ~2.5 KB of initial heap and some bookkeeping — the scheduler happily runs tens of thousands of them on a single core. iSCSI sessions sit idle waiting for SCSI PDUs most of the time, which is the ideal shape for green threads: cheap to park, cheap to wake.

Contrast with the C implementations: target_core_iblock and friends carry a thread pool and a queue, and tuning the pool size is an ongoing concern. We don’t tune anything and the BEAM happily handled 446 req/s in our web-side load test before latency started climbing — and that’s the Phoenix surface with its DB hops, not the iSCSI listener, which has smaller payloads and no SQL in the hot path at all.

One process per session

The protocol module is Scsipub.Target.Session, a plain GenServer. Its state machine walks through three phases:

phase: :security_negotiation  # csg=0, CHAP challenge/response
phase: :operational           # csg=1, negotiate parameters
phase: :full_feature           # csg=1 transit done, handling SCSI PDUs

Each PDU comes in on the socket, gets parsed into a struct, and routed to a handler. If a handler raises — malformed PDU, unexpected state transition, disk error — the process dies. That’s on purpose. The supervisor doesn’t restart it, because there’s no meaningful recovery; the initiator will notice the TCP close and try to log in again. State doesn’t leak between sessions because state doesn’t leave the process.

This is the standard Erlang story (“let it crash”), but it’s more than a platitude for iSCSI. The real-world alternative — carefully defending every parser branch against every attacker-shaped PDU — is how RFC 7143’s more colourful edge cases turn into CVEs in other implementations. We don’t defend; we fence. One bad PDU kills one session.

The Registry (Scsipub.Sessions.Registry, ETS-backed) is how a session announces itself once it reaches Full Feature Phase:

Registry.set_pid(iqn, self())

The Registry monitors the pid and auto-cleans the entry on :DOWN. The admin dashboard reads from the same ETS table to show live connections.

COW overlays

The base image is a regular file — .img, .iso, or .qcow2 decompressed to raw on fetch. It’s read-only. Every concurrent session gets its own overlay file, sparse-allocated to the same size as the base:

/var/lib/scsipub/overlays/
  71a61232479cc467.img          ← overlay, sparse
  71a61232479cc467.img.bitmap   ← 1 bit per sector

The bitmap tracks which 512-byte sectors have been written. Reads check the bit: if set, the overlay has the sector; if clear, fall through to the base image. Writes set the bit and write to the overlay.

The layout means:

The base image is never touched. CI verifies this — we SHA-256 the base before and after an integration run.
The overlay file is sparse. A session that only writes the MBR costs ~512 bytes on disk, not “the full virtual size of the disk.” Filesystem holes do the work.
Disconnecting is cheap. Non-persistent tiers delete the overlay on the TCP close; persistent tiers keep it until the session’s TTL elapses or the user destroys it explicitly.
Writes are counted. Each overlay write bumps a counter against write_limit from the user’s tier config. Hit the limit and the target responds WRITE_PROTECT until the session ends.

The Janitor, a GenServer on a 10-minute tick, sweeps the overlay directory and deletes files that don’t match any live session in the database. That’s how we clean up from the rare case where a process dies before its terminate callback runs.

Caddy in front, TLS everywhere

Caddy terminates HTTPS on port 443 and reverse-proxies to the Phoenix app on port 4000. The same Let’s Encrypt certificate also protects the iSCSI-TLS listener on port 3261 — which is the interesting part, because the iSCSI listener isn’t behind Caddy. It binds :ranch_ssl directly.

Caddy writes the ACME-obtained cert to its internal storage (/var/lib/caddy/.local/share/caddy/...), which the app user can’t read. The bridge is a tiny systemd service running inotifywait against that directory and copying the cert into /var/lib/scsipub/tls/ — owned by a shared group both users can read — whenever the bytes change.

The iSCSI listener picks up rotations without a restart because its sni_fun re-reads the PEM on every TLS handshake, with guardrails:

# lib/scsipub/target/tls_certs.ex
def sni_opts(certfile, keyfile) do
  now = System.monotonic_time(:second)

  case :persistent_term.get(cache_key, nil) do
    {_cert_mtime, _key_mtime, loaded_at, opts}
    when now - loaded_at < @min_reload_interval ->
      opts                           # 60s cooldown — serve cache unconditionally

    {cert_mtime, key_mtime, _loaded_at, opts} ->
      if stat_unchanged?(certfile, keyfile, cert_mtime, key_mtime) do
        opts                         # mtime unchanged — still fresh
      else
        reload_and_cache(...)        # rotation happened — re-read PEM
      end

    nil ->
      reload_and_cache(...)          # cold cache — first load
  end
end

Two guards, in order: a 60-second cooldown that serves the cached opts without any syscall (absorbs a thundering-herd handshake burst), and an mtime check after the cooldown that only pays for a fresh PEM read when the files have actually changed. Both matter — sni_fun is on the hot path for every TLS handshake, and without them a rotation every few months would still cost two stat syscalls per mount.

Things open-iscsi cares about

If you’re building against the open-iscsi initiator that ships in every Linux distro, the protocol is less “what’s on the wire” and more “what iscsiadm does with what’s on the wire.” Three concrete examples that each cost us a day.

`/` in the IQN type-name separator

Our first cut of anonymous target names was iqn.2025-01.pub.scsipub:image/ubuntu. That parses fine as an IQN. iscsiadm even does discovery against it happily. What it can’t do is log in:

iscsiadm: Could not make /etc/iscsi/nodes/iqn.2025-01.pub.scsipub:image/ubuntu

open-iscsi stores its persistent state in /etc/iscsi/nodes/<iqn>/... — it uses the IQN verbatim as a filesystem path. Any / in the name becomes a subdirectory boundary, and the create-if-missing path walk fails. We switched to . as the type/name separator (iqn.2025-01.pub.scsipub:image.ubuntu), which parses the same way and sidesteps the whole problem.

SendTargets has to advertise an address the client can reach

When an initiator does discovery, the target replies with a list of TargetName + TargetAddress records. The initiator saves that address as the portal for future logins — even if the discovery request itself went through a different IP.

In our CI, the target runs inside a CI container and the initiator inside a QEMU VM. QEMU’s user-mode networking NATs to 10.0.2.2 from the VM’s perspective. If we let the server advertise whatever sockname() returns — 127.0.0.1:3260 — iscsiadm dutifully saves that as the portal, and every subsequent login attempt tries to reach the runner’s loopback from inside the VM and fails forever.

# lib/scsipub/target/session.ex
defp advertise_address(socket, transport) do
  case Application.get_env(:scsipub, :public_host) do
    host when is_binary(host) -> "#{host}:#{port(socket, transport)}"
    _ -> sockname_string(socket, transport)
  end
end

Pin :public_host (we ship this as PHX_HOST in deploy env) and SendTargets returns something the client can actually get back to.

The `-o new` dance for static logins

Once you’ve been bitten by the SendTargets-saves-the-portal behaviour enough times, you learn to skip discovery for anything that needs a non-default portal. For example: iSCSI-over-TLS via stunnel. The natural flow would be “discover via the tunnel, then log in.” But the discovery response names the server’s public portal, not 127.0.0.1:3260 where stunnel is terminating, so iscsiadm saves the wrong portal and logs in plain instead of through the tunnel.

The fix is static login:

IQN=iqn.2025-01.pub.scsipub:blank

iscsiadm -m node -T $IQN -p 127.0.0.1:3260 -o new
iscsiadm -m node -T $IQN -p 127.0.0.1:3260 \
  -o update -n node.session.auth.authmethod -v None
iscsiadm -m node -T $IQN -p 127.0.0.1:3260 --login

-o new creates a fresh node record at the portal you specify instead of using whatever the discovery step saved. Our landing page renders exactly that command sequence for the TLS path, because the alternative is an infuriating 30 minutes with iscsiadm --debug=6.

Bonus: stale records retry forever

Once a node record exists under /etc/iscsi/nodes/, iscsid retries the login indefinitely if the session drops. If the target has been destroyed server-side, that manifests as a steady 1-every-3-second stream of “unknown target” login attempts in our server logs. The cure is on the client:

iscsiadm -m node -T <iqn> -o delete

On the server we throttle the log line (once per (ip, target) per 5 minutes at warning level, debug after that) so a stale initiator doesn’t bury real warnings under 17,280 lines of the same complaint per day. See Scsipub.Target.Session.log_unknown_target/2.

Cluster primitives: PR and multi-LUN

What turns this from “a fancy iSCSI sandbox” into “a target real cluster software can drive” is two SAM-5 / SPC-4 features — multi-LUN sessions and SCSI-3 Persistent Reservations. The wire protocol already supports both; the work is on our side, plumbing them into the Session and into something that survives a BEAM restart.

Multiple LUNs per session

A SCSI Logical Unit Number is the byte in each CDB that selects which device behind a target the initiator is addressing. Real storage products expose one target with N LUNs all the time; our Session struct holds a map keyed by LUN number, and the SCSI dispatcher routes by pdu.lun:

case Map.get(state.lun_backends, pdu.lun) do
  nil -> {:error, :logical_unit_not_supported}
  cow -> Handler.dispatch(pdu.cdb, pdu.data, cow, ...)
end

There’s an anonymous demo target wired up — iqn.2025-01.pub.scsipub:multi exposes two LUNs, each backed by a different image — and the session-creation API on the paid side takes an images: [...] array. The unglamorous half of the work was cleanup: multi-LUN sessions write to <sid>.lun0.img, <sid>.lun1.img, etc., and a terminator that only knew about state.overlay_path (the single-LUN field) leaked overlays on disconnect. The fix is a separate cleanup_multi_lun_overlays/1 walker, gated on state.overlay_path == nil so the single-LUN path’s own File.rm doesn’t double-close the same fd.

Persistent reservations

SCSI-3 PR is the primitive cluster software uses to fence a node out of shared storage. The per-LUN state is small: a set of registered initiator keys, plus an optional “reservation” naming one of them as holder along with a type (Write Exclusive, Exclusive Access, and four flavours combining “Registrants Only” and “All Registrants”). Pacemaker, ESXi HA, and Windows MSCS all drive this via sg_persist.

The state machine is Scsipub.Sessions.PR — a pure module, no DB or process baggage, so it’s tested as a struct. The runtime layer (SharedLU, one GenServer per (session_id, lun)) wraps it with write-through to the persistent_reservations Postgres table on every successful PR OUT. SPC-4 says PR state must survive a target reboot, and the table is the only honest way to honour that. A BEAM-restart unit test cycles the SharedLU through stop+restart and asserts the registrations and reservation reload identical.

Two subtle bits of plumbing.

The I_T nexus identifier is the iSCSI InitiatorName, not the CHAP user. Two initiators behind the same CHAP credential are distinct nexuses by design, and trusting CHAP_N would let a second client write under the first’s reservation. The Session struct keeps both:

:initiator_name,        # CHAP_N for paid sessions
:iscsi_initiator_name,  # the InitiatorName from the first login PDU
                        # — what PR identifies by

The other surprise was that Linux’s open-iscsi doesn’t send PR OUT parameter lists as immediate data. It uses the R2T (Ready To Transfer) flow, the same way it does WRITE — which makes sense, the spec lets it, but the original implementation only handled the immediate path. sg_persist --register returned Invalid opcode until R2T-driven PR OUT joined the existing two-phase command machinery that SCSI WRITE already used.

Two-initiator scenario, end to end:

# Initiator A: register a key, reserve Write Exclusive
sg_persist --out --register --param-sark=$KEY_A $DEV_A
sg_persist --out --reserve --param-rk=$KEY_A --prout-type=1 $DEV_A

# Initiator B (different InitiatorName, may share CHAP user):
# READ is allowed, WRITE returns RESERVATION CONFLICT.
dd if=$DEV_B bs=512 count=1 iflag=direct >/dev/null   # ok
dd if=/dev/zero of=$DEV_B bs=512 count=1 oflag=direct # EBUSY

# A releases; B's write now succeeds.
sg_persist --out --release --param-rk=$KEY_A --prout-type=1 $DEV_A

The CI integration suite runs that exact sequence. Combined with the restart-resume contract above, that’s enough to back a 2-node failover cluster off a target on the public internet — the BEAM deploy ritual (SIGTERM, wait for sessions to checkpoint, SIGKILL, restart, Resumer wakes the suspended LUs) doesn’t lose reservations along the way.

What we’re not solving

Deliberate omissions, for the record:

Multi-region. Everything runs in a single datacenter. A multi-region story would need per-session persistence to be a distributed system problem; it currently isn’t, and we like that.
S3- or NBD-backed base images. Images are local sparse files. Upload via the admin UI or an ecto run script; that’s the whole ingestion story. Cloud-backed storage changes the read-path latency distribution meaningfully enough that we’d want to think about it rather than bolt it on.
iSER / RDMA. No. scsipub is a public-internet service; RDMA is a rack-scale protocol. If you need 40 Gbit/s into a block device, the physics say you aren’t on the public internet anyway.
MPIO. Not yet. The initiator side of multipath works fine, but until we have multi-region there’s nowhere to failover to.
Per-session encryption above TLS. The iSCSI protocol has IPsec and a few other approaches for payload secrecy; none are widely deployed, and adding our own on top of TLS would just be framing for framing’s sake.

What comes next

The two projects scsipub originally existed to serve are now both shipped and have their own posts — the Pi netboot shim and how it killed the SD-card shuffle is at Netboot a Pi fleet from iSCSI; the ESP32 USB-mass-storage bridge for lab equipment is at An ESP32 as a network-attached USB stick.

Past that, the interesting question is what happens when a Phoenix app serving iSCSI meets someone who really wants to use it — tens of thousands of sessions, sustained writes, a pathological initiator. We’ve done a load test up to a few hundred concurrent web requests; we haven’t yet found the shape of the BEAM’s failure mode under actual iSCSI load. That’s the next thing to measure.

Hacker Times

Hacker Times

How we run iSCSI over the internet

Discussion

Discussion

The listener

One process per session

COW overlays

Caddy in front, TLS everywhere

Things open-iscsi cares about

`/` in the IQN type-name separator

SendTargets has to advertise an address the client can reach

The `-o new` dance for static logins

Bonus: stale records retry forever

Cluster primitives: PR and multi-LUN

Multiple LUNs per session

Persistent reservations

What we’re not solving

What comes next

Hacker Times

Hacker Times

How we run iSCSI over the internet

Discussion

Discussion

The listener

One process per session

COW overlays

Caddy in front, TLS everywhere

Things open-iscsi cares about

/ in the IQN type-name separator

SendTargets has to advertise an address the client can reach

The -o new dance for static logins

Bonus: stale records retry forever

Cluster primitives: PR and multi-LUN

Multiple LUNs per session

Persistent reservations

What we’re not solving

What comes next

`/` in the IQN type-name separator

The `-o new` dance for static logins