Monitoring protocol for Oyster Instances

Isolated Instances’ Monitoring Protocol

The monitoring protocol establishes the fault tolerance guarantees for Oyster Isolated Instances and involves a network of Auditors who ensure compliance through the Isolated Instance protocol.

Protocol guarantees

When registering, enclave providers stake POND tokens as security for the enclaves they operate. If these enclaves are found to be non-operational during audits, the staked tokens are slashed. Similarly, auditors involved in the auditing process stake POND tokens to guarantee their active participation and the accuracy of their auditing. If auditors are found to be inactive or if they submit inaccurate audit information, their stake can be slashed.

Protocol trust assumptions

Secure Communication

The protocol assumes that communication between Auditors and the enclaves is secure, and that HTTPS or other secure channels (like TLS for the audit requests) are effectively concealing the nature of the request (audit or user) from the host. This assumes no vulnerability in the communication protocol or the implementation thereof that could be exploited to distinguish between user and auditor traffic or to intercept and manipulate data.

Endpoint Security and Reliability

It is assumed that each enclave exposes a standard endpoint through a secure channel and that this endpoint is reliably available for auditing requests. However, there could be the case that the enclave image is under a DoS attack or other network-level disruptions that could prevent auditors from reaching the enclaves.

Randomness and Assignment Integrity

The assignment of Auditors to enclaves is based on a random seed (Sri), which is assumed to be generated fairly and securely, and to be resistant to manipulation. This seed influences the distribution of auditor subsets to enclaves, meaning that the integrity of this randomness is critical to proper functioning of the auditing process. In case this process is compromised f.e the randomness seed is unevenly and predictably distributed, that could be an attack vector for collusion and targeted attacks.

High-level overview

Protocol structure

  • Epochs: Time is organized into periods called Epochs.
  • Slots: Each Epoch is further divided into Slots.
  • Ages: Each Slot is subdivided into Ages.
  • Auditors: A random number generated each Epoch assigns Auditors specific Jobs to audit.

Audit Process

Auditors send requests through secure channels established withto Instances, which must respond within a set time limit. For each Job within a Slot, Auditors take a majority vote to determine the liveliness of the assigned Instance.

Technical Specifications

Enclaves (T): A set size of t enclaves are audited.

Auditors (A): A set of auditors, totaling a in size, conducts the audits.

Epochs (E): Run for a length of LE, divided into n Slots (EiS) each of length e.

Ages: Every Slot is broken into m ages, each lasting p seconds.

Formulas

  • Total Epoch length, LE = n * e.
  • Slot length, e = m * p.
  • SlotId for the s-th slot of epoch Ei is calculated as: SlotId = i * n + s
  • AgeId for the a-th age of a slot with SlotId is: AgeId = SlotId * m + a

Protocol stages

The auditing protocol, at a high level, consists of the following stages:

  1. Enclave setup
  2. Auditor assignment
  3. Enclave monitoring phase
  4. Aggregating audit data
  5. Challenging audit’s data
  6. Verifying the audit

Stage 1. Enclave setup

At the beginning of every Epoch, each Enclave generates a random seed Sei internally, which is not exposed to anyone until a seconds after the end of the Epoch. The host machine on which the Enclave is running can query the random seed Sei after a seconds of the end of the epoch, along with an attestation signed by the enclave. The host machine has to submit the random seed Sei on-chain within k seconds after it can be queried (that is, within a+k seconds after the end of the epoch).

Slashing condition(s)

  1. If the host machine fails to submit the random seed Sei, the Enclave is considered to be offline and POND tokens staked against the enclave are slashed.

Stage 2. Auditor assignment

At this stage, each enclave Ti in the set T is assigned a subset of auditors TAi from the set A, such that |TAi| = k (k is a constant that determines the number of auditors assigned per enclave). The auditor subsets TAi largely consist of different subsets of Auditors for different Slots of an Epoch.

The assignment from A to T for epoch Ei is randomized using a seed Sri which is generated at the start of the epoch Ei-1. Sri is currently considered to be the blockhash of the block at which the epoch Ei-1 starts.

Auditors can only enter or leave the auditor set A at the end of the Epoch which ensures that the auditor subset for a Slot does not change within an Epoch. Enclave set T can expand or decrease based on whether a new job is added or an existing job is closed or runs out of funds.

Each enclave, for a given Slot of an Epoch, is assigned an auditor subset TAi of length k using the following algorithm:

let AuditorSubsetIndices = [];
let count = 0;
let iter = 0;
while(count < k) {
  index = (keccak256(`${iter}-${SlotId}-${EnclaveJobId}`)) mod(t)
  if(!AuditorSubsetIndices.includes(index)) {
    AuditorSubsetIndices.push(index);
    count++;
  }
  iter++;
}

This algorithm ensures that the auditor subset for every enclave in a Slot consists of unique Auditors. It also ensures that every new enclave created during the Epoch will have assigned Auditors to itself. Auditing of new enclaves takes effect only after c slots from the Slot at which the Isolated Instances Job was created, where c is a constant that determines the delay for auditing new enclaves (JobStartup period).

Slashing condition(s)

  1. In case neither the response, nor the online status is reported on-chain by the auditor, then it is assumed to be offline and the POND tokens staked are slashed for inactivity.

Stage 3. Enclave monitoring phase

During each age, the auditor subset TAi assigned to the enclave Ti for the Slot sends audit requests to the enclave to ensure its availability. The audit requests are sent through the Tor network, practically making the request anonymous to ensure that hosts cannot distinguish between an audit and a user request. The random seed Sei is used by the enclave Ti to generate a response to an audit request. The response to the audit request is a 1-bit data as follows:

const response = keccak256(Auditor address + AgeId + Sei)/2^255

Slashing condition(s)

The above response by the enclave Ti to the audit request is required to be submitted on-chain by the Auditor within q seconds to prove that the audit was actually done.

  1. If the enclave Ti does not respond to the audit request, then it is considered to be offline and reported as such on-chain by the auditor and its POND tokens are slashed.
  2. In case neither the response nor the offline status is reported on-chain by the auditor, then the auditor is assumed to be offline and the POND tokens staked are slashed.

Stage 4. Aggregating audit data

This stage involves submitting the audit data for the epoch on-chain to ensure that the monitoring is effective. The audit data consists of the following:

  1. The audit responses collected by the auditors during the epoch, for each enclave they were assigned in a slot, for each age in that slot. The audit responses are the 1-bit data generated by the enclaves. These are submitted on-chain by Auditors within q seconds of the Age for which they are generated.
  2. The random seeds Sei used by the enclaves to generate the audit responses, along with the attestations signed by the enclaves. The random seeds with attestation are submitted by the enclave hosts within a+k seconds of the end of Epoch.

Slashing condition(s)

  1. If the seeds are not submitted within the deadline, the enclave hosts are considered to be offline and the POND tokens staked by the enclave provider towards the enclave are slashed.
  2. If the audit responses are not submitted within the deadline, the corresponding auditors are considered to be offline and the POND tokens staked by the auditors are slashed.

Stage 5. Challenging audit’s data

Given the seed information and the auditor subsets assigned to each enclave during the epoch, anyone can verify the correctness of the audit responses that were submitted by the auditors. If any audit data is wrong or missing, the audit response can be challenged within f seconds after the end of the epoch. The challenger has to stake POND tokens to create a challenge and the faulty response is computed on-chain.

Slashing condition(s)

  1. If the challenge is valid, then the POND tokens staked by the auditor are slashed and the challenger receives a portion of the penalty. Challenge is considered invalid if the response generated by the auditor was found to be correct after on-chain verification which triggers the POND tokens staked by challenger to be slashed.

Stage 6. Verifying the audit

Remember that in stage 2 the auditors send multiple requests to the enclaves. After the challenge period for the audit data correctness is over, anyone can penalize an enclave provider if most of the auditors report that the enclave was offline during an age. The penalty increases with the duration of the unavailability of the enclave during the epoch. Any challenges for the enclave provider staking have to be made within g seconds after the end of the audit data correctness phase (within g + f seconds after the end of the epoch). The challenger has to stake POND tokens to create a challenge and provide the age ID to be checked on-chain.

Slashing condition(s)

  1. The challenger has to stake POND tokens to create a challenge and provide the age ID to be checked on-chain. Challenge is valid if the enclave was inactive as per the audit responses by the auditors, the POND tokens staked by the enclave provider are slashed and the challenger receives a portion of the slashed POND tokens. If the challenge is invalid, then the POND tokens staked by the challenger are slashed.

Some FAQs based on https://x.com/saxenism/status/2002311606752215253

  1. Do you assume auditor-operator separation? If an operator can also register as an auditor, how do you prevent self-auditing or “friendly auditor” capture?

We assume an auditor set with >=2/3 honesty assumption. This is reasonable once the auditor set is large, but bootstrapping is hard and smaller sets are easier to capture. For each enclave, auditors are selected as a subcommittee from the larger auditor set, which gives a much higher level probability of honest members (inspired by Eth2 committees). With a dishonest majority, auditors can slash someone else, they cannot prove an enclave is up while it isn’t.

  1. What’s the Sybil resistance model for auditors? What stops an attacker from spinning many auditor identities to bias assignments or local majorities?

Auditors must stake to register which is slashed if reports are incorrect. An attacker can sybil, but unless they control >1/3 of the auditor set then they can’t practically influence the subcommittee for an enclave. For a sufficiently large auditor set, acquiring that much stake is very hard.

(One potential attack scenario is “auditors for hire”, where auditors behave honestly most of the time but collude on specific jobs to slash an enclave(possibly to drive specific operators out of competition). We are exploring subcommittee selection designs where proving committee membership is impossible, forcing attackers to bribe the entire auditor set which significantly increases attack cost.)

  1. For incorrect audit bits, delayed reveal makes challenges possible once R_{T_i}​​ is posted. But for “offline” reports, does the chain get a cryptographic proof that the audit request was delivered and the response arrived? Absence claims are asymmetric. If an auditor says “no response”, what evidence exists either way? Have you considered receipt-style proofs (enclave-signed challenge response or transcript hash) so “I responded” is provable on-chain?

If an auditor reports “no response”, we can’t reliably distinguish whether the enclave never received the request or the auditor never sent it. So afaik, receipt style proofs might not help if the auditor is malicious. Instead, we rely on auditor randomness and honesty of subcommittee instead. Also “no response” is not rewarded, rewards are only given for proving availability, so auditor has incentives to not withhold and prove the availability.

  1. The host/provider controls scheduling and networking. Even with encrypted payloads, they can shape traffic and selectively drop. How do you model:
  • healthy-to-auditors but degraded-to-users
  • blocked auditor traffic but normal user traffic
  • or the inverse (users blocked, auditors allowed)

This is an open problem. Hosts can selectively allow traffic and this is a problem as you said.

Possible mitigations include traffic obfuscation using shared proxies or convergence of auditor and user access paths, but this remains an open design question. Happy to hear any ideas here.

  1. Do you rely on audit traffic being indistinguishable from user traffic? In practice, hosts can still use metadata, routing and timing.

Yes, this is an assumption right now but a weak one. Removing this assumption is important and ties into the mitigations mentioned above.

  1. Monitoring is periodic (per Age/slot). Does that create “between-check” windows for strategic micro-downtime, censorship, or MEV-timed unavailability? Do you randomize challenges to reduce predictability?

Checks are randomised in time(with age/slot), so it is hard to identify auditors purely based on timing. Operator risks slashing with any micro downtime or censorship.

  1. What happens if the instance crashes near epoch end and cannot reveal R_{T_i}​​? Is there a commitment/receipt mechanism earlier in the epoch that avoids ambiguous states?

If the instance can’t reveal R_{T_i}​​ due to a crash, the provider is considered at fault and can be slashed since availability is part of guarantees they provide. We also separate the base platform tooling from application code via blue images that enable docker based workflows, so an application crashing shouldn’t take down the entire enclave.

  1. Slashing prices unavailability, but does not prevent it if the payoff is higher than the penalty. How do you think about strategic unavailability where the attacker’s profit dominates slashing?

Yup, it won’t. As I was mentioning in 2, we are looking to increase the cost to attack so that it is very hard for an enclave operator to be incorrectly slashed.

  1. This protocol seems great for detection and penalties, but what’s the story for safe recovery when downtime happens anyway? In privacy/TEE-heavy systems, graceful fallback often means leaking what you were protecting. How do you handle continuity without unsafe fail-open paths?

This protocol isn’t intended to replace application design for fault tolerance. It provides economic accountability and pressure to improve reliability, but applications should still assume enclaves can fail and design accordingly.

  1. Beyond incentives, do you enforce redundancy requirements or recommend operational patterns (replicas, failover, multi-provider) to reduce correlated downtime?

We strongly recommend designing applications for enclave crashes and downtime. Replication and redundancy are important design patterns here.