Every Boolean Permission Is a Race Condition

# architecture# distributedsystems# security# systemdesign

MATT ROSE

Most production systems do not fail because they forgot to check permission. They fail because the...

Most production systems do not fail because they forgot to check permission.

They fail because the permission was true too early.

That is the bug.

The system checked whether an action was allowed. The answer was yes. Then time passed. State changed. The account changed. The provider changed. The resource changed. The policy changed. The user clicked again. Another worker acted first. A webhook arrived late. A payment expired. A record was cancelled. A session went stale.

Then the system performed the action anyway because it was still holding an old boolean.

canProceed: true

That is not authorization.

That is a stale opinion.
Booleans Are Not Decisions

A boolean tells you almost nothing.

{
"allowed": true
}

Allowed by whom?

When?

Against what source?

For which action?

For which resource?

For how long?

Under which conditions?

Was the answer live or cached?

Was the provider available?

Was this a real decision or a fallback?

Can the decision expire?

Can it be revoked?

A boolean cannot answer any of that.

A boolean is a compression artifact. It takes a messy, time-bound, conditional decision and crushes it into true or false.

Then the rest of the system pretends that nothing important was lost.

Something important was lost.

Time.
The UI Is Not the Authority

A green button is not permission.

A checked box is not permission.

A dashboard badge is not permission.

A frontend state is not permission.

A user opened a page at 10:01. The backend said the action was available. The frontend rendered the button. The user clicked it at 10:04.

What happened between 10:01 and 10:04?

Maybe nothing.

Maybe everything.

The account could have been suspended. The payment could have failed. The resource could have been reassigned. The policy window could have closed. The external provider could have revoked the state. Another process could have already consumed the opportunity.

The UI does not know.

The UI is a photograph of an old decision.

Production systems should treat every frontend permission as stale by default.
This Is TOCTOU for Product Logic

Security people know this bug.

Time of check. Time of use.

You check a file, then use it later. Between the check and the use, the file changes.

The same bug exists in ordinary business logic.

Except instead of files, the objects are:

accounts
orders
bookings
payments
subscriptions
appointments
permissions
devices
contracts
inventory
external provider state

The pattern is the same:

check state
state changes
use stale check

This is not an edge case. This is the default condition of distributed software.

The world changes between reads and writes.

If your system cannot represent that, it will eventually act on fiction.
Permission Should Be a Lease

Stop treating permission as a boolean.

Treat it as a lease.

A permission lease says:

This subject may perform this action
against this resource
under these conditions
from this authority
until this time
unless revoked earlier.

That is authorization.

This is not:

{
"allowed": true
}

A real permission object should look more like this:

{
"decision": "allowed",
"decision_id": "dec_123",
"subject_id": "user_456",
"resource_id": "res_789",
"action": "execute_transfer",
"authority": "billing_service",
"source": "live",
"issued_at": "2026-06-22T18:45:00Z",
"expires_at": "2026-06-22T18:46:00Z",
"conditions": [
"account_active",
"payment_method_valid",
"policy_window_open"
]
}

Now the system has something it can reason about.

Not just “yes.”

A scoped, time-bound claim.
The Action Boundary Must Recheck

The place where the action happens is the only place that matters.

Not the page load.

Not the eligibility screen.

Not the preview step.

Not the prior API response.

The action boundary.

That is where permission must be validated.

Bad:

async function executeAction(input) {
if (!input.allowed) {
throw new Error("Not allowed");
}

return performAction(input);
}

This is not secure. It is theater.

Better:

async function executeAction(input) {
const lease = await getPermissionLease(input.decision_id);

if (!lease) {
return {
success: false,
status: "permission_missing"
};
}

if (new Date(lease.expires_at) < new Date()) {
return {
success: false,
status: "permission_expired"
};
}

if (lease.subject_id !== input.subject_id) {
return {
success: false,
status: "subject_mismatch"
};
}

if (lease.resource_id !== input.resource_id) {
return {
success: false,
status: "resource_mismatch"
};
}

if (lease.action !== input.action) {
return {
success: false,
status: "action_mismatch"
};
}

return performAction(input);
}

For irreversible actions, even that is not enough.

You recheck the authority live.

async function executeIrreversibleAction(input) {
const lease = await validatePermissionLease(input);

if (!lease.valid) {
return lease.refusal;
}

const live = await fetchAuthoritativeState(input.resource_id);

if (!live.allowed) {
return {
success: false,
status: "state_changed",
reason: live.reason
};
}

return performAction(input);
}

The more serious the action, the closer the check must be to the use.
Cached Permission Needs a Warning Label

Caching is not the enemy.

Unlabeled caching is.

Some workflows can safely use cached state. Many cannot. The problem is when the system hides the difference.

This is not enough:

{
"allowed": true
}

This is useful:

{
"decision": "allowed",
"source": "cached",
"cached_at": "2026-06-22T18:40:00Z",
"max_age_ms": 30000,
"authority": "account_service"
}

Now downstream code can refuse cached authority when live authority is required.

if (permission.source === "cached" && action.requires_live_authority) {
return {
success: false,
status: "live_authority_required"
};
}

That is the difference between an intentional fallback and an accidental lie.
“Success” Should Not Hide the State That Made It Possible

A lot of systems store the final action but not the decision that allowed it.

That is a weak audit trail.

After something goes wrong, the question is not only:

What happened?

The question is:

What did the system believe when it acted?

You need the decision snapshot.

create table action_decision_ledger (
id uuid primary key default gen_random_uuid(),

subject_id text not null,
resource_id text not null,
action text not null,

decision text not null,
authority text not null,
decision_source text not null,

issued_at timestamptz not null,
expires_at timestamptz,
used_at timestamptz,

conditions jsonb,
final_status text,
refusal_reason text,

created_at timestamptz default now()
);

Logs are not enough.

Logs tell you what the code said.

A decision ledger tells you what the system believed.

That is what matters.
Refusal States Are Product Features

A serious system knows how to say no.

Not just with a generic error.

With a precise refusal state.

permission_expired
live_authority_required
state_changed
authority_unavailable
authority_rejected
policy_window_closed
duplicate_action_in_progress
manual_review_required
provider_unconfigured
scope_mismatch

These are not error messages.

They are control flow.

Each one tells the system what to do next.

Retry.

Refresh.

Recheck.

Escalate.

Route to review.

Ask for new authorization.

Stop completely.

Generic failure is useless. Named refusal is operational intelligence.
The Pattern

The pattern is simple.

Ask the authoritative system for permission.
Issue a short-lived decision lease.
Bind the lease to subject, resource, and action.
Revalidate the lease at execution.
Recheck live authority for irreversible actions.
Refuse with named states when stale.
Store the decision snapshot.

That is not complicated.

It is just stricter than most systems are willing to be.

Most systems prefer the comfort of true.

Production does not care about comfort.

Production cares whether the thing is still true when you act.
The Engineering Law

Here is the law:

Never treat permission as a boolean.
Treat permission as a time-bound claim.

A boolean says:

yes

A permission lease says:

yes, for this actor, against this resource, for this action, from this authority, under these conditions, until this moment

That is the difference between a UI state and a production decision.

Final Thought

The most dangerous bugs are not always wrong at the moment they are created.

Sometimes they begin as truth.

The user really was eligible.

The resource really was available.

The policy really did allow it.

The provider really did agree.

The button really did belong on the page.

Then the world changed.

And the system acted as if it had not.

That is why every boolean permission is a race condition.

The fix is not another checkmark.

The fix is time.