implement-queue-using-stacks

Difficulty: Easy · Topic: Stack · Practice it on parikshan: /practice/implement-queue-using-stacks/start

The story

An embedded systems team at NVIDIA writes the request-routing layer of a GPU driver. The driver runs on a constrained kernel environment that exposes only a stack abstraction (LIFO) in its lock-free primitives. The team needs queue (FIFO) semantics for the request pipeline: oldest request out first. They cannot import a queue library; they must build queue semantics on top of two stacks. The trick must be O(1) amortised per operation so they can quote a stable per-request latency budget.

That is implement-queue-using-stacks. The problem is small but the amortised-analysis insight it teaches is the kind of thing that ends up in your toolbox for the rest of your career.

What's actually being asked

Implement a class MyQueue with:

push(x): enqueue x at the back.
pop(): dequeue and return the front element.
peek(): return the front element without removing it.
empty(): return whether the queue is empty (true/false).

The implementation is allowed to use stacks only, push, pop, peek on stacks. The interesting constraint: each operation must be O(1) amortised even though individual pop calls may do O(n) work.

The naive approach

Use a single stack for storage. To pop the front, you have to remove every element to get to the bottom; do that with a second temporary stack, take the bottom, then move everything back. O(n) per pop and per peek. For 10^5 operations all of which are pop, that is 10 billion. Times out.

Symmetric version: do the work on push (insert at the bottom by spilling to a second stack and back). O(n) per push, but O(1) per pop. Same total cost.

The point of the problem is to do better.

The key insight

Use two stacks: in for incoming pushes and out for outgoing pops/peeks.

push(x): push to in. O(1).
pop() / peek(): if out is empty, transfer everything from in to out (which reverses the order, putting the oldest element on top of out). Then pop or peek out.
empty(): both stacks are empty.

The transfer is O(k) where k is the size of in at transfer time. But, and this is the amortised part, each element is transferred exactly once in its lifetime: it is pushed onto in, transferred to out, popped from out. Three constant-time operations per element. Total cost for n operations is O(n). Amortised: O(1) per operation.

This is the canonical amortised-analysis result and it is on every interviewer's "do they understand amortisation?" checklist.

Amortised analysis, in plain language

Worst-case bound on a single operation: O(k) for pop when a transfer happens. But every element pays exactly three operations across its entire lifetime (push to in, pop from in, push to out, pop from out, that is four, but the last only happens if the user actually pops it; let's call it three to four). So across n user operations, the total internal work is at most 4n, giving an average of 4 operations per user operation. That is what "amortised O(1)" means.

The bank-account intuition: every push deposits one coin (cheap operation). When a transfer happens, withdraw the coins to pay for it. Push deposits enough to cover its own work and the future transfer; the transfer pays from the bank.

Why the lazy transfer matters

If you transferred on every push, you would pay O(n) per push and O(1) per pop. Same total, but the worst-case push is now O(n), which is bad for a latency-sensitive system. With lazy transfer, the worst case happens at most once per element, and the average is O(1).

In a real driver, you may even pre-warm by triggering a transfer in idle moments to avoid the rare O(n) jitter at peak load. The algorithm is identical; the scheduling is implementation policy.

Step-by-step approach

in = [], out = [].
push(x): append x to in.
_transfer_if_needed(): if out is empty, while in is non-empty, pop from in and push to out.
pop(): call _transfer_if_needed; return out.pop().
peek(): call _transfer_if_needed; return out[-1].
empty(): return len(in) == 0 and len(out) == 0.

Worked example

Sequence: push 1, push 2, peek, pop, empty, push 3, pop, pop, empty.

op	in	out	result
push 1	[1]	[]	null
push 2	[1, 2]	[]	null
peek	[]	[2, 1] (after transfer)	1
pop	[]	[2]	1
empty	[]	[2]	false
push 3	[3]	[2]	null
pop	[3]	[] (popped 2 from `out`)	2
pop	[]	[3] (after transfer)	3
empty	[]	[]	true

Six operations did O(1) work, two did "more" but with the amortised guarantee.

Complexity

push: O(1) worst-case and amortised.
pop / peek: O(1) amortised, O(n) worst-case at a transfer.
empty: O(1) worst-case.
Space: O(n) total across both stacks.

Common pitfalls

Transferring on every push: correct but O(n) per push, defeating amortisation.
Transferring on every pop, even when out is non-empty: wrong! That reverses the order again, breaking FIFO semantics. Transfer only when out is empty.
Forgetting the empty check in _transfer_if_needed: leads to wrong order.
Using a single stack and a "front pointer": defeats the spirit of the exercise; you are no longer "two stacks".
Returning False capitalised in Python: the spec wants lowercase true/false. (This problem's I/O format requires that, per the spec.)

Where this shows up in the enterprise

GPU drivers (NVIDIA, AMD): FIFO request pipelines built on lock-free stack primitives.
Networking (DPDK, RDMA stacks): zero-copy ring buffers implemented as paired stacks for cache-line discipline.
High-frequency trading: order pipelines where two cache-aligned arrays act as in and out stacks for ultra-low-latency FIFO.
Game engine event loops (Unity, Unreal): deferred-event queues built from two stacks for predictable allocation patterns.
Operating system schedulers: scheduler classes that maintain ready queues using paired LIFO arenas.
Database write-ahead logs: log replay sometimes uses paired stacks for in-order reconstruction of out-of-order pages.
Algorithmic trading risk monitors: position queues built as paired stacks for deterministic peak-time behaviour.

The pattern generalises: two LIFO structures simulate a FIFO when individual operations can be amortised. Wherever LIFO primitives are cheaper or the only thing available, this trick is the bridge.

Variants you'll see elsewhere

implement-stack-using-queues: the dual; harder because queues alone cannot reverse without O(n) per push or pop.
min-stack: see the article in this cluster.
design-circular-deque: a richer data-structure design problem.
monotonic-stack and monotonic-queue: same building blocks, different invariants.
lru-cache: another design problem requiring O(1) amortised operations.

How parikshan turns this into a learning loop

The dynamic private tests on parikshan probe:

Many alternating push/pop pairs (tests that the transfer happens exactly when needed).
A long push streak followed by a long pop streak (tests amortised correctness and FIFO order across a single transfer).
A test sequence that forces multiple transfers (tests that you do not over-transfer).
The peek operation interleaved between pops (must not consume elements).
empty immediately after construction (returns true).
Repeated empty calls during operation (constant time, no side effects).

The integrity-scored AI sparring partner is the place, in practice mode, to verbalise the amortised argument: "each element is pushed once, transferred once, popped once; three operations per element across its life". Once you can say that sentence, you understand amortisation as a concept, not just this problem.

In the AI-integrated workspace

When an agent designs a "queue-on-stacks" class, your review:

Is the transfer lazy (only when out is empty), or eager (on every push)?
Does the agent state the amortised complexity explicitly in a comment? If asked, can it explain why?
Are pop and peek correctly delegating through _transfer_if_needed?
Does empty check both stacks, not just one?

In production, the moment a teammate proposes "let me write a queue using two stacks because our environment only has stacks", you should immediately know: (a) the amortised cost is O(1), (b) the worst-case pop is O(n) and that may need addressing for hard real-time constraints, (c) the trick is lazy transfer. That triad is what implement-queue-using-stacks teaches.

This problem is the easiest article in the stack cluster but the one where amortised analysis first earns its keep. Treat the parikshan practice run as the moment you internalise "three operations per element" as a way of seeing algorithm cost. That seeing is what separates engineers who can build queues from engineers who can also explain why their queue is fast.