implement-queue-using-stacks
Difficulty: Easy · Topic: Stack · Practice it on parikshan: /practice/implement-queue-using-stacks/start
The story
An embedded systems team at NVIDIA writes the request-routing layer of a GPU driver. The driver runs on a constrained kernel environment that exposes only a stack abstraction (LIFO) in its lock-free primitives. The team needs queue (FIFO) semantics for the request pipeline: oldest request out first. They cannot import a queue library; they must build queue semantics on top of two stacks. The trick must be O(1) amortised per operation so they can quote a stable per-request latency budget.
That is implement-queue-using-stacks. The problem is small but the amortised-analysis insight it teaches is the kind of thing that ends up in your toolbox for the rest of your career.
What's actually being asked
Implement a class MyQueue with:
push(x): enqueuexat the back.pop(): dequeue and return the front element.peek(): return the front element without removing it.empty(): return whether the queue is empty (true/false).
The implementation is allowed to use stacks only, push, pop, peek on stacks. The interesting constraint: each operation must be O(1) amortised even though individual pop calls may do O(n) work.
The naive approach
Use a single stack for storage. To pop the front, you have to remove every element to get to the bottom; do that with a second temporary stack, take the bottom, then move everything back. O(n) per pop and per peek. For 10^5 operations all of which are pop, that is 10 billion. Times out.
Symmetric version: do the work on push (insert at the bottom by spilling to a second stack and back). O(n) per push, but O(1) per pop. Same total cost.
The point of the problem is to do better.
The key insight
Use two stacks: in for incoming pushes and out for outgoing pops/peeks.
push(x): push toin.O(1).pop()/peek(): ifoutis empty, transfer everything fromintoout(which reverses the order, putting the oldest element on top ofout). Then pop or peekout.empty(): both stacks are empty.
The transfer is O(k) where k is the size of in at transfer time. But, and this is the amortised part, each element is transferred exactly once in its lifetime: it is pushed onto in, transferred to out, popped from out. Three constant-time operations per element. Total cost for n operations is O(n). Amortised: O(1) per operation.
This is the canonical amortised-analysis result and it is on every interviewer's "do they understand amortisation?" checklist.
Amortised analysis, in plain language
Worst-case bound on a single operation: O(k) for pop when a transfer happens. But every element pays exactly three operations across its entire lifetime (push to in, pop from in, push to out, pop from out, that is four, but the last only happens if the user actually pops it; let's call it three to four). So across n user operations, the total internal work is at most 4n, giving an average of 4 operations per user operation. That is what "amortised O(1)" means.
The bank-account intuition: every push deposits one coin (cheap operation). When a transfer happens, withdraw the coins to pay for it. Push deposits enough to cover its own work and the future transfer; the transfer pays from the bank.
Why the lazy transfer matters
If you transferred on every push, you would pay O(n) per push and O(1) per pop. Same total, but the worst-case push is now O(n), which is bad for a latency-sensitive system. With lazy transfer, the worst case happens at most once per element, and the average is O(1).
In a real driver, you may even pre-warm by triggering a transfer in idle moments to avoid the rare O(n) jitter at peak load. The algorithm is identical; the scheduling is implementation policy.
Step-by-step approach
in = [],out = [].push(x): appendxtoin._transfer_if_needed(): ifoutis empty, whileinis non-empty, pop frominand push toout.pop(): call_transfer_if_needed; returnout.pop().peek(): call_transfer_if_needed; returnout[-1].empty(): returnlen(in) == 0 and len(out) == 0.
Worked example
Sequence: push 1, push 2, peek, pop, empty, push 3, pop, pop, empty.
| op | in | out | result |
|---|---|---|---|
| push 1 | [1] | [] | null |
| push 2 | [1, 2] | [] | null |
| peek | [] | [2, 1] (after transfer) | 1 |
| pop | [] | [2] | 1 |
| empty | [] | [2] | false |
| push 3 | [3] | [2] | null |
| pop | [3] | [] (popped 2 from out) | 2 |
| pop | [] | [3] (after transfer) | 3 |
| empty | [] | [] | true |
Six operations did O(1) work, two did "more" but with the amortised guarantee.
Complexity
push: O(1) worst-case and amortised.pop/peek: O(1) amortised, O(n) worst-case at a transfer.empty: O(1) worst-case.- Space: O(n) total across both stacks.
Common pitfalls
- Transferring on every push: correct but
O(n)per push, defeating amortisation. - Transferring on every pop, even when
outis non-empty: wrong! That reverses the order again, breaking FIFO semantics. Transfer only whenoutis empty. - Forgetting the empty check in
_transfer_if_needed: leads to wrong order. - Using a single stack and a "front pointer": defeats the spirit of the exercise; you are no longer "two stacks".
- Returning
Falsecapitalised in Python: the spec wants lowercasetrue/false. (This problem's I/O format requires that, per the spec.)
Where this shows up in the enterprise
- GPU drivers (NVIDIA, AMD): FIFO request pipelines built on lock-free stack primitives.
- Networking (DPDK, RDMA stacks): zero-copy ring buffers implemented as paired stacks for cache-line discipline.
- High-frequency trading: order pipelines where two cache-aligned arrays act as
inandoutstacks for ultra-low-latency FIFO. - Game engine event loops (Unity, Unreal): deferred-event queues built from two stacks for predictable allocation patterns.
- Operating system schedulers: scheduler classes that maintain ready queues using paired LIFO arenas.
- Database write-ahead logs: log replay sometimes uses paired stacks for in-order reconstruction of out-of-order pages.
- Algorithmic trading risk monitors: position queues built as paired stacks for deterministic peak-time behaviour.
The pattern generalises: two LIFO structures simulate a FIFO when individual operations can be amortised. Wherever LIFO primitives are cheaper or the only thing available, this trick is the bridge.
Variants you'll see elsewhere
implement-stack-using-queues: the dual; harder because queues alone cannot reverse withoutO(n)per push or pop.min-stack: see the article in this cluster.design-circular-deque: a richer data-structure design problem.monotonic-stackandmonotonic-queue: same building blocks, different invariants.lru-cache: another design problem requiringO(1)amortised operations.
How parikshan turns this into a learning loop
The dynamic private tests on parikshan probe:
- Many alternating push/pop pairs (tests that the transfer happens exactly when needed).
- A long push streak followed by a long pop streak (tests amortised correctness and FIFO order across a single transfer).
- A test sequence that forces multiple transfers (tests that you do not over-transfer).
- The
peekoperation interleaved between pops (must not consume elements). emptyimmediately after construction (returns true).- Repeated
emptycalls during operation (constant time, no side effects).
The integrity-scored AI sparring partner is the place, in practice mode, to verbalise the amortised argument: "each element is pushed once, transferred once, popped once; three operations per element across its life". Once you can say that sentence, you understand amortisation as a concept, not just this problem.
In the AI-integrated workspace
When an agent designs a "queue-on-stacks" class, your review:
- Is the transfer lazy (only when
outis empty), or eager (on every push)? - Does the agent state the amortised complexity explicitly in a comment? If asked, can it explain why?
- Are
popandpeekcorrectly delegating through_transfer_if_needed? - Does
emptycheck both stacks, not just one?
In production, the moment a teammate proposes "let me write a queue using two stacks because our environment only has stacks", you should immediately know: (a) the amortised cost is O(1), (b) the worst-case pop is O(n) and that may need addressing for hard real-time constraints, (c) the trick is lazy transfer. That triad is what implement-queue-using-stacks teaches.
This problem is the easiest article in the stack cluster but the one where amortised analysis first earns its keep. Treat the parikshan practice run as the moment you internalise "three operations per element" as a way of seeing algorithm cost. That seeing is what separates engineers who can build queues from engineers who can also explain why their queue is fast.