I was reading http://thislongrun.blogspot.com/2015/03/dead-nodes-dont-bite.html and simply had to respond to some confusion. Since I’m not sure what the comment publishing timeline looks like, I’m copying my reply here.
The intuition you’ve quoted is right and I’m not quite sure why you think it’s wrong.
In particular:
“Now let us turn to single node failures. […] You simply failover to a replica in a transactionally consistent way. Notably, at least Tandem and Vertica have been doing exactly this for years.â€There is nothing more to add. There are real-world systems that are both available and consistent when there is a node failure.
That’s true *in this failure mode*, but the choice of “Availability†in the CAP theorem is much stronger: being available requires a response as long as you have *any* working nodes in the system. The fact that some systems have consistent replica pairs that they can fail between does not make them both Consistent and Available in the CAP sense — it just makes them Consistent but with more uptime than the naive system.
Looking at [1], which I assume you’re drawing that quote from, it’s clear that Stonebreaker’s main point is not that CAP is wrong, but that it encourages the wrong sort of engineering tradeoffs: yes, you might need to choose between C and A eventually, but your systems should be designed to delay the decision as long as possible — and in many applications, users won’t be able to tell the difference. But that doesn’t mean partition-tolerance is not equivalent to fault-tolerance.
In particular, Stonebreaker assumes that any double failures simply won’t happen so you shouldn’t worry about it in your system design. As a developer on a CP system (Ceph) with users who report issues to a mailing list, I am sad to report that for systems of scale he is wrong on this assumption. 🙁
In other words, in a strongly consistent system, the intuition is right, or, more precisely, actions are taken to make it right. However, in an AP system the intuition is just wrong.”
In an AP system you are tolerating a partition, by choosing to *not care* whether the nodes are partitioned or not. But a partition and a node failure are still indistinguishable to any monitoring nodes you have.
In a CP system you are choosing to be consistent and partition-tolerant, by guaranteeing that one side will win. But again down versus partitioned is indistinguishable to any monitoring nodes you have.
And that makes a failure and a partition indistinguishable from anybody else’s point of view.
Indeed, you circle back to this in your penultimate section, so I’m not sure why your lede and conclusion say otherwise.
Speaking more generally, even if partition tolerance and failure are not equivalent, you still have to design your system (in CAP terms) as if they are, choosing to give up either C or A:
Let’s say you did try and design a “CAâ€Â system. Let it consist of a set {N} nodes. Make some writes {W}. Let the set {M} nodes go down. Make some more writes {X} to the remaining {N}-{M} nodes.
Now let the {N}-{M} nodes die, and the {M} nodes turn back on. What happens to your system? Do the {M} functioning nodes return data based only on {W}, or do they refuse service? Does this simply count as every node being failed until one of {N}-{M} is back?
How do those answers differ from applying the same scenario as a partition?