Recently, blockchain-powered network Stellar stopped confirming transactions for more than one hour, effectively going offline.
Although no money was reportedly lost as a result, Stellar’s major issue has now been highlighted publicly: The project is not decentralized, at least not to the extent expected at this point. Notably, the offline scenario was predicted by researchers earlier last month.
Brief introduction to Stellar and its network
Stellar is a platform for money remittance. It was launched in 2014 by Jed McCaleb, founder of Mt. Gox and co-founder of Ripple, and former lawyer Joyce Kim. Stellar’s native asset is the lumen (XLM), currently the ninth-largest cryptocurrency by market cap.
The Stellar network, in turn, is designed as a decentralized peer-to-peer network of validator nodes. Stellar Core software is used by the nodes to confirm transactions.
To reach global consensus with other nodes, Stellar Core runs the Stellar Consensus Protocol (SCP). As per the SCP’s white paper, it has “modest computing and financial requirements” compared to more popular decentralized schemes of proof-of-work (PoW) and proof-of-stake (PoS).
In other words, instead of using an entire network to validate a transaction like bitcoin does, Stellar relies on the so-called quorum slices — sets of nodes that each validator node chooses to agree with. This system allegedly allows Stellar to unburden the network and host as many as 1,000 operations per second, compared to a much more modest rate showcased by bitcoin (up to seven transactions per second) and Ethereum (up to 15 transactions per second).
Together, all quorum slices that make up the validator nodes form a global network, where voting is used to ensure consensus on which transactions are recorded to the ledger. According to Stellar, this process “occurs approximately every 2-5 seconds.”
So why did the Stellar network go offline?
The Stellar Development Foundation (SDF) — a nonprofit organization committed to the development and adoption of Stellar — believes that the network collapsed because “new nodes took on too much consensus responsibility too soon.” Alternatively, as Nicolas Barry, chief technology officer of Stellar, put it, “it was caused by being too decentralized too fast.”
More specifically, the outage seems to be directly related to earlier claims that Stellar’s network is too centralized. Last month, three researchers from the Korea Advanced Institute of Science and Technology (KAIST) published a paper titled “Is Stellar As Secure As You Think?” concluding that the analysis of the Stellar network proves that it “is significantly centralized.”
Specifically, the researchers stressed that the entire Stellar network rested upon a limited amount of nodes, primarily the ones controlled by SDF itself:
“We show that all of the nodes in Stellar cannot run Stellar consensus protocol if only two nodes fail,” the research claims. “To make matters worse, these two nodes are run and controlled by a single organization, the Stellar foundation.”
Later that month, David Mazières, the chief scientist at SDF and a professor of computer science at Stanford University, penned a response. In it, he confirmed that the configuration of Stellar’s federated Byzantine agreement (FBA), which is a consensus model based on quorum slices, is highly centralized, and said that Stellar developers were “in the process of improving” it. Mazières continued:
“We […] are glad the authors drew attention to this fact. Things have already improved considerably from the configuration analyzed in the paper — for instance the Stellar Development Foundation (SDF) can no longer halt the network, and no two nodes can affect liveness.”
Nevertheless, on May 15, at 1:14 p.m. PST, the Stellar network went offline for 67 minutes — according to SDF, while some other reports mentioned “approximately two hours” — after it failed to reach consensus. In a post-mortem analysis, SDF explained that the network froze because too many new nodes were being added in a bid to make it more decentralized:
“We’ve seen claims that Stellar is ‘over-centralized’ and that somehow a failure with SDF’s nodes dragged down the whole network. Ironically, the opposite is true. Stellar has added many new nodes recently. In retrospect, some new nodes took on too much consensus responsibility too soon.”
Specifically, a node of Keybase — a blockchain startup that SDF has invested in — was taken offline for maintenance. At that time, other nodes were reportedly “shaky or down,” which is allegedly why Stellar came to a halt.
Furthermore, SDF claimed that stopping the network is in fact a preferable scenario for Stellar over operating in a faulty state, since the network accommodates financial institutions who supposedly chose it since they “prefer downtime over inconsistent data.” That is why the Stellar protocol didn’t fail, but actually worked as intended, the nonprofit organization argued.
“As a fundamental design choice, Stellar prefers consistency and partition resilience over liveness,” the statement reads. “This is different from other blockchains, in which ‘the chain must go on’ even at the price of soft forks.”
Additionally, SDF has highlighted that no funds were lost as a result of the incident, and the network is currently “healthy.”
KAIST warns that the fundamental problem has not been solved
According to Yongdae Kim, one of the KAIST researchers who authored the April research on the Stellar network, the collapse happened after some changes were made to its structure.
Specifically, Kim told Cointelegraph that, at the time the paper was submitted, if two out of three SDF validator nodes went offline, the Stellar network would collapse.
After researchers reported on the vulnerability, SDF allegedly tried to decentralize the network by removing SDF validators from quorum sets. As a result, Stellar became robust against two node failure, but was still vulnerable to three node failure, according to Kim.
However, right before the halt on May 15, the network has somehow become unstable in the face of a two node failure once more, Kim said, stressing that none of those node pairs belonged to SDF, given that they had been removed at the time. Eventually, a pair of those nodes went offline, which apparently brought the whole network down.
To deal with the aftermath of network failure and bring it back online, SDF included all three of its validators into quorum sets, according to Kim, and hence have returned “back to step 1,” in which if two out of three SDF validator nodes go down, the Stellar network will collapse.
“After we reported it [the cascade failure problem] to them, they manually adjusted validator sets
for a long time,” Kim explained to Cointelegraph. Nevertheless, he said the fact that network failure did occur at some point later on “shows that the design makes it difficult to maintain robust network structure against cascade failure.”
Outlining the fundamental reasons for why the network is vulnerable to a cascade failure problem, Kim described how node hosts have to manually choose their quorum sets, which is difficult, given the complexity of the network’s design. Moreover, the KAIST researcher stressed that not all nodes are equally robust. “SDF are more robust, but they could be a good target,” he told Cointelegraph.
The community’s general reaction was that Stellar is largely centralized, despite SDF actively pushing the opposite opinion. Emin Gün Sirer, co-director of IC3, tweeted:
If your entire network is going down because a single entity had a problem, exactly how decentralized can your system be? That’s right: not at all.https://t.co/6cEWPTPXYc
— Emin Gün Sirer (@el33th4xor) May 16, 2019
In response, Kyle McCollom, product manager at SDF, argued that several nodes were unavailable, while Keybase’s node going down for maintenance pushed the network past the threshold:
Several nodes were unavailable (“In the past few weeks we saw, repeatedly, misconfigured validators hampering consensus.”), and Keybase’s node shutdown pushed the network past the threshold. This happened bc several nodes had a problem, not bc “a single entity had a problem”.
— Kyle McCollom (@kylemccollom) May 16, 2019
Similarly, a user post on Stellar’s subreddit originally implied that the network couldn’t reach consensus because SDF nodes went down, which was denied by McCalleb in the comment section: Stellar’s co-founder wrote that “the SDF nodes and in fact the majority of validators in the network were still up,” but “couldn’t close ledgers safely because they weren’t hearing from enough nodes in their quorums.”
When asked whether the Stellar network could be called a decentralized one after the incident, Hartej Sawhney, a blockchain expert and co-founder of Hosho, replied negatively, but clarified that no project is decentralized today, as the concept has yet to be properly implemented. “Seems like the issue is less to do with centralization, but more to do with consensus responsibility of new nodes,” he told Cointelegraph.
“At this point we of time, Stellar is definitely a centralized network, especially in terms of the liveness aspects, as it was demonstrated in a research done at KAIST,” Eyal Shani, a blockchain researcher at Aykesubir, agreed. “However, this should be no surprise since even the great Bitcoin network can be considered as centralized by many.”
XLM’s price has been experiencing a flat base over the past few days, while the market continues to recover from a major correction that happened earlier this week.
Moreover, on May 16, soon after the network failure was reported, XLM experienced a solid 15% growth, suggesting that the news didn’t affect the asset’s value.
How will Stellar fix this?
SDF has outlined a number of ways to make the network more decentralized and stable at the same time, as part of consequence management.
First, the nonprofit aims to introduce better onboarding for new validators by providing users with published standards and explorers to help them create “good” quorum sets — presumably meaning that SDF will advise hosts on which nodes should be included in their quorum slices to avoid similar incidents.
SDF also hopes to achieve better operational standards. “We will increase operator coordination so that maintenance schedules are publicly communicated,” the organization wrote in the blog post. “We will also help operators keep their nodes and their quorum choices up-to-date.”
Moreover, SDF aims to improve better monitoring and alerting to warn node hosts about which crucial nodes are missing from the network, as well as to arrange bot-created announcements in the public validators channel anytime a node goes offline. Improved communication will also ensure that the network can be brought back online much quicker, the nonprofit suggests.
Kim thinks that none of the SDF’s proposals tackle the cascade failure problem directly, which potentially could lead to further incidents. “Overall, these are good set of mitigations. However, it does not fundamentally solve the problem of Stellar,” he told Cointelegraph. “Without a design change, it would be difficult to improve liveness of Stellar.”
Considering that SDF seems to prioritize consistency and partition resilience over network liveness, Stellar moving from the safety of trusted SDF nodes to a more decentralized scenarios could result in new system collapses, Shani of Aykesubir said. “Until they onboard enough serious validators who promise to behave (i.e be up and run the protocol) we could be seeing more halts in the near future,” he told Cointelegraph.
Time will tell if the nonprofit manages to reinforce its network to prevent further closedowns, but for now, Stellar could be joining the ranks of other major crypto projects that are criticized for a lack of decentralization.