Distributed Storage?
A distributed storage system is infrastructure that can split data across multiple physical servers, and often across more than one data center. It typically takes the form of a cluster of storage units, with a mechanism for data synchronization and coordination between cluster nodes.
Distributed storage is the basis for massively scalable cloud storage systems like Amazon S3 and Microsoft Azure Blob Storage, as well as on-premise distributed storage systems like Cloudian Hyperstore.
Distributed Storage, here, collectively refers to “Distributed data store” also called “Distributed databases” and “Distributed File System”.
And is used in various forms from the NoSQL trend to most famously, AWS S3 storage.
The core concept is to form redundancy in the storage of data by splitting up data into multiple parts, and ensuring there are replicas across multiple physical servers (often in various storage capacities).
Data replica could either be stored as exact whole copies or compressed into multiple parts using Erasure Code
Regardless of the storage method used, this helps to ensure that there are X copies of data stored across Y servers. Where X <= Y. (For example, 3 replicas across 5 servers).
Distributed storage systems can store several types of data:
1. Files — a distributed file system allows devices to mount a virtual drive, with the actual files distributed across several machines.
2. Block storage — a block storage system stores data in volumes known as blocks. This is an alternative to a file-based structure that provides higher performance. A common distributed block storage system is a Storage Area Network (SAN).
3. Objects — a distributed object storage system wraps data into objects, identified by a unique ID or hash.
Distributed storage systems have several advantages:
1. Scalability — the primary motivation for distributing storage is to scale horizontally, adding more storage space by adding more storage nodes to the cluster.
2. Redundancy — distributed storage systems can store more than one copy of the same data, for high availability, backup, and disaster recovery purposes.
3. Cost — distributed storage makes it possible to use cheaper, commodity hardware to store large volumes of data at low cost.
4. Performance — distributed storage can offer better performance than a single server in some scenarios, for example, it can store data closer to its consumers, or enable massively parallel access to large files.
Distributed Storage Features and Limitations
Most distributed storage systems have some or all of the following features:
1. Partitioning — the ability to distribute data between cluster nodes and enable clients to seamlessly retrieve the data from multiple nodes.
2. Replication — the ability to replicate the same data item across multiple cluster nodes and maintain consistency of the data as clients update it.
3. Fault tolerance — the ability to retain availability to data even when one or more nodes in the distributed storage cluster goes down.
4. Elastic scalability — enabling data users to receive more storage space if needed, and enabling storage system operators to scale the storage system up and down by adding or removing storage units to the cluster.
An inherent limitation of distributed storage systems is defined by the CAP theorem. The theorem states that a distributed system cannot maintain Consistency, Availability and Partition Tolerance (the ability to recover from a failure of a partition containing part of the data). It has to give up at least one of these three properties. Many distributed storage systems give up consistency while guaranteeing availability and partition tolerance.
Amazon S3: An Example of a Distributed Storage System
What is Amazon s3 and how it works…
Amazon S3 is a distributed object storage system. In S3, objects consist of data and metadata. The metadata is a set of name-value pairs that provides information about the object, such as date last modified. S3 supports standard metadata fields and custom metadata defined by the user.
Objects are organized into buckets. Amazon S3 users need to create buckets and specify which bucket to store objects to, or retrieve objects from. Buckets are logical structures that allow users to organize their data. Behind the scenes, the actual data may be distributed across a large number of storage nodes across multiple Amazon Availability Zones (AZ) within the same region. An Amazon S3 bucket is always tied to a specific geographical region — for example, US-East 1 (North Virginia) — and objects cannot leave the region.
Each object in S3 is identified by a bucket, a key, and a version ID. The key is a unique identifier of each object within its bucket. S3 tracks multiple versions of each object, indicated by the version ID.
Due to the CAP theorem, Amazon S3 provides high availability and partition tolerance, but cannot guarantee consistency. Instead, it offers an eventual consistency model:
1. When you PUT or DELETE data in S3, data is safely stored, but it may take time for the change to replicate across Amazon S3.
2. When a change occurs, clients immediately reading the data will still see an old version of the data, until the change is propagated.
3. S3 guarantees atomicity — when a client reads the object, they might view the old version of the object, or the new version, but never a corrupted or partial version.
Why do we do this(when do we use distributed storage)?
After all, with the amount of hardware, it’s pretty much a statistical fact.And we really do not want such a large-scale system to go down whenever a single server sneezes.
This contrasts heavily against what used to be more commonplace — redundancy of drives on a single server (such as RAID 1), which protects data from hard-drive failure.
When done correctly, distributed storage systems, helps to make the system survive downtime, such as a complete server failure, an entire server rack being thrown away, or even the extreme of a nuclear explosion of a data center.
Nuclear proof database? Wouldn’t such a system be extremely slow?
Well, that depends… part of the reason why there are so many different distributed systems out there, is that all systems form some sort of compromise that favor one attribute over another.
And one of the most common compromises for distributed systems is accepting the significant overheads involved in coordinating data across multiple servers.
A common example that many developers might have experienced: Uploading a single large file to cloud storage can be somewhat slow. However, on the flip side they can upload as many files concurrently as their wallet will allow simultaneously because the load will be distributed across multiple servers.
So yes, a single server is “slow”. For “fast” multi-server you have parallelism with replicas distributed among them.
How do these replicas get created in the first place?
When a piece of data gets written to the system by a client program, its replicas are synchronously created into various other nodes in the process.
When a piece of data is read, the client either gets the data from one of the replica or from multiple replicas where a final result is decided by majority vote.
If any replica is found corrupted or missing due to a crash, its replica is removed from the system. And a new replica is created and placed onto another server, if available. This either happens on read or through a background checking process.
Deciding on what data is valid or corrupted and where each replica data should be, it is generally decided by a “master node” or via a “majority vote”. And sometimes for choosing where replicas goes “randomly”.
One of the most important recurring thing to take note, however, is the system requirement for majority vote for many operations, among all redundancies for the system to work.
With so many redundancies in place, how does it still fail?
First off… that depends on your definition of failure…
Let see the various definitions of failure acc to various peoples:-
One of the big benefits of a distributed system — is that in many cases when a single node or replica fails, behind the scenes there will be a system reliability engineer (or sysadmin) who will be replacing the affected servers without any users noticing.
Another thing -it is the definition of failure from a business perspective. Permanently losing all of your customer data is a lot worse then entering, let’s say, read-only mode or even crashing the system.
And finally… Murphy’s law… means that sometimes multiple nodes or segments of your network infrastructure will fail. You will face situations where the majority vote is lost.
How is a redundancy fixed then?
With cloud systems, this typically means replacing a replica with a new instance.
However, depending on one’s network infrastructure and data set size, time is a major obstacle.
For example to replace a single 8TB node with a gigabit connection (or 800 megabits/second effectively), one would take approximately 22.2 hours, or 1 whole day rounded up. And that’s assuming optimal transfer rates which would saturate the system. To keep the system up and running without noticeable downtime, we may halve the transfer rate, doubling the time required to 2 whole days.
Naturally, more complex factors will come into play, such as 10-gigabit connections, or hard-drive speed.
However, the main point stays the same, replacing a damaged replica in a “large” storage node is hardly instantaneous.
And it’s sometimes during these long 48 hours moments where things gets tense for the sysadmin. With a 3 replica configuration, there would be no room for mistakes if the majority of 2 healthy nodes is to be maintained. For a 5 replica configuration, there would be breathing room of 1.
And when majority vote is lost, one of three things would occur:
1. Split brain : where you end up with a confused cluster
2. Read-only mode : to prevent the system from having 2 different datasets (and hence a split brain)
3. Hard system failure : Some systems prefer inducing a hard failure, rather than causing a split brain.
What is the split brain problem?
A split brain starts to occur when your cluster starts splitting into 2 segments. Your system would start seeing two different versions of the same data as the cluster goes out of sync.
This happens most commonly when a network failure causes half of the cluster to be isolated from the other half.
What happens subsequently if data change were to occur, is that half of the system would be updated with the other half outdated.
This is “alright” until the other half comes back online. And the data might be out of sync or even changed due to usage with the other half of the cluster.
Both halves would start claiming they are the “real” data and vote against the “other half”. And as with any voting system that is stuck in a gridlock, no work gets done.
So, how do we prevent this in the first place?
Fortunately, most distributed systems, when configured properly, is designed to prevent split brain from happening in the first place. They typically comes in 3 forms.
1. A master node which gets to make all final decisions (this however may cause a single point of failure of a master node going down. Some systems fallback to voting a new master node if it occurs).
2. Hard system failure to prevent such split brain. Until the cluster is all synced back up properly; This ensures that no “outdated” data is shown.
3. Locking the system in readonly; The most common sign of this would be when certain nodes show outdated data, in read-only mode.
The latter being the most common for most distributed systems, also seen in the recent github downtime
How I use one of these ?
These days most entry level AWS, or GCP instances use some form of “block storage” backend for their hard-drives, which is a distributed storage system.
More famously would be Object Storage such as S3 itself and pretty much any cloud storage.
Beyond cloud services, there are many open source deployments that use distributed storage. Pretty much all NoSQL database, even kubernetes itself because it comes deployed with ETCD a distributed key-value storage.
Subsequently for notable specific applications, there is Cockroach DB (SQL), GlusterFS (File storage), Elasticsearch (NoSQL), Hadoop (Big data). Even mysql can be deployed in such a setup, known as group replication. Or alternatively through galera.
Hadoop History
As the World Wide Web grew in the late 1900s and early 2000s, search engines and indexes were created to help locate relevant information amid the text-based content. In the early years, search results were returned by humans. But as the web grew from dozens to millions of pages, automation was needed. Web crawlers were created, many as university-led research projects, and search engine start-ups took off (Yahoo, AltaVista, etc.).
One such project was an open-source web search engine called Nutch — the brainchild of Doug Cutting and Mike Cafarella. They wanted to return web search results faster by distributing data and calculations across different computers so multiple tasks could be accomplished simultaneously. During this time, another search engine project called Google was in progress. It was based on the same concept — storing and processing data in a distributed, automated way so that relevant web search results could be returned faster.
In 2006, Cutting joined Yahoo and took with him the Nutch project as well as ideas based on Google’s early work with automating distributed data storage and processing. The Nutch project was divided — the web crawler portion remained as Nutch and the distributed computing and processing portion became Hadoop (named after Cutting’s son’s toy elephant). In 2008, Yahoo released Hadoop as an open-source project. Today, Hadoop’s framework and ecosystem of technologies are managed and maintained by the non-profit Apache Software Foundation (ASF), a global community of software developers and contributors.
Why is Hadoop important?
1. Ability to store and process huge amounts of any kind of data, quickly. With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that’s a key consideration.
2. Computing power. Hadoop’s distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have.
3. Fault tolerance. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically.
4. Flexibility. Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos.
5. Low cost. The open-source framework is free and uses commodity hardware to store large quantities of data.
6. Scalability. You can easily grow your system to handle more data simply by adding nodes. Little administration is required.
What are the challenges of using Hadoop?
1. MapReduce programming is not a good match for all problems. It’s good for simple information requests and problems that can be divided into independent units, but it’s not efficient for iterative and interactive analytic tasks. MapReduce is file-intensive. Because the nodes don’t intercommunicate except through sorts and shuffles, iterative algorithms require multiple map-shuffle/sort-reduce phases to complete. This creates multiple files between MapReduce phases and is inefficient for advanced analytic computing.
2. There’s a widely acknowledged talent gap. It can be difficult to find entry-level programmers who have sufficient Java skills to be productive with MapReduce. That’s one reason distribution providers are racing to put relational (SQL) technology on top of Hadoop. It is much easier to find programmers with SQL skills than MapReduce skills. And, Hadoop administration seems part art and part science, requiring low-level knowledge of operating systems, hardware and Hadoop kernel settings.
3. Data security. Another challenge centers around the fragmented data security issues, though new tools and technologies are surfacing. The Kerberos authentication protocol is a great step toward making Hadoop environments secure.
4. Full-fledged data management and governance. Hadoop does not have easy-to-use, full-feature tools for data management, data cleansing, governance and metadata. Especially lacking are tools for data quality and standardization.