1 What is CoprHD?
The term Software Defined Storage (SDS) is often overused and may mean different things to different people.
CoprHD is a software platform for building a storage cloud. It is designed with two key goals in mind:
- Make an enterprise or a service provider datacenter full of storage resources (block arrays, filers, SAN and NAS switches) look and feel like a single massive virtual storage array, automating and hiding its internal structure and complexity.
- Extend a full set of essential provisioning operations to end-users. This can range from simple operations such a block volume creation, to fairly complex operations such as creating disaster-recovery-protected volumes in different data centers. A key tenet of this capability is to enable them to do it in a truly multi-user, multi-tenant fashion by providing tenant separation, access control, auditing, usage monitoring, and other features.
CoprHD addresses these goals in a vendor-agnostic and extensible way, supporting a variety storage devices from several vendors, without resorting to a "least common denominator" approach in regard to features offered by these devices.
CoprHD itself does not provide storage. It holds an inventory of all storage devices in the data center and understands their connectivity. It allows the storage administrator to group these resources into either:
- virtual pools, with specific performance and data protection characteristics or,
- virtual arrays, segmenting the data center infrastructure along lines of fault tolerance, network isolation, or tenant isolation.
Storage administrators have full control over which users or tenants can access these virtual pools and virtual arrays.
End-users use a single uniform and intuitive API to provision block volumes or file shares on virtual pools or virtual arrays in a fully-automated fashion.
CoprHD automates all hidden infrastructural tasks, such as finding optimal placement for a volume or necessary SAN fabric configuration. Also,
CoprHD adds intelligence of its own with features like the ability to apply metrics-based algorithms to port selection when exporting volumes. Finally,
CoprHD provisioning goes beyond creating volumes. It has the ability to perform complex orchestrations that cover the end-to-end provisioning process - from volume creation, to SAN configuration, to mounting the newly created volume on a host. And, it has built in support for environments that include VMware and Vblock technologies.
From the architectural perspective,
CoprHD is an operating system for a storage cloud.
- It provides a well-structured and fairly intuitive set of “system calls” for storage provisioning in the form of REST API.
- It is a multi-user, multi tenant system. As such, it enforces access control (authentication, authorization, and separation).
- The “system calls” have transactional semantics: they seem atomic from the user perspective, and when they fail on one of the operations, they either retry the failed operation or they roll back the partially-completed work.
- It is highly concurrent, allowing multiple "system calls" to execute in parallel, but internally enforcing fine-grained synchronization to ensure internal consistency.
- It implements its own distributed, redundant and fault-tolerant data store which holds all of the provisioning data.
- It allows common business logic to remain device-agnostic by using a set of “drivers” capable of interacting with a bunch of storage devices.
2 CoprHD from Inside Out
In early 2012, when we started working on this project, we made a few fundamental decisions that explain some architectural choices:
CoprHDmust be scalable and highly available from the very beginning (it is not an add-on feature).
- It must use Java on Linux - a good server platform with good support for concurrency and scalability - and be delivered as an ensemble of virtual appliances.
- We must use solid and popular open source components extensively to focus on the real issues without reinventing the wheel.
- We must avoid redundant technologies as much as possible, as adding new technology often adds new problems.
- The system console must be for troubleshooting only. All supported operations must have corresponding REST calls.
2.1 CoprHD Cluster
To provide high availability (HA),
CoprHD is a cluster comprising 3 or 5 identical nodes, operating in an active-active mode.
CorpHD itself is a cloud application, and the nodes do not share any resources and do not rely on any hardware support to implement HA. In fact, in a typical deployment, the nodes are simply virtual machines.
Each node runs the same collection of services. In general, the
CoprHD software is divided between services along functional boundaries. Here is the list of services:
Provides the authentication APIs
Provides all public APIs for storage provisioning and storage management
Performs asynchronous operations on storage devices and comprises all of the device-specific code
Offers automation services on top of the provisioning APIs
|syssvc||Provides various system management and monitoring interfaces, access to internal logs, etc.|
Offers distributed shared memory used for cluster coordination
|dbsvc||Comprises a distributed column-based database used as a persistence layer for all provisioning data|
In addition, by dividing
CoprHD software into a collection of specialized services we can:
- Separate stateless web services and provisioning logic from stateful distributed databases.
- Control the memory footprint of auxiliary services and reserve the majority of node memory to services that benefit from aggressive caching.
2.2 Cluster Network
CoprHD nodes communicate through a public IP network. Since a sizable amount of data is replicated to all nodes, we expect that it is a high-bandwidth and low-latency network (basically a Gigabit or 10 Gigabit fabric).
Each node is assigned an IP address on a common subnet. In addition, a
CoprHD cluster uses a virtual IP address (VIP) through which external clients can access
CoprHD REST APIs and the
CoprHD GUI. We use keepalived daemons (which in turn use the VRRP protocol) to ensure that at any point in time exactly one
CoprHD node handles network traffic addressed to a VIP. The keepalived daemons also monitor "live-ness" of nginx servers on each node. The role of nginx is two-fold:
- It acts as a so-called reverse proxy. The nginx servers listen to HTTPS requests arriving at port 4443, and, depending on the path component in the request’s URL, forward the request to the corresponding
- It also acts as an embedded load-balancer. The nginx servers forward the incoming requests not only to the web services on the same node, but to other nodes as well (the destination node is selected by hashing the source IP address of the request). Therefore, even though only one nginx server at a time handles all requests directed to VIP, all nodes participate in request processing.
Similarly, the nginx servers listen to port 443 and distribute the GUI sessions between all available portalsvc instances on all nodes.
2.3 Persistent Backend
As we mentioned before, most of the
CoprHD services are stateless - they do not persist any data. Aside from the system and service logs, all
CoprHD data is stored either in coordinatorsvc or in dbsvc. Both of these services maintain a full copy of data on each node.
The coordinatorsvc should be seen as a strongly-consistent, persistent shared memory for the small amount of data essential for coordinating operations in the
CoprHD cluster. Examples include service beacons (locators), distributed locks, distributed queues, etc. Internally, the coordinatorsvc uses Apache ZooKeeper.
CoprHD services use the coordinator client library to access data in coordinatorsvc. The coordinatorclient is a wrapper around Netflix Curator, a popular ZooKeeper client. The coordinatorclient library abstracts away the distributed nature of ZooKeeper and all of the configuration details, giving the programmer the perception of a local service magically mirrored on all nodes.
The dbsvc is a column-based database. It offers good scalability, high insert rate, and indexing, but only eventual consistency. Internally, the dbsvc is merely a wrapper around Apache Cassandra. However, to prevent temporary inconsistencies, we require quorum-based consensus both for reads and writes to dbsvc. Consequently, a cluster of 2N+1 nodes can tolerate only up to N node failures.
Other services access the dbsvc using the dbclient library. It is a wrapper for Netflix Astynax, a Cassandra client. Similarly to the coordinatorclient, it abstracts away complexity related to the distributed nature of Cassandra. Also, note that dbsvc relies on the coordinatorsvc.
All in all, users of coordinatorclient and dbclient
CoprHD services have the same view of persistent data regardless of the node on which they run. This way
CoprHD stateless services get an abstraction of crash-resistant distributed memory, and can share and hand off work items without explicit RPC calls. This implies, however, that the stateless services have to record all state changes in this persistent shared memory. Otherwise, in the event of a crash, the state would be lost.
Finally, note that since all the
CoprHD internal data, including its own configuration, is stored in the persistent backend, all local configuration files are either immutable or they are generated from the information stored in the databases during system boot.
3 From a Request to the Response
To understand better how this architecture works in practice, we will follow some simple REST calls through the system. In general, REST calls can be synchronous (if the response can be readily produced) or asynchronous (typically, when the request requires an external action which may take time to complete). An asynchronous call returns a URN that identifies the task or workflow. The caller has to poll for the status of the task.
3.1 Example 1: Synchronous REST Call: “GET /login”
CoprHD is a multiuser and multitenant system, most REST calls require an authentication token or an authentication cookie. To obtain such a token (or a cookie) the user has to perform “login” REST call.
CoprHD does not maintain its own user database. Instead it relies on external entity: an Active Directory or an LDAP server to verify user credentials and to map a user to a tenant.
- First, the GET request sent to https://<vip>:4443/login will be handled by nginx on node2, which owns the VIP.
- Since the path component of the URL starts with /login, the request belongs to authsvc. In our example, we assume that the hash of the external client IP address falls onto node1. Therefore nginx on node2 will forward the request to authsvc on node1.
- The authsvc will send the user’s credentials to the external Active Directory (AD) server. If the credentials are correct, the AD response will include the user’s tenant information.
- The authsvc will then generate a
CoprHDauthentication token and store the token with the user tenant info via dbsvc.
- Finally, the authsvc on node1 will send a response containing the authentication token back to nginx on node2, and nginx will forward it to the external client.
3.2 Example 2: Asynchronous REST Call: “POST /block/volumes”
Having an authentication token, we can issue a request to create a block volume:
- A POST request sent to https://<vip>:4443/block/volumes is handled by nginx on node2, which, in our example, owns the VIP.
- The nginx server forwards the request to apisvc on node1.
- Before even considering the request, apisvc has to extract the authentication token from the request header and retrieve the user information from dbsvc. Upon success apisvc knows the user identity and tenant association. This information will be relevant to determining volume placement.
- Then, apisvc analyzes information included in the request, and it determines the number of physical volumes that need to be created and their optimal placement with storage resources available to the user. For each volume, apisvc creates a volume descriptor in dbsvc (that is, in Cassandra).
- Since the physical volume creation can take quite some time, it will be performed asynchronously by controllersvc. To do so, apisvc creates an asynchronous task object in coordinatorsvc (that is, on ZooKeeper), and registers it in a relevant storage controller queue (also in coordinatorsvc).
- After handing off the remaining work to controllersvc, apisvc returns the task identifier to the external client. The client can use the task identifier to poll for the task status.
- In our example, controllersvc on node3 picks the request from the queue and orchestrates the volume creation, updating volume descriptors in dbsvc (Cassandra) as necessary.
- Once the volume creation completes, the controllersvc updates the task object in the coordinatorsvc (Zookeeper) and removes it from the queue.
3.3 Example 3: Accessing the UI to check volume creation status
Finally, we can try using the
CoprHD web-based GUI to monitor creation of the volume from Example 2:
- The web browser will connect to
- Similar to the previous examples, nginx on node2 will forward us to portalsvc on node1.
- Although portalsvc is embedded inside of
CoprHD, it uses the same REST APIs as all external clients. Thus, portalsvc will make a call to
- nginx on node2, which owns VIP, will forward it to apisvc (we assume that the hash of node1 IP address calls onto node2)
- apisvc will use the authentication cookie to retrieve the user's identity and tenant association.
- Finally, the apisvc will retrieve task status and pass it back to portalsvc, and portalsvc will generate a web page and pass it back to the user’s browser.