Article From:

Ceph It is uniquely provided with a unified systemObjects, blocks, and file storageIts function is high reliability, simple management and free software. Ceph is powerful enough to change your company’s IT infrastructure and the ability to manage massive data. Ceph provides great scalability — for thousands of users to access PB and even EB level data. Ceph nodeUsing ordinary hardware and intelligent daemon as support points, Ceph Storage clusterA large number of nodes are organized. They communicate with each other to replicate data and dynamically redistribute data.


CEPH Storage cluster

Ceph Provides an infinitely extendable Ceph Storage cluster,It is based on RADOS ,See paper RADOS – A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters 

Ceph A storage cluster contains two types of daemons:

  • Ceph monitor
  • Ceph OSD Daemon

Ceph The monitor maintains the main copy of the cluster running graph. A monitor cluster ensures the high availability of a monitor when it fails. The storage cluster client asks the Ceph monitor for the latest copy of the cluster diagram.

Ceph OSD The daemon checks its status and other OSD States and reports to monitors.

Storing the client and individual of the cluster Ceph OSD DaemonCRUSH algorithm is used to calculate data position efficiently instead of relying on a centralized query table. Its advanced functions include: Based on libradosNative storage interfaces, and a variety of based on librados The service interface.

Data storage

Ceph Storage cluster from Ceph ClientReceive data – whether it comes from Ceph Block equipment Ceph Object storage Ceph file system、Or is it based on librados Custom implementations – and stored as objects. Each object is a file in the file system, which is stored inObject storage deviceUp. The read / write operation on the storage device is handled by the Ceph OSD daemon.

Ceph OSD Store all data in the flat namespace as object (that is, no directory hierarchy). The object contains an identifier, binary data, and metadata composed of name / value pairs, and the semantics of metadata depend entirely on it. Ceph Client。For example, CephFS stores metadata properties with metadata, such as file owners, creation dates, last modification dates, and so on.



An object ID is not only locally unique, it is unique in the entire cluster.

Scalability and high availability

In the traditional architecture, the client communicates with a centralization component (such as gateways, middleware, API, front end, etc.) as the only entrance to a complex subsystem, which introduces a single point of failure and restricts performance and scalability (that is, if the centralization group hangs, the whole system hangs. “).

Ceph The centralized gateway is eliminated, allowing the client to communicate directly with the Ceph OSD daemon. The Ceph OSD daemon automatically creates an object copy on other Ceph nodes to ensure data security and high availability; in order to ensure high availability, the monitor has also been clustered. byTo eliminate the center node, Ceph uses the CRUSH algorithm.

CRUSH brief introduction

Ceph Both the client and the OSD daemon are used CRUSH The algorithm calculates the location information of the object instead of relying on a centralized query table. Compared with previous methods, CRUSH’s data management mechanism is better, and it is very crisp to assign work to all clients and OSD in the cluster, so it has great scalability. CRUSH uses intelligent data replication to ensure flexibility, and is more suitable for ultra large scale storage. The following paragraphs describe how CRUSH works. For more detailed mechanisms, please refer to the paper: CRUSH – Controllable, scalable and distributed multiple replica data 

Cluster run diagram

Ceph Depending on the Ceph client and OSD, because they know the topology of the cluster, the topology is described by 5 graphs, collectively known as “cluster diagram”:

  1. Montior Map: Cluster fsid 、Location, name, address and port, including the current version, the creation time, and the latest modification time. To view the monitor chart, use ceph mon dump Order.
  2. OSD Map: Include clusters fsid 、Creation time, latest modification time, storage pool list, number of replicates, number of reset groups, OSD list and its status (e.g. up  in )。To look at the OSD run diagram, use ceph osd dump Order.
  3. PG Map::** Contains the version of the reset group, its timestamp, the latest version of the OSD running diagram, the occupying rate, and the details of the reset groups, such as the reset group ID, up set  acting set 、 PG State (such as active+clean ),And the data usage statistics of each storage pool.
  4. CRUSH Map::** Rules that include the list of storage devices, the tree structure of the fault domain (such as devices, hosts, racks, rows, rooms, and so on), and how to use this tree structure when storing data. To look at the CRUSH rule, execute ceph osd getcrushmap -o {filename} Command; then use crushtool -d {comp-crushmap-filename} -o {decomp-crushmap-filename} Decompile; then it can be used cat Or the editor is looking.
  5. MDS Map: It contains the current version of the MDS diagram, the creation time, the latest modification time, and the storage pool, the metadata server list, and the metadata servers. up And in Yes. To look at the MDS diagram, execute ceph mds dump 

The run diagrams maintain changes in their respective operating states, and the Ceph monitor maintains a master copy of the cluster running diagram, including cluster members, States, changes, and the overall health of the Ceph storage cluster.

High Availability Monitor

Ceph Before the client reads or writes data, it must first connect to a Ceph monitor and get the latest copy of the cluster diagram. A Ceph storage cluster only needs a single monitor to run, but it becomes a single point of failure (that is, if the monitor goes down, the Ceph client can’t read and write. “Data.

In order to enhance reliability and fault tolerance, Ceph supports a monitor cluster; within a monitor cluster, delays and other errors cause more than one monitor to lag the current state of the cluster, so the Ceph’s monitor routines must reach the current state of the cluster. CephAlways use most monitors (such as: 1, 2:3, 3:5, 4:6, etc.) and

Paxos The algorithm agrees on the current state of the cluster.

For details of configuring the monitor, see the monitor configuration reference.

High availability authentication

To identify users and prevent middleman attacks, Ceph uses cephx The authentication system is used to authenticate the user and the daemon.



cephx The protocol does not solve transport encryption (such as SSL/TLS) or store encryption.

Cephx Authenticated by shared keys, that is, client and monitor cluster each has a copy of the client key. Such an authentication protocol enables the participants to authenticate each other without showing the key, which means that the cluster is convinced that the user has a key, and the user believes that the cluster has a copy of the key.

Ceph A main extension function is to avoid the central interface of object storage, which requires the Ceph client to interact directly with OSD. Ceph through cephx The authentication system protects data, and it also authenticated users who run Ceph clients. cephx The protocol running mechanism is similar Kerberos 

Users / participants call the Ceph client to contact the monitor. Unlike Kerberos, each monitor can authenticate users and publish keys. cephx There will be no single point of failure or bottleneck. The monitor returns an authenticated data structure similar to the Kerberos bill, which contains a session key that can be used to obtain a Ceph service, the session key is the user’s permanent private key self encrypted, and only this user can request from the Ceph monitor.Ask for service. The client uses the session key to request the service from the monitor. Then the monitor gives the client a certificate to authenticated the OSD that actually holds the data. The Ceph monitor and OSD share the same key, so any OSD or metadata server in the cluster will recognize it.The client can get the credentials from the monitor, like Kerberos. cephx Credentials will also expire, so that attackers cannot use overdue credentials or session keys obtained secretly. As long as the user’s private key has not been leaked before the expiration, this form of authentication can prevent an intermediate line attacker from sending spam messages by other people’s ID or modifying the normal message of the user.

To use cephx ,The administrator must set up the user first. In the diagram below, client.admin The user calls from the command line ceph auth get-or-create-key To generate a user and its key, the authentication subsystem of Ceph generates a username and key and a copy to the monitor, then passes the user’s key back to the monitor. client.admin Users, that is to say, the client and monitor share the same key.



client.admin Users must give the user ID and key to users in a safe manner.

With the monitor authentication, the client has to pass the user name to the monitor, then the monitor generates a session key and encrypts it with the user’s key, then returns the encrypted voucher to the client, and the client can obtain the session key with the shared key decryption load. The session key identifies this use in the current sessionThe client requests a credential with the user name signed by this session key, the monitor generates a certificate, encrypts it with the user’s key, and then returns it to the client, the client decrypts the voucher, and then uses it to sign the request of the OSD and the metadata server in the cluster.

cephx Protocol authenticates the ongoing communication between the client machine and the Ceph server. Each message after the two is authenticated is signed with voucher, and the monitor, OSD, and metadata server can use the shared key to check the messages.

Authentication provides protection between Ceph clients and servers, not extended to Ceph clients. If the user accesses the Ceph client from the remote host, Ceph authentication is useless, and it will not affect the communication between the host computer and the client host.

For the details of the configuration, please refer to Cephx Configuration guide; for user management details, please refer to user management.

Smart programs support large scale

In many cluster architectures, the main purpose of cluster members is to let the centralized interface know which nodes it can access, and then this central interface provides services to the client through a two level scheduling, which will certainly become a scheduling system in the PB to EB level system.MaximumThe bottleneck.

Ceph This bottleneck is eliminated: both the OSD daemon and the client can perceive the cluster, such as the Ceph client and the OSD daemon all know the other OSD daemons within the cluster, so that OSD can communicate directly with other OSD daemons and monitors. anotherThe Ceph client can also interact directly with the OSD daemon.

Ceph The client, monitor, and OSD daemon can interact directly with each other, which means that OSD can use the CPU and memory of local nodes to perform tasks that might drag down the central server. This design balances the computing resources and brings several benefits.

  1. OSD Direct service to the client: Because any network device has the maximum concurrent connection limit, the physical limitations of the centralized system are exposed when the scale is large. Ceph allows the client to directly associate with the OSD node, which improves the performance and the total capacity of the system while eliminating the single fault point. Ceph client can pressWe need to maintain conversation with a OSD rather than a central server.

  2. OSD Members and states: Ceph OSD After joining the cluster, it will continue to report its own state. At the bottom, the OSD state is up or down ,It reflects whether it is running or whether it can provide service. If a OSD state is down And in ,Indicates that the OSD daemon may fail; if a OSD daemon is not running, such as crashing, it can not report itself to the monitor. down Yes. Ceph monitors can periodically Ping OSD daemons to ensure that they are running, but it also authorizes OSD processes to confirm neighbor OSD. down And update the cluster diagram and report it to the monitor. This mechanism means that the monitor is still a lightweight process. For details, see the monitoring OSD And heartbeat.

  3. Data cleaning: As part of maintaining data consistency and cleanliness, OSD can clean objects in the reset group. That is to say, Ceph OSD can compare object metadata and replica metadata stored on other OSD to catch OSD defects or file system errors (every day). OSD can also do deep cleaning (weekly), that is, comparing data in the object by bit, to find out the hard disk bad sectors that are not found during mild cleaning. For detailed cleaning configuration`Data cleaning

  4. Replicating: Like the Ceph client, OSD also uses CRUSH algorithm, but it is used to calculate where the replica is stored (and also used for rebalancing). A typical example is that a client uses CRUSH algorithm to figure out where the object should be stored, and map the object to the storage pool and the reset group.Then find the CRUSH diagram to determine the master OSD of this reset group.

    The client writes the object to the main OSD of the target group, and then the main OSD uses its CRUSH graph copy to find second, third OSD for the copy of the object and copy the data to the corresponding second, third OSD (how many copies of the copy) corresponding to the appropriate set.How many OSD are there? Finally, confirm that the data is successfully stored and then fed back to the client.

With the ability to copy, the OSD daemon can reduce the replication pressure of the client while ensuring the high reliability and security of the data.

Dynamic cluster management

In the section of scalability and high availability, we explain how Ceph extends and maintains high reliability with CRUSH, cluster perception, and intelligent OSD daemons. The key design of Ceph is self governance, self repairing and intelligent OSD daemon. Let’s go deepThis paper explains how CRUSH works and how to dynamically achieve data placement, rebalancing and error recovery in modern cloud storage infrastructure.

About the storage pool

Ceph The storage system supports the concept of “pool”, which is a logical partition for storing objects.

Ceph The client gets a cluster diagram from the monitor and writes the object to the storage pool. Storage pool size The number of copies, the CRUSH rule set and the number of reset groups determine how Ceph places data.

The storage pool can set at least the following parameters:

  • The ownership / access rights of the object;
  • The number of the reset groups; and
  • The CRUSH rule set used.

For details, see the adjustment of the storage pool.

PG Mapped to OSD

Each storage pool has many default groups, and CRUSH dynamically maps them to OSD. When the Ceph client wants to store objects, CRUSH maps the objects to a set of groups.

Mapping objects to the reset group creates an indirect layer between OSD and the client. Ceph clusters must be increased or reduced and dynamically rebalanced. If the client knows which OSD has which object, it will cause the client and OSD to tightly couple; instead,The CRUSH algorithm maps objects to a set of sets, and then maps each set to one or more OSD, which allows Ceph to dynamically rebalance the OSD daemon and the underlying device. The following chart describes how CRUSH maps objects toThe reset group and the reset group are mapped to OSD.

With the copy of the cluster diagram and the CRUSH algorithm, the client can accurately calculate which OSD reads and writes a specific object.

Computing PG ID

Ceph When a client is bound to a monitor, a copy of the latest cluster run graph will be requested. With this diagram, the client can know all the monitors, OSD, and metadata servers in the cluster.However, it knows nothing about the location of the object.

The object position is calculated.

The client only needs to input object ID and storage pool. This is simple: Ceph stores data in a storage pool (such as Liverpool). When the client wants to save named objects (such as John, Paul, George, Ringo, etc.)It uses the object name, a hash value, the number of reset groups in the storage pool, and the storage pool name to calculate the reset group. Ceph calculates PG ID by the following steps.

  1. The client input storage pool ID and object ID (such as pool= “Liverpool” and object-id= “John”);
  2. CRUSH Get the object ID and hash it;
  3. CRUSH The number of PG (such as 58 )To get the hash value, this is the ID of the set.
  4. CRUSH The storage pool ID is obtained according to the name of the storage pool (for example, Liverpool = 4 );
  5. CRUSH Add the storage pool ID to PG ID (such as 4.58 )Before.

The location of the calculated object is much faster than the query location, CRUSH The algorithm allows the client to compute the objectShouldWhere to save and allow the client to connect to the main OSD to store or retrieve objects.

Interconnected and subsets

In the previous section, we noticed that OSD daemons check each other’s heartbeats and give back to the monitor; their other behavior is “peering”, a process that takes all the objects (and their metadata) in a set of objects into a consistent state. matterIn fact, the OSD daemon will report the failure of the interconnection to the monitor, and the interconnection problem will generally recover itself. However, if the problem continues, you may have to refer to the failure of the Internet to solve the problem.



State agreement does not mean that PG holds the latest content.

Ceph The storage cluster is designed to store at least two object data. size = 2 ),This is the minimum requirement to ensure data security. To ensure high availability, Ceph storage cluster should hold more than two copies of objects. size = 3 And min size = 2 ),Only in this way can “degraded“ continue to run and maintain data security.

Recall that the previous intelligent programs support the charts in the large scale. We did not explicitly mention the name of the OSD daemon (for example). osd.0  osd.1 And so on.mainsecond、And so on。 By convention,Master OSD yes acting set The first OSD in the system, and it is responsible for coordinating the interconnection of all the returned groups based on OSD.Only itIt accepts the write request from the client to the object in a reset group.

When a series of OSD is responsible for a reset group, this series of OSD becomes one. acting set 。One acting set Corresponds to a set of OSD currently responsible for this reset group, or OSD that is responsible for a particular set of groups until a certain version is issued.

OSD The daemon acting set A part of it, not always up State. When a OSD is in acting set In the case of up In state, it is up set A part of it. up set It is an important feature, because when a OSD fails, Ceph maps PG to other OSD.



In a PG acting set It contains osd.25  osd.32 and osd.61 ,First osd.25 It’s the main OSD, if it fails, second osd.32 And it becomes the main OSD, osd.25 Will be removed up set 


When you add a OSD daemon to the Ceph storage cluster, the cluster diagram will be updated with the new OSD. Think back to the calculation of PG ID ,This action changes the cluster diagram and therefore changes the location of the object because the input conditions of the computation change. The following diagram describes the process of rebalancing (the graph is rough, because the small changes in the size of a large cluster) are some of them, not all of the PG from the existing OSD (OSD 1 and 2) migrate to the new OSD (OSD 3). Even in the rebalance, CRUSH is stable, many of the reset groups still maintain the original configuration, and each OSD takes off some space, so the load does not appear on the new OSD after the rebalancing is completed.Sudden increase.

Data consistency

As part of maintaining data consistency and cleanliness, OSD can also clean the objects within the set, that is, OSD will compare the metadata of the replica of each object in a different OSD. Cleaning (usually performed daily) is to capture OSD defects and file system errors.OSD can also perform deep cleaning: data in bits compared to objects; deep cleaning (usually weekly) is to capture the bad sectors on the disks that are not found in the light cleaning process.

On the configuration of data cleaning`Data cleaning

erasure coding

Erasure code storage pool stores each object as K+M A data block, one of which K A data block and a M A code block. The size of this storage pool is K+M ,In this way, each block is stored in OSD located in acting set, and the location of the block is also preserved as object property.

For example, when the erasure code storage pool was created, five OSD were allocated. K+M = 5 )And tolerate two of them lost ( M = 2 )。

Read and write code blocks

When included ABCDEFGHI Objects NYAN When written to the storage pool, the erasure encoding function divides the content into three data blocks, which are simply cut into three parts: the first contains ABC 、Second copies are DEF 、Finally, GHI ,If the length of the content is not K The multiples need to be filled; this function also creates two encoding blocks: the fourth is YXY 、The fifth is GQC ,Each block is stored in the OSD in the acting set, respectively. These blocks are stored in the same name ( NYAN )The object, however, is located on different OSD; the block order must also be retained and stored as an attribute of the object. shard_t )Add to the back of the name. Contain ABC The block 1 is stored in OSD5 Up, containing YXY The block 4 is stored in OSD3 Up.

Read from the erasure code storage pool NYAN When the object is decoded, the decoding function reads three blocks. ABC Block 1, including GHI Block 3 and inclusion YXY Block 4, and then rebuild the original content of the object ABCDEFGHI 。The decoding function is told that blocks 2 and 5 are lost (known as erasure), and block 5 is unreadable because OSD4 It’s out. With three readout, the decoding function can be successfully invoked. OSD2 It is the slowest, and its data are not adopted.

Discontinued complete writing

In the erasure code storage pool, the main OSD in up set accepts all write operations, which is responsible for encoding the load as K+M Block and send to other OSD. It is also responsible for maintaining an authoritative version of the log of the set group.

In the following figure, a parameter has been created for K = 2 + M = 1 The erasure coding group is stored on three OSD and two stores. K 、A deposit M 。The acting set of this set group is from OSD 1 OSD 2  OSD 3 Form。 An object has been coded into each OSD: block D1v1 (That is, the data block number is 1, the version is 1) OSD 1 Up, D2v1 stay OSD 2 Up, C1v1 (That is, the code block number is 1, the version is 1) OSD 3 Up. The reset group logs on each OSD are the same (that is, 1,1 ,It shows that epoch is 1 and version 1).

OSD 1 It’s main, it’s received from the client WRITE FULL This means that the net payload will completely replace this object, rather than partial coverage. Version 2 (V2) of this object will be created to replace version 1 (V1). OSD 1 The net load is coded into three blocks. D1v2 (That is, data block number 1, version 2) will be stored in OSD 1  D2v2 stay OSD 2 Up, C1v2 (That is, the code block No. 1 version 2) OSD 3 Each block is sent to the target OSD, including the main OSD, which is also responsible for handling the write operation and maintaining the authoritative version of the reset group log in addition to the storage block. When a OSD receives the instruction message of the write block, it also creates a new set of log logs to reflect the changes, such as OSD 3 storage C1v2 It will take 1,2 (That is, epoch is 1 and version 2 is its own log. Because OSD is working asynchronously, when some blocks are still dropped. D2v2 ),Others may have been confirmed on disk. C1v1 and D1v1 )。

If all goes well, the blocks are confirmed to be on OSD in acting set. last_complete The pointer will go from 1,1 Change to point to point 1,2 

Finally, files stored in the previous version of the object can be deleted: OSD 1 Upper D1v1  OSD 2 Upper D2v1 and OSD 3 Upper C1v1 

But the accident happened, if it happened. OSD 1 Hang up, at the same time D2v2 A part of version 2 of this object has been written after completion. OSD 3 One piece is not enough to recover; it loses two pieces: D1v2 and D2v2 ,And erasure code parameters K = 2  M = 1 It requires at least two pieces to be used to rebuild third. OSD 4 To become a new master OSD, it finds last_complete Log entries (that is, all objects before this entry are available on OSD in previous acting set) 1,1 Then it will be the headline of the new authority log.

stay OSD 3 The log entries found on the list 1,2 and OSD 4 The new authoritative log is divided: it will be ignored and contained. C1v2 The file of the block is also deleted. D1v1 The block will pass the erasure code library during the cleaning period decode The decoding function is rebuilt and stored to the new master OSD 4 Up.

For details, see the notes of erasure code.

Cache classification

For some hot data on the back-end storage layer, the cache layer can provide better IO performance to the Ceph client. The cache layer contains a storage pool created by relatively high speed, expensive storage devices, such as solid hard disks, and is configured as a cache layer; and a back-end storage pool can be erasure code.Coded or relatively low speed, cheap devices, as the economic storage layer. The Ceph object manager determines where to place objects, and the layered proxy decides when to return the object of the cache layer to the back-end storage layer. So the cache layer and the back-end storage layer are completely transparent to the Ceph client.

For details, see the cache classification.

Extended CEPH

You can extend the Ceph function by creating “Ceph Classes” to share the object classes, and Ceph dynamically loads the location. osd class dir Under the directory .so Class files (that is, default) $libdir/rados-classes )。If you implement a class, you can create a new object method to call the native method in the Ceph object store, or the public library or the other class methods in the build.

When writing, the Ceph class can invoke native or class methods, perform arbitrary operations on stack data, generate final writing transactions, and use Ceph atoms.

When reading, the Ceph class can call native or class methods, perform arbitrary operations on the stack data, and return the data to the client.

Ceph Class instance

A class written for a content management system may have the following function to display a bitmap of a specific size and ratio of length to width, so the stack picture should be tailored to a specific length to width ratio, zoom it, and embed an invisible copyright or watermark to protect intellectual property; and then save the generated bitmap as an object.

A typical implementation src/objclass/objclass.h  src/ 、and src/barclass 


Ceph The storage cluster is dynamic – like an organism. Although many storage devices do not make full use of CPU and RAM resources on a common server, Ceph can. From heartbeat to interconnection, to balance, and then to error recovery, Ceph puts the client (and central gateway).But it does not exist in the Ceph Architecture) liberated, using OSD’s computing resources to complete this work. With reference to previous hardware recommendation and network configuration reference to understand the above concepts, it is not difficult to understand how Ceph makes use of computing resources.

CEPH Agreement

Ceph The client interacts with the native protocol and storage cluster, and Ceph encapsulates this function. librados Library, so you can create your own custom client. The following diagram describes the basic architecture.

Native protocol and LIBRADOS

Modern programs require simple object storage interfaces that can be asynchronously communicated. The Ceph storage cluster provides a simple object storage interface with asynchronous communication capability, which provides direct and parallel access to cluster objects.

  • Storage pool operation;
  • Snapshot and writing time replicating clones;
  • Read / write objects; create or delete; – the entire object or a byte range; – append or cut;
  • Create / set / get / delete extended properties;
  • Create / set / get / delete key / value pairs;
  • Mixed operation and double confirmation;
  • Object class.

Object monitoring / notification

The client can register a continuous interest in an object and keep the session open to the main OSD. The client can send a notification message and payload to all monitors and collect feedback notices from the monitors. This function enables the client to use any object as a synchronization / communication channel.

Data stripe

Storage devices have throughput constraints, which can affect performance and scalability, so the storage system generally supports striping (storing continuous pieces of information in multiple devices) to increase throughput and performance. Data strip integration is the most common RAID In RAID, the closest way to Ceph strip is RAID 0 、Or strip stripe, Ceph’s stripe provides the same throughput as RAID 0, the same reliability as N road RAID mirror, and faster recovery.

Ceph Three types of clients are provided: block device, file system and object store. The Ceph client converts the data formats presented to the user (a device image, a REST style object, a CephFS file system directory) to an object that can be stored in the Ceph storage cluster.



Those objects stored in the Ceph storage cluster are not strip oriented. Ceph object store, Ceph block device, and Ceph file system bring their data to multiple objects in Ceph storage cluster. librados Before you write directly to the Ceph storage cluster, you must first strip your own data (and parallel I/O) to enjoy these advantages.

The simplest form of Ceph strip is the strip of an object. The Ceph client writes the ribbon unit to the object stored by Ceph until the object capacity reaches the upper limit before creating another object to store the unfinished data. This simplest strip is the image of the small block device.S3, Swift objects, or CephFS files may be sufficient; however, this simple form does not maximize the ability of Ceph to distribute data between the set groups, and can not maximize performance. The following figure describes the simplest form of striping:

If you want to deal with large size images, large S 3 or Swift objects (such as video), or large CephFS directories, you can see that multiple objects that are brought into an object set can bring significant read / write performance enhancement. When the client writes the strip unit to the corresponding object in parallel,There will be obvious write performance, because objects are mapped to different set of groups and further mapped to different OSD, which can be written at maximum speed in parallel. Writing to a single disk is limited by head movement (e.g. 6ms seek time) and storage device bandwidth (e.g. 100MB/s).Ceph distributes writes to multiple objects (they are mapped to different set of sets and OSD), which reduces the number of channels per device and the throughput of multiple drives to achieve a higher write (or read) speed.



Striping is independent of object replication. Because CRUSH copies objects between OSD, and the data strip is automatically copied.

In the following figure, the client data bar is striped to an object set (figure below). object set 1 ),It contains 4 objects, of which the first strip unit is object 0 A stripe unit 0 、The fourth bands are object 3 A stripe unit 3 ,After writing fourth strips, the client wants to confirm whether the object set is full. If the object set is not full, the client then writes the ribbon from the first object (figure below). object 0 );If the object is full, the client must create a new object set (figure below). object set 2 ),Then the first object from the new object (the following figure is shown in the table). object 4 )Start writing to the first strip. stripe unit 16 )。

Three important variables determine how Ceph can strip data:

  • Object size: Ceph The objects in the storage cluster have the maximum configurable sizes (such as 2MB, 4MB, etc.), and the object size must be large enough to accommodate a lot of strip units, and should be an integer multiple of the strip unit.
  • Strip width: The ribbon has configurable unit size (64KB). The Ceph client divides data into a ribbon unit suitable for writing objects, except for the last one. The width of the stripe should be the slicing of the object size so that the object can contain many strip units.
  • Number of strips: Ceph The client writes a series of strip units to a series of objects determined by the number of strips, which is called an object set. When the client writes to the last object in the object set, it returns to the first one.



Before you put the cluster into production environment, you should test the performance of the strip configuration first, because these parameters will be taken after the data is striped to the object.Must notChange.

Ceph After the client divides the data into a strip unit and maps it to the object, the object is mapped to the reset group and the reset group is mapped to the OSD by CRUSH algorithm, and then the file can be stored on the hard disk in the form of file.



Because the client writes to a single storage pool, all the data that is brought to the object is also mapped to the set of the same storage pool, so they use the same CRUSH diagram and the same access rights.

CEPH Client

Ceph The client includes several kinds of service interfaces.

  • Block equipment: Ceph Block equipment(Also called RBD) services provide block size devices that are adjustable, refined, snapshot and clone. To provide high performance, Ceph blocks the block device to the entire cluster. Ceph supports both kernel object (KO) and QEMU manager directly using “lib.Rbd“ – avoids the overhead of kernel objects on virtual systems.
  • Object storage: Ceph Object storage(Also called RGW) service provides `RESTful Style ‘API, which is compatible with Amazon S3 and OpenStack Swift.
  • File system: Ceph file system( CephFS )The service provides a file system compatible with POSIX, which can be directly used. mount Or mounted as a user space file system (FUSE).

Ceph Additional OSD, MDS and monitor can be run extra to ensure scalability and high reliability. The following image describes advanced architecture.

CEPH Object storage

Ceph Object storage daemon, radosgw ,It’s a FastCGI service, which provides `RESTful Style ‘ HTTP API It is used to store objects and metadata. It is located on the Ceph storage cluster, has its own data format, and maintains its own user database, authentication, and access control. The RADOS gateway uses a unified namespace, that is to say, you can use OpenStack SwiFT compatible API or Amazon S3 compatible API; for example, you can write data through S3 compatible API with one program, and then use another program to read through Swift compatible API.

S3/Swift Object and storage cluster object comparison

Ceph Object storageobjectThis term describes the data it stores. S3 and Swift objects are different from Ceph writes to objects in the storage cluster, and objects within the Ceph object storage system can be mapped to objects within the Ceph storage cluster; S3 and Swift objects are not one.1:1 is mapped to the objects in the storage cluster, and it may be mapped to multiple Ceph objects.

For more details Ceph Object storage.

CEPH Block equipment

Ceph A block device brings a device image strip to multiple objects within a cluster, where each object maps to a set and distributes, and these groups are scattered across the cluster. ceph-osd On the daemon.



Stripe will make RBD block device better than single server.

Compact and snapshot Ceph block devices are very attractive for virtualization and cloud computing. In virtual machine scenarios, people usually use the Qemu/KVM rbd Network storage drives the deployment of Ceph block devices, including host computers. librbd Provide block device services to clients; many cloud computing stack uses libvirt Integration with the management program. You can use the streamlined Ceph block device to match Qemu and “libvirt“ to support OpenStack and CloudStack together to form a complete solution.

Now “librbd“ does not support other management programs. You can also provide block devices to clients using Ceph block device kernel objects. Other virtualization technologies, such as Xen, can access Ceph block device kernel objects, using command line tools. rbd Realization。

CEPH file system

Ceph The file system (Ceph FS) provides a POSIX compliant file system service that sits on an object based Ceph storage cluster, and the files inside are mapped to objects in the Ceph storage cluster. The client can mount the file system on the kernel object or user.On the space file system (FUSE).

Ceph The file system service includes a metadata server (MDS) deployed with the Ceph storage cluster. The role of MDS is to permanently store all file system metadata (directory, file owner, access pattern, and so on) permanently in a fairly reliable metadata server. MDS (nameby ceph-mds The reason for the daemon is that simple file system operations are like listing directories. ls )、Or enter the catalog ( cd )These operations are unnecessary to perturb “OSD“. So separating metadata from data means that the Ceph file system can provide high performance services and reduce storage cluster load.

Ceph FS One or more objects are separated from data and stored in MDS, and file data is stored in storage cluster. Ceph strives to be compatible with POSIX. ceph-mds You can run one or more physical machines for high availability or scalability.

  • High availability: Redundant ceph-mds Routines can be in standby (The state, ready to be placed before the submission. active (Malfunction of active state ceph-mds 。This can be done easily because all data, including logs, are stored on RADOS, which is transformed from ceph-mon Automatic trigger.
  • Scalability: Multiple ceph-mds The routines can be at the same time active State, which splits the directory tree into subtrees (and slices of a single hot directory) to efficiently balance load among all active servers.



Translator: Although the document says so, it is not recommended in practice, and the stability of MDS is not ideal. Many active MDS are far from stable. Even so, you should first configure several MDS backup.

Waiting for life ( standby )And active ( active ) MDS Can be combined, for example, to run 3 in active State of state ceph-mds The routines are extended, and 1 standby Routines are used to achieve high availability.

Link of this Article: Ceph architecture

Leave a Reply

Your email address will not be published. Required fields are marked *