Ceph FS – a drop-in replacement for HDFS

Hadoop is a programming framework that supports the processing and storage of large data sets in a distributed computing environment. The Hadoop core includes the analytics MapReduce engine and the distributed file system known as Hadoop Distributed File System (HDFS), which has several weaknesses that are listed as follows:

  • It had a single point of failure until the recent versions of HDFS

  • It isn't POSIX compliant

  • It stores at least three copies of data

  • It has a centralized name server resulting in scalability challenges

The Apache Hadoop project and other software vendors are working independently to fix these gaps in HDFS.

The Ceph community has done some development in this space, and it has a filesystem plugin for Hadoop that possibly overcomes the limitations of HDFS and can be used as a drop-in replacement for it. There are three requirements for using Ceph FS with HDFS; they are as follows:

  • Running the Ceph cluster

  • Running the Hadoop cluster

  • Installing the Ceph FS Hadoop plugin

The Hadoop and HDFS implementation are beyond the scope of this book; however, in this section, we will superficially discuss how Ceph FS can be used in conjunction with HDFS. Hadoop clients can access Ceph FS through a Java-based plugin named hadoop-cephfs.jar. The two-java classes that follow are required to support Hadoop connectivity to Ceph FS.

  • libcephfs.jar: This file should be placed in /usr/share/java/, and the path should be added to HADOOP_CLASSPATH in the Hadoop_env.sh file.

  • libcephfs_jni.so: This file should be added to the LD_LIBRARY_PATH environment parameter and placed in /usr/lib/hadoop/lib. You should also soft link it to /usr/lib/hadoop/lib/native/Linux-amd64-64/libcephfs_jni.so.

In addition to this, the native Ceph FS client must be installed on each node of the Hadoop cluster. For more of the latest information on using Ceph FS for Hadoop, please visit the official Ceph documentation at http://ceph.com/docs/master/cephfs/hadoop, and Ceph GitHub at https://github.com/ceph/cephfs-hadoop.