Scaling & Efficiency of Hadoop Cluster

4 min readNov 12, 2018

Efficiently Scaling HDFS/Hadoop and keeping performance intact is a difficult task Architects and developers have faced. There are several approaches being used to tackle scaling and efficiency problems, such as

View file systems (VFS)
Frequent HDFS version upgrades
NameNode garbage collection tuning
Limiting no of small file generation
HDFS load management service

To scale reliably without slowing down data analysis for thousands of users and making millions of queries simultaneously is overwhelming work. If HDFS infrastructure is overloaded queries will accumulate and there will be query completion delays. Generally datalake is used for load, transform and extract. All activities happen on one single cluster to reduce replication latency. The dual duties of these clusters results in the generation of small files to accommodate frequent writes and updates, which further clogs the queue. On top these challenges, multiple teams require a large percentage of stored data, making it impossible to split clusters by use cases or organization without duplication and as a result , decreasing efficiency while increasing cost.

The root cause of all the slowdowns and the main bottleneck of organization’s ability to scale HDFS without compromising the performance is the performance throughput of NameNode’s directory tree of all the files in the system, that tracks where data files are kept. Since all the metadata is stored in the NameNode, client request to a HDFS cluster must first pass through it. Again to complicate it further, a single read write lock on the NameNode namespace limits the maximum throughput the NameNode can support because any write request will exclusively hold the write lock and force any other requests to wait in the queue.

Organizations experience high NameNode remote procedure calls ( RPC ) queue time as a result of above mentioned reason. Sometimes NameNode queue time will increase upto 500–600 ms per request.

Scaling & Performance improvements :

Scaling out using viewFs : use view file system to split existing HDFS into multiple physical namespaces and use ViewFs mount points to present a single virtual namespace to the users.

Separate HBASE from HDFS cluster as YARN and query operations, it greatly reduces the load on the main cluster & makes HBASE more stable, reduces HBASE cluster restart time from hours to minutes.

Create dedicated HDFS cluster for YARN application logs & log aggregation for ViewFs. New cluster will save around 40% of the total write requests and most files on it are small files, which also reduces the file count pressure on the main cluster, because no client side changes are needed for existing user applications.

HDFS version upgrades : HDFS upgrades can be gradually rolled out to minimize the risk of outages. While maintaining separate clusters , have to incur an increase in operational cost.

From HDFS 2.6 to 2.7.3 to 2.8.2 to 2.9.1 , these upgrades provide scalability improvement . HDFS-9710, HDFS-9198, HDFS-9412, HDFS-13123

The amount of incremental block reports will decrease , leading to a reduction in NameNode’s load.

NameNode garbage collection tuning : GC tuning plays an important role in optimization and scaling. To prevent GC pauses , Concurrent mark sweep collectors ( CMS ) are forced to do more collection by tuning CMS parameters such as -

CMSInitiatingOccupancyFraction, UseCMSInitiatingOccupancyOnly, CMSParallelRemarkEnabled.

We need to have sufficient space and CPU cycle to support the tweaking this functionalities.

During heavy remote procedure call(RPC) loads, a large number of short-lived objects are created in young generation, which forces the young generation collector to perform stop-the-world collection frequently. By increasing the young generation size from 1.5GB to 16GB and tuning the ParGCCardsPerStrideChunk value (set at 32,768), the total time our production NameNode spent on GC pause decreased from 13 percent to 1.7 percent, increasing throughput by more than 10 percent.

Limiting number of small files : NameNode loads all file metadata in memory, storing small files increases memory pressure on the NameNode. In addition, small files lead to an increase of read RPC calls for accessing the same amount of data when clients are reading the files, as well as an increase in RPC calls when files are generated

Build new ingestion pipelines, that generates much bigger files than those created by older data pipelines. Merge together small files into larger ones, mostly greater than 1GB in size. set a strict namespace quota on Hive databases and application directories. The quota is allocated at the ratio of 256MB per file to encourage users to optimize their output file size. turning on auto-merge on Hive and tuning the number of reducers can greatly reduce number of files generated by Hive insert-overwrite queries.

HDFS load management : Challenges of running a large multi-tenant infrastructure like HDFS is detecting which applications are causing unusually large loads, and from there, taking quick action to fix them. There are several tools available to monitor and control loads on HDFS , which can be used in production.

Scaling & Efficiency of Hadoop Cluster

Written by Sibasish Brahma