HDFS is a distributed file system that stores data across multiple computers in a cluster. Since it spreads the load across multiple machines, it is ideal for large-scale storage.
A master node that handles all blocks on DataNodes. Controls and monitors DataNodes instances, allows access to files, stores all block records on DataNodes, etc.
Generate checkpoints for namespace periodically. FSimage and editslogs are downloaded from active NameNode, merged locally to create a new image which is uploaded back to NameNode.
Ensure high availability of data. When an active NameNode or DataNode fails, the backup node can be promoted to active and the active node switched over to the backup node.
HDFS splits the file into blocks of data. By default, blocks are 128 MB in size, but they can be changed to any size between 32 and 128 megabytes based on performance requirements.
HDFS uses nameNode maintenance to maintain copies of blocks on multiple DataNodes. It keep track of under- and over-replicated blocks and add/delete copies accordingly.