UNIX administration to Hadoop Administration

Friends,

The most mind boggling question every UNIX admin has when he is looking forward to take up Hadoop administration ---

" Do I need to know Java? , Is it mandatory to know Java for Hadoop administration? "

Simple answer is 'NO'. It will be good to know Java, for that matter any knowledge is not a waste. But is it mandatory? NO it isn't . Knowing Java can just give you an upper hand to debug the user queries at time if they fail for some reason ( will discuss that in upcoming blogs)

Brief background of my journey toward Hadoop.

I started my IT career as UNIX System Engineer and use to work intensively on Solaris and Redhat, before I started working on Hadoop as admin / Architect. Trust me , system admin experience helps me at every corner. One needs to have better hold of platform in order to setup Hadoop clusters. Especially if you are working on apache Hadoop cluster or Hortonworks . The GUI or admin console for these variants are very basic as compared to cloudera manager. Hortonworks has its own advantages, as its really an opensource software. One can setup as per the requirements and customize it much better than Cloudera. For doing so you will need to have better hold of OS ( platform ) knowledge.

What else can help to be an Hadoop administrator? I think having below skills set will be an advantage to get going in better way with Hadoop admin role.

Skill required or an advantage to have as admin

1) OS , Platform knowledge , preferably redhat ( RHEL ) or CentOs

2) Scripting , Python added advantage along with Shell ( Bash ) & awk and sed

3) Mysql basics

4) Networks basics

5) Storage basic ( JBOD ) not the high end storage but just basics

Scripting can help you perform your admin work effectively. Chef or Puppet is something system admin are normally aware of and these skills can be used for Hadoop infra administration as well.

What are the task for Hadoop admin.

Broadly the task can be classified in 2 sets.

a) Daily Business as usual ( BAU)

Creating data base in HIVE or HBASE ( both will be discussed in upcoming Blogs). Also creating a filesystem in HDFS. When I say filesystem, I don't mean a physical device or mount point. Here HDFS is an filesystem in itself, precisely a distributed filesystem ( will discuss in later Blogs ).

Commands that are most helpful for basic hadoop filesystem administration are

HDFS

hadoop

Link below can help you get better hold of both commands

Link 1.0

https://hadoop.apache.org/docs/r2.6.3/hadoop-project-dist/hadoop-common/FileSystemShell.html2.

Setting ACL for Hadoop files and directories ( ref. link 1.0 )
Setting Quota for Directories
Setting replication factors and replication to DR. ( ref. distcp link 1.0)
Housekeeping or deleting trash ( ref delete and expunge from link 1.0)

b) Advance admin task

Add or remove nodes to hadoop cluster ( ref link 1.1 cloudera example)

Link 1.1

https://www.cloudera.com/documentation/enterprise/5-5-x/topics/admin_dn_storage_directories.html

Backup fsimage of HDFS or backup the metadata of HDFS ( ref. link 1.2)

Link 1.2

https://www.cloudera.com/documentation/enterprise/5-7-x/topics/cm_mc_hdfs_metadata_backup.html

And so on so forth.

I will cover more error and issue handling in my next blog ( debugging HDFS )

In nutshell I would like to tell my UNIX friends that, don't be reluctant about hadoop admin and is really an very interesting technology to get into. Cloud and HDFS integration these days makes technology more interesting.

Where to Start.

Nothing can help more than self learning. All it matters is your thirst for technology. You need to be always ready to learn and learn and learn. Hadoop is evolving area and lot of research is going on. We get to know new things daily. If you lazy about learning then hadoop is not your cup of tea.

To start with I would suggest to start reading below books.

Hadoop Definitive Guide
Hadoop Operations.
Hadoop apache wiki ( ref. link 2.0)

Link 2.0

https://wiki.apache.org/hadoop

Some Youtube videos also helps. Once you get some idea about what hadoop is (which my upcoming blog can be helpful). You can also go for online course or classes. I suggest online session from reputed firms or webinar of Cloudera. Please bear in mind, don't carry mis-conspects clear them at first place else they can lead to trouble understanding technology.

Setup practice cluster with VMware on your system. Apache Hadoop is freely available Or you can get cloudera VM image to practice hadoop. You can download the version from link 2.1

Link 2.1

https://www.cloudera.com/downloads/quickstart_vms/5-12.html

No options to self learning. Daily 2 hr and with average learning skills also can get you better hold of Hadoop admin skills with in 2 months. Consistency is the key and develop the skills of questioning yourself.

Main component of Hadoop echosystem is HDFS filesystem

What is HDFS filesystem.

Hadoop is a distributed filesystem, means the filesystem spans over more than one system. The data or file is divided in small parts and stored on different nodes. Also Hadoop does keep 3 copy of same block. This gives data redundancy, so incase of any node goes down still we can read the data / file.

The system which holds data is called DATA NODES

Now think this way. The data is divided into parts called BLOCKS . Now we need someone to tell us where is the block for specific file or where to write the block ( to which data node). This Job is done by the system ( process) called NameNode.

This is a master which governs all the read and write operations of files which are in blocks ( parts). The information of blocks is stored by Name node which is called as FSimage .

FSimage is also called as metadata of HDFS. ( will explain FSimage and edit log in later blogs). When file is changed that means appended or owner changed or deleted the metadata is modified but this transition happened on file and or Directories are stored in file called Editlog .

Now further in enterprise solutions we do not want any user ( applications) to access HDFS from data node or Name node directly. So to keep it more secured and safe an access point is create in the form of node. That node is called as EdgeNode.

Below is the echo system of hadoop published by cloudera

We just discussed about one most important block of hadoop. i.e HDFS.

How is Big data problem solved by HDFS.

I would try explain this term in a layman's words. Big data is nothing but an huge amount of data. Just not that, a variety of data. That can be structure or unstructured data, flowing in from all the directions. Now a days in the digital world; just think about the internet browsers. Across world we have some many phones , systems , laptops , search devices using browsers and generating data. If we have to analyses of such huge data, we will need some which is capable to handle big data.

Why HDFS?

So in simple terms, if one man can do X amount of work. 10 people can do 10*X work. So having distributed filessytem and databases yield same results. Else think about a traditional databases, those would take days to return the query ran on huge data.

*** In Next blog we will delve deep into HDFS and its concept along with some daily issues & handling.

Data Science, Data Analytics, Big data, Data engineering

Debugging Hadoop