MapR Hadoop Distribution For Managing Big Data and Step by Step Mapr Cluster Install In The Best Way

Fatih Gençali July 7, 2021 Leave a comment

Subjects

Bigdata Manage
What is Hadoop?
Why is Hadoop important?
What is MapR Hadoop Distribution
Why you should choose MapR Hadoop distribution?
What are Services?
Steps Deploy Mapr Cluster
Mapr Components
Mapr Enterprise Edition or Community Edition
Migration Hdfs to Mapr-Fs
Advices before install and configure Mapr
Architectural Advice
Configure everything for installation
Pre-requests
Preparing
Mapr Installation on Admin Console
Licensing Process
Local installation
Known Issues
Mapr Forum
References

Bigdata Manage;

There are a few popular Hadoop distributions in the market such as Cloudera, MapR, and Hortonworks.

What is Hadoop?

Apache Hadoop is an open source framework for efficiently storing and processing large datasets ranging in size from gigabytes to petabytes of data. Instead of using a single large computer to store and process data, Hadoop allows multiple computers to be clustered together to analyze large datasets more quickly in parallel.

Hadoop consists of four main modules:

HDFS, YARN, MapReduce, Hadoop Common.

Why is Hadoop important?

Ability to store and process huge amounts of any kind of data, quickly. This is an important consideration, especially with the ever-increasing volumes and variety of data from social media and the Internet of Things (IoT).
Computing power. Hadoop’s distributed computing model processes big data quickly.
Fault tolerance. If a node goes down, jobs are automatically routed to other nodes to ensure distributed computation doesn’t fail. Multiple copies of all data are automatically stored.
Flexibility. You can store as much data as you want and then decide how to use it.
Low cost. The open source framework is free and uses commercial hardware to store large amounts of data.
Scalability. You can easily grow your system to handle more data by simply adding nodes.

What is MapR Hadoop Distribution

The MapR Hadoop distribution is 100% plug-and-play compatible with Apache Hadoop, so you don’t need to make any changes to run your applications on a MapR cluster. MapR Hadoop automatically configures compression and memory settings, task heap sizes, and local volumes for mixed data.

The MapR hadoop deployment works on the concept that a market-driven entity should support market needs faster.

Unlike Cloudera and Hortonworks, MapR Hadoop Distribution has a more distributed approach to storing metadata in processing nodes, as it depends on a different file system known as the MapR File System (MapRFS) and does not have a NameNode architecture. MapR hadoop distribution is not based on Linux File system.

MapR is considered one of the fastest hadoop distributions.

Pig, Hive and Sqoop are the only Hadoop distributions with no Java dependencies – because they are based on MapRFS.

Why you should choose MapR Hadoop distribution?

Though MapR is still at number 3 in terms of number of installations, it is one of the easiest and fastest hadoop distributions when compared to others.

MapR is a great distributive file system which comes in two models: community edition and enterprise edition.

Hadoop Benchmark;

To more read;

https://www.altoros.com/research-papers/hadoop-distributions-cloudera-vs-hortonworks-vs-mapr/

What are Services?

A MapR cluster is a full Hadoop distribution. Hadoop itself consists of a storage layer and a MapReduce layer. In addition, MapR provides cluster management tools, data access via NFS, and a few behind-the-scenes services that keep everything running. Some applications and Hadoop components, such as HBase, are implemented as services; others, such as Pig, are simply applications that you run as needed. We will lump them together here, but the distinction is worth making.

MapReduce services: JobTracker, TaskTracker
Storage services: CLDB, FileServer, HBase RegionServer, NFS
Management services: HBase Master, Webserver, ZooKeeper

A daemon called the warden runs on every node to make sure that the proper services are running (and to allocate resources for them). The only service that the warden doesn’t control is the ZooKeeper. Part of the ZooKeeper’s job is to have knowledge of the whole cluster; in the event that a service fails on one node, it is the ZooKeeper that tells the warden to start the service on another node.

MapR Direct Access NFS offers usability and interoperability benefits and makes big data easier and cheaper to handle.

MapR allows for files to be modified and overwritten at high speeds in real time from remote servers via an NFS connection and provides multiple simultaneous reads and writes on any file.

MapR File System does not have a NameNode.

Mapr supports High Availability and Real-time Streaming and Ease of Data Integration and In YARN the Real Multi-tenancy

Steps Deploy Mapr Cluster

On very small clusters of just a few nodes, it’s impractical to isolate services on dedicated nodes. One layout approach is to run one CLDB and one ZooKeeper on the same node, leaving the other nodes free to run the TaskTracker. All nodes should run the FileServer. If you need HA in a small cluster, you will end up running the CLDB and ZooKeeper on additional nodes. Here is a sample layout:

https://mapr.com/developer-portal/mapr-tutorials/steps-deploy-mapr-cluster-part-1-2/

https://mapr.com/developer-portal/mapr-tutorials/steps-deploy-mapr-cluster-part-2-2/

Mapr Components

Hue,Impala, Webserver,Drill,Elasticsearch,HBase,Hive,HTTPFS,Impala,Kafka,Oozie,OpenTSDB,YARN,Spark,Zookeeper,Flume,Object store,Pig,Sqoop,Tez

Mapr Enterprise Edition or Community Edition

MapR is a great distributive file system which comes in two models: community edition and enterprise edition.

If you will use as prod , you should buy enterprise mapr .

Migration Hdfs to Mapr-Fs

Before you copy data from an HDFS cluster to a MapR cluster using the hdfs:// protocol, you must configure the MapR cluster to access the HDFS cluster. To do this, complete the steps listed in Configure a MapR Cluster to Access an HDFS Cluster for the security scenario that best describes your HDFS and MapR clusters, and then complete the steps listed in Verifying Access to an HDFS Cluster.

If the MapR cluster can read the contents of the file, run the distcp command to copy the data from the HDFS cluster to the MapR cluster:

hadoop distcp hdfs://<NameNode>:<NameNode Port>/<HDFS path> maprfs://<MapR-FS path>

Advices before install and configure Mapr

You should use redhat 8 .

Root password will be reseted for direct access.

You should use min 2 server.

You should install epel repo .

Needs direct Internet or download related mapr packages.

Min resources Mem=16gb (advices 64gb) // Cpu=8 (advices 16cpu).

Needs high size on disks ( 64G /tmp , 128G /opt , 32G /home ) as lvm.

Needs min 3 raw disk min 15gb (not formatted) . Do not use RAID . Do not use LVM (Logical Volume Manager).

All ip and hosts should be decribed on dns.

Selinux should be disabled.

Firewall should be stopped and disabled.

You should install and configure ntp/ chronyd.

You should install and configure java jdk 8.

You should permit port 9443 (mapr installation web interface port 9443) , port 8443 (mapr services web interface port 8443).

You should configure ssh passwordless between nodes.

You should configure swap to 1 and transparent_hugepage to never .

You should stop and disable nfs ,nfs-server, nfs-lock.

You should configure limits.conf (soft nofile, hard nofile, soft nproc, hard nproc) to 64000.

You should configure ulimit -n 64000 in bash_profile.

You should configure Umask 0022 .

PermitRootLogin to yes in sshd_config.

You should configure vm.overcommit_memory to 0 in sysctl.conf.

You should configure /etc/pam.d/su for mapr.

You should configure resolv.conf for network.

Architectural Advice

For cldb and ZooKeeper , On small clusters, may need to run CLDB and ZooKeeper on the same node.

For cldb and ZooKeeper , On medium clusters, assign to separate nodes.

For cldb and ZooKeeper , On large clusters, put on separate, dedicated control nodes.

For resourcemanager and ZooKeeper ,avoid running ResourceManager and ZooKeeper together.

For resourcemanager and ZooKeeper , with more than 200 nodes, run ResourceManager on dedicated nodes.

For large clusters , avoid running MySQL Server or webserver on a CLDB node.

Configure everything for installation

1-) You should install and configure repo.

yum repolist all | grep enabled

yum -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm

dnf install http://mirror.centos.org/centos/8/BaseOS/x86_64/os/Packages/centos-linux-repos-8-2.el8.noarch.rpm http://mirror.centos.org/centos/8/BaseOS/x86_64/os/Packages/centos-gpg-keys-8-2.el8.noarch.rpm

yum install https://dev.mysql.com/get/mysql80-community-release-el8-1.noarch.rpm

Etc…

2-) Example /opt disk configure;

fdisk -l | grep ‘^Disk’

#Disk /dev/sde: 130 GiB, 139586437120 bytes, 272629760 sectors

fdisk /dev/sde

n p 1 (enter) (enter) w

mkfs.ext4 /dev/sde1

mkdir /opt/

mount /dev/sde1 /opt

df -Ph | grep opt

vi /etc/fstab

/dev/sde1 /opt ext4 defaults 1 2

Other disks should be as below.

64G /tmp

128G /opt

32G /home

3-) Row disks

fdisk -l

Mapr disk (added as row)

/dev/sdb: 15 GiB

/dev/sdc: 15 GiB

/dev/sdd: 15 GiB

4-) Hosts file for cluster . (We just use nosql2. )

cat /etc/hosts

127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4

::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.1.22 nosql1.localdomain nosql1

192.168.1.27 nosql2.localdomain nosql2

5-) It sholud be added to dns

W used just one node à nosql2

If there are multi-nodes , you must add to /etc/hosts

192.168.1.22 nosql1.localdomain nosql1

192.168.1.27 nosql2.localdomain nosql2

#If you use virtual test machine

#C:\Windows\System32\drivers\etc\hosts

#192.168.1.22 nosql1.localdomain

#192.168.1.27 nosql2.localdomain

#192.168.1.61 mapr.gencali.com

#vi /etc/hosts

#192.168.1.61 mapr.gencali.com

6-1) Selinux should be disabled

cat /etc/selinux/config | grep SELINUX

SELINUX=disabled

getenforce

#Disabled , değilse aşağıdaki işlem yapılır, reboot için irem hanım ile koardinasyon sağlanır.

—-!!!!!!!!!!!!!

vi /etc/selinux/config

SELINUX=disabled

sestatus

getenforce

shutdown -r now

getenforce

sestatus

—-!!!!!!!!!!!!!

6-2) Firewall should be stopped

systemctl stop firewalld

systemctl disable firewalld

systemctl stop iptables

systemctl disable iptables

7-) chrony install and configure

yum -y install chrony

vi /etc/chrony.conf

server 0.tr.pool.ntp.org

server 1.tr.pool.ntp.org

server 2.tr.pool.ntp.org

server 3.tr.pool.ntp.org

systemctl restart chronyd

chronyc sources

date

😎 You should install and configure java

https://www.oracle.com/tr/java/technologies/javase/javase-jdk8-downloads.html

Oracle java jdk,jre 8 download and configure on servers,

mkdir /usr/java

cd /usr/java

tar -xvzf /tmp/jdk-8u281-linux-x64.tar.gz

chmod 755 /usr/java

cd /usr/java

mv latest latest_old

ln -s /usr/java/jdk1.8.0_281 latest

ls -lrt

#default -> /usr/java/latest

#latest -> /usr/java/jdk1.8.0_281

java -version

vi /etc/profile

export JAVA_HOME=/usr/java/latest

export PATH=$JAVA_HOME/bin:$PATH

echo $JAVA_HOME

sudo update-alternatives –list

sudo update-alternatives –config java

sudo update-alternatives –config javac

9-) umask should be configured

/etc/profile

umask 0022

10-) Ssh connectivity should be configured between nodes .

We just used one node for installation.

If there are multi-nodes , you follow below ;

ssh-keygen -t rsa

Press enter for each line

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod og-wx ~/.ssh/authorized_keys

#You should do following step manually with vi

cat /root/.ssh/id_rsa.pub >> root@nosql2:/root/.ssh/authorized_keys

cat /root/.ssh/id_rsa.pub >> root@nosql1:/root/.ssh/authorized_keys

#You should monitor

cat /root/.ssh/id_rsa.pub

cat /root/.ssh/authorized_keys

# You should test each other

ssh root@localhost

ssh root@nosql1

ssh root@nosql2

11-) Swap should be configured

echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled

echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag

sysctl vm.swappiness=1

cat /proc/sys/vm/swappiness

12-) Nfs should be stopped

https://docs.datafabric.hpe.com/62/AdministratorGuide/c_POSIX_loopbacknfs_client.html

systemctl stop nfs

systemctl disable nfs-server

systemctl disable nfs-lock

systemctl stop nfs-server

systemctl stop nfs-lock

#Contol

service mapr-loopbacknfs status

#Active: active (running)

13-) Ulimit in bash_profile

vim ~/.bash_profile

ulimit -n 64000

14- Tcp_retries2 Configure

15–>5

cat /proc/sys/net/ipv4/tcp_retries2

15-) Limits.conf configure

vi /etc/security/limits.conf

* soft nofile 64000

* hard nofile 64000

* soft nproc 64000

* hard nproc 64000

16-) Home create

Mapr home create

mkdir -p /opt/mapr

chmod -R 777 /opt/mapr

Esearch home

mkdir -p /opt/mapr/es_db

chmod -R 777 /opt/mapr/es_db

17-) Sysctl.conf configure

vi /etc/sysctl.conf

vm.overcommit_memory=0

sysctl -p

18-) Sshd_config

cat /etc/ssh/sshd_config

PermitRootLogin yes

19-) Pam configure

cp /etc/pam.d/su /etc/pam.d/su_bck

> /etc/pam.d/su

vi /etc/pam.d/su

#Check that the /etc/pam.d/su file contains the following settings:

#%PAM-1.0

# Uncomment the following line to implicitly trust users in the “wheel” group.

#auth sufficient pam_wheel.so trust use_uid

# Uncomment the following line to require a user to be in the “wheel” group.

#auth required pam_wheel.so use_uid

auth sufficient pam_rootok.so

auth include system-auth

account sufficient pam_succeed_if.so uid = 0 use_uid quiet

account include system-auth

password include system-auth

session include system-auth

session required pam_limits.so

session optional pam_xauth.so

20-) Sample resolv.conf configure

cat /etc/resolv.conf

# Generated by NetworkManager

search home localdomain

nameserver 192.168.1.1

Pre-requests

Pre-requests must be provided on all nodes.

It must be installed on all nodes. (Otherwise, it throws an error for the second node because the map user has not been created.)

Ref:

https://docs.datafabric.hpe.com/61/MapRInstaller.html

You download mapr-setup.sh

https://mapr.com/download/

Preparing

You install mapr-setup.sh on all nodes. But we configure installation on just first node for all nodes.

mkdir -p /tmp/mapr/

chmod -R 777 /tmp/mapr/

wget https://package.mapr.com/releases/installer/mapr-setup.sh -P /tmp/mapr/

chmod +x /tmp/mapr/mapr-setup.sh

sudo bash /tmp/mapr/mapr-setup.sh

Connect browser,

https://<Installer Node hostname/IPaddress>:9443

https://nosql2.localhost:9443

mapr/mapr

You can change pre-requests, but no advice.

vim /opt/mapr/installer/ansible/playbooks/library/prereq/

Example , prereq_check_ram.py

You can set new env.

/opt/mapr/conf/env_override.sh

Installation web interface;

https://nosql2.localhost:9443

Installer

Cluster=mapr.gencali.com

Version 6.2.0

MEP 7.0.1

Mapr Installation on Admin Console

1-) Mapr Installation

2.1-) Version & Services

You can choose to install community or enterprise option.

You can choose to install custom services and versions.

You can choose licance option “after installation” or “login mapr hpe”

2.2-) You can choose to install custom services and versions.

3-) Database Setup

If you choose to install Hue, Hive, Oozie you will install mysql and you will choose related users.

4-) Monitoring

Grafana will be installed .

5-) Set Up Cluster

Mapr user was created with mapr-setup.

Cluster Name must be defined in dns.

6-) Node configuration.

Nodes can be written one under the other . Advice min 2 servers. Nodes must be defined in dns.

Disks must be raw and must be separated with comma . Advice min 3 raw disks

Ssh user should be mapr. You can use root too.

All the data transfer NICs should be the same speed; if you use different speeds, all the NICs will operate at the lowest speed. If MAPR_SUBNETS is not set, MapR uses all NICs present on the node. For more information, see Designating NICs for MapR.

7.1-) Verify Pre-checks for nodes.

All nodes will be checked if they satisfy minimum requirements.

Inprogress status codes;

White – Verification InProgress

Green – Ready for installation

Yellow – Warnings but can be installed

Red – Node cannot be part of cluster

7.2-) Critical and fail status should be fixed. Warning. The warning can be passed, but it would be helpful to fix it later.

😎 Progress Confirmation

9.1-) Configuration Service Layout

9.2-) Node Layout

9.3-) Advanced Component Configuration

10-) Licensing

11-) Installation Step

12-) Installation Complete

13-) Mapr Login

14-) Licence Accept

15-) Overview

16-) Overview

17-) Services interface

18-) Node interface

19-) Volume interface

20-) Tables interface

21-) Cluster setting interface

22-) Running process after installation

23-) Hadoop all applications

24-) All components are as below.

Links to UI Pages

Service Name Browser URL

Drill

http://nosql2.localdomain:8047

Grafana

http://nosql2.localdomain:3000

HBase Master

http://nosql2.localdomain:16010

History Server

http://nosql2.localdomain:19888

Hue

http://nosql2.localdomain:8888

Impala Catalog

http://nosql2.localdomain:25020

Impala Server

http://nosql2.localdomain:25000

Impala Statestore

http://nosql2.localdomain:25010

Kibana

http://nosql2.localdomain:5601

Spark History Server

http://nosql2.localdomain:18080

Spark Thrift Server

http://nosql2.localdomain:4040

Webserver

https://nosql2.localdomain:8443

YARN Node Manager

http://nosql2.localdomain:8042

YARN Resource Manager

http://nosql2.localdomain:8088

Hide common service links

API Services

Service Name Service Ports

Apiserver

https://nosql2.localdomain:8443

Drill

nosql2.localdomain:31010

Elasticsearch

nosql2.localdomain:9300

HBase Master

nosql2.localdomain:16000

HBase Region Server

nosql2.localdomain:16030

HBase REST

nosql2.localdomain:8080

HBase Thrift

nosql2.localdomain:9090

Hive Metastore

nosql2.localdomain:9083

Hive Server 2

nosql2.localdomain:10000

Hive WebHCat

nosql2.localdomain:50111

HTTPFS

nosql2.localdomain:14000

Impala Server

nosql2.localdomain:21050

Apache Kafka REST API

nosql2.localdomain:8082

Mastgateway

nosql2.localdomain:8660

YARN Node Manager

nosql2.localdomain:8041

Oozie

nosql2.localdomain:11000

OpenTSDB

nosql2.localdomain:4242

YARN Resource Manager

nosql2.localdomain:8033

Spark Thrift Server

nosql2.localdomain:2304

Zookeeper

nosql2.localdomain:5181

Hide service verification info

Licensing Process

You can add licence after installation,

Click here to go to HPE Ezmeral Data Fabric Control System

https://nosql2.localdomain:8443/

By doing so, you default to agreeing to the community license agreement:

https://mapr.com/legal/eula/

1-) Install license.

You login “HPE Ezmeral Data Fabric Control System” and accept community license. (1 user is equal license)

2-) Add license after installation complates.

https://webservices_server:8443/

(Upload or Copy-Paste the licenses and Click Apply Licenses. Then you should restart all services)

Free software criterias;

https://mapr.com/legal/eula/

For community;

There can be 1 activation with 1 account.

https://community.datafabric.hpe.com/s/question/0D50L00006BIskHSAT/how-to-add-license-to-the-cluster

https://docs.datafabric.hpe.com/62/ClusterAdministration/admin/cluster/AddLicense.html#ManagingLicenses-Toaddali_26982459-d3e102check

License agreement enterprise;

https://docs.datafabric.hpe.com/62/additional-license-authorization.html

Licanse agreement community;

https://mapr.com/download/

The MapR Data Platform – Community Edition* is available for free per restrictions specified in the MapR End User License Agreement (EULA).

https://mapr.com/legal/eula/

Local installation

If you can not download from mapr repo you can do as below.

https://docs.datafabric.hpe.com/61/AdvancedInstallation/c_local_repo_install.html

Currently, you can download and install the Ezmeral Data Fabric install packages from

http://package.mapr.com/

Known Issues

Known issues are as follows if you need.

https://docs.datafabric.hpe.com/62/MapRInstallerReleaseNotes/mapr_installer_known_issues.html

https://community.datafabric.hpe.com/s/question/0D50L00006d2sfRSAQ/i-just-installed-mapr-and-rebooted-now-i-can-no-longer-access-any-of-the-services

Mapr Forum

You can search or write problems on the forum.

https://community.datafabric.hpe.com/

References;

https://docs.datafabric.hpe.com/

https://www.altoros.com/research-papers/hadoop-distributions-cloudera-vs-hortonworks-vs-mapr/

IT Tutorial IT Tutorial | Oracle DBA | SQL Server, Goldengate, Exadata, Big Data, Data ScienceTutorial

MapR Hadoop Distribution For Managing Big Data and Step by Step Mapr Cluster Install In The Best Way

Bigdata Manage;

What is Hadoop?

Why is Hadoop important?

What is MapR Hadoop Distribution

Why you should choose MapR Hadoop distribution?

What are Services?

Steps Deploy Mapr Cluster

Mapr Components

Mapr Enterprise Edition or Community Edition

Migration Hdfs to Mapr-Fs

Advices before install and configure Mapr

Architectural Advice

Configure everything for installation

Pre-requests

Preparing

Mapr Installation on Admin Console

Licensing Process

Local installation

Known Issues

Mapr Forum

References;

About Fatih Gençali

Leave a Reply Cancel reply