MapR Hadoop Distribution For Managing Big Data and Step by Step Mapr Cluster Install In The Best Way

Subjects

  • Bigdata Manage
  • What is Hadoop?
  • Why is Hadoop important?
  • What is MapR Hadoop Distribution
  • Why you should choose MapR Hadoop distribution?
  • What are Services?
  • Steps Deploy Mapr Cluster
  • Mapr Components
  • Mapr Enterprise Edition or Community Edition
  • Migration Hdfs to Mapr-Fs
  • Advices before install and configure Mapr
  • Architectural Advice
  • Configure everything for installation
  • Pre-requests
  • Preparing
  • Mapr Installation on Admin Console
  • Licensing Process
  • Local installation
  • Known Issues
  • Mapr Forum
  • References

 

Bigdata Manage;

There are a few popular Hadoop distributions in the market such as  Cloudera, MapR, and Hortonworks.

What is Hadoop?

Apache Hadoop is an open source framework for efficiently storing and processing large datasets ranging in size from gigabytes to petabytes of data. Instead of using a single large computer to store and process data, Hadoop allows multiple computers to be clustered together to analyze large datasets more quickly in parallel.

Hadoop consists of four main modules:

HDFS, YARN, MapReduce, Hadoop Common.

Why is Hadoop important?

  • Ability to store and process huge amounts of any kind of data, quickly. This is an important consideration, especially with the ever-increasing volumes and variety of data from social media and the Internet of Things (IoT).
  • Computing power. Hadoop’s distributed computing model processes big data quickly.
  • Fault tolerance.  If a node goes down, jobs are automatically routed to other nodes to ensure distributed computation doesn’t fail. Multiple copies of all data are automatically stored.
  • Flexibility.  You can store as much data as you want and then decide how to use it.
  • Low cost. The open source framework is free and uses commercial hardware to store large amounts of data.
  • Scalability. You can easily grow your system to handle more data by simply adding nodes.

What is MapR Hadoop Distribution

The MapR Hadoop distribution is 100% plug-and-play compatible with Apache Hadoop, so you don’t need to make any changes to run your applications on a MapR cluster. MapR Hadoop automatically configures compression and memory settings, task heap sizes, and local volumes for mixed data.

 

The MapR hadoop deployment works on the concept that a market-driven entity should support market needs faster.

Unlike Cloudera and Hortonworks, MapR Hadoop Distribution has a more distributed approach to storing metadata in processing nodes, as it depends on a different file system known as the MapR File System (MapRFS) and does not have a NameNode architecture. MapR hadoop distribution is not based on Linux File system.

MapR is considered one of the fastest hadoop distributions.

Pig, Hive and Sqoop are the only Hadoop distributions with no Java dependencies – because they are based on MapRFS.

 

Why you should choose MapR Hadoop distribution?

Though MapR is still at number 3 in terms of number of installations, it is one of the easiest and fastest hadoop distributions when compared to others.

MapR is a great distributive file system which comes in two models: community edition and enterprise edition.

Hadoop Benchmark;

 

To more read;

https://www.altoros.com/research-papers/hadoop-distributions-cloudera-vs-hortonworks-vs-mapr/

 

What are Services?

A MapR cluster is a full Hadoop distribution. Hadoop itself consists of a storage layer and a MapReduce layer. In addition, MapR provides cluster management tools, data access via NFS, and a few behind-the-scenes services that keep everything running. Some applications and Hadoop components, such as HBase, are implemented as services; others, such as Pig, are simply applications that you run as needed. We will lump them together here, but the distinction is worth making.

 

 

  • MapReduce services: JobTracker, TaskTracker
  • Storage services: CLDB, FileServer, HBase RegionServer, NFS
  • Management services: HBase Master, Webserver, ZooKeeper

A daemon called the warden runs on every node to make sure that the proper services are running (and to allocate resources for them). The only service that the warden doesn’t control is the ZooKeeper. Part of the ZooKeeper’s job is to have knowledge of the whole cluster; in the event that a service fails on one node, it is the ZooKeeper that tells the warden to start the service on another node.

MapR Direct Access NFS offers usability and interoperability benefits and makes big data easier and cheaper to handle.

MapR allows for files to be modified and overwritten at high speeds in real time from remote servers via an NFS connection and provides multiple simultaneous reads and writes on any file.

MapR File System does not have a NameNode.

Mapr supports High Availability and Real-time Streaming and Ease of Data Integration and In YARN the Real Multi-tenancy

 

Steps Deploy Mapr Cluster

On very small clusters of just a few nodes, it’s impractical to isolate services on dedicated nodes. One layout approach is to run one CLDB and one ZooKeeper on the same node, leaving the other nodes free to run the TaskTracker. All nodes should run the FileServer. If you need HA in a small cluster, you will end up running the CLDB and ZooKeeper on additional nodes. Here is a sample layout:

https://mapr.com/developer-portal/mapr-tutorials/steps-deploy-mapr-cluster-part-1-2/

https://mapr.com/developer-portal/mapr-tutorials/steps-deploy-mapr-cluster-part-2-2/

 

Mapr Components

 

Hue,Impala, Webserver,Drill,Elasticsearch,HBase,Hive,HTTPFS,Impala,Kafka,Oozie,OpenTSDB,YARN,Spark,Zookeeper,Flume,Object store,Pig,Sqoop,Tez

 

 

Mapr Enterprise Edition or Community Edition

MapR is a great distributive file system which comes in two models: community edition and enterprise edition.

If you will use as prod , you should buy enterprise mapr .

 

 

 

Migration Hdfs to Mapr-Fs

Before you copy data from an HDFS cluster to a MapR cluster using the hdfs:// protocol, you must configure the MapR cluster to access the HDFS cluster. To do this, complete the steps listed in Configure a MapR Cluster to Access an HDFS Cluster for the security scenario that best describes your HDFS and MapR clusters, and then complete the steps listed in Verifying Access to an HDFS Cluster.

If the MapR cluster can read the contents of the file, run the distcp command to copy the data from the HDFS cluster to the MapR cluster:

hadoop distcp hdfs://<NameNode>:<NameNode Port>/<HDFS path> maprfs://<MapR-FS path>

 

Advices before install and configure Mapr

You should  use redhat 8 .

Root password will be reseted for direct access.

You should  use min 2 server.

You should  install epel repo .

Needs direct Internet or download related mapr packages.

Min resources  Mem=16gb (advices 64gb)  // Cpu=8    (advices 16cpu).

Needs high size on disks ( 64G   /tmp  ,  128G  /opt   , 32G   /home ) as lvm.

Needs min 3 raw disk min 15gb (not formatted)  . Do not use RAID . Do not use LVM (Logical Volume Manager).

All ip and hosts should be decribed on dns.

Selinux should be disabled.

Firewall should be stopped and disabled.

You should  install and configure ntp/ chronyd.

You should install and configure java jdk 8.

You should permit  port 9443  (mapr installation web interface port 9443) ,  port 8443 (mapr services web interface port 8443).

You should configure ssh passwordless between nodes.

You should configure  swap to 1  and transparent_hugepage to never .

You should stop and disable nfs ,nfs-server, nfs-lock.

You should configure  limits.conf (soft     nofile, hard     nofile, soft     nproc, hard     nproc) to 64000.

You should configure  ulimit -n 64000  in bash_profile.

You should configure  Umask 0022             .

PermitRootLogin to yes in  sshd_config.

You should configure  vm.overcommit_memory to 0 in sysctl.conf.

You should configure  /etc/pam.d/su for mapr.

You should configure  resolv.conf for network.

 

Architectural Advice

For cldb and ZooKeeper , On small clusters, may need to run CLDB and ZooKeeper on the same node.

For cldb and ZooKeeper , On medium clusters, assign to separate nodes.

For cldb and ZooKeeper , On large clusters, put on separate, dedicated control nodes.

For resourcemanager and ZooKeeper ,avoid running ResourceManager and ZooKeeper together.

For resourcemanager and ZooKeeper , with more than 200 nodes, run ResourceManager on dedicated nodes.

For large clusters , avoid running MySQL Server or webserver on a CLDB node.

 

 

Configure everything for installation

1-)  You should install and configure repo.

yum repolist all | grep enabled

 

 

 

yum -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm

dnf install http://mirror.centos.org/centos/8/BaseOS/x86_64/os/Packages/centos-linux-repos-8-2.el8.noarch.rpm http://mirror.centos.org/centos/8/BaseOS/x86_64/os/Packages/centos-gpg-keys-8-2.el8.noarch.rpm

yum install https://dev.mysql.com/get/mysql80-community-release-el8-1.noarch.rpm

Etc…

 

2-) Example /opt disk configure;

fdisk -l | grep ‘^Disk’

#Disk /dev/sde: 130 GiB, 139586437120 bytes, 272629760 sectors

 

fdisk /dev/sde

n p 1 (enter) (enter) w

mkfs.ext4 /dev/sde1

 

mkdir /opt/

mount /dev/sde1 /opt

df -Ph | grep opt

 

vi /etc/fstab

/dev/sde1               /opt           ext4    defaults        1 2

 

Other disks should be as below.

64G   /tmp

128G  /opt

32G   /home

 

3-) Row disks

fdisk -l

Mapr disk (added as row)

/dev/sdb: 15 GiB

/dev/sdc: 15 GiB

/dev/sdd: 15 GiB

 

4-) Hosts file for cluster . (We just use nosql2. )

cat /etc/hosts

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4

::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.1.22 nosql1.localdomain nosql1

192.168.1.27 nosql2.localdomain nosql2

 

5-) It sholud be added to dns

W used just one node à nosql2

 

If there are multi-nodes , you must add to /etc/hosts

192.168.1.22 nosql1.localdomain nosql1

192.168.1.27 nosql2.localdomain nosql2

 

#If you use virtual test machine

#C:\Windows\System32\drivers\etc\hosts

#192.168.1.22 nosql1.localdomain

#192.168.1.27 nosql2.localdomain

#192.168.1.61 mapr.gencali.com

#

#vi /etc/hosts

#192.168.1.61 mapr.gencali.com

 

6-1) Selinux should be disabled

cat /etc/selinux/config | grep SELINUX

SELINUX=disabled

 

getenforce

#Disabled , değilse aşağıdaki işlem yapılır, reboot için irem hanım ile koardinasyon sağlanır.

—-!!!!!!!!!!!!!

vi /etc/selinux/config

SELINUX=disabled

 

sestatus

getenforce

shutdown -r now

getenforce

sestatus

—-!!!!!!!!!!!!!

 

 

6-2) Firewall should be stopped

systemctl stop firewalld

systemctl disable firewalld

systemctl stop iptables

systemctl disable iptables

 

7-) chrony install and configure

yum -y install chrony

vi /etc/chrony.conf

server 0.tr.pool.ntp.org

server 1.tr.pool.ntp.org

server 2.tr.pool.ntp.org

server 3.tr.pool.ntp.org

 

systemctl restart chronyd

chronyc sources

date

 

😎 You should install and configure java

https://www.oracle.com/tr/java/technologies/javase/javase-jdk8-downloads.html

Oracle java jdk,jre 8 download and configure on servers,

 

mkdir /usr/java

cd  /usr/java

tar -xvzf /tmp/jdk-8u281-linux-x64.tar.gz

chmod 755 /usr/java

 

cd  /usr/java

mv latest latest_old

ln -s /usr/java/jdk1.8.0_281 latest

ls -lrt

#default -> /usr/java/latest

#latest -> /usr/java/jdk1.8.0_281

 

java -version

 

vi /etc/profile

export JAVA_HOME=/usr/java/latest

export PATH=$JAVA_HOME/bin:$PATH

 

 

echo $JAVA_HOME

 

sudo update-alternatives –list

sudo update-alternatives –config java

sudo update-alternatives –config javac

 

9-) umask  should be configured

/etc/profile

umask  0022

 

 

10-) Ssh connectivity should be configured between nodes .

We just used one node for installation.

If there are multi-nodes , you follow below ;

 

  1. ssh-keygen -t rsa

Press enter for each line

  1. cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
  2. chmod og-wx ~/.ssh/authorized_keys

 

#You should do following step manually with vi

cat /root/.ssh/id_rsa.pub >> root@nosql2:/root/.ssh/authorized_keys

cat /root/.ssh/id_rsa.pub >> root@nosql1:/root/.ssh/authorized_keys

 

#You should monitor

cat /root/.ssh/id_rsa.pub

cat /root/.ssh/authorized_keys

 

# You should test each other

ssh root@localhost

ssh root@nosql1

ssh root@nosql2

 

11-) Swap should be configured

echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled

echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag

 

sysctl vm.swappiness=1

cat /proc/sys/vm/swappiness

 

12-) Nfs should be stopped

https://docs.datafabric.hpe.com/62/AdministratorGuide/c_POSIX_loopbacknfs_client.html

 

systemctl stop nfs

systemctl disable nfs-server

systemctl disable nfs-lock

systemctl stop nfs-server

systemctl stop nfs-lock

 

#Contol

service mapr-loopbacknfs status

#Active: active (running)

 

13-) Ulimit in bash_profile

vim ~/.bash_profile

ulimit -n 64000

 

14- Tcp_retries2 Configure

15–>5

cat /proc/sys/net/ipv4/tcp_retries2

 

15-) Limits.conf configure

vi /etc/security/limits.conf

*               soft     nofile          64000

*               hard     nofile          64000

*               soft     nproc           64000

*               hard     nproc           64000

 

16-) Home create

 

Mapr home create

mkdir -p /opt/mapr

chmod -R 777 /opt/mapr

 

Esearch home

mkdir -p /opt/mapr/es_db

chmod -R 777 /opt/mapr/es_db

 

17-) Sysctl.conf configure

vi /etc/sysctl.conf

vm.overcommit_memory=0

sysctl -p

 

18-) Sshd_config

cat /etc/ssh/sshd_config

PermitRootLogin yes

 

19-) Pam configure

cp /etc/pam.d/su /etc/pam.d/su_bck

> /etc/pam.d/su

vi /etc/pam.d/su

#Check that the /etc/pam.d/su file contains the following settings:

#%PAM-1.0

# Uncomment the following line to implicitly trust users in the “wheel” group.

#auth           sufficient      pam_wheel.so trust use_uid

# Uncomment the following line to require a user to be in the “wheel” group.

#auth           required        pam_wheel.so use_uid

auth            sufficient      pam_rootok.so

auth            include         system-auth

account         sufficient      pam_succeed_if.so uid = 0 use_uid quiet

account         include         system-auth

password        include         system-auth

session         include         system-auth

session         required        pam_limits.so

session         optional        pam_xauth.so

 

20-) Sample resolv.conf configure

cat /etc/resolv.conf

# Generated by NetworkManager

search home localdomain

nameserver 192.168.1.1

 

 

 

Pre-requests

Pre-requests must be provided on all nodes.

It must be installed on all nodes. (Otherwise, it throws an error for the second node because the map user has not been created.)

 

Ref:

https://docs.datafabric.hpe.com/61/MapRInstaller.html

 

You download mapr-setup.sh

https://mapr.com/download/

 

Preparing

You install mapr-setup.sh on  all nodes. But we configure installation on just  first node for all nodes.

mkdir -p /tmp/mapr/

chmod -R 777 /tmp/mapr/

wget https://package.mapr.com/releases/installer/mapr-setup.sh -P /tmp/mapr/

chmod +x /tmp/mapr/mapr-setup.sh

sudo bash /tmp/mapr/mapr-setup.sh

Connect browser, 

https://<Installer Node hostname/IPaddress>:9443

 

 

https://nosql2.localhost:9443

mapr/mapr

 

You can change pre-requests, but no advice.

vim /opt/mapr/installer/ansible/playbooks/library/prereq/

Example , prereq_check_ram.py

 

You can set new env.

/opt/mapr/conf/env_override.sh

 

Installation web interface;

https://nosql2.localhost:9443

 

 

Installer

Cluster=mapr.gencali.com

Version 6.2.0

MEP 7.0.1

 

Mapr Installation on Admin Console

 

1-)  Mapr Installation

 

 

2.1-) Version & Services

You can choose to install community or enterprise option.

You can choose to install custom services and versions.

You can choose licance option “after installation” or “login mapr hpe”

 

 

2.2-) You can choose to install custom services and versions.

 

 

3-) Database Setup

If you choose to install Hue, Hive, Oozie you will install mysql and you will choose related users.

 

 

4-) Monitoring

Grafana will be installed .

 

 

 

 

 

5-) Set Up Cluster

Mapr user was created with mapr-setup.

Cluster Name must be defined in dns.

 

 

 

 

6-) Node configuration.

Nodes can be written one under the other . Advice min 2 servers. Nodes  must be defined in dns.

Disks must be raw and must be separated with comma . Advice min 3 raw disks

Ssh user should be mapr. You can use root too.

All the data transfer NICs should be the same speed; if you use different speeds, all the NICs will operate at the lowest speed. If MAPR_SUBNETS is not set, MapR uses all NICs present on the node. For more information, see Designating NICs for MapR.

 

 

 

 

7.1-) Verify Pre-checks for nodes.

All nodes  will be checked if they satisfy minimum requirements.

Inprogress status codes;

White – Verification InProgress

Green – Ready for installation

Yellow – Warnings but can be installed

Red – Node cannot be part of cluster

 

 

 

 

7.2-) Critical and fail status should be fixed. Warning.  The warning can be passed, but it would be helpful to fix it later.

 

😎 Progress Confirmation

 

 

 

9.1-) Configuration Service Layout

 

 

9.2-) Node Layout

 

 

 

9.3-) Advanced Component Configuration

 

 

 

10-) Licensing

 

 

 

11-) Installation Step

 

 

 

12-) Installation Complete

 

 

 

 

13-) Mapr Login

 

 

14-) Licence Accept

 

 

15-)  Overview

 

 

16-) Overview

 

 

17-) Services interface

 

 

18-) Node interface

 

 

19-) Volume interface

 

 

20-) Tables interface

 

 

21-) Cluster setting interface

 

 

 

22-) Running process after installation

 

 

 

 

23-) Hadoop all applications

 

 

 

24-) All components are as below.

Links to UI Pages

Service Name      Browser URL

Drill

http://nosql2.localdomain:8047

http://nosql2.localdomain:8047

Grafana

http://nosql2.localdomain:3000

HBase Master

http://nosql2.localdomain:16010

History Server

http://nosql2.localdomain:19888

Hue

http://nosql2.localdomain:8888

Impala Catalog

http://nosql2.localdomain:25020

Impala Server

http://nosql2.localdomain:25000

http://nosql2.localdomain:25000

Impala Statestore

http://nosql2.localdomain:25010

Kibana

http://nosql2.localdomain:5601

Spark History Server

http://nosql2.localdomain:18080

Spark Thrift Server

http://nosql2.localdomain:4040

Webserver

https://nosql2.localdomain:8443

YARN Node Manager

http://nosql2.localdomain:8042

http://nosql2.localdomain:8042

YARN Resource Manager

http://nosql2.localdomain:8088

Hide common service links

API Services

Service Name      Service Ports

Apiserver

https://nosql2.localdomain:8443

Drill

nosql2.localdomain:31010

nosql2.localdomain:31010

Elasticsearch

nosql2.localdomain:9300

HBase Master

nosql2.localdomain:16000

HBase Region Server

nosql2.localdomain:16030

nosql2.localdomain:16030

HBase REST

nosql2.localdomain:8080

nosql2.localdomain:8080

HBase Thrift

nosql2.localdomain:9090

Hive Metastore

nosql2.localdomain:9083

Hive Server 2

nosql2.localdomain:10000

Hive WebHCat

nosql2.localdomain:50111

HTTPFS

nosql2.localdomain:14000

Impala Server

nosql2.localdomain:21050

nosql2.localdomain:21050

Apache Kafka REST API

nosql2.localdomain:8082

nosql2.localdomain:8082

Mastgateway

nosql2.localdomain:8660

nosql2.localdomain:8660

YARN Node Manager

nosql2.localdomain:8041

nosql2.localdomain:8041

Oozie

nosql2.localdomain:11000

OpenTSDB

nosql2.localdomain:4242

YARN Resource Manager

nosql2.localdomain:8033

Spark Thrift Server

nosql2.localdomain:2304

Zookeeper

nosql2.localdomain:5181

Hide service verification info

 

 

Licensing Process

You can add licence after installation,

Click here to go to HPE Ezmeral Data Fabric Control System

https://nosql2.localdomain:8443/

 

 

By doing so, you default to agreeing to the community license agreement:

https://mapr.com/legal/eula/

1-) Install license.

You login “HPE Ezmeral Data Fabric Control System” and accept community license. (1 user is equal license)

2-) Add license after installation complates.

https://webservices_server:8443/

(Upload or Copy-Paste the licenses and Click Apply Licenses. Then you should restart all services)

 

Free software  criterias;

https://mapr.com/legal/eula/

 

For community;

There can be 1 activation with 1 account.

 

https://community.datafabric.hpe.com/s/question/0D50L00006BIskHSAT/how-to-add-license-to-the-cluster

https://docs.datafabric.hpe.com/62/ClusterAdministration/admin/cluster/AddLicense.html#ManagingLicenses-Toaddali_26982459-d3e102check

 

License agreement enterprise;

https://docs.datafabric.hpe.com/62/additional-license-authorization.html

 

Licanse agreement community;

https://mapr.com/download/

The MapR Data Platform – Community Edition* is available for free per restrictions specified in the MapR End User License Agreement (EULA).

https://mapr.com/legal/eula/

 

Local installation

If you can not download from mapr repo you can do as below.

https://docs.datafabric.hpe.com/61/AdvancedInstallation/c_local_repo_install.html

Currently, you can download and install the Ezmeral Data Fabric install packages from

http://package.mapr.com/

 

Known Issues

Known issues are as follows if you need.

https://docs.datafabric.hpe.com/62/MapRInstallerReleaseNotes/mapr_installer_known_issues.html

https://community.datafabric.hpe.com/s/question/0D50L00006d2sfRSAQ/i-just-installed-mapr-and-rebooted-now-i-can-no-longer-access-any-of-the-services

 

Mapr Forum

You can search or write problems on the forum.

https://community.datafabric.hpe.com/

 

References;

https://docs.datafabric.hpe.com/

https://www.altoros.com/research-papers/hadoop-distributions-cloudera-vs-hortonworks-vs-mapr/

 

 

 

 

About Fatih Gençali

- I have supported as Oracle and Nosql & Bigdata Dba for more than 9 years. - I worked in 24x7 production and test environment. - I have 12C OCP certificate. - I have europass diploma supplement. - Saving operations - I have supported for nosql databases (mongo,cassandra,couchbase) - I have supported for ambari&mapr hadoop distributions - I have couchbase certificate. - I have supported databases that are telecommunication , banking, insurance, financial, retail and manufacturing, marketing, e-invoicing . - Providing aligment between prod , prp , stb , dev - Providing management and performance tuning for app and database machines (linux) - Performance tuning and sql tuning - Consolidations, Migration (expdp,xtts,switchover vb...) , installation, patch , upgrade , dataguard , shell script writing , backup restore , exadata management , performans management , security management ,goldengate operations - Resolving performance and security problems for databases and linux machines - I managed oracle 10g/11g/12c databases (dev/test/prp/snap/prod/stby) on Linux/HP/AIX/Solaris O.S - Pl/sql operations , supported shell script, (for aligments and others) - Providing highly available it (software-hardware) systems, especially database systems. - Managing and monitoring availabilities and operations of all systems . - Goldengate operations (oracle to oracle , oracle to bigdata (hdfs , kafka)) - Exadata operations (cell management,upgrade,switchover) - My work processes is according to itil. - Preparing automation for everything to reduce human resource requirement and routine [email protected] https://www.linkedin.com/in/fatih-gençali-22131131/

Leave a Reply

Your email address will not be published. Required fields are marked *