Subjects
- Bigdata Manage
- What is Hadoop?
- Why is Hadoop important?
- What is MapR Hadoop Distribution
- Why you should choose MapR Hadoop distribution?
- What are Services?
- Steps Deploy Mapr Cluster
- Mapr Components
- Mapr Enterprise Edition or Community Edition
- Migration Hdfs to Mapr-Fs
- Advices before install and configure Mapr
- Architectural Advice
- Configure everything for installation
- Pre-requests
- Preparing
- Mapr Installation on Admin Console
- Licensing Process
- Local installation
- Known Issues
- Mapr Forum
- References
Bigdata Manage;
There are a few popular Hadoop distributions in the market such as Cloudera, MapR, and Hortonworks.
What is Hadoop?
Apache Hadoop is an open source framework for efficiently storing and processing large datasets ranging in size from gigabytes to petabytes of data. Instead of using a single large computer to store and process data, Hadoop allows multiple computers to be clustered together to analyze large datasets more quickly in parallel.
Hadoop consists of four main modules:
HDFS, YARN, MapReduce, Hadoop Common.
Why is Hadoop important?
- Ability to store and process huge amounts of any kind of data, quickly. This is an important consideration, especially with the ever-increasing volumes and variety of data from social media and the Internet of Things (IoT).
- Computing power. Hadoop’s distributed computing model processes big data quickly.
- Fault tolerance. If a node goes down, jobs are automatically routed to other nodes to ensure distributed computation doesn’t fail. Multiple copies of all data are automatically stored.
- Flexibility. You can store as much data as you want and then decide how to use it.
- Low cost. The open source framework is free and uses commercial hardware to store large amounts of data.
- Scalability. You can easily grow your system to handle more data by simply adding nodes.
What is MapR Hadoop Distribution
The MapR Hadoop distribution is 100% plug-and-play compatible with Apache Hadoop, so you don’t need to make any changes to run your applications on a MapR cluster. MapR Hadoop automatically configures compression and memory settings, task heap sizes, and local volumes for mixed data.
The MapR hadoop deployment works on the concept that a market-driven entity should support market needs faster.
Unlike Cloudera and Hortonworks, MapR Hadoop Distribution has a more distributed approach to storing metadata in processing nodes, as it depends on a different file system known as the MapR File System (MapRFS) and does not have a NameNode architecture. MapR hadoop distribution is not based on Linux File system.
MapR is considered one of the fastest hadoop distributions.
Pig, Hive and Sqoop are the only Hadoop distributions with no Java dependencies – because they are based on MapRFS.
Why you should choose MapR Hadoop distribution?
Though MapR is still at number 3 in terms of number of installations, it is one of the easiest and fastest hadoop distributions when compared to others.
MapR is a great distributive file system which comes in two models: community edition and enterprise edition.
Hadoop Benchmark;
To more read;
https://www.altoros.com/research-papers/hadoop-distributions-cloudera-vs-hortonworks-vs-mapr/
What are Services?
A MapR cluster is a full Hadoop distribution. Hadoop itself consists of a storage layer and a MapReduce layer. In addition, MapR provides cluster management tools, data access via NFS, and a few behind-the-scenes services that keep everything running. Some applications and Hadoop components, such as HBase, are implemented as services; others, such as Pig, are simply applications that you run as needed. We will lump them together here, but the distinction is worth making.
- MapReduce services: JobTracker, TaskTracker
- Storage services: CLDB, FileServer, HBase RegionServer, NFS
- Management services: HBase Master, Webserver, ZooKeeper
A daemon called the warden runs on every node to make sure that the proper services are running (and to allocate resources for them). The only service that the warden doesn’t control is the ZooKeeper. Part of the ZooKeeper’s job is to have knowledge of the whole cluster; in the event that a service fails on one node, it is the ZooKeeper that tells the warden to start the service on another node.
MapR Direct Access NFS offers usability and interoperability benefits and makes big data easier and cheaper to handle.
MapR allows for files to be modified and overwritten at high speeds in real time from remote servers via an NFS connection and provides multiple simultaneous reads and writes on any file.
MapR File System does not have a NameNode.
Mapr supports High Availability and Real-time Streaming and Ease of Data Integration and In YARN the Real Multi-tenancy
Steps Deploy Mapr Cluster
On very small clusters of just a few nodes, it’s impractical to isolate services on dedicated nodes. One layout approach is to run one CLDB and one ZooKeeper on the same node, leaving the other nodes free to run the TaskTracker. All nodes should run the FileServer. If you need HA in a small cluster, you will end up running the CLDB and ZooKeeper on additional nodes. Here is a sample layout:
https://mapr.com/developer-portal/mapr-tutorials/steps-deploy-mapr-cluster-part-1-2/
https://mapr.com/developer-portal/mapr-tutorials/steps-deploy-mapr-cluster-part-2-2/
Mapr Components
Hue,Impala, Webserver,Drill,Elasticsearch,HBase,Hive,HTTPFS,Impala,Kafka,Oozie,OpenTSDB,YARN,Spark,Zookeeper,Flume,Object store,Pig,Sqoop,Tez
Mapr Enterprise Edition or Community Edition
MapR is a great distributive file system which comes in two models: community edition and enterprise edition.
If you will use as prod , you should buy enterprise mapr .
Migration Hdfs to Mapr-Fs
Before you copy data from an HDFS cluster to a MapR cluster using the hdfs:// protocol, you must configure the MapR cluster to access the HDFS cluster. To do this, complete the steps listed in Configure a MapR Cluster to Access an HDFS Cluster for the security scenario that best describes your HDFS and MapR clusters, and then complete the steps listed in Verifying Access to an HDFS Cluster.
If the MapR cluster can read the contents of the file, run the distcp command to copy the data from the HDFS cluster to the MapR cluster:
hadoop distcp hdfs://<NameNode>:<NameNode Port>/<HDFS path> maprfs://<MapR-FS path>
Advices before install and configure Mapr
You should use redhat 8 .
Root password will be reseted for direct access.
You should use min 2 server.
You should install epel repo .
Needs direct Internet or download related mapr packages.
Min resources Mem=16gb (advices 64gb) // Cpu=8 (advices 16cpu).
Needs high size on disks ( 64G /tmp , 128G /opt , 32G /home ) as lvm.
Needs min 3 raw disk min 15gb (not formatted) . Do not use RAID . Do not use LVM (Logical Volume Manager).
All ip and hosts should be decribed on dns.
Selinux should be disabled.
Firewall should be stopped and disabled.
You should install and configure ntp/ chronyd.
You should install and configure java jdk 8.
You should permit port 9443 (mapr installation web interface port 9443) , port 8443 (mapr services web interface port 8443).
You should configure ssh passwordless between nodes.
You should configure swap to 1 and transparent_hugepage to never .
You should stop and disable nfs ,nfs-server, nfs-lock.
You should configure limits.conf (soft nofile, hard nofile, soft nproc, hard nproc) to 64000.
You should configure ulimit -n 64000 in bash_profile.
You should configure Umask 0022 .
PermitRootLogin to yes in sshd_config.
You should configure vm.overcommit_memory to 0 in sysctl.conf.
You should configure /etc/pam.d/su for mapr.
You should configure resolv.conf for network.
Architectural Advice
For cldb and ZooKeeper , On small clusters, may need to run CLDB and ZooKeeper on the same node.
For cldb and ZooKeeper , On medium clusters, assign to separate nodes.
For cldb and ZooKeeper , On large clusters, put on separate, dedicated control nodes.
For resourcemanager and ZooKeeper ,avoid running ResourceManager and ZooKeeper together.
For resourcemanager and ZooKeeper , with more than 200 nodes, run ResourceManager on dedicated nodes.
For large clusters , avoid running MySQL Server or webserver on a CLDB node.
Configure everything for installation
1-) You should install and configure repo.
yum repolist all | grep enabled
yum -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
dnf install http://mirror.centos.org/centos/8/BaseOS/x86_64/os/Packages/centos-linux-repos-8-2.el8.noarch.rpm http://mirror.centos.org/centos/8/BaseOS/x86_64/os/Packages/centos-gpg-keys-8-2.el8.noarch.rpm
yum install https://dev.mysql.com/get/mysql80-community-release-el8-1.noarch.rpm
Etc…
2-) Example /opt disk configure;
fdisk -l | grep ‘^Disk’
#Disk /dev/sde: 130 GiB, 139586437120 bytes, 272629760 sectors
fdisk /dev/sde
n p 1 (enter) (enter) w
mkfs.ext4 /dev/sde1
mkdir /opt/
mount /dev/sde1 /opt
df -Ph | grep opt
vi /etc/fstab
/dev/sde1 /opt ext4 defaults 1 2
Other disks should be as below.
64G /tmp
128G /opt
32G /home
3-) Row disks
fdisk -l
Mapr disk (added as row)
/dev/sdb: 15 GiB
/dev/sdc: 15 GiB
/dev/sdd: 15 GiB
4-) Hosts file for cluster . (We just use nosql2. )
cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.1.22 nosql1.localdomain nosql1
192.168.1.27 nosql2.localdomain nosql2
5-) It sholud be added to dns
W used just one node à nosql2
If there are multi-nodes , you must add to /etc/hosts
192.168.1.22 nosql1.localdomain nosql1
192.168.1.27 nosql2.localdomain nosql2
#If you use virtual test machine
#C:\Windows\System32\drivers\etc\hosts
#192.168.1.22 nosql1.localdomain
#192.168.1.27 nosql2.localdomain
#192.168.1.61 mapr.gencali.com
#
#vi /etc/hosts
#192.168.1.61 mapr.gencali.com
6-1) Selinux should be disabled
cat /etc/selinux/config | grep SELINUX
SELINUX=disabled
getenforce
#Disabled , değilse aşağıdaki işlem yapılır, reboot için irem hanım ile koardinasyon sağlanır.
—-!!!!!!!!!!!!!
vi /etc/selinux/config
SELINUX=disabled
sestatus
getenforce
shutdown -r now
getenforce
sestatus
—-!!!!!!!!!!!!!
6-2) Firewall should be stopped
systemctl stop firewalld
systemctl disable firewalld
systemctl stop iptables
systemctl disable iptables
7-) chrony install and configure
yum -y install chrony
vi /etc/chrony.conf
server 0.tr.pool.ntp.org
server 1.tr.pool.ntp.org
server 2.tr.pool.ntp.org
server 3.tr.pool.ntp.org
systemctl restart chronyd
chronyc sources
date
😎 You should install and configure java
https://www.oracle.com/tr/java/technologies/javase/javase-jdk8-downloads.html
Oracle java jdk,jre 8 download and configure on servers,
mkdir /usr/java
cd /usr/java
tar -xvzf /tmp/jdk-8u281-linux-x64.tar.gz
chmod 755 /usr/java
cd /usr/java
mv latest latest_old
ln -s /usr/java/jdk1.8.0_281 latest
ls -lrt
#default -> /usr/java/latest
#latest -> /usr/java/jdk1.8.0_281
java -version
vi /etc/profile
export JAVA_HOME=/usr/java/latest
export PATH=$JAVA_HOME/bin:$PATH
echo $JAVA_HOME
sudo update-alternatives –list
sudo update-alternatives –config java
sudo update-alternatives –config javac
9-) umask should be configured
/etc/profile
umask 0022
10-) Ssh connectivity should be configured between nodes .
We just used one node for installation.
If there are multi-nodes , you follow below ;
- ssh-keygen -t rsa
Press enter for each line
- cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
- chmod og-wx ~/.ssh/authorized_keys
#You should do following step manually with vi
cat /root/.ssh/id_rsa.pub >> root@nosql2:/root/.ssh/authorized_keys
cat /root/.ssh/id_rsa.pub >> root@nosql1:/root/.ssh/authorized_keys
#You should monitor
cat /root/.ssh/id_rsa.pub
cat /root/.ssh/authorized_keys
# You should test each other
ssh root@localhost
ssh root@nosql1
ssh root@nosql2
11-) Swap should be configured
echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag
sysctl vm.swappiness=1
cat /proc/sys/vm/swappiness
12-) Nfs should be stopped
https://docs.datafabric.hpe.com/62/AdministratorGuide/c_POSIX_loopbacknfs_client.html
systemctl stop nfs
systemctl disable nfs-server
systemctl disable nfs-lock
systemctl stop nfs-server
systemctl stop nfs-lock
#Contol
service mapr-loopbacknfs status
#Active: active (running)
13-) Ulimit in bash_profile
vim ~/.bash_profile
ulimit -n 64000
14- Tcp_retries2 Configure
15–>5
cat /proc/sys/net/ipv4/tcp_retries2
15-) Limits.conf configure
vi /etc/security/limits.conf
* soft nofile 64000
* hard nofile 64000
* soft nproc 64000
* hard nproc 64000
16-) Home create
Mapr home create
mkdir -p /opt/mapr
chmod -R 777 /opt/mapr
Esearch home
mkdir -p /opt/mapr/es_db
chmod -R 777 /opt/mapr/es_db
17-) Sysctl.conf configure
vi /etc/sysctl.conf
vm.overcommit_memory=0
sysctl -p
18-) Sshd_config
cat /etc/ssh/sshd_config
PermitRootLogin yes
19-) Pam configure
cp /etc/pam.d/su /etc/pam.d/su_bck
> /etc/pam.d/su
vi /etc/pam.d/su
#Check that the /etc/pam.d/su file contains the following settings:
#%PAM-1.0
# Uncomment the following line to implicitly trust users in the “wheel” group.
#auth sufficient pam_wheel.so trust use_uid
# Uncomment the following line to require a user to be in the “wheel” group.
#auth required pam_wheel.so use_uid
auth sufficient pam_rootok.so
auth include system-auth
account sufficient pam_succeed_if.so uid = 0 use_uid quiet
account include system-auth
password include system-auth
session include system-auth
session required pam_limits.so
session optional pam_xauth.so
20-) Sample resolv.conf configure
cat /etc/resolv.conf
# Generated by NetworkManager
search home localdomain
nameserver 192.168.1.1
Pre-requests
Pre-requests must be provided on all nodes.
It must be installed on all nodes. (Otherwise, it throws an error for the second node because the map user has not been created.)
Ref:
https://docs.datafabric.hpe.com/61/MapRInstaller.html
You download mapr-setup.sh
Preparing
You install mapr-setup.sh on all nodes. But we configure installation on just first node for all nodes.
mkdir -p /tmp/mapr/
chmod -R 777 /tmp/mapr/
wget https://package.mapr.com/releases/installer/mapr-setup.sh -P /tmp/mapr/
chmod +x /tmp/mapr/mapr-setup.sh
sudo bash /tmp/mapr/mapr-setup.sh
Connect browser,
https://<Installer Node hostname/IPaddress>:9443
https://nosql2.localhost:9443
mapr/mapr
You can change pre-requests, but no advice.
vim /opt/mapr/installer/ansible/playbooks/library/prereq/
Example , prereq_check_ram.py
You can set new env.
/opt/mapr/conf/env_override.sh
Installation web interface;
https://nosql2.localhost:9443
Installer
Cluster=mapr.gencali.com
Version 6.2.0
MEP 7.0.1
Mapr Installation on Admin Console
1-) Mapr Installation
2.1-) Version & Services
You can choose to install community or enterprise option.
You can choose to install custom services and versions.
You can choose licance option “after installation” or “login mapr hpe”
2.2-) You can choose to install custom services and versions.
3-) Database Setup
If you choose to install Hue, Hive, Oozie you will install mysql and you will choose related users.
4-) Monitoring
Grafana will be installed .
5-) Set Up Cluster
Mapr user was created with mapr-setup.
Cluster Name must be defined in dns.
6-) Node configuration.
Nodes can be written one under the other . Advice min 2 servers. Nodes must be defined in dns.
Disks must be raw and must be separated with comma . Advice min 3 raw disks
Ssh user should be mapr. You can use root too.
All the data transfer NICs should be the same speed; if you use different speeds, all the NICs will operate at the lowest speed. If MAPR_SUBNETS is not set, MapR uses all NICs present on the node. For more information, see Designating NICs for MapR.
7.1-) Verify Pre-checks for nodes.
All nodes will be checked if they satisfy minimum requirements.
Inprogress status codes;
White – Verification InProgress
Green – Ready for installation
Yellow – Warnings but can be installed
Red – Node cannot be part of cluster
7.2-) Critical and fail status should be fixed. Warning. The warning can be passed, but it would be helpful to fix it later.
😎 Progress Confirmation
9.1-) Configuration Service Layout
9.2-) Node Layout
9.3-) Advanced Component Configuration
10-) Licensing
11-) Installation Step
12-) Installation Complete
13-) Mapr Login
14-) Licence Accept
15-) Overview
16-) Overview
17-) Services interface
18-) Node interface
19-) Volume interface
20-) Tables interface
21-) Cluster setting interface
22-) Running process after installation
23-) Hadoop all applications
24-) All components are as below.
Links to UI Pages
Service Name Browser URL
Drill
http://nosql2.localdomain:8047
http://nosql2.localdomain:8047
Grafana
http://nosql2.localdomain:3000
HBase Master
http://nosql2.localdomain:16010
History Server
http://nosql2.localdomain:19888
Hue
http://nosql2.localdomain:8888
Impala Catalog
http://nosql2.localdomain:25020
Impala Server
http://nosql2.localdomain:25000
http://nosql2.localdomain:25000
Impala Statestore
http://nosql2.localdomain:25010
Kibana
http://nosql2.localdomain:5601
Spark History Server
http://nosql2.localdomain:18080
Spark Thrift Server
http://nosql2.localdomain:4040
Webserver
https://nosql2.localdomain:8443
YARN Node Manager
http://nosql2.localdomain:8042
http://nosql2.localdomain:8042
YARN Resource Manager
http://nosql2.localdomain:8088
Hide common service links
API Services
Service Name Service Ports
Apiserver
https://nosql2.localdomain:8443
Drill
nosql2.localdomain:31010
nosql2.localdomain:31010
Elasticsearch
nosql2.localdomain:9300
HBase Master
nosql2.localdomain:16000
HBase Region Server
nosql2.localdomain:16030
nosql2.localdomain:16030
HBase REST
nosql2.localdomain:8080
nosql2.localdomain:8080
HBase Thrift
nosql2.localdomain:9090
Hive Metastore
nosql2.localdomain:9083
Hive Server 2
nosql2.localdomain:10000
Hive WebHCat
nosql2.localdomain:50111
HTTPFS
nosql2.localdomain:14000
Impala Server
nosql2.localdomain:21050
nosql2.localdomain:21050
Apache Kafka REST API
nosql2.localdomain:8082
nosql2.localdomain:8082
Mastgateway
nosql2.localdomain:8660
nosql2.localdomain:8660
YARN Node Manager
nosql2.localdomain:8041
nosql2.localdomain:8041
Oozie
nosql2.localdomain:11000
OpenTSDB
nosql2.localdomain:4242
YARN Resource Manager
nosql2.localdomain:8033
Spark Thrift Server
nosql2.localdomain:2304
Zookeeper
nosql2.localdomain:5181
Hide service verification info
Licensing Process
You can add licence after installation,
Click here to go to HPE Ezmeral Data Fabric Control System
https://nosql2.localdomain:8443/
By doing so, you default to agreeing to the community license agreement:
https://mapr.com/legal/eula/
1-) Install license.
You login “HPE Ezmeral Data Fabric Control System” and accept community license. (1 user is equal license)
2-) Add license after installation complates.
https://webservices_server:8443/
(Upload or Copy-Paste the licenses and Click Apply Licenses. Then you should restart all services)
Free software criterias;
https://mapr.com/legal/eula/
For community;
There can be 1 activation with 1 account.
https://community.datafabric.hpe.com/s/question/0D50L00006BIskHSAT/how-to-add-license-to-the-cluster
https://docs.datafabric.hpe.com/62/ClusterAdministration/admin/cluster/AddLicense.html#ManagingLicenses-Toaddali_26982459-d3e102check
License agreement enterprise;
https://docs.datafabric.hpe.com/62/additional-license-authorization.html
Licanse agreement community;
https://mapr.com/download/
The MapR Data Platform – Community Edition* is available for free per restrictions specified in the MapR End User License Agreement (EULA).
https://mapr.com/legal/eula/
Local installation
If you can not download from mapr repo you can do as below.
https://docs.datafabric.hpe.com/61/AdvancedInstallation/c_local_repo_install.html
Currently, you can download and install the Ezmeral Data Fabric install packages from
http://package.mapr.com/
Known Issues
Known issues are as follows if you need.
https://docs.datafabric.hpe.com/62/MapRInstallerReleaseNotes/mapr_installer_known_issues.html
https://community.datafabric.hpe.com/s/question/0D50L00006d2sfRSAQ/i-just-installed-mapr-and-rebooted-now-i-can-no-longer-access-any-of-the-services
Mapr Forum
You can search or write problems on the forum.
https://community.datafabric.hpe.com/
References;
https://docs.datafabric.hpe.com/
https://www.altoros.com/research-papers/hadoop-distributions-cloudera-vs-hortonworks-vs-mapr/