Article From:
Basic components:Zookeeper:Distributed collaboration frameworkNumber of nodes:Test cluster: 3Production cluster: (7 similar)Small clusters: 3 or 5Medium cluster: 5 or 7Large cluster: more, odd numberHDFS:Storage of massive dataYARN:Cluster resource managementResource schedulingMapReduce:Parallel computing frameworkThought:Divide and ruleCluster size: DoubleEleven analog, peak, downtime processing, November 9thTest cluster:Number of machines: (test data, data class per second)5 - 10 stationsMachine configuration:Memory: more than 24G/32G8G-12GThe more memory each NameNode:2G has, the better MR memory comes from NN.DataNode:6G-8GRS:4GHard disk: 4T / 10TCPU core: 6 nuclei (i5, i7)A map task acquiescence the 1G 1 million filesNetwork card / network cable: (data transmission read and write) million trillion, one hundred thousand, millionProduction cluster: (more than ten to twenty), memory 128G, hard disk 15T, wave, 16 core, NIC 100000 trillion.Small clusters:Less than 20 sets.Medium cluster:Less than 50 sets.Large clusters:More than 50 sets.Hadoop release:ApacheCDH: service charge, open source free (Cloudera's Distribution)Hadoop)Cloudera: the release version supports only 64 bit operating system.Installation: Tar package: package: HTTp:// (Lunix distribution) /6/x86_64/cdh Jingdong and others compile RMP packets by themselves.Parcels package (compressed package, ecosystem allCompressed in, the best):cdh 4.1.2 has only 13 years after the official recommendation of Cloudera Manager installation.HDP: (Hortonworks Data Platform)HortoNworksCompany version:Apache -> CDH HDPInterview questions:Apache and CDH, why do you choose CDH?CDH saves time and effort, automatically detects host, selects version, and has simple configuration, almost fool.Type one key installation. MapR is slightly less convenient, but still much more convenient than Apache. But the MapR version has a fatal drawback. Instead of Hadoop's HDFS, it uses MapRFS implemented by itself. Leading to the Hadoop ecosystemAll systems involved in file system operation require the distribution of MapR. In order to be compatible with MapRFS, the source code has been changed. You can see it on the GitHub account of MapR. Apache really needs full-time operation and maintenance personnel to manage. A number of cluster tubes need to be usedThe configuration tool is available. The manual is absolutely dead. What you have said is that CDH does not develop yarn very much. I don't think this is too worrying. CDH regularly releases the corresponding CDH version according to the latest stable version of Apache open source, so there is no such thing as the latest version. andAnd I remember that CDH now joins parcel management, so it can be very simple and convenient to switch Hadoop versions without reinstalling clusters. This function is too tempting. In addition, CDH will start charging for clusters of more than 50 nodes. This is a question to be considered. Five0 the last 50 nodes are restricted.CDH is very clear about the partition of Hadoop. There are only two series of versions, namely CDH3 and CDH4, which correspond to Hadoop 1 and Hadoop 2 respectively. In comparison, the Apache version is in a mess.Many.CDH has more compatibility, security and stability than Apache.The CDH3 version is modified based on Apache Hadoop 0.20.2 and incorporated into the latest patch. The CDH4 version is based on Apache Had.OOP 2.X improved, CDH always uses the latest Bug repair or Feature Patch, and releases the same functionality version earlier than the Apache Hadoop, and updates faster than the official Apache.CDH supports Kerberos securityAuthentication, Apache uses simple user name matching authentication.CDH documents are clear. Many users who use Apache version will read the documents provided by CDH, including installation documents, upgrading documents, etc.CDH supports Yum/Apt packages, Tar packages, RPM packages,Cloudera Manager is installed in four ways, and Apache only supports Tar package installation.CDH has the following advantages when using the recommended Yum/Apt package:1, networking installation, upgrade, very convenient2. Automatic downloading dependency softwarepackage3, Hadoop ecosystem package automatically matches, do not need you to search for the current Hadoop matching Hbase, Flume, Hive and other software, Yum/Apt will automatically look for a matching version of the software package based on the current installation of Hadoop version, and ensure compatibility.Sex.4. Automatically create related directories and soft chains to appropriate places (such as conf and logs); automatically create HDFS, mapred users, HDFS users are the highest privileged users of HDFS, and mapred users are responsible for MapReduce executionThe permissions of the related directories in the process.Cluster environment preparation:Machine:Disk array:RADI0 RADI1 JBODRADI1: two disks are mapped to a piece of disk. Installed CentOs 6.4, one disk lossBad will not affect the system, because mutual mapping backup 0+1 is more secure.JBOD: disk cabinet. DataNode stores the way data is recommended. Tuning. Read fast.Datanode storage location can be configured to mount a disk, usually /dfs/data01 /dfs/data02 /dfs/data03Loading system: RedHat 5/6 recommendation 6, CentOs 6.x 64 bit 6.4 version, sles 11ur versionClouder document: checkSee which version to adaptSystem:IP address: as far as possible in the same network segment, try to be on a switch (rack Rack, default rack /default).Set the host name:,, (no underline in the host name).Modify the host name:HostnameVI/etc/sysconfig/networkExample:Hostname bigdata-cdh02.ibeifeng.comVI /etc/sysconfig/netWorkHOSTNAME=bigdata-cdh02.ibeifeng.comIP and host name mapping (all machines):Disable IPv6 (all machines):Sudo echo"Alia net-pf-10 off" > > /etc/modprobe.d/dist.confSudo echo "Alia IPv6 off" > > /etc/modprobe.d/dist.confTail -f /etc/modprobe.d/dist.confVI /etc/hostsBigData CDH 5.X172.16.200.11 bigdata-cdh01172.16.200.12 bigdata-cdh02172.16.200.13 bigdata-cdh03Under Windows:C:/Windows/System32/drivers/etc/hosts172.16.200.11 bigdata-cdh01172.16.200.12 bigdata-cdh02172.16.200.13 bigdata-cdh03.ibeifeng.coM bigdata-cdh03Ordinary users (all machines): the names of ordinary users in all clusters must be consistent.For installation software:AddUser BeifengPasswd 123456Su - Beifeng switching userSuSudo handoff convenienceConfiguring sudo permissions for ordinary users (all):SuWrite permission: Chmod 777 /etc/sudoers/chmod u+w /etc/sudoersAdd: VI /etc/sudoersBeifeng ALL= (root) NOPASSWD:ALLReclaim:Chmod 777 /etc/sudoers/chmod u-w /etc/sudoersClose the firewall (all different machines, different keywords);Sudo service iptables stoPPermanent closure: sudo chkconfig iptables offSee if firewall is closed: sudo chkconfig --list|grep iptablesIptaBles 0:off 1:off 2:off 3:off 4:off 5:off 6:off all closedMore /etc/inittab: default runlevel 5, saving memory 3Selinux (All):Disable: because of trouble, no professional operation and maintenance, less use, system service and driver is incompatible, involving safety.Sudo VI /etc/sysconfig/selinuxSELINUX=disabLEDUninstall JDK:View version: sudo rpm -qa|grep JavaUnloading is mandatory because some are related: sudo rpm -e --nodeps XXXSet up the textThe number of open pieces and the maximum number of users (all):The number of files opened:Ulimit -aMaximum number of users:Ulimit -uSet up:Sudo VI /etc/Security/limits.confContent:* (any parameter user) soft nofile 65535* hard nofile 65535* soFT nproc 32000* hard nproc 32000Document: time synchronization:CentOS configuration time synchronization NTP, why use ntpd instead of ntpdate?The reason is simple, ntpd is step by stepAdjust the time, and ntpdate is a breakpoint update, for example, now the server time is 9.18 points, and the standard time is 9.28 points, ntpd will gradually adjust the time to the same time in a period of time, and ntpdate will immediately adjust the time to 9.28 points. IfThe consequences of writing content to the database or other stringent requirements for time will be very serious. (Note: when the difference between local time and standard time is more than 30 minutes, ntpd will stop working).Find a cluster machine as a time server: bigdaTa-cdh01.ibeifeng.comSudo rpm -qa|grep NTPSudo VIM /etc/Restrict (ifconfig:inet6 add)R) mask nomodify notrap#server 2.centos.pool.ntp.orgServer local clockFudge 127.127.1 stratum 10Sudo VI /etc/sysconfig/ntpdSYNC_HWCLOCK=yesSudo service ntpd statusSudo service ntpd startSudo chkconfig ntpd onSudo chkconfig --list|grep NTPFClient synchronization (residualOther machine):zookeeper dependencies are particularly highTimed tasks:Linux CrontabSuCrontab -l: look at the timer taskCrontab-e: create a timer taskSync cluster time0-59/10 * * * * /usr/sbin/ntpdade bigdata-cdh01.ibeifeNg.comSudo reboot


Link of this Article: Large data platform cluster

Leave a Reply

Your email address will not be published. Required fields are marked *