Article From:https://www.cnblogs.com/areyouready/p/9906485.html

Introduction to Linux01.Linuxbrief introductionLinux is a UNIX-like operating system with free and open source code. The core of the operating system is Linus Torvaz.It was first released on 5 October 1991. After adding user-space applications, it becomes a Linux operating system.Application: The program code written for a long time can be installed in various computer hardware devices, such as:Android, such as mobile phones, tablets, routers, runs at the bottom of linux.02.LinuxClassificationVarious versions1->LinuxAccording to the different demands of the market, there are basically two directions:1)Graphical Interface Edition: Focus on user experience, not mature enough (graphics rendering, performance is slightly low)Ubuntu (python)2)Server Edition: No good-looking interface, console window input command operating system (high performance)CentOS (supporting graphics)RedHat (supporting graphics)2->LinuxAccording to the degree of origin (later secondary development)1)Kernel version: A system developed and maintained by a team under the leadership of Linus (original version)2)Release: Release (pirated) after secondary development based on the kernel version by some organizations or companies03.linuxCommonly used versionsCentOSUbuntuRedHat04.LinuxDirectory structureBin: Stores binary executable filesSbin: Stores binary executable files that can only be accessed by rootEtc: Store system configuration filesUsr: Used to store shared system resourcesHOme: The root directory where user files are storedRoot: Super User's DirectoryDev: Used to store device filesLib: Shared libraries and kernel modules needed to store programs running in the underlying file systemMnt: System Administrator Installation FacilityInstallation Point of Time SystemBoot: Files stored at boot timeTmp: Used to store various temporary filesVar: A file used to store data that needs to be changed at runtime05.LinuxCommon command lineLl/ls:View all files in the current directoryCD/ :Enter the root directoryCD/usr/local : Enter subpathsCd..: Return to the previous directoryPwd: The Current PathCD-:Switch to the previous directory06.Firewall related commandsFirewall-cmd --state Display centOS7 firewall statusSystem CTL stop firewalld close firewallSystem CTL disable firewalld prohibits firewall booting and self-startingTwo, LinuX advanced1.Configure static IPModify the configuration file:VI/etc/sysconfig/network-scripts/ifcfg-eno16777736
    Remove # BOOTPROTO="dhcp"
    Add IPADDR=192.168.80.11
         NETMASK=255.255.255.0
         GATEWAY=192.168.80.1
         DNS1=192.168.124.1 Four elementsRestart Network Card to Make Modification Effective Service Network Restart2.Common Basic CommandsLl/ls:View all files in the current directoryCD/ :Enter the root directoryCD/usr/games : Enter subpathsCd..: Exit to the previous directoryPwd: The Current PathCD-:Switch to the previous directory3.Addition, deletion and alteration of foldersMKDIR folder: create folderMV oldname new name: modify folder nameRM file: delete fileRM-f File: Mandatory deletion of filesRM-r Folder: Remove Folder RecursivelyRM-rf Folder: Forced recursive deletionCP File Path: Copy FilesCP-r Folder Path: Copy Folder4.Operating commands for filesTouch file name: create fileCat File Name: View FilesMore File Name: View the file, display the percentage, return to the next line, space to the next page, Q exitLess File Name: ViewDocuments, you can use the top and bottom arrow keys in the lower right corner of the keyboard to turn pages, Q exitTail-10 File Name: The last 10 lines to view the fileTail-f Monitor files: Monitor file changes, such as log filesVI filename: press i: insert contentPress esc: exit editing modePress wq: Save ExitPress q: Do not save exitRM-rf File Name: Delete Files5.Compression and decompression commandsDecompression: Tar-zxvf Compressed packageZ: Call gzip compression command for compressionX: Unzip the fileV: Display the running processF: Specify filenameCompression: Tar-zcvf Packed File Name Packing FileC: Packing filesFor example, tar-zcvf test.tar.gz a.txt b.txt

6.Other Common CommandsPwd: Displays the current locationSearch/ -name "a.txt":Find a file named a.txt in the root directoryWhereis date: Find the location of the date commandPS-ef:View processPS-ef | grep Process name| Meaning: The output of a command on a pipe character is the input of the next commandYum search software: Yum package managerYum install toolsChmod u+x File name: privilege operation7.User operationUseradd username: add usersPasswd User Name: Modify the User's PasswordSu username: switching usersVI/etc/sudoers:Ordinary users can be given root privilegesRoot ALL=(ALL)   ALL
    test  ALL=(ALL)   ALL
    
3. HDFS Distributed Cluster Installation1.Dead workVirtual Machine (Computer 8G Memory Disk 500GB)Three Linux systems (one namenode and two datanodes)(1)Close the firewallFirewall-cmd --state View firewall statusSystem CTL stop firewalld. service closes the firewallSystem CTL disable firewalld. service banStop startup(2)Remote Connection (CRT)(3)Permanently Set Host NameVI/etc/hostname
        Note: reboot should be restarted(4)Configuration mapping fileVI/etc/hosts
        
        #127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
        #::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
        192.168.146.132 hd09-1
        192.168.146.133 hd09-2
        192.168.146.134 hd09-3

2.Install JDK(1)Upload tar packagePress ALT in Secure CRT+ pEnter SFTP mode, drag and drop upload files(2)Unpack tar packetsTar-zxvf jdk-8u144-linux-x64.tar.gz
    
    (3)Configuring environment variablesVI/etc/profile 
        
        export JAVA_HOME=/root/hd/jdk1.8.0_144
        export PATH=$PATH:$JAVA_HOME/bin
        
        source /etc/profile  Loading environment variables(4)Send to other machinesSCP-r hd/jdk1.8.0_144/ hd09-2:hd/jdk1.8.0_144
        scp -r hd/jdk1.8.0_144/ hd09-3:hd/jdk1.8.0_144
        scp -r /etc/profile hd09-2:/etc
        scp -r /etc/profile hd09-3:/etc
        
        Note: Load environment variable source/etc/profile
        
3.Configure SSH Secret-Free LogonSSH-keygen  Generate key pairsSSH-copy-id hd09-1
        ssh-copy-id hd09-2
        ssh-copy-id hd09-3
        
4.Install HDFS cluster(1)Unpack tar packetsTar-zxvf hadoop-2.8.4.tar.gz
        
    (2)Modify hadoop-env.sh
        export JAVA_HOME=/root/hd/jdk1.8.0_144
    
    (3)Modify core-site.xml
        <configuration>
            //Configure HDFS
            <property>
                <name>fs.defaultFS</name>
                <value>hdfs://hd09-1:9000</value>
            </property>
        </configuration>4)Modify hdfs-site.xml
        <configuration>
            //Configuring metadata storage location
            <property>
                <name>dfs.namenode.name.dir</name>
                <value>/root/hd/dfs/name</value>
            </property>
            //Configure data storage location
            <property>
                <name>dfs.datanode.data.dir</name>
                <value>/root/hd/dfs/data</value>
            </property>
        </configuration>5)Configuring Hadoop environment variablesVI/etc/profile
        
        export JAVA_HOME=/root/hd/jdk1.8.0_144
        export HADOOP_HOME=/root/hd/hadoop-2.8.4
        export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
        
        source /etc/profile  Loading environment variables(6)Format namenodeHadoop namenode-format
        
    (7)Distributing Hadoop to other serversSCP-r ~/hd/hadoop-2.8.4/ hd09-2:/root/hd/
        scp -r ~/hd/hadoop-2.8.4/ hd09-3:/root/hd/8)Distribution of Hadoop environment variablesSCP-r /etc/profile hd09-2:/etc
        scp -r /etc/profile hd09-3:/etc
    
        Note: Load environment variable source/etc/profile
        
    (9)Start namenodeHadoop-daemon.sh start namenode
    
    (10)Start datanodeHadoop-daemon.sh start datanode
    
    (11)Access the web port provided by namenode: 50070
        hd09-1:5007012)Visit hd09-1Errors require modification of the C: Windows System32 drivers etc hosts file on Windows computersAdd192.168.146.132 hd09-1
        192.168.146.133 hd09-2
        192.168.146.134 hd09-3
        that will do5.Automatic batch startup script(1)Modify profile slaves to joinHd09-2
        hd09-32)Execute startup commandsStart-dfs.sh
        
IV. HDFS Cluster Modify Secondary NameNode Location and HDFS Command Line Operation1.HDFSCluster Modify Secondary NameNode Location to hd09-21)Modify hdfs-site.xml
        <configuration>
            //Configuring metadata storage location
            <property>
                <name>dfs.namenode.name.dir</name>
                <value>/root/hd/dfs/name</value>
            </property>
            //Configure data storage location
            <property>
                <name>dfs.datanode.data.dir</name>
                <value>/root/hd/dfs/data</value>
            </property>
            
            <property>
                <name>dfs.namenode.secondary.http-address</name>
                <value>hd09-2:50090</value>
            </property>    
        </configuration>
        
        Notice the third one above.<property>No<property>
                <name>dfs.namenode.secondary.https-address</name>
                <value>hd09-2:50090</value>
            </property>2)Distribute hdfs-site.xmlTo other serversCD/root/hd/hadoop-2.8.4/etc/hadoop
        
        scp hdfs-site.xml hd09-2:$PWD
        scp hdfs-site.xml hd09-3:$PWD
    
    (3)hdfsStart commandStart-dfs.sh
    
    (4)hdfsStop orderStop-dfs.sh

2.HDFSCluster modification replication (number of replicas)Modify HDFS-site.xml In & lt; configuration & gt;Join in<property>
                <name>dfs.replication</name>
                <value>3</value>
            </property>
        among<value>The median value is the number of copies3.HDFSCluster Modification Blocksize (Block Size)Modify HDFS-site.xml In & lt; configuration & gt;Join in<property>
                <name>dfs.blocksize</name>
                <value>134217728</value>
            </property>
        among<value>The median value is the block size in bytes.byte)
        
4.hdfscommand line(1)view helpHDFS DFS-help 
        
    (2)View current directory informationHDFS DFS-ls /3)Upload filesHDFS DFS-put /Local path/hdfsRoute(4)Cut fileHDFS DFS-moveFromLocal a.txt /aa.txt
        
    (5)Download files locallyHDFS DFS-get /hdfsPath /Local path(6)Merge DownloadHDFS DFS-getmerge /hdfsPath Folder/Consolidated documents(7)create folderHDFS DFS-mkdir /hello
        
    (8)Create multilevel foldersHDFS DFS-mkdir -p /hello/world
        
    (9)Mobile HDFS filesHDFS DFS-mv /hdfsPath /hdfsRoute(10)Copy HDFS filesHDFS DFS-cp /hdfsPath /hdfsRoute(11)Delete HDFS filesHDFS DFS-rm /aa.txt
        
    (12)Delete HDFS folderHDFS DFS-rm -r /hello
        
    (13)View files in HDFSHDFS DFS-cat /fileHDFS DFS-tail -f /file(14)See how many files are in the folderHDFS DFS-count /Folder(15)View the total space of HDFSHDFS DFS-df /
        hdfs dfs -df -h /16)Modify the number of copiesHDFS DFS-setrep 1 /a.txt
        
5. MapReduce Distributed Programming Framework and Yarn Cluster Construction1.Big Data Solving Problem?Storage of massive data: Hadoop->Distributed File System HDFSComputation of massive data: Hadoop->Distributed Computing Framework MapReduce2.What is MapReduce?
    Distributed Programming Framework, Java-->ssh ssm,Objective: To simplify development!It is the core framework of data analysis application based on hadoop.The function of mapreduce: Integrate user-written business logic code with default components into a complete setDistributed Operating Program, Running concurrently in HOn the adoop cluster.3.MapReduceAdvantages and disadvantagesAdvantage:(1)Easy to program(2)Good expansibility(3)fault-tolerent(4)Suitable for off-line processing above PB levelDisadvantages:(1)Not good at real-time computing(2)Not good at stream computing (MR data source is static)(3)DAG (Directed Graph) Computing (Spark) is not supported4.Yarn (MapReduce Program Platform)The MapReduce program should run and start on multiple machines and execute maptask first, waiting for each maptask to be processed.There's a lot more to start.A reduced task, which requires the user to call the task manually, is not realistic.Need an automated task scheduling platform-->hadoopA distributed scheduling platform is provided in 2.x-YARN
    
5.Building yarn cluster(1)Modify configuration file yarn-site.xml
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hd09-1</value>
    </property>

    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>2)Then copy to the current directory of $PWD per machineSCP yarn-site.xml root@hd09-2:$PWD
    scp yarn-site.xml root@hd09-3:$PWD
    
    (3)Modify the slaves fileThen at hd09-1Up, modify the slaves file of Hadoop to include the machine to start nodemanagerThen hd09-1Configuration of secret-free landing to all machines(4)The script starts the yarn cluster:Start up:Start-yarn.sh
    Stop it:Stop-yarn.sh

    (5)Accessing web portsAfter startup, you can access the web ports of ResourceManager with a browser on Windows:Http://hd09-1:8088
    
6. MapReduce Programming SpecificationUsers write Mr programs in three main parts: Mapper, Reducer, Driver1.Mapperstage(1)User-defined Mapper class inherits parent Mapper class(2)MapperKV pair form of input data (kv type can be customized)(3)MapperRewrite of map method (add business logic)(4)MapperData output kV pair form (kv type can be customized)(5)map()Methods (maptask process) for each & lt; k, V & gt;Call once2.Reducerstage(1)User-defined Reducer class inherits parent Reducer(2)ReducerThe data input type corresponds to the output data type of Mapper stage, which is also a kV pair.(3)ReducerRewrite of reduce method (add business logic)(4)ReduceTaskProcess pairs of K & lt; k, V & gt;Group call once reduce method3.DriverstageMR program needs a Driver to submit tasks, which is a job object describing all kinds of important information.4.Modify mapred-site.xml file & lt; configuration & gt;Join in<property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>

Common data serialization types1. JAVA Type HADOOP Typeint                            IntWritable
        float                        FloatWritable
        long                        LongWritable
        double                        DoubleWritable
        string                        Text
        boolean                        BooleanWritable
        byte                        ByteWritable
        map                            MapWritable
        array                        ArrayWritable

    2.Why serialization?Storing "live objects"3.What is serialization?Serialization is the conversion of objects in memory into byte sequences for storage and network transmission.Deserialization is the conversion of persistent data from byte sequences or hard disks into objects in memory.Serialization of Java-->Serializable

    4.Why not use the serialization interface provided by java?Java serialization is a heavyweight serialization framework. When an object is serialized, it will be accompanied by a lot of additional information (validation information, header, inheritance system, etc.).Not easy to transmit efficiently in networkSo Hadoop developed a Writable, streamlined mechanism/High efficiency.5.Why is serialization important in hadoop?Hadoop communication is implemented by remote call (rpc), and needs to be serialized6.Characteristic:1)compact2)fast3)Scalable4)InteroperabilityEight, sortRequirements: According to the amount of traffic users use per month, the order is based on the amount of traffic they use.Interface->WritableCompareable

    Sorting is the default behavior in hadoop. By default, they are sorted by dictionary privileges.Classification of sorting:1)Partial sorting2)Full sort3)Auxiliary sorting4)Two orderCombiner mergerParent ReducerLocal aggregation reduces network traffic and optimizes the program.Note: Average value?3  5  7  2  6
    
    mapper: (3 + 5 + 7)/3 = 5
            (2 + 6)/2 = 4
            
    reducer:(5+4)/2
    
    Can only be applied without affecting the final business logicNine, compression1、hadoopThree phases(1)Distributed File System HDFS(2)MapReduce, a Distributed Programming Framework(3)yarnframe2、Hadoopdata compressionDuring MR operation, a large amount of data transmission is carried out.Compression technology can effectively reduce the number of read and write sections in the underlying storage (HDFS).Compression improves the efficiency of network bandwidth and disk space.Data compression can save resources effectively!Compression is the optimization strategy of MR program!Compression coding is used to compress mapper or reducer data transmission to reduce disk IO.3、Basic Principles of Compression1、Compression is Less Used for Compression-intensive Tasks2、IOIntensive tasks, multipurpose compression4、MRSupported compression codingCompressed format| hadoopDo you bring it with you? File extension | Is it separable?DEFAULT|       It's | deflate |     noGzip|       It's | GZ |     noBzip2|       It's | bz2 |     yesLZO|       No | LZO |     yesSnappy|       No | snappy |     no5、Coding /DecoderDEFAULT| org.apache.hadoop.io.compress.DefaultCodeC
    Gzip    | org.apache.hadoop.io.compress.GzipCodeC
    bzip2   | org.apache.hadoop.io.compress.BZip2CodeC
    LZO     | com.hadoop.compression.lzo.LzoCodeC
    Snappy  | org.apache.hadoop.io.compress.SnappyCodeC

6、Compression performancecompression algorithm| Original File Size | Compressed File Size | Compression Speed | Decompression speedGzip| 8.3GB        |    1.8GB     |17,5MB/s  |58MB/s
    bzip2    | 8.3GB        |    1.1GB     |2.4MB/s   |9.5MB/s
    LZO      | 8.3gb        |    2.9GB     |49.3MB/s  |74.6MB/s

7、Usage mode(1)mapEnd Output Compression//Open Map-side Output Compression
        conf.setBoolean("mapreduce.map.output.compress", true);
        //Setting Compression Mode//conf.setClass("mapreduce.map.output.compress.codec", DefaultCodec.class, CompressionCodec.class);
        conf.setClass("mapreduce.map.output.compress.codec",BZip2Codec.class, CompressionCodec.class);
    (2)reduceEnd Output Compression//Open output compression at reduce end
        FileOutputFormat.setCompressOutput(job, true);
        //Setting Compression Mode//FileOutputFormat.setOutputCompressorClass(job, DefaultCodec.class);
        //FileOutputFormat.setOutputCompressorClass(job, BZip2Codec.class);
        FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
    
10. Hadoop optimization1、mrProcedure efficiency bottleneckFunction: Distributed Offline ComputingComputer performance: CPU, memory, disk, networkI/OOperation optimization(1)Data skew (code optimization)(2)mapUnreasonable Setting of Reduction Number and Reduction Number(3)mapRunning time is too long, causing reduce to wait too long(4)Combine Text Input Fomrat Small File Merge(5)Non-separable super-large files (continuous overwriting)(6)Multiple small overwritten files require multiple merges2、mroptimization methodSix aspects are considered: data input, Map phase, Reduce phase, IO transmission,Data skew and parameter tuning1­>data input(1)Merge small files: merge small files before performing MR tasks(2)Combine Text Input Format as input to solve a large number of small files on the input side of the sceneMR is not suitable for handling large numbers of small files2­>Mapstage(1)Reduce the number of overwrites (increase memory by 200M 80%<property>
            <name>mapreduce.task.io.sort.mb</name>
            <value>100</value>
        </property>
        <property>
            <name>mapreduce.map.sort.spill.percent</name>
            <value>0.80</value>
        </property>
        (2)Reduce the number of mergers<property>
            <name>mapreduce.task.io.sort.factor</name>
            <value>10</value>
        </property>3)Combiner after map without affecting business logic3­>Reducestage(1)Rationally Setting up the Number of Maps and Reduces(2)Set map/reducecoexistenceSetting up reduce after running a map to a certain extent reduces waiting time<property>
            <name>mapreduce.job.reduce.slowstart.completedmaps</name>
            <value>0.05</value>
        </property>3)Setting buffer of reduce end reasonably<property>
            <name>mapreduce.reduce.markreset.buffer.percent</name>
            <value>0.0</value>
        </property>
    4­>transmission(1)Data compression(2)Use sequenceFile5­>Data skew(1)Scope partitioning(2)Custom partition(3)Combine
        (4)You can use map join without reducing join6­>Parameter tuningSetting Core NumberMap Core Number Settings:<property>
            <name>mapreduce.map.cpu.vcores</name>
            <value>1</value>
        </property>
        reduceCore Number Settings:<property>
            <name>mapreduce.reduce.cpu.vcores</name>
            <value>1</value>
        </property>
        Set memoryMaptask memory settings:<property>
            <name>mapreduce.map.memory.mb</name>
            <value>1024</value>
        </property>
        reducetaskMemory settings:<property>
            <name>mapreduce.reduce.memory.mb</name>
            <value>1024</value>
        </property>
        reduceGet data parallelism on the map side<property>
            <name>mapreduce.reduce.shuffle.parallelcopies</name>
            <value>5</value>
        </property>
    
    

 

Link of this Article: Hadoop summary

Leave a Reply

Your email address will not be published. Required fields are marked *