Article From:https://www.cnblogs.com/octopuszy/p/9061544.html
#!/usr/bin/python
# -*- coding: utf_8 -*-
from pyspark import SparkConf, SparkContext
import os, time


if __name__ == '__main__':
    # Create SparkConf
    # sparkConf:A SparkConf object that initializes the program, configuring the application, and passing the initial parameters.SetAppName: specify the RDD name; setMaster: specify the thread to run.SparkConf = SparkConf ().setAppName ("Python Spark WordCount") \.setMaster ("local")Create SParkContextSc: is a SparkContext object that represents the connection to the computing cluster, which is used to access Spark and create RDD.The sprkConf is passed to the SparkConte as a parameterXTSC =SparkContext (conf=sparkConf)'' ''Create RDD:Mode 1: creation from local set and parallelizationMode two: read data from external file system (HDFS)'' ''The first way is to create RDD---------- from set parallelization.Datas = ["Hadoop spArk "," spark hive spark SQL "," spark Hadoop SQL spark "]Parallelize method: receiving a list data set as a parameter can quickly create R.DD: drawback: loading data sets into memory.Rdd1 = sc.parallelize (datas)The second way is to read data from external storage and create RDD--------.-Textfile method: read the text file as a storage string RDD.Rdd2 = sc.textFile ('T_CAR_20150401.csv')Conversion operation:Car_rdd = rdd2.filter (lambda x:'FB5018'in x)Car2_rdd = rdd2.filter (lambda x:'FB5093'in x)Total_rdd = car_rdd.union (car2_rdd)Action operation:Count: gets a numberFirst: gets the first dataPrint RDd1.count ()Print rdd1.first ()Print total_rdd.count ()Print total_rdd.first ()DormancyFor a period of time, monitor WEB UITime.sleep (100000)SparkContext stopSc.stop ()

 

Similar Posts:

Link of this Article: spark-day1

Leave a Reply

Your email address will not be published. Required fields are marked *