Article From:

  The purpose of this article is to help those engineers who want to know more about Spark, understand the general situation of Spark source code, build the Spark source reading environment, compile and debug the source code of Spark, and lay the foundation for further study in the future.

1. Project structure

        In large projects, it often involves a lot of functional modules, which can save a lot of development and communication costs by the use of Maven for project, sub project (module) management. The whole Spark project is a large Maven project, which contains many sub projects. Whether it is SparkParent projects or subprojects can be managed independently as independent Maven projects. Core is the most core functional module of Spark, providing the core implementation of the functions of RPC framework, measurement system, Spark UI, storage system, scheduling system, computing engine, deployment mode and so on. The main sub items (modules) of these Spark functions are as follows:

  • spark-catalyst:SparkLexical morphology, syntax analysis, abstract syntax tree (AST) generation, optimizer, generative logic execution plan, generation of physical execution plan, etc.
  • spark-core:SparkThe most basic and core functional modules.
  • spark-examples:Using multiple languages, an example of application is provided for Spark learners.
  • spark-sql:SparkBased on the SQL standard, a general query engine is implemented.
  • spark-hive:SparkSupport for Hive metadata and data based on Spark SQL.
  • spark-mesos:SparkThe support module for Mesos.
  • spark-mllib:SparkThe machine learning module.
  • spark-streaming:SparkA support module for convective computing.
  • spark-unsafe:SparkA module that directly operates on system memory to enhance performance.
  • spark-yarn:SparkThe support module for Yarn.

Two. Readiness of reading environment

        To prepare for the Spark reading environment, we need a good machine. The memory of the machine that I debugged the source code is 8GB. The premise of source reading is to first package and compile in IDE environment. The commonly used IDE is IntelliJ IDEA and Eclipse, and the author chooses Ec.LIPSE compiled and read Spark source code, there are two reasons: one is because of years of more familiar with it, two is the use of Eclipse in the community compiled Spark data too little, here can be a supplement. The author compiled Spark source code in Mac OS system, in addition to installationIn addition to JDK and Scala, the following tools need to be installed.

1.Install SBT

        Because Scala uses SBT as a building tool, you need to download SBT. Download:, download the latest installation package sbt-0.13.12.tgz and install it.

Move to a selected installation directory, for example:

mv sbt-0.13.12.tgz~/install/

Enter the installation directory and execute the following commands:

chmod 755 sbt-0.13.12.tgz

tar -xzvf sbt-0.13.12.tgz

Configuration environment:

cd ~

vim .bash_profile

Add the following configuration:

export SBT_HOME=$HOME/install/sbt

export PATH=$SBT_HOME/bin:$PATH

Enter the following command to enable the environment variable to take effect quickly:

source .bash_profile

After installation, use the SBT about command to check and confirm that the installation is normal, as shown in Figure 1.

Figure 1 see whether the SBT installation is normal

2.Install Git

        Since the Spark source uses Git as the version control tool, you need to download the Git client tools. Download:, download the latest version and install it.

    After installation, you can use the GIT – version command to see if the installation is normal, as shown in Figure 2.

Figure 2, see if Git is installed successfully

3.Install the Eclipse Scala IDE plug-in

        EclipseTo support the integration of various IDE tools through a powerful plug-in, to compile, debug, and run Scala programs in Eclipse, you need to install the Eclipse Scala IDE plug-in. Download address:

Since the author’s local version of Eclipse is Eclipse Mars.2 Release (4.5.2), I chose to install plug-in, as in Figure 3:

Figure 3 EclipseScala IDE plug-in installation address

        In the Eclipse, select the “Help” menu, then select “ Install New www.089188.”Cn Software… ” Option, open the Install dialog, as shown in Figure 4:

Figure 4. Install the Scala IDE plug-in

Click “Add…” Button, open the “Add Repository” dialog box, enter the plug-in address, as shown in the 5 Diagram:

Figure 5, add the Scala IDE plug-in address

The contents of the plug-in are selected and installed, as shown in Figure 6.

Figure 6. Install the Scala IDE plug-in

Three, Spark source code compilation and debugging

1.Download Spark source code

First, access the Spark official website, as shown in Figure 7.

Figure 7 Spark official network

Click the “Download Spark” button to find the Git address in the next page, as shown in Figure 8.

Figure 8 the official Git address of Spark

I create the Source folder in the current user directory as a place to place Spark source code, and enter this folder and enter the GIT clonegit:// command to download the source code to local, such as 9The picture shows.

Figure 9 download Spark source code

2.Construction of Scala application

    Go into the Spark root directory and execute the SBT command. It will take a long time to download and parse a lot of jar packages. It took me more than an hour to complete it, as shown in Figure 10.

Figure 10 constructs a Scala application

As you can see from Figure 10, a prompt &gt will appear when SBT is completed.

3.Using SBT to generate eclipse engineering files

    At the SBT command, the prompt &gt is presented; after the input of the eclipse command, it will take a long time to generate the eclipse engineering file, and the author spends about 40 minutes locally. The situation at the time of completion is shown in Figure 11.

Figure 11 compilation process of SBT

Now we look at the subfolders under Spark and find that both.Project and.Classpath files are generated. For example, the.Project and.Classpath files are generated under the mllib project, as shown in Figure 12.

Figure 12 project file generated by SBT

4. Compiling Spark source code

    Because Spark uses Maven as a project management tool, it is necessary to import the Spark project as Maven project into Eclipse, as shown in the 13 figure:

Figure 13 imports into the Maven project

Click the Next button to enter the next dialog box, as shown in Figure 14:

Figure 14. Select Maven project

Select all the items and click the finish button. This completes the import, as shown in Figure 15.

Figure 15.

After importing, you need to set the build path for each sub item. Right click each item and select “Build Path” to “Configure BuildPath…”. Open the Build Path dialog, as in Figure 16:

Figure 16 Java construction path

EclipseThere may be a lot of errors when compiling the project. If you analyze the reasons for the mistake carefully, you can eliminate them one by one. After all the faults are resolved, run MVN clean install, as shown in Figure 17:

Figure 17 successful compilation

5.Debugging Spark source code

    Take the Spark source JavaWordCount as an example to introduce how to debug the Spark source code. Right click and select “Debug As” to “Java Application”. IfTo modify the configuration parameters, right-click and select “Debug As” to “DebugConfigurations…”. Select JavaWordCount from the open dialog box and modify the label on the right.Java performs the configuration of parameters, JRE, classpath, environment variables, as shown in Figure 18:

Figure 18 source code debugging

Readers can also set breakpoints in the Spark source code for tracking and debugging.


On the design and implementation of the Spark kernel design architecture

Leave a Reply

Your email address will not be published. Required fields are marked *