Installing Spark on a Linux machine can be done in a few steps. The following is a detailed guide on how to install Spark in standalone mode on a Linux machine.
- Install Java: Spark requires Java to be installed on the machine. You can check if Java is already installed by running the command
java -version
. If Java is not installed, you can install it by runningsudo apt-get install openjdk-8-jdk
orsudo yum install java-1.8.0-openjdk-devel
depending on your Linux distribution. - Download Spark: Go to the Spark website (https://spark.apache.org/downloads.html) and download the latest version of Spark in the pre-built package for Hadoop. You can download the package in tar format or in binary format.
- Extract the package: Extract the package you downloaded in the previous step. You can use the tar command to extract the package:
tar -xvf spark-x.y.z-bin-hadoopx.y.z.tar
(replace x.y.z with the version number you downloaded). This will create a directory calledspark-x.y.z-bin-hadoopx.y.z
. - Set environment variables: You need to set some environment variables to make Spark work. You can do this by adding the following lines to your
.bashrc
file:
export SPARK_HOME=/path/to/spark-x.y.z-bin-hadoopx.y.z
export PATH=$PATH:$SPARK_HOME/bin
(replace the /path/to/ with the path to the directory where you extracted the Spark package)
- Start the Spark Master: You can start the Spark Master by running the command
start-master.sh
from thesbin
directory of your Spark installation. You can access the Spark Master web UI by going tohttp://<master-url>:8080
in your web browser. - Start the Spark Worker: You can start the Spark Worker by running the command
start-worker.sh <master-url>
from thesbin
directory of your Spark installation. Replace<master-url>
with the URL of the master node. - Verify the installation: You can verify the installation by running the
pyspark
command in your terminal. This will start the PySpark shell. You can run Spark commands and check the status of the cluster by visiting the Master web UI. - Optional: configure Spark: you can configure Spark by editing the
conf/spark-defaults.conf
file.
You have now installed Spark in standalone mode on your Linux machine. You can now use Spark to run big data processing and analytics tasks.
You should make sure that the version of Hadoop you are running is compatible with the version of Spark you installed. You should also check the system requirements for Spark before installing it, as it requires a certain amount of memory and disk space.
Spark important urls to refer