Pro tip: if you're only looking for how to configure Hadoop and Spark to run on a cluster, start here.
Table of Contents
Motivation...
For further actions, you may consider blocking this person and/or reporting abuse
Hi Andrew, thanks for sharing this with us!
Following your instruction I have managed to run HDFS and YARN on my cluster using openJDK-8 instead of Oracle Java (is not pre-installed on Raspbian Buster and can't be installed via apt).
However, when running the wordcount example.
following error occurs
and the job fails.
Do you have any idea what the reason might be?
Try this for a more colourful clustercmd prompt put this in your ~/.bashrc:
then do a
Looks like this:
Looks nice! Thanks, Razvan!
Hi. I am not able to get the hadoop version or java version for the nodes over ssh using clustercmd. I have set up a cluster of 5 Pis (model B+) with 32 GB micro SD cards in all of them. On running
I get the following error:
I am attaching the .bashrc here. Please help. Thanks.
Did you follow these steps?
Create the Directories
Create the required directories on all other Pis using:
Copy the Configuration
Copy the files in /opt/hadoop to each other Pi using:
This will take quite a long time, so go grab lunch.
When you're back, verify that the files copied correctly by querying the Hadoop version on each node with the following command:
You can't run the
hadoop
command on the other Pis until you've copied over thosehadoop
directories. If you have done that, you also need to make sure that that directory is on the$PATH
of the other Pis by including the following lines in each of their.bashrc
files (sorry, I don't think I included this step in the instructions):You could also simply
clusterscp
the.bashrc
file from Pi #1 to each of the other Pis.Hi. Thanks for the reply. Yes, I did the steps you mentioned. Since Java wasn't pre-installed, I installed it manually in each Pi, and checked them individually to see if they are working. As you can see below, the env variables are configured as you have suggested.
Thanks. I resolved the issue by putting the PATH exports above the following part in .bashrc:
I also put the export PATH commands in /etc/profile of each Pi. Thanks.
Thank you for this superb article. I have been following it to deploy a Hadoop/Spark cluster using the latest Raspberry Pi 4 (4GB). I encountered one problem, which was that after completing the tutorial, the spark job was not being assigned. I got a warning:
INFO yarn.Client: Requesting a new application from cluster with 0 NodeManagers and then it sort of got stuck on
INFO yarn.Client: Application report for application_1564655698306_0001 (state: ACCEPTED). I will describe later how I solved this.
First, I want to note that the latest Raspbian version (Buster) does not include Oracle Java 8 which is required by Hadoop 3.2.0. There is no easy way to get it set-up, but it can be done. First you need to manually download the tar.gz file from Oracle's site (this requires a registration). I put it up on a personal webserver so it can be easily downloaded from the Pis. Then, on each Pi:
download java package
cd ~/Downloads
wget /jdk8.tar.gz
extract package contents
sudo mkdir /usr/java
cd /usr/java
sudo tar xf ~/Downloads/jdk8.tar.gz
update alternative configurations
sudo update-alternatives --install /usr/bin/java java /usr/java/jdk1.8.0_221/bin/java 1000
sudo update-alternatives --install /usr/bin/javac javac /usr/java/jdk1.8.0_221/bin/javac 1000
select desired java version
sudo update-alternatives --config java
check the java version changes
java -version
Next, here is how I solved the YARN problem. In your tutorial section "Configuring Hadoop on the Cluster", after the modifications to the xml files have been made on Pi1, two files need to be copied across to the other Pis: these are yarn-site.xml and mapred-site.xml. After copying, YARN needs to be restarted on Pi1.
To set appropriate values for the memory settings, I found a useful tool which is described on this thread stackoverflow.com/questions/495791...
Copy-pasting the instructions:
get the tool
wget public-repo-1.hortonworks.com/HDP/...
tar zxvf hdp_manual_install_rpm_helper_files-2.6.0.3.8.tar.gz
rm hdp_manual_install_rpm_helper_files-2.6.0.3.8.tar.gz
mv hdp_manual_install_rpm_helper_files-2.6.0.3.8/ hdp_conf_files
run the tool
python hdp_conf_files/scripts/yarn-utils.py -c 4 -m 8 -d 1 false
-c number of cores you have for each node
-m amount of memory you have for each node (Giga)
-d number of disk you have for each node
-bool "True" if HBase is installed; "False" if not
This should provide appropriate settings to use. After the xml files have been edited and YARN has been restarted, you can try this command to check that all the worker nodes are active.
yarn node -list
Hi Andreas,
I am running Raspbian Buster on my PIs, too. I have downloaded the "Linux ARM 64 Hard Float ABI" (jdk-8u231-linux-arm64-vfp-hflt.tar.gz) and followed your instructions and I get following error bu running java -version
I guess this java-product is not compatible with the PI. Which exact file have you downloaded from the Orcale site?
First of all i d like to thank Andrew for a superb tutorial. Besides some minor alternation i had to make, i was able to set up the hdfs etc. but i am running now on the same problem as you Andreas.
The first thing i d like to add to your recommendations is that downloading the java is easier.
sudo apt-get install openjdk-8-jdk
and then change the default (as you suggested already):
sudo update-alternatives --config java
sudo update-alternatives --config javac
Then change export JAVA_HOME=$(readlink –f /usr/bin/java | sed "s:bin/java::") to export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-armhf both in ~/.bashrc and in /opt/hadoop/etc/hadoop/hadoop-env.sh.
The part i have been stuck for a while though is that yarn node -list command stucks
and if i try to run a spark job then i also get stuck on the ACCEPTED part.
I haven't yet tried your proposition.
PS I know it is a year-old article but still is the best i ve seen so far in my research.
Ευχαριστώ πολύ και τους 2 (I would like to thank you both)
Hi Andrew,
This is an excellent tutorial so THANK YOU very much!
I had a quite strange problem though related to the
clustercmd
function. At the moment, I have installed Java and Hadoop on both workers (which I callrpi102
andrpi103
). As you can see from the terminal on the right, I can SSH from the master (rpi101
) intorpi102
and run theHadoop version
command and got what I expected.However, as you can see from the terminal on the left, when I did
ssh rpi102 hadoop version
, the hadoop command is not found. But if I try something else, likessh rpi102 whoami
, it worked fine.This seems so odd and really puzzles me. Do you have any idea what the issue could be? Thanks in advance!
Never mind. This answer helped me out.
superuser.com/a/896454/955273
Do you solve it by this command?
Hello Andrew
Fantastic article/tutorial that made my deployment of a 4 node pi4 cluster almost trivial! But... I also think along the way, your use of bash shell scripting is brilliantly applied to the project. A model I will seek to emulate.
It was a great tutorial! I could successfully build a Hadoop system with four Raspberry Pi 4s. There were a few hiccups while I was following this tutorial but could be solved by following others postings. Thanks a lot!
It was a bit early for me to declare my cluster is all set. The problem I'm facing is that the worker nodes are not showing up. I have a master node and three worker nodes in the cluster. "hdfs dfsadmin -report" only shows the master node. Double checked the XML files to make sure if there is any typo and they have been copied to the worker nodes correctly but found nothing. Do you have any suggestions to figure this out?
Hi Andrew,
First off, great stuff. Thank you so much for being as thorough as you are. This was a well put together #howto. Secondly, I'd like to highlight some issues I came across as following your post.
This is to be expected, but the links to the installations of hadoop and spark are no longer valid. But i just simply followed your link down the rabbit hole to find the most up to date available installer.
Next, I did run into the same issue as Rohit Das did as he noted below. I simply followed his fix and it worked perfectly. Thank you Rohit for figuring this out.
Thirdly, I ran into an issue of the namenodes and datanodes being registered, but the datanodes weren't able to connect to pi1. Looking in the UI, everything was 0 and 0 datanodes were registered. After chasing SO post after post, I finally got it to work by changing in core-site.xml
hdfs://pi1:9000
to
hdfs://{ip address}:9000
where {ip address} is the fully typed ip address of pi 1.
Lastly, I mounted 250GB SSD hard drives to my workers. Once I get the optimal resource limits worked out for this in the configuration xml's I'll be sure to post here what they were.
I should note my name node is a raspberry pi 3 but my workers are raspberry pi 4 with 4GB ram.
Thanks for the article.
Thanks for checking it out!
excellent article, I have a pi 3b+ and pi model b(the oldest 256m ver.), is it possible to run a spark cluster? just for study.
My Pis are struggling as-is. I wouldn't recommend any lower specs than the Model 3 B+ (1 GB RAM)
I do want to make use of the old pi, even if it is used as a NameNode, I think it doesn't need much computation resource. I'm new to spark, so the question might be silly, sorry about that.
Even though NameNodes aren't processing data, they still have some CPU and memory requirements (they have to orchestrate the data processing, maintain records of the filesystem, etc.). I saw somewhere that 4GB per node was the recommended minimum. All I know from experience is that 1GB seems to barely work.
Spark sets minimum memory limits and I don't think 256MB is enough to do anything.
okay, the only thing that 256m can do may be running an Nginx reverse proxy in my private cloud or RIP, thanks for that.
Maybe you could turn it into a Pi-Hole?
unfortunately, Pi-Hole project requires at least 512m memory. My old pi should R.I.P right now, I'll leave it to my children as the first gift form elder generation.
Thanks for this excellent guide! Haven't tried it yet, but looks remarkably complete and pedagogical.
Some questions:
Amazing! Now I kind of want to make one, although I'm not sure what I could do with such a cluster.
Impress your friends!
Crush your enemies!
Excellent article ! You have covered the whole process in very detailed manner . Thanks.
Hi, it looks like you've just tried this with computational tasks (calculating pi). We're trying this with Spark SQL and facing out-of-memory errors on our 3B+ cluster. Have you tried this on a memory-intensive job?
Yes, we were doing some benchmarking on another machine, running random forests with X number of trees. I tried to run the same script on the Pi cluster and got
OutOfMemoryError
s with miniscule datasets.Thank you very much for this article! I am definitely going to recreate this at home.