by Udaybhaskar Sarma Seetamraju
ToSarma@gmail.com
Dec 31 2023
Highest-level Context
If you are into “Shift-Left” (whether re: Testing, Security, or Replicating-problems-on-developer-laptop, etc ..), then this article is for you.
For the very first time that you switch to an M1-chipset based MacBooks (from intel-chip based MacBooks) .. Productivity is significantly impacted when doing development/testing/troubleshooting “locally” on your laptop. Out-of-scope of this article is supporting those switching from Windoze.
Towards enabling up to 5x developer-productivity by allowing developers to robustly SIMULATE the Cloud-environment on a laptop — I have the following series of articles re: M1-chipset based MacBooks:
- Running AWS CodeBuild locally on MacBook-M1.
- Running Containers based on older
Ubuntu 20.04
(released in the year 2020) as well as on the newerUbuntu 22.04
(released in the year 2022) - Running Containers based on
arm64
-based Linux
- Running Containers based on older
- (This) Running
AWS Glue
locally on MacBook-M1. Various scenarios covered like: you do Not have “aws credentials” on your Laptop (forcing you to mock all the AWS API calls likeS3 GET
, Glue-Catalog queries, etc..) - New Security-related Best-Practices when creating
arm64/aarch64
Docker-Images on a MacBook-M1.
Quick Summary
Aiming for very simple single command, based on bash
-shell scripts --> to execute your python-code as a Glue Job’s build locally on your MacBook-M1.
In addition, I have a section on how to significantly raise your productivity, in debugging/developing your python-code, even if your company denies your AWS CLI credentials.
To state the obvious, everything here is 100% Python + Bash-Scripts.
Note: You should aim to have your software work on arm64
containers, which invariably is cheapest compute on cloud. More below.
Problem Statements
- Using Git for capturing ALL code-changes while simultaneously copying-n-pasting into AWS Glue-Studio (for testing/troubleshooting) is painful, error prone and frustrating.
- As time progresses, your Glue Job will become complicated and require more than one Python-script.
Worse, you have one or more folder-hierarchies, all of which contain
PY
files that you need toimport
! - Many developers prefer to develop/test/troubleshoot their code as “plain python”, and Not as a Glue-Job.
There is No good reason to deny such developers from doing just that.. .. while ensuring that code will work without any issues inside AWS Glue running locally on MacBook-M1, and eventually work without issues inside Glue on AWS.
- When running as “plain python”, there should be No runtime dependencies (like “
import awsglue
”). - When running as “plain python”, there should be No
spark
-dependency. - When running as “plain python”, all inputs/files should be on local laptop’s filesystem. All output should be written to local filesystem only.
- When running as “plain python”, all information from Glue-Catalog should be available OFFline (as a Python
Dict
object)
- When running as “plain python”, there should be No runtime dependencies (like “
- How to proactively ensure the Glue Job will work on all chip-architectures - without having to scramble later? How to explicitly utilize all
x86_64/amd64/arm64/aarch64
architectures locally on laptop? - If the Enterprise does Not allow Laptops to have AWS-Credentials (in
~/.aws/credentials
file); Even so, how can I EFFICIENTLY test/debug the python-code file locally on my laptop, even as it needs access to Glue Catalog and/or S3 buckets?
Get started!
export BUILDPLATFORM="linux/aarch64"
ONLY when for running on MacBook-M1 laptop, if you'd like to take advantage of native-performance boost !!
Based on your needs on AWS choose between these 2:
export BUILDPLATFORM="linux/
amd64
"
export BUILDPLATFORM="linux/arm64"
/
export DOCKER_DEFAULT_PLATFORM="${BUILDPLATFORM}"
export TARGETPLATFORM="${DOCKER_DEFAULT_PLATFORM}"
WORK_AREA=~
cd ${WORK_AREA}
git clone https://gitlab.com/tosarma/macbook-m1.git
To try out a sample ..
cd macbook-m1
cd AWS-Glue/src
${WORK_AREA}/macbook-m1/AWS-Glue/bin/run-glue-job-LOCALLY.sh sample-glue-job.py
No Bash? Want Python instead?
Just replace the “.sh
” with “.py
” — in the script name “run-glue-job-LOCALLY
” (as shown above).
And, of course, you must insert “python3
” at the very beginning of the CLI (this is a platform-independent advice).
WARNING: Without the benefit of “docker
cli”, you get ZERO visibility into the progress of docker-activity. This is due to use of un-friendly Docker’s Python APIs, because of which the python-code _ WILL _ _ HANG _ for a long time!
To repeat, “run-glue-job-LOCALLY.py
” will hang with NO output, for roughly 2-to-5 minutes (depending on how much CPU and MEMORY you have allocated to the Docker-Desktop, as well as speed of your internet connection).
Important - Note these:
- I only tested using
Python3.11
; No other Python version tested. - PRE-REQUISITES:
-
pip3 install docker
-
Ready to use it for your own Glue-Script?
- First, read the full details within the
macbook-m1/AWS-Glue/README.md
file. - Copy the
*.py
files in themacbook-m1/AWS-Glue/src/common
subfolder into --> YOUR project’s TOPMOST-folder.- ATTENTION: the files
./src/common/*.py
must exist in your project, after you are done copying.
- ATTENTION: the files
- Make sure to edit your
PY
file, to look like the example file provided (sample-glue-job.py
) - From your project root, run:
cd <your-project's-root-folder>
${WORK_AREA}/macbook-m1/AWS-Glue/bin/run-glue-job-LOCALLY.sh \
path/to/your/file.py
If you are having problems importing the new files under ./src/common
, then try adding this command below and then retry the above command.
export PYTHONPATH="your-project's-root-folder"
Is your Python code-base consisting of multiple files across multiple folder-hierarchies?
See section below titled “Complex folder-hierarchies?”
Want to change the CLI-arguments?
Three simple steps:
- Edit the file
macbook-m1/AWS-Glue/src/common/cli_utils.py
- Look inside “
process_all_argparse_cli_args()
” and make changes in that function. - Look inside “
process_std_glue_cli_args()
” and make changes inside that function.- Note: make sure to make similar changes in above steps 2 & 3.
Example:-
You would like to support a new cli-arg as:
*--JOB_NAME 123_ABC
-
Insert a new line at (say) line # 125 for
JOB_NAME
.- This will ensure your code will get the value
123_ABC
when running __ INSIDE __ AWS-Glue !!
- This will ensure your code will get the value
-
Insert a new line at (say) line # 78 for
--JOB_NAME
- This will allow you to run your python-code as a PLAIN python-command and read this CLI-arg.
- See more re: this in a following section titled “running as a PLAIN python-command”
Never edit the file:
macbook-m1/AWS-Glue/src/common/glue_utils.py
The files “common.py
” and “names.py
” in that same folder can be edited. Feel free to play around with them.
Tips, Issues & Errors
See Appendix sections, for tips on configuring Docker-DESKTOP.
Question: Want to automatically cleanup/delete the Docker-containers - after they exit?
Answer: INSERT the cli-arg “--cleanup
” BEFORE the python-filename, to that “run-glue-job-LOCALLY.sh
” script.
Advanced User - Complex folder-hierarchies?
Is your Python code-base consisting of multiple files across multiple folder-hierarchies?
Are you aware that Glue requires you to ZIP up all those OTHER python-files into a single Zip-file?
FYI only - this requirement is driven by Spark!
That script “run-glue-job-LOCALLY.sh
” will automatically do that for you --> that is, it will automatically look UNDER the current-working-directory, and find all **/*.py
files and put them in a temporary ZIP-file.
The script will then automatically pass it on to Glue-inside-Docker (running on your laptop).
If you need to import PY
files in parent/ancestor levels, I recommend that you add a “symlink” (Linux command “ln -s
”) to those files, and put that symlink in your current-working-folder.
Git will preserve these “symlinks” as exactly just that. It will NOT convert them into files. So, feel better already!
No AWS Credentials on your laptop?
For security-reasons, many companies are denying developers the AWS-credentials for AWS-CLI use.
That means you have a showstopper -> re: locally testing/debugging your python-code, for scenarios like:
- COPY INPUT-files from S3-buckets --> into the “current-working directory”.
- COPY OUTPUT-files from the “current-working directory” --> into S3-buckets.
- Lookup Glue Catalog.
- .. etc ..
To workaround this restriction ..
You need to write code that detects whether its running on a MacBook-M1 -versus- actually running inside AWS-Cloud.
In other words, you need to “Short-Circuit” all that code that interacts with AWS-APIs (Glue-Catalog, S3, ..) and mock the expected response from those AWS-APIs.
If you use my script “run-glue-job-LOCALLY.sh
”, it automatically sets an environment-variable called “running_on_LAPTOP
” when running your python-code inside a Docker-Glue container on your laptop!!
How-To “short-circuit”:
if ( os.environ.get('running_on_LAPTOP') ):
print( "!!!!!!!!!!!!!!!!! running on laptop !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!" )
.. ### assume S3-get is already done and file is available in current-directory
.. ### assume Glue-Catalog-Query is already done and ..
### the "JSON-Response" is available in current-directory as a JSON-file
..
else:
..
..
To state the obvious, on AWS-Cloud, AWS-GLUE does __ NOT __ support environment-variables.
So, this environment-variable called “running_on_LAPTOP
” will be UN-defined when running inside AWS-Cloud.
Running as a plain python command
EXAMPLE:
I’m going to use the same “macbook-m1/AWS-Glue/src/sample-glue-job.py
” file, to show how to run as PLAIN Python-program.
FYI: My python-code in that sample-glue-job.py
expects the following 6 CLI-arguments (with their values).
If you do _ NOT _ like this list of CLI-args, see section above titled “Want to change the CLI-args?”
python3 sample-glue-job.py --ENV sandbox \
--commonDatabaseName MyCOMMONDATABASENAME \
--glueCatalogEntryName MyDICT \
--rawBucket MyRAWBUCKET \
--processedBucket MyPROCESSEDBUCKET \
--finalBucket MyFINALBUCKET
If you want the ability to run both as a plain python-script as well as run it inside AWS-glue, you __ MUST __ replicate the structure and code within this “macbook-m1/AWS-Glue/src/sample-glue-job.py
”.
APPENDIX
Docker-Desktop settings for aarch64
-chipset
See screenshot below.
Turn ON the setting titled “Use containerd for pulling and storing images”!
Note: for other scenarios, you may have to turn it OFF.
I can’t help explain this crazy conflicting instructions.
As of 2023, this is a Docker-on-MacBook issue, resolvable only by Docker + Apple Corp.
Running out of Disk-space or Memory?
Screenshot below shows the recommended “high” settings.
After building images, you can reduce:
- “CPU” can be lowered to “2”.
- “Memory” can be lowered to “4GB”.
FYI only - To run on a MacBook-M1, many amd64
emulated containers like Neo4j v4.x
will frequently fail, unless you provide Docker with a minimum of 5 cpus and 8GB of RAM!
End of Article.
Top comments (0)