DEV Community

Cover image for Analyze Sydney property prices with Amazon Redshift Serverless in GA
Wendy Wong for AWS Community Builders

Posted on • Edited on

Analyze Sydney property prices with Amazon Redshift Serverless in GA

Are you concerned about Sydney property prices?

With Covid-19, tech workers have the freedom to work remotely permanently or as they return to the office in a hybrid environment.

Thus employees have opted for a lifestyle that includes the beach, country-side, city, suburbia or with more cultural amenities.

With low unemployment figures in Sydney, should upwardly mobile professionals stay in Sydney or move to a different city?

Lesson Outcomes

  • You will learn how to create a data warehouse with Amazon Redshift Serverless (Generally Available)

  • Discover data with AWS Glue Data Catalog

  • Query your data with Amazon Redshift Serverless (Generally Available)

What is Amazon Redshift Serverless?

Amazon Redshift Serverless is generally available since 12 July 2022, you may view the announcement here and this allows you to build a data warehouse without having to provision a Redshift cluster and it can easily scale to your usage needs.

As a data scientist or data analyst you can use Query Editor v2 to query the data and answer analytic questions.

You may use with Amazon Redshift Severless in GA and access $300 free AWS credits for 90 days in the following AWS regions as at 12 July 2022:

  • US East (Ohio, N. Virginia)
  • US West (N. California, Oregon)
  • Europe (Frankfurt, Ireland)
  • Asia Pacific (Tokyo)
  • Europe (London)
  • Europe (Stockholm)
  • Asia Pacific (Seoul)
  • Asia Pacific (Singapore)
  • Asia Pacific (Sydney)

You may refer to the pricing of Amazon Redshift Serverless here. You only pay for the compute resources that you use with the pay as you go model.

What are the use cases?

You don't need a data engineer to help you provision data warehouse resources, whether you work for a startup or enterprise you can access:

  • Self-service analytics
  • Generate insights that can auto scale

Solution Architecture

This is my summary of the solution architecture for Amazon Redshift Serverless and Amazon Redshift ML

serverk

redshift Ml

Dataset

This open source dataset set 'Sydney House Prices' is provided here by Kaggle.com from 2000 to 2019.

Data Dictionary

Column Data Type

Date, Date
Id, String
suburb, String
postalCode, String
sellPrice, Integer
bed, Integer
bath, Integer
car, Integer
propType, Integer

Tutorial: Getting started with Amazon Redshift Serverless

This tutorial follows the instructions provided by the documentation here.

Step 1: Sign into the AWS Management Console as an IAM user.

IAM

Step 2: Change the AWS region to US-East-2 (Ohio)

Step 3. Upload the Sydney house prices dataset into an Amazon S3 bucket.

bucket

bucket uploaded

Create two folders in the S3 bucket for data transformations name them 'raw data' and 'processed'.

folders

Step 4: Create a crawler from the AWS Glue console by selecting a data source.

crawler

Step 5: I used an existing IAM role for AWS Glue

IAM

Create a database.

database

Step 6: Review the Crawler and click Run Crawler.

 review crawler

The processing job of the crawler is completed.

job completed

Refresh your web browser and navigate to Database -> Table

Review the schema created from the raw data.

created schema

Edit the schema to change the data type for the variables Date and postalCode to Date and String respectively.

edit schemea

You may review the updated schema changes.

schema update

Step 7: Navigate to Amazon Redshift in the search bar and click Try Amazon Redshift Serverless

ohion

Step 8: Select default settings to create a Redshift cluster.
Because I nominated the AWS region Ohio, I will receive $300 free AWS credits over the 90-day free trial for AWS Redshift Serverless service.

cluster

Under the default settings, a namespace and workgroup will be automatically created by Amazon Redshift Serverless.

namespace

Step 9: Create a default IAM role which can access any Amazon S3 bucket.

Iam rileon

Refresh your web browser, you will be able to see that the IAM policy 'Amazon Redshift Full Access' is successfully associated with the IAM user role. The VPC and security groups have been created.

Iassociate

Step 10: Click Save Configuration.

The Redshift cluster will take a few minutes to process.

dashboard

Note: The 90-day free trial of Amazon Redshift Serverless will expire on 2 April 2023, please take note of your free trial expiration date.

Step 11: Click on Query Editor to access Query Editor v2

query edit

Step 12: Double-click on Serverless:default and dev database.

Create a table.

CREATE TABLE sydney_houseprices (
    date DATE,
    id VARCHAR(40),
    suburb VARCHAR(100),
    postalcode VARCHAR(40),
    sellprice bigint,
    bed bigint,
    bath bigint,
    car bigint,
    proptype VARCHAR(40))
    ;
Enter fullscreen mode Exit fullscreen mode

Step 14: Load data into a new table using the COPY command for Amazon Redshift.

COPY sydney_house_prices 
FROM 's3://sydney-house-price/processed/SydneyHousePrices1.csv' 
IAM_ROLE 'arn:aws:iam::XXXXXXXXXXX:role/service-role/AmazonRedshift-CommandsAccessRole-XXXXXXXXXXXX' 
delimiter ','
dateformat 'YYYY-MM-DD'
IGNOREHEADER 1;
Enter fullscreen mode Exit fullscreen mode

load data

Note: If you have trouble loading data into the table. Try these tips:

  1. Change your input csv file date format from Australian to American date format i.e. 'YYYY-MM-DD'

  2. Include the SQL statement 'IGNOREHEADER 1' to skip loading the header

  3. Check for load errors:

SELECT * 
FROM stl_load_errors
Enter fullscreen mode Exit fullscreen mode

Step 15: Check that data has successfully loaded into the table. Preview the first 10 rows.

SELECT * 
FROM sydney_house_prices 
LIMIT 10;
Enter fullscreen mode Exit fullscreen mode

Step 16: Query your new table with your own analytical questions.

Q1. What was the most expensive suburb in the year 2000?

SELECT MAX(sellprice), suburb, DATEPART(year,date) AS year, postalcode AS postcode, date, id
FROM sydney_house_prices
WHERE year = '2000'
GROUP BY 2, 3, 4, 5,6
;
Enter fullscreen mode Exit fullscreen mode

The suburb Bondi Beach had the most expensive property sold in the year 2000 at $3.5 million.

Bondi Beach

Q2. What was the most expensive suburb in the year 2019?

SELECT MAX(sellprice), 
suburb, 
DATEPART(year,date) AS year, 
postalcode AS postcode, 
date, 
id,
proptype,
FROM sydney_house_prices
WHERE year = '2019' AND proptype = 'house'
GROUP BY 2, 3, 4, 5,6,7
ORDER BY 1 DESC
;
Enter fullscreen mode Exit fullscreen mode

Rose Bay was the most expensive suburb in the year 2019 with a property sold for $13.4 million.

redshift serverless

Q3. What was the average price of a house in Concord in 2019 with two car park spots and two bathrooms?

SELECT AVG(sellprice), 
suburb, 
DATEPART(year,date) AS year, 
proptype,
bed, 
car,
bath,
FROM sydney_house_prices
WHERE year = '2019' AND 
proptype = 'house' 
AND suburb = 'Concord'
AND bath =2
AND car = 2
GROUP BY 2, 3,4,5,6
;
Enter fullscreen mode Exit fullscreen mode

The average price of a house in Concord in the year 2019 with two car park spots and two bathrooms is $1,968,000.

Q3

Q4. Which suburb sold the most properties in the year 2019 and what was the average property price?

SELECT COUNT(proptype) AS propertycount, 
AVG(sellprice) AS avg_pricesold,
suburb, 
DATEPART(year,date) AS year, 
postalcode AS postcode, 
date,
FROM sydney_house_prices
WHERE year = '2019' 
GROUP BY 3, 4,5,6
ORDER BY 1 DESC
;
Enter fullscreen mode Exit fullscreen mode

The suburb Warriewood sold the most properties in 2019 with an average selling price of $1,038,136.

Q4

Q5. Where can I live with a budget of between $1,200,000 and $1,500,000 in the year 2019 with three bedrooms, two bathrooms and one car park spot?

SELECT sellprice, 
suburb, 
DATEPART(year,date) AS year, 
car,
bath,
bed,
FROM sydney_house_prices
WHERE year = '2019' AND 
sellprice BETWEEN 1200000 AND 1500000
AND bed = 3
AND bath = 2
AND car = 1
;
Enter fullscreen mode Exit fullscreen mode

Within this budget, in 2019 I can afford to live in the suburbs
Maroubra, Turramurra, Cremorne, Ryde, Sans Souci and Allambie Heights.

budget

Amazon Redshift Serverless (Generally Available) - Clean Up Resources

If you have finished using Amazon Redshift Serverless, you should follow best practice and clean up your resources by following the steps in the AWS documentation at this link.

I encourage you to make the most of your Redshift Serverless free-trial in GA which is 90 days and keep exploring with your $300 AWS credits.

AWS Innovate - Coming Soon!

If you would like to learn more about AWS data or AI/ML please join one of the FREE online conferences in your city.

Register at this link

  • Asia Pacific and Japan on 22 February 2023

APJ

  • EMEA on 9 March 2023

EMEA

  • Americas on 14 March 2023 (this includes Pi Day celebrations)

march 14

References

Until the next lesson, Happy Learning! 😁

Top comments (4)

Collapse
 
valaug profile image
Augusto Valdivia

Interesting used case :)

Collapse
 
abc_wendsss profile image
Wendy Wong

Thanks Augusto for your kinds words :)

Collapse
 
lynnlangit profile image
Lynn Langit

well-written - thanks!

Collapse
 
abc_wendsss profile image
Wendy Wong

Thanks Lynn for your kind words :)