Compact small files
- What does it meant to compact small files?
- Combine a lot of small files into one file
- Improves speed of read queries
- Can be done from a Copy job in ADF/Synapse or incremental load
- Also available in a Delta Lake feature
- Using a Copy job
- Source is the directory with all of the small files
- Select using a wildcard (/directory/*)
- Use the Copy behavior to merge the files
- Using Delta Lake
- Use OPTIMIZE feature
- Done via query in Spark SQL:
OPTIMIZE delta.`/data/events`
OPTIMIZE delta.`abfss://container-name@storage-account-name.dfs.core.windows.net/path-to-data
Handle skew in data
- Skew
- An uneven distribution of data
- Data skew can unbalance compute nodes, lowering performance
- Avoid by balancing parallel processing with correct table distribution (hash or round-robin)
-
Detect skew in distributed table (database consistency check)
- DBCC PDW_SHOWSPACEUSED('dbo.FactInternetSales');
- Resolve data skew
- Research
- Monitor query impact
- Weigh the cost of minimizing
- Solution
- Re-create table with a new distribution column set
- CREATE TABLE AS SELECT (CTAS)
- Research
Handle data spill
- Data spill is when compute engine is unable to hold data in memory and writes ("spills") data to disk
- Impact is expensive disk reads/writes and longer execution times
- Occurs when
- Partition size is too big
- Compute resource size is small
- Data size during merges, unions, etc exceeds memory limit of the compute node
- Identifying data spill
- Synapse SQL - TempDB runs out of space and throws error (monitor with DMVs)
- Spark - view task summary screen under spill column
- Handling the spill
- Increase compute capacity
- Reduce partition size
- Remove skews in data
Optimize resource management
- Optimize Synapse SQL Pools
- Pause when not in use
- Use the right compute unit (DWU) for workload
- Leverage Azure Functions to scale out workload
- Optimize Spark
- Select autoscale option in cluster setup
- Select auto-terminate
- Use spot instances
- Right-size cluster nodes based on memory, CPU intensive, etc
Tune queries by using indexers
- Types of indexes
- Clustered columnstore index
- Default in SQL pool table
- Use for tables > 100 million rows
- Good performance and data compression
- Clustered index
- Good for specific filter conditions
- Use for tables between 100 and 100 million rows
- Heap index
- Use for staging tables
- Clustered columnstore index
- Maintain by rebuilding indexes when seeing performance degradation in existing indexes
- Indexes in Spark Pool
- Spark does not have an inbuilt index
- Uses Hyperspace (or Hyperscale) - ability to create indexes on datasets (CSV, JSON, parquet)
- Works via API
- Criteria
- Contains filter on predicates
- Contains a join that requires heavy shuffles
Tune queries by using cache
- Caching stores frequently accessed data in memory or disk for faster retrieval
- Caching in Synapse SQL
- Result set caching
- Off by default
-
Enabled at database or session level
- DB: ALTER DATABASE SET RESULT_SET_CACHING ON
- Session: SET RESULT_SET_CACHING { ON | OFF }
- Faster query performance
- Max size of 1 TB per database
- Requirements
- User running the query has access to tables used in the query
- Cached query and new query have to be an exact match
- No changes to the table's data or schema where cache was generated from
- Result set caching
- Caching in Spark
- RDD (resilient distributed dataset
- DataFrame
- DataSets
-
Cache methods
- .persist()
- .cache()
- CACHE TABLE
Troubleshoot a failed Spark job
- Debug the issue within the environment and within the job
- Environment
- Confirm the region the cluster is in is not down (status.azure.com)
- Use HDInsight Ambari Dashboard to view cluster health
- Are clusters using high CPU or memory?
- Jobs
- Driver logs
- Task logs
- Executor logs
Troubleshoot a failed pipeline run, including activities executed in external services
- Use Output section of pipeline details to see job status
- To the right of the failed message there are more error details
- Examine the detailed error message for failed activities
Top comments (0)