mahesh_attarde

Posted on Apr 5, 2023

Essense of Machine Learning Hardware

#mlhardware

Interesting Notes from my work.

Here are few design considerations for machine learning asic. Design traits will achieve efficiency/effectiveness per watts.

Feed-forward ANN are faster since no data feedback simplfied desgin
LSTM require accumlation of partial results, increases complexity
Type of Neural Network affects simplifications to hardware, Usage of domain knowledge.

Common Sense Idea : Inference and Training hardware are different (most time)

Hardware for training includes input with large dataset, cost function , optimizations function and model
- Large dataset implies more memory for features, bag of features
- cost and optimzation function use are feedback to improve model with indentifying error. implies "latency of feedback".
- Support for debugging model to look for training errors, generalization errors implies "profiling interfaces"
- Should have support for workflows
Hardware for inference
- It includes input as dataset, not as large as training dataset and model.
- Model can be compressed based on domain of use.

Common Sense Idea: Use known information to minimize data set

Use of Encoding.
Use of Precise data storage. Less Register Size implies better energy/storage efficiency E.g. ML Model with human Age as input parameter. Ideally is 32 bit int. but age of human is < 100 implies 7 bit size. After understanding use cases, model only understand age ranges with widht of 10. implies 4 bit data.
Use of Data quantization. e.g. bfloat16 or MSFP (great common sense idea)

For Vision, audio signals as input, FFT has saved computation significantly!!! ( WOOWWWW!)

List All Unit Operations and minimize to bare minimum required. implies minimum area
For each operation minimize register size -> aguments datatype
For each operation minimize microcode with basics e.g. MAC Design (Multiply and Add) MAC for N bit x 2 regs, with result if 2N muls and add to N again Less Register Size, less microcode implies better energy/storage efficiency
Unit Operations in ALU
- Granularity of Unit Operation designed in hardware should be Maximizing paralization, at cost of space, later in time.
- Examples of Unit Operation
- Convolution aka (dot product) , Matrix Multiply. (Learnings from CPU based MatMul) lead to systolic execution of MatMul (MAC) over General MatMul (GMEM).
- Pooling aka ( Normalization) , thresholding (Learning from Comparator design in HDL) Creating minimized Comparator logic ckt. (Spatial Reduction Case)

More Cycles to get data, implies More stalls in ALU, implies high Energy
More frequent access to to distant memory in Memory heirarchy, Explosively increases energy consumption
Always Pipeline data, Move data in Spatial and Temporal Ways.
- Machine Learning Programming Langauge needs Tensor Loops for Time and Space maximization
Decoding Cost of MOV
- Reg < Cache < Buffer < DRAM
- Large Register size better
- Double Buffering is good
- RAM technology. SRAM, DRAM, HBM
- Buffer Specialization, Read only buffers and write only buffer perform better than R/W
- Quantifiable Design Concept

CPU/GPU Decoders are complex for ML Ops.inefficiency into Pipelining Operations CKT
Short Decoder sequences better, small, efficient for only Unit Ops.
SIMD decoder <> VLIW decoder ckt

Input (Tensor) Slicing at programming lang and backend level
Re-organize Loop order, polyhydrals for better memory movements, split loops into more loops.
Optimize for Area x Energy

High Frequency, high throughput results in heating
Low frequency, high thougput, optimal area serves better.
Diaelectric Silicon technology in nm
Photonic Technology
Calculate Compute Density , Compute to data movement ratio to classify IO dominant or Compute Dominant

DEV Community