DEV Community

huimin liao
huimin liao

Posted on

Computing in memory technology ultilized in multimodel application article- ISSCC 2023 16.1 Multcim in detail

Multimodal models, which are neural network models with the ability to understand mixed signals from different modalities (e.g., vision, natural language, speech, etc.), are one of the most important directions in the development of AI models today. The paper to be presented is entitled "16.1 MulTCIM: A 28nm 2.24μJ/Token Attention-Token-Bit Hybrid Sparse Digital CIM-Based Accelerator for Multimodal Transformers" by Dr. Fengbin Tu from the School of Integrated Circuits at Tsinghua University and the Department of Electronic and Computer Engineering at the Hong Kong University of Science and Technology (HKUST), proposes an integrated digital-computing core design that can support the computation of multimodal Transformer models.
一.Article Basic Information[1]
The ultimate goal of neural network models is to have human-like perception and processing capabilities, and multimodal models have been proposed for this purpose, the best of which is the multimodal Transformer model. However, current multimodal Transformer models face the following three sparsity challenges when executed on hardware:
(1) In terms of attention sparsity, the attention matrix, which is an important part of the Transformer model, has an irregular sparsity, which may lead to a longer reuse distance. For example, 78.6% to 81.7% of the number of tokens can be covered in the ViLBERT-base model. To support such operations, a large number of weights need to be stored in the storage kernel for a long period of time, and the utilization of these weights is extremely low;
(2) In terms of token sparsity, although the computation can be reduced by token pruning, tokens with different lengths for different modalities can lead to computational idleness or pipeline delays across modal attention layers;
(3) In terms of bit sparsity, activation functions such as Softmax, GERU, etc. generate a lot of data close to 0, which enhances the sparsity of the data to be processed, and the effective bitwidth of the same set of inputs to the CIM core will change repeatedly. The serial multiply-accumulate computation scheme in traditional CIM makes the computation time limited by the longest bit width.

Image description

Fig. 1 Challenges presented in the article and the corresponding solutions[2]
In response to the above problems, this paper proposes three targeted solutions:
(1) Aiming at the longer reuse distance caused by the irregular sparsity of the attention matrix, this paper proposes a Long Reuse Elimination Scheduler (LRES, Long Reuse Elimination Scheduler).The LRES splits the attention matrix into global+local sparsity patterns, where the globally similar attention weight vectors will be stored in the CIM for a longer time, and the locally similar weight vectors are consumed and updated more frequently to reduce unnecessary long reuse distances, instead of generating tokens of Q, K, and V sequentially as in the traditional Transformer, which can improve the utilization of the storage-computing kernel;
(2) To address the problem of computational idleness or pipeline delay due to different token lengths in different modalities, this paper proposes a runtime token pruner (RTP, Runtime Token Pruner) and a modal-adaptive CIM Network (MACN, Modal-Adaptive CIM Network) to optimize this process.The RTP is capable of removing unimportant tokens, while MACN is able to dynamically switch between different modalities in the attention layer, reducing the idle time of the CIM and decreasing the latency of generating Q and K tokens;
(3) To address the problem of the longest bitwidth variation due to the sparsity of the activation function, this paper introduces the Effective Bitwidth Balanced CIM (EBB-CIM) macro-architecture to solve the problem.EBB-CIM solves the problem by detecting the effective bitwidth of each element in the input vector and performs a bit balancing process to balance the input bits in the memory MAC to reduce the computation time by balancing the input bits in the MAC. This is accomplished by reallocating bits in longer effective bitwidth elements for shorter effective bitwidth elements, which in turn results in a more balanced overall input bitwidth.
二.Analysis of the content of the paper[1]
In the following, this paper details the article innovations with respect to the three sparsity challenges proposed by the authors:
(1) LRES
LRES contains three parts that work in sequence:
1) Attention sparsity manager: used to store the initial sparse attention pattern and update this pattern based on the token pruning information at runtime, in this step the manager identifies the Q and K vectors that generate extensive attention as these vectors need to be stored in the CIM core for a longer period of time in order to improve the utilization of the CIM;
2) Local Attention Sequencer: the remaining attention matrix, Q and K vectors are reordered, where K is used as weights and Q as input vectors are frequently consumed and switched in the CIM. This means that K vectors are frequently replaced by newly generated K vectors, thus reducing the idleness of the CIM;
3) Reshape Attention Generator: generates configuration information based on the outputs of the first two steps, which is used to optimize the workflow of the CIM core.

Image description
Fig. 2 Schematic structure of LRES2 RTP and MACN
The RTP and MACN modules optimized for token sparsity are described as shown in Fig. 3. Among them, the RTP module is mainly responsible for removing irrelevant tokens. the MACN dynamically divides all CIM cores into two pipeline stages: the StageS for static matrix multiplication (MM) in Q, K, and V token generation; and the StageD for dynamic MM in attention computation, and the two modules will be analyzed in detail below.
Firstly since the class (CLS) markers characterize the importance of other markers, RTP needs to receive the CLS scores of the previous layer and select the top n most important markers of the current layer. And MACN includes a Modal Workload Allocator (MWA), 16 CIM cores and a pipeline bus. At work, the MWA needs to divide the CIM cores into StageS and StageD and pre-assign the weights of StageS according to the allocation table. In addition, in terms of cross-modal switching, the traditional method calculates modes in turn, and different modal parameters lead to many idle CIM macros in the cross-modal switch; whereas, MACN utilizes modal symmetry to overlap the generation of multimodal Q and K tokens to reduce the latency. The specific implementation scheme is that the 4:1 activation structure of CIM stores the multimodal weights in a macro and switches modes by time multiplexing: at time 1 ~ NX , MACN is in Phase1 state and Core1 stores WQX and WQY in the example; at time NX ~ NY , MACN switches to Phase2 state and Core1 activates WQY to generate QY ; at time NY ~ NX +NY , MACN switches to Phase3 state, and Core1 activates WQY and WKX to generate QY and KX . Modal symmetry allows the generation of QY and KX to be completed simultaneously with better CIM utilization.
The final results show that RTP reduces the latency of unimodal and cross-modal attention by a factor of 2.13 and 1.58, and modal symmetry provides an additional 1.69-fold speedup for cross-modal attention.

Image description

Fig. 3 Schematic diagram of the structure of RTP and MACN1 EBB-CIM
An EBB-CIM macro optimized for bit sparsity is shown in Fig. 4. It consists of 32 EBB-CIM arrays, an effective bit-width detector, a bit equalizer and a bit-balanced feeder. Each of these EBB-CIM arrays has 4 × 64 6T-SRAM bit cells (8 groups) and a Cross-Shift Multiply Accumulate Tree (Cross-Shift MAC Tree).EBB-CIM uses an all-digital CIM architecture with 4:1 activation, which achieves high computational accuracy at INT16 while maintaining memory density; the detector receives the inputs at runtime and detects the effective bit width (EB); the bit equalizer calculates the average EB, allocates bits from long EB data to short EB data, and generates a bit-balanced input sequence; and the bit-balanced feeder acquires the sequence and generates a cross-shift configuration. In addition, the EBB-CIM can be reconfigured for INT16 by fusing every two INT8 operations.
The final results show that EBB-CIM reduces the latency of softmaxMM, GELU-MM, and the whole encoder by a factor of 2.38, 2.20, and 1.58, respectively, with a power overhead of only 5.1% and an area overhead of only 4.6%, as compared to the conventional bit-serial CIM.

Image description

Fig. 4 Schematic structure of EBB-CIM[1]
三.multimodal model
(1) Concepts and principles
Multimodal models refer to models that are capable of processing and understanding multiple types of data, such as text, images, audio, and video. Compared to single-modal models, multimodal models are able to fuse information from different modalities, thus improving the accuracy and comprehensiveness of information understanding and task processing.
The core principle of multimodal modeling lies in cross-modal information fusion and collaborative processing, the main processes of which include:
1) Data representation: converting data from different modalities into a form that can be processed by the model. Usually a specific encoder is used to represent the data of each modality as vectors or embeddings;
2) Feature extraction: extracting meaningful features from data of different modalities. For example, use Convolutional Neural Networks (CNN) for images and Recurrent Neural Networks (RNN) or Transformer architecture for text;
3) Cross-modal alignment: establishing associations between different modalities, e.g., by aligning timestamps or utilizing shared attention mechanisms to ensure that information from different modalities can be effectively fused;
4) Information fusion: the aligned multimodal features are fused, and commonly used methods include simple splicing, weighted summation, and the use of more complex fusion networks;
5) Decision making and output: task processing and decision making output, such as classification, generation or retrieval, through the fused features.
(2) Applications and prospects
Multimodal models have a wide range of applications in many domains, for example, the most typical ones are Visual Question Answering and Image Captioning, i.e., inputting a picture to ChatGPT for it to understand the meaning (e.g., Fig. 5) or inputting a passage for ChatGPT to generate an image (e.g., Fig. 6).

Image description

Fig. 5 ChatGPT's visual quiz function

Image description

Fig. 6 Image description generation function of ChatGPT
In addition, video generation models such as Sora released in February and Vidu released in April have the function of Video Captioning; ChatGPT-4o released last week also has the functions of Multimodal Sentiment Analysis, Cross-modal Retrieval, Multimodal Translation, etc. They all rely on multimodal models to realize their functions. ChatGPT-4o big model also has Multimodal Sentiment Analysis, Cross-modal Retrieval, Multimodal Translation, etc., and they all rely on the multimodal big model to realize.

Image description
Figure 7 Sentiment analysis function demonstrated at the launch of ChatGPT-4o

Multi-modal models bring problems such as increased network size, dramatic increase in parameters, and increased training costs, which will challenge traditional chip architectures, and in-store computing technology can cope with these problems well. In-store computing technology will bring higher energy efficiency, computational efficiency, data processing parallelism and lower transmission delay, computational power consumption, these features make the in-store computing chip in the multimodal model training, reasoning and other scenarios dominant, and is expected to replace the traditional Von Neumann architecture to become the architecture of choice for a new generation of AI chips. Domestic ZhiCun Technology has been deeply cultivating in the field of in-store computing chip for many years, since the release of the first international in-store computing chip product WTM1001 in November 2019, in five years, it has already realized the mass production of WTM1001, the first international in-store computing SoC chip WTM2101 validation and small batch trial production of chip, the mass production of the new generation of computing vision chips of the WTM-8 series, etc. In the future, the in-store computing chip will have the advantage of the multi-modal model training inference and other scenarios. In the future, in-store computing chips will play a greater role in the field of multimodal modeling and provide strong support for the wide application of multimodal modeling.

References:
[1]Tu F, Wu Z, Wang Y, et al. 16.1 MuITCIM: A 28nm 2.24μJ/Token Attention-Token-Bit Hybrid Sparse Digital CIM-Based Accelerator for Multimodal Transformers[C]//2023 IEEE International Solid-State Circuits Conference (ISSCC). IEEE, 2023: 248-250.
[2]Tu F, Wu Z, Wang Y, et al. MulTCIM: Digital Computing-in-Memory-Based Multimodal Transformer Accelerator With Attention-Token-Bit Hybrid Sparsity[J]. IEEE Journal of Solid-State Circuits, 2023.

Top comments (0)