I am currently a PhD candidate in the CSE Department at UC San Diego, being advised by Prof. Tajana Rosing. I am a member of
System Energy Efficiency Lab (SEELab)
at UCSD researching novel big data acceleration techniques. I'm specifically interested in the circuits, architecture, and systems aspects of processing in memory.
PhD in Computer Science and Engineering (Expected June 2021)
University of California, San Diego
MS in Electrical and Computer Engineering, 2018
University of California, San Diego
BE (Hons) in Electrical and Electronics Engineering, 2016
Birla Institute of Technology and Science (BITS) - Pilani, K.K. Birla Goa Campus
Publications
Conference Papers:
[C30] Mohsen Imani, Saransh Gupta, Yeseong Kim, and Tajana Rosing, "Digital-based Processing In-Memory for Acceleration of Unsupervised Learning," GOMACTech Conference, 2021.
Today's applications generate a large amount of data that need to be processed by learning algorithms. In practice, the majority of the data are not associated with any labels. Unsupervised learning, i.e., clustering methods, are the most commonly used algorithms for data analysis. However, running clustering algorithms on traditional cores results in high energy consumption and slow processing speed due to a large amount of data movement between memory and processing units. In this paper, we propose DUAL, a Digital-based Unsupervised learning AcceLeration, which supports a wide range of popular algorithms on conventional crossbar memory. Instead of working with the original data, DUAL maps all data points into high-dimensional space, replacing complex clustering operations with memory-friendly operations. We accordingly design a PIM-based architecture that supports all essential operations in a highly parallel and scalable way. DUAL supports a wide range of essential operations and enables in-place computations, allowing data points to remain in memory. We have evaluated DUAL on several popular clustering algorithms for a wide range of large-scale datasets. Our evaluation shows that DUAL provides a comparable quality to existing clustering algorithms while using a binary representation and a simplified distance metric. DUAL also provides 58.8x speedup and 251.2x energy efficiency improvement as compared to the state-of-the-art solution running on GPU.
Digital processing in-memory (DPIM) is a promising technology that significantly reduces data movements while providing high parallelism. In this work, we design and implement the first full-stack DPIM simulation infrastructure, DP-Sim, which evaluates a comprehensive range of DPIM-specific design space with respect to both software and hardware. DP-Sim provides a C++ library to enable DPIM acceleration in general programs while supporting several aspects of software-level exploration by a convenient interface. The DP-Sim software front-end generates specialized instructions that can be processed by a hardware simulator based on a new DPIM-enabled architecture model which is 10.3% faster than conventional memory simulation models. We use DP-Sim to explore the DPIM-specific design space of acceleration for various emerging applications. Our experiments show that bank-level control is 11.3x faster than conventional channel-level control because of higher computing parallelism. Furthermore, cost-aware memory allocation can provide at least 2.2x speedup vs. heuristic methods, showing the importance of data layout in DPIM acceleration.
Today's applications generate a large amount of data that need to be processed by learning algorithms. In practice, majority of the data are not associated with any labels. Unsupervised learning, i.e., clustering methods, are the most commonly used algorithms for data analysis. However, running clustering algorithms on traditional cores results in high energy consumption and slow processing speed due to a large amount of data movement between memory and processing units. In this paper, we propose DUAL, a Digital-based Unsupervised learning AcceLeration, which supports a wide range of popular algorithms on conventional crossbar memory. Instead of working with the original data, DUAL maps all data points into high-dimensional space, replacing complex clustering operations with memory-friendly operations. We accordingly design a PIM-based architecture that supports all essential operations in a highly parallel and scalable way. DUAL supports a wide range of essential operations and enables in-place computations, allowing data points to remain in memory. We have evaluated DUAL on several popular clustering algorithms for a wide range of large-scale datasets. Our evaluation shows that DUAL provides a comparable quality to existing clustering algorithms while using a binary representation and a simplified distance metric.
[C27]Saransh Gupta, Mohsen Imani, Behnam Khaleghi, Niema Moshiri, and Tajana S. Rosing, "RAPIDx: A ReRAM Processing in-Memory Architecture for DNA Short Read Alignment," 70th Virtual Meeting of The American Society of Human Genetics (ASHG), 2020.
Sequence alignment is a core component of many biological applications. As the advancement in sequencing technologies produces a tremendous amount of data on an hourly basis, this alignment is becoming the critical bottleneck in bioinformatics analysis. Even though large clusters and highly-parallel processing nodes can carry out sequence alignment, in addition to the exacerbated power consumption, they cannot afford to concurrently process the massive amount of data generated by sequencing machines. We propose a novel processing in-memory (PIM) architecture suited for DNA short-read sequence alignment, called RAPIDx. We revise the state-of-the-art alignment algorithm to make it compatible with in memory parallel computations, and process DNA data completely inside memory without requiring additional processing units. The main advantage of RAPIDx over the other alignment accelerators is a dramatic reduction in internal data movement while maintaining a remarkable degree of parallelism provided by PIM. The proposed architecture is also highly scalable, facilitating the precise alignment of a large number (hundreds) of short sequences in parallel. We also propose a pre-processing step, which allows us to practically parallelize alignment of 55% more reads than the state-of-the-art approach. We evaluated the efficiency of the proposed architecture over a dataset with up to 50M reads of 150bp each, aligned with the human genome as the reference. Our accelerator is 32x faster than DRAGEN running on a 12-core CPU server.
Hyperdimensional computing (HDC) is a brain-inspired computing paradigm that works with high-dimensional vectors, hypervectors, instead of numbers. HDC replaces several complex learning computations with bitwise and simpler arithmetic operations, resulting in a faster and more energy-efficient learning algorithm. However, it comes at the cost of an increased amount of data to process due to mapping the data into high-dimensional space. While some datasets may nearly fit in the memory, the resulting hypervectors more often than not can't be stored in memory, resulting in long data transfers from storage. In this paper, we propose THRIFTY, an in-storage computing (ISC) solution that performs HDC encoding and training across the flash hierarchy. To hide the latency of training and enable efficient computation, we introduce the concept of batching in HDC. It allows us to split HDC training into sub-components and process them independently. We also present, for the first time, on-chip acceleration for HDC which uses simple low-power digital circuits to implement HDC encoding in Flash planes. This enables us to explore high internal parallelism provided by the flash hierarchy and encode multiple data points in parallel with negligible latency overhead. THRIFTY also implements a single top-level FPGA accelerator, which further processes the data obtained from the chips. We exploit the state-of-the-art INSIDER ISC infrastructure to implement the top-level accelerator and provide software support to THRIFTY. THRIFTY runs HDC training completely in storage while almost entirely hiding the latency of computation. Our evaluation over five popular classification datasets shows that THRIFTY is on average 1612x faster than a CPU-server and 14.4x faster than the state-of-the-art ISC solution, INSIDER for HDC encoding and training.
[C25] Rosario Cammarota, Matthias Schunter, Anand Rajan, Fabian Boemer, Ágnes Kiss, Amos Treiber, Christian Weinert, Thomas Schneider, Emmanuel Stapf, Ahmad-Reza Sadeghi, Daniel Demmler, Huili Chen, Siam Umar Hussain, Sadegh Riazi, Farinaz Koushanfar, Saransh Gupta, Tajan Simunic Rosing, Kamalika Chaudhuri, Hamid Nejatollahi, Nikil Dutt, Mohsen Imani, Kim Laine, Anuj Dubey, Aydin Aysu, Fateme Sadat Hosseini, Chengmo Yang, Eric Wallace, Pamela Norton, "Trustworthy AI Inference Systems: An Industry Research View," arXiv preprint arXiv:2008.04449, 2020.
PDF
In this work, we provide an industry research view for approaching the design, deployment, and operation of trustworthy Artificial Intelligence (AI) inference systems. Such systems provide customers with timely, informed, and customized inferences to aid their decision, while at the same time utilizing appropriate security protection mechanisms for AI models. Additionally, such systems should also use Privacy-Enhancing Technologies (PETs) to protect customers' data at any time. To approach the subject, we start by introducing trends in AI inference systems. We continue by elaborating on the relationship between Intellectual Property (IP) and private data protection in such systems. Regarding the protection mechanisms, we survey the security and privacy building blocks instrumental in designing, building, deploying, and operating private AI inference systems. For example, we highlight opportunities and challenges in AI systems using trusted execution environments combined with more recent advances in cryptographic techniques to protect data in use. Finally, we outline areas of further development that require the global collective attention of industry, academia, and government researchers to sustain the operation of trustworthy AI inference systems.
[C24] Justin Morris, Yilun Hao, Saransh Gupta, Ranganathan Ramkumar, Jeffrey Yu, Mohsen Imani, Baris Aksanli, and Tajana Rosing, "Multi-label HD Classification in 3D Flash," IFIP/IEEE International Conference on VLSI and System-on-Chip (VLSI-SoC), 2020. (invited paper)
PDF
Many classification problems in practice map each sample to more than one label - this is known as multi- label classification. In this work, we present Multi-label HD, an in 3D storage multi-label classification system that uses Hyperdimensional Computing (HD). Multi-label HD is the first HD system to support multi-label classification. We propose two different mappings of HD to Multi-label HD. The first, Power Set HD, transforms the multi-label problem into single-label classification by creating a new class for each label combination. The second, Multi-Model HD, creates a binary classification model for each possible label. Our evaluation shows that Multi-Model HD achieves, on average, 47.8x higher energy efficiency and 47.1x faster execution time while achieving 5% higher classification accuracy as state-of-the-art light-weight multi-label classifiers. Power Set HD achieves 13% higher accuracy than Multi-Model HD, but is 2x slower. Our 3D-flash acceleration further improves the energy efficiency of Multi-label HD training by 228x and reduces the latency by 610x vs training on a CPU.
Processing in-memory (PIM) has shown great potential to accelerate the inference tasks of binarized neural networks (BNNs) by reducing data movement between processing units and memory. However, existing PIM architectures require analog/mixed-signal circuits that do not scale with the CMOS technology. On the contrary, we propose BitNAP, binarized neural network acceleration with in-memory thresholding, which performs optimization at operation, peripheral, and architecture levels for an efficient BNN accelerator. BitNAP supports row-parallel bitwise operations in crossbar memory by exploiting the switching of 1-bit bipolar resistive devices and a unique hybrid tunable thresholding operation. Our thresholding uses both switching and sensing-based operations to achieve maximum performance. BitNAP takes advantage of efficient row-parallel PIM operations. In order to reduce the area overhead of sensing-based operations, BitNAP presents a memory sense amplifier sharing scheme and also, a novel operation pipelining to reduce the latency overhead of sharing. We evaluate the efficiency of BitNAP on the MNIST and ImageNet datasets using popular neural networks. BitNAP is on average 1.24x (10.7x) faster and 185.6x (10.5x) more energy-efficient as compared to the state-of-the-art PIM accelerator for simple (complex) networks.
Quantum computers promise to solve hard mathematical problems such as integer factorization and discrete logarithms in polynomial time, making standardized public-key cryptography (such as digital signature and key agreement) insecure. Lattice-Based Cryptography (LBC) is a promising postquantum public key cryptographic protocol that could replace standardized public key cryptography, thanks to the inherent post-quantum resistant properties, efficiency, and versatility. A key mathematical tool in LBC is the Number Theoretic Transform (NTT), a common method to compute polynomial multiplication that is the most compute-intensive routine, and which requires acceleration for practical deployment of LBC protocols. In this paper, we propose CryptoPIM, a high-throughput Processing In-Memory (PIM) accelerator for NTT-based polynomial multiplier with the support of polynomials with degrees up to 32k. Compared to the fastest FPGA implementation of an NTT-based multiplier, CryptoPIM achieves on average 31x throughput improvement with the same energy and only 28% performance reduction, thereby showing promise for practical deployment of LBC.
Stochastic computing (SC) reduces the complexity of computation by representing numbers with long independent bit-streams. However, increasing performance in SC comes with an increase in area and loss in accuracy. Processing in memory (PIM) with non-volatile memories (NVMs) computes data in-place, while having high memory density and supporting bit-parallel operations with low energy. In this paper, we propose SCRIMP for stochastic computing acceleration with resistive RAM (ReRAM) in-memory processing, which enables SC in memory. SCRIMP can be used for a wide range of applications. It supports all SC encodings and operations in memory. It maximizes the performance and energy efficiency of implementing SC by introducing novel in-memory parallel stochastic number generation and efficient implication-based logic in memory. To show the efficiency of our stochastic architecture, we implement image processing on the proposed hardware.
Deep neural networks (DNN) have demonstrated effectiveness for various applications such as image processing, video segmentation, and speech recognition. Running state-of-theart DNNs on current systems mostly relies on either generalpurpose processors, ASIC designs, or FPGA accelerators, all of which suffer from data movements due to the limited on-chip memory and data transfer bandwidth. In this work, we propose a novel framework, called RAPIDNN, which performs neuron-to-memory transformation in order to accelerate DNNs in a highly parallel architecture. RAPIDNN reinterprets a DNN model and maps it into a specialized accelerator, which is designed using non-volatile memory blocks that model four fundamental DNN operations, ie, multiplication, addition, activation functions, and pooling. The framework extracts representative operands of a DNN model, eg, weights and input values, using clustering methods to optimize the model for in-memory processing. Then, it maps the extracted operands and their pre-computed results into the accelerator memory blocks. At runtime, the accelerator identifies computation results based on efficient in-memory search capability which also provides tunability of approximation to improve computation efficiency further. Our evaluation shows that RAPIDNN achieves 68.4x, 49.5x energy efficiency improvement and 48.1x, 10.9x speedup as compared to ISAAC and PipeLayer, the state-of-the-art DNN accelerators, while ensuring less than 0.5% quality loss.
Processing In-Memory (PIM) has shown a great potential to accelerate inference tasks of Convolutional Neural Network (CNN). However, existing PIM architectures do not support high precision computation, e.g., in floating point precision, which is essential for training accurate CNN models. In addition, most of the existing PIM approaches require analog/mixed-signal circuits, which do not scale, exploiting insufficiently reliable multi-bit Non-Volatile Memory (NVM). In this paper, we propose FloatPIM, a fully-digital scalable PIM architecture that accelerates CNN in both training and testing phases. FloatPIM natively supports floating-point representation, thus enabling accurate CNN training. FloatPIM also enables fast communication between neighboring memory blocks to reduce internal data movement of the PIM architecture. We evaluate the efficiency of FloatPIM on ImageNet dataset using popular large-scale neural networks. Our evaluation shows that FloatPIM supporting floating point precision can achieve up to 5.1% higher classification accuracy as compared to existing PIM architectures with limited fixed-point precision. FloatPIM training is on average 303.2x and 48.6x (4.3x and 15.8x) faster and more energy efficient as compared to GTX 1080 GPU (PipeLayer PIM accelerator). For testing, FloatPIM also provides 324.8x and 297.9x (6.3x and 21.6x) speedup and energy efficiency as compared to GPU (ISAAC PIM accelerator) respectively.
Internet of things has increased the rate of data generation. Clustering is one of the most important tasks in this domain to find the latent correlation between data. However, performing today's clustering tasks is often inefficient due to the data movement cost between cores and memory. We propose HDCluster, a brain-inspired unsupervised learning algorithm which clusters input data in a high-dimensional space by fully mapping and processing in memory. Instead of clustering input data in either fixed-point or floating-point representation, HDCluster maps data to vectors with dimension in thousands, called hypervectors, to cluster them. Our evaluation shows that HDCluster provides better clustering quality for the tasks that involve a large amount of data while providing a potential for accelerating in a memory-centric architecture.
Recently, Processing-In-Memory (PIM) techniques exploiting resistive RAM (ReRAM) have been used to accelerate various big data applications. ReRAM-based in-memory search is a powerful operation which efficiently finds required data in a large data set. However, such operations result in a large amount of current which may create serious thermal issues, especially in state-of-the-art 3D stacking chips. Therefore, designing PIM accelerators based on in-memory searches requires a careful consideration of temperature. In this work, we propose static and dynamic techniques to optimize the thermal behavior of PIM architectures running intensive in-memory search operations. Our experiments show the proposed design significantly reduces the peak chip temperature and dynamic management overhead. We test our proposed design in two important categories of applications which benefit from the search-based PIM acceleration - hyper-dimensional computing and database query. Validated experiments show that the proposed method can reduce the steady-state temperature by at least 15.3 C which extends the lifetime of the ReRAM device by 57.2% on average. Furthermore, the proposed fine-grained dynamic thermal management provides 17.6% performance improvement over state-of-the-art methods.
[C16] Mohsen Imani, Yeseong Kim, Saransh Gupta, Daniel Peroni, and Tajana S. Rosing, "In-Memory Acceleration of Deep Neural Network," GOMACTech Conference, 2019.
Deep neural networks (DNNs) demonstrate superior effectiveness for diverse classification problems, image processing, video segmentation, speech recognition, computer vision and gaming.Although many DNN models are implemented on high-performance computing architectures such as GPGPUs by parallelizing tasks, running neural networks on the general purpose processors is still slow, energy hungry, and prohibitively expensive. Processing in-memory (PIM) is a promising solution to address the data movement issue by implementing logics within a memory.Instead of sending a large amount of data to the processing cores for computation, PIM performs a part of computation tasks, e.g., bit-wise computations, inside the memory, thus the application performance can be accelerated significantly by avoiding the memory access bottleneck. Several existing works have proposed PIM-based neural network accelerators which keep the input data and trained weights inside memory. Although these approaches are the first pace towards employing PIM for DNN acceleration, they have three major downsides: (i) They utilize Analog to Digital Converters (ADCs) and Digital to Analog Converters (DACs) which take the majority of the chip area and power consumption. In addition, the mixed-signal ADC/DAC blocks do not scale as fast as the memory device technology does. (ii) Th existing PIM approaches use multi-level memristor devices that are not sufficiently reliable for commercialization unlike commonly-used single-level NVMs. (iii) Finally, they only support matrix multiplication in analog memory while other operations such as activation functions are implemented using CMOS-based digital logics. This makes the design non-generic and increases the expense of fabrication.
Sequence alignment is a core component of many biological applications. As the advancement in sequencing technologies produces a tremendous amount of data on an hourly basis, this alignment is becoming the critical bottleneck in bioinformatics analysis. Even though large clusters and highly-parallel processing nodes can carry out sequence alignment, in addition to the exacerbated power consumption, they cannot afford to concurrently process the massive amount of data generated by sequencing machines. In this paper, we propose a novel processing in-memory (PIM) architecture suited for DNA sequence alignment, called RAPID. We revise the state-of-the-art alignment algorithm to make it compatible with in-memory parallel computations, and process DNA data completely inside memory without requiring additional processing units. The main advantage of RAPID over the other alignment accelerators is a dramatic reduction in internal data movement while maintaining a remarkable degree of parallelism provided by PIM. The proposed architecture is also highly scalable, facilitating precise alignment of lengthy sequences. We evaluated the efficiency of the proposed architecture by aligning chromosome sequences from human and chimpanzee genomes. The results show that RAPID is at least 2x faster and 7x more power efficient than BioSEAL, the best DNA sequence alignment accelerator.
Recently, Processing In-Memory (PIM) has been shown as a promising solution to address data movement issue in the current processors. However, today's PIM technologies are mostly analog-based, which involve both scalability and efficiency issues. In this paper, we propose a novel digital-based PIM which accelerates fundamental operations and diverse data analytic procedures using processing in-memory technology. Instead of sending a large amount of data to the processing cores for computation, our design performs a large part of computation tasks inside the memory; thus the application performance can be accelerated significantly by avoiding the memory access bottleneck. Digital-based PIM supports bit-wise operations between two selected bit-line of the memory block and then extends it to support row-parallel arithmetic operations.
Brain-inspired hyperdimensional (HD) computing explores computing with hypervectors for the emulation of cognition as an alternative to computing with numbers. In HD, input symbols are mapped to a hypervector and an associative search is performed for reasoning and classification. An associative memory, which finds the closest match between a set of learned hypervectors and a queryÊhypervector, uses simple Hamming distance metric for similarity check. However, we observe that, in order to provide acceptable classification accuracy HD needs to store non-binarized model in associative memory and uses costly similarity metrics such as cosineÊto perform a reasoning task. This makes the HD computationally expensive when it is used for realistic classification problems. In this paper, we propose a FPGA-based acceleration of HD (FACH) which significantly improves the computation efficiency by removing majority of multiplications during the reasoning task. FACH identifies representative values in each class hypervector using clustering algorithm. Then, it creates a new HD model with hardware-friendly operations, and accordingly propose an FPGA-based implementation to accelerate such tasks. Our evaluations on several classification problems show that FACH can provide 5.9X energy efficiency improvement and 5.1X speedup as compared to baseline FPGA-based implementation, while ensuring the same quality of classification.
The performance of graph processing for real-world graphs is limited by inefficient memory behaviours in traditional systems because of random memory access patterns. Offloading computations to the memory is a promising strategy to overcome such challenges. In this paper, we exploit the resistive memory (ReRAM) based processing-in-memory (PIM) technology to accelerate graph applications. The proposed solution, GRAM, can efficiently executes vertex-centric model, which is widely used in large-scale parallel graph processing programs, in the computational memory. The hardware-software co-design used in GRAM maximizes the computation parallelism while minimizing the number of data movements. Based on our experiments with three important graph kernels on seven real-world graphs, GRAM provides 122.5X and 11.1x speedup compared with an in-memory graph system and optimized multithreading algorithms running on a multi-core CPU. Compared to a GPU-based graph acceleration library and a recently proposed PIM accelerator, GRAM improves the performance by 7.1X and 3.8X respectively.
The recent emergence of IoT has led to a substantial increase in the amount of data processed. Today, a large number of applications are data intensive, involving massive data transfers between processing core and memory. These transfers act as a bottleneck mainly due to the limited data bandwidth between memory and the processing core. Processing in memory (PIM) avoids this latency problem by doing computations at the source of data. In this paper, we propose designs which enable PIM in the three major memory technologies, i.e. SRAM, DRAM, and the newly emerging non-volatile memories (NVMs). We exploit the analog properties of different memories to implement simple logic functions, namely OR, AND, and majority inside memory. We then extend them further to implement in-memory addition and multiplication. We compare the three memory technologies with GPU by running general applications on them. Our evaluations show that SRAM, NVM, and DRAM are 29.8x (36.3x), 17.6x (20.3x) and 1.7x (2.7x) better in performance (energy consumption) as compared to AMD GPU.
Internet of Things (IoT) has built a network with billions of connected devices which generate massive volumes of data. Processing large data on existing systems requires significant costs for data movements between processors and memory due to limited cache capacity and memory bandwidth. Processing-In-Memory (PIM) is a promising solution to address the issue. Prior techniques that enable the computation in non-volatile memory (NVM) are designed on a bipolar switching mode, which suffers from a high sneak current in a crossbar array (CBA) structure. In this paper, we propose a unipolar-switching logic for high-density PIM applications, called UPIM. Our design exploits a unipolar-switching mode of memristor devices which can be operated in 1D1R structure hence suppresses the sneak current that exists in prior PIM technologies. Moreover, UPIM takes advantages of a 3D vertical crossbar array (CBA) structure to increase memory utilization per unit area for high-density applications. Our evaluation on a wide range of applications shows that the UPIM achieves up to 31.3x energy saving and 113.8x energy-delay product (EDP) improvement as compared to a recent GPGPU architecture. As compared to the state-of-the-art PIM design based on the bipolar switching mode, our design achieves 3.1x lower energy consumption.
In this work, we design, DigitalPIM, a Digital-based Processing In-Memory platform capable of accelerating fundamental big data algorithms in real time with orders of magnitude more energy efficient operation. Unlike the existing near-data processing approach such as HMC 2.0, which utilizes additional low-power processing cores next to memory blocks, the proposed platform implements the entire algorithm directly in memory blocks without using extra processing units. In our platform, each memory block supports the essential operations including: bitwise operation, addition/multiplication, and search operation internally in memory without reading any values out of the block. This significantly mitigates the processing costs of the new architecture, while providing high scalability and parallelism for performing the extensive computations. We exploit these essential operations to accelerate popular big data applications entirely in memory such as machine learning algorithms, query processing, and graph processing. Our evaluations show that for all tested applications, the performance can be accelerated significantly by eliminating the memory access bottleneck
In the Internet of Things (IoT) era, data movement between processing units and memory is a critical factor in the overall system performance. Processing-in-Memory (PIM) is a promising solution to address this bandwidth bottleneck by performing a portion of computation inside the memory. Many prior studies have enabled various PIM operations on nonvolatile memory (NVM) by modifying sense amplifiers (SA). They exploit a single sense amplifier to handle multiple bitlines with a multiplexer (MUX) since a single SA circuit takes much larger area than an NVM 1-bit cell. This limits potential parallelism that the PIM techniques can ideally achieve. In this paper, we propose MAPIM, mat parallelism for high-performance processing in non-volatile memory architecture. Our design carries out multiple bit-lines (BLs) requests under a MUX in parallel with two novel design components, multi-column/row latch (MCRL) and shared SA routing (SSR). The MCRL allows the address decoder to activate multiple addresses in both column and row directions by buffering the consecutively-requested addresses. The activated bits are simultaneously sensed by the multiple SAs across a MUX based on the SSR technique. The experimental results show that MAPIM is up to 339x faster and 221x more energy efficient than a GPGPU. As compared to the state-of-the-art PIM designs, our design is 16x faster and 1.8x more energy efficient with insignificant area overhead.
The Internet of Things (IoT) has led to the emergence of big data. Processing this amount of data poses a challenge for current computing systems. PIM enables in-place computation which reduces data movement, a major latency bottleneck in conventional systems. In this paper, we propose an in-memory implementation of fast and energy-efficient logic (FELIX) which combines the functionality of PIM with memories. To the best of authors' knowledge, FELIX is the first PIM logic to enable the single cycle NOR, NOT, NAND, minority, and OR directly in crossbar memory. We exploit the voltage threshold-based memristors to enable single cycle operations. It is a purely in-memory execution which neither reads out data nor changes sense amplifiers, while preserving data in-memory. We extend these single cycle operations to implement more complex functions like XOR and addition in memory with 2x lower latency than the fastest published PIM technique. We also increase the amount of in-memory parallelism in our design by segmenting bitlines using switches. To evaluate the efficiency of our design at the system level, we design a FELIX-based HyperDimensional (HD) computing accelerator. Our evaluation shows that for all applications tested using HD, FELIX provides on average 128.8x speedup and 5,589.3x lower energy consumption as compared to AMD GPU. FELIX HD also achieves on average 2.21x higher energy efficiency, 1.86x speedup, and 1.68x less memory as compared to the fastest PIM technique.
Big data has become a serious problem as data volumes have been skyrocketing for the past few years. Storage and CPU technologies are overwhelmed by the amount of data they have to handle. Traditional computer architectures show poor performance when processing such huge data. Processing inmemory is a promising technique to address data movement issue by locally processing data inside memory. However, there are two main issues with stand-alone PIM designs: (i) PIM is not always computationally faster than CMOS logic, (ii) PIM cannot process all operations in many applications. Thus, not many applications can benefit from PIM. To generalize the use of PIM, we designed GenPIM, a general processing in-memory architecture consisting of the conventional processor as well as the PIM accelerators. GenPIM supports basic PIM functionalities in specialized nonvolatile memory including: bitwise operations, search operation, addition and multiplication. For each application, GenPIM identifies the part which uses PIM operations, and processes the rest of non-PIM operations or not data intensive part of applications in general purpose cores. GenPIM also enables configurable PIM approximation by relaxing in-memory computation. We test the efficiency of proposed design over two learning applications. Our experimental evaluation shows that our design can achieve 10.9x improvement in energy efficiency and 6.4x speedup as compared to processing data in conventional cores.
Graph processing has become important for various applications in today's big data era. However, most graph processing applications suffer from large memory overhead due to random memory accesses. Such random memory access pattern provides little temporal and spatial locality which cannot be accelerated by the conventional hierarchical memory system. In this work, we propose GAS, a heterogeneous memory architecture, to accelerate graph applications implemented in message-based vertex program model, which is widely used in various graph processing systems. GAS utilizes the specialized content-addressable memory (CAM) to store random data, and determine exact access patterns by a series of associative search. Thus, GAS not only removes the inefficiency of random accesses but also reduces the memory access latency by accurate prefetching. We test the efficiency of GAS with three important graph processing kernels on five well-known graphs. Our experimental results show that GAS can significantly reduce cache miss rate and improve the bandwidth utilization as compared to a conventional system with a state-of-the-art graph-specific prefetching mechanism. These enhancements result in 34% and 27% reduction in energy consumption and execution time, respectively.
Approximate computing is a way to build fast and energy efficient systems, which provides responses of good enough quality tailored for different purposes. In this paper, we propose a novel approximate floating point multiplier which efficiently multiplies two floating numbers and yields a high precision product. RMAC approximates the costly mantissa multiplication to a simple addition between the mantissa of input operands. To tune the level of accuracy, RMAC looks at the first bit of the input mantissas as well as the first N bits of the result of addition to dynamically estimate the maximum multiplication error rate. Then, RMAC decides to either accept the approximate result or re-execute the exact multiplication. Depending on the value of N, the proposed RMAC can be configured to achieve different levels of accuracy. We integrate the proposed RMAC in AMD southern Island GPU, by replacing RMAC with the existing floating point units. We test the efficiency and accuracy of the enhanced GPU on a wide range of applications including multimedia and machine learning applications. Our evaluations show that a GPU enhanced by the proposed RMAC can achieve 5.2x energydelay product improvement as opposed to GPU using conventional FPUs while ensuring less than 2% quality loss. Comparing our approach with other state-of-the-art approximate multipliers shows that RMAC can achieve 3.1x faster and 1.8x more energy efficient computations while providing the same quality of service.
We live in a world where technological advances are continually creating more data than what we can deal with. Machine learning algorithms, in particular Deep Neural Networks (DNNs), are essential to process such large data. Computation of DNNs requires loading the trained network on the processing element and storing the result in memory. Therefore, running these applications need a high memory bandwidth. Traditional cores are memory limited in terms of the memory bandwidth. Hence, running DNNs on traditional cores results in high energy consumption and slows down processing speed due to a large amount of data movement between memory and processing units. Several prior works tried to address data movement issue by enabling Processing In-Memory (PIM)using crossbar analog multiplication. However, these designs suffer from the large overhead of data conversion between analog and digital domains. In this work, we propose RNSnet, which uses Residue Number System (RNS)to execute neural network completely in the digital domain in memory. RNSnet simplifies the fundamental neural network operations and maps them to in-memory addition and data access. We test the efficiency of the proposed design on several popular neural network applications. Our experimental result shows that RNSnet consumes 145.5x less energy and obtains 35.4x speedup as compared to NVIDIA GPU GTX 1080. In addition, our results show that RNSnet can achieve 8.5x higher energy-delay product as compared to the state-of-the-art neural network accelerators.
Recent years have witnessed a rapid growth in the domain of Internet of Things (IoT). This network of billions of devices generates and exchanges huge amount of data. The limited cache capacity and memory bandwidth make transferring and processing such data on traditional CPUs and GPUs highly inefficient, both in terms of energy consumption and delay. However, many IoT applications are statistical at heart and can accept a part of inaccuracy in their computation. This enables the designers to reduce complexity of processing by approximating the results for a desired accuracy. In this paper, we propose an ultra-efficient approximate processing in-memory architecture, called APIM, which exploits the analog characteristics of non-volatile memories to support addition and multiplication inside the crossbar memory, while storing the data. The proposed design eliminates the overhead involved in transferring data to processor by virtually bringing the processor inside memory. APIM dynamically configures the precision of computation for each application in order to tune the level of accuracy during runtime. Our experimental evaluation running six general OpenCL applications shows that the proposed design achieves up to 20x performance improvement and provides 480x improvement in energy-delay product, ensuring acceptable quality of service. In exact mode, it achieves 28x energy savings and 4.8x speed up compared to the state-of-the-art GPU cores.
Today's computing systems use huge amount of energy and time to process basic queries in database. A large part of it is spent in data movement between the memory and processing cores, owing to the limited cache capacity and memory bandwidth of traditional computers. In this paper, we propose a non-volatile memory-based query accelerator, called NVQuery, which performs several basic query functions in memory including aggregation, prediction, bit-wise operations, as well as exact and nearest distance search queries. NVQuery is implemented on a content addressable memory (CAM) and exploits the analog characteristic of non-volatile memory in order to enable in-memory processing. To implement nearest distance search in memory, we introduce a novel bitline driving scheme to give weights to the indices of the bits during the search operation. Our experimental evaluation shows that, NVQuery can provide 49.3x performance speedup and 32.9x energy savings as compared to running the same query on traditional processor. In addition, compared to the state-of-the-art query accelerators, NVQuery can achieve 26.2x energy-delay product improvement while providing the similar accuracy.
Neural networks (NNs) have shown great ability to process emerging applications such as speech recognition, language recognition, image classification, video segmentation, and gaming. It is therefore important to make NNs efficient. Although attempts have been made to improve NNs' computation cost, the data movement between memory and processing cores is the main bottleneck for NNs' energy consumption and execution time. This makes the implementation of NNs significantly slower on traditional CPU/GPU cores. In this paper, we propose a novel processing in-memory architecture, called NNPIM, that significantly accelerates neural network's inference phase inside the memory. First, we design a crossbar memory architecture that supports fast addition, multiplication, and search operations inside the memory. Second, we introduce simple optimization techniques which significantly improves NNs' performance and reduces the overall energy consumption. We also map all NN functionalities using parallel in-memory components. To further improve the efficiency, our design supports weight sharing to reduce the number of computations in memory and consecutively speedup NNPIM computation. We compare the efficiency of our proposed NNPIM with GPU and the state-of-the-art PIM architectures. Our evaluation shows that our design can achieve 131.5x higher energy efficiency and is 48.2x faster as compared to NVIDIA GTX 1,080 GPU architecture. Compared to state-of-the-art neural network accelerators, NNPIM can achieve on an average 3.6x higher energy efficiency and is 4.6x faster, while providing the same classification accuracy.
Many applications, such as machine learning and data sensing, are statistical in nature and can tolerate some level of inaccuracy in their computation. A variety of designs have been put forward exploiting the statistical nature of machine learning through approximate computing. With approximate multipliers being the main focus due to their high usage in machine-learning designs. In this article, we propose a novel approximate floating point multiplier, called CMUL, which significantly reduces energy and improves performance of multiplication while allowing for a controllable amount of error. Our design approximately models multiplication by replacing the most costly step of the operation with a lower energy alternative. To tune the level of approximation, CMUL dynamically identifies the inputs that produces the largest approximation error and processes them in precise mode. To use CMUL for deep neural network (DNN) acceleration, we propose a framework that modifies the trained DNN model to make it suitable for approximate hardware. Our framework adjusts the DNN weights to a set of potential weights that are suitable for approximate hardware. Then, it compensates the possible quality loss by iteratively retraining the network. Our evaluation with four DNN applications shows that, CMUL can achieve 60.3% energy efficiency improvement and 3.2x energy-delay product (EDP) improvement as compared to the baseline GPU, while ensuring less than 0.2% quality loss. These results are 38.7% and 2.0x higher than energy efficiency and EDP improvement of the CMUL without using the proposed framework.
Brain-inspired HyperDimensional (HD) computing emulates cognitive tasks by computing with long binary vectors, aka hypervectors, as opposed to computing with numbers. However, we observed that in order to provide acceptable classification accuracy on practical applications, HD algorithms need to be trained and tested on non-binary hypervectors. In this paper, we propose SearcHD, a fully binarized HD computing algorithm with a fully binary training. SearcHD maps every data points to a high-dimensional space with binary elements. Instead of training an HD model with non-binary elements, SearcHD implements a full binary training method which generates multiple binary hypervectors for each class. We also use the analog characteristic of non-volatile memories (NVMs) to perform all encoding, training, and inference computations in memory. We evaluate the efficiency and accuracy of SearcHD on a wide range of classification applications. Our evaluation shows that SearcHD can provide on average 31.1x higher energy efficiency and 12.8x faster training as compared to the state-of-the-art HD computing algorithms.
Today's computing systems use a huge amount of energy and time to process basic queries in database. A large part of it is spent in data movement between the memory and processing cores, owing to the limited cache capacity and memory bandwidth of traditional computers. In this paper, we propose a nonvolatile memory-based query accelerator, called NVQuery, which performs several basic query functions in memory including aggregation, prediction, bit-wise operations, join operations, as well as exact and nearest distance search queries. NVQuery is implemented on a content addressable memory and exploits the analog characteristic of nonvolatile memory in order to enable in-memory processing. To implement nearest distance search in memory, we introduce a novel bitline driving scheme to give weights to the indices of the bits during the search operation. To further improve the energy efficiency, our design supports configurable approximation by adaptively putting memory blocks under voltage overscaling. Our experimental evaluation shows that a NVQuery can provide 49.3x performance speedup and 32.9x energy savings as compared to running the same query on traditional processor. Approximation improves the energy-delay product (EDP) of NVQuery by 7.3x, while providing acceptable accuracy. In addition, NVQuery can achieve 30.1x EDP improvement as compared to the state-of-the-art query accelerators.
Realizing logic operations within passive crossbar memory arrays is a promising approach to enable novel computer architectures, different from conventional von Neumann architecture. Attractive candidates to enable such architectures are memristors, nonvolatile memory elements commonly used within a crossbar, that can also perform logic operations. In such novel architectures, data are stored and processed within the same entity, which we term as memristive memory processing unit (MPU). In this paper, Memristor-Aided loGIC (MAGIC) family is discussed with various design considerations and novel techniques to execute logic within an MPU. We present a novel resistive memory-the transpose memory, which adds additional functionality to the memristive memory, and compare it with a conventional memristive memory. A case study of an adder is presented to demonstrate the design issues discussed in this paper. We compare the proposed design techniques with the memristive IMPLY logic in terms of speed, area, and energy. Our evaluation shows that the proposed MAGIC design is 2.4 x faster and consumes 66.3% less energy as compared with the IMPLY-based computing for N-bit addition within memristive crossbar memory. Additionally, we compare the proposed design with IMPLY logic family on ISCAS-85 benchmarks, which shows significant improvements in speed (2x) and energy (10x), with similar area.
Recent years have witnessed a rapid growth in the amount of generated data, owing to the emergence of Internet of Things (IoT). Processing such huge data on traditional computing systems is highly inefficient, mainly due to the limited cache capacity and memory bandwidth. Processing in-memory (PIM) is an emerging paradigm which tries to address this issue. It uses memories as computing units, hence reducing the data transfers between memory and processing cores. However, the application of present PIM techniques is restricted by their limited functionality and inability to process large amounts of data efficiently. In this thesis, we propose novel techniques which exploit the analog properties of emerging memory technologies. Not only do these support more complex functions such as addition, multiplication, and search but also manage and process large data more efficiently. We present a new blocked PIM architecture which uses inter-block interconnects to accelerate data intensive processing. We also introduce a heterogeneous architecture having general purpose cores and PIM-enable memory and a data-dependent task allocation scheme for it. We also apply application specific optimizations and approximation techniques to further design accelerators for neural networks and database query systems. While we design a multiplication-by-constant hardware for neural networks, query processing is accelerated by a novel in-memory nearest search technique. Our neural network accelerator achieves 113.9x higher energy efficiency and 56.3x speedup as compared to AMD GPU. Also, the query accelerator provides 49.3x performance speedup and 32.9x energy savings as compared to recent Intel CPU.
University of California San Diego, 2018
Advisor: Prof. Tajana S. Rosing
Contact
sgupta at ucsd.edu
Room 2148, Computer Science and Engineering Building (EBU3B), La Jolla, CA 92093