FET:Small:AlignMEM:Fast and Efficient DNA Sequence Alignment in Non-Volatile Magnetic RAM
项目来源
项目主持人
项目受资助机构
财政年度
立项时间
项目编号
研究期限
项目级别
受资助金额
学科
学科代码
基金类别
关键词
参与者
参与机构
人员信息
机构信息
项目主管部门
项目官员
1.SAFER: Sparsity Integrated Compute-in-Memory AI Accelerator with a Fused Dot-Product Engine and a RISC-V CPU
- 关键词:
- Energy efficiency;Engines;Indium alloys;Program processors;Reduced instruction set computing;Static random access storage;Data movements;Digital in-memory computing;Floating points;Floating-point and integer acceleration;Memory circuits;Memory footprint;Memory macro;Multiply-and-accumulate;Peak energy;RISC-V
- Sridharan, Amitesh;Ali, Asmer Hamid;Lee, Yongjae;Anupreetham, Anupreetham;Liu, Yaotian;Zhang, Jeff;Seo, Jae-Sun;Fan, Deliang
- 《51st IEEE European Solid-State Electronics Research Conference, ESSERC 2025》
- 2025年
- September 8, 2025 - September 11, 2025
- Munich, Germany
- 会议
We present a sparsity-aware in-SRAM multiply-and-accumulate (MAC) accelerator with a fused dot-product engine (SAFE) and a RISC-V CPU (SAFER). For the first time, we implement a unified dot-product compute methodology in Compute-in-memory (CIM) circuits vastly reducing the hardware footprint for simultaneously supporting both floating point (FP) and integer (INT) MACs. Additionally, we integrate various N: M sparsity formats allowing the CIM macro to store and operate exclusively on compressed non-zero weights. We also tightly integrate a 32-bit RISC-V CPU to SAFE for efficient data-movement across chip. The 28 nm SAFER prototype achieves a peak energy efficiency of 105.7 TOPS/W (78.9 TOPS/W) and 79.9 TOPS/W (63 TOPS/W) in the macro (chip) level for FP8 and INT8 workloads respectively. SAFER also achieves a memory footprint reduction proportional to sparsity through compressed storage, vastly reducing the macro count required for large AI models. For our proposed figure of merit which accounts for PPA along with memory footprint, and for this FoM SAFER improves current SoTA CIMs by 13.8 × for FP8 workloads. © 2025 IEEE.
...2.Efficient Self-Supervised Continual Learning with Progressive Task-Correlated Layer Freezing
- 关键词:
- Semi-supervised learning;Support vector machines;% reductions;Catastrophic forgetting;Continual learning;Layer freezing;Learning methods;Multiple tasks;Training process;Training time;Unlabeled data;Visual representations
- Yang, Li;Lin, Sen;Zhang, Fan;Zhang, Junshan;Fan, Deliang
- 《26th International Symposium on Quality Electronic Design, ISQED 2025》
- 2025年
- April 23, 2025 - April 25, 2025
- Hybrid, San Francisco, CA, United states
- 会议
Inspired by the success of Self-Supervised Learning (SSL) in learning visual representations from unlabeled data, a few recent works have studied SSL in the context of Continual Learning (CL), where multiple tasks are learned sequentially, giving rise to a new paradigm, namely Self-Supervised Continual Learning (SSCL). It has been shown that the SSCL outperforms Supervised Continual Learning (SCL) as the learned representations are more informative and robust to catastrophic forgetting. However, building upon the training process of SSL, prior SSCL studies involve training all the parameters for each task, resulting to prohibitively high training cost. In this work, we first analyze the training time and memory consumption and reveals that the backward gradient calculation is the bottleneck. Moreover, by investigating the task correlations in SSCL, we further discover an interesting phenomenon that, with the SSL-learned background model, the intermediate features are highly correlated between tasks. Based on these new finding, we propose a new SSCL method with layer-wise freezing which progressively freezes partial layers with the highest correlation ratios for each task to improve training computation efficiency and memory efficiency. Extensive experiments across multiple datasets are performed, where our proposed method shows superior performance against the SoTA SSCL methods under various SSL frameworks. For example, compared to LUMP, our method achieves 1.18x, 1.15x, and 1.2x GPU training time reduction, 1.65x, 1.61x, and 1.6x memory reduction, 1.46x, 1.44x, and 1.46x backward FLOPs reduction, and 1.31%/1.98%/1.21% forgetting reduction without accuracy degradation on three datasets, respectively. © 2025 IEEE.
...3.Dichotomous intronic polyadenylation profiles reveal multifaceted gene functions in the pan-cancer transcriptome.
- Sun, Jiao;Kim, Jin-Young;Jun, Semo;Park, Meeyeon;de Jong, Ebbing;Chang, Jae-Woong;Cheng, Sze;Fan, Deliang;Chen, Yue;Griffin, Timothy J;Lee, Jung-Hee;You, Ho Jin;Zhang, Wei;Yong, Jeongsik
- 《Experimental & molecular medicine》
- 2024年
- 卷
- 期
- 期刊
Alternative cleavage and polyadenylation within introns (intronic APA) generate shorter mRNA isoforms; however, their physiological significance remains elusive. In this study, we developed a comprehensive workflow to analyze intronic APA profiles using the mammalian target of rapamycin (mTOR)-regulated transcriptome as a model system. Our investigation revealed two contrasting effects within the transcriptome in response to fluctuations in cellular mTOR activity: an increase in intronic APA for a subset of genes and a decrease for another subset of genes. The application of this workflow to RNA-seq data from The Cancer Genome Atlas demonstrated that this dichotomous intronic APA pattern is a consistent feature in transcriptomes across both normal tissues and various cancer types. Notably, our analyses of protein length changes resulting from intronic APA events revealed two distinct phenomena in proteome programming: a loss of functional domains due to significant changes in protein length or minimal alterations in C-terminal protein sequences within unstructured regions. Focusing on conserved intronic APA events across 10 different cancer types highlighted the prevalence of the latter cases in cancer transcriptomes, whereas the former cases were relatively enriched in normal tissue transcriptomes. These observations suggest potential, yet distinct, roles for intronic APA events during pathogenic processes and emphasize the abundance of protein isoforms with similar lengths in the cancer proteome. Furthermore, our investigation into the isoform-specific functions of JMJD6 intronic APA events supported the hypothesis that alterations in unstructured C-terminal protein regions lead to functional differences. Collectively, our findings underscore intronic APA events as a discrete molecular signature present in both normal tissues and cancer transcriptomes, highlighting the contribution of APA to the multifaceted functionality of the cancer proteome. © 2024. The Author(s).
...4.Aligner-D: Leveraging In-DRAM Computing to Accelerate DNA Short Read Alignment
- 关键词:
- DNA; Random access memory; Task analysis; Genomics; Bioinformatics;Throughput; Sequential analysis; DNA short read alignment;processing-in-memory; DRAM; accelerator
- Zhang, Fan;Angizi, Shaahin;Sun, Jiao;Zhang, Wei;Fan, Deliang
- 《IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS》
- 2023年
- 13卷
- 1期
- 期刊
DNA short read alignment task has become a major sequential bottleneck to humongous amounts of data generated by next-generation sequencing platforms. In this paper, an energy-efficient and high-throughput Processing-in-Memory (PIM) accelerator based on DRAM (named Aligner-D) is presented to execute DNA short-read alignment with the state-of-the-art BWT alignment algorithm. We first present the PIM design that utilizes DRAM's internal high parallelism and throughput. It converts each DRAM array to a potent processing unit for alignment tasks. The proposed Aligner-D can efficiently execute the bulk bit-wise XNOR-based matching operation required by the alignment task with only 3-transistor/col overhead. We then introduce a highly parallel and customized read alignment algorithm based on BWT that supports both exact and inexact match tasks. Next, we present how to map the correlated data of the alignment task to utilize the parallelism from both new hardware and algorithm maximumly. The experimental results demonstrate that Aligner-D obtains $\sim 4\times $ , $\sim 2.45\times $ , $\sim 3.26\times $ , and $\sim 1.65\times $ improvement, respectively, compared with other in-memory computing platforms: Ambit (Seshadri et al., 2017), DRISA-1T1C (Li et al., 2017), DRISA-3T1C (Li et al., 2017), and ReDRAM (Angizi and Fan, 2019). As for DNA short read alignment, Aligner-D boosts the alignment throughput per Watt by $\sim 20104\times $ , $\sim 3522\times $ , $\sim 927\times $ , $\sim 88\times $ , $\sim 5.28\times $ , and $\sim 2.34\times $ , over ReCAM, CPU, GPU, FPGA, Ambit, and DRISA, respectively.
...5.DSPIMM: A Fully Digital SParse In-Memory Matrix Vector Multiplier for Communication Applications
- 关键词:
- Backpropagation;Channel coding;Decoding;Energy efficiency;Matrix algebra;Static random access storage;Belief propagation;Channel decoder;Communication application;Hardware performance;In-memory-computing;MAC;Matrix-vector multipliers;Memory matrix;Neural decoder;Sparsity
- Sridharan, Amitesh;Zhang, Fan;Sui, Yang;Yuan, Bo;Fan, Deliang
- 《60th ACM/IEEE Design Automation Conference, DAC 2023》
- 2023年
- July 9, 2023 - July 13, 2023
- San Francisco, CA, United states
- 会议
Channel decoders are key computing modules in wired/wireless communication systems. Recently neural network (NN)-based decoders have shown their promising error-correcting performance because of their end-to-end learning capability. However, compared with the traditional approaches, the emerging neural belief propagation (NBP) solution suffers higher storage and computational complexity, limiting its hardware performance. To address this challenge and develop a channel decoder that can achieve high decoding performance and hardware performance simultaneously, in this paper we take a first step towards exploring SRAM-based in-memory computing for efficient NBP channel decoding. We first analyze the unique sparsity pattern in the NBP processing, and then propose an efficient and fully Digital Sparse In-Memory Matrix vector Multiplier (DSPIMM) computing platform. Extensive experiments demonstrate that our proposed DSPIMM achieves significantly higher energy efficiency and throughput than the state-of-the-art counterparts. © 2023 IEEE.
...6.A 65nm RRAM Compute-in-Memory Macro for Genome Sequencing Alignment
- 关键词:
- Energy efficiency;Genes;Hafnium oxides;RRAM;Alignment algorithms;Compute-in-memory;Genome sequencing;Genome sequencing alignment;Genomics analysis;Macro design;Memory macro;Memory wall;Short-read alignments;State of the art
- Zhang, Fan;He, Wangxin;Yeo, Injune;Liehr, Maximilian;Cady, Nathaniel;Cao, Yu;Seo, Jae-Sun;Fan, Deliang
- 《49th IEEE European Solid State Circuits Conference, ESSCIRC 2023》
- 2023年
- September 11, 2023 - September 14, 2023
- Lisbon, Portugal
- 会议
In genomic analysis, the major computation bottleneck is the memory-and compute-intensive DNA short reads alignment due to memory-wall challenge. This work presents the first Resistive RAM (RRAM) based Compute-in-Memory (CIM) macro design for accelerating state-of-the-art BWT based genome sequencing alignment. Our design could support all the core instructions, i.e., XNOR based match, count, and addition, required by alignment algorithm. The proposed CIM macro implemented in integration of HfO2 RRAM and 65nm CMOS demonstrates the best energy efficiency to date with 2.07 TOPS/W and 2. 12Gsuffixes/J at 1. 0V. © 2023 IEEE.
...7.MeF-RAM: A New Non-Volatile Cache Memory Based on Magneto-Electric FET
- 关键词:
- Magneto-electric FETs; non-volatile memory; memory bit-cell; cachedesign;PERFORMANCE; BENCHMARKING; OPTIMIZATION; CIRCUIT; ENERGY; WSE2
- Angizi, Shaahin;Khoshavi, Navid;Marshall, Andrew;Dowben, Peter;Fan, Deliang
- 《ACM TRANSACTIONS ON DESIGN AUTOMATION OF ELECTRONIC SYSTEMS》
- 2022年
- 27卷
- 2期
- 期刊
Magneto-Electric FET (MEFET) is a recently developed post-CMOS FET, which offers intriguing characteristics for high-speed and low-power design in both logic and memory applications. In this article, we present MeF-RAM, a non-volatile cache memory design based on 2-Transistor-1-MEFET (2T1M) memory bit-cell with separate read and write paths. We show that with proper co-design across MEFET device, memory cell circuit, and array architecture, MeF-RAM is a promising candidate for fast non-volatile memory (NVM). To evaluate its cache performance in the memory system, we, for the first time, build a device-to-architecture cross-layer evaluation framework to quantitatively analyze and benchmark the MeF-RAM design with other memory technologies, including both volatile memory (i.e., SRAM, eDRAM) and other popular non-volatile emerging memory (i.e., ReRAM, STT-MRAM, and SOT-MRAM). The experiment results for the PARSEC benchmark suite indicate that, as an L2 cache memory, MeF-RAM reduces Energy Area Latency (EAT) product on average by similar to 98% and similar to 70% compared with typical 6T-SRAM and 2T1R SOT-MRAM counterparts, respectively.
...8.APA-Scan: detection and visualization of 3'-UTR alternative polyadenylation with RNA-seq and 3'-end-seq data.
- 关键词:
- 0 / 3' Untranslated Regions. 0 / MicroRNAs. 0 / Protein Isoforms. 0 / RNA Precursors. 0 / RNA, Messenger;3′-End-seq; Alternative polyadenylation; RNA-seq; Transcriptome
- Fahmi, Naima Ahmed;Ahmed, Khandakar Tanvir;Chang, Jae-Woong;Nassereddeen, Heba;Fan, Deliang;Yong, Jeongsik;Zhang, Wei
- 《BMC bioinformatics》
- 2022年
- 23卷
- Suppl 3期
- 期刊
BACKGROUND: The eukaryotic genome is capable of producing multiple isoforms from a gene by alternative polyadenylation (APA) during pre-mRNA processing. APA in the 3'-untranslated region (3'-UTR) of mRNA produces transcripts with shorter or longer 3'-UTR. Often, 3'-UTR serves as a binding platform for microRNAs and RNA-binding proteins, which affect the fate of the mRNA transcript. Thus, 3'-UTR APA is known to modulate translation and provides a mean to regulate gene expression at the post-transcriptional level. Current bioinformatics pipelines have limited capability in profiling 3'-UTR APA events due to incomplete annotations and a low-resolution analyzing power: widely available bioinformatics pipelines do not reference actionable polyadenylation (cleavage) sites but simulate 3'-UTR APA only using RNA-seq read coverage, causing false positive identifications. To overcome these limitations, we developed APA-Scan, a robust program that identifies 3'-UTR APA events and visualizes the RNA-seq short-read coverage with gene annotations.; METHODS: APA-Scan utilizes either predicted or experimentally validated actionable polyadenylation signals as a reference for polyadenylation sites and calculates the quantity of long and short 3'-UTR transcripts in the RNA-seq data. APA-Scan works in three major steps: (i) calculate the read coverage of the 3'-UTR regions of genes; (ii) identify the potential APA sites and evaluate the significance of the events among two biological conditions; (iii) graphical representation of user specific event with 3'-UTR annotation and read coverage on the 3'-UTR regions. APA-Scan is implemented in Python3. Source code and a comprehensive user's manual are freely available at https://github.com/compbiolabucf/APA-Scan .; RESULT: APA-Scan was applied to both simulated and real RNA-seq datasets and compared with two widely used baselines DaPars and APAtrap. In simulation APA-Scan significantly improved the accuracy of 3'-UTR APA identification compared to the other baselines. The performance of APA-Scan was also validated by 3'-end-seq data and qPCR on mouse embryonic fibroblast cells. The experiments confirm that APA-Scan can detect unannotated 3'-UTR APA events and improve genome annotation.; CONCLUSION: APA-Scan is a comprehensive computational pipeline to detect transcriptome-wide 3'-UTR APA events. The pipeline integrates both RNA-seq and 3'-end-seq data information and can efficiently identify the significant events with a high-resolution short reads coverage plots. © 2022. The Author(s).
...9.A 1.23-GHz 16-kb Programmable and Generic Processing-in-SRAM Accelerator in 65nm
- 关键词:
- Computation theory ; Cryptography ; Energy efficiency ; Integrated circuit design;Boolean logic operations ; Chip design ; Complete sets ; Computing platform ; Full adders ; In;memory computing ; Parallel vectors ; Programmability ; Single cycle ; Vector operations
- SridharanAmitesh;AngiziShaahin;CherupallySaiKiran;ZhangFan;SeoJae-Sun;FanDeliang
- 《48th IEEE European Solid State Circuits Conference, ESSCIRC 2022》
- 2022年
- September 19, 2022 - September 22, 2022
- Milan, Italy
- 会议
We present a generic and programmable Processing-in-SRAM (PSRAM) accelerator chip design based on an 8T-SRAM array to accommodate a complete set of Boolean logic operations (e.g., NOR/NAND/XOR, both 2- and 3-input), majority, and full adder, for the first time, all in a single cycle. PSRAM provides the programmability required for in-memory computing platforms that could be used for various applications such as parallel vector operation, neural networks, and data encryption. The prototype design is implemented in a SRAM macro with size of 16 kb, demonstrating one of the fastest programmable in-memory computing system to date operating at 1.23 GHz. The 65nm prototype chip achieves system-level peak throughput of 1.2 TOPS, and energy-efficiency of 34.98 TOPS/W at 1.2V. © 2022 IEEE.
...10.XST: A Crossbar Column-wise Sparse Training for Efficient Continual Learning
- 关键词:
- Continual Learning; In-Memory-Computing; Sparse Learning
- Zhang, Fan;Yang, Li;Meng, Jian;Seo, Jae-sun;Cao, Yu ;Fan, Deliang
- 《25th Design, Automation and Test in Europe Conference and Exhibition》
- 2022年
- MAR 14-23, 2022
- ELECTR NETWORK
- 会议
Leveraging the ReRAM crossbar-based In-Memory-Computing (IMC) to accelerate single task DNN inference has been widely studied. However, using the ReRAM crossbar for continual learning has not been explored yet. In this work, we propose XST, a novel crossbar column-wise sparse training framework for continual learning. XST significantly reduces the training cost and saves inference energy. More importantly, it is friendly to existing crossbar-based convolution engine with almost no hardware overhead. Compared with the state-of-the-art CPG method, the experiments show that XST's accuracy achieves 4.95% higher accuracy. Furthermore, XST demonstrates similar to 5.59x training speedup and 1.5x inference energy-saving.
...
