FPGA Data Ingest Processing for NARA Electronic Records
NARA's Innovative Systems Lab at the University of Illinois is exploring FPGA technology for electronic records management. The project aims to address challenges in data storage, retrieval, and integrity for long-term archival. FPGAs offer software-configurable chips with unique capabilities for on-the-fly routing and signal processing, though they present difficulties in clock speed and processor integration. The focus is on leveraging FPGA for efficient data management within NARA's Electronic Records Archive.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
FPGA Data Ingest Processing for NARA Electronic Records Craig Steffen Innovative Systems Lab, NCSA csteffen@ncsa.uiuc.edu National Center for Supercomputing Applications University of Illinois at Urbana-Champaign
NARA/NSF OCI Grant Innovative Systems and Software: Applications to NARA Research Problems National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Funded through the National Science Foundation Cooperative Agreement NSF OCI 05-25308 Cooperative Support Agreements NSF OCI 04-38712 and NSF OCI 05-04064 by the National Archives and Records Administration Peter Bajcsy, PI Imaginations unbound
NCSA Innovative Systems Lab Rob Pennington and Mike Showerman at NCSA work with innovative systems for HPC use Expertise in: FPGAs Cell BE Nvidia-based CUDA GPUs Imaginations unbound
Genesis of FPGA at NARA project NARA is creating Electronic Records Archive. All federal records digital by 2012 ERA needs to ingest records and store with metadata Imaginations unbound
Boundary Conditions of the Problem On records storage for 100 years: You might be able to count on ASCII. no programs can be assumed to be available Other people s problems: How is the data stored? (physical medium, format) how is it retreived? no compression Imaginations unbound
What NARA Cares About Data storage Data retrieval Data integrity Imaginations unbound
Field Programmable Gate Arrays Software-configurable chip Lower clock speed, higher specialization than ASIC Cores of routers (on-the-fly routing) prototype electronics front-end signal processing Imaginations unbound
Whats Difficult About Using FPGAs? Slow clock speed (100 MHz typical) (30 to 1 DISadvantage against scalar CPU) Slave processors, must be activated by CPU program, no interrupts, dependent on CPU for data transfer (or at least data configuration) Typical programing: hardware design languages (user configures buffers, clocks, and wires in VHDL) Programming is particular to the target chip different families of chips require re-design; basic logical blocks can differ FPGAs typically exist (philosophically and logically) in device space; to use as a processor, you must generate glue routines so that A talks to B Imaginations unbound
FPGAs: Deep Parallelism Deep Parallelism: execution units can be arranged arbitrarily, including feeding each other If 1-clock operations A then B then C need to be performed on data items 1, 2, 3, 4, 5 (and 1, 2, 3, 4 etc. are independent) first clock: 1 A B C next clock: 2 A 1 B C 3 A 2 B 1 C 4 A 3 B 2 C 1 5 A 4 B 3 C 2 1 one output per clock. This continues to be true for arbitrarily long instruction sequences. (Add one instruction per clock, still one output per clock) Later on, when tools talk about pipelining, this is what they re talking about Limited by: sophistication of programming, data independence, logic complexity Imaginations unbound
FPGAs: Wide Pipelining A,B and C performed on data objects 1,2,3,4,5 Deep parallel only: 5 A 4 B 3 C 2 1 Wide parallel: 9 A 7 B 5 C 3 1 10 A 8 B 6 C 4 2 limited by: input and output throughput Imaginations unbound
Tools: How Do We Design Algorithms ISL s philosophy: scientific/algorithm practitioners must be part of the design process. We believe that algorithm design must not require hardware expertise (coding is Ok, but we never want to see a clock or a buffer ) Compilers translate high-level structured language to machine design (vhdl) low-level libraries and glue logic take the vhdl and create a machine design practitioners should know Compiler and basic hardware design, possibly not libraries and glue logic Imaginations unbound
Computational FPGA Vendors SRC Computers Inc. hardware MAP processor, in integrated system in-house Carte-C compiler language glue logic is all integrated and invisible Nallatech Hardware FPGA accelerator cards (coming out with Intel FSB-socket design) in-house Dime-C C compiler (we ll see this later) glue logic is Dimetalk board design. We have written improvements. CPU socket system comes with improved version Hardware-only: Xtreme Data AMD CPU-socket FPGA hardware DRC AMD CPU-socket FPGA hardware AlphaData embedded FPGA designs SGI integrated modular FPGA Systems Software-only Impulse streaminng FPGA compiler and support software Mitrionics Mitrion-C parallel language and Mitrion virtual soft-processor Imaginations unbound
What Can We Contribute? FPGAs: long to activate, good for data-intensive algorithms Best used when every piece of data needs the same processing What information do we need to extract from every file that comes into the ERA? Problem: many various file types file types change, software versions change (remember: tens of years) unpacking would be a major problem How do we deal with this diverging difficulty? Ignore it Imaginations unbound
Nallatech software stack Dime-C compiler Dime-C is a mostly-ANSI-C language arrays implemented as block rams scalar variables are implemented as registers compiler indicates deep pipelining states wide pipelining done automatically Dime-Talk hardware network design tool Hardware/bus/card design tool implements internal switching network for data movement (saves designing explicit hardware buffers for moving data dime-talk is 32 bits wide by definition too complex for programming use, requires wiring and components Improvement is on the way! The new FSB system that is being ordered has the Nallatech Abstraction Layer which promises to do away with the function of DimeTalk for the FSB system. It is unclear the status with regards to older cards Imaginations unbound
debugging via trace functions no debugger for Dime-C no behavioral simulation no printf with Dime-C code (on a separate hardware device) poor-mans printf: diagnostic array bram array that controlling program reads out after running array contains copies of input parameters final values of loop variables (diagnose if array didn t run, if it hit alarm condition be sure to initialize to prevent stale data from being read Imaginations unbound
screenshot example of trace code i=0; output_count = 0; loop_count = 0; DataOut[0]=HostSetPacketSize; DataOut[1]=SndRcv; DataOut[2]=delay; DataOut[3]=wordstosend; DataOut[4]=direction; if(SndRcv==0){ // we're receiving [deleted for clarity my_count = DataOut[6] = (int)GetPipeCount(ChannelA.datain); i=0; while( (i<my_count) && (output_count<INT_RAMSIZE_0) ){ DataReceived[output_count] = readFIFO(ChannelA.datain); Imaginations unbound
Nnalli library Created NCSA/NALLatech Interface library in 2008 holds state information for the running cards provies an API layer over the hardware-oriented FUSE software library from Nallatech Imaginations unbound
nnalli library functions nnalli_init(int my_NUMDEVICES, char **my_devicefiles, DWORD **my_deviceinfo, int my_memory_map_0, DWORD default_timeout); int nnalli_run(nnalli_accelerator_t *my_accel); int nnalli_wait(nnalli_accelerator_t *my_accel); int nnalli_run_wait(nnalli_accelerator_t *my_accel); int nnalli_write_4byte_values(nnalli_accelerator_t *my_accel,void *source_pointer, int destination_address, int destination_node, int num_values); int nnalli_read_4byte_values(nnalli_accelerator_t *my_accel,void *destination_pointer, int source_address, int source_node, int num_values); Imaginations unbound
Teaching students at CHPC in CT Craig Steffen was invited by Mike Inggs of the University of Cape Town to give a class in reconfigurable computing in December of 2008 at the now-forming Center for High Performance Computing Students were about 16 advanced undergraduates and graduate students CHPC has Nallatech hardware and software, so used Dime-C, Dime-Talk and Nnalli abstraction library Realized some of the limitations of the library for everyday use, more on this later Imaginations unbound
Nallatech H101 card Nallatech s first in their High-Performance Computing line of products Announced in 2007, based on the Xilinx Virtex-4 FPGA line PCI-X 133 bus (obsolete now) Each chip has .6 MB internal block RAM Each card has 4 banks of SRAM at and 1 bank of SDRAM at 512 MB Each card has 4 high-speed serial links to communicate with other nearby cards Imaginations unbound
Nallatech H101 Imaginations unbound
H101 limitations Dime-C/Dime-Talk designs run at 100 MHz Bus width for Dime-Talk network is 32-bits (hardware bus is 64 Block RAM can be arbitrarily configured up to .6 MB. ( infinite bandwidth of FPGA for small data sets) SRAM dual-ported and pipelined, so up to 8 32-bit data references for collections from .6 MB up to 16 MB SDRAM quad-ported but NOT pipelined. Up to 4 non- pipelined memory references per clock Imaginations unbound
H101 timing measurements Loading H101 Virtex-4 configuration: ~600 ms run-stop sequence (running function of 0 length): 200 s Latency and throughput for memory transfer to/from SRAM and block RAM: latency: 23 s and 290 MB/s Imaginations unbound
FPGA design process Code written in Dime-C dialect and SDK Compiler takes seconds DimeTalk build includes place-and-route process on chip, which can take many hours or many tens of hours for a fairly full design place-and-route is not a problem with a closed design time place-and-route can fail (after 10 or 20 hours) SO all software is developed first, in C and tested parts of the software are ported over to the FPGA processor Imaginations unbound
variable-re-use rule breaking pipelining Deep pipelining requires strict data-flow through the inner loop C is a sequential language, Dime-C is a parallel language. Statement execution can happen in ANY ORDER to satisfy pipelining requirements. Statements are put in parallel if they don t depend on each other. therefore: variable can be read multiple times, but only assigned once. Otherwise, which order they are assigned is important pipelining breaks Imaginations unbound
Text Parsing For Indexing ignore file format blindly search for strings of ascii text characters character map is arbitrarily configurable (by 255 character input array) blocks of text surrounded by non-text logged as words non-text ignored raw product is unsorted list of words (or word hashes) (as 64-bit unsigned integers) Imaginations unbound
Raw text indexing scheme Everything must be trivially parseable Created index format using ASCII information index file lists files indexed or other index files rest of the files is an ordered list of words encountered hierarchical file format with one master file at the top Prototype deployed on NARA demo system Imaginations unbound
Text Search Infrastructure Text finding scheme searches through file hierarchy starting at a top index Follows braches of index tree where at least one file contains each search term common words are contained in all branches, so they have no effect (but are no detriment) Imaginations unbound
Index File Example 1 0 ./109/h1607_ih.xml ./109/h1609_ih.xml ./109/h1610_ih.xml ./109/h1611_ih.xml ./109/h1612_ih.xml ./109/h1613_ih.xml ./109/h1614_ih.xml ./109/h1615_ih.xml ./109/h1616_ih.xml ./109/h1617_ih.xml ./109/h1618_ih.xml level-0 file indicates index level (references individual files) file name list shows files referred to rest of the file is word hash list 00151664 00160447 00200442 00301762 00500445 00700667 00714002 Imaginations unbound
Index File Example 2 1 .index_00_0000.ind .index_00_0001.ind .index_00_0002.ind .index_00_0003.ind .index_00_0004.ind .index_00_0005.ind .index_00_0006.ind .index_00_0007.ind .index_00_0008.ind .index_00_0009.ind .index_00_0010.ind level-1 file is an index of level-0 index files referenced index file list rest of the file is word hash list for all referenced original files 00001452 00006235 00014005 00054001 00064002 00070676 Imaginations unbound
Index File Example 3 4 .index_03_0000.ind .index_03_0001.ind level-4 top level index file just 2 child index files file hash list for entire collection (file is very large, but one monolithic file 00000550 00001452 00006235 00014005 00014043 00014044 00014046 00014047 00015415 00016234 00017234 Imaginations unbound
Index Explanation of Philosophy Assume that files and data will one day be on archives with indexing information LOST With only the information on and about a data storage volume, determine if the volume might contain a certain set of files The master index of the volume can determine if the data is not in the volume, in which case the entire volume does not need to be mounted and read only a crude indexing system; this is not meant as a fine search tool Imaginations unbound
Search Appliance Screen shot 1 Imaginations unbound
Search Appliance 2; Index Searchable Imaginations unbound
Search Appliance Timing Examples Timing examples of search appliance: NARA-generated file hierarchy index Large file set: SEC documents from 2005 686,000 xml-annotated files total 146 GB on disk (we could only put part of the collection on the machine, as it mostly fills up the hard drive) Imaginations unbound
Search Application: more obscure words = less time The less common the word, the less of the tree has to be searched if a word does NOT exist at all in the top level index file, no file contains that word and the search ends. Very obscure words generate 0-time searches The more words and the more obscure words, the faster the search runs Imaginations unbound
Text Index: Dropping Undesired Keywords Some words are not wanted (formatting tags, xml tags in xml documents and xml keywords) Raw text parser outputs these words as well as others Unwanted words can be eliminated at index storage time, which is after sorting xml tags are strictly grouped and are easy to eliminate (exists in hardware but not in software) Imaginations unbound
Sorting on FPGA We have not implemented a sorting algorithm in the FPGA data is passed back to the CPU in unordered, sequential order sorting is a problem that scalar CPUs (fast, instruction based processors) are very good at not likely to be a win for FPGA Imaginations unbound
Memory Bank Multiple Access arrays in Dime-C are directly implemented as literal banks of internal memory memory is single-ported from the Dime-C process: you may ONLY make ONE reference to a memory bank and still be pipelineable A=a[i]; B=a[j]; // legal C=a[k]; // will not pipeline Imaginations unbound
Temporary Variables to Help Compiler for(i=0 ){ A1 = a[i]; A2 = a[i]; // will not pipeline } instead for(i=0 ){ temp = a[i]; A1 = temp; A2 = temp; } temp is a register, and can be made into various virtual variables that can be read multiple times Imaginations unbound
Bit Shifting Operations Bit-shifts and bit-masks are very low gate cost; they are implemented as wires inside the FPGA all numerical operations are implemented as fully general-purpose operators (taking lots of gates) the lower the width, the lower the gate-count avoid divides if possible divide and multiply by powers of 2 by bit-shifting Imaginations unbound
Dime-C Code Screenshot Imaginations unbound
Dime-C Pipeline Display Imaginations unbound
Dime-Talk System Design Screenshot Imaginations unbound
Histogramming Discussion histogramming is a real computer science problem each update of histogram array consists of a read- location, increment value, write same location can only be pipelined if you can guarantee that the subsequent reads and writes won t overlap histogramming in the FPGA is therefore non-pipelined, sequential-only code (each loop takes X clocks before the next one starts) Imaginations unbound
Histogram Solution, Wide Pipeline Only pipelining can be done in non-deep-pipelined loops, with many different block rams to hold sub-histograms second loop at the end to add histograms together another solution: small number of bins (dozens or perhaps hundreds) can be histogrammed via hardware accumulators histogramming can be done used to be able to trick compiler into it, but they updated the compiler and now that s not allowed can be done in hardware, just have to convince Dime-C to do it. Imaginations unbound
Bit Shifting For Input input file for text parsing is stored in 4-byte-wide array one read per clock is four bytes (4 characters) algorithm must parse through the file one character at a time to check for text boundaries split input and analyze the file 4-wide-parallel (file is processed in as many clocks as file has bytes Imaginations unbound
Eliminating If-Then-Else if-then structure is a control change statement, cannot be done in a pipelined fashion (causes halts or bubbles in pipeline). Pipelined control is static numerical and logical operations can be pipelined better to AND two values together than check to see if it s zero Imaginations unbound
Input validation; File Characterization file bytes can be histogrammed to create basic imprint of file different file types will exhibit different groupings of bytes in files (text characters vs. control characters) files can be tested against a known character frequency map of files of that type for validation can be done as part of ingest processing Imaginations unbound
File Characterization: doc file Imaginations unbound