1/4/2011 5:00 PM EST
according to their paper from last year, each RC (reconfigurable cell) in the MORA (multimedia oriented reconfigurable architecture) SIMD processing array is an 8-bit PIM (processor in memory) with 256 bytes of block RAM, two input ports, and two output ports. each RC has a PE (processing element) that computes 8-bit fixed point arithmetic, as well as logical, shifting, and comparison operations. a controller handles asynchronous handshaking between upstream and downstream RCs.They created a C++ DSL (domain specific language) to program the RCs at a high level of abstraction. for example, an 18-line implementation written in their C++ DSL expands to 16,803 lines of VHDL.in their paper from last year they implemented an 8-bit DCT on the Virtex-4 LX200. It used 22 RCs and 3,368 Virtex logic slices and executed in 200 cycles at 100 MHz. They could squeeze 25 copies of this DCT into the LX200, which by my calculations would yield 12.5 MB/s (11.92 MiB/s). I think they’ve since switched to the Virtex-4 SX (signal processing model) and possibly optimized for the greater number of XtremeDSP slices rather than implementing every PE with logic cells. That said, the reported throughput of 5 GiB/s makes me wonder which algorithm “central to MPEG decoding” they actually implemented.