This document proposes a novel flexible accelerator architecture comprising computational units (FCUs) that can efficiently perform DSP operations using carry-save arithmetic. Each FCU operates directly on carry-save operands and can be configured to perform templates of common DSP operations like multiplication and addition/subtraction. By keeping operands in carry-save format throughout the FCU, intermediate conversions are avoided, improving performance compared to prior approaches. The proposed architecture aims to achieve high computational density while reducing area and power compared to existing inflexible accelerator designs.