Bardess brings years of experience on large servers. When you get into large servers plus large BI applications and you want peak performance, then everything about your processors, memory and memory bus becomes an issue.
For example, as you increase the DIMMs per channel (4 channels per processor), the memory controllers cannot operate at peak memory speed. So 1600MHz memory can drop down to 1333MHz or even 1066MHz. At that point the peak throughput (i.e. 8.0 GT/s) that was advertised for your processors has just dropped up to 33%. But if you attempt to change this by reducing your DIMMs per channel by using larger stick sizes, then your channel throughput drops by 25-50%. Memory controllers will span data over multiple sticks to make simultaneous access faster. Reduce the sticks and you increase the # of requests to get the same amount of data.
You cannot improve the situation by increasing the number of processors, because inter-processor communication in a non-NUMA product like most BI backends will rapidly kill any memory performance improvement. 4 processors is the real-world max unless the product is NUMA aware. Another factor is CAS latency. A poor choice in memory latency can mean a 33% decrease in output from the memory sticks. You can’t make that up somewhere else.
Overall, you can start with an 8.0 GT/s QPI with up to 32GB/s of bandwidth and easily reduce throughput to 10GB/s.
Hopefully this gives you an idea of the subtle complexity of large-scale systems. If you want peak performance under these conditions, you really need to work with a high-quality vendor that understands all of these tradeoffs and has intimate knowledge of their hardware. Buying whatever IBM, Dell and HP are selling on their websites will not get you to peak performance.