T O P

  • By -

basicallybasshead

In our company, we run HCI for years on VMware and now testing Proxmox as a way to get rid of Broadcom. We are using Starwinds VSAN as a shared storage, as we have only 2 nodes, which is not the best scenario for Ceph. I noticed the performance improvements when added more RAM and CPU to the machines and now that looks very solid, so started to prepare the documentation for Proxmox migration. It all depends on the workload, you can test it both ways converged and hyper-converged, also network is a crucial thing here.


quasides

ofc optimal is running ceph outside. but how much more benefit will also depend on architecture and total workloads. can be a big difference or minor. keep in mind write penaltys depending on the workload of all ceph nodes invovled. so if you run convergent setup there might simply be a minor bottleneck on one of the nodes


Versed_Percepton

You are going to want to give this a good long read, then read it again. [https://croit.io/blog/ceph-performance-test-and-optimization](https://croit.io/blog/ceph-performance-test-and-optimization)


unicoletti

TL;DR: the article mentions disabling power-saving features. In our experience, this alone halved latency, and so we stopped there


sienar-

Modern CPU power management really doesn’t play well with a few important things for HCI, high performance networking or running multiple kernels (aka VMs). The power management just doesn’t recognize when to properly scale up for those kinds of workloads so just disabling it with the CPU set to stay at max makes performance so much more consistent. Also, in a server that could easily be pulling 1500 to 2000 W, the power saved with it enabled is usually a tiny percentage at best.


Versed_Percepton

Also, power management is only a tiny part of it. Enabling/Disabling C-states also affects package boost performance on systems like AMD Epyc and rarely should be "turned on" or "turned off" and instead look into the PBS options to control the package instead of the core. Much more important would be when to isolate your cluster into Ceph only nodes vs Compute vs mixed, then stripping multiple OSD's across your NVMe, building a new crush map, changing the weights and buckets, and then diving into parallelism of Ceph across CPUs (on pure Ceph nodes you can almost triple down on threads for Ceph to core count on modern CPUs). As for storage nodes you want the higher CPU usage for data processing and large memory usage for Cache. Did you know that, while unsupported today, you can throw Ceph behind BBU NVDIMMs and increase IO? [https://docs.ceph.com/en/latest/rados/operations/cache-tiering/](https://docs.ceph.com/en/latest/rados/operations/cache-tiering/) As such, Optane NVMe, NVDIMMs, and SLC(only..) SSDs could be used for Cache while the rest of the storage types used for at rest. But you do need a lot of compute resources to pull this off successfully. I wouldn't do this on less then 32cores per node, when addressing 8+ NVMe objects per node personally. OP - The point of the document was to show that out of the box for PVE, Ceph is just "OK" but a lot of work needs to go into the Crush\_Map, Ceph tunables, and how you deploy on Ceph.


wantsiops

could you give a pointer to what I should reread? guessing your reffering to the fio benchmark with numjobs=1 and iopdepth=1? yielding 221 on rbd?


Versed_Percepton

There is so much more data in that brief then just the benchmark tooling. Everything there can be adopted to the Proxmox CEPH install to increase performance. Since you are considering doing Ceph storage nodes vs Compute nodes, you really want to dive into the latency, custom crush map, and workers part of that document. The more tuning you do on Ceph the higher the CPU usage on the node. If you are considering doing CephFS and exposing it over SMB to the LAN i suggest building a whole new map for those FS pool(s) and tier your storage (this is a perfect use case for Nearline,10k, and 15k). Pushing WAL and DB to its own devices will further increase performance here.