@ wrote... (4 years, 7 months ago)

We recently upgraded our network to 10 Gbit and were really hoping to see monumental speed increases in our ceph cluster.

One of our benchmarks was pgbench and to say we were sad would be an understatement…

I created a database and ran tests like so…

createdb -U postgres bench
pgbench -U postgres -i -d bench
pgbench -U postgres -d bench -c 70 2> /dev/null

And here's a typical run. Latency varied a lot but never less than 300ms.

latency average = 951.736 ms
tps = 73.549842 (including connections establishing)

Yeah, so a four node ceph cluster with 12 OSD was getting 73 TPS with a second of latency. Did somebody swap out my drives with a floppy disk?!?

This was so horrifically bad it put in jeaporady an entire 5 year forecast of our tech stack…

Now I don't claim to be a ceph or postgres expert but here's what I tried.

# fsync = off
latency average = 50.641 ms
tps = 1382.270643 (including connections establishing)

So that's groovy and all but it's totally unsafe and not appropriate for production. The good news was that this proved that the ceph cluster itself wasn't utterly broken.

So then I noticed the next line in postgresql.conf

# synchronous_commit = on # synchronization level;

So I tried turning that off and…

latency average = 48.036 ms
tps = 1457.230219 (including connections establishing)

wut! wut!

Basically equivalent speeds to fsync = off.

pretty strange when there is no replication…

I also tried with synchronous_commit = local and got

latency average = 225.714 ms
tps = 310.127180 (including connections establishing)

Which is great compared to the original results but still dismal.

For our use case maybe losing a couple of transactions is worth 10x speed improvement.

The really strange bit is that we aren't running replication on our test machine so synchronous_commit shouldn't affect anything as per my understanding…

But as I said, I'm not a postgresql expert. But all this really begs the question, wtf is ceph doing that causes fsyncs to be so slow? Too bad I'm not a ceph expert either…

Here's a good page explaining syncronous_commit, https://www.tutorialdba.com/2018/04/how-to-improve-performance-of.html

tl;dr - change synchronous_commit = off

Category: tech, Tags: ceph, postgresql
Comments: 6
Comments
1.
Feng Pan @ March 9, 2020 wrote... (3 years, 9 months ago)

Hi Kurt, nice article! We are evaluating ceph as our product's underlying file system. Does it work well with postgres? What's the minimum hardware requirement (10Gbit/s network)? What do you use to deploy ceph and postgres?

Thank you in advance for your help!

2.
Kurt @ March 9, 2020 wrote... (3 years, 9 months ago)

We ended up putting our main postgres databases on dedicated nvme drives in a zfs pool.

I really want to like Ceph but it is very complicated so make sure you actually need it before trying to deploy it.

As for requirements, dedicated 10 gbit for ceph replication and another 10 gbit nic for the public network.

More drives and the more nodes the better. If I were to do it all again I'd insist on SSDs for pool, spinners are too slow when there are 50+ virtual machines all trying to write their logs and whatnot.

For deploying, I use Ansible for everything.

3.
Fan Guogang @ May 25, 2020 wrote... (3 years, 6 months ago)

My test, postgres:12.3 in kubernetes cluster with ceph:

postgres@postgresql-68788cf76c-j6q7r:~$ pgbench -c 10 -j 2 -t 10000 postgres
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 10
number of threads: 2
number of transactions per client: 10000
number of transactions actually processed: 100000/100000
latency average = 12.819 ms
tps = 780.093743 (including connections establishing)
tps = 780.123685 (excluding connections establishing)

while the postgres:9.6 on host with host filesystem:

-bash-4.2$ pgbench -c 10 -j 2 -t 10000 postgres
bash: pgbench: command not found...
-bash-4.2$ ./pgbench -c 10 -j 2 -t 10000 postgres
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 10
number of threads: 2
number of transactions per client: 10000
number of transactions actually processed: 100000/100000
latency average = 2.476 ms
tps = 4039.558263 (including connections establishing)
tps = 4039.925093 (excluding connections establishing)
4.
Kurt @ May 25, 2020 wrote... (3 years, 6 months ago)

Interesting, I've rerun the benchmarks and am now seeing significantly better results from a year ago. Many changes have happened since then but I can't recall exactly what…

Anyhow, here's what I got:

postgres 11 on ceph

pgbench -c 10 -j 2 -t 10000 bench
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 2
number of transactions per client: 10000
number of transactions actually processed: 100000/100000
latency average = 2.803 ms
tps = 3567.563409 (including connections establishing)
tps = 3567.966234 (excluding connections establishing)

postgres 11 on zfs

pgbench -c 10 -j 2 -t 10000 bench
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 2
number of transactions per client: 10000
number of transactions actually processed: 100000/100000
latency average = 2.193 ms
tps = 4559.085564 (including connections establishing)
tps = 4559.731460 (excluding connections establishing)
5.
Christophe @ November 12, 2020 wrote... (3 years ago)

'synchronous_commit' is not (only) about replication, but about synchronous writes of the log files (WAL); ie: when you COMMIT a write, PostgreSQL asks for an immediate write on the disk of the log file in 'pg_wal/', to enable a recovery in case of a crash.

'fsync' is system wide (all DB files) and must be left to 'on', or you may have corrupted data. 'synchronous_commit' may be set to 'off' , sometimes only in a session or a transaction, to avoid most syncs of log files, reducing latency, increasing speed, without risk of corruption, IF you're ready to lose a few transactions (~600ms) in case of a crash.

6.
dario_1tech @ February 3, 2021 wrote... (2 years, 10 months ago)

“ I can't recall exactly what…”

That should be on this post, and would be 100000x awesome

Click here to add a comment