Google Bare Metal in Numbers

By: Gleb Otochkin
Topic: Google Cloud | Posted: Jun 08, 2021 at 9:51 AM EST

Share on:

In the previous posts, I shared my first impression and how to start using the Google Bare Metal Service (BMS). In this post, I will try to show some numbers related to the performance of the solution and you can compare it with your existing environment.

Let me start with the box characteristics. For my tests, I was using an “o2-standard-32-metal” box located in the us-west2 zone (Los Angeles). The solution was configured with 2Gbps interconnect and had a couple of storage resources attached to it. The first one was represented by two 512Gb disks based on HDD storage where I placed my binaries and a recovery ASM disk group and the second was a 2Tb volume “all-flash” I used for data. Here is a summary table:

Characteristic
BMS Box type	o2-standard-32-metal
CPU	Intel(R) Xeon(R) Gold 6234 CPU @ 3.30GHz
CPU sockets	2
CPU cores	16
Memory	384 GB
Disk 1	512 Gb – Standard disk
Disk 2	512 Gb – Standard disk
Disk 3	2048 Gb – All-flash
Network	4 NICs Speed: 25000Mb/s
OS	Oracle Linux 7.9

BMS box characteristics.

Before starting the tests I updated my Oracle Linux and installed a number of packages required for my Oracle database and packages to test IO and Network such as fio and iperf3. Here is a summary table with software and tools used to test the performance.

Package	Testing scope
fio	IO performance
stress-ng	CPU. Memory
swingbench	Oracle database performance
SLOB	Oracle database IO
iperf3	Network
oratcptest	Network

The first tests were done to verify pure IO performance for the attached storage. I didn’t test the root volume but put more attention to the 2Tb and 500Gb volumes used in ASM. I found that the 500Gb volume (HDD based) showed different performance depending on the number of tests. The first test would be slow, the second and subsequent several would show better and better performance and in the end, it could be two times better performance for 4kb random reads than the 2Tb volume. It looked like a direct impact from a cache layer on the storage side.

For the first iteration of the tests, I used the same parameters for the fio tool as Oracle in their blogs and documentation. It may help to compare the results.

Here are the results for the 2Tb (“all-flash”) disk. It showed about 14.5k IOPS and about 56 MiB/s for our read throughput. It didn’t change too much with changing the iodepth parameter.

[customeradmin@at-2881200-svr003 ~]$ sudo fio --filename=/dev/asmdiskdata01 --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --readonly
iops-test-job: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
...
fio-3.7
Starting 4 processes
Jobs: 4 (f=4): [r(4)][2.5%][r=56.3MiB/s,w=0KiB/s][r=14.4k,w=0 IOPS][eta 01m:57s]
Jobs: 4 (f=4): [r(4)][4.2%][r=56.3MiB/s,w=0KiB/s][r=14.4k,w=0 IOPS][eta 01m:55s] 
...
redacted
...
iops-test-job: (groupid=0, jobs=4): err= 0: pid=9531: Fri Apr  2 23:07:00 2021
   read: IOPS=14.5k, BW=56.5MiB/s (59.2MB/s)(6784MiB/120071msec)
    slat (nsec): min=843, max=951532, avg=6506.15, stdev=15775.56
    clat (usec): min=412, max=768900, avg=70784.44, stdev=95724.31
     lat (usec): min=423, max=768903, avg=70791.09, stdev=95724.28
    clat percentiles (usec):
     |  1.00th=[  1090],  5.00th=[  1450], 10.00th=[  9896], 20.00th=[ 11469],
     | 30.00th=[ 12649], 40.00th=[ 22414], 50.00th=[ 24773], 60.00th=[ 36439],
     | 70.00th=[ 58983], 80.00th=[131597], 90.00th=[229639], 95.00th=[291505],
     | 99.00th=[387974], 99.50th=[425722], 99.90th=[557843], 99.95th=[583009],
     | 99.99th=[633340]
   bw (  KiB/s): min= 5840, max=35056, per=25.00%, avg=14462.12, stdev=3297.70, samples=960
   iops        : min= 1460, max= 8764, avg=3615.51, stdev=824.43, samples=960
  lat (usec)   : 500=0.01%, 750=0.01%, 1000=0.31%
  lat (msec)   : 2=7.09%, 4=2.19%, 10=0.75%, 20=27.74%, 50=30.71%
  lat (msec)   : 100=8.67%, 250=14.29%, 500=8.06%, 750=0.19%, 1000=0.01%
  cpu          : usr=0.51%, sys=1.95%, ctx=645903, majf=0, minf=35085
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, &gt;=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, &gt;=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, &gt;=64=0.1%
     issued rwts: total=1736660,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

If we change the iodepth to 1 and have a look at the latency we can see that most of the calls (88.63%) were completed in 100 usec.

[customeradmin@at-2881200-svr003 ~]$ sudo fio --filename=/dev/asmdiskdata01  --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=1 --numjobs=1 --time_based --group_reporting --name=readlatency-test-job --runtime=120 --eta-newline=1 --readonly
readlatency-test-job: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.7
Starting 1 process
Jobs: 1 (f=1): [r(1)][2.5%][r=43.8MiB/s,w=0KiB/s][r=11.2k,w=0 IOPS][eta 01m:57s]
Jobs: 1 (f=1): [r(1)][4.2%][r=43.6MiB/s,w=0KiB/s][r=11.2k,w=0 IOPS][eta 01m:55s]
Jobs: 1 (f=1): [r(1)][5.8%][r=44.0MiB/s,w=0KiB/s][r=11.5k,w=0 IOPS][eta 01m:53s]
...
redacted
...
Jobs: 1 (f=1): [r(1)][100.0%][r=36.0MiB/s,w=0KiB/s][r=9460,w=0 IOPS][eta 00m:00s]
readlatency-test-job: (groupid=0, jobs=1): err= 0: pid=16428: Sun May  2 15:40:22 2021
   read: IOPS=10.4k, BW=40.5MiB/s (42.4MB/s)(4858MiB/120001msec)
    slat (nsec): min=1472, max=826689, avg=6066.59, stdev=2247.50
    clat (nsec): min=1832, max=88773k, avg=88658.73, stdev=111986.42
     lat (usec): min=53, max=88780, avg=94.90, stdev=112.04
    clat percentiles (usec):
     |  1.00th=[   62],  5.00th=[   69], 10.00th=[   71], 20.00th=[   73],
     | 30.00th=[   76], 40.00th=[   82], 50.00th=[   85], 60.00th=[   89],
     | 70.00th=[   91], 80.00th=[   95], 90.00th=[  102], 95.00th=[  117],
     | 99.00th=[  229], 99.50th=[  375], 99.90th=[  441], 99.95th=[  465],
     | 99.99th=[  914]
   bw (  KiB/s): min=32024, max=47288, per=100.00%, avg=41479.86, stdev=3179.03, samples=239
   iops        : min= 8006, max=11822, avg=10369.94, stdev=794.75, samples=239
  lat (usec)   : 2=0.01%, 4=0.01%, 10=0.01%, 50=0.01%, 100=88.63%
  lat (usec)   : 250=10.53%, 500=0.80%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 100=0.01%
  cpu          : usr=2.77%, sys=7.20%, ctx=1243683, majf=0, minf=49982
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, &gt;=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, &gt;=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, &gt;=64=0.0%
     issued rwts: total=1243592,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
 
Run status group 0 (all jobs):
   READ: bw=40.5MiB/s (42.4MB/s), 40.5MiB/s-40.5MiB/s (42.4MB/s-42.4MB/s), io=4858MiB (5094MB), run=120001-120001msec

If we increase the block size to 8kb we get exactly the same number of IOPS and, since the block size doubled, the throughput doubled too. I tried with a bigger block size but it seemed that the 113MiB/s was the limit. With 16k blocks, I got only 7233 IOPS.

[customeradmin@at-2881200-svr003 ~]$ sudo fio --filename=/dev/asmdiskdata01 --direct=1 --rw=randread --bs=8k --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based --group_reporting --name=throughput-test-job --eta-newline=1 --readonly
throughput-test-job: (g=0): rw=randread, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=256
...
fio-3.7
Starting 4 processes
Jobs: 4 (f=4): [r(4)][2.5%][r=113MiB/s,w=0KiB/s][r=14.5k,w=0 IOPS][eta 01m:57s]
Jobs: 4 (f=4): [r(4)][4.2%][r=112MiB/s,w=0KiB/s][r=14.4k,w=0 IOPS][eta 01m:55s]
...
redacted
...
Jobs: 4 (f=4): [r(4)][100.0%][r=113MiB/s,w=0KiB/s][r=14.4k,w=0 IOPS][eta 00m:00s]
throughput-test-job: (groupid=0, jobs=4): err= 0: pid=26804: Thu May  6 11:56:29 2021
   read: IOPS=14.5k, BW=113MiB/s (119MB/s)(13.3GiB/120070msec)
    slat (nsec): min=883, max=12194k, avg=7801.26, stdev=105635.10
    clat (usec): min=389, max=543527, avg=70768.61, stdev=82727.74
     lat (usec): min=500, max=543534, avg=70776.56, stdev=82728.57
    clat percentiles (usec):
     |  1.00th=[  1139],  5.00th=[  1483], 10.00th=[  9503], 20.00th=[ 11731],
     | 30.00th=[ 13173], 40.00th=[ 23725], 50.00th=[ 35390], 60.00th=[ 49021],
     | 70.00th=[ 84411], 80.00th=[133694], 90.00th=[198181], 95.00th=[248513],
     | 99.00th=[341836], 99.50th=[371196], 99.90th=[429917], 99.95th=[455082],
     | 99.99th=[501220]
   bw (  KiB/s): min=10352, max=65488, per=25.00%, avg=28929.36, stdev=6911.83, samples=960
   iops        : min= 1294, max= 8186, avg=3616.14, stdev=863.98, samples=960
  lat (usec)   : 500=0.01%, 750=0.01%, 1000=0.17%
  lat (msec)   : 2=6.82%, 4=2.85%, 10=0.83%, 20=24.49%, 50=25.75%
  lat (msec)   : 100=13.49%, 250=20.65%, 500=4.95%, 750=0.01%
  cpu          : usr=0.45%, sys=2.08%, ctx=840032, majf=0, minf=37878
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, &gt;=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, &gt;=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, &gt;=64=0.1%
     issued rwts: total=1736983,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256
 
Run status group 0 (all jobs):
   READ: bw=113MiB/s (119MB/s), 113MiB/s-113MiB/s (119MB/s-119MB/s), io=13.3GiB (14.2GB), run=120070-120070msec

I was told that the IO depends on the size of the storage and we get 6k IOPS per 1TB. From my tests for all-flash disk storage, we get around 7k+ IOPS per 1 Tb and it exceeds the expected 6k IOPS per Tb for all-flash storage. The performance was stable and predictable all the time.

The results for all-flash are not bad at all, but I got even better results for my 512G disks. The devices were the “standard disk” type but returned much better results with the same parameters for the “fio” tool. We got 50k IOPS and 197 MiB/s with iodepth=256. At the same time, we need to keep in mind that it was most likely driven by the storage cache and could fluctuate depending on the load on the storage layer.

[customeradmin@at-2881200-svr003 ~]$ sudo fio --filename=/dev/asmdiskreco01 --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based --group_reporting --name=throughput-test-job --eta-newline=1 --readonly
throughput-test-job: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
...
fio-3.7
Starting 4 processes
Jobs: 4 (f=4): [r(4)][2.5%][r=363MiB/s,w=0KiB/s][r=92.0k,w=0 IOPS][eta 01m:57s]
Jobs: 4 (f=4): [r(4)][4.2%][r=353MiB/s,w=0KiB/s][r=90.3k,w=0 IOPS][eta 01m:55s]
...
redacted
...
Jobs: 4 (f=4): [r(4)][100.0%][r=69.4MiB/s,w=0KiB/s][r=17.8k,w=0 IOPS][eta 00m:00s]
throughput-test-job: (groupid=0, jobs=4): err= 0: pid=21453: Sun May  2 16:09:52 2021
   read: IOPS=50.4k, BW=197MiB/s (206MB/s)(23.1GiB/120146msec)
    slat (nsec): min=823, max=598316, avg=4054.59, stdev=5727.26
    clat (usec): min=55, max=820318, avg=20323.18, stdev=42225.70
     lat (usec): min=60, max=820320, avg=20327.31, stdev=42225.91
    clat percentiles (usec):
     |  1.00th=[   469],  5.00th=[   938], 10.00th=[  1467], 20.00th=[  3097],
     | 30.00th=[  5014], 40.00th=[  7308], 50.00th=[  9896], 60.00th=[ 13173],
     | 70.00th=[ 17433], 80.00th=[ 23725], 90.00th=[ 34866], 95.00th=[ 60031],
     | 99.00th=[256902], 99.50th=[312476], 99.90th=[425722], 99.95th=[463471],
     | 99.99th=[549454]
   bw (  KiB/s): min= 2048, max=124176, per=25.02%, avg=50414.06, stdev=35598.91, samples=960
   iops        : min=  512, max=31044, avg=12603.50, stdev=8899.73, samples=960
  lat (usec)   : 100=0.09%, 250=0.46%, 500=0.50%, 750=1.71%, 1000=3.47%
  lat (msec)   : 2=7.38%, 4=11.20%, 10=25.46%, 20=24.21%, 50=19.48%
  lat (msec)   : 100=2.88%, 250=2.12%, 500=1.03%, 750=0.03%, 1000=0.01%
  cpu          : usr=1.18%, sys=4.78%, ctx=3902766, majf=0, minf=32824
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, &gt;=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, &gt;=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, &gt;=64=0.1%
     issued rwts: total=6051085,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256
 
Run status group 0 (all jobs):
   READ: bw=197MiB/s (206MB/s), 197MiB/s-197MiB/s (206MB/s-206MB/s), io=23.1GiB (24.8GB), run=120146-120146msec

The latency for that disk is a bit higher and shows 100 usec for 65% while the rest 35% are completed under 250 usec.

[customeradmin@at-2881200-svr003 ~]$ sudo fio --filename=/dev/asmdiskreco01 --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=1 --runtime=120 --numjobs=4 --time_based --group_reporting --name=throughput-test-job --eta-newline=1 --readonly
throughput-test-job: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
...
fio-3.7
Starting 4 processes
Jobs: 4 (f=4): [r(4)][2.5%][r=128MiB/s,w=0KiB/s][r=32.8k,w=0 IOPS][eta 01m:57s]
...
redacted
...
Jobs: 4 (f=4): [r(4)][100.0%][r=148MiB/s,w=0KiB/s][r=37.8k,w=0 IOPS][eta 00m:00s]
throughput-test-job: (groupid=0, jobs=4): err= 0: pid=20940: Mon May  3 12:36:55 2021
   read: IOPS=37.6k, BW=147MiB/s (154MB/s)(17.2GiB/120001msec)
    slat (nsec): min=1298, max=615102, avg=6398.65, stdev=2543.78
    clat (nsec): min=1754, max=12104k, avg=98303.45, stdev=56720.31
     lat (usec): min=53, max=12108, avg=104.87, stdev=56.72
    clat percentiles (usec):
     |  1.00th=[   74],  5.00th=[   81], 10.00th=[   85], 20.00th=[   89],
     | 30.00th=[   92], 40.00th=[   94], 50.00th=[   96], 60.00th=[   99],
     | 70.00th=[  102], 80.00th=[  106], 90.00th=[  115], 95.00th=[  123],
     | 99.00th=[  141], 99.50th=[  149], 99.90th=[  169], 99.95th=[  186],
     | 99.99th=[  334]
   bw (  KiB/s): min=31632, max=39992, per=25.00%, avg=37614.30, stdev=1579.31, samples=956
   iops        : min= 7908, max= 9998, avg=9403.55, stdev=394.82, samples=956
  lat (usec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  lat (usec)   : 100=64.25%, 250=35.73%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%
  cpu          : usr=2.27%, sys=6.52%, ctx=4515860, majf=0, minf=27386
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, &gt;=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, &gt;=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, &gt;=64=0.0%
     issued rwts: total=4514208,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
 
Run status group 0 (all jobs):
   READ: bw=147MiB/s (154MB/s), 147MiB/s-147MiB/s (154MB/s-154MB/s), io=17.2GiB (18.5GB), run=120001-120001msec

The standard disks may show better results in the tests but the speed is not guaranteed there and the latency is much too volatile. But considering the price tag for the standard disks they can be considered a good choice for development and test databases, recovery disk groups, binaries, and local backup storage.

Having a bare metal box and known exact CPUs model it didn’t make too much sense to test CPU performance. As I mentioned above I had a box with 16 cores of Intel(R) Xeon(R) Gold 6234 CPU and it was not virtualized, so I was getting the full performance of the CPU layer.

[customeradmin@at-2881200-svr003 ~]$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    1
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6234 CPU @ 3.30GHz
Stepping:              7
CPU MHz:               1260.743
CPU max MHz:           4000.0000
CPU min MHz:           1200.0000
BogoMIPS:              6600.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              25344K
NUMA node0 CPU(s):     0-7
NUMA node1 CPU(s):     8-15

Nevertheless, I ran a couple of tests using the “stress-ng” tool to verify the stability of the system and how it handles stress load. For the CPU it showed 4494 bogo ops/s (bogus operation per second). The value might not be fully comparable with other environments and models of CPU but it provides a rough estimation of the CPU performance.

[[customeradmin@at-2881200-svr003 ~]$ stress-ng --cpu 2000 --timeout 15m --verbose --metrics-brief
stress-ng: debug: [28502] 16 processors online, 16 processors configured
stress-ng: info:  [28502] dispatching hogs: 2000 cpu
stress-ng: debug: [28502] cache allocate: default cache size: 25344K
stress-ng: debug: [28502] starting stressors
stress-ng: debug: [28503] stress-ng-cpu: started [28503] (instance 0)
stress-ng: debug: [28504] stress-ng-cpu: started [28504] (instance 1)
stress-ng: debug: [28505] stress-ng-cpu: started [28505] (instance 2)
stress-ng: debug: [28506] stress-ng-cpu: started [28506] (instance 3)
stress-ng: debug: [28507] stress-ng-cpu: started [28507] (instance 4)
stress-ng: debug: [28508] stress-ng-cpu: started [28508] (instance 5)
... 
redacted
...
stress-ng: debug: [30562] stress-ng-cpu: started [30562] (instance 1908)
stress-ng: debug: [30578] stress-ng-cpu: started [30578] (instance 1924)
stress-ng: debug: [30626] stress-ng-cpu: started [30626] (instance 1972)
stress-ng: debug: [30640] stress-ng-cpu: started [30640] (instance 1986)
stress-ng: debug: [30594] stress-ng-cpu: started [30594] (instance 1940)
stress-ng: debug: [30610] stress-ng-cpu: started [30610] (instance 1956)
...
redacted
...
stress-ng: debug: [28502] process [30653] terminated
stress-ng: info:  [28502] successful run completed in 912.25s (15 mins, 12.25 secs)
stress-ng: info:  [28502] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [28502]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [28502] cpu             4050114    901.13  14491.63      4.09      4494.50       279.40
[customeradmin@at-2881200-svr003 ~]$

The stress-ng for virtual memory provided the expected range of values too. I didn’t expect anything out of order. This is a bare metal box with standard components and built by a creditable vendor and the test just confirms the expectations.

[customeradmin@at-2881200-svr003 ~]$ stress-ng --vm 8 --vm-bytes 6G --timeout 15m --metrics-brief
stress-ng: info:  [7064] dispatching hogs: 8 vm
stress-ng: info:  [7064] successful run completed in 900.67s (15 mins, 0.67 secs)
stress-ng: info:  [7064] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [7064]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [7064] vm            205119392    900.64   6781.40    393.31    227749.05     28589.22
[customeradmin@at-2881200-svr003 ~]$

After the basic steps for IO and CPU performance, I did multiple tests with the Dominic Gilles swingbench tool. I am not planning to publish all the results but provide a few examples.

The first one is deploying the SOE standard test schema using scale 5.

[oracle@at-2881200-svr003 bin]$ time ./oewizard -cs //localhost:1521/pdbora -ts DATA -dba "sys as sysdba" -dbap Welcome1 -u soe -p Welcome1 -async_off -scale 5 -hashpart -create -cl -v
SwingBench Wizard
Author  :        Dominic Giles
Version :        2.6.0.1137
 
Running in Lights Out Mode using config file : ../wizardconfigs/oewizard.xml
Connecting to : jdbc:oracle:thin:@//localhost:1521/pdbora                  
Connected                                                                  
Starting run                                                               
Starting script ../sql/soedgdrop2.sql                                      
Script completed in 0 hour(s) 0 minute(s) 23 second(s) 726 millisecond(s)  
...
redacted
...
============================================
|           Datagenerator Run Stats        |
============================================
Connection Time                        0:00:00.002
Data Generation Time                   0:02:58.752
DDL Creation Time                      0:03:23.974
Total Run Time                         0:06:22.730
Rows Inserted per sec                      423,557
Data Generated (MB) per sec                   33.9
Actual Rows Generated                   75,779,407
Commits Completed                            3,932
Batch Updates Completed                    379,032
 
Connecting to : jdbc:oracle:thin:@//localhost:1521/pdbora                  
Connected                                                                  
 
Post Creation Validation Report
===============================
The schema appears to have been created successfully.

With a scale factor of 10, it increased from 6 to 10 minutes. Overall 10 min on a scale of 10 can be considered a good result.

[oracle@at-2881200-svr003 bin]$ time ./oewizard -cs //localhost:1521/pdbora -ts DATA -dba "sys as sysdba" -dbap Welcome1 -u soe -p Welcome1 -async_off -scale 10 -hashpart -create -cl -v
SwingBench Wizard
Author  :        Dominic Giles
Version :        2.6.0.1137
 
Running in Lights Out Mode using config file : ../wizardconfigs/oewizard.xml
...
redacted
...
============================================
|           Datagenerator Run Stats        |
============================================
Connection Time                        0:00:00.003
Data Generation Time                   0:05:54.187
DDL Creation Time                      0:04:53.505
Total Run Time                         0:10:47.696
Rows Inserted per sec                      424,974
Data Generated (MB) per sec                   34.1
Actual Rows Generated                  150,707,124
Commits Completed                            7,660
Batch Updates Completed                    753,665
 
Connecting to : jdbc:oracle:thin:@//localhost:1521/pdbora                  
Connected                                                                  
 
Post Creation Validation Report
===============================
The schema appears to have been created successfully.

The next stage was to test the performance using charbench utility from swingbench. It was tested on the scale of 10 deployments using the following parameters.

./charbench -c ../configs/SOE_Server_Side_V2.xml  -cs //bmshost:1521/pdbora  -u soe -p ******** -v users,tpm,tps,vresp -intermin 0 -intermax 0 -min 0 -max 0 -uc 128 -di SQ,WQ,WA -rt 0:10
... 
redacted
...
16:34:08 [128/128]   458302   11679   7     5     4     19    10    6     0     0     0
16:34:09 [128/128]   461219   11654   26    3     5     21    6     6     0     0     0
16:34:10 [128/128]   461968   11687   7     11    7     19    9     6     0     0     0
16:34:11 [128/128]   462255   11564   14    12    5     16    9     1     0     0     0
16:34:12 [128/128]   459930   9168    4     5     1     7     4     0     0     0     0
Saved results to results00009.xml
16:34:13 [0/128]     449871   148     4     5     1     7     4     0     0     0     0

If we parse the results we can get more granular information:

$ python3 ../utils/parse_results.py -r results00009.xml
+-------------------------------------------+---------------------------+
| Attribute                                 |      results00009.xml     |
+-------------------------------------------+---------------------------+
| Benchmark Name                            |  "Order Entry (PLSQL) V2" |
| Connect String                            | //10.168.16.2:1521/pdbora |
| Time of run                               |   Jun 2, 2021 4:24:13 PM  |
| Minimum Inter TX Think Time               |             0             |
| Maximum Inter TX Think Time               |             0             |
| Maximum Intra TX Think Time               |             0             |
| Maximum Intra TX Think Time               |             0             |
| No of Users                               |            128            |
| Total Run Time                            |          0:10:00          |
| Average Tx/Sec                            |          7796.54          |
| Maximum Tx/Min                            |           544518          |
| Total Completed Transactions              |          4677922          |
|                                           |                           |
| Average Transaction Response Time         |                           |
| Customer Registration                     |           20.19           |
| Browse Products                           |           10.86           |
| Browse Orders                             |           18.19           |
| Update Customer Details                   |            8.81           |
| Order Products                            |           28.24           |
| Process Orders                            |           17.24           |
|                                           |                           |
| 10th Percentile Transaction Response Time |                           |
| Customer Registration                     |           10.00           |
| Browse Products                           |            7.00           |
| Browse Orders                             |            6.00           |
| Update Customer Details                   |            5.00           |
| Order Products                            |           16.00           |
| Process Orders                            |            9.00           |
|                                           |                           |
| 50th Percentile Transaction Response Time |                           |
| Customer Registration                     |           10.00           |
| Browse Products                           |            7.00           |
| Browse Orders                             |            6.00           |
| Update Customer Details                   |            5.00           |
| Order Products                            |           16.00           |
| Process Orders                            |            9.00           |
|                                           |                           |
| 90th Percentile Transaction Response Time |                           |
| Customer Registration                     |           10.00           |
| Browse Products                           |            7.00           |
| Browse Orders                             |            6.00           |
| Update Customer Details                   |            5.00           |
| Order Products                            |           16.00           |
| Process Orders                            |            9.00           |
+-------------------------------------------+---------------------------+

We were getting about 8k transactions per second on average and with 90% of transactions completed from 5 to 10ms. The system was using about 50% CPU during the peak load and none of the transactions failed.

I repeated the same tests from the database server itself and from a VM in the us-west2 zone and the results were the same. Network throughput and latency were sufficient to keep exactly the same rate of transactions. The only requirement was to provide sufficient CPU power on the swingbench host.

I also did several tests using Kevin Closson’s SLOB tool to test database IO and here are some results.

The setup for 64 test schemas was completed in 169 sec.

[oracle@at-2881200-svr003 SLOB]$ ./setup.sh IOPS 64
NOTIFY  : 2021.05.24-14:09:36 : Begin SLOB 2.5.4.0 setup.
NOTIFY  : 2021.05.24-14:09:36 : ADMIN_CONNECT_STRING: "system/******"
NOTIFY  : 2021.05.24-14:09:36 : Load parameters from slob.conf:
 
SCALE: 10000 (10000 blocks)
SCAN_TABLE_SZ: 1M (128 blocks)
LOAD_PARALLEL_DEGREE: 16
...
redacted
...
NOTIFY  : 2021.05.24-14:12:25 : Please examine ./slob_data_load_summary.txt for any possible errors
NOTIFY  : 2021.05.24-14:12:25 :
NOTIFY  : 2021.05.24-14:12:25 : NOTE: No errors detected but if ./slob_data_load_summary.txt shows errors then
NOTIFY  : 2021.05.24-14:12:25 : examine /u02/app/oracle/SLOB/cr_tab_and_load.out
 
NOTIFY  : 2021.05.24-14:12:25 : SLOB setup complete. Total setup time:  (169 seconds)

Here are the parameters for my execution. The database SGA size has been reduced to 4G to enforce more physical IO:

UPDATE_PCT: 30
SCAN_PCT: 0
RUN_TIME: 300
WORK_LOOP: 0
SCALE: 10000 (10000 blocks)
WORK_UNIT: 256
REDO_STRESS: LITE
HOT_SCHEMA_FREQUENCY: 2
HOTSPOT_MB: 8
HOTSPOT_OFFSET_MB: 16
HOTSPOT_FREQUENCY: 3
THINK_TM_FREQUENCY: 0
THINK_TM_MIN: .1
THINK_TM_MAX: .5
DATABASE_STATISTICS_TYPE: awr
SYSDBA_PASSWD: "Welcome1"
DBA_PRIV_USER: "system"
ADMIN_SQLNET_SERVICE: ""
SQLNET_SERVICE_BASE: ""
SQLNET_SERVICE_MAX: ""

And here are some results from the AWR:

fn=awr.txt ;export fn | tail -n +`cat $fn | grep -n "physical read IO requests" | cut -f1 -d":"` $fn | head -n 35 | grep physical
physical read IO requests                 1,501,209        4,951.7          20.8
physical read bytes                  13,971,111,936   46,083,424.9     193,572.7
physical read total IO requests           1,513,820        4,993.3          21.0
physical read total bytes            14,172,278,784   46,746,969.6     196,359.9
physical read total multi block                   6            0.0           0.0
physical reads                            1,705,458        5,625.4          23.6
physical reads cache                      1,705,113        5,624.3          23.6
physical reads cache prefetch             1,190,884        3,928.1          16.5
physical reads direct                           345            1.1           0.0
physical reads direct (lob)                       0            0.0           0.0
physical reads direct temporary                   0            0.0           0.0
physical reads prefetch warmup              203,065          669.8           2.8
physical write IO requests                1,467,625        4,840.9          20.3
physical write bytes                 19,986,317,312   65,924,456.0     276,914.7
physical write total IO requests          1,578,377        5,206.2          21.9
physical write total bytes           35,234,857,984  116,221,453.3     488,186.5
physical write total multi block             83,577          275.7           1.2
physical writes                           2,439,736        8,047.4          33.8
physical writes direct                          442            1.5           0.0
physical writes direct (lob)                      4            0.0           0.0
physical writes from cache                2,439,294        8,046.0          33.8
physical writes non checkpoint            1,366,860        4,508.6          18.9

In total we had more than 10k IOPS where read and write requests were split more or less equally but the majority of the read requests (93%) were single block reads. The test put the system to stress increasing the average multiblock read time to 43ms and the single block reads to 32ms. The writing was better since the test did much more reading than writing to the database. The SLOB execution was configured to have 30% of updates. It is probably a bit higher than you might see for an average OLTP system. In my opinion, the system showed good performance and could use all available IO on the storage layer.

One more test was done to test pure network performance using the iperf3 tool. The tests showed 1.44 Gbits/sec bandwidth for direct and 1.47 Gbits/sec for revers stream which was slightly lower than the expected 2Gbits/sec. It was not immediately clear where was the bottleneck. I tried with a different number of CPUs on the tests boxes but the results were the same. I tried the parallel option but it was even slower.

Here are the results for the direct iperf3 test:

iperf3 -c 10.168.16.2
Connecting to host 10.168.16.2, port 5201
[  4] local 10.168.0.5 port 47626 connected to 10.168.16.2 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   177 MBytes  1.49 Gbits/sec  116    126 KBytes       
[  4]   1.00-2.00   sec   167 MBytes  1.40 Gbits/sec  118    134 KBytes       
[  4]   2.00-3.00   sec   171 MBytes  1.44 Gbits/sec   46    133 KBytes       
[  4]   3.00-4.00   sec   183 MBytes  1.53 Gbits/sec  125    119 KBytes       
[  4]   4.00-5.00   sec   181 MBytes  1.52 Gbits/sec  118    133 KBytes       
[  4]   5.00-6.00   sec   169 MBytes  1.42 Gbits/sec   34    111 KBytes       
[  4]   6.00-7.00   sec   175 MBytes  1.47 Gbits/sec   77    103 KBytes       
[  4]   7.00-8.00   sec   178 MBytes  1.49 Gbits/sec   19    155 KBytes       
[  4]   8.00-9.00   sec   173 MBytes  1.45 Gbits/sec   40    130 KBytes       
[  4]   9.00-10.00  sec   144 MBytes  1.20 Gbits/sec   54   93.5 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  1.68 GBytes  1.44 Gbits/sec  747             sender
[  4]   0.00-10.00  sec  1.68 GBytes  1.44 Gbits/sec                  receiver
 
iperf Done.

In the end, I would also present one more test related to network latency for database connections. Oracle has a tool oratcptest.jar (Doc ID 2064368.1) which is designed to evaluate network bandwidth and latency for Data Guard. I tested it with two different record sizes 8k and 4k. You can find your average record size if you evaluate the number of redo writes and volume of redo written for a certain period of time. In my tests, it was close to 4k.

Here are the results for the test with a 4k record length (SYNC).

java -jar oratcptest.jar 10.168.0.3 -port=5555 -write -mode=SYNC -length=4096 -duration=60s -interval=1s
[Requesting a test]
	Message payload        = 4 kbytes
	Payload content type   = RANDOM
	Delay between messages = NO
	Number of connections  = 1
	Socket send buffer     = (system default)
	Transport mode         = SYNC
	Disk write             = YES
	Statistics interval    = 1 second
	Test duration          = 1 minute
	Test frequency         = NO
	Network Timeout        = NO
	(1 Mbyte = 1024x1024 bytes)
 
(12:11:41) The server is ready.
                    Throughput             Latency
(12:11:42)      2.294 Mbytes/s            1.708 ms   (disk-write 1.025 ms)
(12:11:43)      2.499 Mbytes/s            1.568 ms   (disk-write 0.961 ms)
(12:11:44)      2.495 Mbytes/s            1.570 ms   (disk-write 0.946 ms)
(12:11:45)      2.719 Mbytes/s            1.441 ms   (disk-write 0.886 ms)
(12:11:46)      2.613 Mbytes/s            1.499 ms   (disk-write 0.929 ms)
(12:11:47)      2.569 Mbytes/s            1.525 ms   (disk-write 0.939 ms)
(12:11:48)      2.500 Mbytes/s            1.567 ms   (disk-write 0.937 ms)
(12:11:49)      2.618 Mbytes/s            1.496 ms   (disk-write 0.919 ms)
...
redacted
...
(12:12:39)      1.176 Mbytes/s            3.331 ms   (disk-write 2.644 ms)
(12:12:40)      1.176 Mbytes/s            3.331 ms   (disk-write 2.644 ms)
(12:12:41)      1.175 Mbytes/s            3.333 ms   (disk-write 2.661 ms)
(12:12:41) Test finished.
	       Socket send buffer = 23040 bytes
	          Avg. throughput = 1.547 Mbytes/s
	             Avg. latency = 2.532 ms (disk-write 1.888 ms)

The average latency is 2.5ms where 1.9ms is a disk writing on another side and the network itself less than 1ms. The average throughput is 1.5 Mbytes/sec. With the 8k block size, the throughput is bigger.

java -jar oratcptest.jar 10.168.0.3 -port=5555 -write -mode=SYNC -length=8192 -duration=60s -interval=1s
[Requesting a test]
	Message payload        = 8 kbytes
	Payload content type   = RANDOM
	Delay between messages = NO
	Number of connections  = 1
	Socket send buffer     = (system default)
	Transport mode         = SYNC
	Disk write             = YES
	Statistics interval    = 1 second
	Test duration          = 1 minute
	Test frequency         = NO
	Network Timeout        = NO
	(1 Mbyte = 1024x1024 bytes)
 
(21:29:23) The server is ready.
                    Throughput             Latency
(21:29:24)      4.110 Mbytes/s            1.904 ms   (disk-write 1.051 ms)
(21:29:25)      4.539 Mbytes/s            1.724 ms   (disk-write 0.982 ms)
(21:29:26)      4.499 Mbytes/s            1.739 ms   (disk-write 0.991 ms)
(21:29:27)      4.465 Mbytes/s            1.752 ms   (disk-write 1.020 ms)
(21:29:28)      4.489 Mbytes/s            1.743 ms   (disk-write 1.017 ms)
(21:29:29)      4.467 Mbytes/s            1.752 ms   (disk-write 1.048 ms)
(21:29:30)      4.275 Mbytes/s            1.830 ms   (disk-write 1.085 ms)
...
redacted
...
(21:30:21)      2.348 Mbytes/s            3.332 ms   (disk-write 2.620 ms)
(21:30:22)      2.348 Mbytes/s            3.332 ms   (disk-write 2.632 ms)
(21:30:23)      2.317 Mbytes/s            3.377 ms   (disk-write 2.666 ms)
(21:30:23) Test finished.
	       Socket send buffer = 23040 bytes
	          Avg. throughput = 4.280 Mbytes/s
	             Avg. latency = 1.828 ms (disk-write 1.155 ms)

The network latency is roughly the same but because of the bigger size of the record, we get better throughput. With ASYNC we get the best throughput

java -jar oratcptest.jar 10.168.0.3 -port=5555 -write -mode=ASYNC -length=8192 -duration=60s -interval=1s
[Requesting a test]
	Message payload        = 8 kbytes
	Payload content type   = RANDOM
	Delay between messages = NO
	Number of connections  = 1
	Socket send buffer     = (system default)
	Transport mode         = ASYNC
	Disk write             = YES
	Statistics interval    = 1 second
	Test duration          = 1 minute
	Test frequency         = NO
	Network Timeout        = NO
	(1 Mbyte = 1024x1024 bytes)
 
(21:37:36) The server is ready.
                    Throughput
(21:37:37)      8.105 Mbytes/s
(21:37:38)      8.407 Mbytes/s
(21:37:39)      8.055 Mbytes/s
...
redacted
...
(21:38:34)      2.349 Mbytes/s
(21:38:35)      2.348 Mbytes/s
(21:38:36)      2.348 Mbytes/s
(21:38:36) Test finished.
	       Socket send buffer = 78336 bytes
	          Avg. throughput = 4.301 Mbytes/s

In my case it seemed like the limiting factor for throughput was writing to the disk. I used the OS root filesystem as the file destination and it was not based on fast flash storage but rather on HDD-backed storage. In any case, I think the results are sufficient to demonstrate that the network layer between GCP and BMS compartments probably will not be a limiting factor for the BMS deployments.

This is a long post and if you don’t need all the technical detail here is a short summary of my findings:

The system showed good IO, Memory, and CPU performance achieving and even exceeding expectations on the IO side. IO was about 15-20% more than it was expected according to specifications and volume sizes. The HDD-based storage showed really good potential for certain types of workload too. For example, you can use it for DEV, FIT environment, or as a recovery and flashback log destination.
I didn’t encounter any problem with stability during all my tests even when the CPU was completely exhausted during the tests.
The network layer provides a stable and predictable connection with sub-millisecond latency and close to promised throughput.
I can conclude that the BMS solution is a good platform for a serious production workload and is mature enough to be a good choice if you plan to migrate your Oracle database to the cloud.

Share on:

More from this Author

By: Gleb Otochkin
Topic: Google Cloud | Posted: Aug 04, 2021

Terraform Modules Simplified

Terraform is probably already the de-facto standard for cloud deployment. I use it on a daily basis deploying and destroying my tests and demo setups ... Read More

By: Gleb Otochkin
Topic: Google Cloud | Posted: Jul 29, 2021

New Kids on the Block – Rocky Linux

If you’ve been following the recent changes in the Linux world you probably remember how Red Hat and Centos announced in December 2020 that the ... Read More

Blogs