Which recorsize when compression is enabled?

Discussion:

Carlo Pradissitto via illumos-zfs

2014-04-29 19:47:52 UTC

Hi,
I'm doing some performance tests with a database on ZFS.
This database uses a directory for the Write Ahead Log (WAL) files, and
another directory for the datafiles, so I created a dedicated dataset for
each destination.

In both cases (WAL and datafiles) the database writes pages of 64K, using
the write() syscall for the WAL files, and the pwrite() syscall for the
datafiles.
I gain che best result setting the recordsize to 64K in both filesystems.

Turning on the compression property, ZFS chooses the best recordsize in
relation to the byte size after compression, so how can I choose the best
recordsize setting?
I verified that, even in this case, the best performance is with
recordsize=64K for the WAL files (write() syscall), but seems impossible to
have a linear relation between recordsize and performance when the database
uses the pwrite() syscall and ZFS compression is enabled
Any idea?
Thanks
Carlo

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Matthew Ahrens via illumos-zfs

2014-04-29 20:50:45 UTC

Permalink

On Tue, Apr 29, 2014 at 12:47 PM, Carlo Pradissitto via illumos-zfs <

Post by Carlo Pradissitto via illumos-zfs
Hi,
I'm doing some performance tests with a database on ZFS.
This database uses a directory for the Write Ahead Log (WAL) files, and
another directory for the datafiles, so I created a dedicated dataset for
each destination.
In both cases (WAL and datafiles) the database writes pages of 64K, using
the write() syscall for the WAL files, and the pwrite() syscall for the
datafiles.
I gain che best result setting the recordsize to 64K in both filesystems.
Turning on the compression property, ZFS chooses the best recordsize in
relation to the byte size after compression,

No, the recordsize controls the "logical block size" - the unit of space
that can be independently read or written. Compression is done on
individual (64K in your case) blocks. We then allocate a smaller space on
disk (e.g. 27.5K) for that block.

Post by Carlo Pradissitto via illumos-zfs
so how can I choose the best recordsize setting?

You don't need to consider compression when choosing recordsize. Since
you application is doing 64K-sized, 64K-aligned writes, you should use
recordsize=64k.

--matt

Post by Carlo Pradissitto via illumos-zfs
I verified that, even in this case, the best performance is with
recordsize=64K for the WAL files (write() syscall), but seems impossible to
have a linear relation between recordsize and performance when the database
uses the pwrite() syscall and ZFS compression is enabled
Any idea?
Thanks
Carlo
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460> |
Modify<https://www.listbox.com/member/?&>Your Subscription
<http://www.listbox.com>

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Carlo Pradissitto via illumos-zfs

2014-04-30 07:25:52 UTC

Permalink

Hi Matt,
thanks for the answer.
What you said explains the results for the wirte() syscall (time in
nanoseconds):

recordsize WAL*write()* time 32k22486238464 *64k**18450656026*128k55756656279
256k40972905789 512k325888201861024k 27946125963*64k (2)**17652777547*
But I don't understand these results:

recordsize DB*pwrite()* time 32k71150897554 64k56389523961128k 43993550828
256k38910794332 *512k**36554213562*1024k 50276109556*512k (2)**59310872879*
What's wrong with the pwrite() syscall?

Before every test:

- shutdown the test-zone
- destroy WAL and DB datasets
- create WAL and DB datasets with new parameters
- boot the test-zone

Thanks
Carlo

Post by Matthew Ahrens via illumos-zfs
On Tue, Apr 29, 2014 at 12:47 PM, Carlo Pradissitto via illumos-zfs <

Post by Carlo Pradissitto via illumos-zfs
so how can I choose the best recordsize setting?

You don't need to consider compression when choosing recordsize. Since
you application is doing 64K-sized, 64K-aligned writes, you should use
recordsize=64k.
--matt

Jim Klimov via illumos-zfs

2014-04-30 12:02:49 UTC

Permalink

Post by Carlo Pradissitto via illumos-zfs
Hi Matt,
thanks for the answer.
What you said explains the results for the wirte() syscall (time in

...
I think this has to do with the fact that ZFS logical blocks
are currently 128KB max. Your 512KB IOs are thus split into
four separate logical blocks, each of which independently
undergoes compression and allocation (probably contiguous
since they were queued at the same time).

HTH,
//Jim

Carlo Pradissitto via illumos-zfs

2014-04-30 12:14:41 UTC

Permalink

Hi Jim,
from ZFS man page:
"The default recordsize is 128 KB. The size specified must be a power of
two greater than or equal to 512 and less than or equal to 1 MB"

As shown in the following report, the recordsize is 512K (dblk).
In fact the file data are spread over 35 blocks (512K * 35 = 17920K)

***@globalZone:~# zdb -dddddd rpool/datafiles 74
Dataset rpool/datafiles [ZPL], ID 1178, cr_txg 3271209, 179M, 41 objects,
rootbp DVA[0]=<0:cabf45e00:200:STD:1> DVA[1]=<0:28ace2800:200:STD:1> [L0
DMU objset] fletcher4 lzjb LE contiguous unique unencrypted 2-copy
size=800L/200P birth=3287652L/3287652P fill=41
cksum=179bc32678:7aa3c89cd81:159c9b28b2aba:2b714fc9758c68

Object lvl iblk *dblk* dsize lsize %full type
74 2 16K *512K* 4.41M 17.5M 100.00 ZFS plain file
(K=inherit) (Z=inherit)
168 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED
dnode maxblkid: 34
path /dbpedia/arco.cpm
uid 0
gid 0
atime Tue Apr 29 10:38:11 2014
mtime Tue Apr 29 14:53:08 2014
ctime Tue Apr 29 14:53:08 2014
crtime Tue Apr 29 10:38:11 2014
gen 3284593
mode 100644
size 18088960
parent 42
links 1
pflags 40800000204
Indirect blocks:
0 L1 0:c8410cc00:a00 0:28869f000:a00 4000L/a00P F=35
B=3287652/3287652
0 L0 0:cd1e3fa00:20800 80000L/20800P F=1 B=3287652/3287652
80000 L0 0:cf58d5800:20800 80000L/20800P F=1 B=3284640/3284640
100000 L0 0:ceb76d000:20a00 80000L/20a00P F=1 B=3284682/3284682
180000 L0 0:ceb78da00:20e00 80000L/20e00P F=1 B=3284682/3284682
200000 L0 0:c8e91da00:21800 80000L/21800P F=1 B=3284723/3284723
280000 L0 0:c8e93f200:20a00 80000L/20a00P F=1 B=3284723/3284723
300000 L0 0:cfbe7e600:20a00 80000L/20a00P F=1 B=3284764/3284764
380000 L0 0:cfbe9f000:20a00 80000L/20a00P F=1 B=3284764/3284764
400000 L0 0:cec8b4800:20a00 80000L/20a00P F=1 B=3284807/3284807
480000 L0 0:cec8d5200:20a00 80000L/20a00P F=1 B=3284807/3284807
500000 L0 0:ce7ead400:20a00 80000L/20a00P F=1 B=3284849/3284849
580000 L0 0:ce7ecde00:20a00 80000L/20a00P F=1 B=3284849/3284849
600000 L0 0:ce810cc00:20a00 80000L/20a00P F=1 B=3284891/3284891
680000 L0 0:ce812d600:20a00 80000L/20a00P F=1 B=3284891/3284891
700000 L0 0:c90e87a00:20a00 80000L/20a00P F=1 B=3284933/3284933
780000 L0 0:c90ea8400:20a00 80000L/20a00P F=1 B=3284933/3284933
800000 L0 0:c9c65bc00:20a00 80000L/20a00P F=1 B=3284976/3284976
880000 L0 0:c9c67c600:20a00 80000L/20a00P F=1 B=3284976/3284976
900000 L0 0:cf054bc00:20a00 80000L/20a00P F=1 B=3285018/3285018
980000 L0 0:cf056c600:20a00 80000L/20a00P F=1 B=3285018/3285018
a00000 L0 0:cad38c000:20a00 80000L/20a00P F=1 B=3285061/3285061
a80000 L0 0:cad3aca00:20a00 80000L/20a00P F=1 B=3285061/3285061
b00000 L0 0:ce5fc6e00:20a00 80000L/20a00P F=1 B=3285105/3285105
b80000 L0 0:ce5fe7800:20a00 80000L/20a00P F=1 B=3285105/3285105
c00000 L0 0:cd4abb200:20a00 80000L/20a00P F=1 B=3285148/3285148
c80000 L0 0:cd4adbc00:20a00 80000L/20a00P F=1 B=3285148/3285148
d00000 L0 0:cc57ea600:20a00 80000L/20a00P F=1 B=3285192/3285192
d80000 L0 0:cc580b000:20a00 80000L/20a00P F=1 B=3285192/3285192
e00000 L0 0:cb1a0ea00:20a00 80000L/20a00P F=1 B=3285235/3285235
e80000 L0 0:cb1a2f400:20a00 80000L/20a00P F=1 B=3285235/3285235
f00000 L0 0:cc2e31c00:20a00 80000L/20a00P F=1 B=3285279/3285279
f80000 L0 0:cc2e52600:20a00 80000L/20a00P F=1 B=3285279/3285279
1000000 L0 0:ce0fab800:20a00 80000L/20a00P F=1 B=3285321/3285321
1080000 L0 0:ce0fcc200:20a00 80000L/20a00P F=1 B=3285321/3285321
1100000 L0 0:cc2cb8400:12400 80000L/12400P F=1 B=3285340/3285340

segment [0000000000000000, 0000000001180000) size 17.5M

Post by Jim Klimov via illumos-zfs

Post by Carlo Pradissitto via illumos-zfs
Hi Matt,
thanks for the answer.
What you said explains the results for the wirte() syscall (time in

Tim Chase via illumos-zfs

2014-04-30 12:33:59 UTC

Permalink

Hello,

FYI:

Apparently Solaris has supported ZFS record sizes greater than 128KiB
since, at least Solaris 11. There's a WIP patch
<https://github.com/behlendorf/zfs/commit/d1463eebde927201b1a40b355003a832558bc02e>
for ZoL to provide similar support. At this point, I'd suspect it would
be most likely that something compatible will be picked up by all of the
OpenZFS implementations.

- Tim

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Toomas Soome via illumos-zfs

2014-04-30 12:37:20 UTC

Permalink

zpool v32 did add 1MB blocksize (which is currently max in s11).

rgds,
toomas

Post by Tim Chase via illumos-zfs
Hello,
Apparently Solaris has supported ZFS record sizes greater than 128KiB since, at least Solaris 11. There's a WIP patch for ZoL to provide similar support. At this point, I'd suspect it would be most likely that something compatible will be picked up by all of the OpenZFS implementations.
- Tim
illumos-zfs | Archives | Modify Your Subscription

Matthew Ahrens via illumos-zfs

2014-04-30 16:20:28 UTC

Permalink

On Wed, Apr 30, 2014 at 5:33 AM, Tim Chase via illumos-zfs <

Post by Tim Chase via illumos-zfs
Hello,
Apparently Solaris has supported ZFS record sizes greater than 128KiB
since, at least Solaris 11. There's a WIP patch<https://github.com/behlendorf/zfs/commit/d1463eebde927201b1a40b355003a832558bc02e>for ZoL to provide similar support. At this point, I'd suspect it would be
most likely that something compatible will be picked up by all of the
OpenZFS implementations.

I've worked on completing the large block support that Brian Behlendorf
started; work is here:

https://github.com/ahrens/illumos/commits/largeblock

Note that the interface is compatible with Oracle ZFS, but the on-disk
format is not (due to different strategies for recording what features are
in use on disk; it should be possible to write a utility to verify that the
formats are actually compatible and then change the pool from version 32 to
feature flags).

--matt

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Matthew Ahrens via illumos-zfs

2014-04-30 22:18:04 UTC

Permalink

On Apr 30, 2014, at 9:20 AM, "Matthew Ahrens via illumos-zfs" <
On Wed, Apr 30, 2014 at 5:33 AM, Tim Chase via illumos-zfs <

I've worked on completing the large block support that Brian Behlendorf
https://github.com/ahrens/illumos/commits/largeblock
thanks!
Having experience in this area, it is a good thing that it is disabled by
default on Solaris 11.1. Oddly, it seems to be enabled for scrub/resilver
(more tracing needed, but I'm stuck sitting on a tarmac :-(. This is deadly
to performance when you mix workloads -- trains vs cars. Best
recommendation thus far is to leave the default recordsize at 128k, or
otherwise closer to your app's natural size.

I haven't done a lot of performance testing yet -- would appreciate help on
that. My design keeps everything with max 128K (including things like i/o
aggregation), except for user data on datasets that have had the recordsize
explicitly increased.

--matt

-- richard
Note that the interface is compatible with Oracle ZFS, but the on-disk
format is not (due to different strategies for recording what features are
in use on disk; it should be possible to write a utility to verify that the
formats are actually compatible and then change the pool from version 32 to
feature flags).
--matt
*illumos-zfs* | Archives<https://www.listbox.com/member/archive/182191/=now>
<https://www.listbox.com/member/archive/rss/182191/22820713-4fad4b89> |
Modify<https://www.listbox.com/member/?&>Your Subscription
<http://www.listbox.com>

Robert Milkowski via illumos-zfs

2014-05-02 12:46:28 UTC

Permalink

Having experience in this area, it is a good thing that it is disabled by default on Solaris 11.1. Oddly, it seems to >be enabled for scrub/resilver (more tracing needed, but I'm stuck sitting on a tarmac :-(. This is deadly to >performance when you mix workloads -- trains vs cars. Best recommendation thus far is to leave the default recordsize >at 128k, or otherwise closer to your app's natural size.

Yes, the default should probably stay at 128KB (which is also the case in Solaris 11).

FYI - we've deployed some environments with recordsize=1MB and gzip compression.
In one of the cases the compression ratio we get with data is over 20x! Which changes dynamics considerably (a logical block gets compressed to <50KB) - after benchmarking our application we found out that the performance we get in 128KBvs1MB is about the same, but the compression ratio get significantly better so we are saving on disk space with no performance impact.

--
Robert Milkowski
http://milek.blogspot.com

Richard Elling via illumos-zfs

2014-05-02 16:54:33 UTC

Permalink

Post by Robert Milkowski via illumos-zfs

Yes, the default should probably stay at 128KB (which is also the case in Solaris 11).

Agree. For general-purpose workloads, 1MB is a poor choice. For some large-object
workloads 1MB is a good idea. Maybe it is time to dust off my spacemaps-from-space
and take some aerial photos of the spacemaps for these large workloads :-)

Post by Robert Milkowski via illumos-zfs
FYI - we've deployed some environments with recordsize=1MB and gzip compression.
In one of the cases the compression ratio we get with data is over 20x! Which changes dynamics considerably (a logical block gets compressed to <50KB) - after benchmarking our application we found out that the performance we get in 128KBvs1MB is about the same, but the compression ratio get significantly better so we are saving on disk space with no performance impact.

Very interesting! This makes sense as the compressors can work better on
larger blocks.
-- richard

Richard Elling via illumos-zfs

2014-04-30 22:15:44 UTC

Permalink

Post by Matthew Ahrens via illumos-zfs

https://github.com/ahrens/illumos/commits/largeblock

thanks!
Having experience in this area, it is a good thing that it is disabled by default on Solaris 11.1. Oddly, it seems to be enabled for scrub/resilver (more tracing needed, but I'm stuck sitting on a tarmac :-(. This is deadly to performance when you mix workloads -- trains vs cars. Best recommendation thus far is to leave the default recordsize at 128k, or otherwise closer to your app's natural size.
-- richard

Post by Matthew Ahrens via illumos-zfs
Note that the interface is compatible with Oracle ZFS, but the on-disk format is not (due to different strategies for recording what features are in use on disk; it should be possible to write a utility to verify that the formats are actually compatible and then change the pool from version 32 to feature flags).
--matt
illumos-zfs | Archives | Modify Your Subscription

Jim Klimov via illumos-zfs

2014-05-02 11:08:18 UTC

Permalink

Post by Carlo Pradissitto via illumos-zfs
What's wrong with the pwrite() syscall?
* shutdown the test-zone
* destroy WAL and DB datasets
* create WAL and DB datasets with new parameters
* boot the test-zone

How many times did you run each test - once, or are the numbers above
an average of X runs each? Did you interleave them somehow, or ran X1
runs of write() and X2 runs of prwite() as two sequences?

I think that beside the possible difference between write() and pwrite()
that your question implies, there are also several differences in the
system between the runs. For example, do the tests work on the same data
in the database (i.e. is the work regarding processing, compression and
the size of compressed data always the same?) Also, new allocations may
be subject to the results of fragmentation since they search for large
enough "holes" to fit a block into, which may cause your tests with the
largest block sizes to be slower (searching for adequately big "holes"
may take longer, unless your pool is quite new and empty). Caching may
also be a factor, as well as other concurrent loads on the machine...

Is it possible for you to make the testing rig as pre-determinable as
possible, i.e. by making the pool for ZFS data tests (WAL and DB) a
separate pool on separate low-level storage (i.e. a slice on a disk),
so that between runs you export, destroy and recreate the pool so as
to minimize effects of its fragmentation and clear out the related
caches?

By the way, are you certain that your pool is aligned with sector
sizes on your storage (512b or 4K disk sectors, or 256K/512K SSD
pages, etc.) IO's with blocks that regularly cross sector boundaries
might have an influence on the time it takes to process the storage
calls as well, although this is more of a factor for small IOs.

Finally note, though you probably know this, that if your data is
(randomly) written in 64k pages, the updates in the middle of a
1024k ZFS logical block require the system to read the whole 1024k
(less if compressed), update the data in memory and write out the
1024k (less if compressed).

With smaller write pages (i.e. 4K for storage of VM disk images -
where IOs are not always very random, since many files inside the
VMs are stored sequentially, or 8k-16k typical for databases), it
was argued on these lists that an optimal backend (ZFS) block might
better be larger, about 32k-64k, to gain an optimal balance. But I
am not sure this holds for pages as large as 64k (namely, that a
128K or larger backend block would still be more optimal).

HTH,
//Jim Klimov

Carlo Pradissitto

2014-05-05 10:05:56 UTC

Permalink

Hi Jim,
your advises are very interesting.
I actually ran the same test so many times that I can't remember how many!

The test is quite simple: a java process read from an RDF file and load its
records in the DB.

The WAL and DB datasets come from different zpool, each based on a
different disk:

***@globalZone:~# zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
databases 278G 447K 278G 0% 1.00x ONLINE -
rpool 278G 20.0G 258G 7% 1.18x ONLINE -

***@globalZone:~# zpool status
pool: databases
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
databases ONLINE 0 0 0
c7t1d0 ONLINE 0 0 0

errors: No known data errors

pool: rpool
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
c7t0d0 ONLINE 0 0 0

errors: No known data errors

***@globalZone:~# zfs list
NAME USED AVAIL REFER
MOUNTPOINT
databases/wal 31K 274G 31K
legacy
rpool/datafiles 179M 253G 179M
legacy

I thought it might be a metaslab fragmentation/allocation issue but, even
after dozens of attempts, I have this metaslabs asset:
***@globalZone:~# zdb -mm rpool |grep freepct
segments 34 maxsize 409K freepct
0%
segments 2 maxsize 985K freepct
0%
segments 2 maxsize 986K freepct
0%
segments 7 maxsize 773K freepct
0%
segments 1900 maxsize 71.5M freepct
5%
segments 7171 maxsize 421M freepct
90%
segments 12581 maxsize 22.3M freepct
26%
segments 101 maxsize 1.50G freepct
84%
segments 52 maxsize 1.25G freepct
72%
segments 23 maxsize 1.56G freepct
84%
segments 17 maxsize 1.67G freepct
85%
segments 131 maxsize 95.0M freepct
85%
segments 938 maxsize 18.6M freepct
86%
segments 2144 maxsize 112M freepct
83%
segments 2921 maxsize 81.0M freepct
51%
segments 3194 maxsize 13.1M freepct
21%
segments 2966 maxsize 16.0M freepct
18%
segments 838 maxsize 57.8M freepct
93%
segments 510 maxsize 70.5M freepct
87%
segments 760 maxsize 137M freepct
73%
segments 801 maxsize 693M freepct
90%
segments 1522 maxsize 24.2M freepct
80%
segments 241 maxsize 303M freepct
98%
segments 2317 maxsize 190M freepct
98%
segments 48 maxsize 1.05G freepct
99%
segments 5266 maxsize 106M freepct
86%
segments 101 maxsize 496M freepct
99%
segments 3817 maxsize 1.55G freepct
98%
segments 21 maxsize 1.22G freepct
99%
segments 3 maxsize 1.66G freepct
99%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 3740 maxsize 1.48G freepct
99%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 3187 maxsize 320M freepct
99%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 6005 maxsize 109M freepct
98%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 517 maxsize 1.42G freepct
99%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 740 maxsize 829M freepct
99%
segments 327 maxsize 1.66G freepct
99%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%

Aren't enough free 1M holes in the external, faster, disk portion?
I would like to make some new test, having the datafiles in a flash disk.
If it is a metaslab issue, in this case I expect to have always the same
result with the same recordsize, or am I missing something?

The server is an old IBM XSeries, ten years old, and the sector size is 4kb.

Your final note is certainly correct, but I don't want simly discover the
best recordsize for this workload, but I would like to understand why I
don't have always the same result with the same recordsize using the
pwrite() syscall

thanks a lot
Carlo

Post by Jim Klimov via illumos-zfs

How many times did you run each test - once, or are the numbers above
an average of X runs each? Did you interleave them somehow, or ran X1
runs of write() and X2 runs of prwite() as two sequences?
I think that beside the possible difference between write() and pwrite()
that your question implies, there are also several differences in the
system between the runs. For example, do the tests work on the same data
in the database (i.e. is the work regarding processing, compression and
the size of compressed data always the same?) Also, new allocations may
be subject to the results of fragmentation since they search for large
enough "holes" to fit a block into, which may cause your tests with the
largest block sizes to be slower (searching for adequately big "holes"
may take longer, unless your pool is quite new and empty). Caching may
also be a factor, as well as other concurrent loads on the machine...
Is it possible for you to make the testing rig as pre-determinable as
possible, i.e. by making the pool for ZFS data tests (WAL and DB) a
separate pool on separate low-level storage (i.e. a slice on a disk),
so that between runs you export, destroy and recreate the pool so as
to minimize effects of its fragmentation and clear out the related
caches?
By the way, are you certain that your pool is aligned with sector
sizes on your storage (512b or 4K disk sectors, or 256K/512K SSD
pages, etc.) IO's with blocks that regularly cross sector boundaries
might have an influence on the time it takes to process the storage
calls as well, although this is more of a factor for small IOs.
Finally note, though you probably know this, that if your data is
(randomly) written in 64k pages, the updates in the middle of a
1024k ZFS logical block require the system to read the whole 1024k
(less if compressed), update the data in memory and write out the
1024k (less if compressed).
With smaller write pages (i.e. 4K for storage of VM disk images -
where IOs are not always very random, since many files inside the
VMs are stored sequentially, or 8k-16k typical for databases), it
was argued on these lists that an optimal backend (ZFS) block might
better be larger, about 32k-64k, to gain an optimal balance. But I
am not sure this holds for pages as large as 64k (namely, that a
128K or larger backend block would still be more optimal).
HTH,
//Jim Klimov