Hi Jim,
your advises are very interesting.
I actually ran the same test so many times that I can't remember how many!
The test is quite simple: a java process read from an RDF file and load its
records in the DB.
The WAL and DB datasets come from different zpool, each based on a
different disk:
***@globalZone:~# zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
databases 278G 447K 278G 0% 1.00x ONLINE -
rpool 278G 20.0G 258G 7% 1.18x ONLINE -
***@globalZone:~# zpool status
pool: databases
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
databases ONLINE 0 0 0
c7t1d0 ONLINE 0 0 0
errors: No known data errors
pool: rpool
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
c7t0d0 ONLINE 0 0 0
errors: No known data errors
***@globalZone:~# zfs list
NAME USED AVAIL REFER
MOUNTPOINT
databases/wal 31K 274G 31K
legacy
rpool/datafiles 179M 253G 179M
legacy
I thought it might be a metaslab fragmentation/allocation issue but, even
after dozens of attempts, I have this metaslabs asset:
***@globalZone:~# zdb -mm rpool |grep freepct
segments 34 maxsize 409K freepct
0%
segments 2 maxsize 985K freepct
0%
segments 2 maxsize 986K freepct
0%
segments 7 maxsize 773K freepct
0%
segments 1900 maxsize 71.5M freepct
5%
segments 7171 maxsize 421M freepct
90%
segments 12581 maxsize 22.3M freepct
26%
segments 101 maxsize 1.50G freepct
84%
segments 52 maxsize 1.25G freepct
72%
segments 23 maxsize 1.56G freepct
84%
segments 17 maxsize 1.67G freepct
85%
segments 131 maxsize 95.0M freepct
85%
segments 938 maxsize 18.6M freepct
86%
segments 2144 maxsize 112M freepct
83%
segments 2921 maxsize 81.0M freepct
51%
segments 3194 maxsize 13.1M freepct
21%
segments 2966 maxsize 16.0M freepct
18%
segments 838 maxsize 57.8M freepct
93%
segments 510 maxsize 70.5M freepct
87%
segments 760 maxsize 137M freepct
73%
segments 801 maxsize 693M freepct
90%
segments 1522 maxsize 24.2M freepct
80%
segments 241 maxsize 303M freepct
98%
segments 2317 maxsize 190M freepct
98%
segments 48 maxsize 1.05G freepct
99%
segments 5266 maxsize 106M freepct
86%
segments 101 maxsize 496M freepct
99%
segments 3817 maxsize 1.55G freepct
98%
segments 21 maxsize 1.22G freepct
99%
segments 3 maxsize 1.66G freepct
99%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 3740 maxsize 1.48G freepct
99%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 3187 maxsize 320M freepct
99%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 6005 maxsize 109M freepct
98%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 517 maxsize 1.42G freepct
99%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 740 maxsize 829M freepct
99%
segments 327 maxsize 1.66G freepct
99%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
segments 1 maxsize 2G freepct
100%
Aren't enough free 1M holes in the external, faster, disk portion?
I would like to make some new test, having the datafiles in a flash disk.
If it is a metaslab issue, in this case I expect to have always the same
result with the same recordsize, or am I missing something?
The server is an old IBM XSeries, ten years old, and the sector size is 4kb.
Your final note is certainly correct, but I don't want simly discover the
best recordsize for this workload, but I would like to understand why I
don't have always the same result with the same recordsize using the
pwrite() syscall
thanks a lot
Carlo
Post by Jim Klimov via illumos-zfsPost by Carlo Pradissitto via illumos-zfsWhat's wrong with the pwrite() syscall?
* shutdown the test-zone
* destroy WAL and DB datasets
* create WAL and DB datasets with new parameters
* boot the test-zone
How many times did you run each test - once, or are the numbers above
an average of X runs each? Did you interleave them somehow, or ran X1
runs of write() and X2 runs of prwite() as two sequences?
I think that beside the possible difference between write() and pwrite()
that your question implies, there are also several differences in the
system between the runs. For example, do the tests work on the same data
in the database (i.e. is the work regarding processing, compression and
the size of compressed data always the same?) Also, new allocations may
be subject to the results of fragmentation since they search for large
enough "holes" to fit a block into, which may cause your tests with the
largest block sizes to be slower (searching for adequately big "holes"
may take longer, unless your pool is quite new and empty). Caching may
also be a factor, as well as other concurrent loads on the machine...
Is it possible for you to make the testing rig as pre-determinable as
possible, i.e. by making the pool for ZFS data tests (WAL and DB) a
separate pool on separate low-level storage (i.e. a slice on a disk),
so that between runs you export, destroy and recreate the pool so as
to minimize effects of its fragmentation and clear out the related
caches?
By the way, are you certain that your pool is aligned with sector
sizes on your storage (512b or 4K disk sectors, or 256K/512K SSD
pages, etc.) IO's with blocks that regularly cross sector boundaries
might have an influence on the time it takes to process the storage
calls as well, although this is more of a factor for small IOs.
Finally note, though you probably know this, that if your data is
(randomly) written in 64k pages, the updates in the middle of a
1024k ZFS logical block require the system to read the whole 1024k
(less if compressed), update the data in memory and write out the
1024k (less if compressed).
With smaller write pages (i.e. 4K for storage of VM disk images -
where IOs are not always very random, since many files inside the
VMs are stored sequentially, or 8k-16k typical for databases), it
was argued on these lists that an optimal backend (ZFS) block might
better be larger, about 32k-64k, to gain an optimal balance. But I
am not sure this holds for pages as large as 64k (namely, that a
128K or larger backend block would still be more optimal).
HTH,
//Jim Klimov
-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com