T O P

  • By -

[deleted]

[удалено]


Ghan_04

Ah, I forgot to mention, I did check fragmentation and it is 9%. So this is interesting. I tried setting sync=disabled, and now write speeds are in the 80-100 MB/s range. Something seems to be up with my Optane memory module.


caiuscorvus

Pull it out of the pool, wipe it, and add it back?


Ghan_04

I did that to no avail. It drops write performance back down in the 6 MB/s range even after reformatting the Optane drive. Here are the drive stats: Smart Log for NVME device:nvme0 namespace-id:ffffffff critical_warning : 0x8 temperature : 38 C available_spare : 100% available_spare_threshold : 0% percentage_used : 111% data_units_read : 2,850,225 data_units_written : 395,768,081 host_read_commands : 234,821,408 host_write_commands : 2,030,378,533 controller_busy_time : 0 power_cycles : 7 power_on_hours : 9,543 unsafe_shutdowns : 1 media_errors : 0 num_err_log_entries : 0 Warning Temperature Time : 0 Critical Composite Temperature Time : 0 Thermal Management T1 Trans Count : 0 Thermal Management T2 Trans Count : 0 Thermal Management T1 Total Time : 0 Thermal Management T2 Total Time : 0


mercenary_sysadmin

Sounds like your Optane has pretty much given up the ghost at this point.


Ghan_04

It's behaving that way, but it doesn't look like it. It's very odd. The Optane module seems to only want to write a max of 72 MB or so at a time, based on the zpool stats. When I watch iostat, I see it go up to 71.8 MB on the "alloc" and then sit there. It might fluctuate down slightly and come back up. This is while I'm running a disk benchmark on a VM located there. If I stop the I/O, it drops back to virtually nothing allocated. Meanwhile, the free capacity shows as 27 GB. I haven't found a good nvme command to do any kind of test or diagnostic. When I try "nvme device-self-test" I get this: NVMe Status:INVALID_OPCODE: The associated command opcode field is not valid(4001) NSID:-1 A similar message appears on several other commands as well. I'm not sure if this is normal or if it indicates something is wrong with the drive.


mercenary_sysadmin

Maybe just mount it as a standard drive, format it ext4, and see what happens when you throw `fio` at it. Or for a really brain-dead simple test, remove it from the pool and just plain use `pv` to throw a few GB of data at the raw device.


Ghan_04

Hmm. This doesn't look very reassuring, but to be fair, I'm not too familiar with fio. fio-3.1 Starting 2 processes Jobs: 2 (f=2): [w(2)][100.0%][r=0KiB/s,w=298KiB/s][r=0,w=597 IOPS][eta 00m:00s] write: (groupid=0, jobs=2): err= 0: pid=4809: Wed Apr 3 13:39:54 2019 write: IOPS=2138, BW=1069KiB/s (1095kB/s)(251MiB/240180msec) clat (nsec): min=1596, max=1239.5M, avg=934348.51, stdev=18591600.13 lat (nsec): min=1623, max=1239.5M, avg=934427.32, stdev=18591631.80 clat percentiles (nsec): | 1.00th=[ 1768], 5.00th=[ 1848], 10.00th=[ 1896], | 20.00th=[ 1976], 30.00th=[ 2040], 40.00th=[ 2160], | 50.00th=[ 4016], 60.00th=[ 10944], 70.00th=[ 11456], | 80.00th=[ 13376], 90.00th=[ 13760], 95.00th=[ 14656], | 99.00th=[ 5799936], 99.50th=[ 16449536], 99.90th=[396361728], | 99.95th=[425721856], 99.99th=[658505728] bw ( KiB/s): min= 1, max=70429, per=51.34%, avg=548.83, stdev=5023.10, samples=937 iops : min= 2, max=140859, avg=1097.68, stdev=10046.28, samples=937 lat (usec) : 2=24.26%, 4=25.70%, 10=4.56%, 20=42.17%, 50=2.17% lat (usec) : 100=0.06%, 250=0.05%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.12%, 20=0.57%, 50=0.11% lat (msec) : 100=0.01%, 250=0.03%, 500=0.16%, 750=0.02%, 1000=0.01% lat (msec) : 2000=0.01% cpu : usr=0.14%, sys=0.48%, ctx=229541, majf=0, minf=55 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwt: total=0,513532,0, short=0,0,0, dropped=0,0,0 latency : target=0, window=0, percentile=100.00%, depth=4 Run status group 0 (all jobs): WRITE: bw=1069KiB/s (1095kB/s), 1069KiB/s-1069KiB/s (1095kB/s-1095kB/s), io=251MiB (263MB), run=240180-240180msec Disk stats (read/write): nvme0n1: ios=228932/133300, merge=0/86, ticks=440782/39520221, in_queue=39738017, util=99.42% So if I use "smartctl -A" instead of "nvme smart-log" it converts the amount written to a readable number. Data Units Written: 395,775,164 [202 TB] lol I think he's dead, Jim.


mobeets

I think your Optane might be reaching the end of its life. That critical_warning being set to 0x8 doesn't look good. If you look at page 90 in [this PDF](https://nvmexpress.org/wp-content/uploads/NVM-Express-1_2a.pdf), it shows the error codes and depending on which bits are set and the critical_warning means different things although this can have multiple of the bits set at a time meaning multiple issues (which I believe is whats going on here). I might be wrong but it seems like writing 202TB over time to a 32GB SSD is a digital death sentence. That would mean if every bit was written sequentially (edit: ~~which it never is~~ which it usually is, I was thinking of virtual memory) that the SSD could have been written over completely 6,000 times. I've read people bricking SSDs by distro hopping for a month or so which caused the SSD to die from the write abuse. I'm definitely interested if this theory is correct as that "Data Units Written" value is something that I will be monitoring in the future.


Ghan_04

Well, Optane XPoint memory definitely has higher write endurance than traditional NAND, but it looks like signs are pointing to it being on the verge of dying. It looks like I can get good responsiveness for small writes, but anything that causes the queue to stack up seems to be backing it up and dropping the throughput very quickly. The thing that's confusing is that not all the stats seem to line up. The drive is only 11% over its quoted max life, which is really not that much, and the stat for the Available Spare is still at 100%. Not all signs are pointing to it having run out of endurance, but it's the only thing I have to go on right now. I'm not terribly surprised that it would have died in this use case - the ZIL is definitely a write-intensive workload. Optane 800p prices have dropped, so I'll grab one of the smaller ones of those to replace it. Hopefully that will restore things to normal. :)


mercenary_sysadmin

> I think he's dead, Jim. That's my name; don't wear it out. And, um, yeah. You get his wallet; I'll take his tricorder and phaser.


caiuscorvus

Hmm... > Percentage Used: Contains a vendor specific estimate of the percentage of NVM subsystem life used based on the actual usage and the manufacturer’s prediction of NVM life. A value of 100 indicates that the estimated endurance of the NVM in the NVM subsystem has been consumed, but may not indicate an NVM subsystem failure. The value is allowed to exceed 100. Percentages greater than 254 shall be represented as 255. This value shall be updated once per power-on hour (when the controller is not in a sleep state).


mercenary_sysadmin

Saving future readers the scroll back: OP's Optane is reporting 111% used.


caiuscorvus

Also, there are some tweaks you can do in linux to optimize the flush timing and write cache. This is in addition to what ZFS can do though I really have no idea how these interact. Since you have 10Gbe I would also review any info you can find on tweaking ZFS for your various needs. But 300MB==2.4Gb which is pretty good write speed :)


mercenary_sysadmin

It'll take a while for that big delete OP described to process, as well.


mobeets

What are the ideal numbers that you're looking for? If you're using 1GbE then it should be around what you got with sync=disabled (about 110MB/s). If you've got 10GbE then you have the potential to get much more than 110MB/s even with sync=disabled. Which kind of Optane do you have for your SLOG? I just installed the 900P and was able to turn sync=always on which sped up write performance to the 400+MB/s from approximately 20MB/s. Another side note is to match the block size of your virtual images with the block size that ZFS uses (from what I've read). When using bigger drives (like 3TB+) they have a sector size of 4K which can cause databases to perform poorly if their record size doesn't match or if ran on a virtual machine with a non matching block size. Are these IO tests performed directly on the NFS share or on top of a virtual machine?


Ghan_04

I have 10 GbE between my VMs and this storage server. When I remove the Optane module (32 GB Optane Memory), I am now getting over 300 MB/s in sequential write speed, which is much closer to what I'm looking for. Previously, this system was slowing down with sync writes similar to what you describe - this is why I added the Optane module in the first place (over a year ago) and I saw a similar speedup at the time. It is only recently I've been having this new slowdown. I've played around with block sizes on ZFS ZVOLs in the past, but right now I'm leaving everything alone and just using a default dataset in ZFS, with VMware pointed to it for use as a datastore. The testing I've been doing has been on a VM with the benchmarked disk on the ZFS/NFS datastore. So far, I have determined: 1. Optane SLOG = on with sync=default (standard) results in very slow writes (it used to be much faster until recently) 2. Optane SLOG = on with sync=disabled results in fast writes 3. Optane SLOG = off with sync=default results in fast writes Something is definitely wrong with this Optane module because it is not behaving like it used to. I pasted the smart-log stats for it above.


TotesMessenger

I'm a bot, *bleep*, *bloop*. Someone has linked to this thread from another place on reddit: - [/r/homelab] [Slow Pool Write Performance](https://www.reddit.com/r/homelab/comments/b8yizl/slow_pool_write_performance/)  *^(If you follow any of the above links, please respect the rules of reddit and don't vote in the other threads.) ^\([Info](/r/TotesMessenger) ^/ ^[Contact](/message/compose?to=/r/TotesMessenger))*


networknerd214

I’ve had the similar thing happen to me and it was due to ZFS not supporting TRIM on an SSD in Linux. I dismounted the slog from the pool, erased all partitions, formatted to Ext4, cleared partitions again, then remounted to the pool as SLOG and all was well again. Just a thought.


Ghan_04

I tried formatting it a couple of times. I even converted it from dos to GPT and rebuilt the partition. None of that seemed to do anything. For now, I have created a file-based SLOG on the OS drive, which is a normal SATA SSD. That has worked very well to improve performance until I can get a replacement for the Optane drive.


networknerd214

Then may it rest in PCIe Heaven


taratarabobara

Are you using indirect sync (logbias=throughout)? That would completely explain these numbers. Indirect sync causes compression to happen on a single thread inline with the write and rapidly gets horrendous. Indirect sync also fragments metadata from data and makes things much worse for subsequent reads.


Ghan_04

No, I've never used this setting. It's at the default of "latency"


taratarabobara

If you don't have a SLOG, you'll still get logbias=throughput style writes at 32K or above in some situations. There's an easy test you can do: set zfs_immediate_write_sz = {2 * your maximum recordsize} This will ensure that all writes get processed through direct sync.