Nexenta Dedupe – Be Careful!

Just got done working on a friends Nexenta platform, and boy howdy was it tired.  NFS was slow, iSCSI was slow, and for all the slowness, we couldn’t see a problem with the system.  The GUI was reporting there was free RAM, and IOSTAT showed the disks not being completely thrashed.  We didn’t see anything really out of the ordinary at first glance.

After some digging, we figured out that we were running out of RAM for the Dedupe tables.  It goes a little something like this.

Nexenta by default allocates 1/4 of your ARC cache (RAM) to metadata caching.  Your L2ARC map is considered metadata.  When you turn on dedupe, all of that dedupe information is stored in metadata.  The more you dedupe, the more RAM you use, the more L2ARC you use, the more RAM you use.

The system in question is a 48GB system, and it reported that had free memory (6GB or so), so we were baffled.  If its got free RAM, what’s the holdup?  Seems as though between the dedupe tables and the L2ARC, we had outstripped the capabilities of the ARC to hold all of the metadata.  This caused _everything_ to be slow.  The solution?  You can either increase the percentage of RAM that can be used for metadata, increase the total RAM (thereby increasing the amount you can use for metadata caching), or you can turn off dedupe, copy everything off of the volume, then copy it back.  Since there’s no way currently to “undedupe” a volume, once that data has been created, you’re stuck with it until you remove the files.

So, without further ado, here’s how to figure out what’s going on in your system.

echo ::arc|mdb -k

This will display some interesting stats.  The most important in this situation is the last three lines :

arc_meta_used             =     11476 MB
arc_meta_limit            =     12014 MB
arc_meta_max              =     12351 MB

These numbers will change.  Things will get evicted, things will come back.  You don’t want to see the meta_used and meta_limit numbers this close.  You definately don’t want to see the meta_max exceed the limit.  This is a great indicator that you’re out of RAM.

After quite a bit of futzing around, disabling dedupe, and shuffling data off of, then back on to pool, things look better :

arc_meta_used             =      7442 MB
arc_meta_limit            =     12014 MB
arc_meta_max              =     12351 MB

Just by disabling dedupe, and blowing away the dedupe tables, it freed up almost 5GB of RAM.  Who knows how much was being swapped in and out of RAM.

Other things to check :

zpool status -D <volumename>

This gives you your standard volume status, but it also prints out the dedupe information.  This is good to figure out how much dedupe data there is.  Here’s an example :

DDT entries 7102900, size 997 on disk, 531 in core

bucket              allocated                       referenced
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1    6.41M    820G    818G    817G    6.41M    820G    818G    817G
     2     298K   37.3G   37.3G   37.3G     656K   82.0G   82.0G   81.9G
     4    30.5K   3.82G   3.82G   3.81G     140K   17.5G   17.5G   17.5G
     8    43.9K   5.49G   5.49G   5.49G     566K   70.7G   70.7G   70.6G
    16      968    121M    121M    121M    19.1K   2.38G   2.38G   2.38G
    32      765   95.6M   95.6M   95.5M    33.4K   4.17G   4.17G   4.17G
    64       33   4.12M   4.12M   4.12M    2.77K    354M    354M    354M
   128        5    640K    640K    639K      943    118M    118M    118M
   256        2    256K    256K    256K      676   84.5M   84.5M   84.4M
    1K        1    128K    128K    128K    1.29K    164M    164M    164M
    4K        1    128K    128K    128K    5.85K    749M    749M    749M
   32K        1    128K    128K    128K    37.0K   4.63G   4.63G   4.62G
 Total    6.77M    867G    865G    864G    7.84M   1003G   1001G   1000G

 

This tells us that there are 7 million entries, with each entry taking up 997 bytes on disk, and 531 bytes in memory.  Simple math tells us how much space that takes up.

7102900*531=3771639900/1024/1024=3596MB used in RAM

The same math tells us that there’s 6753MB used on disk, just to hold the dedupe tables.

The dedupe ratio on this system wasn’t even worth it.  Overall dedupe ratio was something like 1.15x.  Compression on that volume(which has nearly no overhead) after shuffling the data around,is at 1.42x.  So at the cost of CPU time (which there is plenty of), we get a better over-subscription ratio from compression vs deduplication.

There are definitely use-cases for deduplication, but his generic VM storage pool is not one of them (in my experience).

Posted in Nexenta | Leave a comment

Nexenta and failing drives

Had a little hiccup today.  Apparently my Nexenta system has a failing drive, but it didn’t bother to notify me.

I’ve been noticing that performance on one of my drive arrays has been a little sluggish, but hadn’t had a chance to look in to it.  So I took some time to dig into it the other day and I see one disk doing half the work of the rest of the disks in the array.

A quick ‘fmadm faulty’ shows me where the problem lies :

TIME            EVENT-ID                              MSG-ID         SEVERITY————— ————————————  ————– ———May 09 22:59:45 b6321014-be09-e24f-ded3-daefbf1bcc5a  DISK-8000-4Q   Critical
Host        : localhost Platform    : S5520UR   Chassis_id  : …………Product_sn  :
Fault class : fault.io.scsi.cmd.disk.dev.rqs.merrAffects     : dev:///:devid=id1,sd@n5000c50020dcc347//scsi_vhci/disk@g5000c50020dcc347                  faulted but still in serviceFRU         : “SLOT 10 ” (hc://:product-id=LSI-DE1600-SAS:server-id=:chassis-id=50080e5200c5b000:serial=mungedCESH:part=SEAGATE-ST31000424SS:revision=0005/ses-enclosure=3/bay=10/disk=0)                  faulty
Description : The command was terminated with a non-recovered error condition              that may have been caused by a flaw in the media or an error in              the recorded data.              Refer to http://sun.com/msg/DISK-8000-4Q for more information.
Response    : The device may be offlined or degraded.
Impact      : It is likely that continued operation will result in data              corruption, which may eventually cause the loss of service or the              service degradation.
Action      : Schedule a repair procedure to replace the affected device. Use              ‘fmadm faulty’ to find the affected disk.

Wonderful, I say.  Apparently sometimes notifications just don’t go out on things like that.  Apparently it’s supposed to be better in 3.1.

Anyway, just a heads up for anyone on 3.0.4 – if you start to feel like your system is going awry, give ‘fmadm faulty’ a run from the command line to make sure everything is working properly.

Posted in Nexenta | Leave a comment

Micro-Cloud

Found this the other day – looks pretty sexy - http://www.supermicro.com/MicroCloud/

Posted in Hardware | Leave a comment

Veeam

Yeah, it’s that simple.  Veeam is awesome.  Working with Veeam backups to local Nexenta storage, replicated over a 100mbit WAN link to another Nexenta system for geo-redundancy.  Overall it works pretty slick.  Shoveling 500GB a day over a 100mbit link strains the connection though.

Need to look in to WAN accelerators if load is going to continue to be this high.

Posted in Uncategorized | Leave a comment

Making HDD caddies

So the storage system that I purchased didn’t have any storage caddies.  I’ve taken to some 26guage sheet metal and some side cutters.  The results aren’t terrible for 20 bucks in sheet metal.  Pics to follow.

Posted in Uncategorized | Leave a comment

The scenery, it be changing

So don’t mind the current theme of the site.  It’s new, and it’s horrid.  I’m working on getting a better theme in place, but for now this will have to do.

Thanks to Theron for forcing my hand to get the ball rolling.

Posted in Uncategorized | Leave a comment

Budget Nexenta Build

 

So I’m working on putting together a Nexenta storage system for Theron over at conrey.org for his ESXi test box.  In the great world of Ebay, I was able to acquire a Xyratex HS-1235 for a pittance.  Now, the system is definately a slouch, but it’s got a 12 space hot-swap backplane, dual PSU’s, dual 2.8ghz Xeon procs, 4GB RAM, and dual onboard gigabit nics.  For $50 bucks plus S&H it’s going to be my testbed, nevermind the fact that it’s only got two drive caddies, and a plethora of PCI-X slots, all of that can be worked around.

I’ve thrown in a HighPoint RocketRaid 4 port SATA card, and am hoping for a few more to show up on my doorstep shortly to fill out the necessary SATA ports.  Once that’s complete, it’s off to the hardware store for some sheet metal and plexiglass to fashion some impromptu drive caddies.

Come to think of it, maybe I _should_ just order a new chassis…

Now – if anyone happens to have any 250GB-500GB Enterprise SATA drives that they want to get rid of cheap, I’m game.  Seagate Barracuda ES and WD RE3/4 drives are what I’m in the mood for.

Posted in Nexenta | Leave a comment

Welcome

Welcome to everyone visiting this site.  My name is Matt, and I’m a storage guy.  This whole storage arena is pretty confusing, and it’s changing a lot today.  I started this blog because I’ve had a lot of questions along the way, as I’m sure many others have.  I hope to impart some knowledge along the way, and maybe glean a few things from my visitors too.

You might be familiar with some of my work over at zfsbuild.com – I wrote many of the articles that appeared on that site.  I’m off doing my own thing now though, and wanted to start fresh.  You can expect to hear about ZFS, Nexenta, EMC, Promise, StorageIO Control, Veeam, and pretty much anything that has to do with shared storage.

Well, that’s the intro, lets get this party started!

Posted in Uncategorized | Leave a comment