Tuesday, December 23, 2014

Difference Between ‘Ceph Osd Reweight’ and ‘Ceph Osd Crush Reweight’ [feedly]



----
Difference Between 'Ceph Osd Reweight' and 'Ceph Osd Crush Reweight'
// Ceph

From Gregory and Craig in mailing list…

"ceph osd crush reweight" sets the CRUSH weight of the OSD. This
weight is an arbitrary value (generally the size of the disk in TB or
something) and controls how much data the system tries to allocate to
the OSD.

"ceph osd reweight" sets an override weight on the OSD. This value is
in the range 0 to 1, and forces CRUSH to re-place (1-weight) of the
data that would otherwise live on this drive. It does *not* change the
weights assigned to the buckets above the OSD, and is a corrective
measure in case the normal CRUSH distribution isn't working out quite
right. (For instance, if one of your OSDs is at 90% and the others are
at 50%, you could reduce this weight to try and compensate for it.)

Note that 'ceph osd reweight' is not a persistent setting. When an OSD
gets marked out, the osd weight will be set to 0. When it gets marked in
again, the weight will be changed to 1.

Because of this 'ceph osd reweight' is a temporary solution. You should
only use it to keep your cluster running while you're ordering more
hardware.

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-June/040961.html

I asked myself when one of my osd was marked down (on my old cluster in Cuttlefish…) and I noticed that only the drive of the local machine seemed to fill. Something that seems normal since the weight of the host had not changed in crushmap.

Testing

Testing on simple cluster (Giant), with this crushmap :

1  2  3  4  5  6  7  
ruleset 0  type replicated  min_size 1  max_size 10  step take default  step chooseleaf firstn 0 type host   step emit

Take the example of the 8 pgs on pool 3 :

1  2  3  4  5  6  7  8  9  10  
$ ceph pg dump | grep '^3.' | awk '{print $1,$15;}'  dumped all in format plain  3.4 [0,2]  3.5 [4,1]  3.6 [2,0]  3.7 [2,1]  3.0 [2,1]  3.1 [0,2]  3.2 [2,1]  3.3 [2,4]  

Now I try ceph osd out :

1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  
$ ceph osd out 0 # This is equivalent to "ceph osd reweight 0 0"  marked out osd.0.    $ ceph osd tree  # id weight type name up/down reweight  -1 0.2 root default  -2 0.09998 host ceph-01  0 0.04999 osd.0 up 0 # <-- reweight has set to "0"  4 0.04999 osd.4 up 1 -3 0.04999 host ceph-02  1 0.04999 osd.1 up 1 -4 0.04999 host ceph-03  2 0.04999 osd.2 up 1   $ ceph pg dump | grep '^3.' | awk '{print $1,$15;}'  dumped all in format plain  3.4 [2,4] # <-- [0,2] (move pg on osd.4)  3.5 [4,1]  3.6 [2,1] # <-- [2,0] (move pg on osd.1)  3.7 [2,1]  3.0 [2,1]  3.1 [2,1] # <-- [0,2] (move pg on osd.1)  3.2 [2,1]  3.3 [2,4]  

Now I try ceph osd CRUSH out :

1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  
$ ceph osd crush reweight osd.0 0  reweighted item id 0 name 'osd.0' to 0 in crush map    $ ceph osd tree  # id weight type name up/down reweight  -1 0.15 root default  -2 0.04999 host ceph-01 # <-- the weight of the host changed  0 0 osd.0 up 1 # <-- crush weight is set to "0"  4 0.04999 osd.4 up 1 -3 0.04999 host ceph-02  1 0.04999 osd.1 up 1 -4 0.04999 host ceph-03  2 0.04999 osd.2 up 1   $ ceph pg dump | grep '^3.' | awk '{print $1,$15;}'  dumped all in format plain  3.4 [4,2] # <-- [0,2] (move pg on osd.4)  3.5 [4,1]  3.6 [2,4] # <-- [2,0] (move pg on osd.4)  3.7 [2,1]  3.0 [2,1]  3.1 [4,2] # <-- [0,2] (move pg on osd.4)  3.2 [2,1]  3.3 [2,1]  

This does not seem very logical because the weight assigned to the bucket "host ceph-01" is still higher than the others. This would probably be different with more PG…

Trying with more pgs

1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  
# Add more pg on my testpool  $ ceph osd pool set testpool pg_num 128  set pool 3 pg_num to 128    # Check repartition  $ for i in 0 1 2 4; do echo "osd.$i=$(ceph pg dump 2>/dev/null | grep '^3.' | awk '{print $15;}' | grep $i | wc -l) pgs"; done  osd.0=48 pgs  osd.1=78 pgs  osd.2=77 pgs  osd.4=53 pgs    $ ceph osd reweight 0 0  reweighted osd.0 to 0 (802)  $ for i in 0 1 2 4; do echo "osd.$i=$(ceph pg dump 2>/dev/null | grep '^3.' | awk '{print $15;}' | grep $i | wc -l) pgs"; done  osd.0=0 pgs  osd.1=96 pgs  osd.2=97 pgs  osd.4=63 pgs  

The distribution seems fair. Why in the same case, with Cuttlefish, distribution is not the same ?

1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  
$ ceph osd reweight 0 1  reweighted osd.0 to 0 (802)  $ for i in 0 1 2 4; do echo "osd.$i=$(ceph pg dump 2>/dev/null | grep '^3.' | awk '{print $15;}' | grep $i | wc -l) pgs"; done  osd.0=0 pgs  osd.1=96 pgs  osd.2=97 pgs  osd.4=63 pgs    $ ceph osd crush reweight osd.0 0  reweighted osd.0 to 0 (802)    $ for i in 0 1 2 4; do echo "osd.$i=$(ceph pg dump 2>/dev/null | grep '^3.' | awk '{print $15;}' | grep $i | wc -l) pgs"; done  osd.0=0 pgs  osd.1=87 pgs  osd.2=88 pgs  osd.4=81 pgs  

With crush reweight, everything is normal.

Trying with crush legacy

1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  
$ ceph osd crush tunables legacy  adjusted tunables profile to legacy  root@ceph-01:~/ceph-deploy# for i in 0 1 2 4; do echo "osd.$i=$(ceph pg dump 2>/dev/null | grep '^3.' | awk '{print $15;}' | grep $i | wc -l) pgs"; done  osd.0=0 pgs  osd.1=87 pgs  osd.2=88 pgs  osd.4=81 pgs    $ ceph osd crush reweight osd.0 0.04999  reweighted item id 0 name 'osd.0' to 0.04999 in crush map    $ ceph osd tree  # id weight type name up/down reweight  -1 0.2 root default  -2 0.09998 host ceph-01  0 0.04999 osd.0 up 0 4 0.04999 osd.4 up 1 -3 0.04999 host ceph-02  1 0.04999 osd.1 up 1 -4 0.04999 host ceph-03  2 0.04999 osd.2 up 1   $ for i in 0 1 2 4; do echo "osd.$i=$(ceph pg dump 2>/dev/null | grep '^3.' | awk '{print $15;}' | grep $i | wc -l) pgs"; done  osd.0=0 pgs  osd.1=78 pgs  osd.2=77 pgs  osd.4=101 pgs # <--- All pg from osd.0 and osd.4 is here when using legacy value (on host ceph-01)  

So, it is an evolution of the distribution algorithm to prefer a more global distribution when OSD is marked down (instead of distributing preferably by proximity). Indeed the old distribution can cause problems when there is not a lot of OSD by host, and that they are nearly full.

When some OSDs are marked out, the data tends to get redistributed to nearby OSDs instead of across the entire hierarchy.


----

Shared via my feedly reader




Sent from my iPad