[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
Re: [tor-bugs] #33785 [Internal Services/Tor Sysadmin Team]: cannot create new machines in ganeti cluster
#33785: cannot create new machines in ganeti cluster
-------------------------------------------------+---------------------
Reporter: anarcat | Owner: tpa
Type: defect | Status: new
Priority: High | Milestone:
Component: Internal Services/Tor Sysadmin Team | Version:
Severity: Major | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-------------------------------------------------+---------------------
Comment (by anarcat):
note that allocating the instance to specific nodes works properly:
{{{
root@fsn-node-01:~# gnt-instance add -o debootstrap+buster -t drbd
--no-wait-for-sync --disk 0:size=10G --disk 1:size=2G,name=swap
--backend-parameters memory=2g,vcpus=2 --net 0:ip=pool,network=gnt-fsn
--no-name-check --no-ip-check -n fsn-node-05.torproject.org:fsn-
node-04.torproject.org test-01.torproject.org
Wed Apr 1 20:11:54 2020 - INFO: NIC/0 inherits netparams ['br0',
'openvswitch', '4000']
Wed Apr 1 20:11:54 2020 - INFO: Chose IP 116.202.120.188 from network
gnt-fsn
Wed Apr 1 20:11:55 2020 * creating instance disks...
Wed Apr 1 20:12:07 2020 adding instance test-01.torproject.org to cluster
config
Wed Apr 1 20:12:07 2020 adding disks to cluster config
Wed Apr 1 20:12:07 2020 * checking mirrors status
Wed Apr 1 20:12:07 2020 - INFO: - device disk/0: 2.20% done, 3m 47s
remaining (estimated)
Wed Apr 1 20:12:07 2020 - INFO: - device disk/1: 1.00% done, 2m 6s
remaining (estimated)
Wed Apr 1 20:12:07 2020 * checking mirrors status
Wed Apr 1 20:12:08 2020 - INFO: - device disk/0: 2.40% done, 4m 16s
remaining (estimated)
Wed Apr 1 20:12:08 2020 - INFO: - device disk/1: 1.80% done, 1m 8s
remaining (estimated)
Wed Apr 1 20:12:08 2020 * pausing disk sync to install instance OS
Wed Apr 1 20:12:08 2020 * running the instance OS create scripts...
}}}
creating a solo (not-DRBD) instance on the new network also works fine:
{{{
root@fsn-node-01:~# gnt-instance add -o debootstrap+buster -t plain
--no-wait-for-sync --disk 0:size=10G --disk 1:size=2G,name=swap
--backend-parameters memory=2g,vcpus=2 --net 0:ip=pool,network=gnt-
fsn13-02 --no-name-check --no-ip-check -n fsn-node-05.torproject.org
test-02.torproject.org
Wed Apr 1 20:17:03 2020 - INFO: NIC/0 inherits netparams ['br0',
'openvswitch', '4000']
Wed Apr 1 20:17:03 2020 - INFO: Chose IP 49.12.57.130 from network gnt-
fsn13-02
Wed Apr 1 20:17:04 2020 * disk 0, size 10.0G
Wed Apr 1 20:17:04 2020 * disk 1, size 2.0G
Wed Apr 1 20:17:04 2020 * creating instance disks...
Wed Apr 1 20:17:05 2020 adding instance test-02.torproject.org to cluster
config
Wed Apr 1 20:17:05 2020 adding disks to cluster config
Wed Apr 1 20:17:05 2020 * running the instance OS create scripts...
Wed Apr 1 20:17:18 2020 * starting instance...
}}}
so this is strictly a problem related to the allocator.
It also seems that there are ways of debugging the allocator, as explained
here:
https://github.com/ganeti/ganeti/wiki/Common-Issues#htools-debugging-
hailhbal
most notably, it suggests using the `hspace -L` command which, in our
case, gives us worrisome warnings:
{{{
root@fsn-node-01:~# hspace -L
Warning: cluster has inconsistent data:
- node fsn-node-05.torproject.org is missing -3049 MB ram and 470 GB
disk
- node fsn-node-04.torproject.org is missing -5797 MB ram and 2 GB disk
- node fsn-node-03.torproject.org is missing -14155 MB ram and 162 GB
disk
The cluster has 5 nodes and the following resources:
MEM 321400, DSK 4574256, CPU 60, VCPU 240.
There are 27 initial instances on the cluster.
Tiered (initial size) instance spec is:
MEM 32768, DSK 1048576, CPU 8, using disk template 'drbd'.
Tiered allocation results:
- 1 instances of spec MEM 19200, DSK 460800, CPU 8
- 1 instances of spec MEM 19200, DSK 154880, CPU 8
- most likely failure reason: FailDisk
- initial cluster score: 7.92595903
- final cluster score: 7.26099873
- memory usage efficiency: 50.50%
- disk usage efficiency: 85.56%
- vcpu usage efficiency: 57.08%
Standard (fixed-size) instance spec is:
MEM 128, DSK 1024, CPU 1, using disk template 'drbd'.
Normal (fixed-size) allocation results:
- 44 instances allocated
- most likely failure reason: FailDisk
- initial cluster score: 7.92595903
- final cluster score: 20.56542169
- memory usage efficiency: 40.30%
- disk usage efficiency: 60.61%
- vcpu usage efficiency: 68.75%
}}}
i also tried creating a tracing allocator that shows its input, in
`/usr/lib/ganeti/iallocators/hail-trace`:
{{{
#!/bin/sh
cp "$1" /tmp/allocator-input.json
/usr/lib/ganeti/iallocators/hail "$1"
}}}
then it can be used with the `-I hail-trace` parameter:
{{{
gnt-instance add -o debootstrap+buster -t drbd --no-wait-for-sync
--disk 0:size=10G --disk 1:size=2G,name=swap --backend-parameters
memory=2g,vcpus=2 --net 0:ip=pool,network=gnt-fsn --no-name-check
--no-ip-check -I hail-trace test-01.torproject.org
}}}
that allows us to run the allocator by hand:
{{{
root@fsn-node-01:~# /usr/lib/ganeti/iallocators/hail --verbose /tmp
/allocator-input.json
Warning: cluster has inconsistent data:
- node fsn-node-05.torproject.org is missing -3046 MB ram and 470 GB
disk
- node fsn-node-04.torproject.org is missing -5801 MB ram and 2 GB disk
- node fsn-node-03.torproject.org is missing -14158 MB ram and 162 GB
disk
Received request: Allocate (Instance {name = "test-01.torproject.org",
alias = "test-01.torproject.org", mem = 2048, dsk = 12544, disks = [Disk
{dskSize = 10240, dskSpindles = Nothing},Disk {dskSize = 2048, dskSpindles
= Nothing}], vcpus = 2, runSt = Running, pNode = 0, sNode = 0, idx = -1,
util = DynUtil {cpuWeight = 1.0, memWeight = 1.0, dskWeight = 1.0,
netWeight = 1.0}, movable = True, autoBalance = True, diskTemplate =
DTDrbd8, spindleUse = 1, allTags = [], exclTags = [], dsrdLocTags =
fromList [], locationScore = 0, arPolicy = ArNotEnabled, nics = [Nic {mac
= Just "00:66:37:8b:0a:ba", ip = Just "pool", mode = Nothing, link =
Nothing, bridge = Nothing, network = Just "f96e8644-a473-43db-874b-
99f90e20af7b"}], forthcoming = False}) (AllocDetails 2 Nothing) Nothing
{"success":false,"info":"Request failed: Group default (preferred): No
valid allocation solutions, failure reasons: FailMem: 8, FailN1:
12","result":[]}
}}}
which, interestingly, gives us the same warning.
still not sure where that warning is coming from, but i can't help but
wonder if the problem would go away after re-balancing the cluster.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/33785#comment:1>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
_______________________________________________
tor-bugs mailing list
tor-bugs@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs