Ansible Cleanup: Move systems to linux-system-roles networking #9695
Labels
No labels
announcement
authentication
automate
aws
backlog
blocked
bodhi
ci
Closed As
Duplicate
Closed As
Fixed
Closed As
Fixed with Explanation
Closed As
Initiative Worthy
Closed As
Insufficient data
Closed As
Invalid
Closed As
Spam
Closed As
Upstream
Closed As/Will Not
Can Not fix
cloud
communishift
copr
database
deprecated
dev
discourse
dns
downloads
easyfix
epel
factory2
firmitas
gitlab
greenwave
hardware
help wanted
high-gain
high-trouble
iad2
koji
koschei
lists
low-gain
low-trouble
mbs
medium-gain
medium-trouble
mini-initiative
mirrorlists
monitoring
Needs investigation
notifier
odcs
OpenShift
ops
OSBS
outage
packager_workflow_blocker
pagure
permissions
Priority
Needs Review
Priority
Next Meeting
Priority
🔥 URGENT 🔥
Priority
Waiting on Assignee
Priority
Waiting on External
Priority
Waiting on Reporter
rabbitmq
rdu-cc
release-monitoring
releng
repoSpanner
request-for-resources
s390x
security
SMTP
src.fp.o
staging
taiga
unfreeze
waiverdb
websites-general
wiki
No milestone
No project
No assignees
9 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Infrastructure/fedora-infrastructure#9695
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Describe what you would like us to do:
Many older systems are configured using various templates or hand configured when brought up. The 'new' ansible linux-system-roles for networking allow for us to get rid of these templates and use a standard method per OS to get things configured. Task would be to collect the mac addresses for each system and then make a pr request for the system host_vars data
Old method:
New method:
Data that isn't known is ok to ask for more help on.
When do you need this to be done by? (YYYY/MM/DD)
Metadata Update from @mohanboddu:
I can take a look at this, but need some info to get started:
The ansible repository is at https://pagure.io/fedora-infra/ansible and I will create a list of systems and information later today.
Looks like uploading tar balls does not work. I am putting the tar ball at https://smooge.fedorapeople.org/fedora-infra/ansible-ip-info.tgz
If @copperi doesnt have the time for this I can take it and start working on it.
@bodanel I have started, but sure you can work on this as well, there are about 500 machines and I have access to 50.
So far I have done:
pr_submitted proxy05.fedoraproject.org ... ... proxy05.fedoraproject.org moved to linux-system-roles networking
no response ... proxy06.fedoraproject.org
no response ... proxy09.fedoraproject.org
pr_submitted proxy10.iad2.fedoraproject.org ... proxy10.iad2.fedoraproject.org moved to linux-system-roles networking
pr_submitted proxy101.iad2.fedoraproject.org ... proxy101.iad2.fedoraproject.org moved to linux-system-roles networking
pr_submitted proxy11.fedoraproject.org ... proxy11.fedoraproject.org moved to linux-system-roles networking
pr_submitted proxy110.iad2.fedoraproject.org ... proxy110.iad2.fedoraproject.org moved to linux-system-roles networking
pr_submitted proxy12.fedoraproject.org ... proxy12.fedoraproject.org moved to linux-system-roles networking
no response ... proxy13.fedoraproject.org
pr_submitted proxy14.fedoraproject.org ... proxy14.fedoraproject.org moved to linux-system-roles networking
n/a proxy30.fedoraproject.org
n/a proxy31.fedoraproject.org
n/a proxy32.fedoraproject.org
n/a proxy33.fedoraproject.org
n/a proxy34.fedoraproject.org
n/a proxy35.fedoraproject.org
n/a proxy36.fedoraproject.org
n/a proxy37.fedoraproject.org
n/a proxy38.fedoraproject.org
n/a proxy39.fedoraproject.org
n/a proxy40.fedoraproject.org
Yes please do this in multiple small PR requests. We will want to be able to merge them, test them in blocks which small pr's will work better for.
@smooge
When you can please have a look at https://pagure.io/fedora-infra/ansible/pull-request/481 and let me know if it is ok. If yes, I will start modifying blocks of servers.
buildhw-a64-0[1-6].iad2.fedoraproject.org
buildhw-a64-11.iad2.fedoraproject.org
buildhw-a64-19.iad2.fedoraproject.org
buildhw-a64-20.iad2.fedoraproject.org
buildhw-x86-02.iad2.fedoraproject.org
buildhw-x86-03.iad2.fedoraproject.org
buildhw-x86-04.iad2.fedoraproject.org
buildhw-x86-0[6-16].iad2.fedoraproject.org
done. I'll keep updating the tickets as I push my modifications
The following servers are done also
buildvm-a32-0[1-27].iad2.fedoraproject.org
buildvm-a32-0[1-2].stg.iad2.fedoraproject.org
buildvm-a32-3[1-3].iad2.fedoraproject.org
@smooge
Can you please assing the ticket to me at least so I can find it more easy?
Below servers are commited
buildvm-a64-01.iad2.fedoraproject.org
buildvm-a64-01.stg.iad2.fedoraproject.org
buildvm-a64-02.iad2.fedoraproject.org
buildvm-a64-02.stg.iad2.fedoraproject.org
The following servers are done
buildvm-a64-[03-23].iad2.fedoraproject.org
Assigning to @bodanel :) Thanks for working on it.
Metadata Update from @kevin:
buildvm-ppc64le-[11-40].iad2.fedoraproject.org done.
Metadata Update from @bodanel:
Metadata Update from @smooge:
Hi. just a quick suggestion, can we open a tracker for this somewhere, it will be easy to navigate what's done and what's left. It Will save time, it will be done if we can have a markdown render for file something like this:
(Don't know why it is not rendered properly, just an idea)
so, one thing i have discovered/realized: we do not want to do this for vm's... only bare metal machines.
I think that should cut things down a lot. Basically only hosts that do not have vmhost: set on them.
Do you want to cleanup also existing VM's and let them on DHCP ?
No on dhcp, basically vm's are setup with static ips from our virt-install command and a few variables, so they always have the network setup we expect.
If we try and use linux system roles/networking on them, we have to install them, have the playbook fail, update the new mac address and re-run the playbook.
This is non ideal, instead we should just assume they have the correct static networking set when they were installed.
Does that make sense?
I'm a new contributor to the Fedora Infrastructure team and I'd like to help out with this issue.
If a task list would still be useful, I generated one using the filenames that were uploaded in the tar ball. I can generate a PR request if someone directs me to a path to place this list.
I can then mark the completed systems as done and remove the virtual machines from the list. I can also begin editing the host_vars files as well with some direction.
Can you just attach the list here?
Thanks for working on this!
Full list found here: https://pagure.io/9695_ansible_cleanup/blob/master/f/systems.md
I'll work on removing the vmhost items from the task list today.
it's not on the list, but just in case - please do not do this on openQA worker host boxes (
openqa-*-worker*
). They have very specific network config requirements and you need to know how openQA works in order to change the networking config and make sure the changes work OK.vmhost are fine to stay. They are bare metal. :)
it's 'buildvm*' that are all vm's...
Okay all 'buildvm*' hosts removed from the tasklist, reducing the number of hosts from 482 to 310.
Also all items mentioned as completed in the comments have been marked as completed in the tasklist.
I'm going to start working on a block of bvmhost-a64 machines. Can someone check over this first one and see if I'm configuring the bridge correctly? I also wasn't sure what to do with br0_dev.
So, we need also mac address in a - name: br0-port0 section... I am not sure how best to get the list of mac addresses to you. Could put them in a file on batcave01?
Basically the idea here is that when we specify mac address, linux system roles/networking will know exactly what interface we mean, without us having to call it eth0 or enasaskjdfjdkshgdysrjsdhfjs1 or whatever, and it will know to add that to a bridge, etc.
Right that makes sense about the br0-port0 section. Sure, just let me know where the file is located on batcave01.
Got it! Also should the dns and dns_search sections just match the original "New method:" example?
ok, I just copied the ansible facts cache for bvmhost* to /var/tmp/bvmhost-facts-cache/ so you should be able to get macaddress: info from there.
Yes, dns* should also be set. :)
Alright I updated the host_vars file for the first bvmhost-a64 server seen here at https://pagure.io/fedora-infra/ansible/pull-request/663. After a successful build I will have time to work on additional bvmhost-a64 servers this week.
I just made a pull request for the majority of the bvmhost servers https://pagure.io/fedora-infra/ansible/pull-request/663.
Checklist updated.
Three servers were missing from the files dumped in the batcave01 /var/tmp/bvmhost-facts-cache directory:
Ready for additional mac addresses. Perhaps buildvm* servers?
I'm ready for another mac address dump.
copr* are in aws, can be excluded.
These 3 are all down. The first has a bad disk issue, the other 2 are being decomissioned.
vmhost* is in /var/tmp/vmhosts/ on batcave01. :)
Pull request made for vmhost* servers https://pagure.io/fedora-infra/ansible/pull-request/740.
Tasklist has been updated. I also went ahead and removed servers on AWS. Not that many remaining!
I put the remaining server names in a text file if that helps. I'm ready for the last mac address dump whenever.
So that text file now includes these:
openqa-a64-worker01.iad2.fedoraproject.org
openqa-a64-worker02.iad2.fedoraproject.org
openqa-a64-worker03.iad2.fedoraproject.org
openqa-p09-worker01.iad2.fedoraproject.org
openqa-x86-worker01.iad2.fedoraproject.org
openqa-x86-worker02.iad2.fedoraproject.org
openqa-x86-worker04.iad2.fedoraproject.org
as I mentioned above, please DO NOT just convert these over. They, especially the ones in the
openqa_tap_workers
group, have very specific networking requirements that are likely not covered by system roles yet. I'm happy to talk with you regarding those requirements any time, just ping me in IRC/Matrix or something.Metadata Update from @bodanel:
The following can be dropped / removed:
sign-vault01 (it doesn't get managed by ansible)
kernel01 (it's not managed by us)
bvmhost-a64-10 (it's down, we will have to add it later when it's fixed)
These should possibly not be in ansible/inventory anymore?
virthost-cloud01 (this doesn't exist anymore)
buildhw-a64-07.iad2.fedoraproject.org (dead hw)
buildhw-a64-09.iad2.fedoraproject.org (dead hw)
buildhw-a64-10.iad2.fedoraproject.org (dead hw)
host1plus (no long exists)
osuosl03 (no longer exists)
proxy07.fedoraproject.org (no longer exists)
virthost-rdu02 (no longer exists)
The rest of their ansible facts should be in /tmp9695 on batcave. ;)
so from a quick look at the system roles stuff, I don't think it supports most of the advanced network stuff we need on the openQA worker hosts. So I think the strategy will be just to switch the configuration of the main physical interface(s) and possibly the bridge interface to use system roles, and continue setting the rest up the way we currently do in the plays. A couple of initial questions to figure out:
I'm gonna poke around a bit more tomorrow and maybe try to sketch out the changes for one host to see how it'd look.
I quickly looked at the systems roles documentation too and reached the same conclusion. I think it should be pretty straightforward to move the physical interface(s) over to use system-roles, assuming the two questions you brought up aren't an issue.
I'll be at work tomorrow, but I might have a few minutes free if you want to discuss changing over the first host.
Sorry, wasn't able to get back to this today, had some other things to work on. Will come back to it next week.
Pull request made for buildvm-s90x* servers https://pagure.io/fedora-infra/ansible/pull-request/757.
FYI, this is going to need to wait until after freeze at this point.
yeah, sorry, I didn't get around to it for the openqa workers in the end. will try to get to it after freeze.
Merged that last PR.
Whats left here? openqa and thats it?
I did the openQA workers yesterday:
I think they're okay, nothing obviously blew up anyway. It's only possible to use system-roles to configure the regular physical interfaces, setup of the bridge and tap devices on the tap worker hosts is still in the
openqa/worker
role, I added a comment explaining this.Thanks for the work on the openQA workers Adam!
Looks like there are a few stragglers left:
The ansible facts casche for those is in /tmp/9695 on batcave01.
Is there any way I can help or is it mostly done? Was just looking at open issues.
Mostly done. Last bit waiting for final freeze to be lifted.
Looks like I'll need the macs dumped for these again as they're gone from /tmp on batcave01:
Looks like there are a few stragglers left:
buildvmhost-s390x-01.s390.fedoraproject.org
ibiblio01.fedoraproject.org
ibiblio05.fedoraproject.org
internetx01.fedoraproject.org
osuosl01.fedoraproject.org
osuosl02.fedoraproject.org
qvmhost-x86-02.iad2.fedoraproject.org
retrace03.rdu-cc.fedoraproject.org
storinator01.rdu-cc.fedoraproject.org
virthost-cc-rdu01.fedoraproject.org
virthost-cc-rdu02.fedoraproject.org
virthost-cc-rdu03.fedoraproject.org
In /tmp/ticket-9695/ on batcave01.
Metadata Update from @kevin:
Pull request made for last batch of servers https://pagure.io/fedora-infra/ansible/pull-request/872.
I messed up your PR by sorting all the hosts and group vars files. ;(
Can you rework the PR for that?
Ya no worries, I'll sort that out soon.
I believe I was able to successfully merge the changes from sorting the yaml host vars. See new pull request https://pagure.io/fedora-infra/ansible/pull-request/898.
The PR seems to revert all the vars changes or something? in any case it's got like 300+ files changed. ;(
Should just be a few...
Alright that's no good. See new pull request here (crosses fingers) https://pagure.io/fedora-infra/ansible/pull-request/899.
ok. Got those to mostly work. A few issues I hit:
So, looking at all hosts, dropping those that are cloud hosts, I see the following that still need fixing:
And we should sort the bond devices on ibiblio01/05, but thats going to be a bit complex. ;(
Yeah I had some issues figuring out the last round of hosts because they were a bit different.
If you dump those remaining hosts on over to batcave01 I can take a look.
What do we need to do to sort out ibiblio01/05?
Finally what did you use to sort yaml? I hacked together something quick with Python, but just curious.
ok, in /tmp/9695 on batcave01 is the facts cache for those hosts.
remember to drop the auto6: false. :)
For sorting, I used 'yq'. https://github.com/mikefarah/yq/releases
yq eval 'sortKeys(..)' filename
The first host, aarch64-test01.fedorainfracloud.org, is on AWS.
The remaining hosts all have the vmhost key set. This means they are VMs correct? And we want to update only bare metal machines?
So, when we first started out I wanted to do everything. Then, I decided that vm's were hard because we didn't want to have to try and figure out mac addresses on them, since they change everytime the vm is installed. Then, I figured out that we can get around that by never hardcoding the mac for vm's.
So, now, I think I want to do every host except for aws/cloud hosts. The reason they are excluded is that they often have internal addresses and just map external in via the aws infra, so we can't set the network and if we did it would break things.
For hardware hosts, we want to specify the mac addresses of the hardware and which one(s) are used in the bridges.
For vm's, we want to specify mac address as:
mac0: "{{ ansible_default_ipv4.macaddress }}"
This works because we install the vm and pass it the ip and such, so when we connect to that ip via ansible and gather facts, we have the mac address.
So, sorry for changing the scope here a bunch of times. ;(
Does that make sense?
No worries, let's do all the hosts then (except cloud of course)!
I think this makes sense, so for VMs I'll include the default interface name, eth0 etc., and for the mac address just have to specify the "{{ ansible_default_ipv4.macaddress }}" variable? Which means I may or may not need dumps for the remaining hosts.
I think it would help if we could chat just for a bit to strategize the rest of this project.
I'm on vacation until jan 3rd and on-line somewhere irregularly. You're welcome to ping me or chime in if you see me active tho, happy to chat on it more.
I'm going to add the following to group_vars/all:
I'm noticing in group_vars/all, dns1 and dns2, but not dns_search1 and dns_search2 are defined. Should I add these variables to group_vars/all? Additionally eth0_nm is defined, should it be changed to eth0_ipv4_nm?
On the same note, many vms define eth0_ip, gw, and nm variables. Do these need to be changed to eth0_ipv4, eth0_ipv4_nm, eth0_ipv4_gw, and eth0_ipv4_nm to match the variables in group_vars/all?
yes to all that. ;)
Pull request for default network connection added to group_vars/all https://pagure.io/fedora-infra/ansible/pull-request/924.
I'm making progress with writing a script to change all the nm host_vars at once. I've noticed some vms define just dns, while others define both dns1 and dns2.
If a vm host_var file defines dns1 and dns2 as the default values (10.3.163.33 and 10.3.163.34) should these variables just be deleted as the vm already has the default dns variables defined?
Also some vms define dns (8.8.8.8). In this case dns, dns1, and dns2 will all be defined. Is there a way around this?
I believe the dns variables are the last hurdle.
yes.
Change dns to dns1 and add 8.8.4.4 as dns2?
Awesome.
Alright new pull request. https://pagure.io/fedora-infra/ansible/pull-request/931.
Should possibly be the last one!
A new default network connection is defined in group_vars/all. All non-cloud vms in host_vars then edited to conform to this new default. Exceptions also edited (either ipv6 interface or multiple interfaces) as well.
So, a bunch of testing and tweaking and a few more PR's and... we are finally done!
notifs-backend01 has issues, but thats not surprising. :(
Everything else is done as far as I can tell.
Many thanks to @petebuffon who worked so hard on this... kudos!
Metadata Update from @kevin: