labsdb1005 (mysql) maintenance for reimage
Closed, ResolvedPublic

Description

Scheduled for 15 February.

Announcement:

Hello!

Tools DB and Labs Postgres DB will be undergoing maintenance on 15 Feb
2017 for about 6 hours starting at 1700 UTC and will be unreachable
for some of the duration. Most users shouldn't experience issues if
their code reconnects properly when the server stops accepting
connections (we'll failover to slaves when doing maintenance). Some
tables will not be available for a short period of time, but the tool
owners of those tables have already been notified (see
https://phabricator.wikimedia.org/T127164 ). We'll try to minimize
downtime as much as possible.

We will be upgrading the operating system from Ubuntu Precise to
Debian Jessie in preparation for EOL of Ubuntu Precise (in April
2017). We'll also take this opportunity to upgrade Tools DB to Mariadb

All data should be preserved in this migration. Follow
https://phabricator.wikimedia.org/T123731 for more information!

Thanks!

Planning:

  • Switchover labsdb1005 to labsdb1004
  • Backup labsdb1005 somewhere
  • Reimage labsdb1005
  • Recover data
  • Switchover labsdb1004 back to labsdb1005

Event Timeline

jcrespo added a project: User-notice.

Adding user notice. In theory, no end users should be affected, but if some tools have not been properly programmed to reconnect, they could have issues during and after upgrade.

jcrespo moved this task from Pending comment to In progress on the DBA board.
04:23 < yuvipanda> marostegui: jynus I can verify that I can access labsdb1004 from tools, so no need to massage VLANs or firewalls                                                                                                                                
04:24 < yuvipanda> I do find that it has less databases than 1005 tho. not sure if that's expected                                                                                                                                                      
04:46 < yuvipanda> jynus: marostegui https://gerrit.wikimedia.org/r/#/c/337775/ will switch the aliases we ask people to use to labsdb1004 from 1005                                                                                                                                                                                                                                                                                              
07:15 < marostegui> yuvipanda: https://phabricator.wikimedia.org/P4935                                                                                                                                                                                             
07:16 < marostegui> I guess it is not too worrying

After a chat with Jaime we have moved those old databases in labsdb1005 to: labsdb1005:/srv/tmp/old_dbs .
They didn't have any data, and only one of them contained some tables (empty, only the frm).
So now labsdb1004 and labsdb1005 have the same amount of databases.

For the backup data:

es1017 looks like a good candidate:

marostegui@es1017:~$ df -hT /srv
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs    11T  5.1T  5.9T  47% /srv

labsdb1005 totalsize for /srv is less than 950G:

root@labsdb1005:~# df -hT /srv/labsdb/
Filesystem                        Type  Size  Used Avail Use% Mounted on
/dev/mapper/labsdb1005--vg-labsdb ext4  8.0T  943G  7.1T  12% /srv/labsdb
root@labsdb1005:~# df -hT /srv/postgres/
Filesystem                          Type  Size  Used Avail Use% Mounted on
/dev/mapper/labsdb1005--vg-postgres ext4  4.0T   67M  4.0T   1% /srv/postgres

Change 337881 had a related patch set uploaded (by Jcrespo):
Upgrade toolsdb master to mariadb10

https://gerrit.wikimedia.org/r/337881

Change 337881 merged by Jcrespo:
Upgrade toolsdb master to mariadb10

https://gerrit.wikimedia.org/r/337881

Change 337911 had a related patch set uploaded (by Marostegui):
linux-host-entries: Remove precise from labsdb1005

https://gerrit.wikimedia.org/r/337911

Change 337911 merged by Jcrespo:
linux-host-entries: Remove precise from labsdb1005

https://gerrit.wikimedia.org/r/337911

Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts:

['labsdb1005.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201702152016_jynus_16501.log.

Completed auto-reimage of hosts:

['labsdb1005.eqiad.wmnet']

and were ALL successful.

Change 337990 had a related patch set uploaded (by Jcrespo):
toolsdb: Increase innodb log file size to 500M (1 GB total)

https://gerrit.wikimedia.org/r/337990

Change 337990 abandoned by Jcrespo:
toolsdb: Increase innodb log file size to 500M (1 GB total)

https://gerrit.wikimedia.org/r/337990

This is mostly done, no major incidents- servers where only in read-only for a few seconds before and after the maintenance, for switchover to the slave.

We will monitor in case someone has an issue.

Marostegui claimed this task.

I am closing this as nothing has been reported so far. If something arises, feel free to reopen the task!