Reboot for kernel upgrades
Closed, ResolvedPublic


We need to reboot (along many other hosts) to apply some long standing kernel updates. . Downtime is expected to be a few minutes as the reboot will happen along the reboot of the entire codfw ganeti infrastructure.

This normally would be easy and not require a ticket if not for the nature of the service. Hence filing this task as an advance notice. for interested parties. has some information as to the stuff that may need some extra help recovering for that, we should probably update whatever is needed before (and possibly after the reboot).

This will happen on June 21st.

Related Objects

Event Timeline

akosiaris triaged this task as Medium priority.Jun 12 2017, 9:19 AM

(As this has been marked with user-notice.)

How tentative is that date? Is it worth spreading the word mentioning at specific date yet?

It's dependent on a kernel upgrade due to be released on the 19th. That has already been rescheduled once. I wish I could provide a degree of certainty but it is not dependent on something we control. But it does look like it's not going to be rescheduled. I would suggest we spread the word and just re-spread it if we have to reschedule.

The date is no longer tentative. It's now fixed.

It was included in the issue of Tech News that went out yesterday.

Per @akosiaris ' last comment, I've taken the liberty to update the task description to reflect the now non-tentative date.

Mentioned in SAL (#wikimedia-operations) [2017-06-21T10:35:28Z] <akosiaris> rebooting the entire codfw ganeti cluster for kernel upgrades. Silenced hosts in icinga already. T167643

Change 360629 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Switch codfw puppetdb hosts to eqiad

Change 360629 merged by Alexandros Kosiaris:
[operations/puppet@production] Switch codfw puppetdb hosts to eqiad

This has been completed successfully. I see 92 users (well bots actually/probably) already in #en.wikipedia and another 64 in #de.wikipedia. ClueBot_NG is among them. That's a bit lower than the 120 and 75 respectively we had before the reboot, but that's probably clients not behaving very well to disconnections. Overall this has gone very well, I am gonna leave the task open for a few days for monitoring.

No issue reported in a week, resolving