Sprint retrospective - Postgres phase 2, June 22 - July 24, 2020

What worked

Well-structured
- Thanks to @helfi92 for organizing the work so well
- Concerns at the outset about whether we had signed up for too much work, but because the work was so well-defined, we were able to stay on track.
- It was large, but mostly mechanical, process
  - It could have been tempting to not continue to phase 2 and stick with entities.
  - Boring, but necessary and fruitful work in order to realize the other benefits of postgres down the road.
  - Got it done, like ripping a band-aid off
  - Over 30K lines of code changed, mostly in same area
  - @petemoore touched node code and survived!
    - @petemoore was also on PTO for the final two weeks of the sprint yet we still managed to get stuff done.
      - Rest of the team remained effective, efficient
Not afraid of creating throwaway, intermediate infra to make the process much easier.
- The tooling and test frameworks that @helfi92 & @djmitche made it easy for others to plug-in quickly.
- Look ahead to Object Service which may benefit from a similar pattern.
Merge conflict resolution skills got a workout
- Usually this is something teams shy away from, but the skills themselves are super-useful to have.
- Again, the scripting that was developed to assist with rebasing was great.
We used the minimum viable tooling to assist with migration
- We didn’t rely on an extra ORM or framework like Knex which meant that there was less for people to learn.
We established a virtuous help cycle
- At some point during the sprint, every person on the team helped everyone else
- Lots of pairing
- Useful and practical, again because everyone was on same page
The tree-closing window (TCW)
- We decided to rollback rather than risk a downtime of unknown duration.
- Versioning saved us, or at least gave us an acceptable fallback position where we lost less work.
  - Having multiple releases as fallback points for the TCW
- Flexibility in TCW timing with Release Management
  - We were able to schedule the TCW a week later than it would have normally taken place.
  - Because systems are generally better automated and more resilient than a few years ago. RelMan can be more flexible with timing, provided we engage them early and we have a solid rollback story.
Online migration
- This saved our bacon after the TCW rollback, allowing us to get back on track with the sprint. It is also a very useful tool to have in our back pocket for future work.
Zenhub
- The Zenhub tooling around issues proved better (more full-featured) than the github version.
- The blocking/blocker labelling was useful
  - e.g. When we decided to cut signing, we immediately knew how many other issues would be effective and could also be cut.
We managed to maintain a balance between external requests and sprint focus.

What didn’t

Zenhub
- Despite generally being a welcome addition to the sprint process, we ran into some issues with Zenhub that weren’t great:
  - Occasional blank pages, sometimes associated with switching tabs.
  - Labels that we added to issues sometimes went missing, i.e. our changes didn’t stick and we needed to redo some work.
  - The default layout requires large display.
    - @imbstack found it hard to use on a small laptop screen.
- Nice-to-have features we wish the tool had:
  - the ability to fullscreen a particular pipeline for triage.
  - clone an issue/card
The tree-closing window (TCW)
- We didn’t manage to accomplish all the work we planned due to slower performance on the production database than we saw in staging.
  - We still don’t understand the reason(s) behind the performance difference.
We could have had better communication with CloudOps in the run-up to the TCW.
- If we’re planning for people to need to work weekends, we should involve them in the process as early as possible.

Process changes for next sprint

Avoid merge conflicts by having a work item to merge other pieces of work
- When we’re all working on the same files (which is likely in the sprint model), designate a “merge master” to resolve conflicts.
Triage
- There was some uncertainty around where issues should go when filed: Icebox vs New Issues vs Product Backlog
- By default, new issues go to the Icebox.
  - This is a good default state because it means some small action is required to acknowledge issues that we will actually work on at some point.
- The team found Zenhub useful for triaging issues added to the Icebox because the volume is low.
  - …whereas Github was more useful for larger triage of New Issues/Product Backlog
Product owner responsibilities
- Now that we have access to it, the burndown chart is essential for visualizing progress and delays.
- Keeping on top of burndown chart is now a primary responsibility for the Product Owner, although the rest of the team show also be looking at it.
  - Be sure to compare the view with “In Progress” issues included alongside the view with only “Closed” issues.
  - To help avoid the initial plateau of productivity, future Product Owners are encouraged to front-load sprints with smaller issues/pieces of work to help people ramp up, especially when they are unfamiliar with the domain.
- When organizing the sprint, it’s important to consider factors that might introduce delays and schedule accordingly. Two such factors we encountered during this sprint were:
  1. Reviews for known large changes - some of the database table changes requires thousands of lines of code and proper reviews take time; and,
  2. Deployments - while most deployments can happen quickly, there are external factors like CloudOps availability and Firefox release schedules that we need to account for.
Stand-ups
- We decided that forcing people on the west coast to sacrifice their sleep to attend stand-ups was not kind.
  - Also, east coast people found it strange to reporting status midway through their day.
  - @petemoore was on PTO, so we didn’t get his feedback re: CEST timing
- New model will have localized stand-ups at 09:30 for each timezone
  - We will try to automate the announcement of the stand-ups in the Slack channel using the gcal integration.
  - We will also try the Zenhub slack integration to inject more status into the channel automatically. We understand that this may make the channel noisy at the start of future sprints.
  - It is still incumbent on the Product Owner and Scrum Master to review the daily stand-up notes to look for blockers or other problems.

Chris Cooper

Sprint retrospective - Postgres phase 2, June 22 - July 24, 2020

What worked

What didn’t

Process changes for next sprint