Only enterprise architects can save us from Enterprise Architecture

Enterprise architecture (EA) is a troublesome discipline. I think it’s fair to argue that the famous Bezos API mandate and the birth of AWS are both essentially enterprise architecture efforts as is Simian Army from Netflix. These efforts have clearly delivered a huge positive business impact, but it’s much harder to make that case for the version of EA that exists in government.

For this government version, if we look beyond the tendencies towards self-referential documentation, and the use of frameworks that lack empirical grounding, there is an increasingly visible conflict with a growing body of knowledge about risk and resilience that is worth considering.

In Canada, Treasury Boards Policy on Service and Digital requires the GC CIO to define an enterprise architecture while the Directive on Service and Digital requires the departmental CIOs to “align” with it.

EA is used to design a target architecture, but more people are familiar with it as a project gating mechanism where the pressure to “align” is applied to projects. Mostly this takes the form of EA arguing for centralization and deduplication largely justified by cost savings.

This focus stands in sharp contrast with the literature on resilience, which largely views this sort of cost-optimization activity as stripping a system of it’s adaptive capacity.

What’s common to all of these approaches- robustness, redundancy, and resilience, especially through diversity and decentralization- is that they are not efficient. Making systems resilient is fundamentally at odds with optimization, because optimizing a system means taking out any slack.

Deb Chachra, How Infrastructure Works

Since this stuff can feel pretty abstract, we can try to make this concrete with a look at Treasury Board’s Sign-in-Canada service which is a key part of their “target enterprise architecture”.

The Government of Canada has ~208 Departments and Agencies, 86 of which have their own accounts. This is often held up as an example of of inefficiency and duplication, and the kind of thing that EA exists to fix. As TBS describes: “Sign‑in Canada is a proposal for a unified authentication mechanism for all government digital engagement with citizens.”

If you skip past the meetings required to get all 86 systems to use Sign-in-Canada, the end result would be a “star graph” style architecture; Sign-in-Canada in the center, with digital services connecting to it.

A “star graph”. Imagine the central point as a central sign-in service, or some other shared resource (maybe a shared drive, or a firewall) with other users/systems connecting to it.

Prized for efficiency and especially for central control this star-graph style architecture shows up everywhere in governments. To get to this architecture, EA practitioners apply steady pressure in those meetings (those gating functions of Enterprise Architecture Review Boards) to avoid new sign-in systems and ensure new and existing systems connect to/leverage Sign-in-Canada.

In graph theory there is a term for networks that are formed under such conditions; “preferential attachment“, where new “nodes” in the network attach to existing popular nodes.

Networks formed under a preferential attachment model (called “scale-free” in the literature) have some really interesting (and well studied) properties that I think are exactly what EA is trying to encourage; networks formed like this are surprisingly robust to random failures.

If you imagine the power/cooling/rack space constraints of a traditional data center, and the challenge of staying within those limits while limiting the effects of random failures, the centralization/deduplication focus of EA is a huge benefit.

A demonstration from the oneline Network Science textbook of how scale free networks are surprisingly difficult to destroy by randomly removing nodes.


But “scale-free” networks also have another property: They are very fragile to targeted attack. Only a handful of highly connected nodes need to be removed before the network is completely destroyed. If targeted attacks are suddenly the concern, the preferential attachment playbook, starts to look like a problem rather than a solution.

targeted-attack
A demonstration from the Network Science textbook showing how specifically targeting central nodes quickly destroys a scale-free network.

It’s these ideas that show why an EA practice narrowly focused on reuse/centralization/deduplication ends up conflicting with resilience engineering and modern security architecture.

Through that resilience lens, the success of Sign-in-Canada means a successful hack (the Okta breach gives us a preview) could paralyze 86 government organizations, something that isn’t currently possible.

In academic terms what we’ve done is increase our systems “fragility”, it’s a well known byproduct of the kinds of optimizations that EA is tasked with making.

We need to understand that this mechanistic goal of optimization as creating this terrible fragility and that we need to try and think about how we can mitigate against this.

Paul Larcey: Illusion of Control 2023

These system/network properties are well known enough that the US military has developed an algorithm that will induce fragility in human organizations. It uses this to make networks (terror networks in their case) more vulnerable to targeted attack.

The algorithm is called “greedy fragile” and it works by selecting nodes for “removal” via “shaping operations” (you can imagine what removing someone from a social network means in a military context), so that the resulting network is more centralized (“star-like”) and fragile; centralizing as a way maximize the impact of a future attack.

Explaining the goal of military “shaping operations”, to make a network more “star-like” and fragile.

While it might sound uncharitable to lay the responsibility for systemic fragility at the feet of enterprise architecture it is literally the mandate of these groups to identify and make many of these optimizations happen. It’s worth saying the executives fixated on centralization and security’s penchant for highly centralized security “solutions” are big contributors too.

I would argue the 2022 hack of Global Affairs which brought down the entire department for over a month is an example of of this fragility. When an entire department can fail as a single unit, this is an architectural failure as much as it is a security failure; one that says a lot about the level of centralization involved.

It’s worth saying that architecting for resilience definitely still counts as “enterprise architecture”, and in that way I think EA is actually more important than ever. However as pointed out in How infrastructure Works, it would be a big shift from current practice.

“Designing infrastructural systems for resilience rather than optimizing them for and efficiency is an epistemological shift”

Deb Chachra, How Infrastructure Works

We very much need EA teams (and security architecture teams) to make that shift to focusing on resilience. The EA folks I’ve met are brilliant analysts and more than capable of updating their playbooks with ideas from complex systems, cell-based architecture, resilience patterns like the bulkhead pattern, chaos engineering, or Team-Topologies and using them to build more resilient architectures at every level: both system and organizational.

With Global Affairs, FINTRAC and RCMP all hit within a few weeks of each other here in early 2024, making resilience a priority across the government is crucial and there is nobody better placed to do that than enterprise architects.

Modernise security to modernise government

With the resignation of the CIO of the Government of Canada, the person placed at the top of the Canadian public service to fix the existing approach to IT, there is lots of discussion about what’s broken and how to fix it.

Across these discussions, one thing stands out to me: IT security always seems to get a pass in discussions of fixing/modernising IT.

This post is an attempt to fix that.

As the article about the CIO points out, “All policies and programs today depend on technology”. IT Security’s Security Assessment and Authorization (SA&A) process applies to all IT systems therefore landing on the critical path of “all policies and programs”. This one process adds a 6-24 month delay to every initiative and somehow escapes any notice or criticism at all.

If you imagine some policy research, maybe a public consultation and then implementation work, plus 6-24 months caught in the SA&A process, it should be clear that a single term in office may not be enough be able to craft and launch certain initiatives let alone see benefits from them while in office. Hopefully all political parties can agree fixing this is in their best interests.

As a pure audit process, the SA&A is divorced from the actual technical work of securing systems (strangely done by ops groups or developers, rather than by security groups) leaving lots of room to reshape this process without threatening actual security work. Improvements in this one process are probably the single most impactful change that can be made in government.

It’s also key to accelerating all other modernisation initiatives.

Everyone in Ottawa is well aware that within each department lies one or more likely political-career-ending ticking legacy IT timebombs. Whether this is the the failure of the system itself, or of the initiative launched to fix it, or even just the political fallout from fixed capacity systems failing to handle a surge in demand, every department has these and the only question is who will be in office when it happens.

Though you’d never guess, inside the government it is actually known how to build systems that can be modernised incrementally, changed quickly, rarely have user visible downtime and can expand to handle the waves of traffic without falling over.

The architecture that allows this (known as microservices) was made mandatory by TBS in the 2017 Directive on Service and Digital (A.2.3.10.2 here). The Directives successor (the Enterprise Architecture Framework) doesn’t use the term directly but requires that developers “design systems as highly modular and loosely coupled services” and several other hallmarks of microservices architecture that allow for building resilient digital services.

I think TBS was correct in it’s assessment that this architecture is key to many modernisation initiatives and avoiding legacy system replacements just as inflexible as their predecessors. Treasury Board themselves describes the difference between current practice as their target architecture as a “major shift” and once again it’s a modernisation effort blocked by security.

AWS uses the same architecture to deliver their digital services and promotes it, along with the infrastructure and team structures needed to support it, under the banner “Modern Applications“. Substantially similar advice is given by Google and others and much of these best practices have been worked into TBS’s policy since 2017.

While much of TBS IT policy has been refreshed, all core IT security guidance and patterns are built around pre-cloud ideas (ITSG 22 from 2007, ITSG 38 from 2009) and process (ITSG 33 from 2012).

While TBS might want to to adopt microservices (created circa 2010 around the “death” of SOA), it’s the 2005-era 3-tier architecture (what ITSG-38 still calls the “preferred architecture”) that the infrastructure is set up to support, rather than the fancy compute clusters and cloud functions needed for the microservices architecture.

Similar conflicts exist with the Dev(Sec)Ops and “multidisciplinary teams” needed to support their architecture which are likely not feasible given current interpretations and enforcement of ITSG-33’s AC-5 “separation of duties”.

While TBS has updated it’s policy to require agile development practices, ITSG-33, the foundation of all government security process is explicitly waterfall, and while adapting it to agile is theoretically possible, it’s developers that get exposure to agile methods, rather than the well-intentioned auditors and former network admins that populate most security groups. Surrounded by waterfall processes that no-one seems equipped to modernise, systems built around continual change flounder.

While a big part of the point of this architecture is to fix the very visible problems governments have with availability and scalability, the fixed-capacity security appliances security teams routinely place between the internet and departmental systems will undermine those benefits. No GC firewall will handle waves of traffic like AWS Lambda or Google Cloud Run, and high availability architecture won’t matter when security brings down the firewall in front to patch it.

The idea here is that in many cases modernising security it a precondition to successfully modernising anything else. For those that overlook security, the assumptions embedded in their tools, processes and architectures will subtly but steadily undermine that effort. Treasury Board is filled with smart policy analysts learning this the hard way.

Security is the base of the modernisation pyramid… start there to fix things.