The Complexity of AI Interpretability and Ethical Governance: A Deep Dive (opinion essay)

Irena Kolek
May 23, 2025
8 min read

Hello Fellow Human,

Imagine a world where artificial intelligence systems make decisions that affect billions of human lives: from recommending medical treatments to determining who gets approved for a loan. These AI systems are increasingly becoming the invisible hand shaping critical aspects of our lives. They change our everyday experiences, yet the majority of us do not know how these systems work. It is a black box that even their creators struggle to fully understand.

This concern was highlighted in an article by Dario Amodei, CEO of Anthropic, where he addressed the urgency of interpretability as we seem to not understand how some of AI models reach their conclusions.

This profound statement made me sit down, dig deeper into AI whats and whys and, eventually, pen down this essay-long opinion (that brings more questions than answers): how can we trust them (AI models) to make fair, accurate, and ethical decisions that impact not only our lives but also shape societal thinking?

To begin understanding this problem, one must start with grasping some basics of how AI works.

Black-box models

AI models broadly fall into two categories: black-box models and interpretable models.

Black-box models, like deep neural networks and large language models (e.g., GPT by OpenAI), excel at processing vast amounts of data and identifying complex patterns. However, their decision-making processes are so intricate that even experts struggle to explain how specific outputs are generated. When we call an AI model a "black box," we’re describing the fact that while we can see the input (what you feed into the model) and the output (what the model gives back), the internal process that leads from one to the other is hidden behind layers of complexity that are difficult, if not impossible, for humans to interpret clearly.

Black-box models are often powered by deep learning, a type of machine learning loosely inspired by the human brain. At the heart of this system are artificial neural networks - webs of mathematical functions organized into layers such as input layers, hidden layers, and output layers.

The input layer takes in raw data (like a sentence, image, or audio clip). Then the hidden layers (dozens or hundreds of them) transform the data in complex ways, identifying patterns, correlations, and features. These transformations are shaped during training, using millions or billions of weights and parameters. What follows is the output layer that produces the final result - a sentence completion, image label, or prediction.

Each layer builds on the previous one, developing more abstract representations of the data. These black-box models are trained on colossal datasets - think terabytes of text scraped from books, websites, forums, and social media. During training, the model adjusts its internal parameters (weights) to reduce the difference between its predictions and the actual results.

This process is like teaching by trial-and-error millions of times, but without a curriculum. It learns what language sounds natural, what ideas tend to appear together, how words and concepts relate statistically. But this method also absorbs the biases in online content, the cultural assumptions of the most represented languages or regions, and toxic or harmful ideas that slip through filters.

Because of this, black-box models are especially vulnerable to subtle, systemic biases, which are difficult to detect or correct - not because the designers are careless, but because the scale and complexity make such biases hard to spot.

Interpretable models

In contrast, interpretable models, such as decision trees or rule-based systems, provide a clearer, step-by-step logic behind their outputs. These models are easier to scrutinize and audit, making them more suitable for applications requiring transparency and accountability.

Building and training these models begins with choosing the appropriate model type depending on the problem at hand: for instance, using linear regression to predict numerical outcomes, or decision trees to provide step-by-step logic behind classifications.

Once the model is chosen, the next step is collecting and preparing the data. This involves organizing the data into features and examples, handling missing values, converting categories into machine-readable formats, and ensuring the dataset is as clean and unbiased as possible, since these models directly rely on data patterns to generate decisions.

During the training phase, the model learns to associate input features with outcomes. For regression models, this means adjusting coefficients so that the resulting equation best fits the data, effectively learning how much weight each variable should have in determining the outcome. For decision trees, the training algorithm selects questions that best split the data into meaningful branches, forming a tree-like structure where each path leads to a specific prediction. Rule-based systems either rely on expert-written logic or learn rules directly from the data. In all these cases, the final model is one that can be directly inspected, with each decision traceable back to a transparent rule or calculation.

After training, the model is evaluated using separate test data to assess its accuracy, reliability, and potential biases.

The transparent nature of interpretable models means that any issues can often be identified and corrected directly within the model’s structure. Once validated, the model is deployed for use in real-world scenarios. Because its logic is open and explainable, it becomes much easier for stakeholders to understand, trust, and refine the model’s decisions over time. This transparency makes interpretable models particularly valuable in high-stakes contexts where accountability, fairness, and user trust are paramount. However, while interpretable models foster transparency, they may lack the predictive power and scalability of black-box models.

This creates a paradox: Should we prioritize the accuracy and depth of black-box models or the transparency and ethical assurance of interpretable systems?

The Ethical Dilemma: Balancing Power and Transparency

The ethical implications of AI systems are particularly pronounced when we consider how decisions are made. Imagine a black-box AI model making parole decisions or allocating government funding. If those decisions cannot be adequately explained, it becomes nearly impossible to ensure fairness, prevent discrimination, or hold the model accountable. Moreover, the lack of interpretability opens the door to biased outcomes, whether intentional or inadvertent.

And this is a problem that needs to be discussed as AI usage spreads in our daily life.

Data used to train AI systems often reflects societal biases, and if those biases are not identified and mitigated, AI can perpetuate or even amplify them. This risk becomes even more concerning when we examine who controls the development, deployment, and regulation of AI.

Most of the powerful AI systems are built, trained, and governed within a small number of Western countries - particularly the United States and parts of Europe - where the datasets, algorithms, and design principles tend to reflect Western values, priorities, and cultural assumptions. As a result, AI systems trained predominantly on Western-centric data may risk privileging certain worldviews and marginalizing others, such as those rooted in Asian, African, Middle Eastern, or Indigenous philosophies.

This prioritization of Western thought might not only shape what AI considers to be "neutral" or "rational," but it may also subtly encode power structures that can influence decision-making in a wide array of domains: from education, healthcare, and justice to media and employment.

In black-box models, where the internal logic is opaque, these embedded biases become nearly impossible to detect and even harder to challenge, especially for communities that have little say in how the systems are built.

The result is a form of ethical imperialism: AI systems impose a dominant moral and cultural framework onto global populations, often without consent or representation. For instance, AI tools used for content moderation, loan approval, or academic evaluation may judge individuals based on values or behavioural norms that are foreign to their local context. In doing so, these systems risk suppressing alternative ways of knowing, relating, or organizing society.

Without deliberate efforts to include diverse cultural inputs, linguistic variation, and value systems in AI design and governance, the global deployment of AI technologies could deepen inequality and reinforce existing power imbalances. It is therefore not only a technical challenge but a moral imperative to ensure that interpretability, accountability, and cultural plurality are built into AI from the ground up.

The Path Forward: Hybrid Models and Ethical Oversight

As the problems stated above ring all possible bells, one may ask: what solution could control the development path of AI models? How can we ensure that governing bodies follow an objective approach toward unbiased model training? How can we prevent AI systems from becoming new instruments of political power, potentially spreading propaganda on a global scale without anyone realizing it?

As we navigate the challenges of AI ethics and governance, one of the most promising and balanced solutions lies in hybrid model architecture - a structure that leverages the raw computational power and pattern recognition capabilities of black-box models while simultaneously embedding interpretability through more transparent, rule-based systems.

This layered structure enables the efficiency and power of black-box models while maintaining a safeguard against biased or erroneous outcomes through interpretable oversight.

Integrating robust monitoring systems and feedback loops is crucial. These systems would continuously assess AI outputs for consistency, bias, and ethical alignment. This approach also opens an entirely new ethical and philosophical dimension: the possibility of letting users choose which worldview or moral framework guides the AI’s reasoning.

In practical terms, this option could mean offering culturally contextualized AI pathways. Suppose a user in Tunisia is interacting with a chatbot. Instead of the model operating solely on generalized data shaped by Western norms of efficiency, consumer behavior, or liberal individualism, the system could allow for interaction governed by North African cultural ethics - such as community-centered values, specific legal customs, or religious guidelines. A user in Japan, on the other hand, might opt for an AI mode that places greater emphasis on harmony, social obligation, and long-term relational thinking - values often found in Eastern philosophies.

One may ask: technically speaking, how would that work? In a hybrid setup, the underlying black-box model would perform the heavy computational work- for example, recognizing speech, translating languages, detecting risk, or predicting outcomes. But before final decisions are presented or acted upon, a layer of interpretable logic could take over, tailoring the output through a culturally aware framework. These interpretable models would act as ethical filters or regulators - offering explanations, flagging inconsistencies, and allowing human oversight that is not only technical but culturally and morally situated.

Take, for example, an AI used in the education sector that recommends curriculum tracks for students. A black-box model could analyse vast historical data to find patterns in student performance, career success, and interests. But the final recommendations would be filtered through an interpretable model designed with regional or national educational values in mind. In a Western system, this might prioritize individual choice and future earnings; in contrast, a model shaped by collectivist values may highlight societal contribution, family expectations, or local development needs.

This approach could also be extended to global content moderation. A black-box model might identify potentially harmful speech with high accuracy. But a secondary, interpretable model could apply culturally specific definitions of what is deemed offensive, permissible, or contextually acceptable. This way, a post made in Algeria is not unfairly judged by U.S.- centric standards of speech norms, and vice versa.

Hybrid model architecture and its application

As this idea of hybrid model architecture sounds beautiful on paper, its application in real life would face serious challenges.

Localization of interpretable frameworks requires deep, ongoing collaboration with ethicists, linguists, historians, and community leaders in each culture or region. These frameworks must be dynamic and capable of adapting over time, as cultural norms evolve and change.

Empowering users to select the ethical or philosophical lens through which their AI operates requires thoughtful interface design and clear communication. Since most people don’t think in terms of “epistemological systems,” this process must feel intuitive. For instance, a setup interface could allow users to choose preferred cultural or moral frameworks - such as “communitarian,” “liberal individualist,” or “faith-based” - which would then tailor the AI’s behaviour accordingly.

However, some core values need to remain constant in all proposed models, ensuring interoperability and fairness: a shared foundation of universal ethical principles (data privacy, fairness, dignity, transparency etc) - must be preserved, upon which more localized or personalized interpretations can be layered.

It goes without saying that supporting both a large-scale neural network and multiple, interpretable reasoning layers - configured based on user selection – surely demands robust infrastructure, real-time adaptability, and careful optimization. It will be more energy consuming, and costly in maintenance. Yet, despite all, the benefits might be profound.

A hybrid AI system that integrates user-driven ethical frameworks can reduce the risk of ethical imperialism - the inadvertent imposition of one culture’s moral values on another. Instead, it fosters a truly pluralistic technological future. It offers people in Morocco, Cambodia, Brazil, or Scotland the ability to influence not only how AI functions technically but also how it aligns with their social and moral values. In this vision, AI is no longer a monolithic authority, but a reflective companion - one capable of adapting to human diversity rather than simplifying or erasing it.

Irena Kolek - Business Analyst