Our mission involves three interlinking objectives. First, we aim to advance the design of methods and tools for mechanistic interpretability, understanding, and control of Foundation models, Generative AI, and Frontier AI systems. Second, we aim to build mechanistically interpretable and controllable computational models and simulations of real-world complex systems with advanced AI. Third, we aim to connect the previous two areas by drawing a mechanistic thread between the internal architecture of AI models and the objects, processes, end entities they model in reality for AI-based simulation, oversight, prediction, and control of complex systems across a variety of industries and sectors.
Our current roadmap involves developing tools to accelerate and scale methods for probing and reverse engineering the internal architecture of AI models to uncover the mechanisms, features, and activation patterns that encode knowledge and underly specific behaviours at the computational, algorithmic, and circuit levels and thus develop tools to provide human understandable insight into the internal architecture of complex models, and to enable predictive control of model behaviours and outputs. This is to be complemented by building mechanistically interpretable AI-based models that capture and simulate the dynamics of complex multi-agent and multi-body systems, which may be mapped back to reality for interpretable AI-based model predictive control.
Our objective is to operationalise and commercialise cutting edge research from the rapidly emerging fields of mechanistic interpretability, representation engineering, and the science of AI evaluation to improve the safety, reliability, understanding, and control of advanced AI models. Furthermore, we aim to generalise and extend the domains to which the tools of mechanistic interpretability can be applied from language models to models that relate to non-linguistic domains, objects, entities, and processes.