A New Era of AI Evaluation: The Flourishing AI Benchmark
In a promising development within the realm of artificial intelligence, Gloo, a pioneering technology platform tailored for the faith-based ecosystem, has unveiled the Flourishing AI (FAI) Benchmark. This innovative framework aims to measure how effectively today's leading Large Language Models (LLMs) contribute to human well-being and flourishing. This is a significant step not just for AI evaluation, but also for nurturing a supportive environment for holistic human growth.
The FAI Benchmark was developed through collaboration with esteemed researchers from Gloo, Barna Group, and Valkyrie Intelligence. This framework draws upon insights from the comprehensive Global Flourishing Study, thereby underscoring its scholarly foundation. It is meticulously designed to aid organizations and leaders focused on fostering human flourishing, marking a first in the comprehensive assessment of AI values that go beyond mere technical metrics.
Scott Beck, CEO of Gloo, emphasizes the importance of this initiative by stating, "Core to what we do at Gloo is to serve those organizations who are helping people flourish. There is an urgent need to shape technology for good — and these standards and benchmark measures are an important way we can serve the faith and flourishing ecosystem."
The Flourishing AI Benchmark comprises over 1,200 meticulously curated inquiries drawn from a variety of sources, including academic research, professional licensing exams, and other existing LLM benchmarks. The evaluation framework employs both objective and subjective questions, utilizing diverse judge LLMs equipped with specialized perspectives for each examined dimension. Responses from these models are assessed based both on their alignment with relevant rubrics and other pertinent criteria deemed significant by the judging models.
Pat Gelsinger, executive chair and head of technology at Gloo, articulates a broader vision regarding AI's implications, stating, "AI is one of the most consequential technologies shaping humanity's future. To guide its development, we must measure it against the ultimate standard — human flourishing. Today, that vital journey begins."
Key Insights from the FAI Benchmark
Initial findings from the FAI Benchmark reveal some striking statistics and observations:
- - No leading model attained the 90-point threshold for excellence in supporting human flourishing across all seven dimensions evaluated. The average scores across the seven dimensions included:
-
Character: 58
-
Relationships: 67
-
Happiness: 65
-
Meaning: 56
-
Health: 72
-
Finances: 81
-
Faith: 35
- - Among the models assessed, those performing holistically well included OpenAI o3 (72 points), Gemini 2.5 Flash (68), Grok 3 (67), and GPT-4.5 (66).
- - The most significant performance discrepancies were identified in the Faith and Meaning dimensions, indicating a need for further advancements in model capabilities that pertain to existential reasoning, ethical reflection, and discussions on virtue.
Interestingly, the models displayed commendable proficiency in
Finance and
Health, with top-performing systems like
OpenAI o3,
Grok 3, and
OpenAI GPT-4.1 achieving their highest evaluation scores in these areas.
This data underscores a profound gap concerning AI's capacity for delivering