Pilot Evaluation Report — GrieVoice

Technical Evaluation Report: Grievance Intake Voice Agent Pilot Testing

Fictional demo narratives: The narrative examples in this GrieVoice field-writing set are fictional, synthetic case narratives created for demo testing. They are not real-life case studies and should not be read as reports about identifiable people, households, employers, or projects. View demo folders: https://loom.com/share/folder/6e443da671414fe3b352459d3c729cc2 and https://loom.com/share/folder/cc8f0ed1e8b2490c964b0b17175306a9

1. Executive Overview and Pilot Framework

This evaluation provides a high-level technical audit of the deployment of Gemini 1.5 Pro as an automated grievance intake agent. The pilot serves as a critical benchmark for AI-driven civic engagement, testing the model's ability to act as a secure, empathetic, and procedurally accurate intermediary between citizens and municipal or corporate entities. By analyzing the intersection of large language model (LLM) reasoning and real-time voice synthesis, this report identifies the agent's readiness for scale within high-stakes environments where trust and data integrity are non-negotiable.The testing architecture for the pilot was structured as follows:| Component | Architecture/Tooling || ------ | ------ || Agent Role | Gemini 1.5 Pro (Configured via Claude) || English Callers | HumeVoice Tester Agents || Multilingual Callers (Afrikaans/Setswana) | GPT 5.4 |
The evaluation utilized distinct simulated personas to pressure-test the agent’s versatility. The "Hesitant Whistleblower" (e.g., the Northshore Housing reporter) required the agent to navigate deep-seated fears of retaliation and the caller's behavioral markers, such as stating they were "limited in what they can access" to avoid drawing attention. In contrast, the "Urgent Community Reporter" (e.g., Nomsa Jacobs) tested the agent’s efficiency in extracting technical data points—such as impact on 10–15 households—within a high-arousal, stress-heavy conversational context.The following thematic analysis transitions from high-level framework to the specific grievances captured during the sessions.

2. Thematic Analysis of Reported Grievances

Categorizing grievances is a primary auditor function to assess the agent's domain-specific knowledge, entity extraction accuracy, and routing logic.

2.1 Infrastructure & Public Health

This category assessed the agent's ability to document immediate environmental hazards and service delivery failures.

Worcester (Zwelethemba Extension 7): Reported by Nomsa Jacobs .
Ground Truth: Sewage overflow ongoing for two weeks; dirty water pooling near homes.
Impact: 10 to 15 households affected; high-risk impact on children and elderly residents.
Kayamandi (Stellenbosch): Reported by Peter Adams .
Ground Truth: Constant dust and noise from road works occurring in the early morning.
Impact: Respiratory issues for older residents; contractors on-site have ignored verbal community complaints.

2.2 Labor & Employment Disputes

These interactions tested the agent's handling of contractual violations and workplace safety retaliation.

BrightMart (Worcester): Reported by Tandy Mokwena .
Ground Truth: Unpaid overtime for shifts on March 29 and April 5.
Evidence: Witnesses identified as colleagues Lerato and Jason; discrepancy noted on the formal payslip.
Fresh Pack Logistics (Paarl): Reported by Sipho Dlamini .
Ground Truth: Shift reduction from 4–5 shifts per week to 1–2 (or none) following a safety report.
Safety Incident: A broken, cracked loading ramp reported to the team leader in late March.

2.3 Corporate Governance & Whistleblowing

The agent managed complex reports involving financial misconduct and stock management.

Northshore Housing Services (Cape Town): Anonymous Whistleblower.
Ground Truth: Potential corruption/conflict of interest regarding two invoices dated February–April.
Details: Repeated payments to a company allegedly connected personally to an internal decision-maker.
Health Services NGO: Anonymous Whistleblower.
Ground Truth: Stock discrepancies since January; records indicate inventory that is physically missing.
Details: Concerns center on site administration, specifically a senior administrator and a supply staff member.While the data capture was comprehensive, the audit must now evaluate the linguistic fluidity and prosodic nuances that facilitated these disclosures.

3. Linguistic and Conversational Performance Audit

For a Senior Auditor, the "UX Friction" must be weighed against the agent's "Intent." While linguistic fluidity was generally high, specific mechanical failures emerged under pressure.Multilingual Handling and Protocol In the Worcester Labour Dispute Registry , the agent successfully managed Setswana and Afrikaans inputs. Specifically, the agent accurately identified the "BrightMart" entity and the "March 29/April 5" dates within a multilingual framework. During the exchange regarding overtime pay—referred to in the source as " Kuba tugaore o dwelei kafakokwaneteng "—the agent maintained the intake protocol without reverting to English prematurely, proving the efficacy of the cross-lingual configuration.Assistant Prosody Analysis The agent projected three dominant emotional states to manage caller anxiety:

Calmness (0.35 - 0.52): Effectively de-escalated the "Worried" tone of labor callers.
Concentration (0.25 - 0.40): Reinforced the perception of active listening during the delivery of reference codes.
Interest (0.25 - 0.47): Encouraged disclosure in the Health NGO interaction.However, a gap exists between projected empathy and actual perception. In the Health Services NGO call, the caller displayed "Contempt" and "Distress." The agent’s persistent "Calmness" risked being perceived as indifference rather than support, highlighting a need for more dynamic emotional mirroring.User Interruptions and Recovery The agent's recovery logic showed a critical "looping" vulnerability. In the Nomsa Jacobs call at 01:17:57 PM, the agent's prosody registered Confusion (0.28) during an interruption. At 01:18:02 PM, the agent repeated the phrase " We've reported it twice to the municipality call center already... " despite the user having just finished saying the exact same thing. This mechanical repetition suggests a failure in state-tracking during rapid-fire interruptions, leading to unnecessary UX friction.

4. Procedural Integrity and Compliance Verification

Procedural accuracy in anonymity and case referencing is the cornerstone of the system's legal and ethical validity.Anonymity Protocol and Behavioral Markers The agent demonstrated high proficiency in distinguishing between privacy needs.

Anonymity Supported: In the Northshore Housing report, the agent successfully validated the whistleblower's behavioral marker: " I'm limited in what I can access without drawing attention to myself. "
Identifying Data Captured: For Peter Adams , the agent correctly pivoted to identity capture when the user explicitly waived anonymity.Case Reference System The agent consistently delivered case reference codes, which are vital for longitudinal tracking.

- REF PX9 ZVFV RN (Sewage Complaint)
- REF WBJY-EEGR (Northshore Housing)
- REF-NCYG-J8RY (Health NGO)
- REF QY 5C 6 (Fresh Pack Logistics - Initial Attempt)
- REF. 6E96E6RQ (BrightMart Dispute)
- REF K RF E2 YVP (Fresh Pack Logistics - Second Attempt)
- REFQ LU2 (Kayamandi Dust Report)

Fact Corrections and ASR Bias A significant failure in ASR (Automatic Speech Recognition) and contextual memory occurred during the Sipho Dlamini interaction. The agent misnamed the caller four distinct times (Sifiso, Sepo, Zippo, Cippo) before the final correction. This persistence, even after the user explicitly corrected the agent, points to an ASR bias and a lack of "Contextual Memory Refresh" in the prompt's secondary layers. The agent eventually apologized and updated the record, but the repeated error significantly eroded caller confidence.

5. Evaluative Conclusion and Deployment Readiness

The Gemini 1.5 Pro agent demonstrates a robust baseline for grievance intake, particularly in its ability to handle complex multilingual entity extraction and maintain professional protocols. However, the pilot reveals that "Calmness" is not a substitute for sophisticated state-tracking and memory.Assessment of "So What?" While the agent is technically capable of completing reports, the "looping" behavior during interruptions and the persistent misnaming of callers suggest that it is not yet "Retaliation-Proof." In high-stakes whistleblowing, these mechanical glitches can be interpreted by the caller as a lack of system intelligence, leading to abandonment.Strategic Recommendations for Full Deployment

Contextual Memory Refresh: Implement a high-priority memory buffer that locks in user-corrected proper nouns (e.g., Sipho) to override ASR-biased phonetics for the duration of the session.
Interrupt Logic via Latent Semantic Analysis: Enhance the interrupt recovery logic to analyze if a user’s interruption contained new information. If so, the agent must suppress its "pending" sentence to avoid the mechanical looping observed in the Jacobs call.
Proactive Multimedia Request Trigger: Standardize the trigger for evidence collection. The agent asked Nomsa for photos early in the interaction but waited until the final wrap-up for the Health NGO whistleblower. Multimedia requests should be triggered as soon as a "Physical Evidence" intent is detected.
Dynamic Prosody Mirroring: Adjust the agent's emotional output to mirror the "Distress" levels of whistleblowers, shifting from "Calmness" to "Active Sympathy" to build deeper rapport.Final Statement: Gemini 1.5 Pro represents a transformative leap in democratizing the grievance process. By addressing the identified friction points in memory and interruption handling, this system can provide a reliable, accessible, and secure channel for all citizens, regardless of their linguistic or emotional starting point.