AI Trials: Sept Pt 2

September's Ritual Reflections

Sep 14, 2024

Abstract 3D landscape with geometric mountains in blue and red hues, overlaid with transparent boxes containing various numbers like 20, 15, 8, 60, 20, 74, 18, and 15. Floating cubes and irregular shapes are scattered throughout the scene, adding to the futuristic, data-driven visual style. — Image created with Midjourney v6.1

Preface

This post is part of a year-long initiative where I employ AI to create content about holiday traditions worldwide. The objective is to observe how various AI tools perform and improve in content creation with minimal human intervention over time. This is the 1st of 4 (maybe 5) articles for the month of September.

Prompts and interactions with different AI models will be documented as they occur, providing insights into the methodologies, challenges, and adjustments made throughout the project

TL;DR

September Pt 2 focuses on testing the Chain of Thought (CoT) rubrics from August as potential replacements for the current scoring system. This involved generating articles, analyzing scoring patterns, and creating comparative charts to identify noteworthy issues and trends. The new rubrics performed exceptionally well, relieving concerns about the need to address problems with the existing rubric.

Trial Elements

AI Models

Claude 3.5 Sonnet
ChatGPT-4o

Holidays

Defence Day - September 6th - Pakistan
Brazil's Independence Day - September 7th - Brazil
Ganesh Chaturthi - September 9th - Hindu communities worldwide

Goals

Evaluate the effectiveness of the CoT rubrics from August Pt 3
Analyze scoring patterns and trends for each AI model and rubric combination
Identify improvements for future evaluation methods based on the new rubrics' performance

Shots Fired

Zero-Shot Articles

To address the scoring inconsistencies discovered in September Pt 1, I decided to put the two CoT rubrics created by Claude and GPT to the test alongside my original rubric. Using the new from the previous experiment, I generated zero-shot articles for the three holidays covered in the prior article.

I employed my JSON Editor role to apply each rubric in a separate session, scoring the four article versions for each holiday.

A series of line charts displaying data trends across different categories labeled with codes like SE-LD-ZP, SE-VI-SP, and holidays such as Labor Day, Vietnam's Independence Day, and Yamashita Surrender Day. Each chart shows lines in various colors (blue, green, red, and orange) representing different data sets, with the lines generally trending upward, indicating increasing values with some variation across the charts.

One-Shot Articles

For the one-shot phase, I used the same sessions as the zero-shot articles, introducing my JSON Author and XML template. I instructed the AI to create new articles, using the zero-shot articles in the session as an example article. This approach fell somewhere between a true one-shot prompt and a super prompt - if I call it a "Limbo Prompt", that makes it a thing, right?

Claude:

Maintained a tight range of scores, consistent with past experiments
Scores were notably lower than the scores it has historically provided

ChatGPT-4o:

Exhibited a wider range of scores compared to previous experiments
Produced no ties, a stark contrast to earlier challenges with tie-breaking

Few-Shot Experiment

With 12 articles in hand, I moved on to few-shot prompts. For each event, I supplied the AI with 2 of the 4 one-shot articles from the other two events as examples. So, when writing about Defence Day, the AI received 2 articles each about Brazil's Independence Day and Ganesh Chaturthi as context.

A set of four horizontal bar charts comparing final scores of different articles across various evaluation methods: Claude's Rubric scored by Claude, Claude's Rubric scored by GPT, GPT's Rubric scored by Claude, and GPT's Rubric scored by GPT. Each chart shows bars in multiple colors representing different articles with scores ranging from around 7.0 to 10.0, highlighting variations in evaluations based on rubric and scorer. The charts emphasize differences in scoring between Claude and GPT, as well as between the rubrics used.

The scores derived from Claude's rubric were consistently distributed like the previous set. However, those using GPT's rubric had more in common with one another than any of the prior sets.

When creating the rubrics, I noticed that GPT had included details on comparative scoring. Having toyed with having the AI use the articles to establish the scoring scale in the past, I was curious how it might play out. The consistency in scoring and favored articles across both AI models suggests a crisp definition of how to score in the instructions. This, if it were to persist, would increase confidence in comparing scores across models.

120 Shots

I know what you're thinking. I haven't produced 120 articles with these templates, and older articles would contaminate the quality of the results. It's true, I only have 48 articles.

GPT:

Here's the breakdown of the math expressed in a structured format:
2 Scorers (Claude and GPT) * 2 Prompt Types (One-Shot and Few-Shot) * 2 Templates (C and GPT) * 3 Sets (BI, DD, GC) * 1 Score per Article = 48 Total Scores

For those still reading and counting (a.k.a. Ron), I did supply 1 article as a shot for 12 articles, and 4 articles as shots for 12 articles to 2 AI. 120 shots.

Two side-by-side horizontal bar charts displaying the final scores of various articles, each zoomed in on the range of 7.5 to 9.5. The left chart shows scores given by Claude, while the right chart shows scores given by GPT. Each chart features multicolored bars representing different articles, highlighting score variations across the evaluators. The data emphasizes the differences and similarities in scoring judgments between Claude and GPT, with most scores clustering around the upper 8 to lower 9 range, reflecting a generally high evaluation across articles.

What I find most interesting when reviewing the 48 together is that both AIs consistently favored many of the same articles across the few-shot prompts. This preference emerged naturally, without scoring them as a single set or even in a single session.

AI Articles

Insights & Observations

The Good

The new CoT rubrics' exceptional performance was surprisingly good.
GPT's rubric yielded consistent scoring patterns across AI models, suggesting clear instructions that enable reliable comparisons.
The final charts indicate potential for a unified rubric producing comparable scores across different AI systems.

The Bad

GPT-4o's interactive charts look great, but turn out to be useless when downloaded.

Up Next

Apply the new CoT article template to several events over the coming weeks, assessing adaptability across multi-day and multi-country events.
Refine and expand on the prompting types used in this trial, seeking the point of diminishing returns.
Refine the CoT article template by testing unused AI suggestions and the results of its application to diverse holidays and cultural events.

Additional Tools

The tools behind the articles. No affiliations.

Arc: Browser supreme
ChatGPT-4o *: Alt text & visualizations
Mermaid Chart: When it got complicated and the code got messy…
Midjourney *: Article and AI article images
Rename X *: File renaming app for Mac
Type.ai *: Text editor

Paid items indicated by *

Abstract 3D scene with floating and stacked numbered cubes in shades of blue, red, and orange, set against a glowing background. The cubes, labeled with numbers such as 15, 17, and various single digits, appear to be dynamically arranged, creating a sense of depth and movement. — Image created with Midjourney v6.1

Quiet Evolution is about experimenting and sharing insights. If you find this helpful, coffee is always appreciated (no pressure). Proceeds are used strictly to cover AI costs; any excess goes to the American Cancer Society.

Appendix:

Flowchart depicting a multi-stage evaluation process with four sections: Zero-Shot Prompts, Super Prompts, One-Shot Prompts, and Few-Shot Prompts. Each section shows two templates, ChatGPT-4o’s and Claude’s, generating articles evaluated using JSON and ChatGPT-4o rubrics, resulting in scores for different articles. The diagram visually connects the roles, prompts, articles, evaluations, and scores across each section. — September Pt 2 Workflow

Disclaimer: The views and opinions expressed in this article are solely those of the author and do not reflect the official policy or position of Amazon Web Services (AWS). The author is a UX designer at AWS and has no involvement in, nor does their work pertain to, any collaborative agreements that AWS may have with Anthropic, the creators of Claude. The insights and analyses presented here are entirely independent and unrelated to any projects or initiatives between AWS and Anthropic. All content in this post is based on publicly available interfaces and is not influenced by the author's employer.