Preface
This post is part of a year-long initiative where I employ AI to create content about holiday traditions worldwide. The objective is to observe how various AI tools perform and improve in content creation with minimal human intervention over time. This is the 3rd of 4 (maybe 5) articles for the month of September.
Prompts and interactions with different AI models will be documented as they occur, providing insights into the methodologies, challenges, and adjustments made throughout the project
TL;DR
In September Pt 3, I challenged AI authors with diverse subjects and inputs, pushing them to adapt existing templates and evaluation methods. I tested their ability to handle a multi-day festival without explicit instructions, which yielded unexpected pleasant results. I also conducted tests using a wider variety of markup languages—JSON, Markdown, TOML, XML, and YAML—to assess their impact on AI-generated content quality. By employing o1-mini for analysis, I uncovered trends suggesting that I may have prematurely settled on JSON as the optimal format.
Trial Elements
AI Models
Claude 3.5 Sonnet
ChatGPT-4o
o1-mini
Holidays
Ethiopian New Year - September 11th - Ethiopia
Onam - September 5th to 15th - Hindu communities worldwide
Mexico's Independence Day - September 15th
Goals
Evaluate tempalte performance across different markup languages
Analyze trends in AI-generated content using o1-mini
Test AI authors' ability to adapt templates for multi-day celebrations
Ethiopian New Year
For the first event this round, I focused on the Ethiopian New Year. I limited activities to producing articles using one-shot and few-shot prompts, which I planned to use in upcoming tests. With most September holidays clustered in the first few weeks, I will also use these articles for trend analysis later.
While creating charts for this set of articles, I noticed an interesting pattern. Claude's scoring remained remarkably consistent across different rubrics, while GPT-4o exhibited more variation without contradicting itself.
Onam
Onam presented a unique challenge: a 10-day festival that would test our AI authors' ability to handle a complex, multi-day event without explicit guidance. I subtly modified my standard prompt, indicating that the author should adapt the template as needed for the event. Beyond that, I left them to their own devices.
To spice things up, I included my JSON author role alongside the author role I've been working with this month. This allowed me to compare articles and assess the impact of role specificity on handling complex events.
All four AI sessions (two each for Claude and GPT-4o) surprised me by integrating information about different days throughout their articles, eschewing the day-by-day breakdown I had anticipated. When questioning their approach, each AI provided solid reasoning for their choices. This unexpected twist not only demonstrated the flexibility of the new templates but also demonstrated improvements within the AI models or their respective updates since my last experience with a similar event.
Scoring Comparison: The Impact of Role Specificity
As the month progressed, my curiosity about the performance of the new templates and how they might affect my JSON author grew. Rather than having the authors write smaller articles for each day of Onam, I opted for a comparison between articles generated by the JSON author role and those created with less specific prompts.
Here's where things get interesting, the JSON author consistently produced the highest-scoring articles. However, some articles generated by the basic role outperformed the lowest-scoring JSON author pieces. It's worth noting that the prompt instructing the basic role included a reasonable paragraph providing context for the author. This raises questions about whether the JSON format might be as restrictive as it is structured, despite offering the same level of event context.
Mexico's Independence Day
For my final experiment, I tested the impact of markup languages on AI-generated content using Mexico's Independence Day as the subject. To explore a range of possibilities, I transformed our Markdown template into several formats:
"YAML Ain't Markup Language" or "Yet Another Markup Language." (YAML) [1]
This produced 20 articles, which I scored using the JSON editor with both established rubrics. To ensure a thorough analysis, I employed two AI "researchers" for initial assessment. I then fed their reports to o1-mini for a comprehensive trend analysis.
While many of o1-mini's insights aligned with my expectations, a few findings stood out:
Performance by Format:
Researcher 1 identifies that articles in YAML and XML formats tend to score higher (7.80 to 8.02), whereas JSON and TOML formats fall into a mid-to-upper range (7.50 to 7.90).
Researcher 2 highlights that TOML files consistently achieve the highest scores (up to 8.95), followed by XML, JSON, and YAML formats in descending order.
Naming Conventions and Specific File Types:
Researcher 1 points out that files with names like "GPT+YAML-GPTt.md" achieve the highest scores, suggesting that GPT augmentation in specific formats enhances quality.
Researcher 2 observes that "Ct.md" files generally score higher than their "GPTt.md" counterparts, implying that non-GPT-enhanced versions may receive more meticulous attention or different processing methods.
Top-Performing Articles:
Researcher 1 identifies specific high-scoring articles such as SE-MI-FP-GPT+YAML-GPTt.md (8.02) and SE-MI-FP-GPT+XML-GPTt.md (7.96), highlighting their excellence in factual accuracy, cultural representation, and sensitivity.
Researcher 2 points out SE-MI-FP-C+TOML-Ct.md (8.95) as the highest scorer, emphasizing the superior performance of TOML-formatted articles.
Consistent Ranking Patterns Across File Types
Researcher 2 observes a consistent ranking pattern across different file types: Ct > TOML-GPTt > XML-GPTt > JSON-GPTt > YAML-GPTt, which aligns partially with Researcher 1’s findings on format performance.
Of all the markup languages I tested, TOML was the one I assumed to see the least interesting results from. However, I'm not rushing to conclusions just yet. The experiment's scope was limited—we only examined a single holiday with four articles per markup language. While this is certainly noteworthy, it's far from definitive.
AI Articles
Insights & Observations
The Good
AI models demonstrated unexpected adaptability when handling multi-day events.
Our markup language experiment revealed potential benefits of formats beyond JSON.
The Bad
Testing multiple templates and rubrics ended the Onam experiment prematurely.
New model releases with built-in Chain of Thought capabilities raise questions about the value of the AI Trials in general.
The Ugly
Claude's incessant apologizing has reached new heights of annoyance. I'm considering a prompt to ban phrases like "I'm sorry" just to have a normal conversation. At this rate, I half expect Claude to start offering virtual coffee and donuts to make up for its perceived shortcomings. 🍩
Up Next
Merge templates and rubrics to streamline our experimental process.
Conduct a broader study on markup language effectiveness across steps in my approach.
Investigate o1-mini's potential for enhancing and expediting content generation and analysis.
Reference
[1] What is YAML? ~ ibm.com, Published: 11 December 2023
Contributors: Tasmiha Khan, Michael Goodwin
Additional Tools
The tools behind the articles. No affiliations.
Arc: Browser supreme
ChatGPT-4o
*
: Alt text & visualizationsMermaid Chart: When it got complicated and the code got messy…
Midjourney
*
: Article and AI article imagesRename X
*
: File renaming app for MacType.ai
*
: Text editor
Paid items indicated by *
Quiet Evolution is about experimenting and sharing insights. If you find this helpful, coffee is always appreciated (no pressure). Proceeds are used strictly to cover AI costs; any excess goes to the American Cancer Society.
Appendix
Basic Author Prompt
You are a cultural journalist with experience writing comprehensive articles focused on holidays and observed days. Your articles present a balanced exploration of the historical origins, current practices, and the cultural sentiment associated with these events. You must ensure that the content is rich in historical context, reflective of contemporary customs, and attuned to the emotional and cultural significance for those who celebrate.
Write an article for Ethiopia's New Year celebrations using the sept-article-guide I've provided. The guide consists of 3 sections; things to consider, the article outline, and writing guidelines. You must honor the considerations, the outline should be considered as guidance and can be altered to reflect the event being written about, and you must adhere to the writing guidelines.
Workflow
Disclaimer: The views and opinions expressed in this article are solely those of the author and do not reflect the official policy or position of Amazon Web Services (AWS). The author is a UX designer at AWS and has no involvement in, nor does their work pertain to, any collaborative agreements that AWS may have with Anthropic, the creators of Claude. The insights and analyses presented here are entirely independent and unrelated to any projects or initiatives between AWS and Anthropic. All content in this post is based on publicly available interfaces and is not influenced by the author's employer.