Here is one account of AI being shit at multi-step activate outside of coding:
I think my request of “Hey Gemini, show me a list of all the articles I wrote over the last year and arrange them into categories by subject” is a straightforward one, and I came away from this experience surprised that Gemini shipped these features as bleeding edge AI to customers when it never really delivered for me.
I have the same experience, weekly. In general, using AI for this kind search and analytics has been bad and more time consuming than just doing it myself.
It is especially bad when you hook it up to other services like docs and email.
It works pretty well with plain text files. I think why it works there is that it writes scripts to search and chunk the text. That is, it’s doing non-AI work to search docs and others. Perhaps it’s good at orchestrating text work like this, not doing the finding, sorting, and formatting work.
Here are some example scenarios.
I ask Claude Code (and now Coworker) to look through all of my journal files (plain text, in markdown with frontmatter) and reformat them to a standard format and file and, adding keywords for my emotional state.
It writes scripts to search for and find each file, the processes them. I’m not sure how it does the sentiment analysis…but that would seem to be a perfect use for AI.
I’ve done similar things with re-arranging the plain text files in my Obsidian vault.
Meanwhile, I ask it to find all of the receipts for my last trip by searching my GMail.
Our return flight was canceled, so we wanted to file for the EU reimbursements you get. It did sort of OK. It found an AirBnB receipt, but failed to find the canceled and rescheduled airline emails.
When asked “simple” questions about the policy, it was very ambiguous.
Most tasks I do with email are like that. You get the feeling that the AI - even Gemini - does not search all email.
Here is a test you can do. Ask it to tell you the top ten people you emailed and common subjects with each for the past ten years. It you’re like me, that’s more like 20+ years in GMail (I imported my Yahoo! email long ago).
Usually, this a disappointment.
There’s all sorts of reasons this could be happening. My guess would be that the scale of this problem is out of scope for the consumer grade Gemini and other AI chat apps. You’re talking about looking at, probably, 100'000’s of emails; each one.
However, what I would expect is they because it’s all on Google and they are masters at understanding content (so that they can target ads), that there are existing systems to do this. That would mean the Gemini team coordinating with the GMail team, which is likely a lot to ask.
The good news is that this is likely an app problem. When you look at the complex orchestration systems developers use to make large apps, there is a lot going on. The same things (clearly) don’t exist in the chatbots at the moment.
Something like NotebookLM is a good, early example of an orchestration system. I don’t use NotebookLM enough to be confident in this, but I think it is all built around one use case: “help me learn this topic.” A secondary use case is “help me present this topic.”
For example, it does not do well at all “help me play D&D.”