Apple Says Generative AI Isn’t Good At Math - Forbes

Apple CEO Tim Cook responding to a question on how much 1+1 is.

Getty Images

OBSERVATIONS FROM THE FINTECH SNARK TANK

Conclusions from a new Apple study might make consumers rethink using ChatGPT—and other Generative AI tools—to get financial advice. And it should temper the plans of bank and credit union executives to use artificial intelligence (AI) to offer financial advice and guidance to consumers.

Americans Use Generative AI To Get Financial Advice

A survey from the Motley Fool revealed some surprising—and, frankly, hard to believe—statistics about Americans’ use of the Generative AI tool ChatGPT for financial advice. The study found that:

54% of Americans have used ChatGPT for finance recommendations. Six in 10 Gen Zers and Millennials, half of Gen Xers, and a third of Baby Boomers said they’ve received recommendations for at least one of eight financial products. Credit cards and checking accounts—cited by 26% and 23% of respondents, respectively—were the products most frequently asked about.
Half of consumers said they would use ChatGPT to get a recommendation. That said, few expressed in getting a recommendation in for most products. For example, 25% said they’d want a recommendation from ChatGPT for a credit card—and the percentages go down from there.
Respondents were “somewhat satisfied” with ChatGPT’s recommendations. On a 5-point scale (1=not satisfied, 5=very satisfied), the average overall satisfaction rating was 3.7, ranging from 3.6 from Gen Zers and Baby Boomers to 3.8 from Millennials and 3.9 from Gen Xers.

According to the study, the most important factors determining consumers’ use ChatGPT to find financial products are: 1) the performance and accuracy of the recommendations; 2) the ability to understand logic behind the recommendations; and 3) the ability to verify information the recommendation is based on.

But is the performance, accuracy—and very importantly—logic behind ChatGPT’s recommendations sound? Apple’s report cast some doubts.

Generative AI Falls Short on Mathematical Reasoning

Generative AI tools can do lots of amazing things, but, as a new report from researchers at Apple demonstrates, large language models (LLMs) have some troubling limitations with “mathematical reasoning.” The Apple researchers concluded:

“Current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data. When we add a single clause that appears relevant to the question, we observe significant performance drops across all models. Importantly, we demonstrate that LLMs struggle even when provided with multiple examples of the same question or examples containing similar irrelevant information. This suggests deeper issues in their reasoning processes that cannot be easily mitigated through few-shot learning or fine-tuning.”

A recent TechCrunch article documented some of the seemingly simple mathematical calculations that LLMs get wrong. As the publication wrote, “Claude can’t solve basic word problems, Gemini fails to understand quadratic equations, and Llama struggles with straightforward addition.”

Why can’t LLMs do basic math? The problem, according to TechCrunch, is tokenization:

“The process of dividing data up into chunks (e.g., breaking the word “fantastic” into the syllables “fan,” “tas,” and “tic”), tokenization helps AI densely encode information. But because tokenizers — the AI models that do the tokenizing — don’t really know what numbers are, they frequently end up destroying the relationships between digits. For example, a tokenizer might treat the number “380” as one token but represent “381” as a pair of digits (“38” and “1”).”

Machine Learning Has A Problem, As Well

Annoyingly, a lot of people use the term “machine learning” when referring to regression analysis or some other form of statistical analysis. According to the University of California at Berkeley, machine learning has three components:

A decision process. In general, machine learning algorithms are used to make a prediction or classification. Based on some input data, which can be labeled or unlabeled, your algorithm will produce an estimate about a pattern in the data.
An error function. An error function evaluates the prediction of the model. If there are known examples, an error function can make a comparison to assess the accuracy of the model.
A model optimization process. If the model can fit better to the data points in the training set, then weights are adjusted to reduce the discrepancy between the known example and the model estimate. The algorithm will repeat this iterative “evaluate and optimize” process, updating weights autonomously until a threshold of accuracy has been met.

Regression analysis and most other forms of statistical analyses lack a model optimization process.

Here’s the real-world problem: While “investment” results are generally trackable, “spending” results are not. For the vast majority of people, however, how they spend is a bigger determinant of their financial performance than investing is.

The other challenge here is that we don’t simply spend to optimize our financial performance. We spend to optimize our emotional performance. How is a machine learning model going to track that?

AI Is Not Ready For Prime Time In Financial Advice

Providing financial advice and guidance is not a straight-forward simple task—the set of instructions needed to do it requires many “clauses.” In other words, the goals and objectives for establishing financial advice and guidance are not simple and straight-forward—and it’s these complex questions and instructions that Generative AI tools are not good at (according to Apple).

Bottom line: Banks and credit unions shouldn’t rely on AI to provide financial advice and guidance—right now. Maybe someday, but not now, and not for another 5, maybe 10, years. If vendors claim they’re using machine learning, ask them about their model optimization process. If they claim to have a large language model, ask them how it overcomes math computation limitations.

READ SOURCE