Dear friends,

One reason for machine learning’s success is that our field welcomes a wide range of work. I can’t think of even one example where someone developed what they called a machine learning algorithm and senior members of our community criticized it saying, “that’s not machine learning!” Indeed, linear regression using a least-squares cost function was used by mathematicians Legendre and Gauss in the early 1800s — long before the invention of computers — yet machine learning has embraced these algorithms, and we routinely call them “machine learning” in introductory courses!

In contrast, about 20 years ago, I saw statistics departments at a number of universities look at developments in machine learning and say, “that’s not really statistics.” This is one reason why machine learning grew much more in computer science than statistics departments. (Fortunately, since then, most statistics departments have become much more open to machine learning.)

This contrast came to mind a few months ago, as I thought about how to talk about agentic systems that use design patterns such as reflection, tool use, planning, and multi-agent collaboration to produce better results than zero-shot prompting. I had been involved in conversations about whether certain systems should count as “agents.” Rather than having to choose whether or not something is an agent in a binary way, I thought, it would be more useful to think of systems as being agent-like to different degrees. Unlike the noun “agent,” the adjective “agentic” allows us to contemplate such systems and include all of them in this growing movement.

More and more people are building systems that prompt a large language model multiple times using agent-like design patterns. But there’s a gray zone between what clearly is not an agent (prompting a model once) and what clearly is (say, an autonomous agent that, given high-level instructions, plans, uses tools, and carries out multiple, iterative steps of processing).

Rather than arguing over which work to include or exclude as being a true agent, we can acknowledge that there are different degrees to which systems can be agentic. Then we can more easily include everyone who wants to work on agentic systems. We can also encourage newcomers to start by building simple agentic workflows and iteratively make their systems more sophisticated.

In the past few weeks, I’ve noticed that, while technical people and non-technical people alike sometimes use the word “agent,” mainly only technical people use the word “agentic” (for now!). So when I see an article that talks about “agentic” workflows, I’m more likely to read it, since it’s less likely to be marketing fluff and more likely to have been written by someone who understands the technology.

Let’s keep working on agentic systems and keep welcoming anyone who wants to join our field!

Keep learning,

Andrew

A MESSAGE FROM DEEPLEARNING.AI

Grow your generative AI skills with DeepLearning.AI’s short courses! Learn how to build highly controllable agents in “AI Agents in LangGraph.” Enroll for free and get started

News

Apple’s Gen AI Strategy Revealed

Apple presented its plan to imbue its phones and computers with artificial intelligence.

What’s new: Apple announced Apple Intelligence, a plethora of generative-AI features that integrate with iOS 18, iPadOS 18, and MacOS Sequoia. The beta version of Apple Intelligence will be available in U.S. English prior to a wider rollout near the end of the year, starting with the iPhone 15 Pro and Mac computers that use M-series chips.

On-device and in the cloud: The new capabilities rely on a suite of language and vision models. Many of the models will run on-device, while workloads that require more processing power will run on a cloud powered by Apple chips.

Semantic search analyzes the data on a device to better understand context such as the user’s routines and relationships. For example, if a user enters a prompt like, “Show me the files my boss shared with me the other day,” models can identify the user’s boss and the day in question.
Generative media capabilities are geared to fulfill preset functions. For instance, the text generator offers options to make writing more friendly, professional, or concise. Image generation focuses on tasks like making custom emojis from text prompts and turning rough sketches into polished images.
Apple’s voice assistant Siri will accept text as well as voice prompts. It will also interact with apps, so Siri can, say, determine whether a meeting scheduled in the Calendar app will prevent a user from attending an event at a location designated in the Maps app.
Starting later this year, Siri users will be able to converse with OpenAI’s ChatGPT without having an OpenAI account or paying a fee. Paid ChatGPT users will be able to log in for access to paid features. Apple plans to integrate other third-party large language models.
The underlying infrastructure is designed to maintain user privacy. Apple’s cloud won’t retain user data. Apple won’t have privileged access to user data. Queries to ChatGPT from users who are not logged into an OpenAI account will have their IP masked. In addition, independent researchers can inspect the infrastructure code to verify assurances and find flaws.

How it works: Apple outlined the architecture that underpins the new features and compared two models of its against competitors.

All Apple models were trained on a mix of licensed, synthetic, and web-crawled data (filtered to remove personal and low-quality information). The models were fine-tuned to follow instructions via methods including reinforcement learning from human feedback.
To adapt its models to specific tasks, Apple uses LoRA weights that plug into a pretrained model and adjust its weights at inference. Such LoRA adapters are included for many tasks including summarization, proofreading, email replies, and answering questions.
Apple used quantization, a compression technique called low-bit parallelization (also known as weight clustering), and other methods to improve speed and energy efficiency. On an iPhone 15 Pro, Apple clocked a generation rate of 30 tokens per second.
Apple hired human graders to test two of its models on an internal benchmark that covers tasks including brainstorming, classification, answering questions, rewriting, summarization, and safety. The graders preferred an on-device model of 3 billion parameters over Phi-3-mini, Mistral-7B, and Gemma-7B. They preferred a large language model designed to run in the cloud to DBRX-Instruct, GPT-3.5-Turbo, and Mixtral-8x22B, but not to GPT-4-Turbo.

Behind the news: While rivals like Microsoft and Google dove into generative AI, Apple moved more cautiously. During the 2010s, it invested heavily in its Siri voice assistant, but the technology was outpaced by subsequent developments. Since then, the famously secretive company has been perceived as falling behind big-tech rivals in AI.

Why it matters: While Apple’s big-tech competitors have largely put their AI cards on the table, Apple has held back. Now its strategy is on display: Proprietary foundation models, LoRA to fine-tune them to specific tasks, emphasis on the user experience over raw productivity, judicious use of edge and cloud computing, and deals with other model makers, all wrapped up in substantial privacy protections.

We’re thinking: Apple’s control over its product ecosystem gives the company an extraordinary distribution channel. That’s why Google reportedly paid Apple $20 billion in 2022 to provide the default search engine in Apple’s Safari web browser. This advantage means that, whatever its pace of development and strategy in AI, Apple’s competitive edge remains sharp.

Audio Generation Clear of Copyrights

Sonically minded developers gained a high-profile text-to-audio generator.

What’s new: Stability AI released Stable Audio Open, which takes text prompts and generates 16kHz-resolution music or sound effects. The model’s code and weights are available for noncommercial use. You can listen to a few sample outputs here.

How it works: Stability AI promotes Stable Audio Open for generating not full productions but elements that will be assembled into productions. Although it’s similar to the earlier Stable Audio 2.0, it has important differences.

Stable Audio Open is available for download. In contrast, Stable Audio 2.0 is available via API or web user interface.
The new model accepts only text input, while Stable Audio 2.0 accepts text or audio. It generates stereo, clips up to 47 seconds long rather than Stability Audio 2.0’s three minutes.
Its training dataset was drawn from open source audio databases that anyone can use without paying royalties. In contrast, Stable Audio 2.0 was trained on a commercial dataset.

Behind the news: Stable Audio Open competes not only with Stable Audio 2.0 but also with a handful of recent models. ElevenLabs, known for voice cloning and generation, introduced Sound Effects, which generates brief sound effects from a text prompt. Users can input up to 10,000 prompt characters with a free account. For music generation, Udio and Suno offer web-based systems that take text prompts and generate structured compositions including songs with lyrics, voices, and full instrumentation. Users can generate a handful of compositions daily for free.

Why it matters: Stable Audio Open is pretrained on both music and sound effects, and it can be fine-tuned and otherwise modified. The fact that its training data was copyright-free guarantees that users won’t make use of proprietary sounds — a suitable option for those who prefer to steer clear of the music industry’s brewing intellectual property disputes.

We’re thinking: We welcome Stability AI’s latest contribution, but we don’t consider it open source. Its license doesn’t permit commercial use and thus, as far as we know, doesn’t meet the definition established by the Open Source Initiative. We urge the AI community toward greater clarity and consistency with respect to the term “open source.”

Seoul AI Summit Spurs Safety Agreements

At meetings in Seoul, government and corporate officials from dozens of countries agreed to take action on AI safety.

What’s new: Attendees at the AI Seoul Summit and AI Global Forum, both held concurrently in Seoul, formalized the broad-strokes agreements to govern AI, The Guardian reported. Presented as a sequel to November’s AI summit in Bletchley Park outside of London, the meetings yielded several multinational declarations and commitments from major tech firms.

International commitments: Government officials hammered out frameworks for promoting innovation while managing risk.

27 countries and the European Union agreed to jointly develop risk thresholds in coming months. Thresholds may include a model’s ability to evade human oversight or help somebody create weapons of mass destruction. (Representatives from China didn’t join this agreement.)
10 of those 27 countries (Australia, Canada, France, Germany, Italy, Japan, the Republic of Korea, the Republic of Singapore, the United Kingdom, and the United States) and the European Union declared a common aim to create shared policies while encouraging AI development.
In a separate statement, those 10 nations and the EU laid out more specific goals including exchanging information on safety tests, building an international AI safety research network, and expanding AI safety institutes beyond those currently established in the U.S., UK, Japan, and Singapore.

Corporate commitments: AI companies agreed to monitor their own work and collaborate on further measures.

Established leaders (Amazon, Google, IBM, Meta, Microsoft, OpenAI, Samsung) and startups (Anthropic, Cohere, G42, Inflection, xAI) were among 16 companies that agreed to evaluate advanced AI models continually for safety risks. They agreed to abide by clear risk thresholds developed in concert with their home governments, international agreements, and external evaluators. If they deem that a model has surpassed a threshold, and that risk can’t be mitigated, they agreed to stop developing that model immediately.
14 companies, including six that didn’t sign the agreement on risk thresholds, committed to collaborate with governments and each other on AI safety, including developing international standards.

Behind the news: Co-hosted by the UK and South Korean governments at the Korea Advanced Institute of Science and Technology, the meeting followed an initial summit held at Bletchley Park outside London in November. The earlier summit facilitated agreements to create AI safety institutes, test AI products before public release, and create an international panel akin to the Intergovernmental Panel on Climate Change to draft reports on the state of AI. The panel published an interim report in May. It will release its final report at the next summit in Paris in November 2024.

Why it matters: There was a chance that the Bletchley Park summit would be a one-off. The fact that a second meeting occurred is a sign that public and private interests alike want at least a seat at the table in discussions of AI safety. Much work remains to define terms and establish protocols, but plans for future summits indicate a clear appetite for further cooperation.

We’re thinking: Andrew Ng spoke at the AI Global Forum on the importance of regulating applications rather than technology and chatted with many government leaders there. Discussions focused at least as much on promoting innovation as mitigating hypothetical risks. While some large companies continued to lobby for safety measures that would unnecessarily impede dissemination of cutting-edge foundation models and hamper open-source and smaller competitors, most government leaders seemed to give little credence to science-fiction risks, such as AI takeover, and express concern about concrete, harmful applications like the use of AI to interfere with democratic elections. These are encouraging shifts!

The LLM Will See You Now

A critical step in diagnosing illnesses is a conversation between doctor and patient to assemble a medical history, discuss approaches to managing symptoms, and so on. Can a large language model play the doctor’s role? Researchers trained one to do surprisingly well.

What's new: Articulate Medical Intelligence Explorer (AMIE), a chatbot built by Google researchers Tao Tu, Anil Palepu, Mike Schaekermann and colleagues, showed better diagnostic ability and bedside manner than doctors in conversations with patients. The conversations covered a range of complaints including cardiovascular, respiratory, gastroenterology, neurology, urology, obstetric, and gynecology conditions.

Key insight: A pretrained LLM that’s fine-tuned on conversations between doctors and patients can learn to mimic the doctor’s role. However, such models are limited because available datasets of real-world medical conversations don’t cover the full range of medical scenarios and include ambiguities, interruptions, implicit references and the like, posing difficulties for learning. Conversations generated by a pretrained LLM can cover more conditions in more articulate language. After fine-tuning on real-world conversations, further tuning on generated conversations can improve performance. In addition, after a conversation, critiquing the “doctor’s” performance can improve its ability to render diagnoses, suggest plans for managing symptoms, empathize with patients, and otherwise perform its role.

How it works: The authors fine-tuned a pretrained PaLM-2 on medical multiple-choice questions that describe symptoms, possible causes, and evidence for the correct diagnosis, as well as datasets for tasks like summarizing and continuing medical dialogs. They further fine-tuned the model on its own output.

Given a medical condition, the authors searched the web to retrieve background information about symptoms, management, and patient demographics. Using that information, they prompted PaLM-2 to generate a patient scenario like scenarios used to assess real-world medical interviewing skills.
The authors prompted separate instances of PaLM-2 to play doctor and patient. They fed the generated scenario to the patient and prompted the models to produce a conversation. After each turn, a third instance of PaLM-2 decided whether the conversation was over based on whether the doctor had given a diagnosis and the patient had further questions (or either had said “goodbye”).
Given the generated conversation, a fourth instance of PaLM-2 generated a critique of the doctor model’s empathy, professionalism, repetition, conversation flow, factual accuracy, and whether the doctor had asked questions that led to a diagnosis.
Given the critique, the doctor initiated a second iteration of its conversation with the patient.
The authors fine-tuned PaLM-2 to predict the next token in the second conversation. Then they repeated the process from the beginning a number of times, generating fresh conversations and fine-tuning the model.
At inference, users conversed with the doctor model. Once the conversation was complete, the authors prompted the model to list 10 potential diagnoses.

Results: Specialist physicians evaluated the doctor model’s performance in 149 conversations with human actors who played the roles of patients based on scenarios supplied by clinical providers. They compared the model’s output with those of 20 primary care physicians based on their own conversations with the actors.

The model included the correct diagnosis among its top three in about 90 percent of cases. The physicians included the correct diagnoses among their top three in 77 percent of the scenarios.
Specialist physicians also rated the conversations on 32 subjective qualities including relationship fostering, responding to emotions, understanding patient concerns, and explaining relevant information accurately. Of the 32 qualities, AMIE rated higher on 28 of them. For instance, the physicians said AMIE responded to emotions favorably or very favorably about 83 percent of the time, while physicians responded to emotions favorably or very favorably 31 percent of the time.
The actors also rated the conversations they had with AMIE and the physicians on 26 qualities including whether they had explained the condition and treatment, appeared honest and trustworthy, expressed caring and commitment, and valued the patient as a person. Among those 26 qualities, AMIE outperformed the physicians on 24 of them. For instance, the actors said that AMIE valued them as people 79 percent of the time, while the physicians valued them as people 59 percent of the time.

Why it matters: LLMs can generate fine-tuning data that improves their own performance. By training on relevant, factually correct medical information from the web, LLMs can generate realistic conversations at scale — even in a highly technical, high-stakes discipline like medicine and despite their potential to generate potentially dangerous hallucinations. Used as fine-tuning data, this output enables LLMs to converse with humans more effectively.

We're thinking: AI promises to spread intelligence far and wide. As the authors acknowledge, further work remains to demonstrate this work’s efficacy, ethics, security, and regulatory compliance in a clinical setting. Yet it’s an exciting glimpse of a world in which medical intelligence is fast, cheap, and widely available.