What AI Can Do in 2026: 3 Tasks and Where It Breaks

Ask what can AI actually do and you will get two stock answers. The industry offers a capability list and a product roadmap. Critics offer a catalogue of failures and a reason to wait. Both skip the question that matters most: not what the technology can do in principle, but under what conditions it is reliably right, and how you would know which condition you are in. That answer is more predictable than three years of breathless coverage suggests, and it has almost nothing to do with how sophisticated the model is.

Table of Contents

Not one thing – the three AI types

The question “what can AI actually do?” assumes AI is a single thing with consistent capabilities across tasks. It is not.

Three broad types of task, each with a different reliability profile.

The first is pattern-matching over well-represented ground. Give an AI a task it has seen many variants of (summarising text, generating code for common functions, answering questions with established answers) and it is often excellent. Not perfect. Often excellent.

The second is tasks at the edge of its training distribution: novel combinations, domain-specific judgement calls, questions where the correct answer requires causal reasoning rather than pattern recall. Here reliability drops sharply and unpredictably. The model sounds equally confident whether it is right or wrong.

The third is what researchers call the jagged frontier: tasks that look easy but sit in a blind spot, and tasks that look hard but happen to align perfectly with training data. Your intuitions about what AI should struggle with are often wrong.

Understanding which type of task you are handing to an AI is more useful than any benchmark score.

The jagged frontier explained

A study of consultants using AI produced a result worth examining. Researchers from BCG and Harvard tracked performance across dozens of tasks and found that AI-assisted workers outperformed on complex analysis while making systematic errors on tasks that appeared routine. The performance curve did not slope from easy to hard as expected. It spiked and dropped in ways that had little connection to how difficult the tasks appeared.

Training data alignment predicts reliability. Task complexity does not. Ask an AI to summarise a dense financial report and it handles it well, because its training included thousands of similar documents. Ask it to count occurrences of a specific letter in a word and it fails, because that is not a pattern language models learn to track. The first task looks harder. The second looks like nothing.

A benchmark score tells you average performance across a curated test set. It does not tell you which side of the frontier a specific task sits on. That gap is where most real-world AI disappointments happen.

Where AI is genuinely reliable

The reliability map makes sense once you understand what the technology does best: pattern recognition across material it has seen at massive scale. Ask a large language model to complete a line of Python following a recognisable idiom and it handles that well. Ask it to summarise a dense policy document, translate idiomatic Korean into English, or identify the likely diagnosis from a textbook presentation of symptoms. The track record in those tasks is consistently good.

Coding assistance is the clearest example. Research by GitHub found that developers using Copilot completed tasks 55% faster than those working without it. The model has processed more code than any individual developer will read in a lifetime, so when the pattern is common, the suggestion is useful rather than decorative.

Medical imaging gives the same result from a different domain. AI systems trained on large labelled datasets have matched or exceeded radiologists on narrowly defined tasks: detecting diabetic retinopathy from fundus photographs, flagging suspicious lesions in mammograms. The imaging data is standardised, the labels are explicit, the pattern is learnable. Change the task to interpreting an ambiguous case history and the reliability profile drops sharply.

Technical difficulty is not the reliability signal. Both tasks work because they involve pattern recognition in domains where training data is rich, consistent, and well-labelled.

Grounded retrieval and summarisation

Retrieval-augmented generation sits at the more dependable end of the task spectrum. When a system retrieves a document and summarises it, the answer is already in front of it. The task is closer to extraction than imagination. Research on retrieval-augmented systems consistently shows lower hallucination rates compared to models generating from memory alone, because the source material constrains the output in ways that parametric recall cannot.

Ask AI to condense a contract clause or pull key findings from a research paper, and the accuracy profile looks better than asking it to recall those facts independently. The ceiling is still real. The floor is higher.

Code generation in well-documented domains

Code sits in the same category. Well-documented languages give models dense, consistent training signal, and the output is falsifiable: either the function runs or it does not. The original HumanEval benchmark measured Codex solving about 29 percent of novel Python problems on a single attempt, with rates climbing toward 70 percent given multiple tries. Those numbers have improved since 2021. Write a common React hook or a standard SQL query and current models produce something workable. Reach for a bespoke internal API and the floor drops.

Pattern recognition in structured data

The same reliability pattern applies to structured data. Feed a model a spreadsheet of customer transactions and ask it to flag anomalies. Hand it a labelled dataset and ask it to classify new records. Both tasks sit in workable territory: the input is consistent, the expected output is definable, and a wrong answer shows up in the validation metrics. Classification accuracy on well-specified tabular tasks tops 90 percent across standard benchmarks. Models in this territory fail loudly, which means you can catch and correct them.

Where AI fails and the spectrum of failure

Whether an error is visible matters more than how often it happens.

In well-specified tasks, failures show up. A classification model misidentifies a record, the error surfaces in validation, you adjust. The model doesn’t know it is wrong, but you can find out. That is a workable situation.

Open territory is harder. Tasks without a definable correct answer, contexts requiring background knowledge the model lacks, situations where only subject expertise separates plausible language from accurate language. Ask a model to summarise a legal document and it produces confident, well-structured prose. It also invents citations that don’t exist.

Studies of factual accuracy in large language models find that hallucination concentrates in domains with sparse or inconsistent training signal: specialised professional knowledge, recent events, niche technical detail.

Failure modes fall on a spectrum from loud to quiet. Loud failures are recoverable: wrong answers that look wrong, outputs you would scrutinise anyway. Quiet failures are the problem. Plausible outputs in domains where the person relying on them lacks the expertise to catch the error. A radiologist using AI to flag candidates for a second look is in a different position to a registrar who can no longer read a scan without it. The tool is the same. The failure mode is not.

The hallucination spectrum (0.7% to 51%)

Start with the numbers. Hallucination rates across current AI systems don’t cluster around a single figure. They range from under 1% on constrained factual tasks to above 50% on open-ended questions in specialist domains, and the spread follows task type.

Constrained tasks like date calculations, code syntax checks, and format conversions give the model a narrow solution space. Errors are easier to spot. Open-ended tasks in medicine, law, or finance are a different problem: the model confabulates with grammatical confidence, and anyone without specialist knowledge has no grip on whether the answer is right.

The 0.7% figure and the 51% figure are both real. They belong to different task types. Knowing which applies to your situation is where the reliability question starts.

The degradation effect (the colonoscopy finding)

Gastroenterologists who used AI assistance for polyp detection during colonoscopy got measurably worse at detecting polyps without it. Not over years. Over months. The AI covered the detection work entirely. The underlying skill atrophied.

The augmentation argument assumes AI and human capability run in parallel. The colonoscopy data suggests they can trade off. Lean on the tool long enough and it starts carrying weight your own judgement used to carry. Consistent AI accuracy can erode human accuracy over time.

Language and context bias for Australian inputs

Most large language models powering AI tools today were trained on datasets skewed heavily toward American English. Australian users feel this as quiet, persistent friction. Legal terms that don’t map cleanly to common law jurisdictions. Medical dosing guidelines that reflect US formularies rather than TGA-approved ones. Colloquialisms that get misread or quietly dropped. Analysis of major training corpora confirms the English-language skew runs strongly American. What can AI actually do in an Australian context? Often less than the global benchmarks suggest. Those numbers were generated on content that looks nothing like what an Australian professional actually works with.

Australia specifically

Take legal. Australian solicitors testing AI tools quickly discover that “common law jurisdiction” is doing a lot of work in those benchmark scores. Most large language models have absorbed enormous quantities of American case law, American statutory interpretation, and American legal commentary. Australian legislation, state and territory variations, the specific provisions of the Australian Consumer Law, the quirks of the Family Law Act: thinner in the training data, less reliably retrieved, more often wrong. The tool produces something that looks like legal analysis. Whether it reflects Australian law is a separate question.

Healthcare follows the same pattern. The Therapeutic Goods Administration approves medications at different doses, in different formulations, with different contraindications than the FDA. AI systems calibrated to American formularies produce guidance that looks authoritative and reflects the wrong country’s approvals.

Tax is worse still. The ATO’s administrative guidance, the Income Tax Assessment Act, the treatment of franking credits, the specifics of Australian superannuation: underrepresented in the training data of essentially every major commercial model. Australia’s AI Ethics Framework sets out principles for trustworthy, accountable AI, but most commercially available models were not built with Australian regulatory context as a baseline. This is not solved by a better prompt. It reflects where the training data came from, and decisions made during model development that Australian users had no visibility over.

Adoption, the regulatory gap, and the employment picture

Adoption is outrunning governance. Australian organisations in healthcare, legal, and financial services are deploying AI tools at pace, but the framework they operate under remains voluntary. Companies can follow the standards or not. Most face no legal consequence for the latter.

The employment picture is harder to read cleanly. The rough consensus is that AI shifts work composition rather than eliminating it wholesale. Workers who defer to AI at the edge of their expertise tend to get worse at those tasks. For anyone in a profession where independent judgement is the core product, that trajectory matters more than the headline displacement numbers.

What Australian context means for output quality

Most frontier models draw their training data from North American and British sources above all else. The further a query drifts from US or UK norms, the more the model works from thinner material, and its confidence calibration gets worse.

Australian tax law, workplace relations legislation, professional licensing requirements: the model has seen less of all of these. An HR manager in Melbourne asking about modern award obligations is in different territory from a US employment lawyer asking about at-will contracts. Both questions look identical to the model. The error rate on one is higher.

Closing integration (the two-question practical framework as takeaway)

Two questions do most of the work.

First: is this a closed task or an open one? Closed tasks have right answers the model can triangulate against a large body of consistent source material. Open tasks require judgement against live, ambiguous, or jurisdiction-specific information. The model performs well on closed tasks and hallucinates more frequently on open ones, regardless of how confident it sounds.

Second: who pays the cost of an error? If the answer is “you, your client, or your patient,” treat the output as a starting point rather than a conclusion. If the answer is “a draft someone will check,” the cost of being wrong is low enough to absorb.

Task type is a more reliable guide to what AI can actually do than any benchmark score.

Frequently Asked Questions

What can AI actually do reliably?

Pattern recognition and generation of text, code, and analysis in well-established knowledge domains. Task type is the primary reliability signal. Closed tasks, where a correct answer exists and you can check it independently, tend to go well. Open tasks, where correctness depends on judgement, novel context, or information the model lacked in training, are where confident output and wrong output look identical. Researchers describe this as the "jagged frontier": AI capability is high in some areas and low in immediately adjacent ones. The distribution does not track intuition about which tasks should be difficult, which makes it hard to predict without a framework.

How do I know when to trust what AI tells me?

Ask two questions. Can someone without specialised knowledge verify this output? And does the task have a definite correct answer, or is it a matter of judgement? If both answers are yes, the conditions for reliable AI assistance are roughly in place. If either is no, you need a human reviewer who knows enough to catch errors. Most scrutiny lands on the obviously hard task. The risk zone is the task that seems routine but sits in the open-answer category, where deference to AI output feels safe and errors go undetected.

Will using AI make me better at my job over time?

This depends on how you use it. The augmentation argument, that AI handles routine tasks and frees up capacity for higher-level work, has support in narrow productivity studies. Most workplace commentary stops there. Research in medicine and professional services shows that heavy AI use erodes independent judgement and the pattern recognition that underpins expertise. You stop generating your own assessments. You miss errors you previously caught. Maintaining skill under AI assistance requires effort to preserve the cognitive habits AI can replace. Your improvement depends on the work you put into managing that tension.

Should I be worried about AI, or is the concern overblown?

AI fails quietly and confidently in contexts where users lack the knowledge to detect the error. That is the specific risk worth focusing on. Visible failures get corrected. Plausible-sounding wrong answers in domains where you lack independent expertise often go undetected and uncontested. The calibrated response is matching your reliance on AI to task type, building review habits for high-stakes outputs, and preserving the independent knowledge that tells you when the answer is wrong. Both the booster and sceptic positions are simpler than calibration. Neither fits what the evidence shows.

What AI Can Do in 2026: 3 Reliable Tasks and Where It Breaks

Not one thing – the three AI types

The jagged frontier explained