ChatGPT Health, a recently launched feature designed to analyse long-term health data from sources such as Apple Health, is facing scrutiny after an early real-world test revealed significant accuracy issues.

The concerns emerged after a technology columnist from The Washington Post, Geoffrey A. Fowler, shared his experience using the tool after granting it access to nearly a decade of Apple Watch data. Fowler, who has worn an Apple Watch daily for years, allowed ChatGPT Health to review roughly 29 million recorded steps and approximately 6 million heart rate readings before asking the system to assess his cardiac health.

According to Fowler, the AI delivered a stark assessment, assigning his heart health a failing grade. The evaluation prompted him to immediately change his behaviour and seek medical advice. However, his doctor reportedly dismissed the AI’s conclusions, stating that Fowler was at extremely low risk of a heart attack and that additional testing would likely be unnecessary.

Further analysis revealed that ChatGPT Health appeared to rely heavily on VO2 max estimates generated by the Apple Watch. While the metric is commonly used as an indicator of cardiovascular fitness, Apple has consistently said the watch provides estimates meant for tracking trends rather than making clinical diagnoses. Accurate VO2 max measurements typically require laboratory testing, a distinction that was not reflected in the AI’s assessment.

Fowler also observed that changes in his historical resting heart rate data coincided with upgrades to newer Apple Watch models. These shifts were linked to improved sensors and updated measurement algorithms rather than changes in his health. ChatGPT Health interpreted the variations as medically meaningful signals, without accounting for changes in hardware or software over time.

Adding to the concerns was the system’s lack of consistency. When Fowler repeated the same query, ChatGPT Health produced different results, revising its evaluation from a failing grade to an average one. Subsequent attempts yielded scores ranging from poor to above average, raising questions about the reliability of its assessments.

Fowler also reported that the system struggled to retain basic personal information across conversations. Despite having access to recent blood test results, the AI did not consistently incorporate those data points into its analysis and repeatedly forgot details such as age, gender and recent vital signs.

