Using the same inputs and outputs of a human operator, this framework enables multimodal AI models to view the screen and decide on a series of mouse and keyboard actions to reach an objective.
Currently integrated with GPT-4-Vision as the default model.
Designed for support across operating systems and to be used various multimodal models.
At HyperwriteAI, we are developing Agent-1-Vision, a multimodal model designed for operating software and computer interfaces, with more accurate click location predictions.
We recognize that some operating system functions may be more efficiently executed with hotkeys such as entering the Browser Address bar using command + L rather than by simulating a mouse click at the correct XY location.
We plan to make these improvements over time. However, it's important to note that many actions require the accurate selection of visual elements on the screen, necessitating precise XY mouse click locations.
A primary focus of this project is to refine the accuracy of determining these click locations. We believe this is essential for achieving a fully self-operating computer in the current technological landscape.
Effortlessly conquer your inbox. Stay organized, prioritize messages, and seamless organization, smart prioritization, and rapid responses, all with the power of AI at your fingertips.
Streamline your daily routine. From scheduling appointments and ordering food to online shopping and bill payments, let the power of AI optimize your everyday tasks for a smoother, more efficient lifestyle.
Enhance your research capabilities. Dive into a wealth of knowledge, retrieve accurate information, and uncover valuable insights, all through the brilliance of AI-driven search and thought.