Achieving a robust and universal semantic representation for action description remains a key challenge in natural language understanding. Current approaches often struggle to capture the complexity of human actions, leading to inaccurate representations. To address this challenge, we propose innovative framework that leverages multimodal learning