Google recently published a Security blog post detailing what it calls one of the biggest defense upgrades to Gmail's spam filter in recent years. It's a new text classification system called Resilient and Efficient Text Vectorization (RETVec). Google says it can help understand the relevance and specificity of text, which is emails filled with special characters, emojis, misspellings, and other junk that were previously readable by humans but not easily understood by machines. Previously, spam messages filled with special characters easily slipped through Gmail's defenses.
While any spam filter would likely eliminate an email that reads, “Congratulations! A $1,000 balance has been added to your jackpot account,” the vast majority of the letters in the email go into the endless depths of the Unicode standard, where users may find characters that look like they are part of the regular Latin alphabet.
Google says RETVec is trained to be resilient to character-level operations including insertions, deletions, misspellings, homonyms, LEET substitutions, and more. The RETVec model is trained on a new character encoding that can efficiently encode all characters and words in the UTF-8 set. As a result, RETVec outperforms over 100 languages without requiring lookup tables or fixed vocabulary sizes.
Thanks to RETVec, Gmail can now better recognize and filter spam
Google says the difference is dramatic. Methods that use fixed vocabulary sizes or lookup tables for homonyms are resource-intensive. RETVec, on the other hand, has only 200,000 parameters instead of millions, so while Google’s spam-filtering cloud platform is large, it can run on a local machine. RETVec is open source, and Google hopes it will eliminate homonym attacks.
RETVec works in a similar way to TensorFlow machine learning models, which use visual similarity to determine the meaning of words rather than their actual character content. This approach has led to big improvements, with Google saying that replacing Gmail's spam classifier with RETVec improved spam detection rates by 38% over baseline and reduced false positives by 19.4%. Using RETVec reduced the model's TPU usage by 83%, making the RETVec rollout one of the biggest upgrades in recent years. The company has been testing RETVec internally for the past year and has rolled it out to all user Gmail accounts.
Source link
Comment (0)