“Data Loss Prevention on images” in Patent Application Approval Process (USPTO 20210326461): Patent Application
2021 NOV 10 (NewsRx) -- By a
This patent application has not been assigned to a company or institution.
The following quote was obtained by the news editors from the background information supplied by the inventors: “Data Loss Prevention (DLP) involves monitoring of an organization’s sensitive data, including data at endpoint devices, data at rest, and data in motion. Conventional DLP approaches focus on a variety of products, including software agents at endpoints, physical appliances, virtual appliances, etc. As applications move to the cloud, users are accessing them directly, everywhere they connect, inevitably leaving blind spots as users bypass security controls in conventional DLP approaches while off-network. Encryption increases the problem because sensitive data is typically concealed in Secure Sockets Layer (SSL)/Transport Layer Security (TLS) traffic, which is difficult and expensive to inspect (in terms of cost, processing capability, and latency). Without visibility and control, organizations are at an increased risk of data loss, due either to unintentional or malicious reasons.
“Conventional techniques for catching data include the use of DLP dictionaries and engines. These approaches are used to detect Exact Data Matching (EDM), where specific keywords, classes of data, etc. are flagged. For example, DLP can detect social security numbers, credit card numbers, etc. based on the data format, such as in structured documents, etc. DLP can also detect specific keywords in the DLP dictionaries. However, DLP is difficult with unstructured documents. Unstructured documents are just that; documents that can be free-form and do not have a set structure but are still able to be scanned, captured, and analyzed. For true DLP, it is also important to support the analysis of unstructured documents.
“DLP dictionaries are fundamental to configuring DLP functionalities. A DLP dictionary contains a set of algorithms that are designed to detect specific kinds of information in user traffic. Some example of predefined dictionaries include ABA Bank Routing Numbers, Adult Content, Citizen Service Numbers (
“However, each DLP dictionary, e.g., the predefined dictionaries and the custom dictionaries, each contain their own violation threshold and confidence threshold, making it difficult for DLP dictionaries to work together. In conventional operation, tenants are required to create custom dictionaries to deal with expressions. For example, a use case can include “perform operation A if a more than 10 CCNs are triggered, and perform operation B if more than 20 CCNs are trigger.” There is a need to introduce DLP expression flexibility with DLP dictionaries.
“Also, DLP functionality operates on files having searchable content, e.g., word processing files, text files, presentation files, source code, database files, emails, Portable Document Format (PDF) files, and the like. This means non-searchable files such as images are not capable of DLP scanning. This is problematic as image files can be posted to social media, used to capture sensitive data that is sent to circumvent DLP functionality, etc. In these cases, images can lead to data loss. There have been countless examples where an image is posted with some sensitive or embarrassing text in the background, e.g., on papers, on a white board, etc. There is a need to extend the DLP functionality to images.”
In addition to the background information obtained for this patent application, NewsRx journalists also obtained the inventors’ summary information for this patent application: “The present disclosure relates to systems and methods for Data Loss Prevention (DLP) on images. Specifically, a DLP service or system can detect an image or other non-searchable file in user traffic. When an image is detected, it is scanned to identify and text therein, such as via Optical Character Recognition (OCR). If there is identifiable text, it is extracted from the image and then matched against a plurality of DLP techniques including DLP engines that look for content matching DLP dictionaries associated with a DLP engine, Exact Data Matching (EDM) where the content is matched to see if it exactly matches specific content, and Indexed Data Matching (IDM) where the content is matched some part of a document from a repository of documents. In addition to protecting sensitive material, the DLP on images approach can also ensure embarrassing content is blocked, such as from a social media post, blog, etc.
“Also, the present disclosure relates to systems and methods for Data Loss Prevention (DLP) expression building for a DLP engine. As described herein, a DLP service or system can utilize one or more dictionaries. A DLP dictionary is a set of data that includes specific kinds of information that are monitored for in user traffic. A DLP engine can include one or more DLP dictionaries that are used for detection. The present disclosure includes utilizing expressions to combine one or more DLP dictionaries in the DLP engine to provide an aggregate result. The DLP dictionaries can include predefined dictionaries and custom dictionaries. The present disclosure includes a user interface for users to enter expressions, evaluate the expressions, and store the expressions in a database for use in production.
“Also, the present disclosure relates to systems and methods for Data Loss Prevention (DLP) via Indexed Document Matching (IDM). As described herein, IDM is the ability to identify and protect content that matches the whole or some part of a document from a repository of documents. This feature provides data leak protection for unstructured documents. Specifically, techniques include identifying exact document matches, identifying the same text in a document as in an indexed document, identifying content that contains a subset of text in an indexed document, and identifying content that is similar but not exactly the same as the text in an indexed document. Customers can index files into multiple user-defined profiles or categories. The results of the identification can yield a score that can be matched to a threshold for detection. The technique can be summarized as similarity detection (i.e., same file, same text, similar text, etc.) and fragment identification (i.e., partial content match) to provide a score that is indicative of a match to an indexed document.
“In an embodiment, a method, instructions in a non-transitory computer-readable storage medium, and a DLP service executed by a cloud-based system are presented to perform steps. The steps include obtaining a file to be checked for Data Loss Prevention (DLP); determining a cryptographic hash of the file and comparing the cryptographic hash to corresponding cryptographic hashes of indexed files; responsive to a match between the cryptographic hash and one of the corresponding cryptographic hashes, determining a DLP match and performing an action based thereon; responsive to no match, extracting text from the file and creating an ordered sequence of hashes of variable length chunks of the extracted text; and determining the DLP match with one of the indexed files based on comparing the ordered sequence of hashes with a corresponding ordered sequence of hashes of the indexed files.
“The determining the DLP match based on the comparing the ordered sequence of hashes can utilize a match score based on a number of the hashes that match, and the DLP match is based on the match score being above a threshold. The threshold can be user-configurable in value and configurable across a different profile of the indexed files. The steps can further include, responsive to the DLP match based on the comparing the ordered sequences of hashes, performing an action based thereon. The steps can further include, prior to the obtaining the file, obtaining a lookup table for a tenant associated with a user of the file, wherein the lookup table includes the ordered sequence of hashes indexed to the indexed files. The lookup table can be created in an indexing tool, and wherein the indexed files cannot be recreated from data in the lookup table. The file can be of a first file type, and wherein the file can be determined to match one of the indexed files being a second file type, but having identical text therein. The file can be determined to match one of the indexed files having similar text therein.”
The claims supplied by the inventors are:
“1. A non-transitory computer-readable storage medium having computer-readable code stored thereon for programming one or more processors to perform steps of: detecting an image in monitored user traffic; scanning the image to identify any text and extracting any identified text therein; responsive to the extracting, scanning the extracted text with a plurality of Data Loss Prevention (DLP) techniques including one or more DLP engines where the extracted text is checked to trigger the one or more DLP engines, Exact Data Matching (EDM) where the extracted text is matched to see if it matches specific content, and Indexed Data Matching (IDM) where the extracted text is matched to some part of a document from a repository of documents; and performing one or more actions based on results of the plurality of DLP techniques.
“2. The non-transitory computer-readable storage medium of claim 1, wherein the steps further include inline monitoring the user traffic with a cloud service.
“3. The non-transitory computer-readable storage medium of claim 2, wherein the detecting the image is in part based on detecting the user traffic is associated with any of social media, electronic mail, and posts on Web sites.
“4. The non-transitory computer-readable storage medium of claim 1, wherein the plurality of DLP techniques include at least one DLP engine for detecting embarrassing content based on detecting the user traffic is associated with any of social media, electronic mail, and posts on Web sites.
“5. The non-transitory computer-readable storage medium of claim 1, wherein the plurality of DLP techniques include a plurality of the DLP engines.
“6. The non-transitory computer-readable storage medium of claim 5, wherein at least one of the plurality of DLP engines includes a predefined dictionary including adult content.
“7. The non-transitory computer-readable storage medium of claim 5, wherein at least one of the plurality of DLP engines includes a predefined dictionary and at least one of the plurality of DLP engines includes a custom dictionary.
“8. The non-transitory computer-readable storage medium of claim 1, wherein the scanning the image to identify any text and extracting any identified text therein includes detecting some text in the image via Optical Character Recognition; and extracting the some text when the some text is above a threshold amount.
“9. The non-transitory computer-readable storage medium of claim 1, wherein the one or more actions include any of blocking the image in a cloud service and providing a notification.
“10. A method comprising: detecting an image in monitored user traffic; scanning the image to identify any text and extracting any identified text therein; responsive to the extracting, scanning the extracted text with a plurality of Data Loss Prevention (DLP) techniques including one or more DLP engines where the extracted text is checked to trigger the one or more DLP engines, Exact Data Matching (EDM) where the extracted text is matched to see if it matches specific content, and Indexed Data Matching (IDM) where the extracted text is matched to some part of a document from a repository of documents; and performing one or more actions based on results of the plurality of DLP techniques.
“11. The method of claim 10, further comprising inline monitoring the user traffic with a cloud service.
“12. The method of claim 11, wherein the detecting the image is in part based on detecting the user traffic is associated with any of social media, electronic mail, and posts on Web sites.
“13. The method of claim 10, wherein the plurality of DLP techniques include at least one DLP engine for detecting embarrassing content based on detecting the user traffic is associated with any of social media, electronic mail, and posts on Web sites.
“14. The method of claim 10, wherein the plurality of DLP techniques include a plurality of the DLP engines.
“15. The method of claim 14, wherein at least one of the plurality of DLP engines includes a predefined dictionary including adult content.
“16. The method of claim 14, wherein at least one of the plurality of DLP engines includes a predefined dictionary and at least one of the plurality of DLP engines includes a custom dictionary.
“17. The method of claim 10, wherein the scanning the image to identify any text and extracting any identified text therein includes detecting some text in the image via Optical Character Recognition; and extracting the some text when the some text is above a threshold amount.
“18. The method of claim 10, wherein the one or more actions include any of blocking the image in a cloud service and providing a notification.
“19. A cloud-based system comprising: a plurality of enforcement nodes connected to one another; a central authority connected to the plurality of enforcement nodes; and a Data Loss Prevention (DLP) service executed between the plurality of enforcement nodes, wherein the DLP service is configured to detect an image in monitored user traffic; scan the image to identify any text and extracting any identified text therein; responsive to extraction of the identified text, scan the extracted text with a plurality of Data Loss Prevention (DLP) techniques including one or more DLP engines where the extracted text is checked to trigger the one or more DLP engines, Exact Data Matching (EDM) where the extracted text is matched to see if it matches specific content, and Indexed Data Matching (IDM) where the extracted text is matched to some part of a document from a repository of documents; and perform one or more actions based on results of the plurality of DLP techniques.
“20. The cloud-based system of claim 19, wherein the DLP service is performed with an inline monitoring service through the cloud-based system.”
URL and more information on this patent application, see: Bhallamudi,
(Our reports deliver fact-based news of research and discoveries from around the world.)
Recent Findings from National Research Council Italy (CNR) Highlight Research in Risk Management (Cyber Risk Quantification: Investigating the Role of Cyber Value at Risk): Insurance – Risk Management
California Bans Insurers From Dropping Homeowners In 2021 Wildfire Zones
Advisor News
Annuity News
Health/Employee Benefits News
Life Insurance News