A Logo

Feel free to include my content in your page via my
RSS feed

Help Irongeek.com pay for
bandwidth and research equipment:

Subscribestar or Patreon

Search Irongeek.com:

Irongeek Button
Social-engineer-training Button

Help Irongeek.com pay for bandwidth and research equipment:


Automating Unstructured Data Classification - Malek Ben Salem BSides NOVA 2018 (Hacking Illustrated Series InfoSec Tutorial Videos)

Automating Unstructured Data Classification
Malek Ben Salem
BSidesNOVA 2018

Organizations use documents to communicate, perform business transactions, collaborate and innovate. These documents, which include e-mails, project reports, proposals, contracts, and design drafts, may carry confidential information and intellectual property. They have to be protected from unauthorized access, exfiltration or loss, but they need not be protected at the same level given that their contents are not equally sensitive. So, identifying and properly labeling sensitive documents is important. The current classification process is manual; Document creators label the documents according to the classification taxonomy of their organization when a document is created or uploaded to a file share. The classification taxonomy varies by organization, but generally has 4 levels of confidentiality (Public/Unrestricted, Internal Use, Restricted, and Highly Confidential). The impact of data disclosure or breach varies by confidentiality level, and so does the level of protection required for that data. Various security controls can be deployed to minimize the risk of losing or leaking this information such as access controls, encryption, Data Loss Prevention deployments, Enterprise Data Rights Management, etc. These controls are not effective unless the sensitive or confidential information is properly identified. Manual classification however is not accurate. Employees seem to lack the proper training or proper discipline to label the documents appropriately, thus raising an organization’s information risk level. Worse, malicious users may intentionally label sensitive documents to non-sensitive in order to be able to ex-filtrate data without getting detected. In summary, manual classification is often unreliable and error-prone. We developed an automated approach for classifying business documents using Natural Language Processing and Machine Learning techniques, in order to avoid the misclassification errors introduced by manual classification. We use a real data set to show that our approach achieves high accuracy rates when predicting the confidentiality level of a business documents, and is scalable.

Malek Ben Salem

Ms. Malek BEN SALEM is a cyber security research senior principal at Accenture Technology Labs. She is responsible for defining and leading the cyber security lab's research agenda. Malek holds a PhD in Computer Science from Columbia University, New York. She has been with Accenture Labs since 2011. She has authored several thought leadership and peer-reviewed academic publications. She has also been a Co-Principal Investigator on several DARPA projects, and is leading research on the use of behavioral biometrics for continuous authentication on desktops and mobile devices.

Back to BSides NOVA 2018 video list

Printable version of this article

15 most recent posts on Irongeek.com:

If you would like to republish one of the articles from this site on your webpage or print journal please contact IronGeek.

Copyright 2020, IronGeek
Louisville / Kentuckiana Information Security Enthusiast