At the same time, the cybersecurity industry is incorporating a growing number of AI-driven security solutions that rely on some type of trusted “ground truth” as a reference point.
How can these two seemingly opposing philosophies coexist?
This is not a hypothetical discussion.
Organizations are introducing AI models into their security practices that impact almost every aspect of their business. Can regulators, compliance officers, security professionals, and employees trust these security models at all?
Because AI models are sophisticated, obscure, automated, and evolve over time, it’s difficult to establish trust in an AI-dominant environment.
Yet without trust and accountability, some of these models might be considered risk-prohibitive and so could eventually be under-utilized, marginalized, or prohibited altogether.
One of the main stumbling blocks associated with AI trustworthiness revolves around data. More specifically, it revolves around ensuring data quality and integrity. AI and machine models are only as good as the data the inputs they consume (“garbage in, garbage out”).
Yet these obstacles haven’t discouraged cybersecurity vendors, which are more eager to base their solutions on AI models.
By doing so, vendors are taking a leap of faith, assuming that the data (whether public or proprietary) their models are taking in sufficiently represent the real-life scenarios that these models will encounter going forward.
The data used to power AI-based cybersecurity systems face a number of further problems:
Bad actors can “poison” training data by manipulating the datasets (and even the pre-trained models) that the AI models rely on.
This could allow them to get around cybersecurity controls while the organization at risk doesn’t know that the ground truth it relies on to secure its infrastructure has been compromised.
Such matters could lead to deviations, such as security controls labeling malicious activity as benign. Or it could generate a more profound impact by disrupting or disabling security controls.
AI models are built to address “noise.” But in the cyber world, malicious errors aren’t random.
Security professionals are faced with dynamic and sophisticated adversaries that learn and adapt over time.
Accumulating more security-related data might well improve AI-powered security models. At the same time, it could lead adversaries to change the way they operate.
This could diminish the effectiveness of existing data and AI models. It all boils down to understanding reality and the cause-effect relationships as accurately as possible.
For example, additional data points might help a traditional malware detection mechanism identify common threats. But it might degrade the AI model’s ability to identify new malware that diverges from known malicious patterns.
This is akin to how mutated viral variants can evade an immune system that was trained to identify the original viral strain.
What isn’t known
What isn’t know is prevalent in cyberspace, which many service providers preach to their customers to build their security strategy on the assumption that they’ve already been breached.
The challenge for AI models comes from the fact that these unknowns (blind spots) are seamlessly incorporated into the models’ training datasets and therefore attain a stamp of approval and might not raise any alarms from AI-based security controls.
For example, some security vendors combine a slate of user attributes to create a personalized baseline of a user’s behavior. They then determine the expected permissible deviations from this baseline.
The premise is that these vendors can identify an existing norm that should serve as a reference point for their security models.
However, this assumption might not be good.
For example, an undiscovered malware may already reside in the customer’s system, existing security controls may suffer from coverage gaps, or unsuspecting users may already be suffering from an ongoing account takeover.
It wouldn’t be off to assume that even standard security-related training datasets are probably rife with inaccuracies and misrepresentations.
Indeed, some benchmark datasets for many leading AI algorithms and data science research have proven to be rife with labeling flaws.
Moreover, enterprise datasets can become obsolete, misleading, and/or erroneous over time unless the relevant data, and details of its lineage, are kept up-to-date and tied to their apposite context.
In order to render sensitive datasets accessible to security professionals within and across organizations, some technologies are gaining adoption. This includes privacy-preserving and privacy-enhancing technologies, from de-identification to the creation of synthetic data.
The whole idea behind these technologies is to omit, alter, or hide sensitive information, such as personally identifiable information (PII).
But a consequence is that the inherent qualities and statistically significant attributes of the datasets might be lost along the way.
Moreover, what might seem as negligible “noise” could prove to be significant for some security models, impacting outputs in an unpredictable way.
All of these challenges are detrimental to the ongoing effort to enhance trust in AI-dominated cybersecurity industry.
This is particularly true in the current environment where AI explainability, accountability, and robustness standards and frameworks are not widely accepted.
Efforts have begun to root out biases from datasets, enable privacy-preserving AI training, and reduce the amount of data required for AI training. But it will be harder to fully and continuously inoculate security-related datasets against inaccuracies, manipulations, and unknowns and things that can’t be known, and manipulations, which are intrinsic to the nature of cyberspace.
Maintaining AI hygiene and data quality in ever-morphing, data-hungry digital enterprises may prove equally hard.
Therefore, it’s up to the data science and cybersecurity communities to:
- design and advocate for robust risk assessments and stress tests to enhance visibility and validation
- hard-code guardrails, and
- incorporate offsetting mechanisms that can ensure trust and stability in our digital ecosystem as AI and machine models progress and become a bigger part of the landscape