Import documents in OpenCTI with Ariane AI: Supercharge your entity extraction capabilities
At Filigran, we strongly believe AI is not here to replace, but to assist the Analyst. All our AI features are developed to reduce Analyst fatigue, speed up qualifying, enhance Analyst intelligence. This is the goal of Ariane AI, the Artificial Intelligence we are building in back end to feed all AI features in our products.
The 6.6 release of OpenCTI brought many new features in Ariane AI, one of which is a new connector soberly dubbed import-document-ai . A seasoned OpenCTI user would quickly notice that the name comes from an existing connector, import-document
. So what’s the relationship between those? How does it work and how is it used in OpenCTI?
Too long, didn’t read
- This new import-document-ai connector extends
import-document
by making it possible to extract additional STIX Domain Objects (SDOs) :Malwares
,Intrusion Sets
,Countries
. - Unlike
import-document
connector, it does not rely on existing data in the platform. In other words, this is a feature platform agnostic feature, working out of the box. - It leverages AI’s Natural Language Programming (NLP) extraction capabilities and every entity extracted is actually within the text.
- As Ariane AI is available only as part of OpenCTI Enterprise Edition (EE), it is applicable to this new feature too.
Long read
How does Import-Document functionality work?
As a CTI knowledge base, OpenCTI is a platform that gathers a lot of information from varied sources, to transpose it into a STIX format, enabling then users to structure the intelligence they need. A significant part of this input information is on the form of report or articles, and manually extracting SDOs and STIX Cyber Observables (SCOs) is a cumbersome and prone to errors tak for analysts. A first connector was implemented in late 2021, import-document
, as an attempt to take away the heavy lifting of the extraction by relying on two extraction capabilities :
- extraction by regular expression (RegExp) of known patterns (IP addresses , URL, domain names, Vulneraribilities, etc…)
- extraction by platform matching : configuration allows the user to set the entities they want to extract (for instance
Malware
), then for each run, there is a query of all theMalware
entity names on the platform, with a full text search in the document.
This approach, although efficient, relies on a well curated platform, and might be expensive to run with respect to the amount of information already available in the platform. Besides, it can not generalise to the extraction of new entities if these do not have a clear pattern for RegExp – for instance malware names and intrusion set names.
How can AI help?
Let’s consider the following example, from Wikipedia : “In September 2020 it was reported that Kimsuky attempted to hack 11 officials of the United Nations Security Council.” Even though we don’t know actually what Kimsuky
is, we tend to understand it deals with a group of malevolent cyber actors, i.e Intrusion Set
in the STIX 2.1 framework. So it seems it might be possible to guess what category could be attributed to a particular word only by relying on the small context of the sentence or paragraph around.
This task is a well known field of Natural Language Processing (NLP), that is dubbed Named Entity Recognition (NER). It has been broadly used and studied for the past fifteen years, but early models needed a lot of well annotated data – training set – to take example from in order to be performant. In the case of Intrusion Set
NER, that means we would need a lot of difference sentences with an occurrence of such an entity, knowing its exact position ; definitely a long manual and costly task. Fortunately, the technology behind Large Language Models (LLM), namely transformers, benefited vastly to NLP tasks, and NER in particular: it is now possible to design and use new models that can be proficient with few, or even no examples at all! Furthermore, these models can be small and might run on commodity hardware, whereas LLM often need GPU cards to be used.
What is the implementation of such a connector?
At Filigran, we strongly believe that AI should be an add-on capability that benefits the product and its users, in order to automate tasks and save analysts’ time. Therefore, an AI feature has to fit the existing workflows to enhance them ; it’s not the product duty to adapt to AI capabilities but rather the other way around. Consequently, it felt natural to embed the aforementioned model to a connector, with the same triggers as import-document.
If the connector is correctly configured and deployed on the platform, it becomes available in the dropdown menu.
When selecting ImportDocumentAI, here is what happens :
- The platform launches a job on the connector.
- The connector retrieves the file to extract information from (txt, pdf markdown, html files are supported). It parses text from the file.
- This text is sent to a webservice deployed in Filigran’s environment. This is through this request that the validity of the license is checked.
- The webservice performs the entity extraction by leveraging an NLP-AI model, and returns the detected entities to the connector.
- Given the entities extracted and the context, the connector generates a STIX bundle that is displayed as either an analysis workbench or a draft.
Then the user gets to see the result of the extraction and amend it if needed, before saving the results to the platform.
Conclusion
The set of Filigran’s Ariane AI features has a new member: import-document-ai . This new connector, available to Enterprise Edition platforms, extends the traditionnal import-document
connector by making it possible to extract three SDOs : Malwares
, Intrusion Sets
, Countries
. Unlike its forbearer, import-document-ai
does not rely on the state of the platform it runs with, making the extraction working quickly and out of the box!
This is just the beginning of the automation of information extraction. Next steps will include (but are not limited to): more SDOs extracted (Tools, Organizations, Sectors,…), relationships (targets
and uses
in priority) and continuous improvement of the existing model!
Join us on our Slack community channel to tell us what you think!
Read more
Explore related topics and insights