Wednesday, April 20, 2022

Microsoft Purview- Paint By Numbers Series (Part 1) - Sensitive Information Types





This document is not meant to replace any official documentation, including those found at  Those documents are continually updated and maintained by Microsoft Corporation.  If there is a discrepancy between this document and what you find in the Compliance User Interface (UI) or inside of a reference in, you should always defer to that official documentation and contact your Microsoft Account team as needed.  Links to the data will be referenced both in the document steps as well as in the appendix.


All of the following steps should be done with test data, and where possible, testing should be performed in a test environment.  Testing should never be performed against production data.


Target Audience

The Sensitive Information Type (SIT) section of this blog series is aimed at Compliance officers who need to identify any PII and PHI data in their environment.


Document Scope

This document is meant to guide an administrator who is “net new” to Microsoft E5 Compliance through:

  • Creation of a Sensitive Information Type (SIT).
  • Modification of a Sensitive Information Type (SIT).
  • Testing of your Sensitive Information Type (SIT).




This document does not cover any other aspect of Microsoft E5 Compliance, including:

  • Exact Data Matches
  • Data Protection Loss (DLP) for Exchange, OneDrive, Devices
  • Microsoft Cloud App Security (MCAS)
  • Records Management (retention and disposal)
  • Information Protection
  • Advanced eDiscovery


It is presumed that you have a pre-existing of understanding of what Microsoft E5 Compliance does and how to navigate the User Interface (UI).

Overview of Document

  1. Use Case
  2. Definitions
  3. Notes
  4. Pre-requisites
  5. Create a new Sensitive Information Type (SIT)
  6. Modify an existing Sensitive Information Type (SIT)
  7. Test a Sensitive Information Type (SIT)
  8. Appendix and Links


Use Case

Sensitive Information Types (SIT) are used to flag data for Compliance based upon the content of the file or email, regardless of their location.  So the Use Case here is to create a SIT that does not exist out-of-the-box OR to modifying an existing SIT that is lacking a keyword or pattern that needs to have a compliance policy applied to it.



  1. Data Classification
    1. The core of the Compliance tool is the Microsoft Information Protection (MIP) engine.  This engine allows for indexing of existing data and then track any changes made to that data via the Compliance tool set (example – information label that data with sensitivity and governance labels).


  1. Trainable classifiers
    1. Trainable classifiers leverage Machine Learning (ML) to improve upon pre-built or “net new” keyword dictionaries.  Here are 3 use cases:
      1. Harassment and bad behavior – what is determined to be bad behavior or harassment changes over time.  ML allows for the system to learn this as more data is put into the Tenant.
      2. Vulgarity – not all vulgarity is known.  Often these are slang terms that no one organization knows on day one.
      3. Applications – Most organizations leverage standardized applications for admission, release forms, etc.  Only certain fields change (example – name, address, data of birth, etc).


  1. Sensitive Information Types (SITs)
    1. A SIT is anything that might be unique that you want to track, label, block or set a retention label on.  For example – Credit Card information, Passport information, Social Security Numbers, Medical Record Numbers, Patient IDs, etc.
    2. You can create new SITs and you can modify SITs that you create.
    3. You cannot modify an MS out-of-the-box SIT.  However, you can copy and out-of-the-box SIT and modify that copy as you see fit.


  1. Exact Data Matches (EDMs)
    1. EDMs are used when you want to group large amounts of data specific to your organization.  As the name infers, these are unique to the organization.  For example, instead of any SSN or even any 9 digit number, it would be a unique and specific SSN.  Here are a few examples of how EDMs this would be used.
      1.       Example #1 – PII specific to customers
        • Customer information such as Employee numbers, Social Security Numbers, Credit Card numbers, addresses, Dates of Birth, etc.
      2.       Example #2 – PHI specific to patients
        • Patient information such as Medical Record Numbers, Patient IDs, Social Security Numbers, Credit Card numbers, addresses, Dates of Birth, etc.


  1. Content Search
    1. This allows for a compliance officer to see where the data resides in an organization’s environment.  This does not access the data directly.  It is searching the indexes created by EXO, SPO and OneDrive by default.  Also, using this search provides for the compliance officer to see where the data is, review the context, if necessary, etc. 
      1.  Note – It does not allow for eDiscovery search, hold, and export.  Nor does it allow for them to apply labels or encryption.  It also does not allow for them to apply Data Loss Policies.  Neither does it allow them to apply retention and disposal policies.  That is all done in other parts of the Compliance UI.


  1. Activity Explorer
    1. This allows the compliance officer or IT team to view which activities have been run in the Compliance tool and then drill down into what was done, when, and by whom.  It should be viewed as an auditing tool.



  • Replication times for a Compliance changes to take affect
    • DLP policies will take approximately 15 minutes to take affect
    • Other Compliances items could take 24-48 hours for other changes to take affect



Your test account will need the following rights to run the activities in this blog series.  See the link in the Appendix below for the link for eDiscovery rights.

  • Compliance Administrator
  • eDiscovery Administrator
  • eDiscovery Manager


Data Classification Overview


  1. In the left-hand navigation pane, select Data Classification





  1. In the right-hand pane, you will see several options.  The first is Overview.  This is a dashboard view of your Sensitive Information Types (SITs) and where they are located in your environment.






  1. You will see links in each sub-pane in this dashboard.  These will allow you do drill down into the information provided, or you can click on the links across the top.   We will only be working in the Sensitive information types tab.


Sensitive Information Types (SITs)


Before we create or modify a SIT, let us look at the SIT pane and an existing SIT.

  1. Click on the Sensitive Info Types (SITs).





  1. In the top right-hand corner, you will find a search field where you can search all of the SITs in your Tenant.





  1. If you click on one of these SITs, a window will pop-up on the right showing the details of the SIT.  You cannot modify an out-of-the-box SIT, but you can copy one and modify that copy.  More on that later on.  You can also run a test on a file to verify that the SIT is seeing your data.  Again, more on that later on.  Here is an example of that pop-up.






Creating a Sensitive Information Type (SIT)


First, we will create a new SIT.

  1. Click Create info type in the top-right of the work pane





  1. You will start a 4-step wizard. 


  1. First, give a Name and Description to your new SIT.


  1. Example – Name – Customer ID
  2. Example - Description - Customer Identity Number





  1. Next you will create a pattern to associate with your SIT.  Click Create pattern.



  1. A New Pattern pane will appear on the right-side of the screen.


  1. Choose your confidence level for the SIT you are about to create (High, Medium, or Low).  I am going to choose High Confidence for my SIT.




  1. Click Add Primary Element.  You can choose from Regular Expressions, Keyword lists, Keyword dictionaries or Functions.  I will choose a Keyword list as it is the simplest to create.  There are Keyword dictionaries which are more robust keyword lists.  Then there are Regular expressions.  The most complicated element to create are Functions.  We will not be covering those options in this document.





  1. You can choose from an existing keyword list or create a new ID.  
    •       Example – ID = CustomerID. 
  2. Under Keyword group #1, I am going to only fill in the Case insensitive field separating the words with commas.  You can use any word or list of words but I will be using alphanumeric numbers.  You will notice I only separated the items by commas.  No spaces were used.
    •       Example – A1A1A1,A2A2A2,A3A3A3,A4A4A4,A5A5A5
  3. I will be using Word Match, which is the default.
    •       Note – You can add more than one keyword group if desire.




d. When you are sure of you have what you want, click Done.


9. There is a proximity option between primary and secondary elements.  We will leave this at the default of 300 characters.





  1. You can add Supporting elements if you like.  Similar to Primary Elements, these can be Regular Expressions, Keyword lists, Keyword dictionaries or Functions.  Additionally, you can add a group of elements.  We will not be adding a Supporting element at this time.





  1. You can also add Additional checks.  Below are some examples of what these can be.  We will not be adding additional checks at this time





  1. When you are satisfied with what you see, click Create.
  2. You can copy, edit or delete this pattern.  Copying it will create a duplicate of the pattern.  You can then modify this duplicate as needed.  This is useful if you are wanting to do variations of the same pattern (keywords, functions, regular expressions) as part of your SIT.





  1. Click Next when you are ready.
  2. James_Havens_16-1628296920357.png
  3. Confirm the confidence level of the pattern when you are ready. Click Next when you are ready.
  4. Perform one final review of your SIT and when you are satisfied, click Create.






Modifying a Sensitive Information Type (SIT)


Next, we will Copy and Modify and Existing SIT.  We are doing this against Social Security Numbers so we only look for the numeric pattern of those numbers without the need to have the associated keyword (ex. SSN or SocSecNum).  We will use this SIT in other parts of this blog series.


  1. Stay in the middle pane, click on Sensitive information types






  1. On the right-side of the pane enter “SSN” in the search field





  1. Select U.S. Social Security Number (SSN).





  1. In the right-hand pop-up, select Copy.






  1. Select the new copy (default name will be “U.S. Social Security Number (SSN) copy”), and in the right-hand pop-up, select Edit.




  1. In the wizard that appears, click on the Name section, rename the copy of the U.S. Social Security Number (SSN).  I have renamed mine “U.S. SSN – numbers only”.




  1. In the wizard that appears, click Next until you arrive at the Patterns section of the wizard.  You will find 4 patterns.



  1. For each pattern, click the pencil icon to edit the pattern.
  2. Find the Supporting elements named “Keyword_ssn” and delete it.  Then click Update.





  1. Repeat the step above for the other 3 patterns.
  2. Once all the patterns are updated click Next until you get to the end and then Save the modified SIT.


Test a Sensitive Information Type (SIT)


  1. Click on your either your new or modified SIT and click Test in the pop-up to the top-right.



  1. Click Upload File and browse a file with your test data.



  1. Click Test.  If everything has been set up correctly, you should see something like below that I had in my “U.S. SSN – Numbers Only” test.




  1. Click Finish.


Now that you have created, modified and test a SIT, you are ready to move onto one of the parts of this blog.





Appendix and Links


Note: This solution is a sample and may be used with Microsoft Compliance tools for dissemination of reference information only. This solution is not intended or made available for use as a replacement for professional and individualized technical advice from Microsoft or a Microsoft certified partner when it comes to the implementation of a compliance and/or advanced eDiscovery solution and no license or right is granted by Microsoft to use this solution for such purposes. This solution is not designed or intended to be a substitute for professional technical advice from Microsoft or a Microsoft certified partner when it comes to the design or implementation of a compliance and/or advanced eDiscovery solution and should not be used as such.  Customer bears the sole risk and responsibility for any use. Microsoft does not warrant that the solution or any materials provided in connection therewith will be sufficient for any business purposes or meet the business requirements of any person or organization.


Posted at