Data supervision and security in large data repositories - QC-091

Preferred Disciplines: Computer science/engineering or related; Master, PhD or Post-Doc
Company: N/A
Project Length:  8 months each
Desired start date: ASAP
Location: Montreal, QC
No. of Positions: 2
Preferences: Universities in Quebec

About the Company: 

Our company provides on-premise or cloud-based security, archiving, migration and file share solutions across multiple platforms including Microsoft Exchange/O365, Google, Amazon, Box, DropBox, Twitter, Facebook. As organizations spread data across multiple cloud platforms, our product allows for simple consolidation of all that data into a single repository with singular indexing and eDiscovery features.

Project Description:

Regulatory compliance requires organizations to manage the data their members produce. The diversity of data sources (email archives, file shares, public/private-clouds, etc.) to supervise and the large volume of data (several hundreds of millions of documents) to process poses great challenges in terms of data supervision and security.

Given the importance of personal data protection, this proposal explores new ways to detect improper dissemination of sensitive data and new methods to supervise users activity. The underlying goal is to automatically identify potential compliance and security violations. In this proposal we want to combine natural language processing techniques and data-mining techniques to i) extract information from the document content, ii) exploit metadata from the documents and the user to build various communication networks. Our aim is to exploit such networks to get answers to compliance questions such as what type of content is being exchanged?, is sensitive content exposed?, etc.

Recently, several cluster programming models and frameworks (Hadoop, MapReduce, Spark, etc.) have been proposed for large-scale distributed data processing. Our aim is to build a solution that leverages such frameworks.


Research Objectives:

  • Identify sensitive information in heterogeneous textual documents
  • Exploit communication networks to track information dissemination
  • Build statistical models to enforce document compliance


  • To be discussed

Expertise and Skills Needed:

  • Natural language processing (information extraction, information retrieval)
  • Machine learning
  • Graph mining
  • Data mining
  • Big data

For more info or to apply to this applied research position, please

  1. Check your eligibility and find more information about open projects.
  2. Complete this webform. You will be asked to upload your CV. Remember to indicate the title of the project(s) you are interested in and obtain your professor’s approval to proceed!
  3. Interested students need to get the approval from their supervisor and send their CV along with a link to their supervisor’s university webpage by applying through the webform or directly to Simon Bousquet,