# pdf_parser
## All the codes related to pdf parsing

### The following elements are to be parsed from documents.
1. Documents
    1. Extracting dates from documents
    1. Classification Tags
    1. Extracting Key Entities from documents
        1. Patents
        1. References
        1. Entities
            1. Names
            1. Addresses
            1. Law Firms
            1. Contact Numbers
            1. Emails
    1. Association with Cases

### Setting up the code base.
1. Launch the terminal.
1. Enter the following command to go to the base directory:
   ``` bash
   cd ~
   ```
1. Make a new directory `Code` by using the following command:
   ```bash
   mkdir Code
   ```
1. Pull the current repository by entering the following command:
   ```bash
   git pull gogs@git.fafadiatech.com:harsh/pdf_parser.git
   ```

### TODO LIST:
1. Implementing OCR on tika.
1. Dockerising the whole apache tika with ocr.
1. Testing the re on the scanned pdfs.