All codes related to pdf parsing.

Harsh Parikh 2352bd3c3c added the intial codebase for parsing documents 2 سال پیش
complaints 2352bd3c3c added the intial codebase for parsing documents 2 سال پیش
docker 2352bd3c3c added the intial codebase for parsing documents 2 سال پیش
.gitignore 2352bd3c3c added the intial codebase for parsing documents 2 سال پیش
LICENSE 95a4f66c64 Initial commit 2 سال پیش
README.md 2352bd3c3c added the intial codebase for parsing documents 2 سال پیش

README.md

pdf_parser

All the codes related to pdf parsing

The following elements are to be parsed from documents.

  1. Documents
    1. Extracting dates from documents
    2. Classification Tags
    3. Extracting Key Entities from documents
      1. Patents
      2. References
      3. Entities
        1. Names
        2. Addresses
        3. Law Firms
        4. Contact Numbers
        5. Emails
    4. Association with Cases

Setting up the code base.

  1. Launch the terminal.
  2. Enter the following command to go to the base directory:

    cd ~
    
    1. Make a new directory Code by using the following command: bash mkdir Code
  3. Pull the current repository by entering the following command:

    git pull gogs@git.fafadiatech.com:harsh/pdf_parser.git
    

    TODO LIST:

    1. Implementing OCR on tika.
    2. Dockerising the whole apache tika with ocr.
    3. Testing the re on the scanned pdfs.