Skip to Content

Contributors

Module to read and extract information from PDF's

Hello everyone. 

We have a client using Odoo 16 that needs to extract information from a PDF file and update a res.partner record with this info. The PDF contains data like name, address, ZIP Code, VAT number, etc. Does anyone know of any module/python library that could help us with this?

Thank you!

--

SAMUEL MACIAS OROPEZA

TECH LEAD

smacias@opensourceintegrators.com

P.O. BOX 940, HIGLEY, AZ 85236


by Samuel Macias Oropeza - 10:41 - 6 Sep 2023

Follow-Ups

  • Re: Module to read and extract information from PDF's
    Out of the box Odoo is capable to extract the text content from a it.attachment.
    You just need to make sure the pdfminer.six Python library is installed.

    When hat is the case, the attachment document text is extracted and written in a ir.attachment text field.
    You can then do content search or even implement business logic based on it.

    Reference:
    https://github.com/odoo/odoo/blob/55423cbdeeb1ce35fb257624ea0d04d4be99a943/addons/attachment_indexation/__manifest__.py#L13

    Thanks
    Daniel

    On 06/09/2023 21:42, Samuel Macias Oropeza wrote:
    Hello everyone. 

    We have a client using Odoo 16 that needs to extract information from a PDF file and update a res.partner record with this info. The PDF contains data like name, address, ZIP Code, VAT number, etc. Does anyone know of any module/python library that could help us with this?

    Thank you!

    --

    SAMUEL MACIAS OROPEZA

    TECH LEAD

    smacias@opensourceintegrators.com

    P.O. BOX 940, HIGLEY, AZ 85236

    _______________________________________________
    Mailing-List: https://odoo-community.org/groups/contributors-15
    Post to: mailto:contributors@odoo-community.org
    Unsubscribe: https://odoo-community.org/groups?unsubscribe


    --
    DANIEL REIS
    MANAGING PARTNER

    M: +351 919 991 307
    E: dreis@OpenSourceIntegrators.com
    A: Avenida da República 3000, Estoril Office B, 3º Escr.34, 2649-517 Cascais


    by Daniel Reis - 09:26 - 7 Sep 2023
  • Re: Module to read and extract information from PDF's
    invoice2data is becoming a bit more unstable we are finding with new maintainers. For years it was fairly static and unchanging and fairly dedicated to Odoo, now it is more generalised.  Also for this purpose it would need a bit of customization and it really only suits when you know the document beforehand. We still use it, but wouldn't for a requirement like this.

    For our recent requirements to integrate with DMS and also enterprise Documents module to auto receive records and attach to correct record in this area, we have gone with what is listed below with a simple custom frontend model to define patterns. This was for a backscanning project of some 1m pages, multipage detection, multi doctype kind of thing. Basically, scan 150 pages on a scanner, it comes in, gets parsed and page breaks made and separate files with a copy of extracted text, then auto attached to correct record.

    pdftotext works as advertised. tesseract has some dependencies and quirks, which is fine, just needs some error and ambiguous bit handling. To do really well, you would also want opencv etc to do things like contrast and deskew images from scanned files, but we found actually for the overhead, for the documents we were doing it didn't really add any value. We offered to clean up and put this work to OCA but were refused on basis that noone does OCR anymore.

    Alternatively, you can just push to something like GVision for images. That was our first implementation, it is maybe 1/3 of the code, but harder to test in isolated dev and the results, and while much more comprehensive, for our use case weren't really value for money.

    import pdftotext
    import pytesseract
    from pdf2image import convert_from_bytes

    On Thu, Sep 7, 2023 at 8:51 AM Enric Tobella Alomar <notifications@odoo-community.org> wrote:
    You can try with invoice2data extractor.

    It can extract data from PDF (not only invoice info)

    El mié, 6 sept 2023 a las 22:42, Samuel Macias Oropeza (<notifications@odoo-community.org>) escribió:
    Hello everyone. 

    We have a client using Odoo 16 that needs to extract information from a PDF file and update a res.partner record with this info. The PDF contains data like name, address, ZIP Code, VAT number, etc. Does anyone know of any module/python library that could help us with this?

    Thank you!

    --

    SAMUEL MACIAS OROPEZA

    TECH LEAD

    smacias@opensourceintegrators.com

    P.O. BOX 940, HIGLEY, AZ 85236

    _______________________________________________
    Mailing-List: https://odoo-community.org/groups/contributors-15
    Post to: mailto:contributors@odoo-community.org
    Unsubscribe: https://odoo-community.org/groups?unsubscribe



    --
    Enric Tobella Alomar
    CEO & Founder

    _______________________________________________
    Mailing-List: https://odoo-community.org/groups/contributors-15
    Post to: mailto:contributors@odoo-community.org
    Unsubscribe: https://odoo-community.org/groups?unsubscribe


    by Graeme Gellatly - 11:26 - 6 Sep 2023
  • Re: Module to read and extract information from PDF's
    You can try with invoice2data extractor.

    It can extract data from PDF (not only invoice info)

    El mié, 6 sept 2023 a las 22:42, Samuel Macias Oropeza (<notifications@odoo-community.org>) escribió:
    Hello everyone. 

    We have a client using Odoo 16 that needs to extract information from a PDF file and update a res.partner record with this info. The PDF contains data like name, address, ZIP Code, VAT number, etc. Does anyone know of any module/python library that could help us with this?

    Thank you!

    --

    SAMUEL MACIAS OROPEZA

    TECH LEAD

    smacias@opensourceintegrators.com

    P.O. BOX 940, HIGLEY, AZ 85236

    _______________________________________________
    Mailing-List: https://odoo-community.org/groups/contributors-15
    Post to: mailto:contributors@odoo-community.org
    Unsubscribe: https://odoo-community.org/groups?unsubscribe



    --
    Enric Tobella Alomar
    CEO & Founder


    by Enric Tobella Alomar - 10:51 - 6 Sep 2023