20240905

I don't know why, but the today's meeting was exhausting.

Peeking (?) Under the Hood of PyMuPDF

pymupdf/PyMuPDF

There are several options such as jsvine/pdfplumber when it comes to PDF management, and our team uses PyMuPDF for that purpose.

Integer as a Flag

Redaction is removing visible texts or graphics from a PDF document. It's used to hide sensitive information or reduce the entropy of PDF pages.

To do redactions with PyMuPDF, we first need to use Page.add_redact_annot() to add redaction annotations, and they call Page.apply_redactions().

# document: fitz.Document
# rectangles: list[tuple[int, int, int, int]]

page: fitz.Page = document[0]  # first page
for rect in rectangles:
  page.add_redact_annot(rect)
page.apply_redactions()

Page.apply_redactions

What I stumbled upon today is the text option. The document says its default value is text=PDF_REDACT_TEXT_REMOVE | 0 but the constant PDF_REDACT_TEXT_REMOVE does not exist in the code base. So, in the reality, I'm expected to pass int to the argument.

It can be confirmed in the source code

def apply_redactions(
    page: pymupdf.Page, images: int = 2, graphics: int = 1, text: int = 0
) -> bool:

I may be totally wrong, but I prefer not using int as a kind of flag.

PyMongo has similar implementation.

PyMongo: collection - Collection level operations

pymongo.ASCENDING = 1

Ascending sort order.

pymongo.DESCENDING = -1

Descending sort order.

I remember that once I asked my colleague to use the constants pymongo.ASCENDING and pymongo.DESCENDING instead of integers 1 and -1 because when seeing 1 and -1 in the source code, it's not obvious what they mean. Whereas, the constants convey good amount of information.

If I were to implement a similar method, I would define constants in the same way as PyMongo because integers hardly have information/readability.

But I'm not sure. There may be some advantages of simply using integers. Please correct me if my understanding is stupid.

Binary Expressions as Flags

PyMuPDF/src_classic/helper-python.i

TEXT_PRESERVE_LIGATURES = 1
TEXT_PRESERVE_WHITESPACE = 2
TEXT_PRESERVE_IMAGES = 4
TEXT_INHIBIT_SPACES = 8
TEXT_DEHYPHENATE = 16
TEXT_PRESERVE_SPANS = 32
TEXT_MEDIABOX_CLIP = 64
TEXT_CID_FOR_UNKNOWN_UNICODE = 128

TEXTFLAGS_BLOCKS = (0
        | TEXT_PRESERVE_LIGATURES
        | TEXT_PRESERVE_WHITESPACE
        | TEXT_MEDIABOX_CLIP
        | TEXT_CID_FOR_UNKNOWN_UNICODE
        )

PyMuPDF/src_classic/utils.py

def get_text(
    page: Page,
    option: str = "text",
    clip: rect_like = None,
    flags: OptInt = None,
    textpage: TextPage = None,
    sort: bool = False,
    delimiters=None,
):
    formats = {
        "text": fitz.TEXTFLAGS_TEXT,
        "html": fitz.TEXTFLAGS_HTML,
        "json": fitz.TEXTFLAGS_DICT,
        "rawjson": fitz.TEXTFLAGS_RAWDICT,
        "xml": fitz.TEXTFLAGS_XML,
        "xhtml": fitz.TEXTFLAGS_XHTML,
        "dict": fitz.TEXTFLAGS_DICT,
        "rawdict": fitz.TEXTFLAGS_RAWDICT,
        "words": fitz.TEXTFLAGS_WORDS,
        "blocks": fitz.TEXTFLAGS_BLOCKS,
    }

    # ...

    if option == "blocks":
        return get_text_blocks(
            page, clip=clip, flags=flags, textpage=textpage, sort=sort
        )

My brain cells are dying due to fatigue, but I would like to note that the flags are managed with the power of 2.

For example, 3 = b11 means TEXT_PRESERVE_LIGATURES (= 1) and TEXT_PRESERVE_WHITESPACE (= 2) are turned on.

It is embarassing, but I have never seen this implementation before. I find it a great way of managing multiple flags; it's just easy to understand and manage.

I hope I get a chance to do this someday, in some projects.


Rice 800 Bagels 500 Yogurt 300 Protein shake 200

Total 1800 kcal

10k run

I watched a YouTube video saying that 60~70% of the max heart rate is best for jogging.

My max heart rate is around 200, maybe, and I tried to keep my heart rate around 120~140.

It turned out that my heart rate bumped to 150 and ended up at 170, even though I tried to run at the slowest speed possible.

How is it possible to keep a heart rate below 60% of max? Interesting.


MUST:

TODO:


index 20240904 20240906