9 Comments
Feb 1Liked by Cameron R. Wolfe, Ph.D.

Hey Cameron, love the depth of the article! I have a question for you regarding the retrieval and fine-tuning article: what are your thoughts on OpenAI releasing the ability for users to build custom GPTs? Does that do away with fine-tuning? Have you seen or tested their effectiveness? Thank you!

Expand full comment
author

The functionality is really cool/useful, but I definitely don't think it does away with finetuning! Practitioners will still want to build smaller/customized models that they can host in-house and specialize over their own data. Depending upon the use case, people might not be comfortable with using centralized/proprietary models that they don't have control over.

Expand full comment

Thank you for your answer!

Expand full comment
Feb 1Liked by Cameron R. Wolfe, Ph.D.

How do you know there is not BEIR contamination and generally data contamination in the synthetic data generated by GPT-4?

The fine-tuned LLM (in your last/second to last paper) that used a mixture of synthetic data and other data that ended up beating some BEIR benchmark was surprising, and I’m wondering if that is a fair benchmark.

Expand full comment
author

Great point! This is definitely possible. We probably need to standardize the reporting of contamination metrics as a community to ensure that we are seeing actual performance benefits and not simply training on the test set. However, this is somewhat difficult when GPT-4's training dataset is unknown/proprietary (we can only estimate contamination by downloading a ton of data from the internet).

Expand full comment
Jan 22·edited Jan 22Liked by Cameron R. Wolfe, Ph.D.

> The first step in applying DocLLM is to pass a document through an optical character recognition (ORC) system.

Do you have any suggestion for reliable open source OCR system?

btw there is a typo in your article

Expand full comment
author

I typically use tesseract, but I know that the performance can be behind certain proprietary solution (e.g., Azure OCR API). I think OCR systems are rapidly improving in the last year, so I'm sure there will be more open-source systems being released soon.

Expand full comment
Jan 22Liked by Cameron R. Wolfe, Ph.D.

Another notable article to keep me abreast of what's going on in the field. Thanks a lot Cameron!

Expand full comment
author

Of course! Glad you liked the article :)

Expand full comment