OK, here is the spec:
From WhatsApp Web (pre-logged in, via Chrome, Firefox, whatever) - select a predefined conversation - download full message history, store in sensible folder structure(ie, chats/id_of_chat/images, audio, messages) - primary organisation file should ideally be JSON AND CSV (CSV mainly for ease of checking...), with columns:
sender, senttimedate, raw_innerText, raw_text, has_audio, voicenote_codes, has_image, image_codes
senttimedate can maybe ideally be subdivided into time, data... this information is just missing in some instances, as far as I can tell.
raw_innerText I get from doing a element.innerText in javascript on the relevant message "row" (I believe the divs are role="row" or something....
raw_text is the raw text content of the message (if applicable)
has_audio, voicenote_codes, has_image, image_codes pretty self explanatory really, the codes are just to temporarily assign the downloaded media to the relevant message before it is shuffled into the appropriate folder and renamed sensibly... use your discretion.
Images can be trivially easily downloaded as a blob URL is embedded in the HTML
Voice notes I just cannot find any way to do it without simulating user input
This is fine! BUT, I have had a lot of headaches with simulation libraries before, clicknium being fake open source and random server failures at times. Selenium just... syntax seems to be lacking in some obvious ways... Haven't tried Puppeteer, maybe that's an option but here's the thing, if I can just set up the Chrome profiles manually, then use pyautoit and pyautogui (along with some pytesseract) to navigate the browser - plus TamperMonkey to inject some JS... It seems robust and relatively immune to random browser changes breaking specific automation libraries.
I am so close to just finishing this up myself but gods, I'm just wasting so much time I may as well hand it over to someone who can get this done in a sensible way.
Input Options:Conversation to target (may be group or individual - don't need to worry about activating the conversation, this is handled - each will be sorted into a separate folder)Scrape method: COMPLETE / UPDATE ONLY(COMPLETE typically used the first time round, from then on only UPDATES).
It is primarily just important that VOICE NOTES and IMAGES are downloaded and referenced in the output data format I suggested.
Ideally, will be configurable to operate on a loop, iterating through a list of conversations, at first building the data available, then just updating as needed.
Preferably, use python, user-input simulation and whatever browser automation is needed... GUI not strictly necessary as long as it just works.
I know some people use node.js but just because I personally never bothered to get familiar with it would prefer to rule that out unless you can break down setup instructions like I am a 5 year old.
NO APIs, Whatsapp Business account will not be used.
Success story sharing