pretty bad amdgpu fault last night

The place to discuss Linux and Unix Operating Systems
Forum rules
Behave
Post Reply
User avatar
Grogan
Your Host
Posts: 480
Joined: Sat Aug 21, 2021 10:04 am
Location: Ontario, Canada

pretty bad amdgpu fault last night

Post by Grogan »

Unrecoverable, ended up in a hard boot and me fscking filesystem errors from my other OS (probably the stupid fucking journald databases) which is what I do when I have a bad shutdown so it won't be mounted and I can fix serious errors if they exist. These weren't, just journal rollback and inconsistencies.

Anyway, I was playing Starfield, and had been for several hours (pause game a few times to smoke and eat etc.). I was on dessicated, barren, desert Earth heading back to my ship after a mission at an ancient NASA launch tower, bounding over dunes, jet pack jumping, breaking my legs on a high grav planet lol, when the screen froze (audio was still working). I sighed, and waited for the inevitable game crash, but it was more serious. The display went into "no signal" mode and powered off, I had no keyboard input to switch to a new tty on console. You can see the sequence of events, the ACPI power button shut down worked, but something wouldn't unhook and the filesystems didn't get unmounted and the kernel didn't shut down. The very last entry is me pressing the power button again, and the event registered but at that point there would have been nothing to dispatch the event to (acpid already shut down). The system wasn't halted, but the kernel was fuct. After that, I held the power button until power off.

I am not pleased. I have not had anything like that happen in 4 years. Back to driver recovery not working correctly. (Yesterday's kernel? Too hard to say. I'll have to see if it's an isolated incident. Starfield has crashed a few times since I've had it, but nothing like this has ever happened)

Code: Select all

Apr 04 02:12:53 nicetry kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:2 pasid:32776, for process Starfield.exe pid 10223 thread vkd3d_queue pid 10291)
Apr 04 02:12:53 nicetry kernel: amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x00008000e0400000 from client 0x1b (UTCL2)
Apr 04 02:12:53 nicetry kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00201431
Apr 04 02:12:53 nicetry kernel: amdgpu 0000:03:00.0: amdgpu:          Faulty UTCL2 client ID: SQC (data) (0xa)
Apr 04 02:12:53 nicetry kernel: amdgpu 0000:03:00.0: amdgpu:          MORE_FAULTS: 0x1
Apr 04 02:12:53 nicetry kernel: amdgpu 0000:03:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 04 02:12:53 nicetry kernel: amdgpu 0000:03:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
Apr 04 02:12:53 nicetry kernel: amdgpu 0000:03:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 04 02:12:53 nicetry kernel: amdgpu 0000:03:00.0: amdgpu:          RW: 0x0
Apr 04 02:12:53 nicetry kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:2 pasid:32776, for process Starfield.exe pid 10223 thread vkd3d_queue pid 10291)
Apr 04 02:12:53 nicetry kernel: amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x00008000e0400000 from client 0x1b (UTCL2)
Apr 04 02:12:53 nicetry kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 04 02:12:53 nicetry kernel: amdgpu 0000:03:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
Apr 04 02:12:53 nicetry kernel: amdgpu 0000:03:00.0: amdgpu:          MORE_FAULTS: 0x0
Apr 04 02:12:53 nicetry kernel: amdgpu 0000:03:00.0: amdgpu:          WALKER_ERROR: 0x0
Apr 04 02:12:53 nicetry kernel: amdgpu 0000:03:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
Apr 04 02:12:53 nicetry kernel: amdgpu 0000:03:00.0: amdgpu:          MAPPING_ERROR: 0x0
Apr 04 02:12:53 nicetry kernel: amdgpu 0000:03:00.0: amdgpu:          RW: 0x0
Apr 04 02:13:04 nicetry kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=26306643, emitted seq=26306645
Apr 04 02:13:04 nicetry kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Starfield.exe pid 10223 thread vkd3d_queue pid 10291
Apr 04 02:13:04 nicetry kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
Apr 04 02:13:08 nicetry kernel: amdgpu 0000:03:00.0: amdgpu: failed to suspend display audio
Apr 04 02:13:08 nicetry kernel: amdgpu 0000:03:00.0: amdgpu: MODE1 reset
Apr 04 02:13:08 nicetry kernel: amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
Apr 04 02:13:08 nicetry kernel: amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset
Apr 04 02:13:19 nicetry kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset succeeded, trying to resume
Apr 04 02:13:19 nicetry kernel: [drm] PCIE GART of 512M enabled (table at 0x0000008000900000).
Apr 04 02:13:19 nicetry kernel: [drm] VRAM is lost due to GPU reset!
Apr 04 02:13:19 nicetry kernel: [drm] PSP is resuming...
Apr 04 02:13:24 nicetry kernel: [drm:psp_v11_0_memory_training [amdgpu]] *ERROR* send training msg failed.
Apr 04 02:13:24 nicetry kernel: [drm:psp_resume [amdgpu]] *ERROR* Failed to process memory training!
Apr 04 02:13:24 nicetry kernel: [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -62
Apr 04 02:13:24 nicetry kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset(2) failed
Apr 04 02:13:24 nicetry kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset end with ret = -62
Apr 04 02:13:24 nicetry kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* GPU Recovery Failed: -62
Apr 04 02:13:24 nicetry kernel: [drm] Skip scheduling IBs!
Apr 04 02:13:24 nicetry kernel: [drm] Skip scheduling IBs!

..... snipped multiple lines of the same shit

Apr 04 02:13:24 nicetry kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Apr 04 02:13:34 nicetry kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=522409, emitted seq=522411
Apr 04 02:13:34 nicetry kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Apr 04 02:13:34 nicetry kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
Apr 04 02:13:38 nicetry kernel: amdgpu 0000:03:00.0: amdgpu: failed to suspend display audio
Apr 04 02:13:38 nicetry kernel: amdgpu 0000:03:00.0: amdgpu: Failed to disallow df cstate
Apr 04 02:14:37 nicetry systemd-logind[485]: Power key pressed short.
Apr 04 02:14:37 nicetry systemd-logind[485]: Powering off...
Apr 04 02:14:37 nicetry systemd-logind[485]: System is powering down.
Apr 04 02:14:37 nicetry systemd[1]: Stopping Session c3 of User grogan...
Apr 04 02:14:37 nicetry systemd[1]: Removed slice Slice /system/modprobe.
Apr 04 02:14:37 nicetry systemd[1]: Stopped target Graphical Interface.
Apr 04 02:14:37 nicetry systemd[1]: Stopped target Multi-User System.
Apr 04 02:14:37 nicetry systemd[1]: Stopped target Login Prompts.
Apr 04 02:14:37 nicetry systemd[1]: Stopped target Sound Card.
Apr 04 02:14:37 nicetry systemd[1]: Stopped target Timer Units.
Apr 04 02:14:37 nicetry systemd[1]: archlinux-keyring-wkd-sync.timer: Deactivated successfully.
Apr 04 02:14:37 nicetry systemd[1]: Stopped Refresh existing PGP keys of archlinux-keyring regularly.
Apr 04 02:14:37 nicetry systemd[1]: man-db.timer: Deactivated successfully.
Apr 04 02:14:37 nicetry systemd[1]: Stopped Daily man-db regeneration.
Apr 04 02:14:37 nicetry systemd[1]: shadow.timer: Deactivated successfully.
Apr 04 02:14:37 nicetry systemd[1]: Stopped Daily verification of password and group files.
Apr 04 02:14:37 nicetry systemd[1]: systemd-tmpfiles-clean.timer: Deactivated successfully.
Apr 04 02:14:37 nicetry systemd[1]: Stopped Daily Cleanup of Temporary Directories.
Apr 04 02:14:37 nicetry systemd[1]: Stopping ACPI event daemon...
Apr 04 02:14:37 nicetry systemd[1]: Stopping Getty on tty1...
Apr 04 02:14:37 nicetry login[8985]: pam_unix(login:session): session closed for user grogan
Apr 04 02:14:37 nicetry systemd[1]: Starting Generate shutdown-ramfs...
Apr 04 02:14:37 nicetry systemd[1]: Stopping Authorization Manager...
Apr 04 02:14:37 nicetry systemd[1]: Stopping RealtimeKit Scheduling Policy Service...
Apr 04 02:14:37 nicetry systemd[1]: Stopping Load/Save OS Random Seed...
Apr 04 02:14:37 nicetry systemd[1]: Stopping Daemon for power management...
Apr 04 02:14:37 nicetry systemd[1]: rtkit-daemon.service: Deactivated successfully.
Apr 04 02:14:37 nicetry systemd[1]: Stopped RealtimeKit Scheduling Policy Service.
Apr 04 02:14:37 nicetry systemd[1]: getty@tty1.service: Deactivated successfully.
Apr 04 02:14:37 nicetry systemd[1]: Stopped Getty on tty1.
Apr 04 02:14:37 nicetry systemd[1]: upower.service: Deactivated successfully.
Apr 04 02:14:37 nicetry systemd[1]: Stopped Daemon for power management.
Apr 04 02:14:37 nicetry systemd[1]: upower.service: Consumed 1.891s CPU time.
Apr 04 02:14:37 nicetry systemd-logind[485]: Session c3 logged out. Waiting for processes to exit.
Apr 04 02:14:37 nicetry systemd[1]: Removed slice Slice /system/getty.
Apr 04 02:14:37 nicetry systemd[1]: polkit.service: Deactivated successfully.
Apr 04 02:14:37 nicetry systemd[1]: Stopped Authorization Manager.
Apr 04 02:14:37 nicetry acpid[483]: exiting
Apr 04 02:14:37 nicetry systemd[1]: acpid.service: Deactivated successfully.
Apr 04 02:14:37 nicetry systemd[1]: Stopped ACPI event daemon.
Apr 04 02:14:37 nicetry systemd[1]: acpid.service: Consumed 2.559s CPU time.
Apr 04 02:14:38 nicetry mkinitcpio[11134]: ==> Starting build: 'none'
Apr 04 02:14:38 nicetry mkinitcpio[11134]:   -> Running build hook: [sd-shutdown]
Apr 04 02:14:38 nicetry systemd[1]: systemd-random-seed.service: Deactivated successfully.
Apr 04 02:14:38 nicetry systemd[1]: Stopped Load/Save OS Random Seed.
Apr 04 02:14:38 nicetry mkinitcpio[11134]: ==> Build complete.
Apr 04 02:14:38 nicetry systemd[1]: mkinitcpio-generate-shutdown-ramfs.service: Deactivated successfully.
Apr 04 02:14:38 nicetry systemd[1]: Finished Generate shutdown-ramfs.
Apr 04 02:15:08 nicetry systemd-logind[485]: Power key pressed short.
User avatar
Grogan
Your Host
Posts: 480
Joined: Sat Aug 21, 2021 10:04 am
Location: Ontario, Canada

Re: pretty bad amdgpu fault last night

Post by Grogan »

Something else. This is the first time I've even had to look at logs in 4+ years. I got poxxed off with the way journald was usurping the socket, making it difficult to use syslogd without piping from journald first, and shut logging off altogether. I never once, missed not having logs on Manjaroo/Arch because nothing ever went wrong on that system. Stuff still went to the root console anyway and I could capture output if I wanted to. It was all just for games anyway, that system.

On my new rig I'm just letting journald do its thing (for now). I've got more hardware, so I don't care as much, but journald is retarded and I probably will try again to find a way to switch logging daemons (without having to keep volatile journals to pipe to syslogd). Arch is my main driver now (and my LFS is the alternate) so I'm going to want logs here.
User avatar
Grogan
Your Host
Posts: 480
Joined: Sat Aug 21, 2021 10:04 am
Location: Ontario, Canada

Re: pretty bad amdgpu fault last night

Post by Grogan »

... and just guess what the actual problem is. The Assrock RX 6700 XT card beshat itself and died. Last night was a warning. I didn't have any more trouble until tonight, playing Far Cry 6. Well, it died in the same way, only this time for good. Those page faults were a symptom, not the cause.

I pulled the RX 570 card from the other computer and that's what I'm using now. Just what I didn't need tonight, in so many ways, physical, neurological and financial. This is Newegg, not Amazon, and I bought this card over a month ago now (not sure how long newegg covers RMA) but I can't wait until this is resolved because NOW, I've bought games that this thing will suck at. I want to play my Starfield game!

P.S. "Past Return Period" at Newegg. Probably by like 10 minutes (was purchased March 5), shortly after midnight by the time I was sure and went to Newegg to see the return period lol
User avatar
Grogan
Your Host
Posts: 480
Joined: Sat Aug 21, 2021 10:04 am
Location: Ontario, Canada

Re: pretty bad amdgpu fault last night

Post by Grogan »

I'm thinking about this:
https://www.amazon.ca/GIGABYTE-Graphics ... 0CGC5P7H3/

Originally I wanted something in the RX 6700 series because I wanted to stay a series behind for the drivers. This is a RX 7700 XT. I've given up on sub-$500 lol

Fuck those stupid, cheaper brands. Chintzing you on 50 cents causes things like this.

P.S. Ordered... should have it by Saturday. Damnit, another $689 (with taxes and all).
User avatar
Grogan
Your Host
Posts: 480
Joined: Sat Aug 21, 2021 10:04 am
Location: Ontario, Canada

Re: pretty bad amdgpu fault last night

Post by Grogan »

Sonofabitch... neither of those games will even run on this card. Neither Starfield nor Far Cry 6. In both cases I tried renaming their config files but neither game will launch. The card is way below system requirements, but that doesn't usually mean the game won't launch at all. Jeeze.

Ahh well, I guess I'll have to get back to Assassin's Creed and Mass Effect games because none of the ones I'm playing right now will work.
User avatar
Grogan
Your Host
Posts: 480
Joined: Sat Aug 21, 2021 10:04 am
Location: Ontario, Canada

Re: pretty bad amdgpu fault last night

Post by Grogan »

Asrock's website seems very unhelpful, all I'm seeing are reasons they aren't going to help me. RMA's are only for "Authorized Dealers in the U.S." and I'm to contact my seller. It's like "if you're an end user, you can email support if you want, but...". Then they go on to tell you all the reasons they are going ship your product back to you unrepaired.

I read reviews at Newegg for my card and failures say that Asrock has horrible support and never answers that email (he said he'd been waiting months and this is not resolved)

I'm going to open a chat with someone at Newegg tomorrow, I'm not putting up with this "pass the buck" horseshit. If this would have happened last night I'd have been able to send it back to Newegg.
User avatar
Zema Bus
Your Co-Host
Posts: 240
Joined: Sun Feb 04, 2024 1:25 am

Re: pretty bad amdgpu fault last night

Post by Zema Bus »

Damn, sorry to hear that, and what awful timing. Hope you're able to get somewhere with Newegg.
User avatar
Grogan
Your Host
Posts: 480
Joined: Sat Aug 21, 2021 10:04 am
Location: Ontario, Canada

Re: pretty bad amdgpu fault last night

Post by Grogan »

It seems Newegg is a circle jerk too. The "chat" is a bot that tells me "I'm sorry, I can't locate the information you requested" and "chat with a representative" leads to "we're sorry, live chat is unavailable" and it directs me to self help.

Also, I notice that this product was never eligible for refund, only replacement. "This item is covered by Newegg.ca's Replacement Only Return Policy" whereas other video cards and merchandise have 30 day refund and replacement policies. Asrock are probably assholes.

You should see the hostile language on Asrock's web site. Everything is no, no, no, no.
Repair / RMA (For USA) (RMA Policy)
»For USA Authorized Distributors Only

Email: rma@asrockamerica.com

»End users or indirect customer:

Thank you for choosing ASRock Motherboard. Due to the unique configuration of each computer, we recommend you to contact your dealer for technical support. If you would like contact us for technical support. Please send us an email to support@asrockamerica.com and give us a detailed discription of the problem along with the following information.
This is their "RMA Policy". These are rules for the seller to follow in accepting returns.

https://www.asrock.com/support/index.us.asp?cat=Policy

Why would any vendor sell Asrock products?
User avatar
Zema Bus
Your Co-Host
Posts: 240
Joined: Sun Feb 04, 2024 1:25 am

Re: pretty bad amdgpu fault last night

Post by Zema Bus »

Besides checking the ratings for hardware guess we should also be looking at their RMA policies and what other buyers have reported about their RMA experiences. I got lucky with the XFX Radeon 7 card that died, XFX's RMA process worked well and I got a full refund.

Maybe give their chat one more try, if you tried it at 7 AM PT or earlier it may have just been too early - down here Newegg is based on the west coast and the chat likely wouldn't be available that early.

Here's a Reddit thread from someone having a hard time with Asrock's RMA process, and a mod there with connections at Asrock got it to go through. When someone else in the thread also mentioned having trouble with them the mod suggested that he start his own thread with all the details and he'd look into it. Might be worth a shot.
User avatar
Grogan
Your Host
Posts: 480
Joined: Sat Aug 21, 2021 10:04 am
Location: Ontario, Canada

Re: pretty bad amdgpu fault last night

Post by Grogan »

Yes, that's significant if something isn't going to fail in 30 days. I'm going to make sure I use the shit out of this new video card in the first 30 days lol

Reading that language on Asrock's website, I'd have never bought anything had I seen it before. We'll see how they do after I actually contact them. (I really can't say anything yet because I haven't even tried... just going by the language, and other complaints)

Right now, I've just decided to unstress for the weekend and not worry about this. I'm going to play some relaxing Odyssey tonight, my new card is shipped and will be here tomorrow, I'm going to see my family, then when I get home I'll pop the new card in and get back to my life. (it's easy... PSU connectors already there etc. and I don't even have to mess with drivers or anything)

Monday I'll sort this. It doesn't really matter now, a few days won't make any difference to this.
User avatar
Grogan
Your Host
Posts: 480
Joined: Sat Aug 21, 2021 10:04 am
Location: Ontario, Canada

Re: pretty bad amdgpu fault last night

Post by Grogan »

I think the 7700 card should be fine, it's probably 8 months old since release. RDNA3 should be fine on Linux 6.4+ and current stable mesa, so I'll be fine with current kernels and certainly my mesa builds. I just worry about glitchy behaviour on newer hardware series'.

By the way, I took that locking latch right off my PCI-E card slot. That's going to end up causing cracking of a PCB, damaging the x16 slot, or raking a screwdriver across the motherboard. (I examined the faulty card with a magnifying glass and flashlight, and no, it's not damaged at the locking tongue but that's a common problem with graphics cards these days). It's not RAM sticks, it's a video card and this mechanism isn't appropriate. The only way to release that locking snap is to press on it with a long flat head screwdriver to unlatch it. It's hard to get at it with the cooler and all.

P.S. I don't even think those locking latches do much to keep the card seated. Any time you've forgotten it, the card wiggles fine but just doesn't come right out. So maybe it might keep the card from falling right out if you turned it upside down and shook it but I don't think it does much to keep the card aligned anyway.
User avatar
Grogan
Your Host
Posts: 480
Joined: Sat Aug 21, 2021 10:04 am
Location: Ontario, Canada

Re: pretty bad amdgpu fault last night

Post by Grogan »

Card arrived as promised. Too late to install right now (have to leave) but that's about as good as it could get around here. Ordered 12:30 AM Friday, here 3:30 PM Saturday (and nobody else delivers on Saturday, Sunday and holidays but Amazon contracted couriers here)
User avatar
Grogan
Your Host
Posts: 480
Joined: Sat Aug 21, 2021 10:04 am
Location: Ontario, Canada

Re: pretty bad amdgpu fault last night

Post by Grogan »

Fuck... nothing ever goes right. First of all, the card didn't seat right the first time (dropped right in, seemed good but evidently wasn't). It's a monster of a card, big metal support bracket supporting the weight of the card so it won't sag being held by the tang. It's not going to move now.

Then, when I got display it went right to the BIOS. It couldn't boot, my device didn't show up, because it switched itself to UEFI mode and greyed out the settings. All my other settings were intact, it didn't revert back to defaults or anything. Why the Hell would it do that, the CSM setting is just for booting (Legacy + UEFI), not for assigning PCI-E bus resources or anything. But that's what I had to do, restore defaults and then go through and remember every stinking setting I changed in that complex user interface lol

I guess the bios didn't like going back and forth between those cards.
User avatar
Grogan
Your Host
Posts: 480
Joined: Sat Aug 21, 2021 10:04 am
Location: Ontario, Canada

Re: pretty bad amdgpu fault last night

Post by Grogan »

Well... no hesitation in either of those games now I tell you (Starfield and Far Cry 6) with maxed out settings. Though I haven't tried enabling the DXR settings... not sure how well our drivers can actually do the ray tracing on RDNA3 at this time, though my card has accelerators for it and they say Vulkan ray tracing is usable since Mesa 23.2.
User avatar
Zema Bus
Your Co-Host
Posts: 240
Joined: Sun Feb 04, 2024 1:25 am

Re: pretty bad amdgpu fault last night

Post by Zema Bus »

Glad to hear this one is working well :)
User avatar
Grogan
Your Host
Posts: 480
Joined: Sat Aug 21, 2021 10:04 am
Location: Ontario, Canada

Re: pretty bad amdgpu fault last night

Post by Grogan »

Ray Tracing tested in the game Control. I chose that one to test it first because it's one where I'll notice it (one that needs it). That game has bugged me for a long time, the lighting and dithering and particles. Also, it's not that DXR ray tracing.

I have the Ray Tracing settings on High, with all the boxes checked and while it hurts a little, performance is still good. The smoke and fog and stuff looks better now with the bright lights shining on it.
User avatar
Grogan
Your Host
Posts: 480
Joined: Sat Aug 21, 2021 10:04 am
Location: Ontario, Canada

Re: pretty bad amdgpu fault last night

Post by Grogan »

Well, somethings things do go right. I got away with putting the RX 570 back in the old computer without taking it off the desk or laying it down or disconnecting anything (power supply switch off and capacitance drained though) with no difficulty. Other than previously getting tools etc. I didn't even get up out of my chair when I did it, I just wheeled over to it. I'm just so sick of manhandling heavy computers. It's easy enough to get a card out like that, but it's a dumb way to install it, with big dual slot video cards because gravity is working against you while doing up the mounting screws :lol:

Yet, the best laid plans... I was so cautious and methodical, and took my time last night and failed to seat the card properly. Looked good, couldn't see any fingers but needed to come out again and have a good push in.
User avatar
Zema Bus
Your Co-Host
Posts: 240
Joined: Sun Feb 04, 2024 1:25 am

Re: pretty bad amdgpu fault last night

Post by Zema Bus »

Imagine trying to install one of those high end monster 4 slot cards (I think they're all Nvidia) :)
User avatar
Grogan
Your Host
Posts: 480
Joined: Sat Aug 21, 2021 10:04 am
Location: Ontario, Canada

Re: pretty bad amdgpu fault last night

Post by Grogan »

I think I understand what happened with the BIOS and CSM support. While the CSM module is just for backwards compatible booting (you still have a "UEFI BIOS") it conflicts with Resizable BAR support. I still don't understand what that has to do with booting from a master boot record, but it's a fact that a CSM module is incompatible. Installing the new card would have advertised that. So I guess the BIOS dutifully disabled the CSM module, leaving me unable to boot. I'll bet that behaviour fucks up a lot of people. I guess reloading defaults and going through and manually disabling/enabling things "unconfused" and un-greyed things and disabled resizable BAR support when I enabled the CSM.

I don't care about that, I've never had a graphics card utilizing that before. It just allows the hardware to negotiate a base address register size, dynamically. In theory it means the CPU has direct access to all the graphics memory at once (instead of in chunks) and it should reduce CPU overhead and latency but I doubt it would amount to a hill of beans in practice for me, it's reducing theoretical overhead which may help in some situations (e.g. really large data sets) but nothing will be waiting long. Something to keep in mind for next time, though.

However, that is one bona-fide reason there (besides Secure Boot) to not use a CSM. I didn't know about this until I had to find out what happened to me last night.

To fix that, I'd have to tear down partitions (need to be GPT), restore data, adjust UUIDs in grub.cfg, get a boot loader working with UEFI... I think not today.
User avatar
Grogan
Your Host
Posts: 480
Joined: Sat Aug 21, 2021 10:04 am
Location: Ontario, Canada

Re: pretty bad amdgpu fault last night

Post by Grogan »

Zema Bus wrote: Sun Apr 07, 2024 11:19 pm Imagine trying to install one of those high end monster 4 slot cards (I think they're all Nvidia) :)
I saw a video with Jay with one of those, it was like that, a big monster with a liquid cooling block (this was the "Megaman" PC build with the steampunk custom tubing). It needed a support strut from the bottom of the case lol
User avatar
Grogan
Your Host
Posts: 480
Joined: Sat Aug 21, 2021 10:04 am
Location: Ontario, Canada

Re: pretty bad amdgpu fault last night

Post by Grogan »

So... giving this some more thought, the only explanation I can think of for the CSM and Resizable BAR incompatibility would be registers. When you enable the CSM compatibility module in order to boot from MBR code, you are probably using the same registers needed to initialize that Resizable BAR feature between your CPU and your PCI-Express hardware (it's not only graphics cards in theory but I don't know if anything else uses it. It could also be advantageous for NVME I'd think). There is probably no way on this hardware platform to do both with the way these things are implemented... the CPU needs those register addresses. That's my (somewhat) educated guess, anyway.

P.S. You know me. Whether this will matter to me in practice or not (maybe things like that DXR might be better with a few nanoseconds less latency in VRAM access while working with data sets), now I know something is "not right" (not optimal) and it's bugging me. But what I'd have to do to fix it (and I could still have trouble with these bootloaders) is just too much work. Not starting over, but a lot of data to move and configuration and filesystem adjustments (e.g. /boot will the EFI mount and then where is my Bollux kernel going to go... there too?) :lol:
User avatar
Zema Bus
Your Co-Host
Posts: 240
Joined: Sun Feb 04, 2024 1:25 am

Re: pretty bad amdgpu fault last night

Post by Zema Bus »

One way to find out while leaving your current Arch install intact would be a second Arch install on a second drive, it could just be some small cheap drive just for testing purposes. Then you could see first hand if it makes enough of a difference to bother with. If the difference isn't that significant then you won't have the thought at the back of your mind, wondering whether you're leaving a significant performance boost on the table :)
User avatar
Grogan
Your Host
Posts: 480
Joined: Sat Aug 21, 2021 10:04 am
Location: Ontario, Canada

Re: pretty bad amdgpu fault last night

Post by Grogan »

I would have just torn it down again and blasted back the tarballs again if it didn't work. I was getting close to doing just that :evil:

However, I finally have Arch booting with UEFI, no thanks to my BIOS. I had to go with the "default" EFI install again (like I did with ReFind) only with grub. You do that by using the --removable parameter, it sets it up like this:

Code: Select all

[nicetry BOOT]# pwd
/boot/EFI/BOOT
[nicetry BOOT]# ls -l
total 136
-rwx------ 1 root root 139264 Apr  9 04:03 BOOTX64.EFI
That's the "default" EFI boot program that a UEFI bios is hard coded to look for.

It looks like systemd automatically mounts the EFI partition (I didn't add it to fstab lol) at /boot.

Anyway, Resizable BAR support now. We'll see if I get my 5%

P.S. I even had to build a kernel on the other computer, tar it up, sftp it over (booted with Arch ISO) and drop it in because I found out too late (after I'd torn it down) that I need a relocatable kernel for UEFI booting. I built a kernel with EFI support enabled, but missed that.
User avatar
Grogan
Your Host
Posts: 480
Joined: Sat Aug 21, 2021 10:04 am
Location: Ontario, Canada

Re: pretty bad amdgpu fault last night

Post by Grogan »

So, this changes how grub.cfg operates. I now have to stick my LFS kernel in the EFI partition (I'm using the EXTRAVERSION var in the Makefile to keep them separate) and now paths are relative to that because it's a mount, not a directory. Also, it's the UUID of the FAT32 EFI partition now, not the rootfs

It didn't complicate things too much though. I can still use a fairly simple, hand edited grub.cfg. It just needs a few more grub modules now for gpt partitions and FAT.

Code: Select all

set default="0"
terminal_input console
terminal_output console
set timeout=10
set timeout_style=menu
set menu_color_normal=white/black
set menu_color_highlight=black/light-gray
set linux_gfx_mode=text

### Add delayacct to the kernel command line for iotop ###

menuentry 'Arch Linux' {
        insmod gzio
        insmod part_gpt
        insmod fat
        search --no-floppy --fs-uuid --set=root 69A1-4857
        linux  /vmlinuz-6.8.4 root=/dev/nvme1n1p2 ro mitigations=off split_lock_detect=off loglevel=3 quiet
}

menuentry 'Bollux' {
     insmod gzio
     insmod part_gpt
     insmod fat
     search --no-floppy --fs-uuid --set=root 69A1-4857
     linux /vmlinuz-6.8.4-bollux root=/dev/nvme1n1p4 ro mitigations=off split_lock_detect=off ipv6.disable=1
}
I can boot both OSes now, that's a new kernel for bollux with the EFI stuff built in. So it loads from the EFI partition then loads it's rootfs from /dev/nvme1p4. Not exactly self contained anymore, and it complicates backing up and/or restoring with tar, but oh well.
User avatar
Grogan
Your Host
Posts: 480
Joined: Sat Aug 21, 2021 10:04 am
Location: Ontario, Canada

Re: pretty bad amdgpu fault last night

Post by Grogan »

Know why this works for me? Because it doesn't actually write to NVRAM. I wouldn't even need to run grub-install for this if I knew what I was doing (and I think I do now), I could manually copy the contents of the EFI partition and the bios would find EFI/BOOT/BOOTX64.EFI (case insensitive filesystem but I've always seen that shit upper case)

There's really no magic here like writing to master boot records. It's simply the bios that is to find the EFI boot program. It's supposed to be able to do it with EFI variables so you're not fixated on that one thing. I don't think any other EFI boot entries show up if the system sees this... I have to use the "boot override" setting to even boot with my Ventoy USB stick if the hard disk is set up like that. The system doesn't show any other UEFI boot devices if the hard disk is set up like this.

Interesting observations here for me, but the more I see the more I don't like this. That efibootmgr back end does things that don't work for me. There's nothing that clears this shit, it's the dumbest way to boot that I've ever seen. I can REMOVE boot entries using efibootmgr with a command apparently, but that's the very program that's screwed me up in the first place. I'm starting to think that I don't even have any of those boot entries, the tomfoolery comes from having a default/fallback on the hard drive... the system just stops there when it finds that. Not even the thumb drive will show up as a UEFI boot device for ordering, I have to go to the boot overrides in the Save Settings menu. This is what happened last time too, because I installed ReFind like that (it worked, but needed manual configuration because it took settings from the chroot... and I didn't want to waste time learning a boot loader I didn't want to use in the first place)

It's also silly to have your kernel and shit on a fragile, rudimentary filesystem like FAT32 with fake file permissions. Fucking Microsoft. This whole UEFI booting thing is their shit. To protect the EFI partition I have to do this:

Code: Select all

/dev/nvme1n1p1                                  /boot           vfat            umask=0077      0 0
(Arch/systemd actually stopped automatically mounting that, so I added it to fstab. It must have been when I removed something I didn't like from the grub configuration, some identifier it was using probably. I started out generously by pasting a bunch of stupid shit. It doesn't really need to be mounted, once booted, there's nothing that accesses anything there until I want to upgrade kernels)

The umask makes everything drwx------ and -rwx------ for directories and files. Umask kind of works in reverse, you're adding bits to "mask" out default permissions. Mounts are already owned by root by default, so that locks even me out. I have to su to go in there. I wouldn't do that with a normal boot directory on EXT4 (I change distro's permissions on /boot and use normal 644 for kernel images etc. so I can read and traverse if they've done it) but FAT is horrible. I think Arch is the only one that actually sets things up like this (well, they direct YOU to set things up like this... I could make my own grub configuration where the kernel isn't in the EFI directory, I don't have to listen to them). Kernels don't go in the EFI partition, boot loaders do.

I actually do like this default fallback method though. It has no choice but to work and I'm not ever going to have a second boot loader instance installed (which is supposed to be the thing with those boot vars in NVRAM) it's all going to be done with grub.cfg. Even a Windows boot would be, though after the fact, once the install has broken your bootloader)
User avatar
Zema Bus
Your Co-Host
Posts: 240
Joined: Sun Feb 04, 2024 1:25 am

Re: pretty bad amdgpu fault last night

Post by Zema Bus »

I think that MSI board has the worst implementation of UEFI I've ever worked with, it's much better on my Gigabyte board. But UEFI in general isn't that great. I remember the first time I encountered it during a distro install, I found it disgusting to be forced to have a Microsoft file system on my machine that had always been free from anything Microsoft. I'm used to it now, but I still don't like it.
User avatar
Grogan
Your Host
Posts: 480
Joined: Sat Aug 21, 2021 10:04 am
Location: Ontario, Canada

Re: pretty bad amdgpu fault last night

Post by Grogan »

Yes, and the BIOS update evidently didn't fix that. efibootmgr still fails to write to NVRAM on this board. I don't care though, this will always work.

There will be no more bios updates for me on this board though, MSI gives me the creeps and I doubt they are going to do anything with that now, if they haven't in 3 years.
User avatar
Grogan
Your Host
Posts: 480
Joined: Sat Aug 21, 2021 10:04 am
Location: Ontario, Canada

Re: pretty bad amdgpu fault last night

Post by Grogan »

So I actually heard back from Asrock, and they sent me an RMA request form. I am to fill out the PDF. I thought how the Hell am I supposed to do that, I don't have Adobe Acrobat (Adobe can do an acrobatic leap and disappear up their own rectum)

Then I remembered reading in release notes a long time ago that Firefox could fill out PDF forms. I tested it and saved the file and it had my changes. :thumbup:
User avatar
Zema Bus
Your Co-Host
Posts: 240
Joined: Sun Feb 04, 2024 1:25 am

Re: pretty bad amdgpu fault last night

Post by Zema Bus »

In the past I often used Gimp, it loads PDF's as regular images that I'd edit, then export out as a PDF again. It was a bit cumbersome but it worked.
Post Reply